Convergent genetic and epidemiologic evidence suggests the importance of understanding psychiatric illness from a dimensional rather than solely a categorical perspective. The limitations of traditional diagnostic categories motivated a major NIMH-supported effort to identify measures of psychopathology that more closely align with underlying disease biology. At present, however, the available large clinical data sets, whether health claims, registries, or electronic health records, do not include such dimensional measures. Even with the integration of structure clinician and patient-reported outcomes, generating such cohorts could require a decade or more. Moreover, coded data does not systematically capture clinically-important concepts such as health behaviors or stressors. While such cohorts are developed, natural language processing can facilitate the application of existing electronic health records to enable precision medicine in psychiatry. Specifically, while traditional natural language tools focus on extracting individual terms, emerging methods including those in development by the investigators allow extraction of concepts and dimensions. The present investigation proposes to develop a toolkit for natural language processing of narrative patient notes to extract measures of psychopathology, including estimated RDoC domains. In preliminary investigations in a large health system, these tools have demonstrated both face validity and predictive validity. This toolkit also allows extraction o complex concepts from narrative notes, such as stressors and health behaviors. In the proposed study, these natural language processing tools will be applied to a large psychiatric inpatient data set as well as a large general medical inpatient data set, to derive measures of psychopathology and other topics. The resulting measures will then be used in combination with coded data to build regression and machine-learning-based models to predict clinical outcomes including length of hospital stay and risk of readmission. The models will then be validated in independent clinical cohorts. By combining expertise in longitudinal clinical investigation, natural language processing, and machine learning, the proposed study brings together a team with the needed skills to develop a critical toolkit for understanding health records dimensionally The resulting models can be applied to facilitate investigation of dimensions of psychopathology and related topics, allowing stratification of clinical risk to enable development of targeted interventions.

Public Health Relevance

Public health significance many aspects of psychiatric illness are not adequately captured by diagnostic codes. This study will apply natural language processing and machine learning to electronic health records from large health systems. The resulting symptom dimensions will allow better stratification of risk for clinically-important outcomes, including prolonged hospital stays and early readmissions.

National Institute of Health (NIH)
National Institute of Mental Health (NIMH)
Research Project (R01)
Project #
Application #
Study Section
Mental Health Services Research Committee (SERV)
Program Officer
Morris, Sarah E
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Massachusetts General Hospital
United States
Zip Code
McCoy Jr, Thomas H; Hart, Kamber; Pellegrini, Amelia et al. (2018) Genome-wide association identifies a novel locus for delirium risk. Neurobiol Aging 68:160.e9-160.e14
McCoy Jr, Thomas H; Yu, Sheng; Hart, Kamber L et al. (2018) High Throughput Phenotyping for Dimensional Psychopathology in Electronic Health Records. Biol Psychiatry 83:997-1004
Snapper, Leslie A; Hart, Kamber L; Venkatesh, Kartik K et al. (2018) Cohort study of the relationship between individual psychotherapy and pregnancy outcomes. J Affect Disord 239:253-257
McCoy Jr, Thomas H; Castro, Victor M; Hart, Kamber L et al. (2018) Genome-wide Association Study of Dimensional Psychopathology Using Electronic Health Records. Biol Psychiatry 83:1005-1011
McCoy Jr, Thomas H; Perlis, Roy H (2018) Temporal Trends and Characteristics of Reportable Health Data Breaches, 2010-2017. JAMA 320:1282-1284
McCoy, Thomas H; Castro, Victor M; Snapper, Leslie A et al. (2017) Efficient genome-wide association in biobanks using topic modeling identifies multiple novel disease loci. Mol Med 23:285-294
McCoy, T H; Castro, V M; Snapper, L et al. (2017) Polygenic loading for major depression is associated with specific medical comorbidity. Transl Psychiatry 7:e1238