Convergent genetic and epidemiologic evidence suggests the importance of understanding psychiatric illness from a dimensional rather than solely a categorical perspective. The limitations of traditional diagnostic categories motivated a major NIMH-supported effort to identify measures of psychopathology that more closely align with underlying disease biology. At present, however, the available large clinical data sets, whether health claims, registries, or electronic health records, do not include such dimensional measures. Even with the integration of structure clinician and patient-reported outcomes, generating such cohorts could require a decade or more. Moreover, coded data does not systematically capture clinically-important concepts such as health behaviors or stressors. While such cohorts are developed, natural language processing can facilitate the application of existing electronic health records to enable precision medicine in psychiatry. Specifically, while traditional natural language tools focus on extracting individual terms, emerging methods including those in development by the investigators allow extraction of concepts and dimensions. The present investigation proposes to develop a toolkit for natural language processing of narrative patient notes to extract measures of psychopathology, including estimated RDoC domains. In preliminary investigations in a large health system, these tools have demonstrated both face validity and predictive validity. This toolkit also allows extraction o complex concepts from narrative notes, such as stressors and health behaviors. In the proposed study, these natural language processing tools will be applied to a large psychiatric inpatient data set as well as a large general medical inpatient data set, to derive measures of psychopathology and other topics. The resulting measures will then be used in combination with coded data to build regression and machine-learning-based models to predict clinical outcomes including length of hospital stay and risk of readmission. The models will then be validated in independent clinical cohorts. By combining expertise in longitudinal clinical investigation, natural language processing, and machine learning, the proposed study brings together a team with the needed skills to develop a critical toolkit for understanding health records dimensionally The resulting models can be applied to facilitate investigation of dimensions of psychopathology and related topics, allowing stratification of clinical risk to enable development of targeted interventions.
Public health significance many aspects of psychiatric illness are not adequately captured by diagnostic codes. This study will apply natural language processing and machine learning to electronic health records from large health systems. The resulting symptom dimensions will allow better stratification of risk for clinically-important outcomes, including prolonged hospital stays and early readmissions.