This Small Business Innovation Research (SBIR) Phase I project seeks to address the most significant and challenging software need in healthcare: Cohort identification. A cohort is a group of patients with a common medical condition. Cohorts underpin modern medical care, defining treatment algorithms, measuring quality improvement, supporting government initiatives, and representing the core organization for research trials. While manual techniques have been developed to identify a cohort within a healthcare organization's electronic medical record (EMR), all rely on a physician or coder identifying and marking every record for every applicable medical condition. This manual process is inaccurate and only addresses the most common conditions. The suggested novel and revolutionary approach is to use big data techniques, utilizing the detailed unstructured narrative notes recorded on every patient for every encounter in every healthcare institution. The core technology required to extract and make unstructured data usable in healthcare is natural language processing (NLP) combined with coded representations of clinical concepts (ontologies). This proposal brings together industry leading teams and technologies to tackle the greatest data problem in healthcare, which offers a unique opportunity to significantly influence care for decades to come.
The broader impact/commercial potential of this project includes creating the foundational infrastructure for the next generation of data-driven healthcare. Just as Google and Yahoo required advanced information extraction and search indexing techniques to make the vast amount of internet data usable, healthcare requires similar enabling technology. The healthcare challenge is even more complex given the multitude of natural language descriptions used by physicians and the complex logic that defines potential cohorts and algorithms. To address these issues, healthcare requires the category of technologies used in Google and Yahoo, but specialized for the healthcare domain. In healthcare, quality improvement requires recognizing at risk cohorts in a population. Missing these cohorts and inadequately treating them can increase mortality by an order of magnitude, as in the case of deep vein thrombosis (DVT) in acute care. For quality measures being implemented by the federal government, defining and identifying cohorts is always the first step of tracking and reporting. Current processes are manual, limited, and inaccurate. By bringing evidence derived from clinical documentation which is created in current workflow to real-time and population based treatment decisions, this intervention will form a foundation for data-driven care, supporting improved outcomes, shorter hospitalizations, and reduced direct medical costs.
Background This Small Business Innovation Research (SBIR) project defines an innovative approach to solve a critical challenge in healthcare: measurement of clinical quality. Clinical quality metrics underlie quality improvement and are required inputs for all efforts to improve healthcare outcomes and reduce costs. Current systems to identify cohorts within electronic health records (EHR) are manually populated and, as a result, are known to be inaccurate, and to support only a handful of more than 600 nationally defined measures. With increasing government demand for quality measurement in healthcare, there is equally high demand for automated systems that can accurately identify patient cohorts, particularly those required for quality measures. An advanced approach is to use the 80% of healthcare data that exists in narrative unstructured format within the EHR. This approach has previously been limited due largely to technological limitations. Project Approach Automated data normalization processes were used to identify cohorts based on simple quality measures. Simple quality measures are based on single concepts, such as diabetes or hypertension. However, even simple quality measures can be difficult to accurately identify. For example, the term hypertension may appear on an EHR-based problem list for a patient. But, that concept, if resolved, may or may not be relevant within this quality cohort. To enable these types of important distinctions, feature vectors are required to algorithmically identify patients eligible for inclusion in quality measures. To apply this type of logic and translate hundreds of extracted features into a handful of quality cohorts, a standardized process was needed. The project required development of a filter, or inference layer, which identified feature vectors linked to cohorts. Project Objectives This Phase I program had two objectives: Leverage Health Fidelity’s best-in-class REVEAL system, clinical data model, and advanced statistical modeling to develop a tool to enable patient cohort identification based on combined processed narratives and discrete data. Use the tools developed in Objective 1 to compare processed narrative and discrete data against gold standard data for 10 quality cohorts. Success criteria included: at least 10% improvement in sensitivity in correctly identifying cohorts associated with specific quality measures over discrete data alone, less than 5% decrease in specificity, and statistical significance. Project Methods For system development, this project used a large set of de-identified clinical records for training. For results analysis, this project used a separate set of 3,000 de-identified clinical records for validation. The validation record set contained patient records which were manually reviewed, annotated, and enhanced with 10 quality measures by a single independent clinician, who was not part of the development team. Upon sealing the results of the clinician gold standard annotation, the engineering team initiated an iterative development process using extracted features mapped to controlled vocabularies to create a statistical inference engine. The team integrated the statistical inference engine into the REVEAL tool and configured the tool to automatically accept and persist validation data from test data sets and gold standard data within a single defined data model. In order to compare the results of the test data sets to the gold standard data, the team created a data analysis tool to test system results for each individual quality measure and for the sum of measures. Summary of Data and Conclusions The cohort identification tool developed in this Phase I project met its success criteria. Cohort identification based on discrete claims data alone, representing the most common current method of quality measurement, generated a sensitivity of 18.9% per individual patient encounter. Cohort identification based on unstructured processed data yielded a dramatically higher sensitivity of 78.6%, for an absolute increase of 57.7% and a relative increase of 416%. Both approaches had high specificity, 99.6% for discrete data, and 96.2% for unstructured processed data. Results were statistically significant for each individual quality measure and for the sum of all 10 quality measures (p <.0001). The Phase I project served as a critical first step in developing an automated process to measure quality in healthcare. By identifying a better approach to accurately identify quality cohorts, this effort directly supports national initiatives to improve the quality and costs of care.