The goals of our project are as follows: 1. Create a corpus of temporally annotated data. Under the supervision of our consultants Dr. Frank Sacks, Dr. Vincent Carey, and two Registered Nurses, we will create a gold-standard annotation of events and temporal information within patient narratives from de- identified Electronic Health Record data using the CLEF and TimeML guidelines. We will use the framework of the Brandeis Annotation Tool, a system we have designed to facilitate the quick construction of accurately annotated corpora against a specified guideline. Extensions to the current event library and lexicon with medical event references will be made during the annotation process, under the guidance of the Registered Nurses. 2. Adapt the TARSQI Toolkit (TTK) to targeted temporal properties and relations in the EHR domain. We will use the TARSQI toolkit, a robust set of temporal processing algorithms we have designed for parsing natural language text, to automatically annotate the events and temporal information in EHR data. Combined with the Brandeis AcroMed Medical Abbreviation Server and those terms introduced in part 1, we will employ the Specialist Lexicon and other medical resources to extend the toolkit capabilities for recognizing and interpreting medical event information. Algorithms for identifying events, temporal expressions, and event anchorings and orderings will be trained against the gold standard created in Aim 1, and tested against held-out data. 3. Create a cross-document temporal database of medical events. Using the recognition algorithms introduced in Aim 2, we will create a searchable, temporally ordered database of medical events such as diseases, symptoms, surgeries/interventions, and test results. Events referred to multiple times in the data will be merged using a constraint- satisfaction analysis in order to create a more coherent narrative for a single patient over multiple records.
It is becoming increasingly common for medical researchers to use Electronic Health Records (EHRs) as a primary source of data for researching correlations between various medical issues and concepts. However, EHRs typically contain unstructured text, making them difficult to mine. This research will create a database of temporal orderings from events extracted from EHR patient narratives, using algorithms previously applied to news articles.