The Veterans Health Information Systems and Technology Architecture (VistA) is an integrated system of software applications that directly supports patient care at Veterans Health Administration (VHA) healthcare facilities. To facilitate veteran care, VistA maintains a massive repository of patient-related data, including over 1.3 billion textual documents (e.g., progress notes, discharge summaries). The Computerized Patient Record System (CPRS), a front-end application that interfaces with the VistA data repository, allows clinicians to enter, review, and update information concerning all aspects of a veteran's care in their electronic health record (EHR). For veterans with complex and chronic diseases, thousands or tens of thousands of text- based progress notes may be associated with their EHR. Searching through this vast amount of textual data to find useful information can be an arduous task due to the lack of sophisticated search capabilities within CPRS. The VistA EHR system represents the cornerstone of clinical care in the VA. This pilot study is the first step in a program of research, where the ultimate goal is to make finding relevant information within a veteran's EHR easier for clinicians, thus improving processes of care and, potentially, patient outcomes. The purpose of the proposed study is to determine if information retrieval (IR) techniques found to be useful in searching large text-based data repositories such as the Internet or PubMed can be applied to progress notes from VistA. In addition, we will explore whether including information about clinically-relevant concepts from a medical ontology improves IR results. A total of four IR systems will be examined: (1) vector space model (baseline);(2) vector space model enhanced with ontology weights;(3) latent semantic indexing model;and (4) latent semantic indexing model enhanced with ontology weights. The SNOMED-CT ontology will be used with concepts weighted via their relative importance within the ontology by Google's PageRank algorithm. The four IR systems will be evaluated based on their ability to find progress notes relevant to a selected note;where relevance will be judged by the clinical co- investigators. The document collection to be searched will consist of all progress notes over a 17-month period from a random sample of 20 patients from the James A. Haley Veterans Medical Center (JAHVMC) who tested positive for methicillin-resistant Staphylococcus aureus (MRSA) and five who did not test positive. The association of MRSA infections with prolonged hospital stays and patients with chronic conditions presents a cohort of patients that are ideal for testing IR systems. The EHR of MRSA-positive patients are likely to contain large numbers of progress notes of a heterogeneous nature (e.g., physician notes, nursing notes, laboratory results). The large quantity and diverse types of notes associated with this complex condition will provide for an excellent test of the effectiveness of the proposed IR techniques. The IR systems will be evaluated using measures derived from precision and recall. The exact Wilcoxon Signed Rank test, a non-parameteric test, will be used to examine all-pair combinations of IR systems for each performance measure.

Public Health Relevance

Information overload has been cited as a major concern of clinicians using the Veterans Health Administration's (VHA) electronic health record (EHR) system. In particular, clinicians have raised concerns over the number, length, and difficulty in finding information within progress notes. This is not surprising since the most current version of the VHA's Computerized Patient Record System (CPRS) only offers a simple exact text matching information retrieval (IR) system, which typically returns none or far too many progress notes. This problem is especially noticeable in Veterans with complex conditions (e.g., MRSA) due to the thousands of notes associated with their EHR from which information could be obtained. This pilot study seeks to develop new IR systems, based on state of the art statistical techniques that could drastically improve search capabilities, reduce information overload, and improve patient care.

National Institute of Health (NIH)
Veterans Affairs (VA)
Non-HHS Research Projects (I01)
Project #
Application #
Study Section
Blank (HSR7)
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
James A. Haley VA Medical Center
United States
Zip Code