There are nearly 7,000 diseases that have a prevalence of only one in 2,000 individuals or less. Yet, such rare diseases are estimated to collectively affect over 300 million people worldwide, representing a significant healthcare concern. Although rare diseases have predominantly genetic origins, nearly half of them do not manifest symptoms until adulthood and frequently confound discovery and diagnosis. Even in the case of early onset disorders, the sheer number of possible diagnoses can often overwhelm clinicians. As a result, rare diseases are often diagnosed with delay, misdiagnosed or even remain undiagnosed, not only disrupting patient lives but also hindering progress on our understanding of such diseases. Data science methods that mine large-scale retrospective health record data for phenotypic information will aid in timely and accurate diagnoses of rare diseases, especially when combined with additional data types, thus, having significant real- world impact. This proposal will integrate electronic health record (EHR) data sets with publicly available vocabularies and ontologies, and genomic data for the improved identification and characterization of patients with rare diseases, using approaches from machine learning, natural language processing (NLP) and basic bioinformatics. The work has three specific aims and will be carried out in two phases. During the mentored phase, the principal investigator (PI) will develop data-driven methods to extract standardized concepts related to rare diseases from clinical notes and infer the occurrence of each disease (Aim 1). He will also develop data science approaches to compare and contrast longitudinal patterns associated with patients' journeys through the healthcare system when seeking a diagnosis for a rare disease, and aid in clinical decision-making by leveraging these patterns (Aim 2). During the independent phase (Aim 3), computational methods will be developed for the integrated modeling and analysis of genotypic (from Aim 3) and phenotypic information (from Aims 1 and 2). Cohorts to be sequenced will cover diseases for which causal genes or disease definitions are unclear (discovery), as well as those for which these are well known (validation). This work will be carried out under the mentorship of four faculty members with complementary expertise in biomedical informatics, data science, NLP, and rare disease genomics at the University of Washington, the largest medical system in the Pacific Northwest (four million EHRs), world-renowned researchers in medical genetics, and a robust data science environment. In addition, under the direction of the mentoring team, the PI will complete advanced coursework, receive training in translational bioinformatics and clinical research informatics, submit manuscripts, and seek an independent research position. This proposal will yield preliminary results for subsequent studies on data-driven phenotyping and enable the realization of the PI's career goals by providing him with the necessary training to build on his machine learning and basic bioinformatics expertise to transition into an independent investigator in biomedical data science.

Public Health Relevance

Rare genetic diseases are estimated to affect the lives of 25 to 30 million Americans and their families, and present a significant economic burden on the healthcare system. Currently, our knowledge of the broad spectrum of the 7,000 observed rare diseases is limited to a few well-studied ones, hindering our ability to make correct and timely diagnoses. The objective of this study is to improve the identification of patients with rare diseases in healthcare systems by developing data science approaches that automatically recognize rare disease-related patterns in patient health records and correlate them with genomic data, thus, aiding in diagnosis and discovery.

National Institute of Health (NIH)
National Library of Medicine (NLM)
Career Transition Award (K99)
Project #
Application #
Study Section
Special Emphasis Panel (ZLM1)
Program Officer
Sim, Hua-Chuan
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Washington
Other Health Professions
Schools of Medicine
United States
Zip Code