Researchers continually upload data into public repositories at a rapid pace, yet utilize few common standards for annotation, making it close to impossible to compare or associate data across studies. To address this problem, we will develop a defined meta- data model and build an integrated system called Phenotype Discovery (PhD) that enables researchers to query and find genomic studies of interest in public repositories as well as upload new data into our database (sdGaP), in a standardized manner. A Query Interpreter (QI) will utilize text mining and natural language processing techniques to map free text into concepts in biomedical ontologies, allowing non-structured queries to be answered efficiently. In Phase I of the project, we will develop a proof-of-concept system that can retrospectively structure phenotypic descriptions in dbGaP, and will work with domain experts in pneumology to build use cases and evaluate the automated mappings. In Phase II of the project, we will extend the domain expertise to cardiology, hematology, and sleep disorders to build a more comprehensive system, expanding the phenotype annotation to transcriptome databases, and integrating a flexible automated genotype annotation tool for sdGaP. We will develop a user-friendly interface to prospectively assist researchers in uploading their data with standardized phenotypic annotations. We will provide the tool for free from our website and continuously improve its quality, based on user feedback and usage data.

Public Health Relevance

Phenotype Discovery (PhD) represents a novel, automated system to describe the characteristics of patients whose genetic information is available in public data repositories, without compromising their privacy. This initiative is greatly needed so that more researchers can make use of data collected from projects funded by public agencies. PhD uses new methodology for natural language processing and semantic integration to interpret the narrative text as well as variables and their values from studies in genomic databases. Standardized terminologies will be utilized to ensure that data can be analyzed across different studies.

Agency
National Institute of Health (NIH)
Institute
National Heart, Lung, and Blood Institute (NHLBI)
Type
Exploratory/Developmental Cooperative Agreement Phase II (UH3)
Project #
4UH3HL108785-03
Application #
8733018
Study Section
Special Emphasis Panel (ZHL1-CSR-K (M1))
Program Officer
Papanicolaou, George
Project Start
2011-07-19
Project End
2014-09-23
Budget Start
2013-09-24
Budget End
2014-09-23
Support Year
3
Fiscal Year
2013
Total Cost
$500,000
Indirect Cost
$177,419
Name
University of California San Diego
Department
Biostatistics & Other Math Sci
Type
Schools of Arts and Sciences
DUNS #
804355790
City
La Jolla
State
CA
Country
United States
Zip Code
92093
Doan, Son; Conway, Mike; Phuong, Tu Minh et al. (2014) Natural language processing in biomedicine: a unified system architecture overview. Methods Mol Biol 1168:275-94