Databases such as dbGaP represent extremely valuable resources of data that have been assembled across multiple cohorts. The increasing development of cost-effective high-throughput genotyping and sequencing technologies are resulting in vast amounts of genetic data. While such databases were formed in order to archive and distribute the results of previously performed genetic association analyses, an increasing number of studies have provided de-identified individual-level genotypic and phenotypic data that are made available to outside researchers who have obtained the appropriate authorization. While the amount of data made available has increased dramatically in recent years, relatively little has been done in order to facilitate phenotype harmonization across studies. Many genetic epidemiologic studies of cardiovascular disease have multiple variables related to any given phenotype, resulting from different definitions and multiple measurements or subsets of data. A researcher searching such databases for the availability of phenotype and genotype combinations is confronted with a veritable mountain of variables to sift through. This often requires visiting multiple websites to gain additional information about variables that are listed on databases, and examination of data distributions to assess similarities across cohorts. While the naming strategy for genetic variants is largely standardized across studies (e.g. """"""""rs"""""""" numbers for single nucleotide polymorphisms or SNPs), this is often not the case for phenotype variables. For a given study, there are often numerous versions of phenotypic variables. Researchers currently have to analyze and compare increasingly larger numbers of variables that have varying degrees of documentation associated with them to obtain the desired information. This is a time-consuming process that may still miss the most appropriate variables. Moreover, every researcher that wants to compare the same datasets often needs to start from scratch since there are no tools to share the phenotype comparison results. The availability of informatic tools to make phenotype mapping more efficient and improve its accuracy, along with intuitive phenotype query tools, would provide a major resource for researchers utilizing these databases. The tools we are proposing would allow researchers to (1) Quickly obtain the information needed to assess whether a specific study will be useful for the hypothesis of interest;(2) Exclude variables that do not meet research criteria;(3) Ascertain which studies have combinations of phenotype and genetic information of interest;and (4) More easily expand research questions beyond the most basic main-effects to more complex analyses such as gene-by-environment interactions and multivariate tests incorporating multiple phenotypes. The increased utility will also enable larger meta-analyses to be performed, as researchers will be able to more quickly hone in on outcomes, exclusionary variables and covariates of interest, leading to increased statistical power to detect genetic associations.

Public Health Relevance

While the amount of genomic data (e.g., GWAS, sequencing, etc.) made available has increased dramatically in recent years, relatively little has been done in order to facilitate phenotype harmonization across studies. The tools we are proposing would allow researchers to quickly identify data sets of interest, expand research questions beyond the most basic main-effects to more complex analyses such as gene-by-environment interactions and multivariate test incorporating multiple phenotypes, and perform larger meta-analyses easily by honing in on outcomes, exclusionary variables and covariates of interest with increased statistical power to detect genetic associations.

Agency
National Institute of Health (NIH)
Institute
National Heart, Lung, and Blood Institute (NHLBI)
Type
Exploratory/Developmental Cooperative Agreement Phase II (UH3)
Project #
4UH3HL108780-03
Application #
8733017
Study Section
Special Emphasis Panel (ZHL1-CSR-K (M1))
Program Officer
Papanicolaou, George
Project Start
2011-07-19
Project End
2014-09-23
Budget Start
2013-09-24
Budget End
2014-09-23
Support Year
3
Fiscal Year
2013
Total Cost
$500,000
Indirect Cost
$85,559
Name
University of Southern California
Department
Biostatistics & Other Math Sci
Type
Schools of Engineering
DUNS #
072933393
City
Los Angeles
State
CA
Country
United States
Zip Code
90089