We propose to develop, implement and streamline an informatics pipeline to fill the gap between production and analysis for gene-region specific high coverage data from the full-scale 1000 Genomes Project. The developed pipeline aims to process data generated from exomes using direct capture technologies and next-generation sequencing as a major part of the 1000 Genomes Project, to identify and catalog SNPs and indels that enable a detailed understanding of the genetic variants distribution within coding regions among the human population. We will develop and improve several software packages for read mapping, variant discovering, and data quality assurance in terms of statistical rigor and software engineer aspects so they will be suitable for general usage as a toolkit. We expect that both the genetic variation information from exome and the toolkit will play a critical role in the future genetic medical research. We propose three specific Aims:
Aim 1. QC metrics for gene-region data across different samples, populations and technological platforms allowing for full data integration. Here we will explore the various possible approaches to deal with duplicate reads and their effects. An informatics pipeline for applying these metrics to QC gene-region specific data will be implemented.
Aim 2. Develop and optimize gene-region specific pipeline for genetic variations detection, and derive common quality metrics for variations regardless of the technological platforms. The focus of this particular data processing pipeline is to reliably discover nearly all genetic polymorphisms (up to 0.1% MAP) within the coding sequences. We will optimize our Atlas software for SNP and INDEL discoveries, using Pilot 3 data as an exercise for validation. We will also carry out genotyping and sequencing experiments for quality assessment on SNP/INDEL discoveries, and then evaluate and compare its performance with other different available approaches.
Aim 3. Coordinate with DCC to implement gene-region specific data processing pipeline. We will closely collaborate with DCC to implement and streamline this particular data processing pipeline so it is readily applicable for processing the gene-region data from the full-scale project. We will facilitate the effort of integrating the genetic variations and individual genotypes obtained from different components of the 1000 Genomes Project. Public Health Relevance: The developed pipeline will process gene-region specific data as a major part of the 1000 Genomes Project, to catalog SNPs and INDELs within coding regions of the human genome. Once such a high quality data set becomes available, we expect that the list of novel rare non-synonymous SNPs will be immediately included and characterized in any disease association study.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project--Cooperative Agreements (U01)
Project #
3U01HG005211-02S1
Application #
8527232
Study Section
Special Emphasis Panel (ZHG1-HGR-M (M2))
Program Officer
Brooks, Lisa
Project Start
2009-09-18
Project End
2014-06-30
Budget Start
2012-07-01
Budget End
2014-06-30
Support Year
2
Fiscal Year
2012
Total Cost
$200,000
Indirect Cost
$59,294
Name
Baylor College of Medicine
Department
Genetics
Type
Schools of Medicine
DUNS #
051113330
City
Houston
State
TX
Country
United States
Zip Code
77030
1000 Genomes Project Consortium; Auton, Adam; Brooks, Lisa D et al. (2015) A global reference for human genetic variation. Nature 526:68-74
Challis, Danny; Antunes, Lilian; Garrison, Erik et al. (2015) The distribution and mutagenesis of short coding INDELs from 1,128 whole exomes. BMC Genomics 16:143
Gray, Stacy W; Martins, Yolanda; Feuerman, Lindsay Z et al. (2014) Social and behavioral research in genomic sequencing: approaches from the Clinical Sequencing Exploratory Research Consortium Outcomes and Measures Working Group. Genet Med 16:727-35
Wang, Q Y; Song, J; Gibbs, R A et al. (2013) Characterizing polymorphisms and allelic diversity of von Willebrand factor gene in the 1000 Genomes. J Thromb Haemost 11:261-9
Wang, Yi; Lu, James; Yu, Jin et al. (2013) An integrative variant analysis pipeline for accurate genotype/haplotype inference in population NGS data. Genome Res 23:833-42
Challis, Danny; Yu, Jin; Evani, Uday S et al. (2012) An integrative variant analysis suite for whole exome next-generation sequencing data. BMC Bioinformatics 13:8
Lu, James T; Wang, Yi; Gibbs, Richard A et al. (2012) Characterizing linkage disequilibrium and evaluating imputation power of human genomic insertion-deletion polymorphisms. Genome Biol 13:R15
1000 Genomes Project Consortium; Abecasis, Goncalo R; Auton, Adam et al. (2012) An integrated map of genetic variation from 1,092 human genomes. Nature 491:56-65
Evani, Uday S; Challis, Danny; Yu, Jin et al. (2012) Atlas2 Cloud: a framework for personal genome analysis in the cloud. BMC Genomics 13 Suppl 6:S19
Marth, Gabor T; Yu, Fuli; Indap, Amit R et al. (2011) The functional spectrum of low-frequency coding variation. Genome Biol 12:R84

Showing the most recent 10 out of 13 publications