Filling the data processing gap for exon-region specific data from 1000 Genomes

Gibbs, Richard

Abstract

We propose to develop, implement and streamline an informatics pipeline to fill the gap between production and analysis for gene-region specific high coverage data from the full-scale 1000 Genomes Project. The developed pipeline aims to process data generated from exomes using direct capture technologies and next-generation sequencing as a major part of the 1000 Genomes Project, to identify and catalog SNPs and indels that enable a detailed understanding of the genetic variants distribution within coding regions among the human population. We will develop and improve several software packages for read mapping, variant discovering, and data quality assurance in terms of statistical rigor and software engineer aspects so they will be suitable for general usage as a toolkit. We expect that both the genetic variation information from exome and the toolkit will play a critical role in the future genetic medical research. We propose three specific Aims:
Aim 1. QC metrics for gene-region data across different samples, populations and technological platforms allowing for full data integration. Here we will explore the various possible approaches to deal with duplicate reads and their effects. An informatics pipeline for applying these metrics to QC gene-region specific data will be implemented.
Aim 2. Develop and optimize gene-region specific pipeline for genetic variations detection, and derive common quality metrics for variations regardless of the technological platforms. The focus of this particular data processing pipeline is to reliably discover nearly all genetic polymorphisms (up to 0.1% MAP) within the coding sequences. We will optimize our Atlas software for SNP and INDEL discoveries, using Pilot 3 data as an exercise for validation. We will also carry out genotyping and sequencing experiments for quality assessment on SNP/INDEL discoveries, and then evaluate and compare its performance with other different available approaches.
Aim 3. Coordinate with DCC to implement gene-region specific data processing pipeline. We will closely collaborate with DCC to implement and streamline this particular data processing pipeline so it is readily applicable for processing the gene-region data from the full-scale project. We will facilitate the effort of integrating the genetic variations and individual genotypes obtained from different components of the 1000 Genomes Project. Public Health Relevance: The developed pipeline will process gene-region specific data as a major part of the 1000 Genomes Project, to catalog SNPs and INDELs within coding regions of the human genome. Once such a high quality data set becomes available, we expect that the list of novel rare non-synonymous SNPs will be immediately included and characterized in any disease association study.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Human Genome Research Institute (NHGRI)
Type: Research Project--Cooperative Agreements (U01)
Project #: 5U01HG005211-02
Application #: 7930731
Study Section: Special Emphasis Panel (ZHG1-HGR-M (M2))
Program Officer: Brooks, Lisa

Project Start: 2009-09-18
Project End: 2014-06-30
Budget Start: 2010-07-01
Budget End: 2014-06-30
Support Year: 2
Fiscal Year: 2010
Total Cost: $521,149
Indirect Cost

Institution

Name: Baylor College of Medicine
Department: Genetics
Type: Schools of Medicine
DUNS #: 051113330

City: Houston
State: TX
Country: United States
Zip Code: 77030

Related projects


NIH 2012 U01 HG	Filling the data processing gap for exon-region specific data from 1000 Genomes Gibbs, Richard A. / Baylor College of Medicine	$200,000
NIH 2010 U01 HG	Filling the data processing gap for exon-region specific data from 1000 Genomes Gibbs, Richard A. / Baylor College of Medicine	$521,149
NIH 2009 U01 HG	Filling the data processing gap for exon-region specific data from 1000 Genomes Gibbs, Richard A. / Baylor College of Medicine	$548,375

Publications

1000 Genomes Project Consortium; Auton, Adam; Brooks, Lisa D et al. (2015) A global reference for human genetic variation. Nature 526:68-74

Challis, Danny; Antunes, Lilian; Garrison, Erik et al. (2015) The distribution and mutagenesis of short coding INDELs from 1,128 whole exomes. BMC Genomics 16:143

Gray, Stacy W; Martins, Yolanda; Feuerman, Lindsay Z et al. (2014) Social and behavioral research in genomic sequencing: approaches from the Clinical Sequencing Exploratory Research Consortium Outcomes and Measures Working Group. Genet Med 16:727-35

Wang, Q Y; Song, J; Gibbs, R A et al. (2013) Characterizing polymorphisms and allelic diversity of von Willebrand factor gene in the 1000 Genomes. J Thromb Haemost 11:261-9

Wang, Yi; Lu, James; Yu, Jin et al. (2013) An integrative variant analysis pipeline for accurate genotype/haplotype inference in population NGS data. Genome Res 23:833-42

Challis, Danny; Yu, Jin; Evani, Uday S et al. (2012) An integrative variant analysis suite for whole exome next-generation sequencing data. BMC Bioinformatics 13:8

Lu, James T; Wang, Yi; Gibbs, Richard A et al. (2012) Characterizing linkage disequilibrium and evaluating imputation power of human genomic insertion-deletion polymorphisms. Genome Biol 13:R15

1000 Genomes Project Consortium; Abecasis, Goncalo R; Auton, Adam et al. (2012) An integrated map of genetic variation from 1,092 human genomes. Nature 491:56-65

Evani, Uday S; Challis, Danny; Yu, Jin et al. (2012) Atlas2 Cloud: a framework for personal genome analysis in the cloud. BMC Genomics 13 Suppl 6:S19

Marth, Gabor T; Yu, Fuli; Indap, Amit R et al. (2011) The functional spectrum of low-frequency coding variation. Genome Biol 12:R84

Showing the most recent 10 out of 13 publications

Comments

Be the first to comment on Richard Gibbs's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: