We propose to develop, implement and streamline an informatics pipeline to fill the gap between production and analysis for gene-region specific high coverage data from the full-scale 1000 Genomes Project. The developed pipeline aims to process data generated from exomes using direct capture technologies and next-generation sequencing as a major part of the 1000 Genomes Project, to identify and catalog SNPs and indels that enable a detailed understanding of the genetic variants distribution within coding regions among the human population. We will develop and improve several software packages for read mapping, variant discovering, and data quality assurance in terms of statistical rigor and software engineer aspects so they will be suitable for general usage as a toolkit. We expect that both the genetic variation information from exome and the toolkit will play a critical role in the future genetic medical research. We propose three specific Aims:
Aim 1. QC metrics for gene-region data across different samples, populations and technological platforms allowing for full data integration. Here we will explore the various possible approaches to deal with duplicate reads and their effects. An informatics pipeline for applying these metrics to QC gene-region specific data will be implemented.
Aim 2. Develop and optimize gene-region specific pipeline for genetic variations detection, and derive common quality metrics for variations regardless of the technological platforms. The focus of this particular data processing pipeline is to reliably discover nearly all genetic polymorphisms (up to 0.1% MAP) within the coding sequences. We will optimize our Atlas software for SNP and INDEL discoveries, using Pilot 3 data as an exercise for validation. We will also carry out genotyping and sequencing experiments for quality assessment on SNP/INDEL discoveries, and then evaluate and compare its performance with other different available approaches.
Aim 3. Coordinate with DCC to implement gene-region specific data processing pipeline. We will closely collaborate with DCC to implement and streamline this particular data processing pipeline so it is readily applicable for processing the gene-region data from the full-scale project. We will facilitate the effort of integrating the genetic variations and individual genotypes obtained from different components of the 1000 Genomes Project. Public Health Relevance: The developed pipeline will process gene-region specific data as a major part of the 1000 Genomes Project, to catalog SNPs and INDELs within coding regions of the human genome. Once such a high quality data set becomes available, we expect that the list of novel rare non-synonymous SNPs will be immediately included and characterized in any disease association study.
Showing the most recent 10 out of 13 publications