The 1000 Genomes Project is developing data resources and analytical methods required for the next stage of human genetics research: (a) discovering millions of novel polymorphisms with frequencies 0.5%-10%, which can then be tested for association to disease via imputation and direct genotyping in patients, and (b) bringing together genome centers, technology companies and population geneticists in a collaborative framework to develop data formats, analytical methods and standards for sensitive, accurate genome-wide resequencing for rare variants. The Project's goals are aggressive as detection of rare variants requires unprecedented accuracy, multiplied by the inclusion of multiple rapidly-evolving next generation sequencing platforms. Key tasks for Data Processing include (a) defining the biases and error processes characteristic of each sequencing platform, (b) determining how to use properly calibrated data to discover and genotype variants (SNP and structural), including making use of population genetics and prior array data for each sample, and (c) making it easy for users to browse the resulting data, and integrate it in statistical genetic analysis of disease samples. As members of the 1000 Genomes Project Analysis Group, we propose three Aims. First, to develop, implement and apply methodology to convert raw intensity data from each platform into accurate four-base probabilities, refining and calibrating the underlying base-call probabilities, and increasing accuracy. Second, to develop and implement an integrated approach to SNP and CNV detection that utilizes these probabilities, combines information across multiple samples, and exploits existing information from genotyping arrays, increasing sensitivity and accuracy for both SNPs and structural variants. Third, to develop user-friendly software for browsing and applying 1000 Genomes Project data in disease research, making Project data on sequence variation and linkage disequilibrium accessible and easily usable to the wider genetics community. We have assembled an experienced and skilled team of statistical and population genetic analysts and software engineers, with a track record of contributions to the SNP Consortium, HapMap project, and disease association studies. If funded, we will develop improved methods for interpreting raw next generation sequencing data, and software tools that speed the application of data from the Project to the genetics community.
The data for the 1000 Genomes project will provide the underpinnings for the execution and interpretation of all complex human disease genetic research that follows. As this constitutes the most prodigious investment in human variation resource generation to date, it is Imperative that the data from this project is processed and analyzed as accurately as possible as the raw data is of such a scale that it cannot be maintained permanently. In addition the methods developed in this proposal will be directly applied beyond the 1000 Genomes project to medical sequencing efforts to unlock the genetics of complex disease.
Showing the most recent 10 out of 14 publications