Statistical Methods for Next-Generation Sequencing in Disease Association Studies Through this project we propose to develop statistical approaches and software for genotype calling and association testing in next-generation sequence data. The field is driven by molecular advances that allow for affordable, massively parallel sequencing. The rapid development of statistical methods for next-generation sequence data in disease studies is necessary to keep pace with the advancing molecular technology. Next- generation sequencing is based on random, short-read technology;thus the coverage of any nucleotide is highly variable and subject to error. Distinguishing random error from truly variable sites is required for """"""""SNP- calling"""""""". One step beyond this is identifying the individual's actual genotype at the site. This is a highly statistical problem and we have yet to see this problem addressed in a statistically rigorous manner. The solution that we propose, and what makes our approach novel, assumes that we have a sample of individuals, each with next-generation sequence data. We anticipate that sequencing may ultimately replace GWAS SNP arrays for disease-association studies. While this may be several years away for whole-genome sequencing, sequencing enough people individually for a small association study is already becoming practical with target capture arrays. We can leverage the information from a sample of individuals with next-generation sequence data to more accurately estimate an individual's genotype and the position-specific error rate. Our approach is to express the genotype probabilities and error rate in a likelihood framework. We can then use standard statistical theory to help us call genotypes. This approach should perform better than calling genotypes for a single individual at a time based on an arbitrary filter as is currently done. A distinct advantage of this statistical framework is that the uncertainty in the genotype calls can be incorporated directly into our disease-association tests (e.g., case-control and rare variant analysis). In this way we will increase power of our association tests and reduce bias due to error or systematic missingness. Incorporation of next-generation sequence data into the association tests provides a complete analysis pipeline from sequence to association.
Our project meets the goals of the GO grant program because of its potential high-impact in a short term. Methods development is particularly well-suited for, and in need of, a short-term infusion of support. The area of next-generation sequencing is rapidly growing, yet statistical methods to use these data effectively lag far behind molecular advances. Our project will provide the rapid acceleration needed to quickly provide statistical approaches to meet the coming data from these new technologies.