Identification of genetic factors that contributing to complex diseases is one of the grant challenges in the post-genomic era. A series of exciting new findings were made recently using the genome wide association study (GWAS) design. However, moving from confirmed association signal to the collection of causal variants at a given locus poses significant challenges. A desirable follow-up strategy of GWAS is to conduct a comprehensively resequencing analysis at the genomic regions of interest. This will allow scientists to comprehensively discover and study all sequence variants, which greatly increase the chance of identifying new disease causing mutations. Rapid advances in the next generation sequencing technologies are making such a strategy increasingly feasible. Novel statistical methods need to be developed in order to analyze data generated from these new sequencing instruments. In this proposal, we focus on identifying single nucleotide polymorphisms (SNPs) from resequencing data generated from the Illumina Genome Analyzer platform. First, we will develop a probability-based model that allow us to simultaneously perform mapping of multi- mapped short sequencing reads, identifying sequencing errors, and calling SNPs and their genotypes. Since our method will be developed under the Bayesian framework, additional information such as the genotypes obtained from GWAS can be incorporated as informative priors to improve our inference. Second, we will develop a probability- based approach that combine sequencing read data at selected loci from multiple individuals to improve SNP and genotype calling. The goal is to borrow strength among a pool of samples to resolve ambiguity at loci with low sequencing depth. We will implement our statistical methods in freely available software tools to facilitate analysis of targeted resequencing studies. Finally, we plan to apply our methods on data generated from real targeted resequencing studies that is being planned for psoriasis and type 2 diabetes through collaboration.
Next generation sequencing technologies facilitate large scale resequencing studies which offer us better chances of identifying disease-causing mutations. In this proposal, we will develop novel statistical methods for the identification of genetic variants from the so called ultra-high-throughput sequencing data. When completed, software tools and methods will be made freely available to allow better analysis of data generated from resequencing studies.