The etiology of complex diseases involves multiple genes and environmental factors. Since each individual gene locus is only a small part of the whole picture, association studies based on correlating variation at one or a few gene loci to disease outcomes may miss significant larger-scale associations. An attractive alternative that may be more revealing is to base association studies on correlations between disease outcomes and haplotypes across selected genomic regions, A prerequisite for association studies, whether they are based on a few loci or on larger-scale haplotypes, is an accurate method for haplotype frequency estimation in a given population. The differences between the haplotype frequencies in a healthy population and in a population of affected individuals may be subtle. Thus, getting an accurate estimate for the haplotype frequencies is extremely important for disease association studies. Estimating haplotype frequencies is a non-trivial task because current sequencing methods may produce noisy or incomplete data and typically yield genotypes, whose resolution into pairs of haplotypes is ambiguous. Existing methods for haplotype frequency estimation are mainly heuristic in nature, and they are only suitable for large samples of unrelated individuals from a homogenous population over short genomic regions. Any deviation from these conditions may result in inaccurate estimates. The main goal of this project is to develop efficient and accurate tools for haplotype frequency estimation under different conditions, and to integrate these methods with novel tools for disease association studies. In particular, the following activities are proposed: develop accurate, efficient and robust methods for haplotype frequency estimation over short and long genomic regions; extend these methods to deal with small sample size and deviations from Hardy-Weinberg equilibrium due to population substructure, and incorporate pedigree information into the haplotype frequency estimator; integrate the resulting tools with a systematic tool for disease association studies that looks for candidate loci automatically using multiple calls to the haplotype frequency estimator; and launch a web server that will allow geneticists to upload their data and run the programs developed in the project on the fly through the web server.
The direct effect of the project would be to reduce the sample size needed for association studies, thus making more studies possible under the same budget constraints. This in turn will lead to a better understanding of complex diseases, which may speed up the search for diagnosis and treatment tools. The mathematical models introduced in this project may shed light on haplotype structure and on evolution. Furthermore, the project will address optimization problems and statistical learning problems that may be of use beyond the scope of genetics. The diverse tasks of this project include algorithm design and implementation, software integration and biological modeling. Thus, there is a wide range of activities that are suitable for students of all levels. This will give students an exciting exposure to multidisciplinary research involving computer science, statistics, genetics and mathematics. The methods developed in this project will be integrated in bioinformatics courses at UCSD, and the material will be publicly available as PowerPoint presentations on the web. The software developed in this project will be integrated with the existing publicly available web server HAP.