Variation in our DNA (often inherited) can have important functional consequences, including susceptibility to diseases. However, much of variation is due to random drift and may have no functional consequence. Identifying the small subset of variations that are functionally important is key to a deeper understanding of the genetic basis of diseases and other phenotypes, and is the mainstay of statistical genetics and other fields. However, rapidly falling costs of genome sequencing implies that genomes of entire populations will be completely sequenced. The availability of tremendous amounts of genetic data, and the complexity of relations between genotypes and phenotypes changes the nature of inference problems from statistical to computational, and demands the use of algorithmic (combinatorial and machine learning) techniques. In this proposal, the PIs propose specific goals in three broad areas, which involve the use of algorithmic techniques in solving problems in genetics.
1. Epistatic interactions and geometric embedding: Epistatic interactions where two distant loci interact to jointly mediate the phenotype often confound analyses. However, with millions of loci, testing all pairs for interactions is computationally intractable. The PIs propose to develop fast algorithms for this problem. The approach depends upon the development of a metric embedding that maps the genotypes at a locus to a point in a high dimensional Euclidean metric, such that interacting pairs have small Euclidean distances. This metric embedding is novel, and allows the use of geometric algorithms for fast detection of epistasis. 2. Haplotype assembly: Haplotyping refers to the separation of the maternal and paternal chromosomes. Successful resolution has great impact in improving the efficacy of genetic association, and in understanding the genetic history of the population. The PIs propose the use of modern strobe-sequencing technologies and single genome amplification to dramatically expand the length of achievable haplotypes. One of the formulated problems maps naturally to connectivity in a new class of random graphs. 3. Pooled selection: The PIs propose the identification of regions under genetic selection, using next generation sequencing data. Specifically, the proposed tests work on pooled DNA, and partially sampled DNA, and employ a combination of techniques from population genetics and combinatorial optimization.
Broader Impact and Intellectual Merit The great promise of genomics is that our complete sequence will be an integral part of our medical record, and the major health prognostics will be informed by variation. However, the early research in correlating genotypes and phenotypes is stymied by lack of analysis tools. The problems addressed here are central to the domain and will clearly add to the toolkit of geneticists and biologists. The research also contributes directly to the CISE-CCF mission of developing novel algorithms for Computational Biology, as the proposed problems are uniquely at the intersection of algorithmic and genetics, and open new avenues of research in Computer Science.
Dissemination and outreach will continue through the length of the project contributing to the broader impact of this research. It will include invited and contributed presentations, publications, classroom projects, and collaborations. Software will be freely available as source-code, or web-tools, for academic, research and non-commercial purposes adding to the infrastructure of genetic analyses tools.