EAGER:  Haplotype Phasing Algorithms and Clark Consistency Graphs

Istrail, Sorin

Abstract

Currently, most SNP assay data is made available from genome-wide association studies (GWAS), which proceed by identifying a number of individuals carrying a disease or a trait and comparing them to those that do not or are not known to carry the disease/trait. Both sets of individuals are then genotyped for a large number of single-nucleotide polymorphism (SNP) genetic variants that are then tested for association to the disease/trait. GWAS studies using tens of thousands of individuals are becoming commonplace and are increasingly the norm in the association of genetic variants to disease. These studies generally proceed by pooling large amounts of genome-wide data from multiple studies, for a combined total of tens of thousands of individuals in a single meta-analysis study. It can be expected that hundreds of thousands, if not millions, of individuals will soon be studied for association to a single disease or trait. Although SNPs are the most abundant form of variation between two individuals, other forms of variation exist, such as copy-number variation ? large-scale chromosomal deletions, insertions, and duplications. These variations, which have been shown to be influential factors in many diseases, are not probed using the current technology of SNP arrays. An emerging trend in accounting for ?missing heritability? is ?parent-of-origin? effects, where genetic variants confer risk only when inherited from a specific parent. Long-range haplotype phasing is key to identifying the association of the haplotype pattern to the specific parent of origin.

The premise of this research is that long tracts are unlikely (to be shared) unless the haplotypes are identical by descent (IBD), in contrast to short shared tracts, which may be identical by state (IBS). A difficult algorithmic challenge is that of tract finding in genotype matrices of a sample of m people genotyped at n SNPs. The premise of our research is that long tracts are unlikely (to be shared) unless the haplotypes are identical by descent (IBD), in contrast to short shared tracts, which may be identical by state (IBS). A difficult algorithmic challenge is that of tract finding in genotype matrices of a sample of m people genotyped at n SNPs. To apply such a long-range phasing algorithm to the U.S. population, it is estimated that 2 million individuals must be genotyped. Algorithmic strategies proposed here show promise that the combinatorial structure of Clark Consistency Graphs can provide the basis for powerful algorithms that will decrease this number substantially. The primary output of this project will be new long-range phasing software, documentation, and source code, all to be immediately and continually available to the scientific community as open-source for research and education.

Project Report

This NSF proposal has been focused on the development of algorithms for genome-wide haplotype phasing.Two aims were proposed: AIM 1: Develop Algorithmic Graph Theory and Efficient Data Structures for Long-Range Clark Consistency Graphs and for Tract-Finding for GWAS Data Sets. AIM 2: Develop Algorithms and Software Library for Long-Range Haplotype Phasing of a Large Sample of Individuals. The importance of the problem and the need for powerful algorithms that work accurately for genome-wide size data is eloquently presented in the following quote: "Improving data quality is crucial, because if a human genome cannot be independently assembled then the sequence data cannot be sorted into the two sets of parental chromosomes, or haplotypes. This process haplotype phasing will become one of the most useful tools in genomic medicine. Establishing the complete set of genetic information that we received from each parent is crucial to understanding the links between heritability, gene function, regulatory sequences and our predisposition to disease." J. C. Venter, "Multiple personal genomes await," Nature, April 2010 In a seminal 1990 paper, "Inference of Haplotypes from PCR-amplified Samples of Diploid Populations," Andy Clark proposed a very fast greedy algorithm for separating the maternal and paternal haplotypes of an individual. The quest then, for each individual, is the identification of the two haplotypes (the maternal-paternal explanation) out of the (2 to the power k) possible haplotypes. Clarkâ€™s algorithm is an algorithm-schema, leaving a number of implementations details open. The algorithmic ideas continue to be effective and actively researched today; in fact, algorithms based on them are called "Clark Methods" in the literature. Experimental methods to date have not been able to provide a cost-effective technology for the haplotype phasing problem, and with todayâ€™s genome-wide association studies (GWAS) size samples (thousands of people, hundred of thousands of SNPs), computational methods are the only feasible approach to practical solutions for the problem. And even there, due to the huge SNP array data sets, very fast algorithmic methods like Clark Methods are very important. The search for fast and accurate computational methods for phasing is witnessed by the large literature on the subject and by the many software tools for haplotype phasing and competitions assessing these tools. In addition to the Clark Methods, a variety of other statistical and combinatorial models have been proposed: maximum-likelihood, Bayesian, parsimony, coalescent, perfect and imperfect phylogeny. The PIâ€™s work on a number of other synergetic projects has provided algorithmic insights into the haplotype phasing problem as well as a suite of software tools and an environment for haplotype reconstruction visualization. These include: a Genome Browser under development dedicated to GWAS analysis, the ARIADNE GWAS-Browser; unification of linkage disequilibrium measures; a suite of genomic tools including a variety of phasing algorithms based on parsimony haplotyping, maximum-likelihood and haplotyping, coalescent haplotyping, robustness of haplotype blocks analysis, tagging SNPs and minimum informative SNPs selection, SNPs and haplotype assembly, and imperfect ancestral recombination graphs reconstruction. In addition, libraries of tools developed by his group at Celera Genomics 2000-2005 for genome annotation, genome assembly algorithms, and assembly comparison tools, available in open source, provided an effective software development environment for his laboratory research work on in haplotype phasing algorithms. The Clark Consistency Graph Theory was introduced by Sharan, Halldorson and Istrail in 2006. This graph theoretic framework is the basis for the algorithm development of this proposal and responsible for the success of the methods obtained. The concept was reintroduced by the deCode Genomics Long Range Phasing paper as "haplotype sharing graphs." It was the data structure of Clark Consistency Graphs that allowed us to extend algorithmic strategies employed for the Icelandic population towards characteristics to samples of U.S. populations. Fundamental to the algorithmic success was the developmental of a new algorithm that optimally provides the solution of the identification of the identical-by-descent (IBD) tracts in genome-wide haplotype matrices (when the genotype data is phased). A number of other algorithmic synergetic contributions were made for the haplotype phasing problem: (1) genomic deletions and phasing corrections with applications to autism GWAS data, (2) haplotype assembly from next generation data as a back bone for phasing reconstruction, and (3) maximum likelihood phasing polynomial optimization. The work on this proposal provided a suite of software tools devoted to haplotype phasing that the PI used part of his work on teaching and developing (over the last four years) a graduate course on computational methods for GWAS analysis.

Funding Agency

Agency: National Science Foundation (NSF)
Institute: Division of Information and Intelligent Systems (IIS)
Type: Standard Grant (Standard)
Application #: 1048831
Program Officer: Sylvia Spengler

Project Start
Project End
Budget Start: 2011-01-01
Budget End: 2012-12-31
Support Year
Fiscal Year: 2010
Total Cost: $199,999
Indirect Cost

EAGER: Haplotype Phasing Algorithms and Clark Consistency Graphs
Istrail, Sorin
Brown University, Providence, RI, United States

Abstract

Project Report

Funding Agency

Institution

Comments

Recent in Grantomics:

Recently viewed grants:

Recently added grants:

Abstract

Project Report

Funding Agency

Institution

Comments