A tandem repeat (TR) is any pattern of nucleotides which occurs as repeating, consecutive copies along a DNA molecule. Often, the pattern copies are not identical. A TR can be polymorphic, that is, it can be different across individuals in a population: 1) the number of copies may be different, 2) the arrangement of non-identical copies may be dfferent, and 3) the copies may contain different small mutations. TR variants are known to affect important biological processes, such as chromatin structure, gene plasticity, gene expression, and disease states, so their discovery is crucial for correctly understanding complex bio-molecular interactions. While a conservative estimate suggests that 100,000 human TRs may be polymorphic, until recently, genome-wide study of TR polymorphism, in humans and other organisms, has been too difficult and costly, with the result that the true extent of polymorphism and its effects are unknown. New genome sequencing technologies offer the first real opportunity to fill in the details of TR diversity. These technologies sequence millions of high quality, short DNA fragments in a single experiment. Current sequencing projects are producing many billions of reads rich in TR variant information. Yet, current read mapping algorithms, which attempt to assign each read to its proper location on the reference genome, are not designed to detect TR variants.

This project has three central goals: 1. Algorithm Development; 2.Genome Studies; 3. Variation Curation in a public database. Strategies will be developed to accurately and efficiently map TR-containing reads to reference genome TR loci. Anticipated algorithmic developments include: 1) Optimization of tree-based alignment, for use when millions of short, disjoint sequences must be aligned to each other. The reads and references can each be merged into separate Patricia tree data structures and alignment computed between tree nodes, thereby eliminating redundant computation in the prefixes of the two sequence sets. 2) Production of space-saving, Burrows Wheeler transforms (BWT) of the most redundant tree parts by employing approximate shortest common superstrings (SCS) for the two sequence sets. 3) Development of an efficient Four-Russians style block computation for edit distance alignment in the trees by exploiting redundancy inherent in the small alphabet and block input scores, 4) Development of a bounding computation for edit-distance based on efficient, bit-register computation of longest common subsequence (LCS) alignment, and 5) Parallelization of all algorithms for further efficiency with multi-core processors, Single Instruction, Multiple Data (SIMD) bit-register computations, and highly parallel graphics processing units (GPUs). Data from six recently published whole human genomes, two human centenarian genomes, and the 1000 genomes project will be analyzed to discover TR variants. An internet-accessible, public database and analysis platform for curation and display of TR variants will be developed.

The TR variant discovery software and all data sets produced will directly enhance the infrastructure for TR diversity research in genome biology, genome evolution, and comparative genomics. The software and data will be freely available to the research community through a high capacity website maintained in the PI's lab at Boston University. The PI will participate in a variety of activities that link research and education and support participation by members of underrepresented groups, including provision of opportunities in research for graduate and undergraduate students, participation in high school enrichment and curriculum development projects, and editorship of an international journal engaged in the dissemination of bioinformatics research.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
1017621
Program Officer
Sylvia Spengler
Project Start
Project End
Budget Start
2010-08-15
Budget End
2015-07-31
Support Year
Fiscal Year
2010
Total Cost
$500,000
Indirect Cost
Name
Boston University
Department
Type
DUNS #
City
Boston
State
MA
Country
United States
Zip Code
02215