Genetic variations are landmarks that allow us to track our genetic ancestry and their genome structure informs us about the molecular and demographic forces that have shaped it. For medical research the most important polymorphisms are disease-causing variants, but non-functional polymorphisms are also useful as markers for linkage and association studies. The detection of single-nucleotide polymorphisms (SNPs) and short insertion/deletions (INDELs) from DNA sequences is challenging because one must align and compare sequences from varied sources, and differentiate true polymorphisms from sequencing errors. There is a growing need to find rare, medically important alleles in deep alignments of clonal sequences and diploid sequence traces; to identify large numbers of markers for mapping studies in humans, model organisms, and plants; and to discover informative polymorphisms for pathogen strain identification. Building on our existing software, POLYBAYES, we propose to develop a general polymorphism discovery tool that meets these challenges. We will organize fragementary sequences by layering them upon the genome reference sequence; discard paralogous sequences from similar, duplicated genome regions; and use base quality values in a rigorous, Bayesian scheme to compare sequences of arbitrary quality standards. Specifically, we propose methods to align multi-exon genes, and novel methods for paralog filtering based either on complete mapping information or on genome distributions of sequence divergence. We will develop new algorithms for the difficult problem of INDEL detection; integrate heterozygote detection in diploid traces into our software; enhance sensitivity to detect rare alleles; and include a new measure to estimate the true positive rate of our candidate predictions. We will implement a fast, reliable, full functionality discovery tool that is free for academic research, performs well in large discovery projects, but can run on desktop computers, and is easily accessible to Biologists in small or medium laboratories.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
5R01HG003698-03
Application #
7270446
Study Section
Special Emphasis Panel (ZRG1-BST-D (51))
Program Officer
Brooks, Lisa
Project Start
2005-09-16
Project End
2010-07-31
Budget Start
2007-08-01
Budget End
2008-07-31
Support Year
3
Fiscal Year
2007
Total Cost
$350,215
Indirect Cost
Name
Boston College
Department
Biology
Type
Schools of Arts and Sciences
DUNS #
045896339
City
Chestnut Hill
State
MA
Country
United States
Zip Code
02467
1000 Genomes Project Consortium; Abecasis, Goncalo R; Auton, Adam et al. (2012) An integrated map of genetic variation from 1,092 human genomes. Nature 491:56-65
Huang, Weichun; Li, Leping; Myers, Jason R et al. (2012) ART: a next-generation sequencing read simulator. Bioinformatics 28:593-4
1000 Genomes Project Consortium; Abecasis, Gonçalo R; Altshuler, David et al. (2010) A map of human genome variation from population-scale sequencing. Nature 467:1061-73
Hillier, LaDeana W; Marth, Gabor T; Quinlan, Aaron R et al. (2008) Whole-genome sequencing and variant discovery in C. elegans. Nat Methods 5:183-8
Huang, Weichun; Marth, Gabor (2008) EagleView: a genome assembly viewer for next-generation sequencing technologies. Genome Res 18:1538-43
Quinlan, Aaron R; Stewart, Donald A; Stromberg, Michael P et al. (2008) Pyrobayes: an improved base caller for SNP discovery in pyrosequences. Nat Methods 5:179-81