Our goal is to develop a set of pre- and post-processing tools that are independent of the assembly software used and thus could be immediately implemented at all major sequencing/assembly centers. We will make our software and methods freely available, open source. Pre-processing:
Specific Aim 1. Improving draft genome assemblies through better use of read data, (a) Vector Trimming: Improperly trimmed vector sequences often cause genome assemblies to break unnecessarily. We propose an improved vector trimming method that determines vector sequence automatically, (b) We propose to preprocess the read data to increase the amount of useable sequence on the 3'ends of the reads using overlap-based error correction followed by overlap-based trimming. We propose to use low quality bases on the 3'ends of the reads to help resolve repetitive regions. This procedure would be used to further improve UMD Overlapper. (c) We propose to use our read extension and error correction routines to create better assemblies of genomes sequenced at low. Post-processing:We propose to develop a set of assembler-independent techniques that can be used at any sequencing/assembly center or in the framework of any sequencing consortium.
Specific Aim 2. Assembly evaluation software. We have developed a Compression/Expansion (CE) statistic that allows us to detect misassembled regions in the draft assemblies. We also developed software that uses shooting method to determine which inserts lie in easy to assemble (not necessarily unique) regions of the genome and exactly measure their size. We propose to develop integrated assembly evaluation/misassembly detection software that uses statistics based on read and mate pair placements and data obtained from the shooting methods to detect problems in draft assemblies.
Specific Aim 3. Assembly reconciliation. We propose to create software that enhances a given draft assembly using alternate draft assemblies of the same genome created from the same read data with different assembly programs, or with the same assembly program using different parameters. The U.S. government spends hundreds of millions of dollars on whole genome shotgun sequencing. We believe that if the goals of this project are achieved, significantly better and cheaper genomes will be produced. The cost of using our techniques will be negligible compared to the cost of generating reads. Our approach may find more genes and regulatory regions and lead to a better understanding of the genetic structure of the various genomes. The ultimate goal of this project is to improve public health by better understanding the human genome and the genomes of other species.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
3R01HG002945-06S1
Application #
7920507
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Felsenfeld, Adam
Project Start
2009-09-05
Project End
2010-08-31
Budget Start
2009-09-05
Budget End
2010-08-31
Support Year
6
Fiscal Year
2009
Total Cost
$71,250
Indirect Cost
Name
University of Maryland College Park
Department
Other Basic Sciences
Type
Schools of Arts and Sciences
DUNS #
790934285
City
College Park
State
MD
Country
United States
Zip Code
20742
Muñoz, Adriana; Santos Muñoz, Daniella; Zimin, Aleksey et al. (2016) Evolution of transcriptional networks in yeast: alternative teams of transcriptional factors for different species. BMC Genomics 17:826
Li, Gang; Hillier, LaDeana W; Grahn, Robert A et al. (2016) A High-Resolution SNP Array-Based Linkage Map Anchors a New Domestic Cat Draft Genome Assembly and Provides Detailed Patterns of Recombination. G3 (Bethesda) 6:1607-16
Marçais, Guillaume; Yorke, James A; Zimin, Aleksey (2015) QuorUM: An Error Corrector for Illumina Reads. PLoS One 10:e0130821
Schrader, Lukas; Kim, Jay W; Ence, Daniel et al. (2014) Transposable element islands facilitate adaptation to novel environments in an invasive species. Nat Commun 5:5495
Zimin, Aleksey V; Marçais, Guillaume; Puiu, Daniela et al. (2013) The MaSuRCA genome assembler. Bioinformatics 29:2669-77
Patro, Rob; Sefer, Emre; Malin, Justin et al. (2012) Parsimonious reconstruction of network evolution. Algorithms Mol Biol 7:25
Salzberg, Steven L; Phillippy, Adam M; Zimin, Aleksey et al. (2012) GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Res 22:557-67
Marcais, Guillaume; Kingsford, Carl (2011) A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27:764-70
Dalloul, Rami A; Long, Julie A; Zimin, Aleksey V et al. (2010) Multi-platform next-generation sequencing of the domestic turkey (Meleagris gallopavo): genome assembly and analysis. PLoS Biol 8:
Rapatski, Brandy; Yorke, James (2009) Modeling HIV outbreaks: the male to female prevalence ratio in the core population. Math Biosci Eng 6:135-43

Showing the most recent 10 out of 16 publications