The two widely used Next Generation Sequencing (NGS) technologies are 454 Sequencing and Illumina sequencing. We propose to determine the best sequencing strategy, that is the optimal mix of 454 and Illumina read and mate pair data to produce the best possible assembly at the lowest cost. We propose to continue developing our software for closing gaps and fixing mis- assemblies by our shooting method. We can extend the method to use additional NGS reads and mate pairs to close gaps in existing assemblies to increase contiguity, and find and correct mis-assemblies. This method can be used as a cheaper alternative to traditional finishing techniques. The final product of any assembly project is a set of the chromosome sequence files. We propose to develop improved software capable of producing chromosome sequences from the assembled contigs using mate pair and marker data. Our preliminary version works for assemblies that have large contigs (N50 size >100Kb). Genomes assembled from the NGS data typically have small contigs (N50 size of 10-20Kb). We propose to extend development of the software so that it is applicable to genome assemblies of the NGS data. We propose to employ the experience that we gained in the previous project period to re-assemble the genomes of chicken, rat, and possibly other genomes of public health interest from the existing Trace Archive data combined with (if available) additional NGS data. The NGS data is getting cheaper. Now there are many groups interested in sequencing various genomes. Thus we propose to produce de novo assemblies of insect, plant genomes and other organisms of public health interest in collaboration with the centers that generate the data. Our goal is to serve as an expert genome assembly group that provides its services and techniques to the community.

Public Health Relevance

Advances in the sequencing technologies made it possible to obtain large amounts of sequence data quickly and at low cost, compared to the Sanger sequencing. Our goals are to contribute our techniques, software and expertise in assembly of the short read data to the community. We will continuously improve our methods to obtain the best possible assemblies of the new genomes sequenced with the latest technologies. The ultimate goal of this project is to improve public health by better understanding the human genome and the genomes of other species.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
5R01HG002945-09
Application #
8509756
Study Section
Biodata Management and Analysis Study Section (BDMA)
Program Officer
Felsenfeld, Adam
Project Start
2003-08-13
Project End
2014-05-31
Budget Start
2013-06-01
Budget End
2014-05-31
Support Year
9
Fiscal Year
2013
Total Cost
$275,160
Indirect Cost
$84,160
Name
University of Maryland College Park
Department
Other Basic Sciences
Type
Schools of Arts and Sciences
DUNS #
790934285
City
College Park
State
MD
Country
United States
Zip Code
20742
Zimin, Aleksey V; Marcais, Guillaume; Puiu, Daniela et al. (2013) The MaSuRCA genome assembler. Bioinformatics 29:2669-77
Salzberg, Steven L; Phillippy, Adam M; Zimin, Aleksey et al. (2012) GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Res 22:557-67
Marcais, Guillaume; Kingsford, Carl (2011) A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27:764-70
Dalloul, Rami A; Long, Julie A; Zimin, Aleksey V et al. (2010) Multi-platform next-generation sequencing of the domestic turkey (Meleagris gallopavo): genome assembly and analysis. PLoS Biol 8:
Zimin, Aleksey V; Delcher, Arthur L; Florea, Liliana et al. (2009) A whole-genome assembly of the domestic cow, Bos taurus. Genome Biol 10:R42
Rapatski, Brandy; Yorke, James (2009) Modeling HIV outbreaks: the male to female prevalence ratio in the core population. Math Biosci Eng 6:135-43
White, James Robert; Roberts, Michael; Yorke, James A et al. (2008) Figaro: a novel statistical method for vector sequence removal. Bioinformatics 24:462-7
Roberts, Michael; Zimin, Aleksey V; Hayes, Wayne et al. (2008) Improving Phrap-based assembly of the rat using "reliable" overlaps. PLoS One 3:e1836
Sindi, Suzanne S; Hunt, Brian R; Yorke, James A (2008) Duplication count distributions in DNA sequences. Phys Rev E Stat Nonlin Soft Matter Phys 78:061912
Zimin, Aleksey V; Smith, Douglas R; Sutton, Granger et al. (2008) Assembly reconciliation. Bioinformatics 24:42-5

Showing the most recent 10 out of 11 publications