The goal of this grant is to advance the genomic community's ability to accurately, easily and rapidly assemble genome sequences from shotgun sequencing data. Specifically, we plan to enhance and maintain a leading shotgun fragment assembly software program, the Celera Assembler. Shotgun DNA sequence data continues to be produced faster than it can accurately be assembled to determine genomic sequence. This problem will only be compounded as new sequencing machines, capable of producing large volumes of sequence data at low cost, come into active use. The sophistication of assembly software continues to be challenged to keep pace. The Celera Assembler was released into the public domain in 2004, and is managed as an open source project via the Sourceforge repository. The quality and accuracy of assembly software has a direct impact on the cost of genomic sequencing projects and genome closure, and the accuracy of the resulting genome sequence has a direct impact on all of the health related research that utilizes the sequence. We have assembled an exceptional team comprising a large portion of the original development team to improve and maintain the Celera Assembler, and support its user base. The principal investigator was the co-leader of the Celera Assembler development at Celera and the three co-investigators all made significant contributions to the algorithms and code base. This team has demonstrated that enhancements to the Celera Assembler could significantly improve the quality of genome assemblies (11, 41). This grant will allow us to make the Celera Assembler more user friendly, robust, capable of generating higher quality assemblies, and incorporating data from new types of sequencers. Towards this end, we will simplify and improve the algorithms and code, develop or incorporate analysis tools to assess the quality of assemblies, test the code on multiple computer platforms, debug the code on numerous organism assemblies, and develop a set of challenging benchmark assembly problems based on real data for use in rigorous regression testing to validate improved results using improved algorithms. All algorithmic improvements will be published in the scientific literature and documented in the code base. The entire code base and supporting analysis, benchmark and regression tools will be maintained as an open source project. ? ? ?

National Institute of Health (NIH)
National Institute of General Medical Sciences (NIGMS)
Research Project (R01)
Project #
Application #
Study Section
Special Emphasis Panel (ZRG1-BST-L (51))
Program Officer
Lyster, Peter
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
J. Craig Venter Institute, Inc.
United States
Zip Code
Pearce, S L; Clarke, D F; East, P D et al. (2017) Genomic innovations, transcriptional plasticity and gene loss underlying the evolution and divergence of two highly polyphagous and invasive Helicoverpa pest species. BMC Biol 15:63
Gulia-Nuss, Monika; Nuss, Andrew B; Meyer, Jason M et al. (2016) Genomic insights into the Ixodes scapularis tick vector of Lyme disease. Nat Commun 7:10507
Marinotti, Osvaldo; Cerqueira, Gustavo C; de Almeida, Luiz Gonzaga Paula et al. (2013) The genome of Anopheles darlingi, the main neotropical malaria vector. Nucleic Acids Res 41:7387-400
Koren, Sergey; Schatz, Michael C; Walenz, Brian P et al. (2012) Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat Biotechnol 30:693-700
Prüfer, Kay; Munch, Kasper; Hellmann, Ines et al. (2012) The bonobo genome compared with the chimpanzee and human genomes. Nature 486:527-31
Miller, Webb; Hayes, Vanessa M; Ratan, Aakrosh et al. (2011) Genetic diversity and population structure of the endangered marsupial Sarcophilus harrisii (Tasmanian devil). Proc Natl Acad Sci U S A 108:12348-53
Koren, Sergey; Miller, Jason R; Walenz, Brian P et al. (2010) An algorithm for automated closure during assembly. BMC Bioinformatics 11:457
Miller, Jason R; Koren, Sergey; Sutton, Granger (2010) Assembly algorithms for next-generation sequencing data. Genomics 95:315-27
Rausch, Tobias; Koren, Sergey; Denisov, Gennady et al. (2009) A consistency-based consensus algorithm for de novo and reference-guided sequence assembly of short reads. Bioinformatics 25:1118-24
Denisov, Gennady; Walenz, Brian; Halpern, Aaron L et al. (2008) Consensus generation and variant detection by Celera Assembler. Bioinformatics 24:1035-40

Showing the most recent 10 out of 12 publications