Integrated Assembly Software for Sanger and Next Generation Sequence Technologies

Durfee, Timothy

Abstract

The advent of next-generation (Next-gen) sequencing technologies has begun a surge in whole genome sequencing and resequencing, exemplified spectacularly by four papers describing five complete human genomes in 2008 alone. One company, Knome, now even offers customers their entire genome sequence using Next-gen sequencing technology. These developments, together with targeted resequencing of genome, presage the day of the $1000 human genome. Broad-scale whole human genome resequencing (WHGR) will have enormous impact on the areas of personalized medicine, human evolution and human diversity. To fully realize that potential, however, software capabilities must be dramatically enhanced to meet the significant challenges posed by the sheer volume of data generated in these projects, the diversity of technology-specific data characteristics and simply analyzing the 6 billion base pair diploid human genome. Moreover, we see the day when technology improvements and cost reductions make WHGR as commonplace as bacterial genome sequencing has become today. For that to occur, assembly and analysis software must be accessible to a far broader and less computer savvy range of researchers than the highly specialized bioinformatics teams that decode the information now. Also, computer resources are far more limited even for a well funded research laboratory than available to a large sequencing center. Therefore, the overall goal of this proposal is to develop a Next-gen sequence assembly and analysis pipeline, DESKAPP, that will run on an affordable ($5000) high- end desktop computer and produce a human genome sequence in a reasonable timeframe (days, not weeks). WHGR by DESKAPP will involve a reference-guided main assembly as well as a de novo assembly branch to characterize unique regions of the new genome relative to the reference. Merging of the assemblies produces a complete sequence that can be evaluated for gene content, single nucleotide polymorphisms (SNPs) and structural variation (SV;indels, inversion, translocations) both by web-based searches of external databases to identify known allelic variation and by direct examination of the sequence to identify new polymorphisms. A Disk Sort Alignment algorithm allows the data sets which are far too large for in-memory processing to be evaluated and clustered for assembly by SeqMan N-Gen (SM N-Gen), our desktop assembly engine. Using a prototype DSA-SM N-Gen pipeline, we have processed the entire 7.4x 454 data set from the James Watson genome to a layout file in 31 hours using DSA and have assembled three chromosomes: 8;21;and X;using SM N-Gen. Assembly times varied from 1 hour for Chromosome 21 to 10.6 hours for an average- sized chromosome, such as Chromosome 8. Together, these results demonstrate the feasibility of constructing a DESKAPP pipeline for WHGR. The Phase II Aims are designed to build upon this foundation and produce a seamless pipeline for the desktop assembly and analysis of a human genome in a matter of days.

Public Health Relevance

Next-gen sequencing technologies have started a new revolution throughout biology by providing DNA sequence data in unprecedented quantities at continually decreasing costs. This data will be invaluable in the emerging era of personalized medicine and in exploring the immense diversity of life. The goal of this project is to develop desktop computer software that will enable research laboratories and clinics of any size to realize the promise of these new technologies.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Institute of General Medical Sciences (NIGMS)
Type: Small Business Innovation Research Grants (SBIR) - Phase II (R44)
Project #: 2R44GM082117-02A1
Application #: 7746688
Study Section: Special Emphasis Panel (ZRG1-GGG-J (10))
Program Officer: Lyster, Peter

Project Start: 2007-09-01
Project End: 2011-12-31
Budget Start: 2010-01-01
Budget End: 2010-12-31
Support Year: 2
Fiscal Year: 2010
Total Cost: $757,868
Indirect Cost

Institution

Name: Dnastar, Inc.
Department
Type
DUNS #: 130194947

City: Madison
State: WI
Country: United States
Zip Code: 53705

Related projects


NIH 2011 R44 GM	Integrated Assembly Software for Sanger and Next Generation Sequence Technologies Durfee, Timothy J. / Dnastar, Inc.	$722,920
NIH 2010 R44 GM	Integrated Assembly Software for Sanger and Next Generation Sequence Technologies Durfee, Timothy J. / Dnastar, Inc.	$757,868

Comments

Be the first to comment on Timothy Durfee's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: