Complete genome de novo assembly software for the emerging long read sequencing era

Durfee, Timothy

Abstract

Despite the tremendous success of short read next-generation sequencing (NGS) technologies, their inherent inability to establish long range connectivity makes fundamental tasks such as genome closure, haplotype phasing and alternatively spliced transcript characterization all but impossible. Now, two long read sequencing providers, Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT), are producing data that can overcome these critical shortcomings. PacBio is capable of producing 10-20kb reads and has seen increased adoption for closing microbial genomes in particular, but also for eurkaryotic genomics and transcriptomics. ONT?s MinION device is a portable real-time sequencing platform capable of producing 100kb reads and has already been successfully applied to microbial sequencing and pathogen identification. ONT?s new high-throughput instrument, the PromethION, is being released in 2016 and will have sufficient output for human genome scale experiments. The tremendous potential of both technologies is currently hampered by high error rates (10-20%) which makes assembly and consensus calling extremely computationally challenging. Various command line software programs have been developed to tackle these challenges, but they typically require substantial bioinformatic expertise and computing resources/savvy and do not address the critical hurdles associated with diploid genomes. With long read sequencing poised to become a major resource for genomics, there is clearly an urgent need for integrated easy-to-use assembly and analysis software that can handle and exploit the unique aspects of this data. Toward that end, we have developed a prototype de novo assembler based on our patented Disk Sort Alignment (DSA) algorithm that can assemble an uncorrected bacterial genome data set into a single contig with >99.2% base accuracy on a standard desktop computer in less than 3.5 hours. The assembler uses DSA-determined read overlaps to construct an assembly string graph from which a layout is fed to a novel consensus generator designed to maximize accuracy from this error prone data. The overall goal of this direct to Phase II proposal is to transform the prototype into a fully scalable long read de novo assembler for both haploid and diploid genomes. We will first optimize the performance of the assembler components, building a solid foundation from which to incorporate the essential diploid-aware capabilities of 1) identifying large structural variation between two sister chromosomes, 2) adapting the consensus base caller to handle heterozygous SNVs and small indels and 3) exploiting the long range connectivity of the data to properly phase the variants and produce accurate haplotype sequences. Finally, we will leverage these tools to identify alternatively spliced transcripts and allele- specific expression from long read RNA-Seq data. Consistent with DNASTAR?s 30 year history of delivering easy-to-use expert level software, this assembler will give any user access to these revolutionary long read sequencing technologies and those to come.

Public Health Relevance

Emerging ?long read? technologies have the ability to sequence DNA molecules one thousand times longer than current ?next-generation? instruments. This remarkable advance has tremendous implications for genomic sciences, including supporting enhanced understanding of the causes of and cures for human disease. In this project, we will develop the software needed to accurately convert this new data into truly complete genome sequences for any individual.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Institute of General Medical Sciences (NIGMS)
Type: Small Business Innovation Research Grants (SBIR) - Phase II (R44)
Project #: 1R44GM122120-01
Application #: 9255092
Study Section: Special Emphasis Panel (ZRG1-IMST-K (14)B)
Program Officer: Ravichandran, Veerasamy

Project Start: 2017-03-01
Project End: 2019-02-28
Budget Start: 2017-03-01
Budget End: 2018-02-28
Support Year: 1
Fiscal Year: 2017
Total Cost: $749,795
Indirect Cost

Institution

Name: Dnastar, Inc.
Department
Type: Domestic for-Profits
DUNS #: 130194947

City: Madison
State: WI
Country: United States
Zip Code: 53705

Related projects


NIH 2018 R44 GM	Complete genome de novo assembly software for the emerging long read sequencing era Durfee, Timothy J. / Dnastar, Inc.
NIH 2018 R44 GM	Complete genome de novo assembly software for the emerging long read sequencing era Durfee, Timothy J. / Dnastar, Inc.
NIH 2017 R44 GM	Complete genome de novo assembly software for the emerging long read sequencing era Durfee, Timothy J. / Dnastar, Inc.	$749,795

Comments

Be the first to comment on Timothy Durfee's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: