Despite the tremendous success of short read next-generation sequencing (NGS) technologies, their inherent inability to establish long range connectivity makes fundamental tasks such as genome closure, haplotype phasing and alternatively spliced transcript characterization all but impossible. Now, two long read sequencing providers, Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT), are producing data that can overcome these critical shortcomings. PacBio is capable of producing 10-20kb reads and has seen increased adoption for closing microbial genomes in particular, but also for eurkaryotic genomics and transcriptomics. ONT?s MinION device is a portable real-time sequencing platform capable of producing 100kb reads and has already been successfully applied to microbial sequencing and pathogen identification. ONT?s new high-throughput instrument, the PromethION, is being released in 2016 and will have sufficient output for human genome scale experiments. The tremendous potential of both technologies is currently hampered by high error rates (10-20%) which makes assembly and consensus calling extremely computationally challenging. Various command line software programs have been developed to tackle these challenges, but they typically require substantial bioinformatic expertise and computing resources/savvy and do not address the critical hurdles associated with diploid genomes. With long read sequencing poised to become a major resource for genomics, there is clearly an urgent need for integrated easy-to-use assembly and analysis software that can handle and exploit the unique aspects of this data. Toward that end, we have developed a prototype de novo assembler based on our patented Disk Sort Alignment (DSA) algorithm that can assemble an uncorrected bacterial genome data set into a single contig with >99.2% base accuracy on a standard desktop computer in less than 3.5 hours. The assembler uses DSA-determined read overlaps to construct an assembly string graph from which a layout is fed to a novel consensus generator designed to maximize accuracy from this error prone data. The overall goal of this direct to Phase II proposal is to transform the prototype into a fully scalable long read de novo assembler for both haploid and diploid genomes. We will first optimize the performance of the assembler components, building a solid foundation from which to incorporate the essential diploid-aware capabilities of 1) identifying large structural variation between two sister chromosomes, 2) adapting the consensus base caller to handle heterozygous SNVs and small indels and 3) exploiting the long range connectivity of the data to properly phase the variants and produce accurate haplotype sequences. Finally, we will leverage these tools to identify alternatively spliced transcripts and allele- specific expression from long read RNA-Seq data. Consistent with DNASTAR?s 30 year history of delivering easy-to-use expert level software, this assembler will give any user access to these revolutionary long read sequencing technologies and those to come.
Emerging ?long read? technologies have the ability to sequence DNA molecules one thousand times longer than current ?next-generation? instruments. This remarkable advance has tremendous implications for genomic sciences, including supporting enhanced understanding of the causes of and cures for human disease. In this project, we will develop the software needed to accurately convert this new data into truly complete genome sequences for any individual.