We will soon be able to exhaustively sequence the DNA and RNA of entire communities of bacteria, as well as every individual cell of a tumor. Both of these very different applications of sequencing share in the need to rapidly and efficiently sort through large amounts of noisy sequence data (dozens to 100s of terabases) to separate signal from noise and produce biological insight. However, current bioinformatics approaches for extracting information from this data cannot easily handle the vast amounts of data being acquired. The primary challenges in processing this sequence data are twofold: the relatively high error rate of 0.1-1 %, per base, and the volume of data we can now easily acquire with sequencers such as lllumina HiSeq. For years, sequencing capacity has been doubling every 6 months -significantly faster than compute capacity. Since almost all extant bioinformatics analysis approaches require multiple passes across the primary data, and many analysis algorithms have not been parallelized, bioinformatics analysis capacity continues to lag ever further behind data generation capacity. In addition, many of the existing software packages cannot easily be retooled to take advantage of many core or GPU algorithms, and hence will not take advantage of expected advances in compute capacity and cyber infrastructure we propose to develop and implement novel streaming approaches for loss compression and error connection in shotgun sequencing data. Our algorithms are few-pass ($<$ 2), require no sample-specific information, and can be implemented in fixed or low memory;moreover, they are amenable to parallelization and can run efficiently in many core environments. When implemented as a prefilter to existing analysis packages, our approaches will eliminate or correct the majority of errors in data sets, dramatically reducing the computational space and time requirements for downstream analysis using existing packages. Moreover, we will provide novel capability by extending error correction approaches to mRNAseq and metagenomic data sets. Intellectual Merit: We will develop a range of algorithms for space- and time-efficient compression and error correction of short-read DNA and RNA sequence data. These strategies will substantially increase the scalability of many downstream analysis applications, ranging from community analysis of metagenomes to resequencing analysis of humans. We will provide analyses describing the tradeoffs between space and time efficiency and sensitivity, and deliver tested, documented reference implementations of our approaches that can be used by the community for practical evaluation and incorporation into analysis tools. Our approaches will significantly impact short-read sequence analysis by introducing efficient and effective streaming approaches to the two most common types of short-read analysis, mapping and assembly.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
5R01HG007513-02
Application #
8703739
Study Section
Special Emphasis Panel (ZRG1-BST-N (52))
Program Officer
Bonazzi, Vivien
Project Start
2013-07-19
Project End
2016-05-31
Budget Start
2014-06-01
Budget End
2015-05-31
Support Year
2
Fiscal Year
2014
Total Cost
$204,209
Indirect Cost
$67,318
Name
Michigan State University
Department
Biostatistics & Other Math Sci
Type
Schools of Engineering
DUNS #
193247145
City
East Lansing
State
MI
Country
United States
Zip Code
48824
Crusoe, Michael R; Brown, C Titus (2016) Channeling Community Contributions to Scientific Software: A Sprint Experience. J Open Res Softw 4:
Crusoe, Michael R; Brown, C Titus (2016) Walking the Talk: Adopting and Adapting Sustainable Scientific Software Development processes in a Small Biology Lab. J Open Res Softw 4:
Crusoe, Michael R; Alameldin, Hussien F; Awad, Sherine et al. (2015) The khmer software package: enabling efficient nucleotide sequence analysis. F1000Res 4:900
Howe, Adina Chuang; Jansson, Janet K; Malfatti, Stephanie A et al. (2014) Tackling soil diversity with the assembly of large, complex metagenomes. Proc Natl Acad Sci U S A 111:4904-9
Wilson, Greg; Aruliah, D A; Brown, C Titus et al. (2014) Best practices for scientific computing. PLoS Biol 12:e1001745
Lau, Maggie C Y; Cameron, Connor; Magnabosco, Cara et al. (2014) Phylogeny and phylogeography of functional genes shared among seven terrestrial subsurface metagenomes reveal N-cycling and microbial evolutionary relationships. Front Microbiol 5:531
Zhang, Qingpeng; Pell, Jason; Canino-Koning, Rosangela et al. (2014) These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure. PLoS One 9:e101271
Goffredi, Shana K; Yi, Hana; Zhang, Qingpeng et al. (2014) Genomic versatility and functional variation between two dominant heterotrophic symbionts of deep-sea Osedax worms. ISME J 8:908-24