We will soon be able to exhaustively sequence the DNA and RNA of entire communities of bacteria, as well as every individual cell of a tumor. Both of these very different applications of sequencing share in the need to rapidly and efficiently sor through large amounts of noisy sequence data (dozens to 100s of terabases) to separate signal from noise and produce biological insight. However, current bioinformatics approaches for extracting information from this data cannot easily handle the vast amounts of data being acquired. The primary challenges in processing this sequence data are twofold: the relatively high error rate of 0.1-1 %, per base, and the volume of data we can now easily acquire with sequencers such as lllumina HiSeq. For years, sequencing capacity has been doubling every 6 months -significantly faster than compute capacity. Since almost all extant bioinformatics analysis approaches require multiple passes across the primary data, and many analysis algorithms have not been parallelized, bioinformatics analysis capacity continues to lag ever further behind data generation capacity. In addition, many of the existing software packages cannot easily be retooled to take advantage of many core or GPU algorithms, and hence will not take advantage of expected advances in compute capacity and cyber infrastructure we propose to develop and implement novel streaming approaches for loss compression and error connection in shotgun sequencing data. Our algorithms are few-pass ($<$ 2), require no sample-specific information, and can be implemented in fixed or low memory;moreover, they are amenable to parallelization and can run efficiently in many core environments. When implemented as a prefilter to existing analysis packages, our approaches will eliminate or correct the majority of errors in data sets, dramatically reducing the computational space and time requirements for downstream analysis using existing packages. Moreover, we will provide novel capability by extending error correction approaches to mRNAseq and metagenomic data sets. Intellectual Merit: We will develop a range of algorithms for space- and time-efficient compression and error correction of short-read DNA and RNA sequence data. These strategies will substantially increase the scalability of many downstream analysis applications, ranging from community analysis of metagenomes to resequencing analysis of humans. We will provide analyses describing the tradeoffs between space and time efficiency and sensitivity, and deliver tested, documented reference implementations of our approaches that can be used by the community for practical evaluation and incorporation into analysis tools. Our approaches will significantly impact short-read sequence analysis by introducing efficient and effective streaming approaches to the two most common types of short-read analysis, mapping and assembly.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Research Project (R01)
Project #
Application #
Study Section
Special Emphasis Panel (ZRG1-BST-N (52))
Program Officer
Bonazzi, Vivien
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Michigan State University
Biostatistics & Other Math Sci
Schools of Engineering
East Lansing
United States
Zip Code
Zhang, Qingpeng; Pell, Jason; Canino-Koning, Rosangela et al. (2014) These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure. PLoS One 9:e101271