BIGDATA: Low-Memory Streaming Prefilters for Biological Sequencing Data

Brown, Charles

Abstract

We will soon be able to exhaustively sequence the DNA and RNA of entire communities of bacteria, as well as every individual cell of a tumor. Both of these very different applications of sequencing share in the need to rapidly and efficiently sort through large amounts of noisy sequence data (dozens to 100s of terabases) to separate signal from noise and produce biological insight. However, current bioinformatics approaches for extracting information from this data cannot easily handle the vast amounts of data being acquired. The primary challenges in processing this sequence data are twofold: the relatively high error rate of 0.1-1 %, per base, and the volume of data we can now easily acquire with sequencers such as lllumina HiSeq. For years, sequencing capacity has been doubling every 6 months -significantly faster than compute capacity. Since almost all extant bioinformatics analysis approaches require multiple passes across the primary data, and many analysis algorithms have not been parallelized, bioinformatics analysis capacity continues to lag ever further behind data generation capacity. In addition, many of the existing software packages cannot easily be retooled to take advantage of many core or GPU algorithms, and hence will not take advantage of expected advances in compute capacity and cyber infrastructure we propose to develop and implement novel streaming approaches for loss compression and error connection in shotgun sequencing data. Our algorithms are few-pass ($<$ 2), require no sample-specific information, and can be implemented in fixed or low memory;moreover, they are amenable to parallelization and can run efficiently in many core environments. When implemented as a prefilter to existing analysis packages, our approaches will eliminate or correct the majority of errors in data sets, dramatically reducing the computational space and time requirements for downstream analysis using existing packages. Moreover, we will provide novel capability by extending error correction approaches to mRNAseq and metagenomic data sets. Intellectual Merit: We will develop a range of algorithms for space- and time-efficient compression and error correction of short-read DNA and RNA sequence data. These strategies will substantially increase the scalability of many downstream analysis applications, ranging from community analysis of metagenomes to resequencing analysis of humans. We will provide analyses describing the tradeoffs between space and time efficiency and sensitivity, and deliver tested, documented reference implementations of our approaches that can be used by the community for practical evaluation and incorporation into analysis tools. Our approaches will significantly impact short-read sequence analysis by introducing efficient and effective streaming approaches to the two most common types of short-read analysis, mapping and assembly.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Human Genome Research Institute (NHGRI)
Type: Research Project (R01)
Project #: 5R01HG007513-02
Application #: 8703739
Study Section: Special Emphasis Panel (ZRG1-BST-N (52))
Program Officer: Bonazzi, Vivien

Project Start: 2013-07-19
Project End: 2016-05-31
Budget Start: 2014-06-01
Budget End: 2015-05-31
Support Year: 2
Fiscal Year: 2014
Total Cost: $204,209
Indirect Cost: $67,318

Institution

Name: Michigan State University
Department: Biostatistics & Other Math Sci
Type: Schools of Engineering
DUNS #: 193247145

City: East Lansing
State: MI
Country: United States
Zip Code: 48824

Related projects


NIH 2015 R01 HG	BIGDATA:Low-Memory Streaming Prefilters for Biological Sequencing Data Brown, C. Titus / University of California Davis	$211,884
NIH 2014 R01 HG	BIGDATA: Low-Memory Streaming Prefilters for Biological Sequencing Data Brown, Charles Titus / Michigan State University	$204,209
NIH 2013 R01 HG	BIGDATA: Low-Memory Streaming Prefilters for Biological Sequencing Data Brown, Charles Titus / Michigan State University	$249,888

Publications

Crusoe, Michael R; Brown, C Titus (2016) Channeling Community Contributions to Scientific Software: A Sprint Experience. J Open Res Softw 4:

Crusoe, Michael R; Brown, C Titus (2016) Walking the Talk: Adopting and Adapting Sustainable Scientific Software Development processes in a Small Biology Lab. J Open Res Softw 4:

Crusoe, Michael R; Alameldin, Hussien F; Awad, Sherine et al. (2015) The khmer software package: enabling efficient nucleotide sequence analysis. F1000Res 4:900

Howe, Adina Chuang; Jansson, Janet K; Malfatti, Stephanie A et al. (2014) Tackling soil diversity with the assembly of large, complex metagenomes. Proc Natl Acad Sci U S A 111:4904-9

Wilson, Greg; Aruliah, D A; Brown, C Titus et al. (2014) Best practices for scientific computing. PLoS Biol 12:e1001745

Lau, Maggie C Y; Cameron, Connor; Magnabosco, Cara et al. (2014) Phylogeny and phylogeography of functional genes shared among seven terrestrial subsurface metagenomes reveal N-cycling and microbial evolutionary relationships. Front Microbiol 5:531

Zhang, Qingpeng; Pell, Jason; Canino-Koning, Rosangela et al. (2014) These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure. PLoS One 9:e101271

Goffredi, Shana K; Yi, Hana; Zhang, Qingpeng et al. (2014) Genomic versatility and functional variation between two dominant heterotrophic symbionts of deep-sea Osedax worms. ISME J 8:908-24

Comments

Be the first to comment on Charles Brown's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: