We propose to investigate new computational approaches to two central problems of high-throughput se- quence analysis: (1) quantification of transcript and species abundance in RNAseq and metagenomic data, and (2) improved error correction of sequencing reads. The proposed novel approaches to both of these problems derive from the ability to quickly count every instance of every k-mer (string of length k) within huge collections of sequence data. Extensive preliminary work on this problem, manifest in the k-mer counting software (called Jellyfish) published by the project personnel, will be brought to bear and extended. Existing mapping-based computational techniques for quantifying transcript abundance have found wide applicability but read mapping is error prone due to, e.g., splice junctions, microexons, and variation from the reference sequence.
Aim 1 seeks to develop an alternative, mapping-free approach to transcript quantification from sequencing data that relies on clustering normalized k-mer count vectors to identify k-mers that are indicative of transcript or gene abundance. These k-mers form profiles that can be used to rapidly quantify expression of the given transcript or gene in subsequent experiments with limited computational effort and avoiding the challenging read mapping step.
Aim 2 tackles the problem of error correction of genomic, and, more speculatively, RNAseq reads by developing more accurate k-mer filtering methods and more compact de Bruijn graph representations. The new filtering proce- dures try to make a better distinction between correct and erroneous k-mers by simultaneously considering their position within the reads and the distribution of their quality scores across reads. Improved error correction and de Bruijn graph representations will be used for more efficient algorithms for super-read and unitig creation, the initial stages of assembly. The methods and software developed for both aims will significantly increase the ability of high-throughput sequence analysis and assembly to be completed on widely available commodity computers.

Public Health Relevance

High-throughput sequence data has very recently become a widely used tool in basic and applied biological research. RNA transcript sequencing has been used to gain insight into development processes in model or- ganisms and for understanding the etiology and mechanism of diseases, and metagenomic sequence data has revealed the complex symbiosis between microorganisms and their environments. The more computationally efficient approaches proposed here to analyze this data will allow even research labs with limited computational power to quantify transcript and gene abundance faster and more accurately, leading to a better understanding of microbial communities, causes of genetic diseases, and responses to varied environments.

Agency
National Institute of Health (NIH)
Type
Exploratory/Developmental Grants (R21)
Project #
1R21HG006913-01
Application #
8359445
Study Section
Biodata Management and Analysis Study Section (BDMA)
Program Officer
Bonazzi, Vivien
Project Start
Project End
Budget Start
Budget End
Support Year
1
Fiscal Year
2012
Total Cost
Indirect Cost
Name
University of Maryland College Park
Department
Biostatistics & Other Math Sci
Type
Other Specialized Schools
DUNS #
790934285
City
College Park
State
MD
Country
United States
Zip Code
20742
Patro, Rob; Mount, Stephen M; Kingsford, Carl (2014) Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. Nat Biotechnol 32:462-4
Duggal, Geet; Wang, Hao; Kingsford, Carl (2014) Higher-order chromatin domains link eQTLs with the expression of far-away genes. Nucleic Acids Res 42:87-96