We propose to investigate new computational approaches to two central problems of high-throughput se- quence analysis: (1) quantification of transcript and species abundance in RNAseq and metagenomic data, and (2) improved error correction of sequencing reads. The proposed novel approaches to both of these problems derive from the ability to quickly count every instance of every k-mer (string of length k) within huge collections of sequence data. Extensive preliminary work on this problem, manifest in the k-mer counting software (called Jellyfish) published by the project personnel, will be brought to bear and extended. Existing mapping-based computational techniques for quantifying transcript abundance have found wide applicability but read mapping is error prone due to, e.g., splice junctions, microexons, and variation from the reference sequence.
Aim 1 seeks to develop an alternative, mapping-free approach to transcript quantification from sequencing data that relies on clustering normalized k-mer count vectors to identify k-mers that are indicative of transcript or gene abundance. These k-mers form profiles that can be used to rapidly quantify expression of the given transcript or gene in subsequent experiments with limited computational effort and avoiding the challenging read mapping step.
Aim 2 tackles the problem of error correction of genomic, and, more speculatively, RNAseq reads by developing more accurate k-mer filtering methods and more compact de Bruijn graph representations. The new filtering proce- dures try to make a better distinction between correct and erroneous k-mers by simultaneously considering their position within the reads and the distribution of their quality scores across reads. Improved error correction and de Bruijn graph representations will be used for more efficient algorithms for super-read and unitig creation, the initial stages of assembly. The methods and software developed for both aims will significantly increase the ability of high-throughput sequence analysis and assembly to be completed on widely available commodity computers.

Public Health Relevance

High-throughput sequence data has very recently become a widely used tool in basic and applied biological research. RNA transcript sequencing has been used to gain insight into development processes in model or- ganisms and for understanding the etiology and mechanism of diseases, and metagenomic sequence data has revealed the complex symbiosis between microorganisms and their environments. The more computationally efficient approaches proposed here to analyze this data will allow even research labs with limited computational power to quantify transcript and gene abundance faster and more accurately, leading to a better understanding of microbial communities, causes of genetic diseases, and responses to varied environments.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Exploratory/Developmental Grants (R21)
Project #
Application #
Study Section
Biodata Management and Analysis Study Section (BDMA)
Program Officer
Good, Peter J
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Carnegie-Mellon University
Schools of Arts and Sciences
United States
Zip Code
Patro, Rob; Duggal, Geet; Love, Michael I et al. (2017) Salmon provides fast and bias-aware quantification of transcript expression. Nat Methods 14:417-419
Sefer, Emre; Kingsford, Carl (2016) Diffusion archeology for diffusion progression history reconstruction. Knowl Inf Syst 49:403-427
Wang, Hao; McManus, Joel; Kingsford, Carl (2016) Isoform-level ribosome occupancy estimation guided by transcript abundance with Ribomap. Bioinformatics 32:1880-2
Solomon, Brad; Kingsford, Carl (2016) Fast search of thousands of short-read sequencing experiments. Nat Biotechnol 34:300-2
Spealman, Pieter; Wang, Hao; May, Gemma et al. (2016) Exploring Ribosome Positioning on Translating Transcripts with Ribosome Profiling. Methods Mol Biol 1358:71-97
Sefer, Emre; Duggal, Geet; Kingsford, Carl (2016) Deconvolution of Ensemble Chromatin Interaction Data Reveals the Latent Mixing Structures in Cell Subpopulations. J Comput Biol 23:425-38
Patro, Rob; Norel, Raquel; Prill, Robert J et al. (2016) A computational method for designing diverse linear epitopes including citrullinated peptides with desired binding affinities to intravenous immunoglobulin. BMC Bioinformatics 17:155
Kingsford, Carl; Patro, Rob (2015) Reference-based compression of short-read sequences using path encoding. Bioinformatics 31:1920-8
Patro, Rob; Mount, Stephen M; Kingsford, Carl (2014) Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms. Nat Biotechnol 32:462-4
Duggal, Geet; Wang, Hao; Kingsford, Carl (2014) Higher-order chromatin domains link eQTLs with the expression of far-away genes. Nucleic Acids Res 42:87-96

Showing the most recent 10 out of 13 publications