We propose to investigate new computational approaches to two central problems of high-throughput se- quence analysis: (1) quantification of transcript and species abundance in RNAseq and metagenomic data, and (2) improved error correction of sequencing reads. The proposed novel approaches to both of these problems derive from the ability to quickly count every instance of every k-mer (string of length k) within huge collections of sequence data. Extensive preliminary work on this problem, manifest in the k-mer counting software (called Jellyfish) published by the project personnel, will be brought to bear and extended. Existing mapping-based computational techniques for quantifying transcript abundance have found wide applicability but read mapping is error prone due to, e.g., splice junctions, microexons, and variation from the reference sequence.
Aim 1 seeks to develop an alternative, mapping-free approach to transcript quantification from sequencing data that relies on clustering normalized k-mer count vectors to identify k-mers that are indicative of transcript or gene abundance. These k-mers form profiles that can be used to rapidly quantify expression of the given transcript or gene in subsequent experiments with limited computational effort and avoiding the challenging read mapping step.
Aim 2 tackles the problem of error correction of genomic, and, more speculatively, RNAseq reads by developing more accurate k-mer filtering methods and more compact de Bruijn graph representations. The new filtering proce- dures try to make a better distinction between correct and erroneous k-mers by simultaneously considering their position within the reads and the distribution of their quality scores across reads. Improved error correction and de Bruijn graph representations will be used for more efficient algorithms for super-read and unitig creation, the initial stages of assembly. The methods and software developed for both aims will significantly increase the ability of high-throughput sequence analysis and assembly to be completed on widely available commodity computers.
High-throughput sequence data has very recently become a widely used tool in basic and applied biological research. RNA transcript sequencing has been used to gain insight into development processes in model or- ganisms and for understanding the etiology and mechanism of diseases, and metagenomic sequence data has revealed the complex symbiosis between microorganisms and their environments. The more computationally efficient approaches proposed here to analyze this data will allow even research labs with limited computational power to quantify transcript and gene abundance faster and more accurately, leading to a better understanding of microbial communities, causes of genetic diseases, and responses to varied environments.
Showing the most recent 10 out of 13 publications