The extraordinary advances in sequencing technology during the past decade have transformed genome sequencing into a routine experiment that can be performed by individual investigators. This is having an enormous impact on biology, not only by increasing the power and application of comparative genomics, but via the unforeseen opportunity to perform low-cost high-throughput molecular biology experiments using sequencing. New assays for probing the molecular biology of cells by reducing experiments to DNA fragment counting are known as sequence census methods. These methods have the potential to dramatically advance our understanding of the dynamics and structure of molecules and pathways (Wold and Myers, 2008). The successful application of sequence census methods to functional genomics depends on the ability to narrow a growing gap between sequencing output and analysis capability (McPherson, 2009). The analysis of high-throughput sequencing data is complicated not only by the vast quantities of data being produced (leading to difficult engineering challenges), but also by the non-trivial mathematical and statistical inference problems that must be solved to glean functional information from read counts. Experiments continue to grow in number and complexity, resulting in an unprecedented challenge for computational biologists. We have tackled some of these challenges in previous work. Our Cufflinks program (Trapnell, et al., 2010) provides a suite of tools for processing and analyzing RNA-Seq data, which consists of reads that originate from mRNA fragments and that can be used to measure relative abundances of transcripts. We have also worked on the analysis of Methyl-Seq experiments for measuring methyl modification of CpG dinucleotides, and have developed approaches for normalizing fragment counts that are biased due to non-random fragmentation. In the course of these projects, we have tackled and solved problems that are common to many sequence census experiments, yet many challenges remain. We propose to extend our previous work so that our tools can continue to develop with the technologies and allow for increasingly refined functional inferences. However, there is an additional, and key aspect of our proposal that is based on the recognition that our solutions for RNA-Seq can be organized in a modular framework that will allow them to be much more generally applicable. This leads to a proposal to develop a general analysis infrastructure for sequence census experiments. In other words, the goal of this proposal is to develop a computational and statistical infrastructure for reconstructing the desired functional information from a wide range of sequence census experiments. Our proposal is organized into two parts that reflect these aims: 1. Further development of the Cufflinks suite of programs to address numerous remaining problems in RNA-Seq analysis. Specific projects are outlined in the proposal, and are based on large amounts of user-supplied feedback we have received in recent months since releasing our software, 2. Development of a modular analysis framework consisting of tools that can be customized for the analysis of novel sequence census experiments. We have recognized that it is not only sequencing that is """"""""high-throughput"""""""";the number of experiments based on sequencing is also growing at an exponential rate. The organization of analysis tools into 'subroutines'that can be easily merged into analysis workflows is therefore essential. In addition to reviewing our preliminary work and providing details on our planned approach to the research, we also provide letters of support from leading academia and industry experts, as well as sequencing facility directors that we will consult with throughout the project. We believe this is crucial in order to maintain appropriate focus during a research program that will take years, while the field advances rapidly in months. We also plan to organize workshops to help train and educate users who rely on our tools, and who need to adapt to increasingly complex analysis systems.

Public Health Relevance

The availability of low-cost high-throughput sequencing technologies is providing unprecedented opportunities for measuring cellular activity at the molecular level via sequence census reductions that are based on counting DNA fragments. However non-trivial reductions require the solution of challenging mathematical inverse problems to glean information from the sequence, and depend on efficient algorithms suitable for vast quantities of data. We will build on an existing solution we have developed for RNA-Seq analysis to create a platform for analysis of a wide range of transcription and translation measurement assays, and to develop a general infrastructure for the analysis of sequence census experiments.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
5R01HG006129-02
Application #
8321953
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Pazin, Michael J
Project Start
2011-08-19
Project End
2014-05-31
Budget Start
2012-06-01
Budget End
2013-05-31
Support Year
2
Fiscal Year
2012
Total Cost
$371,541
Indirect Cost
$121,541
Name
University of California Berkeley
Department
Biostatistics & Other Math Sci
Type
Schools of Arts and Sciences
DUNS #
124726725
City
Berkeley
State
CA
Country
United States
Zip Code
94704
Li, Bo; Tambe, Akshay; Aviran, Sharon et al. (2017) PROBer Provides a General Toolkit for Analyzing Sequencing-Based Toeprinting Assays. Cell Syst 4:568-574.e7
Schaeffer, L; Pimentel, H; Bray, N et al. (2017) Pseudoalignment for metagenomic read assignment. Bioinformatics 33:2082-2088
Fu, Audrey Qiuyan; Pachter, Lior (2016) Estimating intrinsic and extrinsic noise from single-cell gene expression measurements. Stat Appl Genet Mol Biol 15:447-471
Ntranos, Vasilis; Kamath, Govinda M; Zhang, Jesse M et al. (2016) Fast and accurate single-cell RNA-seq analysis by clustering of transcript-compatibility counts. Genome Biol 17:112
Pimentel, Harold; Sturmfels, Pascal; Bray, Nicolas et al. (2016) The Lair: a resource for exploratory analysis of published RNA-Seq data. BMC Bioinformatics 17:490
Bray, Nicolas L; Pimentel, Harold; Melsted, Páll et al. (2016) Near-optimal probabilistic RNA-seq quantification. Nat Biotechnol 34:525-7
Singer, Meromit; Pachter, Lior (2015) Controlling for conservation in genome-wide DNA methylation studies. BMC Genomics 16:420
Aviran, Sharon; Pachter, Lior (2014) Rational experiment design for sequencing-based RNA structure mapping. RNA 20:1864-77
Roberts, Adam; Pachter, Lior (2013) Streaming fragment assignment for real-time analysis of sequencing experiments. Nat Methods 10:71-3
Roberts, Adam; Feng, Harvey; Pachter, Lior (2013) Fragment assignment in the cloud with eXpress-D. BMC Bioinformatics 14:358

Showing the most recent 10 out of 15 publications