Methods for the analysis of RNA-Seq and related sequence census based experiments

Pachter, Lior

Abstract

The extraordinary advances in sequencing technology during the past decade have transformed genome sequencing into a routine experiment that can be performed by individual investigators. This is having an enormous impact on biology, not only by increasing the power and application of comparative genomics, but via the unforeseen opportunity to perform low-cost high-throughput molecular biology experiments using sequencing. New assays for probing the molecular biology of cells by reducing experiments to DNA fragment counting are known as sequence census methods. These methods have the potential to dramatically advance our understanding of the dynamics and structure of molecules and pathways (Wold and Myers, 2008). The successful application of sequence census methods to functional genomics depends on the ability to narrow a growing gap between sequencing output and analysis capability (McPherson, 2009). The analysis of high-throughput sequencing data is complicated not only by the vast quantities of data being produced (leading to difficult engineering challenges), but also by the non-trivial mathematical and statistical inference problems that must be solved to glean functional information from read counts. Experiments continue to grow in number and complexity, resulting in an unprecedented challenge for computational biologists. We have tackled some of these challenges in previous work. Our Cufflinks program (Trapnell, et al., 2010) provides a suite of tools for processing and analyzing RNA-Seq data, which consists of reads that originate from mRNA fragments and that can be used to measure relative abundances of transcripts. We have also worked on the analysis of Methyl-Seq experiments for measuring methyl modification of CpG dinucleotides, and have developed approaches for normalizing fragment counts that are biased due to non-random fragmentation. In the course of these projects, we have tackled and solved problems that are common to many sequence census experiments, yet many challenges remain. We propose to extend our previous work so that our tools can continue to develop with the technologies and allow for increasingly refined functional inferences. However, there is an additional, and key aspect of our proposal that is based on the recognition that our solutions for RNA-Seq can be organized in a modular framework that will allow them to be much more generally applicable. This leads to a proposal to develop a general analysis infrastructure for sequence census experiments. In other words, the goal of this proposal is to develop a computational and statistical infrastructure for reconstructing the desired functional information from a wide range of sequence census experiments. Our proposal is organized into two parts that reflect these aims: 1. Further development of the Cufflinks suite of programs to address numerous remaining problems in RNA-Seq analysis. Specific projects are outlined in the proposal, and are based on large amounts of user-supplied feedback we have received in recent months since releasing our software, 2. Development of a modular analysis framework consisting of tools that can be customized for the analysis of novel sequence census experiments. We have recognized that it is not only sequencing that is """"""""high-throughput"""""""";the number of experiments based on sequencing is also growing at an exponential rate. The organization of analysis tools into 'subroutines'that can be easily merged into analysis workflows is therefore essential. In addition to reviewing our preliminary work and providing details on our planned approach to the research, we also provide letters of support from leading academia and industry experts, as well as sequencing facility directors that we will consult with throughout the project. We believe this is crucial in order to maintain appropriate focus during a research program that will take years, while the field advances rapidly in months. We also plan to organize workshops to help train and educate users who rely on our tools, and who need to adapt to increasingly complex analysis systems.

Public Health Relevance

The availability of low-cost high-throughput sequencing technologies is providing unprecedented opportunities for measuring cellular activity at the molecular level via sequence census reductions that are based on counting DNA fragments. However non-trivial reductions require the solution of challenging mathematical inverse problems to glean information from the sequence, and depend on efficient algorithms suitable for vast quantities of data. We will build on an existing solution we have developed for RNA-Seq analysis to create a platform for analysis of a wide range of transcription and translation measurement assays, and to develop a general infrastructure for the analysis of sequence census experiments.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Human Genome Research Institute (NHGRI)
Type: Research Project (R01)
Project #: 5R01HG006129-03
Application #: 8526489
Study Section: Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer: Pazin, Michael J

Project Start: 2011-08-19
Project End: 2014-05-31
Budget Start: 2013-06-01
Budget End: 2014-05-31
Support Year: 3
Fiscal Year: 2013
Total Cost: $353,656
Indirect Cost: $114,906

Institution

Name: University of California Berkeley
Department: Biostatistics & Other Math Sci
Type: Schools of Arts and Sciences
DUNS #: 124726725

City: Berkeley
State: CA
Country: United States
Zip Code: 94704

Related projects


NIH 2013 R01 HG	Methods for the analysis of RNA-Seq and related sequence census based experiments Pachter, Lior S. / University of California Berkeley	$353,656
NIH 2012 R01 HG	Methods for the analysis of RNA-Seq and related sequence census based experiments Pachter, Lior S. / University of California Berkeley	$371,541
NIH 2011 R01 HG	Methods for the analysis of RNA-Seq and related sequence census based experiments Pachter, Lior S. / University of California Berkeley	$372,651

Publications

Schaeffer, L; Pimentel, H; Bray, N et al. (2017) Pseudoalignment for metagenomic read assignment. Bioinformatics 33:2082-2088

Li, Bo; Tambe, Akshay; Aviran, Sharon et al. (2017) PROBer Provides a General Toolkit for Analyzing Sequencing-Based Toeprinting Assays. Cell Syst 4:568-574.e7

Fu, Audrey Qiuyan; Pachter, Lior (2016) Estimating intrinsic and extrinsic noise from single-cell gene expression measurements. Stat Appl Genet Mol Biol 15:447-471

Ntranos, Vasilis; Kamath, Govinda M; Zhang, Jesse M et al. (2016) Fast and accurate single-cell RNA-seq analysis by clustering of transcript-compatibility counts. Genome Biol 17:112

Pimentel, Harold; Sturmfels, Pascal; Bray, Nicolas et al. (2016) The Lair: a resource for exploratory analysis of published RNA-Seq data. BMC Bioinformatics 17:490

Bray, Nicolas L; Pimentel, Harold; Melsted, Páll et al. (2016) Near-optimal probabilistic RNA-seq quantification. Nat Biotechnol 34:525-7

Singer, Meromit; Pachter, Lior (2015) Controlling for conservation in genome-wide DNA methylation studies. BMC Genomics 16:420

Aviran, Sharon; Pachter, Lior (2014) Rational experiment design for sequencing-based RNA structure mapping. RNA 20:1864-77

Roberts, Adam; Schaeffer, Lorian; Pachter, Lior (2013) Updating RNA-Seq analyses after re-annotation. Bioinformatics 29:1631-7

Roberts, Adam; Pachter, Lior (2013) Streaming fragment assignment for real-time analysis of sequencing experiments. Nat Methods 10:71-3

Showing the most recent 10 out of 15 publications

Comments

Be the first to comment on Lior Pachter's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: