Assays based upon next generation sequencing technologies (*-seq assays) are widely used in the genomics community. As these assays mature and attempt to probe more subtle biological phenomenon, new tools based upon powerful statistical techniques will be needed to provide confidence in the resulting biological conclusions. To date, *-seq assay analysis tools can be split into two distinct classes, mapping and quantification. Mapping tools attempt to match each read with a genomic location, whereas quantification tools infer biological features from the """"""""mapped"""""""" reads. The results of the mapping are often very dependent on tuning parameters and rarely, if ever, provide any notion of confidence. The analysis tools typically take the provided mappings as gospel. This project will take a different approach. The investigators propose to use known physical and biochemical properties of the assay to model the assay. Such an approach will yield better mappings, while providing a notion of confidence that can be made an integral part of downstream analysis. Extensive validation of the software and underlying models is planned in three organisms using data from five different validatory experiments. The work proposed in this project will result in significant improvements in the analyses of *-seq data. If successful, this project will replace a host of mapping algorithms, peak callers, and transcript quantifiers, forming the foundation of a software suite for the integrative analysis of *-seq assays.

Public Health Relevance

The results of the mapping reads in assays based on next generation sequencing, (e.g. ChIP-seq, RNA-seq, DNase-seq) are often very dependent on tuning parameters without ever providing a notion of confidence, and downstream analysis tools typically take the provided mappings as gospel. Our approach is different: we make the known physical and biochemical properties of the assay and biological properties of the feature assayed an integral part of the mapping process and then on the basis of our assay model, set confidence limits on our mappings that can then be made an integral part of downstream analysis, analytical or biological. Our working prototype is called Statmap, which we intend to replace a host of mappers, peak callers, and transcript quantifiers as the principle tool for analysis and quantification in the computational genomicists arsenal by the end of 2012.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Exploratory/Developmental Grants (R21)
Project #
5R21HG006187-02
Application #
8290222
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Bonazzi, Vivien
Project Start
2011-06-27
Project End
2013-05-31
Budget Start
2012-06-01
Budget End
2013-05-31
Support Year
2
Fiscal Year
2012
Total Cost
$222,835
Indirect Cost
$72,835
Name
University of California Berkeley
Department
Biostatistics & Other Math Sci
Type
Schools of Arts and Sciences
DUNS #
124726725
City
Berkeley
State
CA
Country
United States
Zip Code
94704
Boley, Nathan; Stoiber, Marcus H; Booth, Benjamin W et al. (2014) Genome-guided transcript assembly by integrative analysis of RNA sequence data. Nat Biotechnol 32:341-6