The data generated by the ENCODE Consortium constitutes an unprecedented opportunity to make biomedical inferences about the function and structure of the human genome. With nearly 1000 genome-wide assays, the depth of information now publicly available about each base-pair is staggering, and in the next round of consortium work this depth is likely to increase geometrically. In this proposal, we describe statistical challenges that, if met, will substantially enhance the capacity of the Analysis Workin Group (AWG) to make functional biological inferences from ENCODE data. We will tackle these challenges, serving as a statistical """"""""Research and Development"""""""" component with built-in experimental validation capabilities, and will provide the AWG with useful software implementations of the statistical tools we develop, as well with iterative refinements of these tools grounded in experimental validations of our imputed networks. In particular, we will: 1) develop methods of dimension reduction that will aid the AWG in data visualization, summarization, and prediction studies (e.g. the prediction of transcription from chromatin data);2) develop new quantitative network models of complex biological systems assayed by ENCODE;and 3) conduct targeted biological validation assays designed to interrogate important low-dimensional structures in our models and to feed back to improve both model structure and performance. Our approaches to dimension reduction will aid biologists in interpreting and formulating hypotheses from high-dimensional genomics data, our network models will facilitate the construction of interpretable predictive algorithms that lead directly t testable and quantifiable hypotheses, and our validation assays will ensure that inferences derived from our tools provide meaningful biological insights. As we did as part of the ENCODE and modENCODE data analysis centers, we will work closely with the AWG to ensure that our software implementations are immediately and maximally useful to the consortium, and that the overall course of our work on network inference and dimension reduction is focused around biological questions central to the interests of the consortium.

Public Health Relevance

In the next generation of the ENCODE Consortium, the ENCODE Data Coordination and Analysis Center (EDCAC) will be responsible for applying high-throughput bioinformatics and statistical techniques to interrogate all consortium data, per the direction of the AWG. However, as the dimensionality of the data increases geometrically, existing tools for integrative analysis will become less useful and new techniques and software will need to be developed. We propose to function as a """"""""Statistical Research and Development"""""""" component of the ENCODE Consortium: we will tackle analytical bottlenecks that arise, and in many cases have already arisen, to facilitate the progress of the consortium.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project--Cooperative Agreements (U01)
Project #
1U01HG007031-01
Application #
8402497
Study Section
Special Emphasis Panel (ZHG1-HGR-M (M2))
Program Officer
Pazin, Michael J
Project Start
2012-09-17
Project End
2015-06-30
Budget Start
2012-09-17
Budget End
2013-06-30
Support Year
1
Fiscal Year
2012
Total Cost
$425,000
Indirect Cost
$110,998
Name
University of California Berkeley
Department
Miscellaneous
Type
Organized Research Units
DUNS #
124726725
City
Berkeley
State
CA
Country
United States
Zip Code
94704
Wang, Y X Rachel; Liu, Ke; Theusch, Elizabeth et al. (2018) Generalized correlation measure using count statistics for gene expression data with ordered samples. Bioinformatics 34:617-624
Basu, Sumanta; Kumbier, Karl; Brown, James B et al. (2018) Iterative random forests to discover predictive and stable high-order interactions. Proc Natl Acad Sci U S A 115:1943-1948
Shi, Funan; Huang, Haiyan (2017) Identifying Cell Subpopulations and Their Genetic Drivers from Single-Cell RNA-Seq Data Using a Biclustering Approach. J Comput Biol 24:663-674
Wu, Siqi; Joseph, Antony; Hammonds, Ann S et al. (2016) Stability-driven nonnegative matrix factorization to interpret spatial gene expression and build local gene networks. Proc Natl Acad Sci U S A 113:4290-5
Stoiber, Marcus H; Olson, Sara; May, Gemma E et al. (2015) Extensive cross-regulation of post-transcriptional regulatory networks in Drosophila. Genome Res 25:1692-702
Kim, Kyungpil; Bolotin, Eugene; Theusch, Elizabeth et al. (2014) Prediction of LDL cholesterol response to statin using transcriptomic and genetic variation. Genome Biol 15:460
Gerstein, Mark B; Rozowsky, Joel; Yan, Koon-Kiu et al. (2014) Comparative analysis of the transcriptome across distant species. Nature 512:445-8
Wang, Y X Rachel; Waterman, Michael S; Huang, Haiyan (2014) Gene coexpression measures in large heterogeneous samples using count statistics. Proc Natl Acad Sci U S A 111:16371-6
Brown, James B; Boley, Nathan; Eisman, Robert et al. (2014) Diversity and dynamics of the Drosophila transcriptome. Nature 512:393-9
Boley, Nathan; Wan, Kenneth H; Bickel, Peter J et al. (2014) Navigating and mining modENCODE data. Methods 68:38-47

Showing the most recent 10 out of 15 publications