This project aims to develop new methods for integrating large amounts of high resolution data arising from different types of molecules and measurement methods; the goal is to ascertain how the molecules interact over time to carry out essential biological functions. Biological functions are carried out through the myriad interactions of biological molecules, such as when proteins bind to other proteins to modulate their activity or to nucleic acids to regulate genes. DNA sequencing has led to an explosive growth in data reporting genomic sequences and their variations, and gene expression through transcript profiling; the data streams from high-throughput technologies for protein and metabolite profiling are quickly catching up. This has led to ever-expanding repositories that archive, organize and share the resulting data: by also connecting experimental conditions to the molecular profiles, researchers come to understand which molecular interactions occur and, from these, deduce many of the biological functions in living cells. Extraction of meaningful biological insights from these data sets is challenging in two ways: the data sets are very large so they require computational methods for basic handling, and each type of data differs from the others in many ways (type of noise, source of error, completeness, etc.) so they may require different statistical modeling to standardize them correctly prior to merging them. Carried out correctly, the resulting high-dimensional data sets are suitable for a variety of predictive analytics that reveal functional modules in the molecular interactomes. Results from this project will be made available through webservers and open source software. The integrated research and educational activities include interdisciplinary bioinformatics curriculum development, outreach to high school students and research opportunities for students in underrepresented groups.

Comprehensively understanding various functional aspects of a gene or a protein, such as involvement in a particular biological process, physical/genetic interactions, or disease association, is critical for both biology and translational medicine research. Since exhaustively characterizing genes or proteins through biological experiments is often intractable, systems-level integration of knowledge and computational hypothesis generation have garnered great interest in the field as an effective way to guide experiments. In this project, we will develop a novel computational framework for data integration and dimensionality reduction of heterogeneous network and functional genomic data to obtain informative data representations in a low-dimensional vector space. To utilize both molecular networks and evolutionary information, we will apply the proposed dimensionality reduction techniques to effectively integrate sequence data and network data across multiple species for predicting gene function. Our approaches will enable large-scale, integrated, cross-species, genome-scale gene function annotation. Through this integration, our methods can also infer functional homology or analogy between genes, which share weak sequence similarity but relevant biological functions, from different species. Results, software and additional information will be available at http://jianpeng.cs.illinois.edu.

Agency
National Science Foundation (NSF)
Institute
Division of Biological Infrastructure (DBI)
Application #
1652815
Program Officer
Peter McCartney
Project Start
Project End
Budget Start
2017-04-15
Budget End
2022-03-31
Support Year
Fiscal Year
2016
Total Cost
$618,168
Indirect Cost
Name
University of Illinois Urbana-Champaign
Department
Type
DUNS #
City
Champaign
State
IL
Country
United States
Zip Code
61820