With the advent of high-throughput biotechnology, people can now investigate a biological system with multiple bioassays using diverse types of study objects. The convolution of multiple bioassays, study objects and exposure conditions produces a wealth of rich information about the biological system, and at the same time, poses great challenge on how to integrate heterogeneous data sources and extract sufficient knowledge that cannot be gained from any single dataset alone. Depending on the complexity level of the convolution, direct data integration is often used when combing information from multiple bioassays on a cohort of similar study objects, and indirect integration via knowledge transfer is generally attractive when it comes to the power of incorporating relatively more diverse data types. Direct information integration relies on concordant and well annotated data structures, and the connections between datasets are usually easily understood. These methodologies have to meet many computational challenges owing to different sizes, formats and dimensionalities of the data being integrated. In biological studies, researchers tend to study their system using models varying in terms of species (human vs. mouse), compositions (tissue vs. cell lines, single cells) and/or exposures conditions. Thus, the datasets generated are drawn from different feature space and/or different sample distributions, where direct integration becomes infeasible. These highly disparate datasets may each have the potential to provide complimentary information key to the research question being carried out, and thus it is also urgent to construct a transfer learning method tailored for high-throughput bioassay data so that existing heterogeneous datasets could be re-purposed in a future study.
To address these challenges, this project will develop new classes of computational methods for direct information integration and indirect knowledge transfer, and ultimately leverage structures and relations among various bioassay datasets for better understanding of a biological system, for example, cancer drug resistance mechanism. The research team will achieve their goals through exerting the following two objectives. First, they will develop a novel formulation for information distillation on multiple -omics data so that knowledge could be easily transplantable to future studies, which would otherwise be prevented due to technical and platform bias. The framework consists of a supervised sparse clustering method for qualitative representation of coherent signatures, together with a co-clustering approach to detect local low rank structures. Second, they will develop a novel transfer learning method by imposing the structural regularities learnt from the source domains to any target domain. The key assumption is that the structural regularities are invariant to technical and platform bias, so they are ideal vehicle for knowledge transfer. The project is expected to develop novel computational tools that can effectively explore a wide range of heterogeneous datasets, and it has great potential to minimize the cost on recollecting new training data, by maximizing utilization of existing information and fully using the knowledge derived therefrom to substantiate our understanding of a biological/biomedical system. Hence the project will have far-reaching economic and societal impacts.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.