There are nearly 2 million experiments now in open online databases where, in each experiment, scientists have recorded data on the activity of tens of thousands of DNA elements, genes, or proteins in a particular biological sample. These samples come from a variety of tissues in human or other organisms under numerous conditions and are an invaluable resource for other scientists to reanalyze and discover new biology. However, these data are severely underused because finding the samples one is interested in from the sea of all samples is still very hard. This is because most sample descriptions are written using language that is hard to search unambiguously or are incomplete in reporting every critical aspect of the sample source. This project seeks to democratize data-driven biology by annotating publicly available samples from six species (human and five animal models) on a massive scale, which will enable researchers to discover relevant published data for further analysis. A web-interface will be developed to provide researchers a single access point to all these annotations and the related search tools. The insights and tools from this project will have far-reaching implications for how the biological researchers will submit, store, manage, access, reuse, and re-share data. Improving data discoverability and reusability will increase the accuracy, efficiency, and reproducibility of biological research overall, saving resources by accelerating scientific discovery. The educational/training activities that will be developed in this project will result in formalizing and openly disseminating the “hidden curriculum” in modern bioinformatics: the abstract experiential skills critical for holistic, practical competency in conducting large data analysis and research in a rapidly changing bioinformatics landscape. This effort will create openness and equity in professional training in applying computing to study biology.

This project will develop new machine learning methods that use both text and molecular data to assign comprehensive, standardized annotations to publicly-available omics samples. The barrier for complete and structured sample descriptors is two-fold: Samples are routinely described using non-standard, varied terminologies written in unstructured natural language, and Even basic attributes, e.g., tissue or environment, are omitted from sample descriptions if they were not factors considered in the original study. The objective of this project is to remove both of these barriers and will integrate state-of-the-art machine learning (ML) advances to: develop ML methods to infer standardized annotations from plain text descriptions, jointly from multiple omics types; develop ML models to predict structured metadata from molecular omics profiles, jointly from multiple species; and develop methods to integrate these text- and omics-profile-based models to comprehensively annotate millions of samples, and tools for researchers to use this massive resource to glean novel biology. Integrated with this research is an education plan to develop ties to formalize and openly disseminate the hidden curriculum in data-driven biology. All the results from this project including data and code will be available at www.thekrishnanlab.org.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Agency
National Science Foundation (NSF)
Institute
Division of Biological Infrastructure (DBI)
Application #
2045651
Program Officer
Jean Gao
Project Start
Project End
Budget Start
2021-05-01
Budget End
2026-04-30
Support Year
Fiscal Year
2020
Total Cost
$128,521
Indirect Cost
Name
Michigan State University
Department
Type
DUNS #
City
East Lansing
State
MI
Country
United States
Zip Code
48824