Detecting and quantifying products of cellular metabolism using mass spectrometry (MS) has already shown great promise in biomarker discovery, nutritional analysis and other biomedical research fields. Despite recent advances in analysis techniques, our ability to interpret MS measurements remains limited. The biggest challenge in metabolomics is annotation, where measured compounds are assigned chemical identities. The annotation rates of current computational tools are low. For several surveyed metabolomics studies, less than 20% of all compounds are annotated. Another contributing factor to low annotation rates is the lack of systematic ways of designing a candidate set, a listing of putative chemical identities that can be used during annotation. Relying on exiting databases is problematic as considering the large combinatorial space of molecular arrangements, there are many biologically relevant compounds not catalogued in databases or documented in the literature. A secondary yet important challenge is interpreting the measurements to understand the metabolic activity of the sample under study. Current techniques are limited in utilizing complex information about the sample to elucidate metabolic activity. The goal of this project is to develop computational techniques to advance the interpretation of large-scale metabolomics measurements. To address current challenges, we propose to pursue three Aims: (1) Engineering candidate sets that enhance biological discovery. (2) Developing new techniques for annotation including using deep learning and incremental build out methods to recommend novel chemical structures that best explain the measurements. (3) Constructing probabilistic models to analyze metabolic activity. Each technique will be rigorously validated computationally and experimentally using chemical standards. Two detailed case studies on the intestinal microbiota will allow us to further validate our tools. Microbiota-derived metabolites have been detected in circulation and shown to engage host cellular pathways in organs and tissues beyond the digestive system. Identifying these metabolites is thus critical for understanding the metabolic function of the microbiota and elucidating their mechanisms. The complex test cases will challenge our techniques, provide feedback during development, and allow us to further disseminate our techniques. We will work closely with early adopters of our tools, as proposed in supporting letters, to further validate our tools and encourage wide adoption. All proposed tools will be open source and made accessible through the web. Our tools promise to change current practices in interpreting metabolomics data beyond what is currently possible with databases, current annotation tools, statistical and overrepresentation analysis, or combinations thereof. The use of machine learning and large data sets as proposed herein defines the most promising research direction in metabolomics analysis.

Public Health Relevance

Untargeted Metabolomics is a recently developed technique that allows the measurement of thousands of molecules in a biological sample. This work proposes several novel computational techniques that address limitations of current metabolomics analysis tools. We anticipate that this work will advance discoveries in biomedical research and have direct benefits to human health.

National Institute of Health (NIH)
National Institute of General Medical Sciences (NIGMS)
Research Project (R01)
Project #
Application #
Study Section
Biodata Management and Analysis Study Section (BDMA)
Program Officer
Ravichandran, Veerasamy
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Tufts University
Biostatistics & Other Math Sci
Biomed Engr/Col Engr/Engr Sta
United States
Zip Code