Mass spectrometry in combination with chromatography provides a powerful approach to characterize small molecules produced in cells, tissues and other biological systems. In essence, measured metabolites provide a functional readout of cellular state, allowing novel biological studies that advance our understanding of health and disease. Currently, the main bottleneck in metabolomics is determining the chemical identities associated with the spectral signatures of measured masses. Despite the growth of spectral databases and advances in annotation tools that recommend the chemical structure that best explains each signature, the large majority of measured masses cannot be assigned a chemical identity. There is now consensus that gleaning partial information regarding the measured spectra in terms of chemical substructure or chemical classification can inform biological studies. This consensus is reflected in the newly updated reporting standards for metabolite annotation as proposed by the Metabolite Identification Task Group of the Metabolomics Society. As we show in our Preliminary Results, spectral characterization results in ?features? that can enhance performance in machine-learning tasks such as annotation. This work aims to enhance the use and value of the metabolomics dataset in Metabolomics Workbench by: (1) developing machine-learning tools trained on this dataset to characterize unknown spectra, and (2) adding characterization information to the Metabolomics Workbench dataset.
In Aim 1, we identify spectral patterns (motifs) that can represent chemically meaningful groupings of peaks within the spectra (e.g., peaks associated with aromatic substructures, loss of a substructure fragment, etc.). We utilize neural topic models that use variational inference to identify such motifs. We expect such models to offer computational speedups and to identify more chemically coherent motifs when compared to earlier implementations of topic modeling. We generate motifs across all spectra in the Metabolomics Workbench and provide annotations for each spectrum.
In Aim 2, we map spectral signatures to chemical ontology classes. As ontologies are hierarchical and as a molecule can be associated with multiple classes at different hierarchical levels of an ontology, we cast this mapping problem as a hierarchical multi-label classification problem and use neural networks to implement such a classifier. The classifier will be trained using the Metabolomics Workbench dataset. Learned motifs from Aim 1 will be used as additional input features to improve classification. We expect that the developed classifier can be used by others to elucidate measurements of unidentified molecules with chemical ontology classes, or to generate ontology terms that can be used as features in downstream machine-learning tasks.

Public Health Relevance

The project proposes to investigate machine learning techniques to enhance the utility of a Common Fund data set hosted through the Metabolomics Workbench. This data set consists of biologically relevant molecules and information about their structural composition and their mass spectrometry signatures. We anticipate that our techniques will result in annotating and adding information to the data set, which in turn will advance discoveries in biomedical research and have direct benefits to human health.

Agency
National Institute of Health (NIH)
Institute
Office of The Director, National Institutes of Health (OD)
Type
Small Research Grants (R03)
Project #
1R03OD030601-01
Application #
10111982
Study Section
Special Emphasis Panel (ZRG1)
Program Officer
Kinsinger, Christopher
Project Start
2020-09-15
Project End
2021-08-31
Budget Start
2020-09-15
Budget End
2021-08-31
Support Year
1
Fiscal Year
2020
Total Cost
Indirect Cost
Name
Tufts University
Department
Biostatistics & Other Math Sci
Type
Biomed Engr/Col Engr/Engr Sta
DUNS #
073134835
City
Boston
State
MA
Country
United States
Zip Code
02111