With the number of human health studies involving metabolomics rising at a rapid rate, the development of methods to address critical analytic barriers in the analysis of metabolomics data is of critical importance. Missing values (MVs) are a pervasive, and often ignored, issue in metabolomics, yet the treatment of MVs can have a substantial impact on differential abundance and other downstream statistical analyses. The MVs problem in metabolomics is quite challenging, namely because the source of MVs is not always clear and can arise because the metabolite is i) not biologically present in the sample, ii) present in the sample but at a concentration below the lower limit of detection (LOD), or iii) present in the sample but undetected due to technical issues related to sample pre-processing steps (e.g. peak resolution). Current commonly used methods (e.g., substitution by zeros, LOD, or the mean value) tend to be overly-simplistic and produce sub-optimal and potentially misleading results. Since there is a noticeable absence of imputation methods from the literature that properly account for the different types of missingness in metabolomics data, there is an urgent need to invest in improving statistical models of MVs that are specific to metabolomics. We have recently developed a modified K-nearest neighbors (KNN) imputation algorithm that accounts for the truncation point (i.e., the LOD) in the data (KNN-TN). Based on simulations derived from real metabolomics studies, this algorithm showed considerable improvement in imputation accuracy (root-mean squared error) compared to single value (LOD, mean, zero) imputation approaches and standard KNN imputation. In this proposal, we will develop an alternative Bayesian modeling approach that accounts for the uncertainty due to imputation and stabilizes estimates for small samples by sharing information across metabolites. Further, we will evaluate the impact of MV imputation on downstream statistical analyses based on simulations from a wide-variety of publicly available datasets from the Metabolomics Workbench. Our analyses will allow us to make comprehensive recommendations to analysts about which imputation algorithm(s) are optimal in terms of biological impact. Lastly, we will develop publicly available software for implementing all developed imputation methods, including a web-accessible interface to broaden outreach and impact. The overall long term goal of this proposal is to develop user-friendly software and best-practices guidelines for imputation strategies in metabolomics data, thereby improving accuracy of downstream statistical analysis and the resulting biological impact.

Public Health Relevance

Missing value imputation is a crucial step in the analytic pipeline for metabolomics data that can strongly affect downstream statistical analyses. Simplistic approaches that are commonly used (e.g. zero, lower limit of detection, or mean value substitution) underestimate the variability associated with imputation and potentially produce biased and misleading results. This proposal addresses this critical analytic problem by 1) developing methods and software for missing value imputation in metabolomics data coupled with 2) presenting guidelines for how to select the best imputation algorithm in terms of downstream analyses and the resulting biological impact.

Agency
National Institute of Health (NIH)
Institute
National Cancer Institute (NCI)
Type
Small Research Grants (R03)
Project #
1R03CA222446-01
Application #
9433329
Study Section
Special Emphasis Panel (ZRG1)
Program Officer
Spalholz, Barbara A
Project Start
2017-09-20
Project End
2019-09-19
Budget Start
2017-09-20
Budget End
2019-09-19
Support Year
1
Fiscal Year
2017
Total Cost
Indirect Cost
Name
Ohio State University
Department
Miscellaneous
Type
Schools of Medicine
DUNS #
832127323
City
Columbus
State
OH
Country
United States
Zip Code
43210