. This Small Business Innovation Research project addresses the problem of assessing reproducibility in analyzing high-throughput data. In feature selection for data with large numbers of fea- tures, it is well known that some features will appear to affect an outcome by chance, and that subsequent predictions based on these features may not be as successful as initial results would seem to indicate. Similarly, there are often multiple stages, and many parameters, involved in the multivariate assays de- signed to analyze high-throughput profiles. For example, good results achieved with a particular combina- tion of settings for an instance of cross-validation may not generalize to other instances. The objective of this proposal is to extend new statistical methods for assessing reproducibility in replicate experiments to the context of machine learning, and demonstrate effectiveness in this application. The machine-learning methods to be investigated will include random forests, supervised principal components, lasso penal- ization and support vector machines. We will use simulated and real data from genomic applications to show the potential of this approach for providing reproducibility assessments that are not confounded with prespecified choices, for determining biologically relevant thresholds, for improving the accuracy of signal identification, and for identifying suboptimal results. Relevance. Although today's high-throughput technologies offer the possibility of revolutionizing clinical practice, the analytical tools available for extracting information from this amount of data are not yet sufficiently developed for targeted exploration of the underlying biology. This project directly addresses the need to make what the FDA terms IVDMIA (In-Vitro Diagnostic Multivariate Index Assays) transparent, interpretable, and reproducible, and is thus an opportunity to improve analysis products and services provided to companies that identify, characterize, and validate biomarkers for clinical diagnostics and drug development decision points. The long-term goal of the proposed project is to develop a platform for biomarker discovery and integrative genomic analysis, with reproducibility assessment incorporated into multivariate assays. This will enable evaluation and improvement of approaches to detecting the biological factors that affect a particular outcome, and lead to more efficient and more effective methods for disease diagnosis, treatment monitoring, and therapeutic drug development.
Statistical models play a key role in medical research in uncovering information from data that leads to new diagnostics and therapies. However, development of standards for reliability in biomedical data mining has not kept up with the rapid pace at which new data types and modeling approaches are being devised. This proposal is for new methods for quantifying reproducibility in biomedical data analyses that will have a far-reaching impact on public health by streamlining protocols, reducing costs and offering more effective clinical support systems.