Predictive modeling of biomedical data arising from clinical studies for early detection, monitoring and prognosis of diseases is a crucial step in biomarker discovery. Since the data are typically measurements subject to error, and the sample size of any study is very small compared to the number of variables measured, the validity and verification of models arising from such datasets significantly impacts the discovery of reliable discriminatory markers for a disease. An important opportunity to make the most of these scarce data is to combine information from multiple related data sets for more effective biomarker discovery. Because the costs of creating large data sets for every disease of interest are likely to remain prohibitive, methods for more effectively making use of related biomarker discovery data sets continues to be important. Solution: This project develops and applies Transfer Rule Learning (TRL), a novel framework for integrative biomarker discovery from related but separate data sets, such as those generated from similar biomarker profiling studies. TRL alleviates the problem of data scarcity by providing automated ways to express, verify and use prior hypotheses generated from one data set while learning new knowledge via a related data set. This is the first study of transfer learning for biomarker discovery. Unlike other transfr learning approaches, TRL takes knowledge in the form of interpretable, modular classification rules, and uses them to seed learning of a rule model on a new data set. Classification rules simplify the extraction of discriminatory markers, and have been used successfully for biomarker discovery and verification in a non-integrative fashion.
Specific Aims : This project tests the main hypothesis that TRL provides a mechanism for transfer learning of classification rules between related source and target data sets that improve performance on the target data, compared to learning without transfer. TRL will be evaluated using cross-validation performance of classification accuracy and transfer measures, on related groups of existing biomarker discovery datasets obtained from multiple experimental platforms for lung cancer detection and prognosis. A new set of independent validation data will be generated for early detection of lung cancer to test the models generated on pilot data. Insights into the impact of different modeling algorithms on transfer outcomes will be gleaned. Significance: The TRL framework and tool are important for combined analysis and interpretation of clinical data, as they support incremental building, verification and refinement of rule models for predictive biomedicine. The application of TRL to real-world biomarker discovery datasets can yield insights into novel interactions involving known markers, and the most reliable biomarkers for early detection of disease, particularly lung cancer. This project has the potential to help create new diagnostic screening tools for lung cancer detection. It allows foundational understanding of the use of transfer learning for integrative biomarker discovery that could lead to novel technologies for combining information from data and prior knowledge.
This project will develop highly-needed computational methods for integrative biomarker discovery from related but separate data sets produced by predictive molecular profiling studies of disease. It will generate new experimental data for early detection of lung cancer, and has the potential to help create new diagnostic screening tools for lung cancer, a leading cause of death from cancer in the United States.
|Dutta-Moscato, Joyeeta; Gopalakrishnan, Vanathi; Lotze, Michael T et al. (2014) Creating a pipeline of talent for informatics: STEM initiative for high school students in computer science, biology, and biomedical informatics. J Pathol Inform 5:12|