The research objective of this award is to develop a robust and efficient method to discover knowledge from large data sets that are high dimensional, with disproportional class distribution, and with data coming from different sources at different times. The key technique is a model-free knowledge discovery method that does not require the time-consuming pattern extraction step traditionally carried out in data mining. This is accomplished by converting a large data set into a compact contingency table through discretization, feature selection, and data summarization. A confusion matrix based data set characteristic measure is used to ensure that feature selection is robust to class distribution so proper class separability is maintained in the reduced data set. When new data is available, the contingency table can be efficiently updated to enable incremental knowledge discovery. The developed method will be validated and applied to gene expression based diagnosis, which involves datasets with extremely high dimensionality (up to several hundred thousands).
If successful, the results of this research will provide a knowledge discovery tool that enables domain experts to make better decisions in various applications including evidence-based medicine in healthcare, fault diagnosis in maintenance, and customer relationship management in finance and retail. The target application of gene expression based diagnosis, when successful, would allow physicians to identify complex genetic traits that underlie different phenotypes, disease subtypes, as well as clinical outcome. This will enable personalized intervention to maximize treatment effectiveness for individual patients. The collaboration with College of Medicine and Cincinnati Children?s Hospital makes it possible to access critical domain expertise and specialized computing resources; thus enhancing research and education infrastructure. Research results will be incorporated into both Engineering and Medicine graduate courses so students can benefit from most recent interdisciplinary research advances.