A significant challenge in the analysis of large-scale genomic and molecular profiles of cancer is the identification of distinct, molecularly independent disease subtypes and the association of these with clinically relevant outcomes. The barriers to identifying molecularly-defined, clinically relevant subtypes have been the high-dimensionality of the feature space, limited sample sizes, and low recurrence rate of mutations between patients. The intellectual merits of this project are to develop theory, algorithms, and implementation for robust and scalable network-based machine learning and data mining techniques in high-dimensional gene expression and gene mutation data for disease subtype discovery in cancer. The results of the project can help to identify individual cancer, pan-cancer, and sex-specific subtypes to better understand the nature of cancer and to develop the most efficacious therapeutic strategies. The mathematical and machine-learning models developed in this study are general biological network-induced regularization models that are applicable in a broad range of supervised, semi-supervised, and unsupervised learning problems.
The goal of this project is to design novel network-based learning models that optimally integrate prior biological knowledge on gene regulatory mechanisms into learning algorithms. New group-based and Laplacian-based regularization techniques and restricted manifold learning in matrix factorization are investigated to design reproducible models for disease subtyping. This is the first study to build an efficient toolkit for cancer subtype discovery that fully integrates discrete mutational profiles and continuous gene expression data. The project provides extensive cross-disciplinary training in Computer Science, Mathematics, and Engineering. The models developed during this study can be broadly applied as more precision genomic medicine data become available.