This subproject is one of many research subprojects utilizing the resources provided by a Center grant funded by NIH/NCRR. The subproject and investigator (PI) may have received primary funding from another NIH source, and thus could be represented in other CRISP entries. The institution listed is for the Center, which is not necessarily the institution for the investigator. Analysis of gene expression data for cancer classification can provide valuable information for early diagnosis and treatment. The computational extraction of derived patterns from microarray gene expression is a non-trivial task that involves sophisticated algorithm design and analysis for specific domain discovery. Moreover, the extraction of biologically significant knowledge from the gene expression data is a growing computational challenge, as the large number of genes, which can correspond to different time sequences or tissue types, has a dimensionality that is several orders of magnitude more than the evaluated samples. During this reporting period, we have developed a formal approach for feature extraction of genes by first applying feature selection heuristics based on the statistical impurity measures and analyzing the associative dependencies between the genes and then computing weights to the genes based on their degree of participation in the rules. Consequently, we developed a weighted Jaccard and vector cosine similarity measure to compute the similarity between the discovered rules. To demonstrate the usability and efficiency of the concept of our technique, we applied it to three publicly available, multiclass cancer gene expression datasets and performed a biomedical literature search to support the effectiveness of our results.
Showing the most recent 10 out of 179 publications