Any given normal sample has some set of cell types, say N. So any given sample tissue array measurement is a convex linear combination of N cell type specific signatures. Different biopsy samples differ in their linear combination coefficients and also genotypic differences between subjects. The presumption in this analysis is that the genotypic differences are not significant between normal samples. (The entire analysis is pointless if the genotypic differences overwhelm the similarities attributable to abstract cell types.) So there is a column-wise convex matrix MS such that for every gene g the matrix Sg = BgMS where Bg is the cell type expression vector for g and Sg is the vector of expression data for g for all the subjects. The matrix MS is independent of the gene and therefore this is an overdetermined system, with a solution to be determined by optimization. The columns of MS depend only on the corresponding subjects. Having found an overall consistent MS matrix, we can deduce B for all genes. In reality we do not know N so we have to do model comparison for the choice of N. ? ? We modified the Non-negative Matrix Factorization algorithm due to Lee and Seung by adding a step where the matrix MS is made convex prior to the recursion step. We stopped the algorithm when updates did not materially change the matrix distance between Sg and BgMS. We showed that our algorithm is noise tolerant, giving reasonable results even with 50% noise. We applied a Minimum Description Length Criterion to determine the correct value of N, and found that on test data up to 40% noise added, we can still determine the correct value of N. We have also tested the algorithm for robustness against varying subject number and robustness against varying the number of genes measured.? ? Our goal is now to validate our methods by applying our algorithm to clinical data and comparing our results to a pathologists report. Can one associate cancer genotypes with the exclusive-or appearance of specific extreme points? Can one use this approach for other decomposition problems?

Project Start
Project End
Budget Start
Budget End
Support Year
1
Fiscal Year
2007
Total Cost
$49,110
Indirect Cost
City
State
Country
United States
Zip Code