At present, there are several public efforts under way to generate massive data sets derived from all available cancer tissues. Such data is already available for four forms of cancer: Ovarian, lung, breast and colon, and more are on the way. One of the characteristics of these data sets is that the number of features that are measured is in the tens of thousands, while the number of tissue samples for each form of cancer is in the hundreds. The main challenge therefore is to extract the most informative features that can be used to distinguish one set of cancer patients from another, for example, those that respond to a particular form of therapy from those who do not. Such features, referred to as biomarkers, can then be used to develop therapies that are customized to focused groups or even individual patients. However, almost all available algorithms for extracting relevant features from big data sets face a "barrier" in that the number of features extracted is bounded below by the number of training samples. This number, which might be in the hundreds, is far too large to be useful in biological applications. In this project, it is proposed to develop some novel algorithms for feature extraction that can break through this "barrier" and identify far fewer features than the number of training samples. These newly developed algorithms will be analyzed in terms of their statistical behavior and their optimality; in addition they will be validated on actual data sets from lung, ovarian and endometrial cancer.

Another important aspect of current cancer therapy is the widespread acceptance of the need to use multi-drug combinations. This is because when a patient is treated with a single drug, almost invariably the tumor will grow back even if it shrinks initially, and the relapsed tumor is often resistant to the drug. Due to combinatorial explosion, it is not feasible to try out all possible combinations of drugs in experimental settings. Moreover, due to the complexity of the behavior of cancer cells, it is also not possible to develop analytical models for the mechanisms of action of multiple drugs used in combination. It is therefore imperative to develop methodologies for predicting the efficacy of multi-drug combinations while making almost no assumptions about the mechanism of action of each drug. In this project, it is proposed to use the so-called "maximum entropy method" to develop such a prediction methodology. The maximum entropy method was developed about fifty years in the context of deriving equilibrium statistical mechanics from information theory, and is widely accepted as one of the best methods to be used when it is desired to minimize the number of a priori assumptions.

Intellectual Merit: Currently available algorithms for classification and regression such as LASSO, elastic net, and Dantzig have the feature that the number of key features extracted is roughly equal to the number of training samples. However, even this number is too large to be of practical use in biological situations. Preliminary investigations on a new algorithm invented by the PI show that it does not have this limitation. Moreover, this new algorithm has shown promising performance on two types of cancer data sets: endometrial and ovarian. If a sound theoretical foundation can be established for the observed behavior of this algorithm, as well as for another that is still in the conceptual stage, that would be a very significant contribution to statistics and to machine learning theory. On another front, if it can be established through theory and experiment that the maximum entropy method can be used to predict the efficacy of multi-drug combinations, that would greatly advance both the theory of the method and the practical applicability of multi-drug therapy.

Broader Impacts: Cancer is the second leading cause of death in the USA, in other industrialized countries, and also in newly industrializing countries. It is widely accepted that cancer is the most "individual" of diseases in that no two manifestations are alike. Therefore personalized therapy is the way forward. However, there are very few methodologies for developing personal therapy that are agnostic as to the type of cancer. The present project aims to develop precisely such methodologies. Given the large mindshare of cancer in the scientific community and in society at large, it can be safely assumed that if the project is successfully completed, then the research findings would be followed up by the cancer researcher community. To hasten the process, the PI will work with several cancer researchers in the UT Southwestern Medical Center in Dallas and in the M. D. Anderson Cancer Center in Houston.

The project will entail the training of two graduate students and at least one undergraduate summer intern per year. This would serve to increase the pool of trained manpower and also to disseminate the analytical approach to cancer therapy design to a broader audience.

Project Start
Project End
Budget Start
2013-07-01
Budget End
2018-06-30
Support Year
Fiscal Year
2013
Total Cost
$369,605
Indirect Cost
Name
University of Texas at Dallas
Department
Type
DUNS #
City
Richardson
State
TX
Country
United States
Zip Code
75080