This research will develop systematic procedures which take advantage of computer-related developments and advanced combinatorial optimization techniques, to build on previously successful ad-hoc methods for optimizing feature selection in data analysis, with special attention to bioinformatics. Knowledge extraction from data represents a fundamental challenge in information technology research. A very frequent type of knowledge extraction problem is that of analyzing archives of records or observations in order to discover hidden structural relationships. Problems of this type appear in numerous areas of science, technology, medicine, management, and in countless other areas of activity. The advent of the computer and of the Internet have radically increased the role of data analysis, by allowing not only the creation of large, meaningful datasets, but also by making them accessible to researchers all over the globe.
One of the most prominent areas of applications of data analysis is in systems biology and bioinformatics. In contrast to molecular biology investigations, which typically focus on single molecules, systems biology pays attention to tens or even hundreds of thousands of biological attributes at the same time. Moreover, the number of attributes included in a dataset is predicted to increase dramatically in the very near future. The new global approach to systems biology has been enabled by new technologies that have allowed the simultaneous measurement of large numbers of attributes and the generation of large multiparameter datasets. These biological datasets represent the domain of bioinformatics. Beside the classic methods of statistics, new approaches to the analysis of data are required. Among others, these methodologies prompt the development of entirely new research areas, including e.g., machine learning, data mining, neural networks, support vector machines. In all these areas of data analysis, the knowledge of a known (finite) subset of observations is used to derive conclusions about the entire set of possible observations.
While powerful analytic tools have been developed within the framework of statistics and of newer research areas based heavily on combinatorics, logic and optimization, the size of the problems in the area of bioinformatics (as well as in some other areas), leads to major computational difficulties, and raises the challenge of developing new approaches in which systematic heuristic procedures are integrated into solution algorithms, in order to intelligently reduce the size of the problems to be solved. By using ad-hoc combinations of heuristics and of combinatorics, logic, and optimization based algorithms, this project aims to achieve spectacular reductions of problem size without significant loss in the accuracy of the resulting models.
This project will introduce new concepts for evaluating the role and the impact of features in data analysis problems. These concepts combine elements of statistics, combinatorics, information theory, the theory of Boolean functions, the theories of games and of voting. On the algorithmic side, they open possibilities for systematic heuristic approaches to the elimination of redundant features. The project will also introduce new concepts for the comparative evaluation of pairs of features, including those of similarity and domination. These concepts combine elements from statistics, the theory of partially ordered sets, and that of Boolean algebra, and can add to the arsenal of tools available for the elimination of unnecessary features. As opposed to previous studies which view the sets of attributes just as collections of individual attributes, the project proposes a study of the combined efforts of attributes. The research plan includes the development of a local optimality criterion for a set of attributes which should combine the two desired characteristics of a "support set": it should allow the construction of accurate models, and it should, at the same time, be of a computationally manageable size.
In addition, the project will introduce new, synthetic "logical" attributes which allow, on the one hand, the compression of the dataset and, on the other hand, the possibility of finding clearly understandable and practical usable "logical" discriminants, which distinguish the positive observations from the negative ones. A very significant application of these ideas is the proposed algorithmic framework for optimizing feature selection. In the field of biological research and bioinformatics, the project introduces: (i) the concept of "groups of biomarkers" (obtainable through combinatorial optimization techniques), (ii) an algorithmic hypothesis generator for biological research, and (iii) a new approach for discovering new classes of observations with highly similar characteristics. A publicly available software package will result from the work and is expected to stimulate substantial research in the computational analysis of data, and in bioinformatics.
By combining elements of statistics, combinatorial optimization, information theory, the theories of Boolean functions, games, voting, and partially ordered sets, the project is expected to attract researchers from a variety of areas to collaborative studies. The publicly available software for the analysis of bioinformatics data, as well as the hypothesis generation system proposed, will provide major tools for stimulating research in biology and bioinformatics.