During the past decade, the new focus on genomics has highlighted a particular challenge: to integrate the different views of the genome that are provided by various types of experimental data. The long-term objective of this work is to provide a coherent computational framework for integrating and drawing inferences from a collection of genome-wide measurements. Hence, the proposed research plan develops algorithms and computational tools for learning from heterogeneous data sets. We focus on the analysis of the yeast genome because so many genome-wide data sets are currently available; however, the tools we develop will be applicable to any genome. We approach this task using two recent trends from the field of machine learning: kernel algorithms that represent data via specialized similarity functions, and transductive algorithms that exploit the availability of unlabeled test data during the training phase of the algorithm. We apply focus on two tasks: (1) classifying groups of genes that are of interest to our collaborators, including components of the spindle pole body, cell cycle regulated genes, and genes involved in meiosis and sporulation, splicing, alcohol metabolism, etc., and (2) prediction of protein-protein interactions. These two specific aims are not only important scientific tasks, but also represent typical challenges that future genomic studies will face. Accomplishing these aims requires the integration of many heterogeneous sources of data, the prediction of multiple properties of genes and proteins, the explicit introduction of domain knowledge, the automatic introduction of knowledge from side information, scalability to large data sizes, and tolerance of large levels of noise. ? ?
Showing the most recent 10 out of 20 publications