During the past decade, the new focus on genomics has highlighted a particular challenge: to integrate the different views of the genome that are provided by various types of experimental data. The long-term objective of this work is to provide a coherent computational framework for integrating and drawing inferences from a collection of genome-wide measurements. Hence, the proposed research plan develops algorithms and computational tools for learning from heterogeneous data sets. We focus on the analysis of the yeast genome because so many genome-wide data sets are currently available; however, the tools we develop will be applicable to any genome. We approach this task using two recent trends from the field of machine learning: kernel algorithms that represent data via specialized similarity functions, and transductive algorithms that exploit the availability of unlabeled test data during the training phase of the algorithm. We apply focus on two tasks: (1) classifying groups of genes that are of interest to our collaborators, including components of the spindle pole body, cell cycle regulated genes, and genes involved in meiosis and sporulation, splicing, alcohol metabolism, etc., and (2) prediction of protein-protein interactions. These two specific aims are not only important scientific tasks, but also represent typical challenges that future genomic studies will face. Accomplishing these aims requires the integration of many heterogeneous sources of data, the prediction of multiple properties of genes and proteins, the explicit introduction of domain knowledge, the automatic introduction of knowledge from side information, scalability to large data sizes, and tolerance of large levels of noise. ? ?

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Exploratory/Developmental Grants Phase II (R33)
Project #
5R33HG003070-02
Application #
6952028
Study Section
Special Emphasis Panel (ZRG1-SSS-Y (11))
Program Officer
Bonazzi, Vivien
Project Start
2004-09-30
Project End
2007-08-31
Budget Start
2005-09-01
Budget End
2006-08-31
Support Year
2
Fiscal Year
2005
Total Cost
$412,000
Indirect Cost
Name
University of Washington
Department
Genetics
Type
Schools of Medicine
DUNS #
605799469
City
Seattle
State
WA
Country
United States
Zip Code
98195
Muratore, Kathryn E; Engelhardt, Barbara E; Srouji, John R et al. (2013) Molecular function prediction for a family exhibiting evolutionary tendencies toward substrate specificity swapping: recurrence of tyrosine aminotransferase activity in the I? subfamily. Proteins 81:1593-609
Sankararaman, Sriram; Kimmel, Gad; Halperin, Eran et al. (2008) On the inference of ancestries in admixed populations. Genome Res 18:668-75
Qiu, Jian; Noble, William Stafford (2008) Predicting co-complexed protein pairs from heterogeneous data. PLoS Comput Biol 4:e1000054
Pena-Castillo, Lourdes; Tasan, Murat; Myers, Chad L et al. (2008) A critical assessment of Mus musculus gene function prediction using integrated genomic evidence. Genome Biol 9 Suppl 1:S2
Bleakley, Kevin; Biau, Gerard; Vert, Jean-Philippe (2007) Supervised reconstruction of biological networks with local models. Bioinformatics 23:i57-65
Qiu, Jian; Hue, Martial; Ben-Hur, Asa et al. (2007) A structural alignment kernel for protein structures. Bioinformatics 23:1090-8
Vert, Jean-Philippe; Qiu, Jian; Noble, William S (2007) A new pairwise kernel for biological network inference with support vector machines. BMC Bioinformatics 8 Suppl 10:S8
Xing, Eric P; Jordan, Michael I; Sharan, Roded (2007) Bayesian haplotype inference via the Dirichlet process. J Comput Biol 14:267-84
Mann, Tobias P; Noble, William Stafford (2006) Efficient identification of DNA hybridization partners in a sequence database. Bioinformatics 22:e350-8
Lewis, Darrin P; Jebara, Tony; Noble, William Stafford (2006) Support vector machine learning from heterogeneous data: an empirical analysis using protein sequence and structure. Bioinformatics 22:2753-60

Showing the most recent 10 out of 20 publications