Pathway analysis of genomic data?the use of prior knowledge about how genes function together in biological systems?plays an increasingly critical role in gaining biological insights from large-scale genomic studies, and particularly in cancer research. However, even the richest source of computer-accessible biological pathway information, the Gene Ontology (GO), is very incomplete, hampering pathway analyses. Over the past three years, the GO Consortium has developed a project that has shown that, by utilizing a rigorous phylogenetic approach, we can increase the amount of knowledge for human genes by five-fold through careful use of experimental data obtained in model organisms such as the mouse, fruit fly, and yeast. The GOC project, however, relies on expert human biologists, and will not scale to the entire human genome. Here, we propose to develop a computational approach that leverages the experience gained in the GOC project. We will develop an accurate, scalable computational solution to the gene function inference problem, which will dramatically increase the amount of biological information that can be used in analysis of genome-scale human datasets. In brief, the task is to integrate knowledge obtained from experiments across multiple organisms, in the context of the family tree that relates the genes, by constructing a probabilistic model of function conservation and divergence. The main application of the probabilistic model will be to infer the function of human genes, from experiments in other organisms. While each gene family will have a specific model depending on its own, unique history, to avoid overfitting we will estimate only a small number of parameters that are shared across all families. We propose to use the same, rigorous model of functional evolution as employed in the GOC project, which is based on evolutionary gain and loss of different kinds of functions (e.g. a catalytic function, binding function or even participation in a biological process or pathway), using not only GO annotations but additional information such as protein domain structure and active sites. We will use the manually-curated examples from the GO Consortium as a training set for developing, as well as a test set for assessing, our computational inference method. We expect that this work will result in a dramatic increase in the number of GO annotations for human genes, resulting in much more informative results from pathway analysis, thus generating additional insights into human disease risk, progression and potential therapies. While our approach is general, we will focus manual validation on cancer-related pathways in order to ensure applicability specifically in cancer research.
An enormous amount of biological information has been painstakingly accumulated from experiments not only in human cells, but in many other organisms as well. This information has been critical for using genomics experiments to understand cancer progression and potential therapies, but current approaches make use of only a small fraction of the available information. This project will dramatically increase the amount of biological information available, improving genomics analysis in studies of cancer and other diseases.
Showing the most recent 10 out of 28 publications