III:Small: Inference of Causal Regulatory Relationships from Genetic Studies

Inference of biological networks from high-throughput genomic data is a central problem in bioinformatics where many different types of methods have been proposed and applied to a wide diversity of datasets. Several recent studies have collected data which contain both genetic variation information as well as gene expression information from a set of genetically distinct strains of an organism which have several advantageous properties for inferring causal regulatory relationships between genes. A principled way of representing causal relationships is using graphical causal models and a rich theory of inference of such models from observational data and interventions has been developed. However, this theory assumes full knowledge of the joint distribution which is equivalent to having very large samples and so is only guaranteed to work asymptotically. In this proposal, the team will extend causal inference methods in several directions motivated by applications to genetic views of genomics datasets where there are relatively small samples. In particular they will apply their new methods to detecting the presence and absence of causal relationships between yeast genes. While the focus of this proposal is on applying the developed techniques to a specific problem in bioinformatics, the causal inference issues addressed in this proposal are the general issues faced when applying causal inference to finite samples. Many of the approaches developed in this proposal will be applicable to a wide range of problems. The resulting methods developed in this proposal will be made available to the scientific community through publicly available software.

The project involves the training of a graduate and undergraduate students. The collaborative nature of the project will expose the students to the medical and genetics worlds, and at the same time, it will improve their abilities to design and implement solutions to complex algorithmic and statistical problems. The research will be converted into course materials for the interdisciplinary course, Computational Genetics, which is taken by both undergrad and graduate students as well as students from the medical school.

Project Report

The most fundamental task in science is to explain observations. One way to explain observations is to characterize the underlying mechanisms which were involved in generating the observations. These mechanisms can be described in a set of cause and effect relationships. The goal of "causal inference" is to identify these cause and effect relationships from the data. Recently, there has been a breakthrough in this type of analysis which takes advantage of graphs which represent cause and effect relationships. These "causal graphs" can be then used to model the underlying mechanisms which generated the observations. This project focused developing techniques for inferring causal graphs from data. The project had two major aims. The first is to extend current techniques for causal inference from finite samples. The rich theory which has been recently developed for causal graph inference assumed infinite size samples. Our project developed techniques for using smaller samples and taking into account the uncertainty of having a small amount of observations. The second aim was to apply these techniques to infer cause and effect relationships in biological data and in particular to understand how genes interact with each other. We made progress on both of those aims and also during the project we made a connection between causal graph methods and methods for identifying what genetic variation is involved in traits. It turns out that these two apparently different types of analysis are closely connected and we were able to utilize ideas from causal graphs to develop novel methods for genetic analysis. Each of these directions led to many publications which disseminated the findings to the scientific community. In addition to these efforts on causal graphs, our project led to a significant amount of educational efforts. The most significant is that we have developed an undergraduate Minor in Bioinformatics which is approved by to start in Fall 2012. This minor is particularly relevant to this type of project because it is housed in the Computer Science department. It is only the 2nd Minor undergraduate program offered by the Engineering School at UCLA. The minor includes several courses which directly prepare students for research. We originally expected the majority of students in the Minor to be majors in Computer Science. Surprisingly, many students from the life sciences also decided to enroll in the Minor. Because of the Minor, these students will obtain a substantial background in computing which will open up significant job opportunities in industry which they would not have had the background for without the Minor.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
0916676
Program Officer
Sylvia J. Spengler
Project Start
Project End
Budget Start
2009-09-01
Budget End
2013-08-31
Support Year
Fiscal Year
2009
Total Cost
$499,444
Indirect Cost
Name
University of California Los Angeles
Department
Type
DUNS #
City
Los Angeles
State
CA
Country
United States
Zip Code
90095