CDI-Type II: Discovery of Succinct Dynamical Relationships in Large-Scale Biological Data Sets

Mishra, Bhubaneswar

Abstract

Many types of information in neuroscience and molecular biology can be described as a set of measurements taken repeatedly as some index changes its value. In some situations, such as transcriptomic data measuring gene activities, the index is time while in other situations, such as in genetics association study, the index is position in a genomic DNA sequence and, in any case, the complete collection of data is referred to as a time series. Inference is the process of taking such time series, probably corrupted by errors, and computing answers to the following sorts of questions: (1) What is the system that generated the time series? For instance, if the system is known to be a differential equation of a specific type, what are the parameter values in the differential equation? (2) Given a completely specified system and a time series, did that system generate that time series? For instance, if a biologist has hypothesized a system that describes gene expression for a particular set of genes and then measures expression data, is the data compatible with the system, or equivalently, the hypothesis? (3) Given two time series, were they generated by the same system? For instance, if the pattern of nerve firings in a neural system is recorded in two different experimental situations, is the pattern the same or is it different? The four Principal Investigators are focused on three different biological application domains at three different biological scales: (1) the phenotyping of animal and human ethanol-consumption behavior (whole organism scale), (2) the pattern of action potentials measured on ensembles of neurons (cell-population scale), and (3) the time course of gene expressions as governed by the regulatory circuits of the cell (cellular scale). The types of challenges that are encountered in these applications include the following characteristics: the information is distributed over long periods of time rather than concentrated in time; the systems include delays and feedback paths; and the systems are highly nonlinear, including switching behavior, rather than linear. The major methodologies that will be developed and combined to solve inference problems in these application areas are: (a) information theory and stochastic control, (b) multi-scale approaches to learning the geometry of the data, and (c) computer algebra and symbolic computation. For example, to deal with the presence of delay and feedback in neuroscience systems, especially in the context of the interaction between information and stochastic control, requires a fundamental rethinking of classical information theory as it is employed in technology-based communication systems.

As the cost of computing decreases, computing becomes increasingly pervasive. A major purpose of pervasive computing is the real-time collection of high-dimensional time series of very diverse types of data including biological, medical, financial, communication systems status, power systems status, etc. The project will provide computational algorithms and software to analyze this data in more sophisticated ways and thereby extract more sophisticated information. Action taken upon this more sophisticated information, e.g., personalized medicine based on individualized genomic information or more accurate and flexible control of power systems thereby avoiding blackouts, will have important human and economic benefits to society. An important component of the project is educational, e.g., three graduate students working on the project will receive tuition and stipend and an unrestricted number of undergraduates will participate through a variety of ways, e.g., project courses. By attracting talented students to science and technology and providing challenging research experiences, the project will have important work force benefits to society.

Project Report

In the recent years, as a result of various technological breakthroughs, it has become possible to collect large amount of biological data at multiple scales: ranging from single-cells, cell-lines, organs, organisms to entire ecologies. The technological breakthroughs have been possible because of various novel approaches in nano-technologies, molecular manipulations, sample preparations as well as biochemical modifications. One such technological breakthroughs is exemplified by the next-generation sequencing platforms. Other examples include single-molecule analyses based on optical mapping (at genomic level) and AFM mapping (at transcriptomic level). Because of the various noise processes and errors in the data, exquisite design of the experimental setups as well as accurate interpretation of the data had to be developed, requiring sophisticated mathematics, based on computational complexity analysis, probabilistic modeling and Bayesian statistical inference algorithms. This project has achieved several important milestones in the rapid progress of this body of technologies. Furthermore, it has also become possible to organize the resulting data in terms of their spatio-temporal structure. For instance, if one considers the development and progression of cancer in terms of the cellular heterogeneity and temporality, the amount and complexity of the data that are possible to collect are mind boggling. Yet, we still lack a good mechanistic interpretation of these data. In order to address these problems, scientists have relied on phenomenological models that can organize the data in terms of a succession of segments within which various subsystems evolve continuously and across which they repeatedly reorganize to achieve certain global properties. Discovering, characterizing and interpreting these properties in terms of a succinctly-expressible language (e.g., modal logic) have been an important goal of our research. Furthermore, certain important events not just obey certain chronological structures, but hint at important causal dependencies that point to hitherto undiscovered mechanisms. These causality relationships can be categorized in terms of two related frameworks: "type causality" and "token causality," both of which can be expressed in the language of a propositional probabilistic temporal logic. In addition, since one may entertain large number of plausible causal hypotheses, one needs advanced Empirical Bayesian methods to separate spurious causal relations from the genuine ones. Thus, our research program needed to develop a large number of inter-related algorithmic tools spanning over machine learning algorithms, dynamic programming for segmentation, statistical inference algorithms, model checking algorithms, techniques for controlling false-discovery rates as well as efficient parallelization and code-optimization. In the arena of technological advances, we have developed novel nano-technologies based on querying short single molecules (e.g., cDNAs) by nano-cantilever based AFM technologies. In order to achieve a desired accuracy, various techniques of information theory were borrowed to devise experiments with enzymes and probes that implicitly perform error-correction. Efficient and accurate image analysis algorithms were developed to improve sizing errors and data interpretation. Furthermore, a large portion of our efforts have been devoted to the problem of genome assembly, with a focus on generating haplotypic information, which is crucial in population level genetic analysis. For instance, an algorithm, dubbed SUTTA, was developed to combine next generation sequencing data with long-range information (from mate-pairs and optical mapping) in order to discover structural variations in the genomes as well as information on how they might have been transmitted from the parents to children in terms of the haplotypes of the childrenâ€™s diploid genomes. Another serious problem that has plagued this research field has been due to a lack of reliable methodologies to assess the accuracies of the existing genome assembly algorithms. We developed a feature-based method to address these quality-assessment needs, and have used it to analyze the data from Assemblathons and GAGE databases. In the arena of model building, model checking and causality analysis, we have developed new tools based on temporal logic and a probabilistic causality framework that is founded on temporal-priority and probability raising. We have used these techniques to understand such biological processes as cell-cycles, host-pathogen interactions, evolution of drug-resistance, metabolic cycles, hypoxia in cancer, evolution of genetic codes, copy-number variations in cancer, cancer progression and interpretation of electronic medical records.

Funding Agency

Agency: National Science Foundation (NSF)
Institute: Division of Computer and Communication Foundations (CCF)
Application #: 0836649
Program Officer: Sankar Basu

Project Start
Project End
Budget Start: 2008-09-15
Budget End: 2013-08-31
Support Year
Fiscal Year: 2008
Total Cost: $480,000
Indirect Cost

CDI-Type II: Discovery of Succinct Dynamical Relationships in Large-Scale Biological Data Sets
Mishra, Bhubaneswar
New York University, New York, NY, United States

Abstract

Project Report

Funding Agency

Institution

Comments

Recent in Grantomics:

Recently viewed grants:

Recently added grants:

Abstract

Project Report

Funding Agency

Institution

Comments