Columbia University and the University of California Berkley are awarded grants for the development of novel analytical tools that bridge interdisciplinary gaps between engineering and biological sciences by facilitating a synergistic integration of top-down statistical signal processing theory approaches with bottom-up methods that characterize biological networks as collections of basic biomolecular interactions. The former are the subject of the emerging engineering discipline: Genomic Signal Processing (GSP), while the latter are the domain of classical biochemistry/biophysics. The investigators recognize that the number of putative signal processing mechanisms that needs to be analyzed by GSP for a given biological system could be significantly reduced when their consistency with biochemical/biophysical laws is demanded. The resulting Biochemically-constrained GSP (BioGSP) approach is thus able to produce results on par with traditional GSP methods, but is significantly more efficient as well as assured to be in compliance with key molecular properties of biological mechanisms.

Biological systems consist of molecules and molecular complexes, whose interactions comprise intricate circuits and networks. Knowledge of their structure and function can lead to powerful new ways of controlling biological mechanisms, which may potentially enable new approaches to remedying faults in natural biological processes as well as to engineering denovo synthetic biomolecular designs. Recent advancements in experimental techniques have allowed us an unprecedented view of how these systems are structured. However, detailed understanding their function remains a challenge due, in large part to the scale and complexity of networks involved as well as the nonlinear nature of biochemical interactions among the various molecular species. This issue is particularly acute for genetic networks - both because of their importance to biological systems development and operation as well as due to the often complex regulatory patterns they employ. Further information about the project may be found at the PI web sites at www.ee.columbia.edu/~wangx/ and http://genomics.lbl.gov/index.html.

Project Report

The aim of biomolecularly- and, in particular, Biochemically-constrained Genomic Signal Processing is to bridge the interdisciplinary gaps between engineering and biological sciences, especially in the area of genomic regulation, thus facilitating a synergistic integration between signal processing with molecular biology. Intellectual Merit: A key goal of this project is to develop novel BioGSP techniques for inferring gene regulatory networks from heterogeneous expression and interaction data. We have made the following technical contributions. (1) Discovery of conserved motifs in genomic sequences represents one of essential bioinformatics problems. However, improving performance while maintaining input flexibility remains a key algorithmic challenge. We have developed BAMBI a sequential Monte Carlo algorithm based on the position weight matrix model that has the flexibility to also estimate motif length, number of instances, as well as their locations within each sequence. (2) In genome-wide association studies, thousands of individuals are genotyped in hundreds of thousands of single nucleotide polymorphisms (SNPs). Using a Tree-Based Deterministic sampling technique, we have developed an intuitive and conceptually simple phasing algorithm for trio data. The trade off between speed and accuracy achieved by our algorithm makes it a strong candidate for routine use on trio datasets. (3) Conserved motifs are indications of transcription factor binding sites and evidence of genes that are co-regulated and provide support for appropriately grouping genes. We have developed an algorithm that uses expression data and biochemical modeling to discover further interaction between genes. It uses the output of the motif discovery algorithm as an input and can refine the estimated gene regulatory network. (4) Biomolecular sequences (nucleic acids and proteins) can be described as non-stationary symbolic sequences on a given finite alphabet. A new Markov chain model is introduced in order to model the statistical periodicities in biomolecular sequences, which is useful for the detection of short exons or regions with a particular periodicity. (5) The haplotypes of an individual can be used to predict diseases and help designing drugs. However, experimentally determining haplotypes is expensive and time-consuming, so genotypes are usually measured instead. Given the set of genotypes for a group of unrelated individuals, it is possible to infer the haplotype pair for each subject based on the maximum parsimony principle. We have proposed two related formulations of the haplotype inference problem that translate the maximum parsimony principle into thesparse representation of genotypes. (6) We have developed a new genomic signal processing technique based on the Hierarchical Dirichlet Process (HDP) and apply it to the genomic network prediction and biological data clustering. The HDP allows multi-scale clustering for various features simultaneously. (7) To date, few algorithms have been proposed for using genome-wide deletion fitness data to infer the organization of these networks. We have developed a dynamical model and proposed an inference algorithm for using fitness data from single knockout libraries to identify the underlying gene-regulatory networks. Unlike most prior methods, our approach captures not only structural, but also dynamical and non-linear nature of biomolecular systems. (8) We have introduced a novel framework for hypothesis testing in the presence of unknown parameters, and we have applied the proposed test to the problem of detecting and estimating periodicities in DNA sequences and demonstrated the advantages of the new framework compared to the classical Neyman-Pearson approach and the GLRT. (9) Gene regulatory networks typically exhibit sparse connections. However, traditional inference algorithms such as the classical EM algorithm cannot exploit such sparse structures and therefore the results are less accurate. We have considered the problem of sparse parameter estimation in a general nonlinear dynamic system and have proposed a MAP-EM solution. (10) In most sequenced organisms the number of known regulatory genes (e.g., transcription factors (TFs)) vastly exceeds the number of experimentally-verified regulons which could be associated with them. We have developed an efficient algorithm that aims to identify a given transcription factor’s regulon through inference of its unknown binding sites, based on the discovery of its binding motif, which is also unknown and is estimated within the algorithm framework. Broader Impact: The recent emergence of quantitative and systems biology along with the increasing emphasis on dynamic modeling of genetic regulatory networks has resulted in calls for a major effort on the part of the signal processing community to find new and creative ways for combining its strengths with those of bioscientific disciplines so as to further promote the advancement of genetic science, biology, and medicine. Our research has culminated with substantial advancements in knowledge and understanding of some key organisms at the level of their underlying gene regulatory networks as well as formulation of novel general modeling, analysis, and computational methods for genetic and epigenetic science. Moreover, this interdisciplinary research project has provided invaluable opportunities for both graduate and undergraduate students to learn about as well as contribute to the exciting and rapidly developing field of Bioengineering.

Agency
National Science Foundation (NSF)
Institute
Division of Biological Infrastructure (DBI)
Application #
0850030
Program Officer
Peter H. McCartney
Project Start
Project End
Budget Start
2009-09-15
Budget End
2014-02-28
Support Year
Fiscal Year
2008
Total Cost
$615,264
Indirect Cost
Name
Columbia University
Department
Type
DUNS #
City
New York
State
NY
Country
United States
Zip Code
10027