Columbia University and the University of California Berkley are awarded grants for the development of novel analytical tools that bridge interdisciplinary gaps between engineering and biological sciences by facilitating a synergistic integration of top-down statistical signal processing theory approaches with bottom-up methods that characterize biological networks as collections of basic biomolecular interactions. The former are the subject of the emerging engineering discipline: Genomic Signal Processing (GSP), while the latter are the domain of classical biochemistry/biophysics. The investigators recognize that the number of putative signal processing mechanisms that needs to be analyzed by GSP for a given biological system could be significantly reduced when their consistency with biochemical/biophysical laws is demanded. The resulting Biochemically-constrained GSP (BioGSP) approach is thus able to produce results on par with traditional GSP methods, but is significantly more efficient as well as assured to be in compliance with key molecular properties of biological mechanisms.

Biological systems consist of molecules and molecular complexes, whose interactions comprise intricate circuits and networks. Knowledge of their structure and function can lead to powerful new ways of controlling biological mechanisms, which may potentially enable new approaches to remedying faults in natural biological processes as well as to engineering denovo synthetic biomolecular designs. Recent advancements in experimental techniques have allowed us an unprecedented view of how these systems are structured. However, detailed understanding their function remains a challenge due, in large part to the scale and complexity of networks involved as well as the nonlinear nature of biochemical interactions among the various molecular species. This issue is particularly acute for genetic networks - both because of their importance to biological systems development and operation as well as due to the often complex regulatory patterns they employ. Further information about the project may be found at the PI web sites at www.ee.columbia.edu/~wangx/ and http://genomics.lbl.gov/index.html.

Project Report

In this project we invented a number of new computational approaches to understanding how cells regulate their behavior in different environments to survive in these variable conditions. (1) Gene expression underlies most essential cellular processes and is typically controlled by networks of regulatory interactions. Two basic mechanisms directly involved in regulating gene expression are transcription factor binding and site-specific recombination. In both cases, the proteins involved often attach to highly specific DNA segments - conserved regulatory motifs - which leads to activation or repression of gene expression either due to epigenetic interactions between transcription factors and components of RNA polymerase machinery or recombinase-mediated genetic and genomic modifications of the underlying DNA sequences. Thus, discovery of such motifs represents of the essential problems in bioinformatics and computational biology. However, as individual binding sites are subject to context-specific optimizations of protein affinities as well as neutral alterations by random mutations, nucleotide sequences of various motif instances can display a significant degree of heterogeneity. Motif discovery from sequences thus becomes a computationally challenging task that has been the subject of much research in recent years. Furthermore, along with performance, one of the essential requirements for a practically useful motif discovery algorithm is input flexibility. To address these problems: We have developed BAMBI – a sequential Monte Carlo algorithm based on the position weight matrix model that has the flexibility to also estimate motif length, number of instances, as well as their locations within each sequence. Using BAMBI, we have shown that the proposed approach can be used to find binding sites in synthetic data as well as in the DNA sequence database containing multiple binding site instances of cAMP receptor protein (CRP) – a major prokaryotic transcription factor. The problem of discovering multiple motif instances with unknown length is particularly significant in the case of recombinases, whose target sites tend to be both long (on the order of 30 bp or more) as well as occur in multiple instances within relevant genomic loci (at least two are needed for DNA strand exchange). Using BAMBI, we were able to successfully identify these sites within the sequences containing the compiled list of Din-family recombinase sites. Results obtained reveal that BAMBI demonstrates better statistical performance in the described applications than four of the widely-used profile-based motif discovery algorithms. (2) Genome-wide fitness is an emerging type of high throughput biological data generated for individual organisms by creating libraries of knockouts, subjecting them to broad ranges of environmental conditions, and measuring the resulting clone-specific fitnesses. Since fitness is an organism-scale measure of gene regulatory network behavior, it may offer certain advantages when insights into such phenotypical and functional features are of primary interest over individual gene expression. In particular, contribution of practically irrelevant genes may be effectively filtered out if they do not contribute substantially to fitness state—regardless of their statistical significance or dynamic state. To address these problems we have developed a model and proposed an inference algorithm for using fitness data from knockout libraries to identify underlying gene regulatory networks. Unlike most prior methods, the presented approach captures not only structural, but also dynamical and non-linear nature of biomolecular systems involved. A state–space model with non-linear basis is used for dynamically describing gene regulatory networks. Network structure is then elucidated by estimating unknown model parameters. An unscented Kalman filter is used to cope with the non-linearities introduced in the model, which also enables the algorithm to run in on-line mode for practical use. Results obtained using the algorithm provide satisfying answers for both synthetic data as well as empirical measurements of GAL network in yeast Saccharomyces cerevisiae and TyrR–LiuR network in bacteria Shewanella oneidensis. (3) Molecular barcode arrays are widely employed in the analysis of large strain libraries, whereby probes linked to unique oligonucleotides ("antitags") are used to detect selected DNA targets ("tags") by highly specific hybridization. One of the major problems for such screen designs is thus insuring a high degree of probe-target specific city and low level of non-specific c binding ("orthogonality") across the entire tag population ("collection"). Several approaches have been previously proposed for designing orthogonal DNA tags by studying their individual or pair-wise structures, such as Smith Waterman sequence similarity, the widely-used Nearest Neighbor (NN) method, and full thermodynamic estimates of sequences. However, these methods generally involve imposing various heuristic constraints ("design rules") on possible tag/antitag sequences in order to achieve probe-target specificity across the collection. The resulting lack of freedom in considering all putative sequences can lead to potentially sub-optimal designs and to the ensuing reduction in the degree of orthogonality within the constructed tag/antitag (TaT) collections. The algorithm we developed finds orthogonal sequence sets whose properties compared favorably with those previously identified as part of orthogonal tags developed for the construction of gene knockout collections.

Agency
National Science Foundation (NSF)
Institute
Division of Biological Infrastructure (DBI)
Application #
0850205
Program Officer
Peter H. McCartney
Project Start
Project End
Budget Start
2009-09-15
Budget End
2012-08-31
Support Year
Fiscal Year
2008
Total Cost
$127,427
Indirect Cost
Name
University of California Berkeley
Department
Type
DUNS #
City
Berkeley
State
CA
Country
United States
Zip Code
94704