Despite the enormous potential of plant research for transformative advances in sustainable food and energy production, environmentally responsible materials biosynthesis, and safe effective medicines, a basic understanding of how plant genes are 'turned on' or 'turned off' in order to produce these useful products remains a primary scientific bottleneck. In order to understand how these genetic pathways function, one must identify and understand the patterns of DNA elements that ultimately control these genes. To address this fundamental scientific challenge in biological informatics, this project will use pattern recognition techniques in order to develop a detailed understanding of the way in which plant genes are controlled by DNA-encoded information. In order to do this, we will develop specific computational techniques for building and interpreting pattern recognition models. The most important aspect of this research is creating algorithms with the ability to recognize when a number of different DNA element patterns provide biologically relevant solutions for gene control. The algorithms, models, and biological outcomes developed by this project will be made available to the scientific community, including the plant science community. The underlying computational methods are expected to be applicable in many other fields of science and engineering. Teaching and training modules developed by this project will include outreach to our underserved communities of rural Oregon, including hands-on science experiences for high school students, as well as teacher training. The modules will draw special attention to the need for computationally trained researchers in the plant sciences, and to the need for a dramatically increased understanding of plant gene regulatory networks as a foundation for new drug discovery, agricultural, and materials biosynthesis challenges.

The combinatorial control of gene expression by Transcription Factors (TFs) is one of the most fundamental regulatory mechanisms across the eukaryotic tree of life, from plants to humans. The DNA region surrounding the start of a gene, called the promoter region, contains short regulatory sequence elements known as Transcription Factor Binding Sites (TFBSs). In plants, despite the many exciting potential advances that hinge on a detailed understanding of plant gene regulation, relatively little is known about the specific TFBS patterns that lead to gene expression. This project uses and extends a machine learning model developed by the Megraw lab that is able to predict the presence of Transcription Start Sites (TSSs) with high accuracy and resolution. A primary advantage of this model is that it suggests specific sets of TFBS:promoter interactions which have the potential to 'turn on' a particular gene. This model makes novel use of currently available high-throughput TSS data to examine the global structure of promoters in the model plant Arabidopsis, and in preliminary research suggests striking unanticipated differences between plant and animal DNA regulatory codes. The goal of this project is to rigorously test and expand this promising new computational approach for dissecting which TFBS sets are optimal predictors of gene up-regulation in a biological sample. The relative simplicity of the Arabidopsis genome makes this challenge tractable, yet capable of providing transformative insight into gene regulation in a multicellular organism. The results of this project can be found at

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

National Science Foundation (NSF)
Division of Biological Infrastructure (DBI)
Application #
Program Officer
Jennifer Weller
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Oregon State University
United States
Zip Code