PI: Michael Freeling (UC Berkeley) Key personnel: Brian C. Thomas (UC Berkeley)

This is a Genome-Enabled Plant Research project that will provide databases of regulatory DNA sequences, mapping tools and software needed by the plant breeding and research community. The primary biological research activity is to align each gene in sorghum with its orthologous gene in rice, evaluate exon annotations, and then define and store those noncoding sequences that have been conserved over evolutionary time, called "CNSs". These sequences are expected to comprise DNA sites that bind proteins and small RNAs that regulate gene availability or expression. When CNS discovery is complete, the plant community should have approximately 170,000 short (mean= 31 bps long) CNSs representing functional (conserved) noncoding regions sorted to particular genes. Noncoding sequences between CNSs that are not conserved will be used as control sequences. Many previously unknown functions will be discovered. CNSs will be associated with motifs, expression patterns, methylation status and the like, with results displayed on the project web site (http://synteny.cnr.berkeley.edu/) and available to the community through a DAS server and PlantGDB. Both the plant research community and this project need a comparative genomics software application that will compare - using static and on-the-fly procedures - any chromosomal region with any other from any plant. This project includes the design and description of this new software system, called CoGe. All results, databases, applications and code will be publicly available on this project's website, and in journal and community websites (http://sourceforge.net; http://plantgdb.org.

This project has broader impacts. The sorghum-rice CNS-discovery project generates pan-grass mapping markers. Grasses without sequence databases, orphan grass commodities exemplified by foxtail millet, should benefit most directly. At least one graduate student and several undergraduate researchers are essential participants in this research. This project includes summer internships for Bay Area high school students who are talented, underserved, and recommended to us by their biology teachers.

Project Report

This 5.5 year project generated over 40 publications and two research websites being used extensively by the community: genomeevolution.com/CoGe (at iPlant) and qTeller.com (at UC-Berkeley). Freeling’s Google Citation Index provides impartial impact data. Broader Impacts: Five graduate students, three postdoctoral students, ten collaborators, four programmers, nine high school student summer interns (most underserved), and three research visitors were supported. Intellectual merit: Genomic DNA between transcriptional units is noncoding DNA, and much of this is either repetitive, transposon sequence or simple sequence. This DNA has been called "junk". Also between transcriptional units are regulatory elements that tell genes how, when, where and how much to express, and the sequence of these not-junk elements are often conserved during evolution. A typical eukaryotic gene in diagrammed (Fig. 1) and these cis-regulatory elements are labeled (e.g. distal enhancers). Modern genomic crop breeding has a resolution smaller than "a gene". Our primary aim is to place computed conserved noncoding sequences (CNSs) between genes to help breeders map agronomic traits (e.g. "drought resistance"). Both maize and Brassica breeders are using our CNS markers. We are just now publishing (Frontiers, 2013, in press; code on GitHub) our fully automated CNS Discovery Pipeline 3.0. Thousands of line of python code takes as input two gene-annotated genomes and generates as output two lists: a gene pairs list and a CNS list. We proceeded stepwise. (1) Help obtain plant genomes. We were authors on the genome –release papers of papaya, sorghum, Chinese cabbage and banana. (2) We hired programmers and web-developers to code my lab’s comparative genomics software as a public on-the-fly toolbox: CoGe (lead: student Eric Lyons). Fig. 1 is CoGe’s homepage; with an inset showing what CNSs look like graphed by CoGe. (3) We coded new algorithms like "quota-align" to adapt CoGe to the multiply polyploid genomes of our crops. (4) There is a reciprocal relationship between tandem duplication events and retentions of post-tetraploid genes as pairs for genes categorized by functions (Fig. 3). The unique nature of this relationship proved that the reason genes are or are not retained following polyploidy involves maintaining the status quo of product stoichiometries. The applicable theory is The Gene Balance Hypothesis framed by Birchler, Veitia and Hurst. (5) Returning to CNSs, we learned that they are rich in known DNA-binding motifs, that particular G-box CNSs are light-responsive enhancers, that there is much overlap between CNSs and DNaseI-open chromatin, retention increases with CNSs and, CNSs on CNS-rich genes may be silencers. We showed that duplicate functional exons and CNSs are lost quickly and by deletion via intrachromosomal recombination. Thus, ancient tetraploids less than 20 million years old serve well as "deletion machines" useful to link specific DNA (often CNS) sequence with particular expression patterns. Figure 4 shows the proof-of-principle of our general method called "fractionation mutagenesis". We computed that the sequence 5’TGGAAGGTGCGGTGAACGGATCTGTC would mean "’on’ in pollen grains" and we proved this transgenically (inset ‘blue" pollen grains in Fig.4). To compare genes for RNAseq expression pattern and quantity, we developed "qTeller.com". (Student James Schnable coded this public reads-to-FPKM pipeline and graphing tool). The deletion that removed pollen expression (Fig.4A) was first noticed by comparing maize doublets in qTeller-maize. There are now over 30 different community-generated RNA-seq experiments in qTeller-maize. (Additionally, this project invested in obtaining many RNA-seq endpoints for qTeller-Brassica.com.) Figure 5 shows a two-gene plot for a post-tetraploid pair of maize genes. These two sister genes are obviously expressed in a similar way, BUT one gene is 2:1 more expressed in all experiments. Post-polyploid duplicate genes are not equally expressed. Furthermore, the underexpressed genes are almost always on one of the (the recessive) subgenomes. So, for maize, underexpressed genes tend to reside on the genome that has suffered the most deletion of both genes and CNSs. This phenomenon is called "genome dominance". Our hypothesis is that one of the two subgenomes in maize (and all other allopolyploids) got epigenetically marked following polyploidy 10s of millions of years ago, and that genes on one subgenome tended to express to lower levels thereafter; deleting duplicate genes on the recessive subgenome mattered little to selection for gene balance. Figure 1, the gene diagram, shows what we have found to be the best explanation of individual gene dominance: recessive genes have "larger" 5’ patches of smallRNA-targeted transposons. That is, each gene has a "heterochromatin" rheostat on it that variably governs its FPKMs by position effect, and that entire genomes get their transposon load in the parents, but they get these rheostats epigenetically set all together. Figure 6 is a cartoon of this whole-genome setting of expression quantity, and how genome dominance is really the same phenomenon as inbreeding depression. That is, our mechanism for genome dominance is the mechanism beneath hybrid vigor, perhaps the most important agronomic trait of all.

Agency
National Science Foundation (NSF)
Institute
Division of Integrative Organismal Systems (IOS)
Application #
0701871
Program Officer
Diane Jofuku Okamuro
Project Start
Project End
Budget Start
2007-09-15
Budget End
2013-02-28
Support Year
Fiscal Year
2007
Total Cost
$1,965,052
Indirect Cost
Name
University of California Berkeley
Department
Type
DUNS #
City
Berkeley
State
CA
Country
United States
Zip Code
94704