Understanding how non-coding DNA regulates gene expression is critical to addressing myriad problems in biotechnology and human health. This endeavor, however, has proven a major challenge, in part, as the genome appears to encode many highly convoluted regulatory networks. Particularly in vertebrates, regulatory elements such as promoters or enhancers are highly diverse and contain of dozens of DNA motifs spaced by intervening sequences. Traditional sequence analysis strategies that focus on conservation are ineffective to single out the functional motifs as regulatory DNA evolves rapidly and often even relocates. The transcription start site (TSS) is a landmark of gene regulation. Accurate TSS enables to define DNA motifs functionally associated with transcription. TSSs further allow anchoring and comparing the regulatory regions or orthologous genes across evolution, independent of direct sequence conservation. Although distantly related organisms typically lack homologous regulatory DNA, it remains to be explored to what extent specific sequence motifs are selectively conserved to drive expression gene. I therefore developed capped small RNA- seq (csRNA-seq), which accurately maps the TSS of both stable (protein coding and non-coding RNAs) and unstable transcripts (enhancer RNAs, divergent transcripts) to reveal active regulatory elements genome-wide. csRNA-seq only requires total RNA as starting material, thus enabling TSSs profiling in virtually any eukaryotic organism from which RNA can be extracted. Eukarya, from unicellular protists to humans, vary in organismic, genetic and regulatory complexity. I hypothesize that this spectrum in diversity, combined with TSS mapping (csRNA-seq), can be exploited to uncover the key DNA motifs and subsequently TF networks that regulate gene expression across the Eukarya. Analogous to the work of an archeologist at prehistoric sites, mapping TSSs along the tree of life ?excavates? ancestral, less convoluted states of gene regulation. These insights should also be instrumental to better interpret the human genome. To explore this central hypothesis, I seek to 1) implement tools for the analysis and visualization of csRNA-seq and facilitate the comparative analysis of annotated regulatory features across species, 2) identify the TF binding sites mediating transcription initiation across Eukarya, 3) trace the evolution and usage of TF binding sites and their spatial organization in regulatory elements of orthologous genes or sets of genes. In preparation, I have generated data for 42 Eukarya spanning over 2 billion years of transcriptome evolution and joined an exceptional bioinformatics group, which also provides a unique an opportunity for training critical for my successful transition to independence. This proposal, if successful, will reveal the major DNA motifs mediating transcription and markedly expand our mechanistic understanding of eukaryotic gene regulation. Furthermore, it will provide a novel method to capture nascent TSSs (csRNA-seq), a free software suite to facilitate analysis, a data portal for easy data access and browsing, and unique dataset to the greater scientific community.

Public Health Relevance

A limiting factor in understanding gene regulation is insufficient knowledge of how regulatory information is encoded in the genome. By studying regulatory elements functionally defined by transcription across evolution, this proposal aims to reveal the key DNA motifs, transcription factors and regulatory networks mediating eukaryotic gene expression. These insights should also be instrumental to better interpret the human genome.

National Institute of Health (NIH)
National Institute of General Medical Sciences (NIGMS)
Career Transition Award (K99)
Project #
Application #
Study Section
Special Emphasis Panel (ZGM1)
Program Officer
Sesma, Michael A
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of California, San Diego
Internal Medicine/Medicine
Schools of Medicine
La Jolla
United States
Zip Code