Nucleic-acid-protein interactions are fundamental to diverse biological processes from gene expression to epigenetic control. While the primary sequence of DNA or RNA sets the structural landscape that establishes the biological function of nucleic acids, our ability to predict how perturbations in sequence affect this structure- function relationship eithe at the intra- or inter-molecular interaction level, is limited. Because of the combinatorial complexity of these nucleic acid polymers - especially RNA - obtaining a comprehensive picture of the effects of multiple degrees of sequence perturbation necessarily requires high-throughput methods of assaying nucleic acid species. To this end, we have developed a platform for quantitative biochemistry of tens to hundreds of millions of diverse DNA or RNA molecules on an Illumina sequencing chip. By generating a diverse library of DNA sequences to be probed, we have constructed a post hoc DNA array, using the sequencing data to define the sequences of the clonal clusters (each containing approximately 500 fragments of DNA) on the chip. To probe RNA structures, where the need for combinatorial investigations to probe both structure and function is most acute, we use E. coli RNA polymerase to transcribe the immobilized dsDNA fragments into single stranded RNA, which remains bound to its DNA of origin via a stable, stalled RNAP. Using this RNA array, and custom built fluorescence analysis software, we have demonstrated comprehensive investigations of binding affinities of fluorescently labeled MS2 coat protein, a canonical RNA binding protein. By measuring the equilibrium constants and off-rates for MS2 for all possible single, double, and triple point mutants of the consensus stem-loop sequence, we demonstrate the power of this comprehensive analysis for understanding structure-function relationships in the context of the crystal structure of the interactions, as well as understanding the evolutionary functional constraints of these interactions. By developing three different methods of generating diverse libraries of DNA and RNA on-chip, we will probe the relative affinities of Cas9 and TALEN for target sequences across all near-cognate sequences and across the entire genome. These quantitative investigations will provide detailed biophysical information about the specificity of these protein, as well as their propensity for off-target binding. We will also develop three orthogonal methods for measuring RNA structure on-chip, including FRET-based methods to enable thermodynamic melting measurements. With these methods, we will carry out massive measurements of RNA stability across sequence space, probing all possible short hairpin structures as well as internally mismatched stem loops. These data will multiply the number of thermodynamic measurements of RNA by many orders of magnitude, and will be easily added to current RNA structure prediction suites. Finally we will push the sensitivity of this high-throughput platform o the single molecule level. As proof-of-principle, we will observe the kinetics of folding of divers DNA hairpins, opening the door to single-molecule methods across millions of diverse nucleic acid structures.
Nucleic-acid-protein interactions are extremely common in biology and are implicated in a large variety of human diseases, but our ability to predict the relationship between changes in DNA, or especially RNA, sequences that modify biological function is currently limited. We will develop a high- throughput means of measuring the inter- or intra-molecular interactions of nucleic-acids directly on a high-throughput sequencing platform, providing a powerful source of data for quantitatively and comprehensively understanding these interactions. These datasets will allow better understanding of how both common and rare genetic polymorphisms affect biological phenotype.