The human genome contains the structural and operational instructions for living cells, yet exactly what these instructions are and how they are utilized and encoded in the primary genomic sequence is poorly understood. Arguably the only well-understood portions of the genome are protein-coding regions, which make up less than 2% of the genome. It has become increasingly clear that the non-coding genome encodes vast numbers of regulatory elements important for controlling gene expression levels in a cell type specific manner. Moreover, the overwhelming majority of disease- and trait-associated variants identified by genome-wide association studies (GWAS) lie in non-coding regions of the genome, and are strongly enriched in regulatory elements. Despite this clear relevance, we still lack a complete understanding of the global organizing principles of the regulatory genome, such as how regulatory elements are distributed across the genome, what their occurrence patterns are across cell types, and how they are encoded in the genomic sequence. We hypothesize that the main reason for our limited understanding is not lack of data, but that most data sets are generated and ultimately analyzed in isolation, limiting their full potential. To further our understanding of the organizing principles of the regulatory genome, it is therefore essential to take an ?en masse approach to data analysis, exploiting the dynamics across large numbers of observations. In this project, we will use this notion to develop methods for defining the first comprehensive and pragmatically useful human regulatory genome annotation based on the coordinated occurrence patterns of regulatory elements across hundreds of cell types and states. Beyond individual elements, we will define multi-kilobase domains of shared regulatory activity, which will shed light on the regulatory landscapes around genes and higher-order regulatory domains. In addition, we will integrate regulatory annotations with orthogonal information based on functional genomics chromatin state data to arrive at a rich composite view of the regulatory genome. Lastly, we will develop the first fully data-driven system for designing and validating context-specific synthetic regulatory elements. We anticipate that our results will provide a new lens on the human regulatory genome, which will open up new research avenues in the areas of systems and synthetic biology, ultimately contributing to the understanding and treatment of human disease. We are determined to provide the genomics community with pragmatically useful regulatory genome annotations and tools to utilize these resources.
The human regulatory genome plays a crucial role in the coordination of gene expression in health and disease but remains poorly understood. This application proposes a general framework for elucidating the organizing principles of the regulatory genome through large-scale data integration, from a common coordinate system for regulatory elements to their universal annotation and in-depth characterization. This work will generate insights and testable hypotheses for the study of basic gene regulatory processes, interpretation of genetic variation and synthetic biology efforts, which are all key areas contributing to the understanding and treatment of human disease.