Large samples sizes are increasingly common in genetics/genomics, particularly in human genetics where sample sizes must be large (>1,000s of individuals) to detect variant associations with complex disease traits. A common feature of data from large samples is that the individuals within the study have varying levels of similarity with one another that can become problematic for downstream analyses (e.g. causing spurious associations) if not understood. Thus uncovering population structure and dissecting it to understand its source is a common and important practice in large-scale studies. Here, we aim to solve challenges for visualizing population structure that regularly arise when researchers interact with large-scale population genomic data sets.
In Aim 1 we will develop a software tool for visualizing population structure using principal components analysis (PCA). This tool will make straightforward several steps that are commonly reinvented by data scientists as they analyze PCA outputs from genetic data. It will also make more clear whether PCA analyses may be returning anomalous results.
In Aim 2 we will develop a tool for producing geographic allele frequency maps of publicly available or user-generated allele frequency data.
In Aim 3 we will develop a visualization approach for displaying geographic regions where populations show unexpectedly high or low levels of differentiation.
In Aim 4 we will integrate these pieces of software into a single suite and link them to externally generated data sources and existing genome browsers. By developing these sets of tools we help to remove the need for unnecessary script generation by independent researchers and increase the pace of genomics research. Throughout the project we will pay special attention to developing user-friendly interactive data displays such as those generated by the Data Driven Documents (d3) JavaScript visualization libraries. Where possible we will use simple, yet flexible python backends and provide complementary R libraries to facilitate customizations and integration with existing analysis pipelines. While population genetic applications will motivate our work, the tools we are generating will be generally applicable to other forms of structured biomedical data.
This project will provide tools for visualizing large-scale genetic data with population structure. While numerous advanced algorithms for summarizing population structure exist, the human interface to the outputs of these methods is lacking and has become a time sink during the analysis of large samples. In this project we will provide user-friendly tools that lower the barrier to understanding genetic variation datasets. In particulr we will develop tools for visualizing compressed representations of genetic variation (i.e. PCA results) and how genetic diversity is distributed across geographic space in a sample.
Engelmann, Brett W; Hsiao, Chiaowen Joyce; Blischak, John D et al. (2018) A Methodological Assessment and Characterization of Genetically-Driven Variation in Three Human Phosphoproteomes. Sci Rep 8:12106 |
Dey, Kushal K; Hsiao, Chiaowen Joyce; Stephens, Matthew (2017) Visualizing the structure of RNA-seq expression data using grade of membership models. PLoS Genet 13:e1006599 |
Marcus, Joseph H; Novembre, John (2017) Visualizing the geography of genetic variants. Bioinformatics 33:594-595 |
Novembre, John; Peter, Benjamin M (2016) Recent advances in the study of fine-scale population structure in humans. Curr Opin Genet Dev 41:98-105 |
Petkova, Desislava; Novembre, John; Stephens, Matthew (2016) Visualizing spatial population structure with estimated effective migration surfaces. Nat Genet 48:94-100 |
de Manuel, Marc; Kuhlwilm, Martin; Frandsen, Peter et al. (2016) Chimpanzee genomic diversity reveals ancient admixture with bonobos. Science 354:477-481 |