Our overarching goal is to understand how information characterizing genes and their function can be organized, integrated, and then generalized to new contexts. This is a central question of the post-genomic era, and one that becomes ever more pressing as novel assays expand the scope, breadth, and detail of information describing gene properties. While the Gene Ontology is the most prominent and universal system for organizing gene function, hundreds of others exist, often serving specialized research interests. Most laboratories depend on the validity of some subset of this data to design new experiments or interpret their results, but their quality is hard to directly ascertain, particularly in novel or complex integrative methodologies. Based on substantial preliminary data, we hypothesize that determining robustness and specificity will provide a highly general assessment of the utility of databases. We propose to use these properties to assess the entire corpus of resources organizing gene information, as well as the methods which exploit this information, and the results that they report. Critically, determining robustness and specificity does not require validation with respect to ?gold standard? information. By evaluating these resources with respect to their joint specificity and robustness we determine means of integrating and organizing their data for use in novel applications. Finally, we propose to apply our improvements in quality control to better target rare but robust results where this is an experimental goal, notably rare diseases and single cell expression. The three complementary objectives in this project are to: 1. Determine the uniqueness and robustness of data characterizing gene function. We develop a formal approach for characterizing robustness and uniqueness/specificity by exploiting prior probability in the form of gene multifunctionality. We will evaluate robustness and specificity across essentially all complex and structured databases characterizing genes. These measures can be compared between databases or over time and provide a global landscape of data structure. 2. Test methods designed to exploit information describing gene function. Statistical and machine learning methods exploiting structured data will be assessed for robust and specific output. Data features driving performance in diverse applications will be identified and complementary sources of data as well as community clusters will be defined. 3. Evaluate results that depended on the use of databases describing gene function. Using a combination of text-mining and figure-mining, we will assess the ongoing literature for novel, robust, and specific gene-function associations. We will characterize and evaluate the ?dark matter? of gene-function association from both the point of unannotated genes as well as incomplete functions.
Recent advances in genetics have it made possible to assess gene function in many different contexts and this data is captured in numerous resources which catalogue the varied properties of genes. The primary goal of this project is to explore and characterize the specificity and robustness of information about gene function within these databases. Because most biomedical experiments build on pre-existing knowledge, establishing guidelines for interpreting that data also has a large impact on our ability to understand novel results, particularly where they are unusual, as in many rare diseases.
Ballouz, Sara; Dobin, Alexander; Gingeras, Thomas R et al. (2018) The fractured landscape of RNA-seq alignment: the default in our STARs. Nucleic Acids Res 46:5125-5138 |
Crow, Megan; Paul, Anirban; Ballouz, Sara et al. (2018) Characterizing the replicability of cell types defined by single cell RNA-sequencing data using MetaNeighbor. Nat Commun 9:884 |
Crow, Megan; Gillis, Jesse (2018) Co-expression in Single-Cell Analysis: Saving Grace or Original Sin? Trends Genet 34:823-831 |