As proteins accumulate change and diverge over time, they must continue to satisfy the structural and energetic constraints that enable them to function properly. Because of this, the diversity of protein sequences available from living organisms represents a wealth of data on the relationships between protein sequence, structure, and function. Extracting insights from these data, however, remains a challenge. In principle, the optimal approach to decoding functional information in protein sequence biodiversity is to use parametric statistical inference with realistic phylogeny-based models. Historically, this approach has been limited by the amount of sequence data available, and by the difficulty and prohibitive computational complexity of the probability calculations needed. Sequence biodiversity has become much more readily available in the post-genomic era, and yet despite important advances by a few groups, the vast majority of research making use of comparative sequence information has depended on models with numerous convenience-motivated assumptions that ignore important biological effects such as the variation of structural and functional contexts across residues in a protein, and over time. These simplifications prevent the full potential of comparative genomic data from being brought to bear on the role of genetic variation in health and disease, and thus may pose significant roadblocks to biological and biomedical discovery. Here, it is proposed to take advantage of recently developed methods, which we have designed to remedy this situation by eliminating the need for potentially misleading model over-simplifications. We will use these methods to model complex context-dependent variation in evolutionary patterns across positions in proteins, and to model how these patterns change over time and in response to external influences. The evolutionary patterns at different positions will be related to known structural and energetic features and, based on these analyses, new models will be developed that directly incorporate bona fide biological effects. Because of our algorithmic advances, these models can be built to more accurately reflect the true complexity of interdependent sequence, structural and functional effects on evolutionary processes, yielding unparalleled power to detect subtle but meaningful effects, to provide better predictive capabilities, and to more precisely characterize the causes and consequences of protein evolution. The proposed research is broadly important for human health because of the central role that protein function and dysfunction play in the mechanisms and etiology of a vast majority of human diseases and health disorders. Thus, a better understanding of the relationship between sequence variation, structure, and function will yield better prediction of the effects of human mutations, greater understanding of protein function and its role in human biology, improved rational design of novel proteins that might be used to improve human health, and potentially more accurate structure-based drug design. The proposed research will focus on modeling proteins encoded in complete vertebrate mitochondrial genomes, taking advantage of the uniquely dense sequence sampling available across diverse vertebrate species. We will pay detailed attention to primate, including human, mitochondrial genomes, so that our research will provide specific insight into how key oxidative phosphorylation proteins function, and how mutations in these genes lead to human diseases by disrupting structure and function. Given the central importance of the mitochondrion to aging, disease (e.g., diabetes, Parkinson's, Alzheimer's, and other neurological diseases) and to cellular processes including apoptotic cell death, such insights may prove directly beneficial as a hypothesis-generating and testing platform for experimental, pharmacological, and translational research in several areas. By design, this project will pave the way for future research on nuclear genome datasets as their number and diversity increases.
The proposed studies are broadly important for human health because better understanding of the relationship between protein sequence variation, structure, and function will enable better prediction of the effects of human mutations, better prediction of protein structure to better understand protein function and its role in human biology, and better rational design of novel proteins that might be used to improve human health. In addition to general insight, the research will provide specific insight into how core mitochondrial proteins function, and how mutations in these genes might lead to disease by disrupting structure and function. Given the importance of mitochondria to aging, disease (e.g., diabetes, Parkinson's, and other neurological diseases), and to cellular processes including development and programmed cell death (apoptosis), such insights may also prove directly beneficial as a hypothesis-generating and testing platform for experimental, pharmacological, and translational research in numerous areas.
|Wacholder, Aaron C; Cox, Corey; Meyer, Thomas J et al. (2014) Inference of transposable element ancestry. PLoS Genet 10:e1004482|
|Nakayama, Maki; Castoe, Todd; Sosinowski, Tomasz et al. (2012) Germline TRAV5D-4 T-cell receptor sequence targets a primary insulin peptide of NOD mice. Diabetes 61:857-65|
|Pollock, David D; Thiltgen, Grant; Goldstein, Richard A (2012) Amino acid coevolution induces an evolutionary Stokes shift. Proc Natl Acad Sci U S A 109:E1352-9|
|Castoe, Todd A; Poole, Alexander W; de Koning, A P Jason et al. (2012) Rapid microsatellite identification from Illumina paired-end genomic sequencing in two birds and a snake. PLoS One 7:e30953|
|de Koning, A P Jason; Gu, Wanjun; Castoe, Todd A et al. (2011) Repetitive elements may comprise over two-thirds of the human genome. PLoS Genet 7:e1002384|
|Pollock, David D; de Koning, A P Jason; Kim, Hyunmin et al. (2011) Bayesian analysis of high-throughput quantitative measurement of protein-DNA interactions. PLoS One 6:e26105|
|Castoe, Todd A; Hall, Kathryn T; Guibotsy Mboulas, Marcel L et al. (2011) Discovery of highly divergent repeat landscapes in snake genomes using high-throughput sequencing. Genome Biol Evol 3:641-53|
|Harris, J Kirk; Sahl, Jason W; Castoe, Todd A et al. (2010) Comparison of normalization methods for construction of large, multiplex amplicon pools for next-generation sequencing. Appl Environ Microbiol 76:3863-8|
|de Koning, A P Jason; Gu, Wanjun; Pollock, David D (2010) Rapid likelihood analysis on large phylogenies using partial sampling of substitution histories. Mol Biol Evol 27:249-65|
|Castoe, Todd A; de Koning, A P Jason; Kim, Hyun-Min et al. (2009) Evidence for an ancient adaptive episode of convergent molecular evolution. Proc Natl Acad Sci U S A 106:8986-91|
Showing the most recent 10 out of 11 publications