Biological discoveries made in other organisms tell us about the functions of human genes because of the ability to compare homologous protein sequences. Recent efforts to sequence a greater diversity of species for comparative analysis have been primarily done on the DNA level, and protein sequences are subsequently translated in silico assuming some genetic code. However, there is currently no informed way of selecting the correct genetic code for a newly sequenced organism, which is critical for the correct translation of predicted protein sequences. As more diverse organisms are sequenced, species using variant genetic codes continue to be found, suggesting that there may be a hidden diversity of alternative genetic codes across the tree of life.
Aim 1 proposes building a computational tool to predict the genetic code used by an organism from nucleotide sequence alone. This would fill in a critical missing step in genome annotation pipelines and would ensure the accuracy of protein sequence databases, which are predominantly composed of predicted protein sequences.
In aim 2, the computational tool will be used to infer the genetic code usage of all publicly available genomes and validate any new genetic codes by computational analysis of tRNA genes, experimental confirmation of tRNA expression via Northern blotting, and confirmation of altered codon translation via proteomic mass spectrometry.
In aim 3, the updated distribution of alternative genetic codes will be used to address long-standing hypotheses in the field about how the genetic code is thought to evolve. This research training plan is intended to prepare the PI for a career as an independent and interdisciplinary researcher. The training environment will be in a collaborative computational laboratory, with access to a lab bench and shared lab equipment to do the proposed experiments. The training plan will also include development of science communication skills, including oral presentations and writing.

Public Health Relevance

Efforts are currently underway to sequence genomes from across the tree of life, but there is currently no informed way of selecting which genetic code to use when translating predicted protein sequences in silico. This proposal outlines a plan to build a computational tool to infer the genetic code used by an organism from nucleotide sequence data alone, and then to characterize the distribution of alternative genetic codes in all sequenced organisms. This would not only ensure the accuracy of protein sequence databases but would also allow us to address fundamental questions about how disruptive changes to protein translation can evolve.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Predoctoral Individual National Research Service Award (F31)
Project #
Application #
Study Section
Special Emphasis Panel (ZRG1)
Program Officer
Cubano, Luis Angel
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Harvard University
Graduate Schools
United States
Zip Code