The last few decades have seen the birth and maturation of the field of Molecular Biology. Initially, mutant genes were focal points of genome exploration. Now, entire genomes are routinely sequenced, and the resident genes are automatically identified by annotation algorithms. Alternatively, proteomic approaches prepare proteolytic peptides of whole-cell extracts for analysis by mass spectrometry. Each of these approaches are strongly biased for large genes: large genes are frequent targets for mutation, long-open- reading frames are easily discerned in genomic sequence, and large proteins generate many peptides for mass spectrometry identification. This unintended bias has also created a large gap in our understanding of molecular biology. Recent work in eukaryotes and prokaryotes alike have uncovered multitudes of small genes or their encoded proteins. The numbers of small proteins (considered as 50 aa or less) rival that of traditionally large proteins, yet only a handful have been ascribed a function. The goal of this proposal is to propel this nascent field forward by facilitating both small protein discovery and functional characterization. Our preliminary data identify specific examples that clearly define cis- and trans-classes of function for short- open-reading frames and small proteins. These early leads will be pursued to fruition, providing the framework for the expanded rigorous study needed in any new field. We will test an additional subset of small proteins for function, which we anticipate will reveal functions for each member of this training set, while also establishing general principles for short-open-reading frames and small proteins. We will develop and apply our small protein approaches in mycobacteria. Mycobacteria offer many advantages for small protein study. Foremost is that they express >1000 small proteins in standard conditions. An extensive toolkit for modifying, culturing, and analyzing mycobacteria makes them very tractable. A GC-rich genome provides codon bias selection as one criterion to identify functional small proteins. Moreover, our findings of small gene/protein function in standard laboratory conditions may directly provide insights into the biology and pathogenesis of infection. This proposal integrates the complementary expertise of investigators whose ongoing collaboration has already provided the requisite groundwork leading to this proposal. Through the proposed Aims, we will identify new functional roles of encoded mycobacterial small proteins and develop an optimized, small-proteomics pipeline for efficient application to other bacteria, archaea, and eukaryotes.
Small proteins (sproteins), defined as being <50 amino acids, have been overlooked for decades because they are ignored by genome annotation pipelines, their biochemical properties mean they are often discarded in standard protein preparations, and their small size makes them challenging to detect by mass spectrometry. The paucity of work on sproteins is in stark contrast to their prevalence in all domains of life; recent application of genome-wide approaches has revealed large numbers of sproteins in eukaryotes and prokaryotes. Focusing on mycobacteria, where sproteins are found in large numbers, we will characterize regulatory ORFs that encode sproteins, identify functional roles for trans-acting sproteins, and develop methods to identify additional functional sproteins.