Protein overexpression is desirable for many biotechnology applications ranging from vaccine production to drug discovery. Based on gene expression data of about 20,000 genes with common expression vectors from collaboration with the Northeast Structural Genomics Consortium, we observe that the free energy of the first ~50 coding nucleotides is strongly predictive of the expression level of polypeptides. Many mRNA sequences encode exactly the same protein sequence because multiple codons may map to the same amino acid. The recent emergence of experimental datasets of expression levels for genes has created an opportunity to maximize protein expression through modeling and algorithm design. We propose to develop algorithms that will enable biologists to evaluate whether native mRNA sequences are likely to express highly and to build synonymous mRNA sequences designed to optimize gene expression. Having relatively little mRNA secondary structure at the start of the coding region is of particular importance as that is where the ribosome assembles. Extensive mRNA secondary structure later in the gene also appears to be deleterious. In translation, splicing, and small interfering RNA gene regulation mechanisms, a region of messenger RNA must be unfolded to allow binding of the ribosome, splice factors, or microRNAs. Understanding the unfolding free energy costs offers opportunities to understand the biology of and to algorithmically engineer changes in gene expression.
Engineering mRNA sequences to optimize polypeptide expression will facilitate production of proteins for therapies, for the study of protein structure and for drug discovery. Synonymous mutations have been linked to human disease risk.