Peptide hormones regulate embryonic development and most physiological processes by acting as endocrine or paracrine signals. They are also a rich source of relatively safe medicines to treat both common and rare diseases. Yet finding peptide-coding genes below ~300 base pairs is inherently difficult because they lie within the noise of the genome. Recent multidisciplinary, proteophylogenomic studies in lower species, such as yeast and flies, have uncovered hundreds of new small protein-coding genes called ?smORFs?. In humans, recent work on the mitochondrial genome has also uncovered dozens of small peptide hormone genes called MDPs. Based on these and other studies, it is estimated that about 5% of proteins in the human nuclear genome have not yet been discovered, particularly those that encode small peptides below 100 amino acids. It is a well documented but rarely challenged practice to discard large quantities of sequencing and proteomic data because they do not match the annotated human genome. My overarching goal is to discover the human ?secretome? and make practical use of it to improve the human condition. Over the past few years, we have developed a unique pipeline of technologies that combines breakthroughs in math, computer hardware and software, proteomics, mass spectrometry, and HTS screening, each of which has been optimized and integrated. Our GeneFinder software modules, based on machine-learning, can process data 100 times faster than traditional methods and rapidly validate small human genes using public and in-house generated databases of genetic and proteomic data. Using the prototype version of the platform that finds conservation between humans, chimp, and macaque, we have discovered thousands of putative peptide-coding genes and validated hundreds of them.
We aim to (1) further improve the algorithm to increase its speed and accuracy, (2) improve the genome annotation for thousands of small novel genes, (3) determine their expression profiles in normal and diseased tissues, (4) explore their genetic association with disease loci, and (5) screen the first secretomic library to find hormones with novel biological and therapeutically relevant activities. The data, the software package, and libraries will be made available to the research community. In doing so, we will shed light on the dark matter of the human genome, the parts with the greatest therapeutic potential, thereby helping to steer and accelerate the pace of research and drug development for generations to come.
There has been a rapid expansion in the use of peptide hormones as drugs over the last decade, yet new research indicates that more than 90% of all hormones in the body (encoded by an additional 5% of the human genome) remain to be discovered. As a result, terabytes of data are discarded each week and innumerable opportunities for biological discovery are missed because, according to our findings, the majority of genes below ~300 base pairs are missing from the annotated human genome. We propose an integrated, multi- disciplinary approach to find, validate and characterize an estimated 4000-5000 new peptide-coding genes using a pioneering technology platform that combines breakthroughs in math, custom-built computer hardware and software, and wet-lab approaches, providing a far more complete roadmap for biology and medicine in the 21st century.