Proteins are the key functional molecules in cells, performing multiple biological tasks. This includes catalyzing reactions, providing structure to cellular components, signaling between different cells and regulating the production of other genes among many others. Proteins are composed of chains of individual amino acids that are formed initially into a long sequence, which forms into a strictly controlled 3D structure, giving the highly specific function to each protein. The advent of genome sequencing has transformed our ability to study these molecules into a "Big Data" discipline, coupled to advances in mass spectrometry and allied computing techniques. This particular branch of "'omics" is referred to as proteomics - the high-throughput study (identification and quantification) of all the proteins that can be detected in a given biological sample. Proteomics is used right across biological and biomedical research for profiling systems as varied as human, model organisms including plants, and infectious diseases/microbes, among many others. Many biological functions are dependent on chemical modifications that proteins can undergo, called Post-translational Modifications (PTMs). Due to the occurrence of PTMs, one particular gene can produce a great number of different protein entities which can potentially have different biological functions. PTMs can provide a rapid mechanism for changing function, such as switching an enzyme (biological catalyst) "on" and "off". Due to their functional importance, sites of PTMs on proteins are frequently the targets for drug design, particularly against cancer. In this grant, high-quality data analysis pipelines will be used to study the occurrence of the main types of PTMs across hundreds of proteomics datasets in the public domain, involving human and the main model organisms (e.g. mouse, rat and the model plant Arabidopsis).
The types and sites of post-translational modifications (PTMs) on proteins are rich and diverse, providing cells with a rapid mechanism for adapting function under different conditions. PTMs are widely studied across all areas of fundamental and applied life sciences research. Proteomics approaches using mass spectrometry (MS) provide the sole high-throughput means to detect and localize protein PTMs. Despite their biological importance, PTM-relevant data is collated in the public domain via disparate resources, with a lack of data provenance. An efficient way to improve the situation is to make PTM information derived from proteomics approaches available through UniProtKB (www.uniprot.org/), the world-leading protein-knowledgebase. There are hundreds of relevant PTM proteomics datasets in the public domain since the proteomics community is now widely embracing open data policies (e.g. through the resources PRIDE and PeptideAtlas, part of the ProteomeXchange consortium). We will develop and deploy in the cloud open and reproducible pipelines to re-analyse consistently hundreds of PTM relevant public datasets coming from human and the main model organisms. Complementary analysis approaches will be used: primarily standard protein database-based but also spectral library-based and open modification searches. Special attention will be devoted to ensuring that PTM localization is accurate and community guidelines will be developed with that goal in mind. These data will be widely disseminated to UniProtKB and other knowledge-bases (e.g. neXtProt) and made available at PRIDE, PeptideAtlas, and a new resource PTMeXchange. These new PTM data will be integrated across studies, to increase statistical power at an unprecedented scale and accuracy. Finally, several following demonstration studies will be performed to understand PTM motifs, function and evolution.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.