Mass spectrometry-based top-down proteomics has emerged as one of the most informative approaches in protein analysis because it provides the bird-eye view of all intact proteoforms generated from post-translational modifications and sequence variations. A major challenge in proteoform identification by database search is the combinatorial explosion of possible proteoforms resulting from combinations of sequence variations, post-translational modifications, and other molecular events, such as protein degradation. Here, we propose to a novel data model, called the mass graph, to efficiently represent a huge number of potential proteoforms, and design new mass graph-based alignment and filtering algorithms that precisely identify complex proteoforms at the proteome level. We will also develop a software pipeline that combines top-down mass spectrometry and RNA-Seq data to identify sample-specific proteoforms. The proposed research will be conducted by a group of researchers who have complementary expertise. All the proposed algorithms will be implemented as user-friendly open source software tools.
This project addresses the proteoform identification problem by top-down mass spectrometry and by top-down mass spectrometry-based proteogenomics. New data models and algorithms will be proposed for high-throughput proteome-wide identification of complex proteoforms with post-translational modifications and sequence variations. Software tools developed based on these algorithms will facilitate the decoding of complex proteoforms like histone proteins and the discovery of proteome biomarkers.
Showing the most recent 10 out of 15 publications