Mass spectrometry (MS) in combination with database searching is a popular and presumably accurate method for identification of proteins from species with known genomes. Proteins separated on a gel, digested with e.g. trypsin and extracted from the gel yields specific peptides, which can be subsequently analyzed by MS. The distribution of tryptic peptide masses, a so-called tryptic peptide map, is a protein fingerprint and can be compared with the sequence information stored in a database. Various scoring methods have been developed in order to find the protein candidate with the highest degree of similarity to the experimentally obtained peptide map. Due to imperfections in the separation and extraction, contamination during processing etc, the tryptic peptide map is typically incomplete with respect to the protein identified, and also contains a background of tryptic peptide masses from one or several other proteins. This means that a protein-identification may not always be accurate and unambiguous. In light of some apparent problems of distinguishing good protein-identification results from more uncertain ones, we are currently developing methods for determining the quality of protein identification by mass spectrometry. The approach is to do protein identifications on hypothetical sets of tryptic peptide mass data generated by a computer. This allows us to have a perfect control over the quality of the data and to vary the data as well as the search parameters in many different ways. Protein cont... identification based on realistic but random hypothetical data sets are particularly useful. Independently of the scoring method used in the identification, a repeated use of random data sets can generate the probability density function for protein identification by chance. Knowing this function under the conditions of a particular experiment, such as the size and mass accuracy of a peptide map, one can test the hypothesis that the identification score is an observation from a random distribution. Hence, this allows the assignment of a confidence level of the identification. Furthermore, statistical analyses of protein identification with hypothetical data will allow us to determine the quality of currently employed scoring methods. An improved insight into what features characterize a good scoring method, will guide us in future efforts to further refine such methods for protein identification.
Showing the most recent 10 out of 67 publications