This subproject is one of many research subprojects utilizing the resources provided by a Center grant funded by NIH/NCRR. The subproject and investigator (PI) may have received primary funding from another NIH source, and thus could be represented in other CRISP entries. The institution listed is for the Center, which is not necessarily the institution for the investigator. With a multitude of different MS instrumentation and data analysis software platforms available for MS and proteomics it becomes difficult to manipulate and manage various data sets. Recently we presented a software application that will allow the conversion of processed MS data files obtained on a variety of instruments into several common formats accepted by different software applications. We have further developed the program to add support for the mzXML format (Pedroli et al., 2004) and incorporate a front end interface which may be linked to several web based database searching engines including Mascot, ProteinProspector and BUPID (a peptide mass fingerprinting program based on a log-likelihood ratio model developed here at BUSM) (Tong et al., 2005). The data processing software was developed using Microsoft Visual Basic 6.0. To add support for mzXML format, we used MSXML 4.0 as an XML parser and built a visual C++ library to decode Base64 encoded peak list data in the mzXML file. Other supported data formats are intermediate files converted from raw data files using software from the manufacturers: LC MS/MS data is processed with Analyst QS (ABI/Sciex), MassLynx/PLGS2.1 (Waters); and MALDI MS data with MOverZ (Proteometrics LLC); and FTMS data with BUDA (O'Connor, www.bumc.bu.edu/FTMS). The BUPID program was developed in C under Linux and made accessible to the main program through a CGI based web interface. The shell data conversion program was written to implement a user friendly GUI interface which may be operated in an unattended batch processing mode. Testing of the program was performed on existing MALDI-TOF MS, MALDI-FT MS and LC MS/MS data sets obtained in house. The program allowed the conversion of large volumes of data obtained on different instruments to the formats of several commercially and publicly available search engines. Files were then submitted for protein identification to the search engines with the search settings specified by the user. For a batch of files, the search setting only needs to be specified once, thus allowing unattended operation. Results files are automatically saved in HTML format and can then be viewed directly inside the program. Our recent implementation of the mzXML format introduced by the Institute for Systems Biology affords the benefits of a common data format for summation of results obtained on different MS platforms, comparative analysis of MS methodology and archiving of data such that it may be analyzed at a later date in-house or at a different facility. The software provides an easy-to-use graphical interface for automatic MS data conversion and database searching. It can also be easily be expanded for more MS data types and linked to more database search engines. The new search algorithm Boston University Protein Identifier (BUPID) provides a robust and accurate statistical model for protein identification using MS data. The algorithm offers a number of new features: 1. Using log-likelihood ratio as scoring function, the algorithm can best distinguish correctly assigned peptides from incorrect assignments. 2. Matching peaks with a background-dependent threshold offers more flexibility and accuracy than the traditional mass window. 3. The statistical model provides similar or better results with comparison to conventional database search engines. We use log-likelihood ratio to calculate the probability that a protein is present in the sample. The model distinguishes two hypotheses ? H0: That a set of peaks in the spectrum is generated by the random background; and HA: That the same set of peaks is generated by peptides corresponding to a specific protein. A peak is included in the set if the probability that it is produced by the protein is more significant than that it is otherwise produced by the random background. Final results are ranked by the E-value of their probability score using the sequence information of the protein. We have compared the performance of the BUPID server and several other public web-based database search engines. Peptide map data sets were obtained from existing ongoing projects. BUPID database search results had on average 27% more true positives in top 10 predictions, with comparison to MASCOT results. Within the top 100 predictions, BUPID showed 27% more true positives as compared to MASCOT. In addition, BUPID was able to find all five human hemoglobin proteins in 6 cases within top 20. MASCOT succeeded in one case. When using peaks with higher than 5% relative intensity, the spreads are 28% and 24% within top 10 and 100 predictions, respectively. With another MALDI data set (transcription factor E2F1 protein purified and separated on 1D SDS PAGE), all five search engines pulled out similar results. BUPID also provides various data visualizations tools that are found useful by many users, including combined view of a protein mixture, mass spectrum of shared or similar peptides in different proteins, etc. A typical BUPID run takes 2~3 minutes on a Pentium IV PC.
Showing the most recent 10 out of 253 publications