The Human Genome Project is rapidly pouring a wealth of DNA sequence data intodatabases at the National Institutes of Health (NIH). Within this vast quantity of data lie the largely not-yet-understood """"""""blueprints"""""""" which the individual cells in an organism use to build the array of proteins that serve as the molecular machines for executing the wide variety of biological processes necessary to sustain life. This ever-growing genome database serves as a fundamental resource in accelerating research using mass spectrometry for identification of proteins. The database is much like having the answers to the odd-numbered problems in the backof the book. The difficulty for scientists then becomes how to pose an odd-numbered question and then decipher the answer. Mass spectrometry (MS) techniques produce two types of information from a single sample in a matter of minutes. The first is peptide mass. A so-called """"""""peptide-mass fingerprint"""""""" is obtained after using an enzyme to digest a target protein into a mixture of smaller pieces called peptides. The molecular masses of each peptide in the mixture are measured with a mass spectrometer. The resulting set of masses constitutes a """"""""fingerprint."""""""" The second is peptide sequence. In a tandem MS experiment, individual peptides in an unseparated mixture can be selectively fragmented. Subsequent measurement of the fragment masses yields data in the form of a""""""""peptide fragment-ion tag"""""""" and allows sequence to be nominally derived from the mass differences between adjacent fragments. Because of the complexity of the data produced from these types of experiments and the tremendous sample throughput potential from automation of MS instruments we can develop software for manipulating the data into a form that allows us to posethe question: """"""""Is the sequence of the protein we have just analyzed in the genome database"""""""". If so we and our collaborators could then begin to evaluate what is already known about the protein and how it might be important in the particular disease being studied. On those increasingly rare occasions when the particular protein sequence is not in the database, our data could be used to initiate gene-cloning efforts. Moreover, even with weak MS spectra from very low quantities of material, the combination of partly ambiguous mass and partial sequence data thatcan be obtained is of high discriminating power. Hence, genome database searches could leadto unambiguous, high confidence protein identifications because only a minuscule fraction of the enormous number of theoretically possible sequences can exist in the limited genome size of aliving organism.
Showing the most recent 10 out of 630 publications