In consultation with Dr. Spouge, Dr. Gonzalez-Delgado set up a semi-automated pipeline using the alignment program HMMer and the Pfam protein domain database, to identify the functionality of viral proteins. The strategy is to take the genome of a human virus, of which there are now many fully sequenced, extract the annotated genes, and then run HMMer in reverse mode, to find significantly similar domains in Pfam, which then become candidates for inferring the functionality of the viral proteins. Once the significant Pfam domains have been found, the Pfam domain database specifies which of the proteins contributing to a given protein domain are in fact human proteins. The human proteins therefore become candidates for a previous horizontal gene transfer between the virus and its human host. To inform the project with questions relevant to experimental virologists, Dr. Gonzalez-Delgado consulted with Drs. DeVico and Lewis at the Institute of Human Virology, who represent virologists interested in developing therapies for the human immunodeficiency virus, HIV. Drs. DeVico and Lewis (who are not experts in HIV evolution) told us that they were not aware of horizontal gene transfer between HIV and humans (although a priori it seems likely to occur, because as a retrovirus HIV inserts itself into the human genome and then replicates itself). Dr. Gonzalez-Delgado was able to find similarities between human proteins and more than 20% of the HIV genes. Encouraged by this result, Drs. DeVico and Lewis suggested that Dr. Gonzalez-Delgado examine human herpes virus, which has been a prototype for many studies using sequence alignment to find horizontal transfer between viruses to human. In preliminary runs, Dr. Gonzalez-Delgado was able to use her pipeline to increase the number of statistically significant similarities between the human herpes virus and the human genome from about 20% of the viral genes to almost 50%, and to confirm many previous inferences of functionality from the database annotations for the human proteins. Although there are many investigations of viral evolution using sequence similarity, the interest is not to retrace the evolutionary connections between viruses and their hosts. Rather, we are interested in positing possible functionality for viral proteins, for which the approach chosen has some practical advantages. First, to posit functionality for a viral protein from a corresponding human protein requires annotation of the human protein. The sequences in the Pfam domain database are generally more heavily annotated than random protein sequences, because annotations propagate from one protein to another within each family. Thus, because most Pfam families contain several protein sequences of known function, inferring function of a family is generally not difficult. Second, because the pipeline is based on the curated multiple alignments in Pfam, and because it uses protein searches based on position-specific scoring matrices, the pipeline has a significant advantage in sensitivity over the use of pairwise alignment programs like BLAST. Third, the nature of the pipeline makes it a good candidate for complete automation. Fourth and finally, the use of a domain database, instead of a protein database, increases the sensitivity of the search for homology, because domains are the functional combinatorial units of evolution.

Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
National Library of Medicine
Zip Code