We have developed an automatic protein fingerprinting method for the evaluation of protein structural similarities based on secondary structure element compositions, spatial arrangements, lengths, and topologies. This method can rapidly identify proteins sharing structural homologies as we demonstrate with five test cases: the globins, the mammalian trypsin-like serine proteases, the immunoglobulins, the cupredoxins, and the actin-like ATPase domain-containing proteins. Principal components analysis (PCA) of the similarity distance matrix calculated from an all-by-all comparison of 1031 unique chains in the Protein Data Bank has produced a distribution of structures within a high-dimensional structural space. Fifty percent of the variance observed for this distribution is bounded by 6 axes, 2 of which encode structural variability within 2 large families, the immunoglobulins and the trypsin-like serine proteases. Many aspects of the spatial distribution remain stable upon reduction of the database to 140 proteins with minimal family overlap. The axes correlated with specific structural families are no longer observed. A clear hierarchy of organization is seen in the arrangement of protein structures in the universe. At the highest level, protein structures populate regions corresponding to the all-alpha, all-beta, and alpha/beta superfamilies. Large protein families are arranged along family-specific axes, forming local densely-populated regions within the space. The lowest level of organization is intrafamilial; homologous structures are ordered by variations in peripheral secondary structure elements or by conformational shifts in the tertiary structure.
Showing the most recent 10 out of 508 publications