Understanding printed documents is an intelligent activity. This research is about automating one aspect of analyzing a document image and deriving a high-level representation of its visual content. Documents contain photographs and accompanying text. This effort is concerned with arriving at an integrated interpretation of the communicative unit consisting of photographs and their captions. When text describes salient aspects of a photograph, it is possible to use the text to direct a vision system in understanding the photograph. There are two components to this research: the first deals with language issues and the second with development of a vision subsystem. Methods of extracting visual information from text, specifically cues required to identify salient objects, are to be studied; such information may be present in a variety of forms, based on both syntax and semantics. The role of textually extracted visual cues in performing visual object recognition is also to be studied. As a test of the theory, it is proposed to develop a system where the result of parsing a caption of a newspaper photograph is used to identify human faces in the photograph. The face location subsystem will incorporate scale invariant techniques, and filters that characterize faces based on the presence of distinguishing visual features. //