Institution: Boston College Artists are the masters of visual perception. Studying art and vision together can provide new solutions to fundamental problems in computer vision. We focus on inferring scene layout from a single image. This problem has been studied since the earliest days of Artificial Intelligence research, resulting in a host of so-called Shape-from-X methods, where X could be shading, perspective, etc. Unfortunately, each of these methods works under its own assumptions which often do not hold in real images. How these cues interact and integrate remains elusive. Painters constantly use a combination of four techniques: occlusion, perspective, shading, and form to effectively evoke a 3D percept from a 2D picture. Studying their techniques can lend insights into the computation of recovering scene layout from pixel values. The PI proposes to bring artists and vision scientists together to solve the computational problem of scene layout from pictorial cues. This project realizes it in three areas: education, experiments and computational modeling.

A new interdisciplinary course, Art and Visual Perception, has been developed at Boston College to give a comprehensive cross-examination of how art contributes to the understanding of vision, and how vision contributes to the generation and viewing of art. Students are actively engaged in both art practice and vision experiments. Learning art and vision together results in a deeper understanding than studying each discipline separately. Students' assignments also result in valuable datasets for vision research.

The computational approach to scene layout from pictorial cues in this project is to group pixels into spatially organized surfaces from a global integration of multiple pictorial cues in a spectral graph-theoretic framework. The goal is to turn artistic rendering knowledge on how these cues interact into a computational reality. The PI will study geometry (occlusion and perspective), appearance (brightness and color), and form using eye tracking and psychophysics experiments and computational models. These efforts are organized into two phases that progress from inferring the spatial layout from scenes made of planar surfaces (rooms and streets) to scenes made of curved surfaces (landscape and generic scenes).

Intellectual Merit

What is most remarkable about vision is its ability to perceive 3D spatial layout from a single 2D image. The proposed research replicates this ability in computation from a grouping perspective. Compared to statistical learning approaches, the grouping method is not only generic and thus scales well with the number of scenes, but can also produce a precise organization of surfaces in the scene. Compared to traditional Shape-from-X approaches, the grouping method examines each pictorial cue in conjunction with others. The integration of these multiple pictorial cues allows them for the first time to become applicable to real images. The PI has developed the essential grouping machinery in spectral graph theory for depth segregation. Compared to most existing formulations on this topic, it has unparalleled conceptual simplicity, computational efficiency, and guaranteed near-global optimality. The proposed research on brightness and color perception, in connection with Shape-from- Shading and surface organization, will help clarify the role of low- level and high-level mechanisms in the long-standing scientific debate between Hering and Helmholtz on color perception.

Broader Impact

This project bridges the gap between art and science not only in research but also in education by developing a new curriculum that traverses the areas of neuroscience, psychology, computer science, and visual arts, by involving students in art practice and scientific experiments, and by providing a forum for artists and scientists to exchange ideas on visual perception. These interdisciplinary efforts befit the liberal arts education tradition at Boston College. This project will not only benefit from the strong Fine Arts department on campus, but also cultivate computer science awareness and outreach to non-technical people, and promote the growth of the young Computer Science department at Boston College.


Project Report

The goal of our project is to infer scene layout from a single image,which is simply an array of numbers indicating the intensity of light at individual pixel locations. We need to organize these numbers into surfaces oriented towards the viewer in the 3D space. When we understand the spatial layout, we can visualize what the scene looks like from a different vantage point (Figure 1). Painters constantly use a combination of techniques to effectively evoke a 3D percept from a 2D picture, and studying their techniques can lend insights into the computation of recovering scene layout from pixel values (Figure 2). Occlusion effectively depicts elevation and the range of depth, and it is the most universal and earliest depth cue developed along with line drawings. Perspective was popularized by Renaissance, and it includes focal convergence, foreshortening, and texture gradients. Shading can be subtle but powerful, and requires keen observation and mastery of chiaroscuro. Form can evoke a rich sense of space and volume from a flat 2D pattern by its interaction with visual memory. This is achieved by the fine precision of simple shapes in Kelly’s work, and by the viewer’s long scrutiny and problem solving of complex geometrical configurations in Twaddle’s work. We have developed a new interdisciplinary course on Art and Vision. We bring neuroscience, psychology, computer science, visual art, scientific imaging and visualization together in examining how we perceive light, color, motion, shape, material, depth and distance. Students learn basic drawing skills along with rudimentary intuitions in computation and programming. Emphasis is placed on appreciating how artistic rendering contributes to the understanding of inner workings of visual sense, and how effective visual communication can be achieved through more knowledge on visual perception. In computer vision, some of these artistic techniques can find their counterparts known as Shape-from-X, where X could be junctions and contours, perspective, texture, shadows, or shading. Unfortunately, each Shape-from-X method makes its own assumptions that could conflict with others and often hold poorly in real scene images. An alternative to Shape-from-X is statistical learning. Unlike any Shape-from-X that has its own stylized features, statistical learning approaches take many real images as training examples, extract many candidate features from them, and memorize the association between the 2D features and annotated 3D attributes. Given a new image, its features are computed and used as a query to the memory, and the most likely 3D attributes are retrieved. The success thus critically depends on how similar the test image is to the training images. Our computational approach to scene layout from a single image is to pop surfaces out from a global integration of multiple sources of information. These cues act upon some intermediate representations (e.g. lines and planes) and their compatibility with each other is evaluated, so that scene layout emerges from the most consistent group of visual representations (Figure 3). We have conducted a series of human vision experiments and made progress on several aspects about this computational framework. For example, we have developed a new integration machinery in spectral graph theory, called Angular Embedding, for reconciling multiple local pairwise measurements into global ordering of elements in a metric space. The problem is similar to obtaining a consensus movie ranking from many users' individual rankings of movies. However, unlike conventional embedding which ranks elements sequentially on a line, our angular embedding places elements in the complex plane, with angles encoding the positions and radii encoding the confidence in the positions. Elements with low confidence in their positions are placed near the origin of the complex plane, which naturally indicates that all angular positions become equally possible at the extreme (Figure 4). Angular embedding has been used for modeling subjective experience of luminance from the objective intensity of an image, for reconstructing images from noisy pairwise intensity differences, and for segmenting an image into objects in depth layers (Figure 5). We have developed another theoretic tool in pattern classification, called Power SVM, for effectively generalizing a concept from a few examplars, i.e. images annotated with concept labels (Figure 6). Intuitively, given only a few frontal views of persons of interest, we can identify those with distinctive looks (e.g. Oppenheimer vs. others) more readily and confidently. That is, human vision recognizes more variants of a distinctive exemplar, and the distinctiveness is relative in terms of what it is discriminated against. In computer vision, we can evaluate each exemplar's distinctiveness and require our classifier to place more distinctive exemplars farther away from the decision boundary between two classes. This conceptual change is similar to generalizing Voronoi Diagram to Power Diagram, where singular points become balls with distinctive radii and consequently their equidistance boundaries change in a complex way, providing a greater modeling capacity. For more information about the project, please visit the PI's homepage.

National Science Foundation (NSF)
Division of Information and Intelligent Systems (IIS)
Application #
Program Officer
Jie Yang
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
International Computer Science Institute
United States
Zip Code