Applications of Computer Vision to Computer Graphics
Vol.33 No.4 November 1999
From the Guest Editor
Realistic image synthesis is a central but elusive goal of computer graphics. Great strides have been made in modeling the propagation of light through an environment, as exemplified by modern rendering algorithms such as radiosity and stochastic ray tracing. However, creating models that look and move realistically remains a largely unsolved problem. Movies like Jurassic Park or Star Wars demonstrate thrilling possibilities - graphical models so rich and realistic that they integrate seamlessly with live action footage. Yet to create such models requires great artistry and manual labor on the part of a team of animators, and rendering these complex models is equally time consuming. Motivated by these difficulties, there has been increasing interest in the computer graphics community in the use of computer vision as a means for capturing the real world from photographs and video.
The field of computer vision is devoted to the problem of modeling and interpreting the world through the analysis of images. By capturing the geometry, material properties and motion of real-world scenes, computer vision offers an attractive solution to the problem of model acquisition. Beyond providing a rich source of input for computer graphics, computer vision has the potential to impact computer graphics at a variety of levels. For example, a camera-equipped PC could interpret your gestures and motions directly, without the need for a mouse or 3D-input device. Artists and media producers could benefit from vision-assisted tools that simplify editing images and video. Rendering architectures could take advantage of image-based representations to render complex scenes more efficiently. These and other applications of computer vision are now generating great interest in the computer graphics community. In the last two years alone, four sessions at SIGGRAPH have been devoted to papers on image-based modeling and rendering (IBMR) techniques. Important work in IBMR and other graphics applications has also appeared at computer vision conferences such as the International Conference on Computer Vision and the Conference on Computer Vision and Pattern Recognition. Interest in this area is further evidenced by a flurry of recent workshops [1, 7], tutorials [3, 8, 13] and panels .
Computer graphics is also influencing research directions in computer vision, a trend that has accelerated in recent years. For instance, the work on bidirectional reflectance distribution functions (BRDF) models for reflectance modeling, pioneered by Ken Torrance and others in the graphics community, led to the adoption of these models within the vision community and influenced the physics-based approach to computer vision . Similarly, alpha-channels have long been used in the graphics community as a representation for image transparency. In order to compute surface transparency, researchers in computer vision have begun to incorporate alpha-channels in stereo reconstruction algorithms [11, 12]. Motivated by rendering applications, a new crop of image-based modeling algorithms have emerged [5, 6] that optimize fidelity of appearance rather than geometric accuracy. This shift in metrics has led to stunningly realistic renderings, notably of architecture  and human faces  that appear virtually indistinguishable from reality.
In this special issue of Computer Graphics, we have invited a number of leading researchers and practitioners to describe recent results in this promising interdisciplinary area. These activities can be grouped into a number of broad application areas, which include the creation of 3D models, interactive image editing, special effects such as synthetic element insertion and user-interface tasks such as tracking and gesture recognition. Below, we list some of the more active application areas.
Creating Graphical Models
Computer vision techniques have a long history of use in the construction of 3D graphics models. Laser range-scanning, which is now used extensively in the movie and CAD industries, originated in computer vision and robotics laboratories. While commercial systems such as Cyberware and Cyra use laser stripes or scanning lasers and triangulation, newer systems such as the ZCAM use laser radar to give real-time depth images. Brian Curless’ article overviews these active range scanning techniques which, when combined with sophisticated new stitching (merging) algorithms, can be used to create accurate high resolution models of real-world objects and scenes.
Computing stereo correspondences between two or more images in order to obtain a depth map, also a traditional computer vision technique, can now be done at frame rates on a personal computer (e.g., commercial systems from companies such as Point Grey Research and SRI International). While the quality of the range map may not be as good as that available with active methods, the hardware requirements are minimal. For objects such as human faces, surprisingly good reconstructions can be obtained.
Physically-based models, i.e., deformable models that can automatically adapt themselves to image data, are another powerful way to extract three-dimensional information from imagery. Demetri Terzopoulos’ article reviews some of his work in this area, which includes the creation and animation of deformable models of vegetables, human faces and fish from natural imagery. He also describes how computer vision algorithms can be used to simulate the visual systems of virtual humans.
Photogrammetric techniques, which involve locating and matching features such as edges across several images, can be used to interactively construct 3D models of man-made objects. The Façade system, developed at Berkeley, is an example of such a system, as are a number of commercial packages such as PhotoModeler and 3D Builder. Paul Debevec’s article surveys some of these image-based modeling techniques, as well as methods for capturing the photometric properties of objects. For example, given several images of a scene, surface albedos and partial BRDFs can be estimated.
Vision-based techniques can also be used to capture the dynamic motions of a person. Current commercial systems require the use of special markers such as reflective balls affixed to the actor. Michael Gleicher’s article discusses such traditional motion capture systems, as well as more recent research papers, which suggest that it may be possible to track human motion without requiring special markers. This has the potential of making motion capture much less cumbersome and much more flexible. It may also enable motion capture from historical sources such as old film footage.
Interactive Manipulation and Editing
The interactive manipulation and editing of imagery and video footage can often be enhanced using "power-assist techniques" based on computer vision. Interactive dynamic contours such as snakes and intelligent scissors make the task of tracing and segmenting objects in images or video much easier. Eric Mortensen’s article on vision-assisted image editing discusses these techniques, as well as their application to problems such as automatically tracking segmented objects from frame to frame.
Feature tracking can also be a powerful tool for film and video editing. The two most common applications are the stabilization of video and the insertion of graphical objects into live footage. Stabilization can be based on one, two, three or four points, which corresponds to translational, rotational, affine and perspective motions or deformations. There are now commercial systems based on such techniques that can insert or replace billboards in live broadcasts, typically at sporting events.
When the camera motion is complex and 3D graphical elements are to be inserted (e.g., dinosaurs running behind trees), 3D camera motion must be recovered by tracking a large number of feature points. Doug Roble’s article discusses the application of such match move (camera motion recovery) software in the special effects industry. The other task in performing such insertion is to correctly separate foreground and background elements (a generalized version of the matte extraction problem). While today much of this work has to be done manually on a frame-by-frame basis, research on layered motion extraction will hopefully provide a mostly automated solution to this problem .
Image-based rendering, which involves synthesizing new views of an object or a scene directly from images, is currently a very active research area within the computer graphics community. The simplest form of image-based rendering - the interactive viewing of panoramic images - is very popular on the Web and in multimedia titles such as encyclopedias. Stitching such images together is a much easier problem than the general per-pixel matching required in stereo correspondence, and a number of commercial packages are available.
More ambitious uses of image-based rendering, such as rendering from Lightfields or Lumigraphs, are still mostly in the research stage. Leonard McMillan and Steven Gortler discuss these and other image-based rendering techniques. For example, as multi-texturing (the ability to use a blend of texture maps for rendering surfaces) becomes more widely available, view-dependent texture maps (e.g., as used in the Façade system) may become more commonplace. And while the use of image caching, as exemplified by the Talisman system, is not yet widespread, the use of natural (photographic) imagery as a source for computer graphics material is becoming more and more commonplace.
Natural (Perceptual) User Interfaces
Computer vision techniques have the potential to make the interaction between computers, graphics and users much more natural. For example, instead of using a mouse or joystick to control the viewing position, a user could just move her head, or use gestures. Similarly, the control of objects in an interactive application (simple CAD system or computer game) could be based on gestures. Bill Freeman and his colleagues at the Mitsubishi Electric Research Laboratories describe their work in this area, as well as surveying other efforts in vision-based human-computer interaction.
Vision techniques can also be used to replace traditional (e.g., electromagnetic) user tracking systems. For example, the UNC Hi-Ball system can estimate the position of a virtual reality system user with very high precision and at high frame rates. With these systems, the goal is not to capture the complete (articulated) motion of a user (as in motion capture), but rather to obtain information about his position and orientation. Trackers that do capture a user’s movements (sometimes only in 2D) can be used to control video games, e.g., to simulate fighting stances and actions or athletic motions such as jumping and throwing.
Professional vs. Consumer
The above application areas often fall more into the professional user market or the consumer market. As a general rule, professionals are much more willing to undergo a training period to use a new technology, if the eventual payback is higher productivity. For this reason, they are often the early adopters of vision-based technology. Consumers, on the other hand, expect the system’s use to be very simple and intuitive, even if they cannot achieve the same level of control or sophistication as professionals.
It is interesting to observe that features such as sophisticated contour-based segmentation tools, which were originally only available in professional image editing packages, can now be found in consumer-level products. We expect to see a similar "downward" migration of vision-based features and capabilities from professional to consumer applications. For example, "gamers," who are generally technologically savvy, may be interested in adopting vision-based interaction techniques long before the average "couch potato" might.
Imagine being able to visit Paris via your Web browser - to navigate through the rooms of the Louvre museum or walk down the Champs Elysées at sunset. Just as present-day cameras capture scene appearance, cameras of the future will capture scene geometry and material properties in the form of 3D photographs. Indeed, the first generation of commercial range sensors is already available. The next few years will see the availability of 3D video cameras at a high-end consumer price point. These devices will capture highly accurate depth maps and registered color images at real-time rates. When it becomes possible to also capture surface reflectance (BRDF’s), this technology will have a tremendous impact on the field of computer graphics. Modeling environments on the scale of an entire city will require ways of acquiring and integrating measurements from thousands or millions of camera viewpoints. While initial progress has been made in this regard, capturing large-scale environments remains largely an unsolved problem and a topic of active research.
Images offer a way of easily capturing extremely complex phenomena: water crashing over rocks on a beach, smoke wafting from a cigarette, a flag rippling in the wind, a volcanic eruption. Future animators could capture such phenomena with a video camera or other capture device, edit their appearance and construct new graphical worlds and animations. Beyond modeling geometry and appearance, these capabilities will require reconstructing scene dynamics from images, a topic in computer vision that is still in its infancy. Capturing purposive motions like "dancing" from images will require even more complicated analysis, in order to capture subtleties such as style, personality and technique. Modeling dynamic 3D scenes from video may be possible with future breakthroughs in computer vision. Most likely, obtaining such models will require input from an animator in order to determine what the right control knobs should be. Modeling motions of the sophistication of dancing will likely require going beyond pure image analysis and into the realm of art and artificial intelligence.
We currently interact with computers on their terms, via mice and keyboards. Future human-machine interactions may be radically different, based instead on speech, natural gestures and haptics. By placing cameras, microphones and other sensors around a room, it is possible to actively monitor a person’s "output" in terms of their position, movements and sounds. Future computers will be "aware"; they will be able to sense who you are, where you are, what you’re doing, where you’re looking, what you’re saying and potentially what your mood is. Such data will provide an extremely rich source of input, well beyond that which can be transmitted via the motion of a mouse or keystrokes. Current research in the area of computer-based user interfaces is already making important progress toward these goals, but this is still vast and largely uncharted territory.
These exciting applications have led to a new research area at the intersection of computer vision and computer graphics. This special issue of Computer Graphics brings together leading researchers and practitioners at the forefront of this area. The resulting collection of eight articles is meant to provide an introduction to this area and a starting point for further investigation. For a more comprehensive reference on computer vision resources, we refer you to the excellent Computer Vision Homepage . We believe this issue has relevance for a wide audience, including researchers, artists, animators, producers of special effects, game authors and a host of other computer graphics practitioners. We hope you find the issue enjoyable and informative.
Steven Seitz is an Assistant Professor of Robotics and Computer Science at Carnegie Mellon University, where he conducts research in image-based rendering, graphics and computer vision. His current research focuses on the problem of acquiring and manipulating visual representations of real environments using semi- and fully-automated techniques. This effort has led to the development of "View Morphing" techniques for interpolating different images of a scene and voxel-based algorithms for computing photorealistic scene reconstructions. His work in these areas has appeared at SIGGRAPH and in international computer vision conferences and journals. He co-organized courses on 3D photography at SIGGRAPH 99 and CVPR 99.
Richard Szeliski is a Senior Researcher in the Vision Technology Group at Microsoft Research, where he is pursuing research in 3D computer vision, video scene analysis and image-based rendering. His current focus in on constructing photorealistic 3D scene models from multiple images and video. Szeliski has published more than 60 research papers in computer vision, computer graphics, medical imaging, neural nets and parallel numerical algorithms, as well as the book Bayesian Modeling of Uncertainty in Low-Level Vision. He was a co-organizer of the ACM Workshop on Image-Based Modeling and Rendering and Associate Editor of the IEEE Transactions on Pattern Analysis and Machine Intelligence.