DETAILS


COLUMNS


CONTRIBUTIONS


ARCHIVE



Applications of Computer Vision to Computer Graphics

Vol.33 No.4 November 1999
ACM SIGGRAPH


Image-Based Modeling and Lighting



Paul E. Debevec University of California at Berkeley

Perhaps the most prominent thread in the development of computer graphics has been the pursuit of photorealistic imagery. Starting from the vector graphics of the 1960s, the graphics community has developed progressively more advanced display hardware, geometric representations, surface reflectance models and illumination algorithms, and by the 1990s photorealistic rendering has become a part of traditional computer graphics.

While photorealistic renderings can now be achieved with traditional computer graphics, they generally do not come easily. First, creating realistic models is a time and talent-intensive task. With most software, the artist needs to build a detailed model piece by piece, and then specify the reflectance characteristics (color, texture, specularity, etc.) for all the various surfaces in the scene. Second, generating photorealistic renderings requires advanced techniques such as radiosity and global illumination, which tend to be very computationally intensive. In brief, modeling tends to be hard, and rendering is slow.

For applications such as interior design, architectural planning, simulation, virtual museums and cultural heritage, the objects and environments that one wishes to render using the computer often exist within the real world. In such cases, the modeling and rendering challenges increase considerably: since people are familiar with what the results should look like, they will have higher expectations of the quality of the results. However, objects and environments in the real world have a very useful property as well: they can be photographed. Additionally, a great deal of research has been done in the last several years to model the geometry and appearance of objects and scenes from photographs.

Computer vision concerns the development of algorithms to derive various sorts of information from digital images. The process of recovering 3D structure from 2D photographs, now known as image-based modeling, has been one of its central endeavors. Conversely, the process of photorealistically rendering objects and scenes from images and geometry is known as image-based rendering. As the models come from photographs, modeling need not be hard, and since the shading also comes from the images, rendering need not be slow. In the next few sections, I’ll discuss a number of image-based modeling techniques and some projects that have used them in conjunction with image-based rendering.

Shape from Silhouettes

Some work has been done in both vision and graphics to recover the shape of objects from their silhouette contours in multiple images. If the camera geometry is known for each image, each contour defines an infinite, cone-like region of space within which the object must lie. An estimate of the geometry of the object can thus be obtained by intersecting multiple such regions from different images. As a greater variety of views of the object are used, this technique will eventually recover the line hull of the object. For many objects the line hull will be the same as the shape itself, although for shapes with a complicated concave structure, some of the volume may never be intersected away.


Figure 1: Modeling from silhouettes. Three photographs of a 1980 Chevette are shown at the top of the picture. The box in the middle shows how the silhouettes of the photos are used to carve out the shape of the car and then to color it in. A sampling of frames from a fly-by animation of the Chevette model are shown across the bottom.

In the summer of 1991 I used a silhouette-based technique to construct an image-based model of a Chevette automobile. A friend and I parked the car next to a tall building in such a way that it was possible to take nearly orthographic pictures of the car from directly in front, in back, above and from the right and left. I digitized the photos and manually located the silhouette of the car in each picture. I then algorithmically intersected the extrusions of the silhouettes, pushing each along its viewing direction through a volume of voxels (volume elements) to carve out the shape of the car. With the shape recovered, I colored in the exposed voxel facets with the pixel values from the original images: facets pointing up took their colors from corresponding pixels in the top image; facets pointing forward took their colors from the front image, and so forth. Finally, I used this model to create an animation of the Chevette flying across the screen (see Figure 1).

What was compelling about the Chevette animation was that it didn’t look like a computer rendering — it looked like the real car. Not only was the shading believable, but it had the nuances of my particular Chevette — from its bent license plate to the chipped paint on the roof. Although the geometry recovered from just three silhouettes didn’t exactly conform to the shape of the car, the fact that the shading came from real photographs made the renderings believable. Also, because of the way in which the images were used to color the voxels, a synthetic view from one of the original directions would exactly reproduce that original view.

Silhouettes are one of the most basic cues to 3D structure, and despite their simplicity they’ve been useful in a wide variety of image-based modeling and rendering research. Richard Szeliski has used object silhouettes from a video stream of a rotating object to reconstruct its geometry [15]. A similar silhouette-based technique was used to provide an approximate estimate of object geometry to improve renderings in the Lumigraph image-based modeling and rendering system [5]. More recently, Steven Seitz and Charles Dyer have worked to derive voxel models by carving away any voxel whose color is not consistent in its projection into the various views [14].

Traditional Photogrammetry

The science of extracting 3D information from 2D photographs predates computer vision; it has a century-old history in the field of photogrammetry. Photography’s potential application to mapping was noted as early as 1851 by the French inventor Aimé Laussedat, and gained widespread use during World War I due to the work of mathematicians such as E. Kruppa who developed the algebraic aspects of the problem [6]. The basic idea is simple: when a picture is taken, the 3D world is projected in perspective onto a flat 2D image plane. As a result, a feature (for example, the top of a flagpole) seen at a particular location in an image can actually lie anywhere along a particular ray beginning at the camera center and extending out to infinity. This ambiguity can be resolved if the same feature is seen in two different photographs, which constrains the feature to lie on the intersection of the two corresponding rays. This process is known as triangulation.

Using triangulation, any feature seen in at least two photographs taken from known locations can be localized in 3D. In fact, with a sufficient number of corresponding points, it is mathematically possible to solve for unknown camera positions as well. Such techniques have allowed photogrammetrists to create topographic maps of terrain from aerial surveys by marking corresponding points in aerial images. Beardsley, Torr, and Zisserman [1] have developed methods for determining both 3D point locations and camera positions from features tracked in a video sequence, a technique known in the vision community as structure from motion. Recovering camera paths from image sequences is known in the visual effects industry as match-moving, and Doug Roble’s article on pp. 58-60 in this issue describes some of these applications.

Shape from Stereo

Unfortunately, building 3D models with traditional photogrammetry can be tedious, requiring each vertex of the model to be marked in at least two images. If the computer were able to form the correspondences automatically, that would clearly be more efficient. This can in fact be done: when the two images are taken from nearby viewpoints, a program called a stereo correspondence algorithm can be used to determine which pixels in one image correspond to which pixels in another. Simple triangulation can then be used to compute a 3D depth measurement for each matched pixel. The process is similar to how our visual cortex fuses the two images cast on our retinas into a vivid perception of depth.

In the summer of 1994, I worked with Michael Naimark, John Woodfill and Leo Villareal on a project that used stereo correspondence to create a virtual tour of a trail in Canada’s Banff National Forest. The images came from a specially built stereo image capture rig that had two forward-looking 16mm movie cameras spaced eight inches apart on a hand-pushed cart. An encoder on one of the wheels triggered the cameras to take a stereo pair of images each time the cart was pushed one meter forward. One such pair is shown in Figs. 2a and 2b.

In a stereo pair, the images will be slightly different since the two cameras are in different locations. Because of the geometry of the image formation process, a feature observed in one image will shift to the left or right to some extent when observed in the other. Features that are far away will have very little disparity between the images, but features that are closer will project to noticeably different image locations. In fact, it is easy to show through similar triangles that a region’s depth in the scene is inversely proportional to its disparity between the images. A good introduction to the geometry and techniques of stereo correspondence, as well as several other image-based modeling techniques, can be found in the book Three-Dimensional Computer Vision by Olivier Faugeras [4].






Figure 2:  Modeling with stereo: Immersion ‘94. The top two photos are a stereo pair (reversed for cross-eyed stereo viewing) taken by Michael Naimark in Canada’s Banff National Forest. In the center is a depth map computed using a stereo correspondence algorithm; intensity indicates depth, with brighter pixels being closer. Pixels the algorithm did not reliably match are indicated in blue. Below are two virtual views generated by casting each pixel out into space based on its depth, and reprojecting it into a virtual camera. On the left is the result of virtually moving one meter forward, on the right, one meter backward. Note the dark de-occluded areas produced by these virtual camera moves; these areas were not seen in the original stereo pair. Such regions can be filled in from neighboring stereo pairs when they are available. (Images courtesy of Interval Research Corporation.)

For each stereo pair of images, we used the stereo algorithm by Woodfill and Zabih [18] to determine pixels correspondences between the left and the right image. Successfully fusing the two images generates a depth measurement for every pixel. Such an array of depth values is called a depth map, and is often displayed with distant pixels in black and closer pixels in brighter shades, as seen in Fig. 2c. This is not a traditional 3D model — only the fronts of objects are represented, and in many places the depth estimates are inaccurate — but it is a 3D representation of the scene.

What’s exciting is that when an image has an associated depth map, its pixels need no longer sit flattened onto the image plane. Instead, we can project them out to their proper locations in 3D, and then project them back onto a new image plane, making a virtual view of the scene from a new location, as seen in Figs. 2d and 2e. In the original project, we also created animations of smoothly traveling down the trail by cross-blending between reprojections of the nearest available views of the scene, letting one reprojection fill in data where that of the other was missing.

The stereo rerendering technique produces eerily realistic renderings, but it is not without its limitations. Reprojecting images in this way introduces image resampling issues, and when the virtual viewpoint moves far away from the original views, errors in depth estimation and unseen areas will cause unacceptable artifacts in the renderings. However, images that appear just as realistic as the real world can still be rendered efficiently from a wide range of viewpoints, without the usual headaches of modeling and rendering a complex scene. Related work in this area includes Eric Chen and Lance Williams’ original view interpolation work on synthetic imagery [16], Laveau and Faugeras’ work with real stereo imagery [8], and McMillan and Bishop’s work [12] in reprojecting panoramic images. For more information on these image-based rendering techniques, see Steven Gortler and Leonard McMillan’s article in this issue of Computer Graphics on pp. 61-64.

The weakest aspect of stereo modeling techniques is that the algorithm must determine which pixels in the left image correspond to which pixels in the right image, which is often ambiguous since regions around a pixel often are similar in appearance to several pixel regions in the other image. One way of addressing this problem is to make use of images taken from many locations; the Virtualized Reality project at Carnegie Mellon [7] performs stereo across images from over fifty live video streams.

The correspondence problem can be almost entirely solved by projecting a pattern of light onto the scene to disambiguate pixel correspondences. A special case of this is a triangulation-based laser range scanner, where a vertical stripe of light from a laser scans across the scene to make pixel correspondences obvious. The resulting geometry is often so precise that multiple scans from different directions can be combined to create globally accurate 3D models. To read more about active sensing techniques for geometry acquisition, see Brian Curless’s article in this issue of Computer Graphics on pp. 38-41.

Photogrammetric Modeling with Facade

For my Ph.D. thesis at UC Berkeley, I worked with Camillo Taylor, Jitendra Malik, George Borshukov and Yizhou Yu to develop a system for modeling and rendering architectural scenes from photographs. Architectural scenes are an interesting case of the general modeling problem since their geometry is typically very structured. At the same time, they are one of the most common types of environment one would want to model. The goal of our research has been to create a system for modeling architecture that is convenient, requires relatively few photographs and produces freely navigable and photorealistic results.

The product of this research is Facade, an interactive tool that enables a user to build photorealistic architectural models from a small set of photographs. In Facade, the user builds a 3D model of the scene by specifying a collection of geometric primitives such as boxes, arches, and surfaces of revolution. However, the user does not need to specify the dimensions or the locations of these components. Instead, the user corresponds edges in the model with edges marked in the photographs, and the computer works out the shapes and positions of the primitives to make the model agree with the photographed geometry.

Facade simplifies the reconstruction problem by solving directly for the architectural dimensions of the scene: the lengths of walls, the widths of doors and the heights of roofs, rather than the multitude of vertex coordinates that a standard photogrammetric approach would try to recover. As a result, the reconstruction problem becomes simpler by orders of magnitude, both in computational complexity and, more importantly, in the number of image features that it is necessary for the user to mark. The technique also allows the user to fully exploit architectural symmetries — modeling repeated structures and computing redundant dimensions only once — further simplifying the modeling task. Many of these techniques have been leveraged in MetaCreations’ new photogrammetric modeling product Canoma. (Website)

Further research that we have done enables the computer to automatically refine a basic recovered model to conform to more complicated architectural geometry. The technique, called model-based stereo, displaces the surfaces of the model to make them maximally consistent with their appearance across multiple photographs. Thus, a user can model a bumpy wall as a flat surface, and the computer will compute the relief. This technique was employed in modeling the facade of a gothic cathedral for the interactive art installation "Rouen Revisited" shown at the SIGGRAPH 96 art show. (Website)

For rendering, we begin with the standard approach of forming texture maps by projecting the original photos onto the recovered geometry. However, we take the additional step of selectively blending between the original photographs depending on the user’s viewpoint. As a result, this view-dependent texture mapping approach makes the renderings look considerably more alive and realistic. Projects such as Levoy and Hanrahan’s Light Field Rendering [9] and Gortler et. al’s Lumigraph [5] work show why this is the case: as the number of available views of the scene increases, the required accuracy of the geometry decreases.


Figure 3: Modeling with Facade. This model of the Berkeley campus was constructed using the Facade photogrammetric modeling system from 20 photographs. The model was constructed for the short film, The Campanile Movie, shown at the SIGGRAPH 97 Electronic Theater, which featured an image-based fly-around of the Berkeley bell tower. With this in mind, the model is more detailed near the tower and fades out in complexity further away. Forty of the campus’ buildings are represented. The film model also includes photogrammetrically recovered terrain geometry that extends out to the horizon.



Figure 3 (cont’d). Two synthetically generated images from The Campanile Movie, created by projecting the photographs onto the recovered geometric model. Using graphics hardware, renderings such as these can be generated in real time. Animations are available on the web.


Figure 4: Image-based lighting. Above is a frame from the SIGGRAPH 99 Electronic Theater animation Fiat Lux in which hundreds of animated spheres and monoliths were rendered into an image-based 3D model of St. Peter’s Basilica. High dynamic range photography was employed to capture the illumination inside the basilica, allowing the synthetic objects to be illuminated with the light that was actually present. Also see the website.
The Campanile Movie

The most ambitious model we have constructed to date was for the animation The Campanile Movie shown at the SIGGRAPH 97 Electronic Theater. The main sequence of the film is a swooping fly-around of Berkeley’s bell tower, looking out across the surrounding campus. To create the animation, we built an image-based model of the tower and the surrounding campus — from the foot of the tower out to the horizon — from a set of 20 photographs. The photographs were taken from the ground, from the tower and from above the tower using a kite. The geometry we constructed was more detailed near the tower and less detailed further away. A total of 40 campus buildings were modeled to some extent; the buildings further away appeared only as texture maps projected onto the ground. The terrain geometry was inferred from the bases of the buildings, as well as a few scattered points recovered through triangulation, and a series of points along the horizon. There were a few thousand polygons in the model, and the 16 images used in rendering the scene fit snugly into the available texture memory. By making use of OpenGL and a special view-dependent texture-mapping technique [3], it was possible to render the scene in real time.

The final effect was one that we had not seen before — a computer rendering, at a glance indistinguishable from the reality from which it was built, able to be viewed interactively in any direction and from any position for a wide range of views. Some images from the project are shown in Figure 3. George Borshukov, who worked on The Campanile Movie and is now at MANEX Entertainment, applied some of these same techniques to produce virtual camera moves for the Keanu Reeves film, The Matrix.

Beyond Appearance: Image-Based Lighting and Inverse Global Illumination

While image-based modeling and rendering allows us to move virtually through real scenes, it does not allow us make changes to the scenes. The reason is that image-based renderings are simply interpolations of the appearance of the scene at the time the photographs were taken. Adding new objects, or changing the lighting, requires recomputing the illumination in the scene to show the new conditions.

To realistically add a synthetic object to an image-based model, such as a piece of furniture or a virtual actor, the object needs to be illuminated in a manner that is consistent with the rest of the scene. Furthermore, the object needs to affect the illumination of the rest of the scene: casting shadows, appearing in reflections and so forth. One method to do this is to measure the incident illumination from the scene at the location that the object is to be placed, and then to illuminate the synthetic object with this measurement of the real light using a global illumination algorithm [2]. This technique was used to produce the animation Fiat Lux, shown in Figure 4.

Changing the lighting in a scene is a more involved process, since it requires computing a completely new solution of the interaction of the light and the surfaces in the scene. To do this requires estimating the reflectance properties of the surfaces in the scene, such as their diffuse color, their specular properties and whether they are metallic. Sato, Wheeler and Ikeuichi [13] and Stephen Marschner [11] have shown how the reflectance properties of an object can be estimated from a set of photographs under point-source illumination. Our current research at UC Berkeley in inverse global illumination [17] shows how to recover the reflectance properties of an entire scene under arbitrary lighting conditions. The result of such an analysis is a model consisting of geometry and reflectance properties, which can be rendered using traditional computer graphics techniques.

Conclusion

The 1990s have been an exciting time for computer graphics research, as our ability to model and render aspects of the real world has evolved from approximate models of simple objects to detailed models of complex scenes. The impact of these new techniques has already been felt in the film industry, as virtual camera moves and image-based models are being increasingly used for visual effects. In the next decade we’ll be able to capture and display larger data sets, recompute lighting in real time, view scenes as immersive 3D spaces and populate these recreated spaces with photorealistic digital humans. The biggest challenge posed by this technology is a familiar one, and one that the graphics community has consistently taken on with enthusiasm and flourish: finding something great to do with it.

For Further Information

See also the computer vision website.

Animations from The Chevette Project, Facade, Rouen Revisited, The Campanile Movie, Rendering with Natural Light and Fiat Lux are available on the web

References
  1.  Beardsley, Paul, Phil Torr and Andrew Zisserman. "3D Model Acquisition from Extended Image Sequences," in Proceedings European Conference on Computer Vision, pp. 683-695, 1996.
  2.  Debevec, Paul. "Rendering Synthetic Objects into Real Scenes: Bridging Traditional and Image-based Graphics with Global Illumination and High Dynamic Range Photography," SIGGRAPH 98 Conference Proceedings, pp. 189-198, July 1998.
  3.  Debevec, Paul E., Yizhou Yu and George D. Borshukov. "Efficient View-Dependent Image-Based Rendering with Projective Texture-Mapping," in 9th Eurographics Workshop on Rendering, pp. 105-116, June 1998.
  4.  Faugeras, Olivier. Three-Dimensional Computer Vision, MIT Press, 1993.
  5.  Gortler, Steven J., Radek Grzeszczuk, Richard Szeliski, Michael F. Cohen. "The Lumigraph," SIGGRAPH 96 Conference Proceedings, 1996, pp. 43-54.
  6.  Kruppa, E. "Zur Ermittlung eines Objectes aus zwei Perspektiven mit innerer Orientierung" in Sitz.-Ber. Akad. Wiss., Wien, Math. Naturw. Kl., Abt. 11a., pp. 1939-1948, 1913.
  7.  Kanade, T. et al. "Constructing Virtual Worlds Using Dense Stereo," Proceedings of Sixth IEEE International Conference on Computer Vision (ICCV’98), Bombay, India, January 1998, pp. 3-10.
  8.  Laveau, Stephane and Olivier Faugeras. "3D scene representation as a collection of images," Proceedings of 12th International Conference on Pattern Recognition, pp. 689-691, 1994.
  9.  Levoy, Marc and Pat Hanrahan. "Light Field Rendering," SIGGRAPH 96 Conference Proceedings, 1996, pp. 31-42.
  10.  Marschner, Stephen. Inverse Rendering for Computer Graphics, Ph.D. thesis, Cornell University, August 1998, see Website.
  11.  Marschner, Stephen R., Stephen H. Westin, Eric P. F. Lafortune, Kenneth E. Torrance and Donald P. Greenberg. "Image-based BRDF Measurement Including Human Skin," in Eurographics Rendering Workshop 1999, June 1999.
  12.  McMillan, Leonard and Gary Bishop. "Plenoptic Modeling: An Image-Based Rendering System," SIGGRAPH 95 Conference Proceedings, pp. 39-46, 1995.
  13.  Sato, Yoichi, Mark D. Wheeler and Katsushi Ikeuchi. "Object Shape and Reflectance Modeling from Observation," SIGGRAPH 97 Conference Proceedings, 1997, pp. 379-387.
  14.  Seitz, S. M. and C. R. Dyer. "Photorealistic Scene Reconstruction by Voxel Coloring," International Journal of Computer Vision, pp. 1-23, Volume 35, Number 2, 1999.
  15.  Szeliski, Richard. "Rapid Octree Construction from Image Sequences," CVGIP: Image Understanding, Volume 58, Number 1, July 1993, pp. 23-32.
  16.  Williams, Lance and Shenchang Eric Chen. "View Interpolation for Image Synthesis," SIGGRAPH 93 Conference Proceedings, 1993.
  17.  Yu, Yizhou, Paul Debevec, Jitendra Malik and Tim Hawkins. "Inverse Global Illumination: Recovering Reflectance Models of Real Scenes from Photographs," SIGGRAPH 99 Conference Proceedings, pp. 215-224, August 1999.
  18.  Zabih, Ramin and John Woodfill. "Non-parametric local transforms for computing visual correspondence," ECCV, Springer Verlag, May 1994, pp. 151-158.


Paul Debevec studied math and computer engineering at the University of Michigan and received his Ph.D. in computer science from UC Berkeley where he is now a Research Scientist. While helping develop computer graphics technology, he has also collaborated with leading media artists to apply such technology to creative applications. He has directed several internationally exhibited computer animations, many of which have premiered at the SIGGRAPH Electronic Theater. His current work involves applying new graphics techniques to digital filmmaking and cultural heritage.

Paul Debevec
387 Soda Hall #1776
Computer Science Division
UC Berkeley
Berkeley, CA 94720-1776

Tel: +1-510-642-9940
Fax: +1-510-642-5775
Website

The copyright of articles and images printed remains with the author unless otherwise indicated.