DETAILS


COLUMNS


CONTRIBUTIONS


ARCHIVE



Applications of Computer Vision to Computer Graphics

Vol.33 No.4 November 1999
ACM SIGGRAPH


Vision in Film and Special Effects



Doug Roble
Digital Domain

Introduction

Computer vision techniques have become increasingly important in visual effects. When a digital artist inserts an effect into a live-action scene, the more information that they know about the scene, the better. In visual effects, we use vision techniques to calculate the location of the camera, construct a 3D model of the scene and follow objects as they move through the scene.

Think back to the 1980s, before computer graphics in special effects really took hold. In most of the films of that era, it was very easy to spot the "effects shot" in a film. The give-away was the sudden halting of a previously dynamic, moving camera. This was done because inserting a special effect character or object into a scene with a moving camera was very difficult. The camera move used in filming the live action part of the scene must be exactly reproduced by the camera on the effects set when the model or puppet is filmed. To accomplish this, one of two techniques were employed: 1) mechanical devices were used to encode the motion of the camera and then the encoded information was used to control the motion of the camera on the effects set, or 2) an experienced motion control engineer matched the move by eye. Both of these techniques are terribly prone to error and were used only on high budget productions.

In the 1990s with the advent of widespread computer generated effects and digital scanning of film, it became possible to apply computer vision techniques to extract information from the scene to aid the digital artist in blending an effect with the reality of the filmed image.

The Basics

Vision techniques are used to figure out the location and orientation of the camera when a photograph was taken. This is known as 3D tracking (also match moving or, in computer vision, view correlation). The simplest way to determine the camera location is to survey the geometry of the scene and use it to calculate the camera position.

To visualize this, think of taking a photograph of a room, developing the film as a slide, and, somehow, putting the slide back into the camera at the film plane. When viewed through this imaginary camera, the picture on the slide is superimposed on the scene in the camera’s viewfinder. Now, assuming that you haven’t changed the lens of the camera, it should be possible to return to the room and try to figure out where the camera was when the photograph was taken. By moving around you’ll soon find that there is one and only one location and orientation of the camera where all the features on the slide line up with all the points in the room.


Figure 1: 3D tracking software developed at Digital Domain was used on nearly every shot of the movie Titanic.

Software can do what you just did if you give it an image and a 3D model of the room and identify where points on the 3D model map to points on the image.

To create special effects, sequences of frames (e.g., shots) must be handled. By identifying the correspondence between points on the 3D model and the frames within a shot, software can be used to construct an animation curve of the position and orientation of the camera. This curve can be exported to an animation package and used by an artist to animate the virtual camera to match the movement of the live action camera.

This is only the beginning! Now that we have a moving camera, we can use photogrammetry to construct a better 3D representation of the scene. Surveying using traditional techniques is a time consuming process. In the typical case, surveyors only have time to digitize the location of just enough points to track the location of the camera. When the artist wants to insert a digital character in the scene, it is very useful if the artist has a good 3D model of the set. If your camera has moved enough, you can use two different camera positions to triangulate the location of any visible point.

Finally, it is possible to solve for the camera location and the position of tracking points without any survey data at all (This problem is referred to as structure-from-motion). In this case the camera must be moving so that we can see the scene from different vantagepoints. The calculations are more difficult and need more care to achieve a consistently stable solution.

There are alternate methods that can be used to achieve the same effect. The technique of recording the camera move using physical encoders hasn’t gone away. There have also been attempts to use traditional motion capture methods to capture the motion of the camera. We’ve even tried to capture the motion of a helicopter-mounted camera using a combination of global positioning system and accelerometers. Unfortunately, these techniques will only produce a rough track. They always fail because of error that is present in the recording and the fact that they do not record the mechanical motion of the film in the gate and exactly what happens with the image and the lens. Using the image to calculate the camera position always produces the most accurate results.

Many commercial packages are available to solve some or all of these problems. ras_track from Hammerhead, SoftScene from Geometrix, Matchmaker from Synapix and 3D Equalizer from Science D Visions are only a few of packages currently available.

3D Tracking Basics

3D tracking is basically a big optimization problem. The 3D location of points and the location of those points on a 2D image are known. The search space is the parameters ( x, y, z position, x, y, z rotation and field of view) of the camera that took the photograph. This is a seven dimensional minimization problem where the fitness measure is the error between the location of the 2D points and the location of where the 3D points render to on the photograph given the camera location.

There are many ways to solve this. A good starting point is to read the article "View Correlation" by Rod Bogart in Graphics Gems II [1]. It is a simple, robust solution to the problem. After that, looking at computer vision literature (Three-Dimensional Computer Vision by Olivier Faugeras [4] is very good) will give you more details.

When using these techniques on film one encounters problems that other disciplines can simply ignore. For visual effects, the accuracy is paramount. Nothing spoils an effect more than if a digital set extension wiggles against the real set or a digital character’s feet don’t stay planted on the ground. So, not only does the solution need to be accurate on a single frame, it must be accurate over the entire shot. The high resolution of film doesn’t help, either. What might look fine at video resolution becomes glaringly wrong at film resolution.

In addition, the images that we work with are often less than ideal for applying computer vision techniques. Typical problems encountered include motion blur, low light, changing light and high film grain. Even worse, the set might change or be destroyed during a shot. On Dante’s Peak we had to do a track where the very nice surveyed building caught on fire during the shot! Our tracking reference points were destroyed before the shot finished!

The Problem with Field of View

In our quest for accuracy, the lens of the camera keeps challenging us. First, the field of view of the lens can cause problems. Consider, if you are looking through the lens of an SLR camera and you wish the subject to appear bigger in the frame. You have two options - either physically move the camera closer to the subject or reduce the field of view of the lens. If you experiment with this, you’ll notice that zooming and moving the camera produce different images but that small zooms and small motions of the camera look remarkably similar.

Surveyed objects help in determining the field of view. As previously mentioned, you don’t need surveyed objects to do a track. This is often the case, but for high accuracy during a shot with a zoom lens, surveyed objects remove some of the variables from the optimization and produces more stable and accurate results.

Even with surveyed objects, the field of view will breathe over a shot. Depending on the error of tracking the points, the solution for one frame might move the camera closer to the object and increase the field of view. On the next frame the camera might be moved further from the object and the field of view is reduced. The solution might look correct on a single frame but the motion of the camera is incorrect. To solve this, the artist calculates the camera parameter curves for an initial solution. Then, realizing that the zoom isn’t nearly as jittery as it looks, the artist smoothes the curve. Then the camera parameters are recalculated with the field of view parameter locked to the smooth curve. This produces very nice results very quickly.


Figure 2: Marker barrels were dropped on a glacier to make tracking easier.

Initially we used professional survey teams to survey the sets and locations of the shots. One of the first uses of my software was to track a helicopter move over a glacier. The glacier had no readily discernible points so our survey crew dropped a set of 55 gallon drums on the glacier and surveyed them by having a particularly gutsy surveyor hang on the landing skids of a hovering helicopter, holding the reflecting prism on the barrel (see Figure 2). It wasn’t safe to land a person on the glacier, and to make matters worse was around 10 below at the time!

Now, while we still use professional surveyors for some tasks, we typically survey sets using photogrammetry techniques.

Barrel/Lens Distortion

Unfortunately, the problems of the lens don’t stop there. In computer graphics, the camera is typically modeled as an ideal pinhole camera. Even the most expensive physical lenses only approximate a pinhole camera. So, our calculations of the camera position have inherent errors.

It is possible to solve for the non-linear distortion effects of a lens just as one solves for the field of view, but that presents a problem when the data is exported to an animation package or renderer. These packages support only a pinhole camera model, though some renderers do allow you to write your own camera "shader." Thus, the position of the camera that is calculated will appear incorrect when viewed through the animation package’s GUI. This is typically unacceptable for an artist trying to animate a character so that it sticks to a floor.

One solution to this is to undistort the image before tracking it. We solve for the distortion present in the lens and warp the images to appear as if the camera were a pinhole camera. The 3D track is done with the new images and the artists can animate to them. When the effect is composited onto the final image it is distorted to match. Unfortunately, this warping will effectively reduce the resolution of the image by half, so rendering must be accomplished at twice the normal size.

We typically ignore the effects of lens distortion unless it is particularly bad. The camera solution will have some built in error, but we have ways of dealing with that.

Other Problems

Other things can get in the way of a good track. A film set is a dynamic, busy place. Actors can walk in front of all the good tracking features in the set. The camera might be moving fast enough so that all the points are motion blurred. The lighting during a shot can change so significantly that automatic tracking methods fail. Finding and following feature points on natural objects (trees, rocks) is particularly difficult. In these cases we often add painted ping-pong balls as tracking points with the knowledge that we’ll "paint them out later"!

Problems such as these highlight the difference between typical computer vision applications and computer vision in visual effects. If actors completely obscure the tracking points during a shot, we can always "throw an artist at it" and manually determine a plausible camera move. Our computer vision problems do not need to be real-time solutions and if they are not perfect the first time around, the artist can nudge the solution until it’s correct.


Figure 3: 2D tracking was used to apply digital makeup to Brad Pitt in Interview with a Vampire.
2D Tracking

2D tracking is the process of following a feature as it moves through the images in a shot. 2D tracking is used to follow the points used in 3D tracking. 2D tracking is also used in the compositing process. In Interview with a Vampire, Brad Pitt’s vampire makeup was too heavily applied in one scene (see Figure 3). This wasn’t noticed until after the production had wrapped, making it impossible to refilm the scene. We were able to help by 2D tracking features on his face and compositing digital makeup to reduce the effects of the actual makeup.

These kind of applications require that our 2D tracking solution be fast and sub-pixel accurate. 3D tracking also requires that a pattern be followed even as it changes shape due to changes in perspective as the camera moves.

Most commercially available 2D trackers have the user identify a feature to be tracked. Then, the small image of the feature is compared to the image on the next frame. An X and Y translation of the feature is computed. Unfortunately, if the feature changes shape, like the corner of a building as the camera circles it, the track of the feature will not be very accurate.

We have developed a 2D tracking solution that not only computes the X and Y translation of a pattern, but also its scale and rotation and skew. This, while increasing the search space of the solution and slowing down the process, makes the 2D track much more accurate and robust.

2D tracking falls under the category of vision-assisted image editing. See the article by Eric Mortensen on pp. 55-57 in this issue [7] for more information on this complex field.

Surveyless Tracking

The ultimate solution to 3D tracking is to not require any knowledge of the scene whatsoever. This is certainly possible but care must be taken to get consistently stable solutions. In surveyless tracking, the artist places 2D trackers on multiple features in the image. As the camera moves through the shot, the location of the points are tracked. These locations are fed simultaneously to an optimizer that solves for the location of the camera at each frame and the location of the points.

There are, however, limitations to this approach. The camera must move quite a bit for an accurate calculation of the geometry. In shots where the camera doesn’t move much, additional photos can be taken from different vantage points to provide the extra information. Additionally, the more that is known about the lens, the better. If the lens is calibrated by shooting a photograph of a reference object, the calculations are more accurate. By knowing the geometry of the scene, however, this step can be eliminated. If three reference points can be related by a right angle, the focal length of the lens can be determined in software. Shots in which the camera is zooming are more difficult without surveyed objects.

Surveyless tracking software can gather no information about the scale of the objects in an image: the tracking algorithm cannot distinguish the image of a real car from that of a toy car. This is easily solved by measuring the distance between two points on the set, but you have to remember to do it!

With his Façade system, Paul Debevec [2, 3] has developed a very clever way of doing surveyless tracking and scene reconstruction. By having the user assemble a very simple object from standard geometric primitives, his system can accurately reconstruct architectural forms.

Conclusion

Computer vision for scene understanding has become a powerful tool in visual effects. The more an artist knows about the scene, the more they can alter and add things to it. Beyond 2D and 3D tracking, computer vision techniques are being used to create mattes for characters and reduce noise in images. Vision techniques are also being considered for motion capture of characters (see Michael Gleicher’s article in this issue on pp. 51-54 for more information [6]).

When most people think of visual effects, they think of amazing creatures or fantastic sets. There is a large amount of work going on behind the scene to make those creatures and sets look like they belong in the film. The audience never actually sees this work, but if it wasn’t done, they’d notice. Computer vision techniques help the digital artist immensely.

References
  1.  Bogart, Rod. "View Correlation," Graphics Gems II, J. Arvo, Editor, Academic Press, 1991.
  2.  Debevec, P., C. Taylar, J. Malik. "Modeling and Rendering Architecture from Photographs: A hybrid geometry- and image-based approach," Computer Graphics, Proceedings of SIGGRAPH 96, August 1996, ACM SIGGRAPH, New York, NY, pp. 11-20.
  3.  Debevec, P. "Image-Based 3D Modeling," Computer Graphics, 33(4) November 1999, ACM SIGGRAPH, New York, NY, pp. 46-50.
  4.  Faugeras, O. Three-Dimensional Computer Vision, MIT Press, 1993.
  5.  Gleicher, M. and A. Witkin. "Through-the-Lens Camera Control," Computer Graphics, Proceedings of SIGGRAPH 92, July 1992, ACM SIGGRAPH, New York, NY, pp. 331-340.
  6.  Gleicher, M. "Motion Capture," Computer Graphics, 33(4) November 1999, ACM SIGGRAPH, New York, NY, pp. 51-54.
  7.  Mortensen, E. "Vision-Assisted Image Editing," Computer Graphics, 33(4) November 1999, ACM SIGGRAPH, New York, NY, pp. 55-57.


Doug Roble has been a Senior Software Engineer at Digital Domain for six years. He received his Ph.D. in computer science from The Ohio State University and his bachelor’s degree in electrical engineering from the University of Colorado. In 1999, he was awarded a Technical Achievement Award from the Academy of Motion Picture Arts and Sciences for his 3D tracking and scene reconstruction software.

Doug Roble
Digital Domain
300 Rose Avenue
Venice, CA 90291

The copyright of articles and images printed remains with the author unless otherwise indicated.