DETAILS


COLUMNS


CONTRIBUTIONS


ARCHIVE



Applications of Computer Vision to Computer Graphics

Vol.33 No.4 November 1999
ACM SIGGRAPH


Vision-Assisted Image Editing



Eric N. Mortensen
Brigham Young University

What Is It?

When I think of image editing, packages such as Photoshop and Paint Shop Pro come to mind. These packages are used to edit, transform or manipulate, typically with extensive user guidance, one or more images to produce a desired result. Tasks such as selection, matting, blending, warping, morphing, etc. are often tedious and time consuming. Vision-assisted editing can lighten the burden, whether the goal is a simple cut and paste composition or a major special effect for a movie (see Doug Roble’s article in this issue). Thus, this article focuses on computer vision techniques that reduce (often significantly) the time and effort involved in editing images and video.

The goal of vision systems is to detect edges, regions, shapes, surface features, lighting properties, 3D geometry, etc. Most currently available image editing tools and filters utilize low-level, 2D geometric or image processing operations that manipulate pixels. However, vision techniques extract descriptive object or scene information, thus allowing a user to edit in terms of higher-level features.

Fully automatic computer vision remains a major focus in the computer vision community. Complete automation is certainly preferred for such tasks as robotic navigation, image/video compression, model driven object delineation, multiple image correspondence, image-based modeling or anytime autonomous interpretation of images/video is desired. However, general purpose image editing will continue to require human guidance due to the essential role of the user in the creative process and in identifying which image components are of interest.

Most vision-assisted image editing techniques fall somewhere between user-assisted vision and vision-based interaction. User-assisted vision describes those techniques where the user interacts in image (or parameter) space to begin and/or guide a vision algorithm so that it produces a desired result. For example, Photoshop’s magic wand computes a connected region of similar pixels based on a mouse click in the area to be selected. Vision-based interaction refers to those methods where the computer has done some or all of the "vision" part and the user interacts within the resulting vision-based feature space. One example is the ICE (Interactive Contour Editing) system [4] that computes an image’s edge representation and then allows a user to interactively select edge groupings to extract or remove image features.

A tool is classified based on where a user can "touch" the data of the underlying vision function - the process that computes results from inputs. User-assisted vision manipulates the input (or domain) space of the vision function while vision-based interaction provides access to the result (or range). Some tools allow intervention at several steps in the process, including the ability to adjust partial or intermediate results.

Regardless of a tool’s classification, there are algorithmic properties that are desirable for image editing tools. These tools should be:

  •  Fast (efficient): the user does not want to wait between mouse clicks.
  •  Simple: while the underlying computation may be complex, the default interface presented to the user should be straightforward.
  •  Intelligent: it "understands" what the user wants based on a minimum of input-the less effort and time a user needs to direct the tool to the desired result, the more intelligent the tool.
  •  Predictable: the user can generally anticipate the tool’s response to a given input, even if unfamiliar with the underlying algorithm.
  •  Interactive: the ability to interact with and guide the tool in process.
  •  General: it has broad applicability and can be used effectively in a wide variety of situations.
  •  Robust: it is forgiving, to a degree, of noise in the input, both image and human. It accepts rough or approximate inputs and/or noisy images and still produces the desired result.
  •  Overridable: it should support the ability to force a specific desired behavior (within reasonable limits given the framework of the tool) or easily modify the generated output to achieve the desired results.

It is not necessary for a tool to possess all of these properties to be effective. Conversely, some properties necessitate others. For example, the more intelligent a tool is, the more need there may be to override when it "intelligently" does the wrong thing.

One property that is paramount in almost any tool is predictability. Trying to work with an unpredictable system is like trying to hammer a nail blindfolded (and perhaps as painful). Predictability is influenced not only by the underlying vision function and its associated properties, but also by the tool’s classification. Vision-based interaction tools tend to be more predictable. With vision-based interaction, the vision process has completed before interaction occurs. Thus, the mapping between human input and the resulting output is potentially more immediate and straightforward. However, in user-assisted vision tools, human input has to pass through the entire vision function, which can be very complex, thus confusing the mapping. That doesn’t mean that user-assisted tools are inherently unpredictable. A robust system and a simple interface go a long way in providing predictability, whether user-assisted or vision-based.

What’s Available?

When it comes to computer vision techniques that have been or could be applied to image editing, there is an abundance of good ideas. Since I don’t have the space, time, knowledge or desire to discuss every good idea out there, I will highlight only a few image editing tools.

Magic wand: for years, this tool and the lasso tool have been the mainstay selection tools for Photoshop users. As mentioned, magic wand is very much a user-assisted vision tool. There are various implementations of magic wand, but the basic idea is pretty much the same. Given a user specified sample point or region, magic wand computes a region of connected pixels such that all the selected pixels fall within some adjustable tolerance of the sample statistic. For all its simplicity, magic wand is remarkably effective for certain types of objects. Yet, in terms of algorithmic properties, it seems somewhat slow, unintelligent, and unpredictable. Further, there is no ability to interact after the mouse click and the override capability is limited to backing up or combining results from multiple mouse clicks. While the interface is straightforward (point, click and hope), adjusting the tolerance level is often hit and miss. A more active interface would go a long way in enhancing magic wand’s performance and overcoming its algorithmic shortcomings. For example, rather than presetting the tolerance, magic wand could compute and layer the results from multiple tolerances. Upon pressing the mouse button, magic wand could immediately display the zero tolerance selection. Mouse movement (right/left or up/down) could then add on or peel away tolerance layers, displaying each result interactively. Such an interface would allow the user to quickly select among various results, providing an anchor towards the vision-based interaction side of the scale.





Figure 1: Images used in a composition. (a-c) Live-wire boundaries used for segmentation. (d) Final composition: Millennium.





Figure 2: ICE results [2]. (a) Original image. (b) Image contours with selected edges highlighted. (c) Reconstruction of image after deleting highlighted contours. Copyright 1998 IEEE.

Active Contours [7]: this technique has received a lot of attention in the vision community. Active contours are initialized by drawing an approximate contour around an object. They are called snakes since the contour appears to "wiggle" and "slide" as it seeks a minimum energy state combining smoothness and image features. By using the result from one frame as the approximate contour in the next frame, snakes can track a moving object through a video sequence. Algorithmically, snakes are fast, robust, seemingly intelligent and can be overridden by nudging on or pulling at portions of the snake during minimization. As a user-assisted vision tool, the user is not quite sure of the snake’s final "resting" place given the initial contour. Even nudging and pulling only modify the vision function’s domain (or input) space. Fortunately, snakes are generally well behaved and predictable due to their robust minimum energy formulation.

Intelligent Scissors [9, 10]: based on the live-wire interactive optimal path selection tool, Intelligent Scissors allows a user to choose a minimum cost contour segment corresponding to a portion of the desired object boundary. As the mouse moves, Intelligent Scissors displays the minimum cost contour from the cursor position back to a user-specified "seed" point. It utilizes both user-guided vision (placement of seed points) and vision-based interaction (selection of an optimal path). It’s fast, robust, has general applicability and requires relatively little human guidance to achieve an acceptable result (see Figure 1). Unlike snakes, Intelligent Scissors requires a few seconds to precompute local edge costs and it does not readily present a simple extension to video sequences (see What’s Next below). Also, Intelligent Scissors tends to be more sensitive to noise in low contrast areas. Some recent improvements to Intelligent Scissors [11] increase interactive responsiveness, provide subpixel boundary positioning and include an edge model to automatically compute a matte image and processed foreground image (the background component is removed from edge transitions).

Image Contour Editing (ICE) [4]: as noted above, ICE is a vision-based image editing tool. It automatically finds reliable image contours and encodes their location, orientation, blur and asymptotic intensity on both sides. By inverting the contour’s edge model, ICE is able to reconstruct a high-fidelity representation of the original image. Interactive selection and grouping of contours allows objects to be quickly extracted or eliminated from an image (see Figure 2). Currently, ICE has no override mechanism to manually adjust or define contours and their properties (e.g. to fill in occluded contours that become visible after removal of a foreground object), but there seems to be no reason why this couldn’t be added. While ICE is fairly fast during interaction, it does require a couple of minutes of preprocessing to compute the contour representation. However, preprocessing can be done off-line and the edge representation can be used to save the image in a compressed format, encoding and saving the difference between the original and reconstructed images. As such, image contours can be loaded and manipulated directly from the compressed format. The reduced data inherent in the contour representation provides not only for image compression, but could also facilitate efficient computation of other computer vision methods, such as automatic feature matching for image morphing.

Morphing (and Warping): traditionally, image morphing and warping rely on extensive user input to manually specify how features map from one image to another. However, application of computer vision techniques has reduced the human effort. For example, snakes can assist in the tedious feature matching task [8]. Even better is to eliminate user-directed feature matching altogether, or at least minimize it. [6] formulates image morphing as a shape blending problem [13] and automatically computes a mapping that minimizes the work required to "bend" and "stretch" a source image grid so that it conforms to a destination (example morphs can be seen on the web). Interactive placement of "anchor" points provides an intuitive and efficient override mechanism to guide the process when it needs a little help. Another attractive morphing idea uses example pose/expression images and optical flow to compute the pose/expression space of a person’s face and head [1]. Upon entering pose and expression parameters, the system synthesizes a new image of the original person with the specified pose and expression. Alternatively, it can apply the parameters to someone else’s image and warp the person’s expression and pose to conform to the desired settings.

What’s Next?

The recent convergence of vision and graphical techniques has produced several tools that estimate various physical scene or object properties, which in turn are used to render new scenes from different viewpoints. Such tools are ideal candidates for image/video editing applications. Traditional image and video editing has been limited to 2D operations. When cutting an object from one scene to paste it into another or when editing an object to get the required look, a desirable result is often dependent on matching or adjusting the camera angle, lighting conditions, focal length, noise properties, etc. By integrating ideas from plenoptic decomposition [14], image-based modeling [3] (see also Paul Debevec’s article in this issue on pp. 46-50), multi-view modeling [15], shape from video/shading/ texture and/or a variety of other vision and graphics techniques - and by throwing in a little human guidance at times - a user could potentially extract and manipulate physical scene properties such as an object’s geometry, photometric properties [16], texture or color, etc. Further, imaging properties such as blur [9] (due to motion or focal length) and camera noise could be automatically modified (with an override option) to better match the object’s position in the destination scene.

When it comes to selection and matting of moving objects, video editing is usually a painstaking process. While snakes do a fair job of tracking moving object boundaries [7] and generating mattes [2], manual matte definition is still widely used. Consequently, there is ongoing work to extend Intelligent Scissors for temporal image sequences (i.e., video). The goal of this project is to maintain the same interactive style that made Intelligent Scissors an effective selection tool for still images. As such, Intelligent Scissors for video will allow a user to simultaneously view and interact with multiple video frames. Several frames are displayed as a montage and the user is free to interact with any frame (or to jump back and forth between frames). As the cursor moves in one frame, the live-wire is immediately updated and displayed in all frames. To make sure that an object is not tracked past a scene cut, temporal slice analysis could detect scene transitions (including cuts, wipes, and fades) [12]. Other extensions to Intelligent Scissors will provide tight coupling of contour properties between frames (to improve tracking capabilities) and a more intelligent cursor snap [5] (to ease seed point placement). With these additions, Intelligent Scissors should stay on the cutting edge of vision-assisted image editing.

Acknowledgments

I would like to thank the co-editors, Steve Seitz and Rick Szeliski, as well as William Barrett for their input and suggestions. I would also like to thank James Elder for providing the images demonstrating ICE.

References
  1.  Beymer, D. and T. Poggio. "Image Representations for Visual Learning," Science, Vol. 272, No. 5270, pp. 1905-1909, June 28, 1996.
  2.  Covell, M. M. and T. J. Darrell. "Dynamic Occluding Contours: A New External-Energy Term for Snakes," in Proc. IEEE: Computer Vision and Pattern Recognition, Vol. II, pp. 232-238, June 1999.
  3.  Debevec, P. E., C. J. Taylor and J. Malik. "Modeling and Rendering Architecture from Photographs: A hybrid geometry- and image-based approach," Computer Graphics, SIGGRAPH 96 Conference Proceedings, August 1996, pp. 11-20.
  4.  Elder, J. H. and R. M. Goldberg. "Image Editing in the Contour Domain," Proc. IEEE: Computer Vision and Pattern Recognition, pp. 374-381, June, 1998.
  5.  Gleicher, M. "Image Snapping," Computer Graphics, SIGGRAPH 95 Conference Proceedings, August 1995, pp. 183-190.
  6.  Gao, P. and T. W. Sederberg. "A work minimization approach to image morphing," The Visual Computer, Vol. 14, pp. 390-400, 1998.
  7.  Kass, M., A. Witkin, and D. Terzopoulos. "Snakes: Active Contour Models," Int. Journal of Computer Vision, Vol. 1, No. 4, pp. 321-331, January 1988.
  8.  Lee, S. Y., K. Y. Chwa and S. Y. Shin. "Image Metamorphosis Using Snakes and Free-Form Deformations," Computer Graphics, SIGGRAPH 95 Conference Proceedings, August 1995, pp. 439-448.
  9.  Mortensen, E. N. and W. A. Barrett. "Intelligent Scissors for Image Composition," Computer Graphics, SIGGRAPH 95 Conference Proceedings, August 1995, pp. 191-198.
  10.  Mortensen E. N. and W. A. Barrett. "Interactive Segmentation with Intelligent Scissors," Graphical Models and Image Processing, Vol. 60, No. 5, pp. 349-384, September 1998.
  11.  Mortensen, E. N. and W. A. Barrett. "Toboggan-Based Intelligent Scissors with a Four Parameter Edge Model," in Proc. IEEE: Computer Vision and Pattern Recognition, Vol. II, pp. 452-458, June 1999.
  12.  Ngo, C. W., T. C. Pong and R. T. Chin. "Detection of Gradual Transitions through Temporal Slice Analysis," in Proc. IEEE: Computer Vision and Pattern Recognition, Vol. I, pp. 36-41, June 1999.
  13.  Sederberg, T. W., P. Gao, G. Wang and H. Mu. "Shape blending: An intrinsic approach to the vertex path problem," Computer Graphics, SIGGRAPH 93 Conference Proceedings, August 1993, pp. 15-18.
  14.  Seitz, S. M. and K. N. Kutulakos. "Plenoptic Image Editing," IEEE Int. Conf. on Computer Vision Proceedings, pp. 17-24, January 1998.
  15.  Szeliski, R. "A Multi-View Approach to Motion and Stereo," IEEE Computer Vision and Pattern Recognition Proceedings, Vol. I, pp. 157-163, June 1999.
  16.  Yu, Y. and J. Malik. "Recovering Photometric Properties of Architectural Scenes from Photographs," Computer Graphics, SIGGRAPH 98 Conference Proceedings, July 1998, pp. 207-217.


Eric Mortensen is a Ph.D. student in computer science at Brigham Young University. He works under the direction of Dr. William Barrett in the Computer Graphics, Vision, and Image Processing Laboratory, a Utah state sponsored Center of Excellence. Eric’s research interests include interactive vision and graphics techniques, image and video segmentation and composition and image-based modeling.

Eric N. Mortensen
3361 TMCB
Computer Science Dept.
Brigham Young University
Provo, UT 84602-6576

Tel: +1-801-378-7605

The copyright of articles and images printed remains with the author unless otherwise indicated.