A Practical Approach to Motion Capture: Acclaim's optical motion capture system

Wes Trager, Advanced Technologies Group

Reference: "Character Motion Systems", SIGGRAPH 94: Course 9

Table of Contents

  1. Introduction
    1. Why motion capture?
    2. What are the requirements of a motion capture system?
  2. A brief history of the quest for motion capture
  3. An overview of current input systems
    1. Prosthetic
    2. Acoustic
    3. Magnetic
    4. Optical
  4. Skeletal Animation
    1. Inverse kinematics.
  5. Acclaim's System approach
    1. Knowledge-based system
    2. Sensors - how many, and where?
    3. The capture area.
    4. Capturing motion using multiple actors
    5. Editing motion
    6. Using captured motion - porting data sets
  6. Character Design
    1. Skin choices - single-skin vs. ball-joint.
    2. Secondary actions
    3. Complex secondary actions - Gestural animation.
    4. Direct gestural animation
    5. Interpretive gestural animation
    6. Lip-syncing - a special case of gestural animation
  7. The Skeleton
  8. Future issues.
  9. Appendix A: The Acclaim Skeleton File Format
  10. Appendix B: Acclaim Motion-Capture System - Technical Issues

Introduction

Why motion capture?

Motion capture has some very useful applications for many types of users. Its purpose is not simply to duplicate the movements of an actor, as some people have naively stated. Hopefully, in the course of these notes, we will communicate the fallacies inherent in statements such as "Why use motion capture when you could just have used the actor anyway. It's just a fad and will soon pass".

Motion capture allows you to:

What are the requirements of a motion capture system?


A brief history of the quest for motion capture

Almost from birth, human beings are experts at recognizing human movement. We know when something looks "right". Though artist's and animator's renderings of human movement may be artistically successful, they will fail the "Reality Test". Human beings perceive visual information in this order:

with movement being the strongest, since it is instinctively the most important for our survival.

Motion capture as a process of taking a human being's movements and recording it in some fashion is really nothing new. In the late 1800's several people (most notably Marey and Muybridge, with the help of photography) analyzed human movement for medical purposes, and also for the military.

In the early 20th century, motion capture techniques were used in traditional 2D cel animation by Disney and others. Photographed motion was used as a template for the artist/animator who traced individual frames of film to create individual frames of drawn animation. This technique is generally referred to as rotoscoping. In the mid 1980's the type of motion capture used was really an extension of rotoscoping., where an actor's movements were filmed from more than one view. Witness points (i.e., markers) attached to the subject and visible on the film were then manually encoded as corresponding points on the 3D representation of the character in the computer. This process is sometimes called photogammetry.

Towards the end of the 1980's motion capture as we know it today began to appear, with software that algorithmically applied captured witness points to some form of 3D object. At the same time the availability of a "skeletal system" using a knowledge-based digital skeleton first appeared in the university research environment, gradually making its way into commercial turnkey animation systems.


An overview of current input systems

Prosthetic

This is one of the early methods for capturing the motion from various parts of human anatomy. These methods include simple "on/off" type of motion detection systems as well as complex motion tracking systems. The latter type of prosthetic motion capture could be an ideal approach if it wasn’t for the complex mechanical requirements and the performance-inhibiting qualities generally associated with such designs. However, the type of data provided could be clean rotational data collected in real time without any occlusion problems. This method is based on a set of armatures which must be attached all over the performer’s body. The armatures are then connected to each other by using a series of rotational and linear encoders. These encoders are then connected to an interface that can simultaneously read all the encoders in order to prevent data skewing. Finally, through a set of trigonometry functions, the performer’s motion can be analyzed. These design restrictions seem to be quite difficult to overcome, and will probably limit the use of these type of devices for character animation.

Acoustic

Acoustic capture is another method currently used for performance capture. This method involves the use of a triad of audio receivers. An array of audio transmitters are strapped to various parts of the performers body. The transmitters are sequentially triggered to output a "click" and each receiver measures the time it takes for the sound to travel from each transmitter. The calculated distance of the three receivers is triangulated to provide a point in 3D space. An inherent issue with this approach is the sequential nature of the position data it creates. In general, we would like to see a "snap shot" of the performer’s skeletal position rather than a time skewed data stream. This position data is typically applied to an inverse kinematics system which in turn drives an animated skeleton.

One of the big advantages of this method is the lack of occlusion problems normally associated with optical systems. However, there seems to be several negative factors associated with this method that may or may not impede its use. First, there is the fact that the cables can be a hindrance to various types of performances. Second, the current systems do not support enough transmitters to accurately capture the personality of the performance. Third is the size of the capture area, which is limited by the speed of sound in air and the number of transmitters. In addition, the accuracy of this approach can sometimes be affected by spurious sound reflections.

Magnetic

This is a popular method used for performance capture. Magnetic capture involves the use of a centrally located transmitter, and a set of receivers which are strapped on to various parts of the performer’s body. These receivers are capable of measuring their spatial relationship to the transmitter. Each receiver is connected to an interface that can be synchronized so as to prevent data skew. The resulting data stream consists of 3D positions and orientations for each receiver. This data is typically applied to an inverse kinematics system to drive an animated skeleton. This magnetic approach shares the same lack of occlusion problems with the audio method. But it also shares the same negative factors such as the hindrance of cables, lack of sufficient receivers and the limited capture area. In addition, being magnetic the system is affected by any sizable areas of metal in the vicinity of the capture area, such as girders, posts, etc.

Optical

Optical systems have become quite popular over the last couple of years. These systems can offer the performer the most freedom of movement since they do not require any cabling. Optical systems incorporate directionally-reflective balls referred to as markers which attach to the performer. Optical systems require at least three video cameras, each of which is equipped with a light source that is aligned to illuminate the field of view for that camera. Each camera is in turn connected to a synchronized frame buffer. The computer is presented with each camera view in order to calculate a 3D position of each marker; the resulting data stream therefore consists of 3D position data for each marker. This data is typically applied to an inverse kinematics system, to animate a skeleton.

One typical problem with optical systems is the fact that it is quite easy for the performer to occlude, or hide, one or more markers thus creating "holes" in the data stream. This occlusion problem can be minimized by adding more cameras and/or more markers. However, adding more cameras makes tracking each marker more complex, resulting in increased CPU time. Increasing the number of markers can result in exponentially increasing the "confusion factor", i.e. keeping track of which marker is which. Optical systems are also limited by the resolution of the cameras and the sophistication of their tracking software.


Skeletal Animation

The practice of using a skeleton to control a 3D character has become very popular over the last few years, appearing in virtually all animation systems on the market. Skeletal animation allows the artist to easily position and control the rotation points of a 3D character. The artist can concentrate on animating the character using the skeletal system; he can then create a geometric "skin" (representing how the character should appear) and attach it to the animated skeleton.

Skeletal systems are hierarchical in nature and allow control of the character in an efficient manner. Skeletal systems with user-friendly and powerful interfaces can be excellent environments for the artist to control the complex algorithms involved in animating single-skin joint structures.

The availability of this kind of skeletal animation environment typically provides another benefit - the use of inverse kinematics to animate a character.

Inverse kinematics.

Inverse kinematics was a great breakthrough for 3D character animation, providing a "goal-directed" approach to animating a character. It allows the artist to control a 3D character's limbs by treating them as a mechanical linkage, or kinematic chain. Control points, connected to the ends of these chains, allow the entire chain to be manipulated by a single "handle". In the case of an arm, the handle would be the hand at the end of the arm.

These control points can also be driven by external data. In other words, inverse kinematics allows the artist to design a skeleton structure that can be driven from data sets created by a motion capture system. Typically these data sets are made up of points moving in 3D space, that represent frame-by-frame positions of sensors placed on the performer. Thus even capture systems with a limited number of sensor points can animate a more complex structure through the use of inverse kinematics.

However, if we attempt to drive the control points or handles directly with this data we quickly discover that this model does not come close to representing how the actual bones of the actor behave. The fact is that we are only monitoring how the skin is influenced by the actor's bone movements. The actor's real skeleton structure is extremely complex and not easily duplicated mathematically. Even more complex is its influence on the skin, combined with the effect of the underlying musculature.

The problem is typically resolved by loosely connecting the data set to the control points. We then allow the data set to influence the control points, but impose constraints and limits on the motion of the skeletal structure that the inverse kinematics solution must obey. Even if our 3D data was extremely accurate, there is such a fundamental difference between what this data represents and the complexity of what we are trying to model, that the end result is not very accurate at all.

As we increase the number of points captured, we depend less on inverse kinematics and more on the loose coupling mechanism. If we were able to track enough sensors, we would have no need for inverse kinematics. On the other hand, the coupling influence calculations would be intense, to say nothing of the task of connecting all these points to a skeleton. The amount of data required to explain a movement can be substantial when dealing with a large number of capture points.

After around 30 points, it becomes more economical to store movement data as a series of bone rotations rather than the raw 3D data set. In addition, rotation data will drive the skeleton directly, thereby eliminating the coupling influence calculations and significantly speeding up the skin animation process.

Even though we are dealing with vastly less data using a rotational solution than with a positional one, it is important to understand the complexity of the real data we are now gathering versus what you get from the inverse kinematics solution. Graphing rotational data for a given joint, as we have done below for the root node of a skeleton, shows how much detail is actually in real motion. Unlike positional data, which may well improve with some judiciously applied smoothing, rotational data should NOT be smoothed, since that would eliminate the frame-to-frame variational detail that gives the motion its realism and accuracy.


Acclaim's System approach

Acclaim's motion capture process is an optical one, currently utilizing a four-camera setup. Being optical, it is a tetherless system - the actor's natural motions are not hampered by wires or any other encumbrances.

Captured data is based on bone rotations, rather than on joint positions. This technique provides the most realistic and accurate body motion, without the need for an inverse-kinematic solution to derive rotational data from positional data. The data is derived in the actual capture process, using biomechanical knowledge about the way skeletal systems behave.

In addition, there are extensive editing tools available to massage the motion data after capture, providing further creative control over the final motion.

Besides providing a brief overview of the system, this discussion will focus not so much on the mechanics of the actual process, but on the issues involved in using a system like this, and in dealing with the resultant data in a practical fashion.

A knowledge-based system

The Acclaim system uses an expert-system approach during the tracking and analysis phase of capture, utilizing a knowledge-base of biomechanical information about how the human body behaves in motion. This body of knowledge includes information about limits, constraints, mass, moments of inertia, etc. The data from the capture session is manipulated by this system to produce a solution for the move.

The tracking process automatically tracks the sensor points on the actor from the multiple views provided by the four cameras. Generally at least two views are necessary at a time. This allows for some level of redundancy in the tracking process, since there will be many occasions where particular sensors are occluded from a given camera's view, by objects in the scene or other parts of the actor's body. However sensors can be tracked from only one camera, and even if total occlusion occurs and points are no longer visible, this can be handled for short durations by running the system forwards and backwards around that area. In other words, as long as the points become visible again further along in the motion, the tracking algorithms will successfully maintain tracking through the occluded section.

Needless to say, the algorithms involved are extremely complex and based on many years of scientific and medical research. The difference here is that the solution arrived at the end of the tracking process is based on an understanding of how the human skeleton behaves, and can accurately reproduce very sophisticated and complex motions, even at the most demanding areas like the shoulder.

Sensors - how many, and where?

The number of sensors or markers used in the capture process clearly has a major impact on the quality and realism of the resulting motion. In a simplistic sense, one can say that the more sensors used, the better and more accurate the motion - "more is more" in this case. At Acclaim, we have used up to 140 sensors in a capture session (across two actors); typically we will use around 50-60 sensors per actor. The nature of the movement being captured will in some degree dictate how many sensors to use. For instance, if there is to be little or no individual finger motion, then fewer sensors are needed on the hands.

The number of sensors is tied into the time taken to process the tracking algorithms; in fact, having a larger number of sensors can make it easier for the tracking algorithms, resulting in a stable solution in a shorter time.

However, one of the main reasons for having multiple sensors in a capture session has to do with getting around occlusion problems on the "set" where the capture is taking place. In other words, there needs to be sufficient markers in use, so that even if the view from one camera is occluded, there is sufficient sensor data from the other cameras to provide a viable solution.

Placement of sensors is based mainly on joint position, i.e., where we want to record rotational information. However, they should also be positioned to facilitate viewing by the capture cameras. In actual fact there is a lot of latitude in where exactly to place the sensor - the important thing is that the location of the sensor on the subject doesn't move around during capture. Even though the sensors are placed on the outside of the body, i.e., on the skin or perhaps the clothing, the sensor position is computed as being at the actual joint by the biomechanical expert system.

Finally, there are individual calibration issues to consider, in terms of the length of time it takes to suit up and calibrate for different bodies. Ideally the entire process should take less that five minutes, and the calibration phase to be near instantaneous.

The capture area

In practical terms, the capture area ends up being defined as the maximum pixel area resolvable by the cameras accurately, based on the existing lighting conditions. The typical capture area is 12 by 14 by 9 feet, with four cameras covering the area in equal proportion. This is satisfactory for much basic motion, stunt interaction etc. and is fairly typical for several of the other capture systems also. This arrangement is in some senses dynamic, in that the cameras can be rearranged to suit an aerial capture as opposed to a scenario where most of the action occurs on the ground.

But sometimes, the nature of the action requires that a move be performed over a longer distance (and/or height), and cannot be done in pieces and joined later. For this reason, we employ the notion of capture zones.

Zones are independent areas of capture "space", each with their own camera setup, that are linked into a common tracking process so motion can be continuously tracked from one zone into another. This allows very complex and lengthy moves to be captured, without the effort of piecing together separate sequences later using motion-editing. In addition, from the director's point-of-view, there is no loss of continuity or inspiration for the actors, since the motion is performed as one continuous "take".

The camera setup can be optimized for the particular action taking place in that zone; for instance, cameras can be positioned to capture known "hidden" motion - i.e. actions not likely to be seen by the standard camera setup.

It is important, in a zoned system such as this, that calibration of the capture space be performed quickly, easily and reliably from zone to zone. Tools exist in our system to provide fast and accurate calibration of the space; this calibration only needs to change if camera positions change. Calibration of this type is non-trivial, since you need to be able to correct for different lens geometries.

Capturing motion using multiple actors

Wherever synthetic "actors" are used, their actions will often be composited in with those of a live actor. One of the problems with this technique is the difficulty the live actor has in talking to - and interacting with - nothing.....the "Roger Rabbit" problem. Allowing the actor to interact with another person results in a better performance, often in far less time.

Most capture systems will only allow a single actor to be captured at a time. Using more than one actor compounds the problems associated with tethered systems. However, from an acting and choreography point of view, there is no substitute for the dynamics of multiple actors interacting. Directors can coach their actors in exactly the same way they would for any other type of filming; stunt men can fight, block moves etc., the same way they would normally.

Some types of interaction between characters, especially involving physical contact, can only be captured using multiple actors simultaneously. Shaking hands, "high 5" hand contact, arm-wrestling and tug-of-war-style rope pulling are just a few examples.

In the following illustration, two performers have been captured during a tug-of-war session; the dynamics of their movements would be difficult to achieve without the use of simultaneous capture.

Editing motion

Capturing motion in the fashion we have been discussing, with many sensors and sophisticated tracking algorithms, can provide final motion that is completely usable direct from the capture process. So, why would we need to edit the motion?

In fact, capturing an actor's movements and using them "as is" is really only part of the potential for using motion capture for character animation. There are many ways one would like to use captured motion as the "base" to work from, further stylizing and altering that motion to create the desired result.

A simple example would be to take the captured motion of an actor and alter it in a gross fashion, such as applying a rotation or translation to the move, so that the character takes a different path through the environment.

In a similar vein, let's say we would like to take motion from one actor apply it to another character. Here there are potential problems of scale, since the actors may well be sized or built differently -- this can become extreme when a normal actor's moves need to be applied to a character such as a dwarf or a Barbie doll, where the relative proportions of the two characters are vastly different.

Suppose we wanted to transfer the motion of our actor to an inanimate object, such as a chair. Here we have the problem of motion from a human character (two legs) applied to an object of four legs. Yet we still want the "upper-body" motion to be the same.

Crowd scenes can be created by duplicating motion n times for crowd members, with slight variations.

Often, the captured motion is right, but some particular element wants to be altered or exaggerated. Inverse kinematics techniques can be used to affect the motion; for instance, lifting the hands of a runner at the end of his run would raise his arms enough to create more drama as he breaks the tape, while still maintaining the realism of the captured motion as the major component of the move.

As we can see, there needs to exist a suite of tools able to dissect and manipulate motion data sets in a variety of ways, to cut and paste motions together seamlessly, and to interpolate one motion into another without losing the essential nature of each motion.

Using captured motion - porting data sets.

Motion data is all very well on its own, but what do you do with it once you've captured it, i.e. how do you use it in an animation? Whether the motion capture system is a proprietary one or commercially available, the motion data has to be useful to the widest number of potential users. In order to achieve this, it has to be easily input into one or more of the commercially-available animation systems.

Porting data sets into animation systems is a big issue. Every system out there has its own set of problems. Each system has its own approach to skeletal animation and skin manipulation. Some systems are polygonal, some patch based. Data formats differ, especially in terms of describing animation and skeletal structure, though there are common formats for object data understood by all. Some allow real-time input of data, some only through files. Some systems readily adapt to rotational data, others can only take in positional data.

Once the data is in there, other issues arise. For instance, what order are rotations evaluated? Rotation ordering is in fact one of the subtlest problems one can encounter porting data sets, since motion can appear to be correct for an object; this is especially true with rotation ordering for the root node of the skeleton. For instance, a character can walk along fine, turn and then flip 360 degrees, because the XYZ rotation order is swapped. Differences in rotation ordering for leg bones can result in feet passing through the floor, for instance.

However, some systems do not allow independent ordering of rotations. This can be simulated by placing zero-length joints in the hierarchy and distributing the rotations amongst them. This works well but is especially difficult and confusing for the user to select the appropriate joint, especially if adding inverse kinematics to the motion in the vendors system.

Besides ordering, there are issues associated with orientation of the rotation axes relative to the bone; some systems specify bone direction as X, some Y, some Z, some arbitrarily. Do the axes rotate with the bone, or remain fixed relative to its orientation - i.e. are the rotations performed globally, or local to the bone? Does the bone inherit rotations from its neighbor up the hierarchy, or is its rotation relative to the origin?

These issues all need to be considered when porting data to existing animation systems, and in fact must be handled prior to inputting the data, since typically there are insufficient tools in the animation system itself to adjust the data and correct for differences in ordering, coordinates systems etc. The complexity of the data involved and its interrelationships can be immense. For each bone in the body, there will be some number of tracks of single-axis rotation data, with vast amounts of complex interrelationships between these tracks. Very few animation systems on the market will allow the user any kind of decent control over that complexity.


Character Design

Skin choices - single-skin vs. ball-joint.

The skin surface used to define a character is one of the major elements where the motion capture process can either succeed or fail. However good the skeletal motion is, if the skin model is badly made or does not behave properly, the final animated motion will never truly look good.

The user typically has a choice between polygon or spline-surface based approaches to skin creation. Splines have obvious qualities such as needing fewer objects to define simple forms, smooth edges during rendering and sophisticated tools for blending objects together. We feel, however, that the nature of a complex form such as the human body with all its wrinkles and folds (especially if clothed) is best modeled using polygons. The same effect could be achieved using splines, but you would end up using almost as many control vertices as polygon vertices, thus defeating one purpose of using splines.

In 3D modeling for character animation, at least as regards polygonal objects, models can be divided up roughly into two broad categories - single-skinned, continuous surfaces and hierarchically-structured models made from separate 3d objects (e.g. a ball-jointed figure). Today, most 3D systems will provide mechanisms for creating and manipulating both, though with varied success for the single-skinned variety.

Of course, these two are not mutually exclusive; frequently there will be a requirement for multiple single skins structured into a hierarchy, especially where objects need to be picked up and put down, or where there is additive clothing such as armor.

Using a single-skin surface object to define the skin has a number of advantages. A single skin can be pushed, pulled and sculpted into the final form much as a traditional sculptor does with clay. It is ideally suited to complex organic shapes like the human form.

Firstly, because it is one continuous surface, there are no seams where the sub-objects in the hierarchy meet. In other words, there will be no problems with individual objects interpenetrating each other where they meet. This has traditionally been the problem with hierarchically-structured models, especially when attempting to realistically model the human form. (Clearly this is not a problem if the desired character is not intended to be a realistic human character, but a robot, for instance.)

Secondly, the wealth of subtle and complex deformations that occur where the arm or leg bends, with the associated folding and creasing of skin, cannot be realistically achieved using separate objects for the limb parts. This becomes very obvious when texturing the object (see below).

In some 3D systems, techniques exist to handle multiple level-of-detail versions of objects. Skeletal motion would deform a simple version of a polygonal skin, which in turn would be used to control a more complex "smoothed" version of itself. Changes in vertex position in the "control" object automatically modify vertex positions (in a weighted fashion) on the target object.

In the rendering phase, using a single-skin character has tremendous advantages in texturing. Texture can be applied over the entire surface without the problems associated with abutting textures and the resultant shading artifacts. Textures can be made to bend and stretch in a realistic fashion, especially in the folds and creases that occur where limbs bend (elbows and knees), and where muscles flex, like the bicep. The face, of course, is a prime example where this is the case.

Secondary actions

Another advantage to the single-skin approach is the ability to layer motions together in an additive fashion, to create more complex animation. When combined with captured motion, we can refer to these as secondary actions; in other words, additional motions that enhance the primary (captured) motion in some way.

These secondary actions can be as simple or as complex as required. For instance, animating a breathing behavior for a chest could be layered on top of any captured motion for a figure. If these secondary actions are driven by the motion itself, the effect is much more realistic. For example, the bicep swell that occurs when an arm is bent can be linked directly to bone rotation, so that as the arm bends, the bicep swells. The secondary action does not occur unless the arm bends.

Physically-based and procedural methods for creating secondary actions can also be used. For instance, a cape attached to a figure can flow out behind him, moving as he moves. This is a relatively new area for character animation, though there are a number of tools starting to appear in commercial systems for performing this kind of action.

All of these ideas embody the basic principle of putting the "behavior" of a skin onto the skeleton, irrespective of its actual motion. This means that when motion is applied the skin behaves in a certain way which is different from another skeleton/skin combination using the same motion.

Complex secondary actions - Gestural animation.

Often we would like to combine some very sophisticated secondary actions with our main captured motion, such as facial animation or intricate hand/finger actions. These actions are often referred to as "gestural animation". Gestural animation can be performed using motion capture, either live or as a separate "delayed" capture that is later combined with the main capture - these are the direct form of gestural animation. Alternatively gestural animation can be interpretive in nature, using the motion capture as a guide for how to affect the object being animated.

Direct gestural animation

Live capture allows both main and secondary actions to be captured simultaneously. The same actor must be used for both, which may or may not be desirable. Also, a potentially large number of sensors must be allocated to the gestural capture, thus reducing the number available for primary motion capture.

On the other hand, delayed capture allows a different actor to be used for each capture session. For example, a stunt double could be used for the main action, while the "talent" is used for the close-up facial animation. Since the captures are done separately, there is no need to worry about numbers of sensors; there is complete freedom to use as many sensors as required to accurately capture intricate hand movements or subtle facial expressions. These secondary actions are then "grafted" onto the existing motion, typically during motion editing.

In either case the captured motion is a brute-force solution to animating facial or other gestural animation, the result being exactly as performed by the actor.

Interpretive gestural animation

Gestural animation can also imply a higher-level notion of what constitutes a gesture or expression. The term "gesture" can be used to imply a grouping of simpler movements into one unit - consider a smile gesture as a composite made up of several independent mouth, cheek, eye and eyebrow movements combined together in various amounts. Those same movements, combined in different percentages, would constitute a different expression entirely, e.g. a frown. These gestures effectively represent various deformations of the object, to achieve certain effects.

A gesture can also be procedurally created by sampling motion input and then determining from that input some recognizable expression. This can be achieved in a number of ways. For instance, the determination can be rule-based, or made using pattern-matching techniques. Once the gestural expression has been identified, the corresponding gestural deformation is applied to the object. In this way, motion input is interpreted to mean whatever the animator wants it to mean.

Lip-syncing - a special case of gestural animation

We have already touched on aspects of facial animation in the previous section, since it is really a special case of gestural animation. What distinguishes facial animation is the need for lip-syncing - the synchronizing of mouth movements to the spoken dialog.

If live or delayed capture is used, then it is (generally) a simple synchronization issue of lining up the start of the dialog with the start of the animation. Of course, it may be the case that the voice-over is performed by a different individual from either the motion actor or the facial actor; in that case, it may be necessary to adjust timing for either animation or audio. At the present time there are several examples of tools for syncing dialog to animation, in the commercially-available software packages.

However, if a higher-level gestural approach is used, then the task becomes somewhat harder. Some sort of connection must be made between a phoneme and an animation shape. By "phoneme" we mean the smallest unit component of the spoken dialog; we use the word loosely to mean either a true phoneme, one of the classic cartoon mouth shapes or some type of user-defined mouth shape. The animation shape the phoneme refers to is some form of gesture, sculpted to match that phoneme shape.

Getting at the phonemes is the hard part, since ideally you want to automate the process of breaking up the sound into its component parts; the alternative is to perform the process manually (or else use a service bureau to do it for you!).

Having identified the phoneme components of the dialog, an automated mechanism is then required to dynamically match them to the appropriate mouth shape. Interpolation tools should also be available, to smooth transitions from one shape into another. In addition, it is often useful to be able to "emphasize" a particular shape (i.e. extrapolate past its default shape), for dramatic purposes.

Again, there are some examples of software that can automate this process in the commercial field, but there is still a lot of room for simple, flexible and powerful solutions to this problem area.

The Skeleton

There are many issues involved in designing and building skeletons that will successfully animate. This is especially true as regards successful animation of single-surface polygonal skins, and in particular animating with motion created through the Acclaim capture process.

For a discussion of these issues, we include two documents which are internal technical memos from our R&D department, as appendices at the end of these notes. One is a discussion of our skeleton file format, which is available to system vendors (and which we encourage them to adopt!) to allow motion from our system to be easily input into any animation system. The second discusses the technical issues involved.

Future issues.

These are some areas where we see a need for future development of motion capture systems:

We want to be able to substitute a motion capture actor, using a plate, in a scene with his synthetic character and show this character interacting with live actors. We want to be able to do this under any conditions of lighting or encumbrance.

The ability to have a full body real-time system (30fps) has applications for VR and program making. The data bandwidth for skeletal movement is quite low and should allow easy interaction between systems. This will enable multiple user, real-time, full body interaction.


Main Animation Page
HyperGraph Table of Contents.
HyperGraph Home page.

Last changed March 13, 1999, G. Scott Owen, owen@siggraph.org