MIG '22: Proceedings of the ACM SIGGRAPH Conference on Motion, Interaction and Games

Full Citation in the ACM Digital Library

SESSION: Animation Synthesis

Toward Believable Acting for Autonomous Animated Characters

This paper describes design principles and a system, based on reinforcement learning and procedural animation, to create an autonomous character capable of believable acting—exhibiting a responsive and expressive illusion of interactive life, grounded in its subjective experience of its world. The design principles incorporate knowledge from animation, human-computer interaction, and psychology, articulating guidelines that, when followed, support a viewer’s suspension of disbelief. The system’s reinforcement learning brain generates action, emotion, and attention signals based on motivational drives, and its procedural animation system translates those signals into expressive biophysical movement in real time. We demonstrate the system on a stylized quadruped character in a virtual habitat. In a user study, participants rated the character favorably on animacy and ability to experience emotions, which is consistent with finding the character believable.

S2M-Net: Speech Driven Three-party Conversational Motion Synthesis Networks

In this paper we propose a novel conditional generative adversarial network (cGAN) architecture, called S2M-Net, to holistically synthesize realistic three-party conversational animations based on acoustic speech input together with speaker marking (i.e., the speaking time of each interlocutor). Specifically, based on a pre-collected three-party conversational motion dataset, we design and train the S2M-Net for three-party conversational animation synthesis. In the architecture, a generator contains a LSTM encoder to encode a sequence of acoustic speech features to a latent vector that is further fed into a transform unit to transform the latent vector into a gesture kinematics space. Then, the output of this transform unit is fed into a LSTM decoder to generate corresponding three-party conversational gesture kinematics. Meanwhile, a discriminator is implemented to check whether an input sequence of three-party conversational gesture kinematics is real or fake. To evaluate our method, besides quantitative and qualitative evaluations, we also conducted paired comparison user studies to compare it with the state of the art.

Simulating Fracture in Anisotropic Materials Containing Impurities

Fracture simulation of real-world materials is an exceptionally challenging problem due to complex material properties like anisotropic elasticity and the presence of material impurities. We present a graph-based finite element method to simulate dynamic fracture in anisotropic materials. We further enhance this model by developing a novel probabilistic damage mechanics for modelling materials with impurities using a random graph-based formulation. We demonstrate how this formulation can be used by artists for directing and controlling fracture. We simulate and render fractures for a diverse set of materials to demonstrate the potency and robustness of our methods.

SESSION: Motion Analysis and Perception

A new framework for the evaluation of locomotive motion datasets through motion matching techniques

Analyzing motion data is a critical step when building meaningful locomotive motion datasets. This can be done by labeling motion capture data and inspecting it, through a planned motion capture session or by carefully selecting locomotion clips from a public dataset. These analyses, however, have no clear definition of coverage, making it harder to diagnose when something goes wrong, such as a virtual character not being able to perform an action or not moving at a given speed. This issue is compounded by the large amount of information present in motion capture data, which poses a challenge when trying to interpret it. This work provides a visualization and an optimization method to streamline the process of crafting locomotive motion datasets. It provides a more grounded approach towards locomotive motion analysis by calculating different quality metrics, such as: demarcating coverage in terms of both linear and angular speeds, frame use frequency in each animation clip, deviation from the planned path, number of transitions, number of used vs. unused animations and transition cost.

By using these metrics as a comparison mean for different motion datasets, our approach is able to provide a less subjective alternative to the modification and analysis of motion datasets, while improving interpretability.

Learning Gait Emotions Using Affective and Deep Features

We present a novel data-driven algorithm to learn the perceived emotions of individuals based on their walking motion or gaits. Given an RGB video of an individual walking, we extract their walking gait as a sequence of 3D poses. Our goal is to exploit the gait features to learn and model the emotional state of the individual into one of four categorical emotions: happy, sad, angry, or neutral. Our perceived emotion identification approach uses deep features learned using long short-term memory networks (LSTMs) on datasets with labeled emotive gaits. We combine these features with gait-based affective features consisting of posture and movement measures. Our algorithm identifies both the categorical emotions from the gaits and the corresponding values for the dimensional emotion components - valence and arousal. We also introduce and benchmark a dataset called Emotion Walk (EWalk), consisting of videos of gaits of individuals annotated with emotions. We show that our algorithm mapping the combined feature space to the perceived emotional state provides an accuracy of 80.07% on the EWalk dataset, outperforming the current baselines by an absolute 13–24%.

Impact of Self-Contacts on Perceived Pose Equivalences

Defining equivalences between poses of different human characters is an important problem for imitation research, human pose recognition and deformation transfer. However, pose equivalence is a subjective information that depends on context and on the morphology of the characters. A common hypothesis is that interactions between body surfaces, such as self-contacts, are important attributes of human poses, and are therefore consistently included in animation approaches aiming at retargeting human motions. However, some of these self-contacts are only present because of the morphology of the character and are not important to the pose, e.g. contacts between the upper arms and the torso during a standing A-pose. In this paper, we conduct a first study towards the goal of understanding the impact of self-contacts between body surfaces on perceived pose equivalences. More specifically, we focus on contacts between the arms or hands and the upper body, which are frequent in everyday human poses. We conduct a study where we present to observers two models of a character mimicking the pose of a source character, one with the same self-contacts as the source, and one with one self-contact removed, and ask observers to select which model best mimics the source pose. We show that while poses with different self-contacts are considered different by observers in most cases, this effect is stronger for self-contacts involving the hands than for those involving the arms.

SESSION: Motion Capture and Extraction

A Tool for Extracting 3D Avatar-Ready Gesture Animations from Monocular Videos

Modeling and generating realistic human gesture animations from speech audios has great impacts on creating a believable virtual human that can interact with human users and mimic real-world face-to-face communications. Large-scale datasets are essential in data-driven research, but creating multi-modal gesture datasets with 3D gesture motions and corresponding speech audios is either expensive to create via traditional workflow such as mocap, or producing subpar results via pose estimations from in-the-wild videos. As a result of such limitations, existing gesture datasets either suffer from shorter duration or lower animation quality, making them less ideal for training gesture synthesis models.

Motivated by the key limitations from previous datasets and recent progress in human mesh recovery (HMR), we developed a tool for extracting avatar-ready gesture motions from monocular videos with improved animation quality. The tool utilizes a variational autoencoder (VAE) to refine raw gesture motions. The resulting gestures are in a unified pose representation that includes both body and finger motions and can be readily applied to a virtual avatar via online motion retargeting. We validated the proposed tool on existing datasets and created the refined dataset TED-SMPLX by re-processing videos from the original TED dataset. The new dataset is available at https://andrewfengusa.github.io/TED_SMPLX_Dataset.

A Practical Method for Butterfly Motion Capture

Simulating realistic butterfly motion has been a widely-known challenging problem in computer animation. Arguably, one of its main reasons is the difficulty of acquiring accurate flight motion of butterflies. In this paper we propose a practical yet effective, optical marker-based approach to capture and process the detailed motion of a flying butterfly. Specifically, we first capture the trajectories of the wings and thorax of a flying butterfly using optical marker-based motion tracking. After that, our method automatically fills the positions of missing markers by exploiting the continuity and relevance of neighboring frames, and improves the quality of the captured motion via noise filtering with optimized parameter settings. Through comparisons with existing motion processing methods, we demonstrate the effectiveness of our approach to obtain accurate flight motions of butterflies. Furthermore, we created and will release a first-of-its-kind butterfly motion capture dataset to research community.

SESSION: Motion Control and Planning

Learning High-Risk High-Precision Motion Control

Deep reinforcement learning (DRL) algorithms for movement control are typically evaluated and benchmarked on sequential decision tasks where imprecise actions may be corrected with later actions, thus allowing high returns with noisy actions. In contrast, we focus on an under-researched class of high-risk, high-precision motion control problems where actions carry irreversible outcomes, driving sharp peaks and ridges to plague the state-action reward landscape. Using computational pool as a representative example of such problems, we propose and evaluate State-Conditioned Shooting (SCOOT), a novel DRL algorithm that builds on advantage-weighted regression (AWR) with three key modifications: 1) Performing policy optimization only using elite samples, allowing the policy to better latch on to the rare high-reward action samples; 2) Utilizing a mixture-of-experts (MoE) policy, to allow switching between reward landscape modes depending on the state; 3) Adding a distance regularization term and a learning curriculum to encourage exploring diverse strategies before adapting to the most advantageous samples. We showcase our features’ performance in learning physically-based billiard shots demonstrating high action precision and discovering multiple shot strategies for a given ball configuration.

Time Reversal and Simulation Merging for Target-Driven Fluid Animation

We present an approach to control the animation of liquids. The user influences the simulation by providing a target surface which will be matched by a portion of the liquid at a specific frame of the animation; our approach is also effective for multiple target surfaces forming an animated sequence. A source simulation provides the context liquid animation with which we integrate the controlled target elements. From each target frame, we compute a target simulation in two parts, one forward and one backward, which are then joined together. The particles for the two simulations are initially placed on the target shape, with velocities sampled from the source simulation. The backward particles use velocities in the opposite direction as the forward simulation, so that the two halves join seamlessly. When there are multiple target frames, each target frame simulation is computed independently, and the particles from these multiple target simulations are later combined. In turn, the target simulation is joined to the source simulation. Appropriate steps are taken to select which particles to keep when joining the forward, backward, and source simulations. This results in an approach where only a small fraction of the computation time is devoted to the target simulation, allowing faster computation times as well as good turnaround times when designing the full animation. Source and target simulations are computed using an off-the-shelf Lagrangian simulator, making it easy to integrate our approach with many existing animation pipelines. We present test scenarios demonstrating the effectiveness of the approach in achieving a well-formed target shape, while still depicting a convincing liquid look and feel.

Stealthy path planning against dynamic observers

In virtual environments, research into the problem of stealthy or covert path planning has either assumed fixed and static motion of observers or has used relatively simple probabilistic models that statically summarize potential behavior. In this paper, we introduce a method that dynamically estimates enemy motion in order to plan covert paths in a prototype game environment. We compare our results to other baseline pathfinding methods and conduct an extensive exploration of the many parameters and design choices involved to better understand the impact of different settings on the success of covert path planning in virtual environments. Our design provides a more flexible approach to covert pathfinding problems, and our analysis provides useful insights into the relative weighting of the different factors that can improve design choices in building stealth scenarios.