SA Technical Communications '23: SIGGRAPH Asia 2023 Technical Communications

Full Citation in the ACM Digital Library

SESSION: Rendering

Monte Carlo Denoising via Multi-scale Auxiliary Feature Fusion Guided Transformer

Deep learning-based single-frame Monte Carlo denoising techniques have demonstrated remarkable results in photo-realistic rendering research. However, the current state-of-the-art methods relying on self-attention mechanisms underutilize auxiliary features and struggle to preserve intricate high-frequency details in complex scenes. Employing a generative adversarial architecture, we present a transformer-based denoising network guided by multi-scale auxiliary feature. The proposed U-shaped denoising network extracts multi-scale texture and geometric features from auxiliaries, modulating them to guide the improved transformer module’s denoising process. The improved transformer module employs cross-channel self-attention to capture non-local relationships with near-linear computational complexity. Additionally, a gating mechanism is introduced in the transformer module’s feed-forward network, enhancing information flow. Extensive experiments on noisy images with varied per-pixel sampling rates demonstrate the method’s superiority in quantitative metrics and visual perception compared with state-of-the-art methods. Our method excels notably in intricate scenes with complex hair and texture details, which are historically challenging to denoise.

SESSION: See Details

Combining Resampled Importance and Projected Solid Angle Samplings for Many Area Light Rendering

Direct lighting from many area light sources is challenging due to variance from both choosing an important light and then a point on it. Resampled Importance Sampling (RIS) achieves low variance in such situations. However, it is limited to simple sampling strategies for its candidates. Specifically for area lights, we can improve the convergence of RIS by incorporating a better sampling strategy: Projected Solid Angle Sampling (ProjLTC). Naively combining RIS and ProjLTC improves equal sample convergence. However, it achieves little to no gain in equal time. We identify the core issue for the high run times and reformulate RIS for better integration with ProjLTC. Our method achieves better convergence and results in both equal sample and equal time. We evaluate our method on challenging scenes with varying numbers of area light sources and compare it to uniform sampling, RIS, and ProjLTC. In all cases, our method seldom performs worse than RIS and often performs better.

Focus Range: Production Ray Tracing of Depth of Field

Defocus blur adds realism to computer generated images by modelling the restricted focus range of physical camera lenses. It allows artists to draw the viewer’s attention to in-focus regions of the image by separating them from the background. The widely used thin lens model provides a simple approach to achieve this effect, allowing artists to control the frame’s appearance through the distance to the focus plane and the aperture size. Our proposed focus range model extends the focus distance to a range. This allows artists to keep characters fully in focus and independently define the out-of-focus blur. We demonstrate how to achieve this robustly in the presence of specular BSDFs, ray-oriented geometry and multiple light bounces. Additionally, we share our practical experience of integrating this model into our production renderer Glimpse.

Real-time Rendering of Glossy Reflections using Ray Tracing and Two-level Radiance Caching

Estimation of glossy reflections remains a challenging topic for real-time renderers. Ray tracing is a robust solution for evaluating the specular lobe of a given BRDF; however, it is computationally expensive and introduces noise that requires filtering. Other solutions, such as light probe systems, offer to approximate the signal with little to no noise and better performance but tend to introduce additional bias in the form of overly blurred visuals. This paper introduces a novel approach to rendering reflections in real time that combines the radiance probes of an existing diffuse global illumination framework with denoised ray-traced reflections calculated at a low sampling rate. We will show how combining these two sources allows producing an efficient and high-quality estimation of glossy reflections that is suitable for real-time applications such as games.

SESSION: Motion Synthesis with Awareness

Motion to Dance Music Generation using Latent Diffusion Model

The role of music in games and animation, particularly in dance content, is essential for creating immersive and entertaining experiences. Although recent studies have made strides in generating dance music from videos, their practicality in integrating music into games and animation remains limited. In this context, we present a method capable of generating plausible dance music from 3D motion data and genre labels. Our approach leverages a combination of a UNET-based latent diffusion model and a pre-trained VAE model. To evaluate the performance of the proposed model, we employ evaluation metrics to assess various audio properties, including beat alignment, audio quality, motion-music correlation, and genre score. The quantitative results show that our approach outperforms previous methods. Furthermore, we demonstrate that our model can generate audio that seamlessly fits to in-the-wild motion data. This capability enables us to create plausible dance music that complements dynamic movements of characters and enhances overall audiovisual experience in interactive media. Examples from our proposed model are available at this link:

SynthDa: Exploiting Existing Real-World Data for Usable and Accessible Synthetic Data Generation

Acquiring real-world data for computer vision presents challenges such as data scarcity, high costs, and privacy concerns. We introduce SynthDa, an automated approach for usable synthetic data generation (SDG) that empowers users with varying expertise to create diverse synthetic data from existing real-world datasets. It combines pose estimation, synthetic scene creation, and domain randomization to offer data variants. Ease of SDG through SynthDa enables different permutations and combinations of synthetic data that allow users to explore efficacy of various data configurations in relation to their specific AI tasks. Our experiments across multiple existing datasets and models demonstrate the utility of SynthDa in challenging nuances such as the “more data, the better” paradigm; revealing that excessive synthetic data may degrade performance and vice versa. In a pilot user study with 24 participants, we show the perceived usefulness of SynthDa as a promising SDG tool for overcoming challenges related to real-world data acquisition.

SESSION: Designer Assistant

PerfectDart: Automatic Dart Design for Garment Fitting

Dart, a triangle-shaped folded and stitched tuck in a garment, is a common sewing technique used to provide custom-fit garments. Unfortunately, designing and optimally placing these darts requires knowledge and practice, making it challenging for novice users. We propose a novel computational dart design framework that takes rough user cues (the region where the dart will be inserted) and computes the optimal dart configurations to improve fitness. To be more specific, our framework utilizes the body-garment relationship to quantify the fitting using a novel energy composed of three geometric terms: 1) closeness term encoding the proximity between the garment and the target body, 2) stretchability term favouring area-preserving cloth deformation, and 3) smoothness term promoting an unwrinkled and unfolded garment. We evaluate these three geometric terms via off-the-shelf cloth simulation and use it to optimize the dart configuration by minimizing the energy. As demonstrated by our results, our method is able to automatically generate darts to improve fitness for various garment designs and a wide range of body shapes, including animals.

SESSION: Holography

High-quality Color-animated CGH Using a Motor-driven Photomask

We propose a novel display system for a color-animated computer-generated hologram (CGH) with the world’s highest level of image quality, screen size, and viewing angle. The conventional method uses a static CGH and pattern illumination to achieve color-animated CGH with large screens and wide viewing angles, but non-illuminated areas result in degraded image quality. Our system solves this problem by using a photomask with a motorized stage, which enables the projection of much finer patterns than with the conventional method. We demonstrated that our method achieves a significant quality improvement compared to the conventional method by developing a color animation prototype with 4 reconstructed images.

SESSION: TechScape

The Effects of Avatar Voice and Facial Expression Intensity on Emotional Recognition and User Perception

The use of avatars of various rendering styles (e.g., abstract, cartoon, realistic) in virtual reality is ever-increasing. However, little is known about the effects of auditory stimuli, specifically avatar voices, on users’ perceived realism. This paper aims to investigate and better understand the role of a look-alike avatar’s vocal and facial expression intensity on users’ perceived realism and emotional recognition using a virtual bystander scenario. Results show that avatars’ vocal intensity generally affected study participants’ emotional recognition while facial expression intensity affected their perceived realism. The results have implications for the perception and effectiveness of look-alike avatars in virtual environments, specifically industry training for dangerous or non-replicable situations, such as school shootings and exposure therapy.

Comparing Cinematic Conventions through Emotional Responses in Cinematic VR and Traditional Mediums

This paper compares emotional responses to cinematic conventions in cinematic virtual reality (CVR) versus traditional mediums. We conducted a between-subjects experiment to evaluate viewer experiences across 10 cinematic shot types displayed through three viewing mediums: computer screen, cinema projector, and VR headset (N=15 per condition). Using the Self-Assessment Manikin scale (SAM), we measured pleasure, arousal, and dominance reactions. The statistical analysis showed that CVR elicited significantly higher arousal and lower dominance than traditional formats, indicating enhanced emotional engagement. Overall patterns were found to be similar, but the viewers’ emotions were intensified in CVR for shots like close-up and ground-level. Our findings suggest CVR's embodied, interactive qualities altered the impact of cinematic techniques. This comparative study reveals CVR invokes different emotional resonance versus traditional mediums, contributing insights on adapting storytelling strategies for the VR cinematographic experience.

SESSION: Anything Can be Neural

Aerial Diffusion: Text Guided Ground-to-Aerial View Synthesis from a Single Image using Diffusion Models

We present a novel method, Aerial Diffusion, for generating aerial views from a single ground-view image using text guidance. Aerial Diffusion leverages a pretrained text-image diffusion model for prior knowledge. We address two main challenges corresponding to domain gap between the ground-view and the aerial view and the two views being far apart in the text-image embedding manifold. Our approach uses a homography inspired by inverse perspective mapping prior to finetuning the pretrained diffusion model. Aerial Diffusion uses an alternating sampling strategy to compute the optimal solution on complex high-dimensional manifold and generate a high-fidelity (w.r.t. ground view) aerial image. We demonstrate the quality and versatility of Aerial Diffusion on a plethora of images and prove the effectiveness of our method with extensive ablations and comparisons. To the best of our knowledge, Aerial Diffusion is the first approach that performs single image ground-to-aerial translation in an unsupervised manner. The full paper and code can be found at

LayerDiffusion: Layered Controlled Image Editing with Diffusion Models

Text-guided image editing has recently experienced rapid development. However, simultaneously performing multiple editing actions on a single image, such as background replacement and specific subject attribute changes, while maintaining consistency between the subject and the background remains challenging. In this paper, we propose LayerDiffusion, a semantic-based layered controlled image editing method. Our method enables non-rigid editing and attribute modification of specific subjects while preserving their unique characteristics and seamlessly integrating them into new backgrounds. We leverage a large-scale text-to-image model and employ a layered controlled optimization strategy combined with layered diffusion training. During the diffusion process, an iterative guidance strategy is used to generate a final image that aligns with the textual description. Experimental results demonstrate the effectiveness of our method in generating highly coherent images that closely align with the given textual description. The edited images maintain a high similarity to the features of the input image and surpass the performance of current leading image editing methods. LayerDiffusion opens up new possibilities for controllable image editing.

SESSION: Materials

Bounded VNDF Sampling for Smith–GGX Reflections

Sampling according to a visible normal distribution function (VNDF) is often used to sample rays scattered by glossy surfaces, such as the Smith–GGX microfacet model. However, for rough reflections, existing VNDF sampling methods can generate undesirable reflection vectors occluded by the surface. Since these occluded reflection vectors must be rejected, VNDF sampling is inefficient for rough reflections. This paper introduces an unbiased method to reduce the number of rejected samples for Smith–GGX VNDF sampling. Our method limits the sampling range for a state-of-the-art VNDF sampling method that uses a spherical cap-based sampling range. By using our method, we can reduce the variance for highly rough and low-anisotropy surfaces. Since our method only modifies the spherical cap range in the existing sampling routine, it is simple and easy to implement.

Standard Shader Ball: A Modern and Feature-Rich Render Test Scene

We discuss our recent material test scene contribution to the USD Assets Working Group [2022] and enumerate some of its desirable test qualities. The scene employs modern geometry and material standards and has been released to the rendering community under a permissive licence, so that others may benefit from its useful properties, to encourage standardisation, and to remove the need for continuing reinvention.

SESSION: Beyond Skin Deep

Mapping and Recognition of Facial Expressions on Another Person's Look-Alike Avatars

As Virtual Reality (VR) continues to advance and gain popularity, one of the persisting concerns revolves around the level of fidelity achievable in creating lifelike avatars. This study aims to explore the possibility, feasibility, and effects of controlling avatars in Virtual Reality using actual facial expressions and eye movements from three different actors mapped onto one actor’s look-alike avatar. The objective is to explore the authenticity and appeal of avatars, particularly when they accurately portray facial expressions of their respective users compared to when they display facial expressions from other individuals. By properly mapping facial expressions and eye movements onto the avatar, we seek to aid with the development of a more realistic and captivating virtual experience that closely mirrors real-life interactions. Furthermore, we investigate whether mapping one’s facial expressions to another person’s look-alike avatar affects identification and recognition.

Efficient Incremental Potential Contact for Actuated Face Simulation

We present a quasi-static finite element simulator for human face animation. We model the face as an actuated soft body, which can be efficiently simulated using Projective Dynamics (PD). We adopt Incremental Potential Contact (IPC) to handle self-intersection. However, directly integrating IPC into the simulation would impede the high efficiency of the PD solver, since the stiffness matrix in the global step is no longer constant and cannot be pre-factorized. We notice that the actual number of vertices affected by the collision is only a small fraction of the whole model, and by utilizing this fact we effectively decrease the scale of the linear system to be solved. With the proposed optimization method for collision, we achieve high visual fidelity at a relatively low performance overhead.

Portrait Expression Editing With Mobile Photo Sequence

Mobile cameras have revolutionized content creation, allowing casual users to capture professional-looking photos. However, capturing the perfect moment can still be challenging, making post-capture editing desirable. In this work, we introduce ExShot, a mobile-oriented expression editing system that delivers high-quality, fast, and interactive editing experiences. Unlike existing methods that rely on learning expression priors, we leverage mobile photo sequences to extract expression information on demand. This design insight enables ExShot to address challenges related to diverse expressions, facial details, environment entanglement, and interactive editing. At the core lies ExprNet, a lightweight deep learning model that extracts and refines expression features. To train our model, we captured portrait images with diverse expressions, incorporating pre-processing and lighting augmentation techniques to ensure data quality. Our comprehensive evaluation results demonstrate that ExShot outperforms other editing approaches by up to 29.02% in PSNR. Ablation studies validate the effectiveness of our design choices, and user studies with 28 participants confirm the strong desire for expression editing and the superior synthesis quality of ExShot, while also identifying areas for further investigation.

SESSION: TechnoScape

Interactive Material Annotation on 3D Scanned Models leveraging Color-Material Correlation

3D scanning has made it possible to generate 3D models from real objects. Although 3D scanning can capture an object’s shape and color texture, it is still technically difficult to analyze and reproduce material properties such as metalness, roughness, and transparency. Therefore, they need to be explicitly annotated after the scanning process. However, existing methods are highly labor-intensive such as a simple brush painting that requires delicate and inefficient handwork. To make this process more efficient and accurate, we propose a system that mitigates the costs by introducing a texture-aware annotation pipeline. This method is based on the observation that material distribution is correlated to color distribution. We segment the 3D surface into areas based on color similarity and let users annotate materials using the segmentations as masks. In an empirical user study, the participants could make quality annotations in a short time.

Footstep Detection for Film Sound Production

In this paper, we presented a footstep detection method, which could assist sound editors in positioning the character’s footsteps on the timeline of the film. Based on it, a footstep detection system was designed for film sound production. Considering the characteristics of human motion and the needs of film sound production, our method included two parts: data preprocessing and footstep detection modeling. Experiments on various types of shots showed the good generalization and high accuracy of our method. The application evaluation demonstrated the high efficiency of the system.

Training Orchestral Conductors in Beating Time

Orchestral conducting involves a rich vocabulary of gestures and so training conductors is challenging. We discuss how virtual reality and gesture detection could be used to aid this process. We describe our pilot interface for training conductors in basic beat patterns, using gestural input and virtual reality output. We investigated both positional and acceleration-based detection of beats, concluding that that the best way to detect beats reliably is to identify maxima in acceleration, that is, those moments in a beat that would appear as a flick to a human player. This practical evidence supports Wöllner’s theory of how human players detect beats. We trialled our system with novices and experts, with a range of beating styles.

A Motion-Simulation Platform to Generate Synthetic Motion Data for Computer Vision Tasks

We developed the Motion-Simulation Platform, a platform running within a game engine that is able to extract both RGB imagery and the corresponding intrinsic motion data (i.e., motion field). This is useful for motion-related computer vision tasks where large amounts of intrinsic motion data are required to train a model. We describe the implementation and design details of the Motion-Simulation Platform. The platform is extendable, such that any scene developed within the game engine is able to take advantage of the motion data extraction tools. We also provide both user and AI-bot controlled navigation, enabling user-driven input and mass automation of motion data collection.

SESSION: Motion Awareness with Synthesis, Part II

EasyVolcap: Accelerating Neural Volumetric Video Research

Live4D: A Real-time Capture System for Streamable Volumetric Video

Volumetric video holds promise for virtual and augmented reality (VR/AR) applications but faces challenges in interactive scenarios due to high hardware costs, complex processing and substantial data streams. In this paper, we introduce Live4D, a cost-effective, real-time volumetric video generation and streaming system using an RGB-only camera setup. We propose a novel deep implicit surface reconstruction algorithm, that combined neural signed distance field with observed truncated signed distance field to generate the watertight meshes with low latency. Moreover, we achieve a robust non-rigid tracking method that provides temporal stability to the meshes while resisting tracking failure cases. Experimental results show that Live4D achieves a performance of 24fps using mid-range graphic cards and exhibits an end-to-end latency of 95ms. The system enables live streaming of volumetric video within a 20Mbps bandwidth requirement, positioning Live4D as a promising solution for real-time 3D vision content creation in the growing VR/AR industry.

SESSION: Modeling and Geometry

Distributed Solution of the Blendshape Rig Inversion Problem

The problem of rig inversion is central in facial animation, but with the increasing complexity of modern blendshape models, execution times increase beyond practically feasible solutions. A possible approach towards a faster solution is clustering, which exploits the spacial nature of the face, leading to a distributed method. In this paper, we go a step further, involving cluster coupling to get more confident estimates of the overlapping components. Our algorithm applies the Alternating Direction Method of Multipliers, sharing the overlapping weights between the subproblems and show a clear advantage over the naive clustered approach. The method applies to an arbitrary clustering of the face. We also introduce a novel method for choosing the number of clusters in a data-free manner, resulting in a sparse clustering graph without losing essential information. Finally, we give a new variant of a data-free clustering algorithm that produces good scores with respect to the mentioned strategy for choosing the optimal clustering.

SESSION: Flesh & Bones

Robust Skin Weights Transfer via Weight Inpainting

We present a new method for the robust transfer of skin weights from a source mesh to a target mesh with significantly different geometric shapes. Rigging garments is a typical application of skin weight transfer where weights are copied from a source body mesh to avoid tedious weight painting from scratch. However, existing techniques struggle with non-skin-tight garments and require additional manual weight painting. We introduce a fully automatic two-stage skin weight transfer process. First, an initial transfer is performed by copying weights from the source mesh only for those vertices on the target mesh where we have high confidence in obtaining the ground truth weights from the source. Then, we automatically compute weights for all other vertices by interpolating the weights computed in stage one. This approach is robust and easy to implement in practice, yet it far outperforms the methods used in existing commercial software and previous research works.

Learning multivariate empirical mode decomposition for spectral motion editing

Spectral motion editing provides an important perspective for character animation synthesis since biomechanical features are usually embedded in the frequency domain. However, heavy manual postprocessing and high computational cost were required to achieve high-quality character animations. In this paper, first, we propose a novel architecture for neural networks to learn multivariate empirical mode decomposition that can decompose motion into non-linear frequency components corresponding to their biomechanical features. Next, we demonstrate a spectral motion editing technique based on our proposed architecture. The results revealed that high-quality character animation synthesis could be achieved by editing these decomposed non-linear frequency components, providing novel tasks for character animation design.

SESSION: Visualizing the Future

MicroGlam: Microscopic Skin Image Dataset with Cosmetics

In this paper, we present a cosmetic-specific skin image dataset.1 It consists of skin images from 45 patches (5 skin patches each from 9 participants) of size 8mm*8mm under three cosmetic products (i.e., foundation, blusher, and highlighter). We designed a novel capturing device inspired by LightStage [Debevec et al. 2000]. Using the device, we captured over 600 images of each skin patch under diverse lighting conditions in 30 seconds. We repeated the process for the same skin patch under three cosmetic products. Finally, we demonstrate the viability of the dataset with an image-to-image translation-based pipeline for cosmetic rendering and compared our data-driven approach to an existing cosmetic rendering method [Kim and Ko 2018].

SESSION: Humans & Characters

Story-to-Motion: Synthesizing Infinite and Controllable Character Animation from Long Text

Generating natural human motion from a story has the potential to transform the landscape of animation, gaming, and film industries. A new and challenging task, Story-to-Motion, arises when characters are required to move to various locations and perform specific motions based on a long text description. This task demands a fusion of low-level control (trajectories) and high-level control (motion semantics). Previous works in character control and text-to-motion have addressed related aspects, yet a comprehensive solution remains elusive: character control methods do not handle text description, whereas text-to-motion methods lack position constraints and often produce unstable motions. In light of these limitations, we propose a novel system that generates controllable, infinitely long motions and trajectories aligned with the input text. 1) We leverage contemporary Large Language Models to act as a text-driven motion scheduler to extract a series of (text, position, duration) pairs from long text. 2) We develop a text-driven motion retrieval scheme that incorporates motion matching with motion semantic and trajectory constraints. 3) We design a progressive mask transformer that addresses common artifacts in the transition motion such as unnatural pose and foot sliding. Beyond its pioneering role as the first comprehensive solution for Story-to-Motion, our system undergoes evaluation across three distinct sub-tasks: trajectory following, temporal action composition, and motion blending, where it outperforms previous state-of-the-art (SOTA) motion synthesis methods across the board. Homepage:

SESSION: Head & Face

CLIP-Head: Text-Guided Generation of Textured Neural Parametric 3D Head Models

We propose CLIP-Head, a novel approach towards text-driven neural parametric 3D head model generation. Our method takes simple text prompts in natural language, describing the appearance & facial expressions, and generates 3D neural head avatars with accurate geometry and high-quality texture maps. Unlike existing approaches, which use conventional parametric head models with limited control and expressiveness, we leverage Neural Parametric Head Models (NPHM), offering disjoint latent codes for the disentangled encoding of identities and expressions. To facilitate the text-driven generation, we propose two weakly-supervised mapping networks to map the CLIP’s encoding of input text prompt to NPHM’s disjoint identity and expression vector. The predicted latent codes are then fed to a pre-trained NPHM network to generate 3D head geometry. Since NPHM mesh doesn’t support textures, we propose a novel aligned parametrization technique, followed by text-driven generation of texture maps by leveraging a recently proposed controllable diffusion model for the task of text-to-image synthesis. Our method is capable of generating 3D head meshes with arbitrary appearances and a variety of facial expressions, along with photoreal texture details. We show superior performance with existing state-of-the-art methods, both qualitatively & quantitatively, and demonstrate potentially useful applications of our method. We have released our code at

Hair Tubes: Stylized Hair from Polygonal Meshes of Arbitrary Topology

In this paper, we describe a fast, topology-independent method to generate bundles of hair from a mesh defining the outward shape of the hair. This allows artists to focus on the outward appearance and create stylized painterly hairstyles. We describe a novel approach to parameterize a hair mesh using ideas from discrete differential geometry and offer simple controls to distribute hair within the volume of the mesh. We present real-world production examples of various hairstyles created using our proposed method.