SA '23: SIGGRAPH Asia 2023 Conference Papers

Full Citation in the ACM Digital Library

SESSION: Shells

Subspace-Preconditioned GPU Projective Dynamics with Contact for Cloth Simulation

We propose an efficient cloth simulation method that combines the merits of two drastically different numerical procedures, namely the subspace integration and parallelizable iterative relaxation. We show those two methods can be organically coupled within the framework of projective dynamics (PD), where both low- and high-frequency cloth motions are effectively and efficiently computed. Our method works seamlessly with the state-of-the-art contact handling algorithm, the incremental potential contact (IPC), to offer the non-penetration guarantee of the resulting animation. Our core ingredient centers around the utilization of subspace for the expedited convergence of Jacobi-PD. This involves solving the reduced global system and smartly employing its precomputed factorization. In addition, we incorporate a time-splitting strategy to handle the frictional self-contacts.

Specifically, during the PD solve, we employ a quadratic proxy to approximate the contact barrier. The prefactorized subspace system matrix is exploited in a reduced-space LBFGS. The LBFGS method starts with the reduced system matrix of the rest shape as the initial Hessian approximation, incorporating contact information into the reduced system progressively, while the full-space Jacobi iteration captures high-frequency details. Furthermore, we address penetration issues through a penetration correction step. It minimizes an incremental potential without elasticity using Newton-PCG. Our method can be efficiently executed on modern GPUs. Experiments show significant performance improvements over existing GPU solvers for high-resolution cloth simulation.

SESSION: Character and Rigid Body Control

C·ASE: Learning Conditional Adversarial Skill Embeddings for Physics-based Characters

We present C · ASE, an efficient and effective framework that learns Conditional Adversarial Skill Embeddings for physics-based characters. C · ASE enables the physically simulated character to learn a diverse repertoire of skills while providing controllability in the form of direct manipulation of the skills to be performed. This is achieved by dividing the heterogeneous skill motions into distinct subsets containing homogeneous samples for training a low-level conditional model to learn the conditional behavior distribution. The skill-conditioned imitation learning naturally offers explicit control over the character’s skills after training. The training course incorporates the focal skill sampling, skeletal residual forces, and element-wise feature masking to balance diverse skills of varying complexities, mitigate dynamics mismatch to master agile motions and capture more general behavior characteristics, respectively. Once trained, the conditional model can produce highly diverse and realistic skills, outperforming state-of-the-art models, and can be repurposed in various downstream tasks. In particular, the explicit skill control handle allows a high-level policy or a user to direct the character with desired skill specifications, which we demonstrate is advantageous for interactive character animation.

MuscleVAE: Model-Based Controllers of Muscle-Actuated Characters

In this paper, we present a simulation and control framework for generating biomechanically plausible motion for muscle-actuated characters. We incorporate a fatigue dynamics model, the 3CC-r model, into the widely-adopted Hill-type muscle model to simulate the development and recovery of fatigue in muscles, which creates a natural evolution of motion style caused by the accumulation of fatigue from prolonged activities. To address the challenging problem of controlling a musculoskeletal system with high degrees of freedom, we propose a novel muscle-space control strategy based on PD control. Our simulation and control framework facilitates the training of a generative model for muscle-based motion control, which we refer to as MuscleVAE. By leveraging the variational autoencoders (VAEs), MuscleVAE is capable of learning a rich and flexible latent representation of skills from a large unstructured motion dataset, encoding not only motion features but also muscle control and fatigue properties. We demonstrate that the MuscleVAE model can be efficiently trained using a model-based approach, resulting in the production of high-fidelity motions and enabling a variety of downstream tasks.

ViCMA: Visual Control of Multibody Animations

Motion control of large-scale, multibody physics animations with contact is difficult. Existing approaches, such as those based on optimization, are computationally daunting, and, as the number of interacting objects increases, can fail to find satisfactory solutions. We present a new, complementary method for the visual control of multibody animations that exploits object motion and visibility, and has overall cost comparable to a single simulation. Our method is highly practical, and is demonstrated on numerous large-scale, contact-rich examples involving both rigid and deformable bodies.

SESSION: Computational Design

Sparse Stress Structures from Optimal Geometric Measures

Identifying optimal structural designs given loads and constraints is a primary challenge in topology optimization and shape optimization. We propose a novel approach to this problem by finding a minimal tensegrity structure—a network of cables and struts in equilibrium with a given loading force. Through the application of geometric measure theory and compressive sensing techniques, we show that this seemingly difficult graph-theoretic problem can be reduced to a numerically tractable continuous optimization problem. With a light-weight iterative algorithm involving only Fast Fourier Transforms and local algebraic computations, we can generate sparse supporting structures featuring detailed branches, arches, and reinforcement structures that respect the prescribed loading forces and obstacles.

SESSION: Fluid Simulation

An Implicitly Stable Mixture Model for Dynamic Multi-fluid Simulations

Particle-based simulations have become increasingly popular in real-time applications due to their efficiency and adaptability, especially for generating highly dynamic fluid effects. However, the swift and stable simulation of interactions among distinct fluids continues to pose challenges for current mixture model techniques. When using a single-mixture flow field to represent all fluid phases, numerical discontinuities in phase fields can result in significant losses of dynamic effects and unstable conservation of mass and momentum. To tackle these issues, we present an advanced implicit mixture model for smoothed particle hydrodynamics. Instead of relying on an explicit mixture field for all dynamic computations and phase transfers between particles, our approach calculates phase momentum sources from the mixture model to derive explicit and continuous velocity phase fields. We then implicitly obtain the mixture field using a phase-mixture momentum-mapping mechanism that ensures conservation of incompressibility, mass, and momentum. In addition, we propose a mixture viscosity model and establish viscous effects between the mixture and individual fluid phases to avoid instability under extreme inertia conditions. Through a series of experiments, we show that, compared to existing mixture models, our method effectively improves dynamic effects while reducing critical instability factors. This makes our approach especially well-suited for long-duration, efficiency-oriented virtual reality scenarios.

SESSION: Robots & Characters

MOCHA: Real-Time Motion Characterization via Context Matching

Transforming neutral, characterless input motions to embody the distinct style of a notable character in real time is highly compelling for character animation. This paper introduces MOCHA, a novel online motion characterization framework that transfers both motion styles and body proportions from a target character to an input source motion. MOCHA begins by encoding the input motion into a motion feature that structures the body part topology and captures motion dependencies for effective characterization. Central to our framework is the Neural Context Matcher, which generates a motion feature for the target character with the most similar context to the input motion feature. The conditioned autoregressive model of the Neural Context Matcher can produce temporally coherent character features in each time frame. To generate the final characterized pose, our Characterizer network incorporates the characteristic aspects of the target motion feature into the input motion feature while preserving its context. This is achieved through a transformer model that introduces the adaptive instance normalization and context mapping-based cross-attention, effectively injecting the character feature into the source feature. We validate the performance of our framework through comparisons with prior work and an ablation study. Our framework can easily accommodate various applications, including characterization with only sparse input and real-time characterization. Additionally, we contribute a high-quality motion dataset comprising six different characters performing a range of motions, which can serve as a valuable resource for future research.

SESSION: Rendering

FuseSR: Super Resolution for Real-time Rendering through Efficient Multi-resolution Fusion

The workload of real-time rendering is steeply increasing as the demand for high resolution, high refresh rates, and high realism rises, overwhelming most graphics cards. To mitigate this problem, one of the most popular solutions is to render images at a low resolution to reduce rendering overhead, and then manage to accurately upsample the low-resolution rendered image to the target resolution, a.k.a. super-resolution techniques. Most existing methods focus on exploiting information from low-resolution inputs, such as historical frames. The absence of high frequency details in those LR inputs makes them hard to recover fine details in their high-resolution predictions. In this paper, we propose an efficient and effective super-resolution method that predicts high-quality upsampled reconstructions utilizing low-cost high-resolution auxiliary G-Buffers as additional input. With LR images and HR G-buffers as input, the network requires to align and fuse features at multi resolution levels. We introduce an efficient and effective H-Net architecture to solve this problem and significantly reduce rendering overhead without noticeable quality deterioration. Experiments show that our method is able to produce temporally consistent reconstructions in 4 × 4 and even challenging 8 × 8 upsampling cases at 4K resolution with real-time performance, with substantially improved quality and significant performance boost compared to existing works.Project page: https://isaac-paradox.github.io/FuseSR/

Input-Dependent Uncorrelated Weighting for Monte Carlo Denoising

Image-space denoising techniques have been widely employed in Monte Carlo rendering, typically blending neighboring pixel estimates using a denoising kernel. It is widely recognized that a kernel should be adapted to characteristics of the input pixel estimates in order to ensure robustness to diverse image features and amount of noise. Denoising with such an input-dependent kernel, however, can introduce a bias that makes the denoised estimate even less accurate than the noisy input estimate. Consequently, it has been considered essential to balance the bias introduced by denoising and the reduction of noise. We propose a new framework to define an input-dependent kernel that departs from the existing approaches based on error estimation or supervised learning. Rather than seeking an optimal bias-noise balance as in those existing approaches, we propose to constrain the amount of bias introduced by denoising. Such a constraint is made possible by the concept of uncorrelated statistics, which has never been applied for denoising. By designing an input-dependent kernel with uncorrelated weights against the input pixel estimates, our denoising kernel can reduce data-dependent noise with a negligible amount of bias in most cases. We demonstrate the effectiveness of our method for various scenes.

Adaptive Recurrent Frame Prediction with Learnable Motion Vectors

The utilization of dedicated ray tracing graphics cards has revolutionized the production of stunning visual effects in real-time rendering. However, the demand for high frame rates and high resolutions remains a challenge. The pixel warping approach is a crucial technique for increasing frame rate and resolution by exploiting the spatio-temporal coherence. To this end, existing super-resolution and frame prediction methods rely heavily on motion vectors from rendering engine pipelines to track object movements. This work builds upon state-of-the-art heuristic approaches by exploring a novel adaptive recurrent frame prediction framework that integrates learnable motion vectors. Our framework supports the prediction of transparency, particles, and texture animations, with improved motion vectors that capture shading, reflections, and occlusions, in addition to geometry movements. In addition, we introduce a feature streaming neural network, dubbed FSNet, that allows for the adaptive prediction of one or multiple sequential frames. Extensive experiments against state-of-the-art methods demonstrate that FSNet can operate at lower latency with significant visual enhancements and can upscale frame rates by at least two times. This approach offers a flexible pipeline to improve the rendering frame rates of various graphics applications and devices.

Fast-MSX: Fast Multiple Scattering Approximation

Classical microfacet theory suffers from energy loss on materials with high roughness due to the single bounce assumption of most microfacet models. When roughness is high, there is a large chance of multiple scattering occurring among the microfacets of the surface. Without explicitly modelling for this behaviour, rough surfaces appear darker than they should. To address this issue, we present a novel method to estimate the multiple scattering contribution from a second light bounce. Our method is inspired by Zipin’s geometric construction approach, which simplifies the calculation of the light transport inside a V-groove cavity. Our experimental results demonstrate that our method is visually pleasing, physically plausible, and artifact-free compared to recent multiple scattering works. Additionally, the low computational cost makes our model suitable for real-time rendering.

SESSION: See Details

RMIP: Displacement ray tracing via inversion and oblong bounding

High-performance ray tracing of triangle meshes equipped with displacement maps is a challenging task. Existing methods either rely on pre-tessellation, taking full advantage of the hardware but with a poor memory/quality tradeoff, or use custom displacement-centric acceleration structures, preserving all the geometric details but being orders of magnitude slower. We introduce a method that efficiently probes the displacement-map space to find ray-surface intersections without relying on pre-tessellation. Our method combines inverse displacement mapping and on-the-fly surface-bound computation. It employs a novel data structure that provides tight displacement bounds over rectangular regions in the displacement-map space. We demonstrate the effectiveness of our approach in a production GPU path tracer. It can achieve over an order of magnitude speed-up in render time compared to state of the art in the most challenging real-time path-tracing scenarios, while maintaining a low memory footprint.

SESSION: View Synthesis

SinMPI: Novel View Synthesis from a Single Image with Expanded Multiplane Images

Single-image novel view synthesis is a challenging and ongoing problem that aims to generate an infinite number of consistent views from a single input image. Although significant efforts have been made to advance the quality of generated novel views, less attention has been paid to the expansion of the underlying scene representation, which is crucial to the generation of realistic novel view images. This paper proposes SinMPI, a novel method that uses an expanded multiplane image (MPI) as the 3D scene representation to significantly expand the perspective range of MPI and generate high-quality novel views from a large multiplane space. The key idea of our method is to use Stable Diffusion [Rombach et al. 2021] to generate out-of-view contents, project all scene contents into an expanded multiplane image according to depths predicted by monocular depth estimators, and then optimize the multiplane image under the supervision of pseudo multi-view data generated by a depth-aware warping and inpainting module. Both qualitative and quantitative experiments have been conducted to validate the superiority of our method to the state of the art. Our code and data are available at https://github.com/TrickyGo/SinMPI.

Inovis: Instant Novel-View Synthesis

Novel-view synthesis is an ill-posed problem in that it requires inference of previously unseen information. Recently, reviving the traditional field of image-based rendering, neural methods proved particularly suitable for this interpolation/extrapolation task; however, they often require a-priori scene-completeness or costly preprocessing steps and generally suffer from long (scene-specific) training times. Our work draws from recent progress in neural spatio-temporal supersampling to enhance a state-of-the-art neural renderer’s ability to infer novel-view information at inference time. We adapt a supersampling architecture [Xiao et al. 2020], which resamples previously rendered frames, to instead recombine nearby camera images in a multi-view dataset. These input frames are warped into a joint target frame, guided by the most recent (point-based) scene representation, followed by neural interpolation. The resulting architecture gains sufficient robustness to significantly improve transferability to previously unseen datasets. In particular, this enables novel applications for neural rendering where dynamically streamed content is directly incorporated in a (neural) image-based reconstruction of a scene. As we will show, our method reaches state-of-the-art performance when compared to previous works that rely on static and sufficiently densely sampled scenes; in addition, we demonstrate our system’s particular suitability for dynamically streamed content, where our approach is able to produce high-fidelity novel-view synthesis even with significantly fewer available frames than competing neural methods.

High-Fidelity and Real-Time Novel View Synthesis for Dynamic Scenes

This paper aims to tackle the challenge of dynamic view synthesis from multi-view videos. The key observation is that while previous grid-based methods offer consistent rendering, they fall short in capturing appearance details of a complex dynamic scene, a domain where multi-view image-based rendering methods demonstrate the opposite properties. To combine the best of two worlds, we introduce Im4D, a hybrid scene representation that consists of a grid-based geometry representation and a multi-view image-based appearance representation. Specifically, the dynamic geometry is encoded as a 4D density function composed of spatiotemporal feature planes and a small MLP network, which globally models the scene structure and facilitates the rendering consistency. We represent the scene appearance by the original multi-view videos and a network that learns to predict the color of a 3D point from image features, instead of memorizing detailed appearance totally with networks, thereby naturally making the learning of networks easier. Our method is evaluated on five dynamic view synthesis datasets including DyNeRF, ZJU-MoCap, NHR, DNA-Rendering and ENeRF-Outdoor datasets. The results show that Im4D exhibits state-of-the-art performance in rendering quality and can be trained efficiently, while realizing real-time rendering with a speed of 79.8 FPS for 512x512 images, on a single RTX 3090 GPU. The code is available at https://zju3dv.github.io/im4d.

Repurposing Diffusion Inpainters for Novel View Synthesis

In this paper, we present a method for generating consistent novel views from a single source image. Our approach focuses on maximizing the reuse of visible pixels from the source image. To achieve this, we use a monocular depth estimator that transfers visible pixels from the source view to the target view. Starting from a pre-trained 2D inpainting diffusion model, we train our method on the large-scale Objaverse dataset to learn 3D object priors. While training we use a novel masking mechanism based on epipolar lines to further improve the quality of our approach. This allows our framework to perform zero-shot novel view synthesis on a variety of objects. We evaluate the zero-shot abilities of our framework on three challenging datasets: Google Scanned Objects, Ray Traced Multiview, and Common Objects in 3D.

VMesh: Hybrid Volume-Mesh Representation for Efficient View Synthesis

With the emergence of neural radiance fields (NeRFs), view synthesis quality has reached an unprecedented level. Compared to traditional mesh-based assets, this volumetric representation is more powerful in expressing scene geometry but inevitably suffers from high rendering costs and can hardly be involved in further processes like editing, posing significant difficulties in combination with the existing graphics pipeline. In this paper, we present a hybrid volume-mesh representation, VMesh, which depicts an object with a textured mesh along with an auxiliary sparse volume. VMesh retains the advantages of mesh-based assets, such as efficient rendering and compact storage, while also incorporating the ability to represent subtle geometric structures provided by the volumetric counterpart. VMesh can be obtained from multi-view images of an object and renders at 2K 60FPS on common consumer devices with high fidelity, unleashing new opportunities for real-time immersive applications.

SESSION: Motion Synthesis with Awareness

DROP: Dynamics Responses from Human Motion Prior and Projective Dynamics

Synthesizing realistic human movements, dynamically responsive to the environment, is a long-standing objective in character animation, with applications in computer vision, sports, and healthcare, for motion prediction and data augmentation. Recent kinematics-based generative motion models offer impressive scalability in modeling extensive motion data, albeit without an interface to reason about and interact with physics. While simulator-in-the-loop learning approaches enable highly physically realistic behaviors, the challenges in training often affect scalability and adoption. We introduce DROP, a novel framework for modeling Dynamics Responses of humans using generative mOtion prior and Projective dynamics. DROP can be viewed as a highly stable, minimalist physics-based human simulator that interfaces with a kinematics-based generative motion prior. Utilizing projective dynamics, DROP allows flexible and simple integration of the learned motion prior as one of the projective energies, seamlessly incorporating control provided by the motion prior with Newtonian dynamics. Serving as a model-agnostic plug-in, DROP enables us to fully leverage recent advances in generative motion models for physics-based motion synthesis. We conduct extensive evaluations of our model across different motion tasks and various physical perturbations, demonstrating the scalability and diversity of responses.

Computational Design of Wiring Layout on Tight Suits with Minimal Motion Resistance

An increasing number of electronics are directly embedded on the clothing to monitor human status (e.g., skeletal motion) or provide haptic feedback. A specific challenge to prototype and fabricate such a clothing is to design the wiring layout, while minimizing the intervention to human motion. We address this challenge by formulating the topological optimization problem on the clothing surface as a deformation-weighted Steiner tree problem on a 3D clothing mesh. Our method proposed an energy function for minimizing strain energy in the wiring area under different motions, regularized by its total length. We built the physical prototype to verify the effectiveness of our method and conducted user study with participants of both design experts and smart cloth users. On three types of commercial products of smart clothing, the optimized layout design reduced wire strain energy by an average of 77% among 248 actions compared to baseline design, and 18% over the expert design.

SESSION: Holography

Multi-color Holograms Improve Brightness in Holographic Displays

Holographic displays generate Three-Dimensional (3D) images by displaying single-color holograms time-sequentially, each lit by a single-color light source. However, representing each color one by one limits brightness in holographic displays. This paper introduces a new driving scheme for realizing brighter images in holographic displays. Unlike the conventional driving scheme, our method utilizes three light sources to illuminate each displayed hologram simultaneously at various intensity levels. In this way, our method reconstructs a multiplanar three-dimensional target scene using consecutive multi-color holograms and persistence of vision. We co-optimize multi-color holograms and required intensity levels from each light source using a gradient descent-based optimizer with a combination of application-specific loss terms. We experimentally demonstrate that our method can increase the intensity levels in holographic displays up to three times, reaching a broader range and unlocking new potentials for perceptual realism in holographic displays.

Holographic Near-eye Display with Real-time Embedded Rendering

We present a wearable full-color holographic augmented reality headset with binocular vision support and real-time embedded hologram calculation. Contrarily to most previously proposed prototypes, our headset employs high-speed amplitude-only microdisplays and embeds a compact and lightweight electronic board to drive and synchronize the microdisplays and light source engines. In addition, to enable a standalone usage of the headset, we developed a real-time hologram rendering engine capable of computing full-color binocular holograms at over 35 frames per second on a NVIDIA Jetson AGX Orin embedded platform. Finally, we provide a comparison of the efficiency of laser diodes and superluminescent diodes for the reduction of speckle noise, which greatly affects the reconstructed image’s quality. Experimental results show that our prototype enables full-color holographic images to be reconstructed with accurate focus cues and reduced speckle noise in real-time.

Simultaneous Color Computer Generated Holography

Computer generated holography has long been touted as the future of augmented and virtual reality (AR/VR) displays, but has yet to be realized in practice. Previous high-quality, color holographic displays have made either a 3 × sacrifice on frame rate by using a sequential color illumination scheme or used more than one spatial light modulator (SLM) and/or bulky, complex optical setups. The reduced frame rate of sequential color introduces distracting judder and color fringing in the presence of head motion while the form factor of current simultaneous color systems is incompatible with a head-mounted display. In this work, we propose a framework for simultaneous color holography that allows the use of the full SLM frame rate while maintaining a compact and simple optical setup. Simultaneous color holograms are optimized through the use of a perceptual loss function, a physics-based neural network wavefront propagator, and a camera-calibrated forward model. We measurably improve hologram quality compared to other simultaneous color methods and move one step closer to the realization of color holographic displays for AR/VR.

SESSION: Full-Body Avatar

Towards Practical Capture of High-Fidelity Relightable Avatars

In this paper, we propose a novel framework, Tracking-free Relightable Avatar (TRAvatar), for capturing and reconstructing high-fidelity 3D avatars. Compared to previous methods, TRAvatar works in a more practical and efficient setting. Specifically, TRAvatar is trained with dynamic image sequences captured in a Light Stage under varying lighting conditions, enabling realistic relighting and real-time animation for avatars in diverse scenes. Additionally, TRAvatar allows for tracking-free avatar capture and obviates the need for accurate surface tracking under varying illumination conditions. Our contributions are two-fold: First, we propose a novel network architecture that explicitly builds on and ensures the satisfaction of the linear nature of lighting. Trained on simple group light captures, TRAvatar can predict the appearance in real-time with a single forward pass, achieving high-quality relighting effects under illuminations of arbitrary environment maps. Second, we jointly optimize the facial geometry and relightable appearance from scratch based on image sequences, where the tracking is implicitly learned. This tracking-free approach brings robustness for establishing temporal correspondences between frames under different lighting conditions. Extensive qualitative and quantitative experiments demonstrate that our framework achieves superior performance for photorealistic avatar animation and relighting.

Drivable Avatar Clothing: Faithful Full-Body Telepresence with Dynamic Clothing Driven by Sparse RGB-D Input

Clothing is an important part of human appearance but challenging to model in photorealistic avatars. In this work we present avatars with dynamically moving loose clothing that can be faithfully driven by sparse RGB-D inputs as well as body and face motion. We propose a Neural Iterative Closest Point (N-ICP) algorithm that can efficiently track the coarse garment shape given sparse depth input. Given the coarse tracking results, the input RGB-D images are then remapped to texel-aligned features, which are fed into the drivable avatar models to faithfully reconstruct appearance details. We evaluate our method against recent image-driven synthesis baselines, and conduct a comprehensive analysis of the N-ICP algorithm. We demonstrate that our method can generalize to a novel testing environment, while preserving the ability to produce high-fidelity and faithful clothing dynamics and appearance.

SESSION: How To Deal With NERF?

SimpleNeRF: Regularizing Sparse Input Neural Radiance Fields with Simpler Solutions

Neural Radiance Fields (NeRF) show impressive performance for the photo-realistic free-view rendering of scenes. However, NeRFs require dense sampling of images in the given scene, and their performance degrades significantly when only a sparse set of views are available. Researchers have found that supervising the depth estimated by the NeRF helps train it effectively with fewer views. The depth supervision is obtained either using classical approaches or neural networks pre-trained on a large dataset. While the former may provide only sparse supervision, the latter may suffer from generalization issues. As opposed to the earlier approaches, we seek to learn the depth supervision by designing augmented models and training them along with the NeRF. We design augmented models that encourage simpler solutions by exploring the role of positional encoding and view-dependent radiance in training the few-shot NeRF. The depth estimated by these simpler models is used to supervise the NeRF depth estimates. Since the augmented models can be inaccurate in certain regions, we design a mechanism to choose only reliable depth estimates for supervision. Finally, we add a consistency loss between the coarse and fine multi-layer perceptrons of the NeRF to ensure better utilization of hierarchical sampling. We achieve state-of-the-art view-synthesis performance on two popular datasets by employing the above regularizations. The source code for our model can be found on our project page: https://nagabhushansn95.github.io/publications/2023/SimpleNeRF.html

DreamEditor: Text-Driven 3D Scene Editing with Neural Fields

Neural fields have achieved impressive advancements in view synthesis and scene reconstruction. However, editing these neural fields remains challenging due to the implicit encoding of geometry and texture information. In this paper, we propose DreamEditor, a novel framework that enables users to perform controlled editing of neural fields using text prompts. By representing scenes as mesh-based neural fields, DreamEditor allows localized editing within specific regions. DreamEditor utilizes the text encoder of a pretrained text-to-Image diffusion model to automatically identify the regions to be edited based on the semantics of the text prompts. Subsequently, DreamEditor optimizes the editing region and aligns its geometry and texture with the text prompts through score distillation sampling [Poole et al. 2022]. Extensive experiments have demonstrated that DreamEditor can accurately edit neural fields of real-world scenes according to the given text prompts while ensuring consistency in irrelevant areas. DreamEditor generates highly realistic textures and geometry, significantly surpassing previous works in both quantitative and qualitative evaluations.

SESSION: Smooth-Parametric-LINE

Bézier Spline Simplification Using Locally Integrated Error Metrics

Inspired by surface mesh simplification methods, we present a technique for reducing the number of Bézier curves in a vector graphics while maintaining high fidelity. We propose a curve-to-curve distance metric to repeatedly conduct local segment removal operations. By construction, we identify all possible lossless removal operations ensuring the smallest possible zero-error representation of a given design. Subsequent lossy operations are computed via local Gauss-Newton optimization and processed in a priority queue. We tested our method on the OpenClipArts dataset of 20,000 real-world vector graphics images and show significant improvements over representative previous methods. The generality of our method allows us to show results for curves with varying thickness and for vector graphics animations.

PSDR-Room: Single Photo to Scene using Differentiable Rendering

A 3D digital scene contains many components: lights, materials and geometries, interacting to reach the desired appearance. Staging such a scene is time-consuming and requires both artistic and technical skills. In this work, we propose PSDR-Room, a system allowing to optimize lighting as well as the pose and materials of individual objects to match a target image of a room scene, with minimal user input. To this end, we leverage a recent path-space differentiable rendering approach that provides unbiased gradients of the rendering with respect to geometry, lighting, and procedural materials, allowing us to optimize all of these components using gradient descent to visually match the input photo appearance. We use recent single-image scene understanding methods to initialize the optimization and search for appropriate 3D models and materials. We evaluate our method on real photographs of indoor scenes and demonstrate the editability of the resulting scene components.

SESSION: Applications & Innovations

Joint Sampling and Optimisation for Inverse Rendering

When dealing with difficult inverse problems such as inverse rendering, using Monte Carlo estimated gradients to optimise parameters can slow down convergence due to variance. Averaging many gradient samples in each iteration reduces this variance trivially. However, for problems that require thousands of optimisation iterations, the computational cost of this approach rises quickly.

We derive a theoretical framework for interleaving sampling and optimisation. We update and reuse past samples with low-variance finite-difference estimators that describe the change in the estimated gradients between each iteration. By combining proportional and finite-difference samples, we continuously reduce the variance of our novel gradient meta-estimators throughout the optimisation process. We investigate how our estimator interlinks with Adam and derive a stable combination.

We implement our method for inverse path tracing and demonstrate how our estimator speeds up convergence on difficult optimisation tasks.

Extended Path Space Manifolds for Physically Based Differentiable Rendering

Physically based differentiable rendering has become an increasingly important topic in recent years. A common pipeline computes local color derivatives of light paths or pixels with respect to arbitrary scene parameters, and enables optimizing or recovering the scene parameters through iterative gradient descent by minimizing the difference between rendered and target images. However, existing approaches cannot robustly handle complex illumination effects including reflections, refractions, caustics, shadows, and highlights, especially when the initial and target locations of such illumination effects are not close to each other in the image space.

To address this problem, we propose a novel data structure named extended path space manifolds. The manifolds are defined in the combined space of path vertices and scene parameters. By enforcing geometric constraints, the path vertices could be implicitly and uniquely determined by perturbed scene parameters. This enables the manifold to track specific illumination effects and the corresponding paths, i.e., specular paths will still be specular paths after scene parameters are perturbed. Besides, the path derivatives with respect to scene parameters could be computed by solving small linear systems.

We further propose a physically based differentiable rendering method built upon the theoretical results of extended path space manifolds. By incorporating the path derivatives computed from the manifolds and an optimal transport based loss function, our method is demonstrated to be more effective and robust than state-of-the-art approaches in inverse rendering applications involving complex illumination effects.

SESSION: Light, Shadows & Curves

SOL-NeRF: Sunlight Modeling for Outdoor Scene Decomposition and Relighting

Outdoor scenes often involve large-scale geometry and complex unknown lighting conditions, making it difficult to decompose them into geometry, reflectance and illumination. Recently researchers made attempts to decompose outdoor scenes using Neural Radiance Fields (NeRF) and learning-based lighting and shadow representations. However, diverse lighting conditions and shadows in outdoor scenes are challenging for learning-based models. Moreover, existing methods may produce rough geometry and normal reconstruction and introduce notable shading artifacts when the scene is rendered under a novel illumination. To solve the above problems, we propose SOL-NeRF to decompose outdoor scenes with the help of a hybrid lighting representation and a signed distance field geometry reconstruction. We use a single Spherical Gaussian (SG) lobe to approximate the sun lighting, and a first-order Spherical Harmonic (SH) mixture to resemble the sky lighting. This hybrid representation is specifically designed for outdoor settings, and compactly models the outdoor lighting, ensuring robustness and efficiency. The shadow of the direct sun lighting can be obtained by casting the ray against the mesh extracted from the signed distance field, and the remaining shadow can be approximated by Ambient Occlusion (AO). Additionally, sun lighting color prior and a relaxed Manhattan-world assumption can be further applied to boost decomposition and relighting performance. When changing the lighting condition, our method can produce consistent relighting results with correct shadow effects. Experiments conducted on our hybrid lighting scheme and the entire decomposition pipeline show that our method achieves better reconstruction, decomposition, and relighting performance compared to previous methods both quantitatively and qualitatively.

Shadow Harmonization for Realistic Compositing

Compositing virtual objects into real background images requires one to carefully match the scene’s camera parameters, surface geometry, textures, and lighting to obtain plausible renderings. Recent learning approaches have shown many scene properties can be estimated from images, resulting in robust automatic single-image compositing systems, but many challenges remain. In particular, interactions between real and synthetic shadows are not handled gracefully by existing methods, which typically assume a shadow-free background. As a result, they tend to generate double shadows when the synthetic object’s cast shadow overlaps a background shadow, and ignore shadows from the background that should be cast onto the synthetic object. In this paper, we present a compositing method for outdoor scenes that addresses these issues and produces realistic cast shadows. This requires identifying existing shadows, including soft shadow boundaries, then reasoning about the ambiguity of unknown ground albedo and scene lighting to match the color and intensity of shaded areas. Using supervision from shadow removal and detection datasets, we propose a generative adversarial pipeline and improved composition equations that simultaneously handle both shadow interaction scenarios. We evaluate our method on challenging, real outdoor images from multiple distributions and datasets. Quantitative and qualitative comparisons show our approach produces more realistic results than existing alternatives. Our code, datasets, and trained models are publicly available at https://lvsn.github.io/shadowcompositing.

SESSION: Rendering, Neural Fields & Neural Caches

SeamlessNeRF: Stitching Part NeRFs with Gradient Propagation

Neural Radiance Fields (NeRFs) have emerged as promising digital mediums of 3D objects and scenes, sparking a surge in research to extend the editing capabilities in this domain. The task of seamless editing and merging of multiple NeRFs, resembling the “Poisson blending” in 2D image editing, remains a critical operation that is under-explored by existing work. To fill this gap, we propose SeamlessNeRF, a novel approach for seamless appearance blending of multiple NeRFs. In specific, we aim to optimize the appearance of a target radiance field in order to harmonize its merge with a source field. We propose a well-tailored optimization procedure for blending, which is constrained by 1) pinning the radiance color in the intersecting boundary area between the source and target fields and 2) maintaining the original gradient of the target. Extensive experiments validate that our approach can effectively propagate the source appearance from the boundary area to the entire target field through the gradients. To the best of our knowledge, SeamlessNeRF is the first work that introduces gradient-guided appearance editing to radiance fields, offering solutions for seamless stitching of 3D objects represented in NeRFs. Our code and more results are available at https://sites.google.com/view/seamlessnerf.

Neural Caches for Monte Carlo Partial Differential Equation Solvers

This paper presents a method that uses neural networks as a caching mechanism to reduce the variance of Monte Carlo Partial Differential Equation solvers, such as the Walk-on-Spheres algorithm [Sawhney and Crane 2020]. While these Monte Carlo PDE solvers have the merits of being unbiased and discretization-free, their high variance often hinders real-time applications. On the other hand, neural networks can approximate the PDE solution, and evaluating these networks at inference time can be very fast. However, neural-network-based solutions may suffer from convergence difficulties and high bias. Our hybrid system aims to combine these two potentially complementary solutions by training a neural field to approximate the PDE solution using supervision from a WoS solver. This neural field is then used as a cache in the WoS solver to reduce variance during inference. We demonstrate that our neural field training procedure is better than the commonly used self-supervised objectives in the literature. We also show that our hybrid solver exhibits lower variance than WoS with the same computational budget: it is significantly better for small compute budgets and provides smaller improvements for larger budgets, reaching the same performance as WoS in the limit.

SESSION: TechScape

Perceptual Requirements for World-Locked Rendering in AR and VR

Stereoscopic, head-tracked display systems can show users realistic, world-locked virtual objects and environments. However, discrepancies between the rendering pipeline and physical viewing conditions can lead to perceived instability in the rendered content resulting in reduced immersion and, potentially, visually-induced motion sickness. Precise requirements to achieve perceptually stable world-locked rendering (WLR) are unknown due to the challenge of constructing a wide field of view, distortion-free display with highly accurate head and eye tracking. We present a system capable of rendering virtual objects over real-world references without perceivable drift under such constraints. This platform is used to study acceptable errors in render camera position for WLR in augmented and virtual reality scenarios, where we find an order of magnitude difference in perceptual sensitivity. We conclude with an analytic model which examines changes to apparent depth and visual direction in response to camera displacement errors.

Perceptually Adaptive Real-Time Tone Mapping

Tone mapping operators aim to remap content to a display’s dynamic range. Virtual reality is a popular new display modality that has significant differences from other media, making the use of traditional tone mapping techniques difficult. Moreover, real-time adaptive estimation of tone curves that faithfully maintain appearance remains a significant challenge. In this work, we propose a real-time perceptual contrast-matching framework, that allows us to optimally remap scenes for target displays. Our framework is optimized for efficiency and runs on a mobile Quest 2 headset in under 1ms per frame. A subjective study on an HDR-VR prototype demonstrates our method’s effectiveness across a wide range of display luminances, producing imagery that is preferred to alternatives tone mapped at peak luminances an order of magnitude higher. This result highlights the importance of good tone mapping for visual quality in VR.

SESSION: Anything Can be Neural

LiveNVS: Neural View Synthesis on Live RGB-D Streams

Existing real-time RGB-D reconstruction approaches, like Kinect Fusion, lack real-time photo-realistic visualization. This is due to noisy, oversmoothed or incomplete geometry and blurry textures which are fused from imperfect depth maps and camera poses. Recent neural rendering methods can overcome many of such artifacts but are mostly optimized for offline usage, hindering the integration into a live reconstruction pipeline.

In this paper, we present LiveNVS, a system that allows for neural novel view synthesis on a live RGB-D input stream with very low latency and real-time rendering. Based on the RGB-D input stream, novel views are rendered by projecting neural features into the target view via a densely fused depth map and aggregating the features in image-space to a target feature map. A generalizable neural network then translates the target feature map into a high-quality RGB image. LiveNVS achieves state-of-the-art neural rendering quality of unknown scenes during capturing, allowing users to virtually explore the scene and assess reconstruction quality in real-time.

VET: Visual Error Tomography for Point Cloud Completion and High-Quality Neural Rendering

In the last few years, deep neural networks opened the doors for big advances in novel view synthesis. Many of these approaches are based on a (coarse) proxy geometry obtained by structure from motion algorithms. Small deficiencies in this proxy can be fixed by neural rendering, but larger holes or missing parts, as they commonly appear for thin structures or for glossy regions, still lead to distracting artifacts and temporal instability. In this paper, we present a novel neural-rendering-based approach to detect and fix such deficiencies. As a proxy, we use a point cloud, which allows us to easily remove outlier geometry and to fill in missing geometry without complicated topological operations. Keys to our approach are (i) a differentiable, blending point-based renderer that can blend out redundant points, as well as (ii) the concept of Visual Error Tomography (VET), which allows us to lift 2D error maps to identify 3D-regions lacking geometry and to spawn novel points accordingly. Furthermore, (iii) by adding points as nested environment maps, our approach allows us to generate high-quality renderings of the surroundings in the same pipeline. In our results, we show that our approach can improve the quality of a point cloud obtained by structure from motion and thus increase novel view synthesis quality significantly. In contrast to point growing techniques, the approach can also fix large-scale holes and missing thin structures effectively. Rendering quality outperforms state-of-the-art methods and temporal stability is significantly improved, while rendering is possible at real-time frame rates.

SESSION: Materials

Multiple-bounce Smith Microfacet BRDFs using the Invariance Principle

Smith microfacet models are widely used in computer graphics to represent materials. Traditional microfacet models do not consider the multiple bounces on microgeometries, leading to visible energy missing, especially on rough surfaces. Later, as the equivalence between the microfacets and volume has been revealed, random walk solutions have been proposed to introduce multiple bounces, but at the cost of high variance. Recently, the position-free property has been introduced into the multiple-bounce model, resulting in much less noise, but also bias or a complex derivation.In this paper, we propose a simple way to derive the multiple-bounce Smith microfacet bidirectional reflectance distribution functions (BRDFs) using the invariance principle. At the core of our model is a shadowing-masking function for a path consisting of direction collections, rather than separated bounces. Our model ensures unbiasedness and can produce less noise compared to the previous work with equal time, thanks to the simple formulation. Furthermore, we also propose a novel probability density function (PDF) for BRDF multiple importance sampling, which has a better match with the multiple-bounce BRDFs, producing less noise than previous naive approximations.

A Micrograin BSDF Model for the Rendering of Porous Layers

We introduce a new BSDF model for the rendering of porous layers, as found on surfaces covered by dust, rust, dirt, or sprayed paint. Our approach is based on a distribution of elliptical opaque micrograins, extending the Trowbridge-Reitz (GGX) distribution [Trowbridge and Reitz 1975; Walter et al. 2007] to handle pores (i.e., spaces between micrograins). We use distance field statistics to derive the corresponding Normal Distribution Function (NDF) and Geometric Attenuation Factor (GAF), as well as a view- and light-dependent filling factor to blend between the porous and base layers. All the derived terms show excellent agreement when compared against numerical simulations.

Our approach has several advantages compared to previous work [d’Eon et al. 2023; Merillou et al. 2000; Wang et al. 2022]. First, it decouples structural and reflectance parameters, leading to an analytical single-scattering formula regardless of the choice of micrograin reflectance. Second, we show that the classical texture maps (albedo, roughness, etc) used for spatially-varying material parameters are easily retargeted to work with our model. Finally, the BRDF parameters of our model behave linearly, granting direct multi-scale rendering using classical mip mapping.

SESSION: Beyond Skin Deep

Emotional Speech-Driven Animation with Content-Emotion Disentanglement

To be widely adopted, 3D facial avatars must be animated easily, realistically, and directly from speech signals. While the best recent methods generate 3D animations that are synchronized with the input audio, they largely ignore the impact of emotions on facial expressions. Realistic facial animation requires lip-sync together with the natural expression of emotion. To that end, we propose EMOTE  (Expressive Model Optimized for Talking with Emotion), which generates 3D talking-head avatars that maintain lip-sync from speech while enabling explicit control over the expression of emotion. To achieve this, we supervise EMOTE with decoupled losses for speech (i.e., lip-sync) and emotion. These losses are based on two key observations: (1) deformations of the face due to speech are spatially localized around the mouth and have high temporal frequency, whereas (2) facial expressions may deform the whole face and occur over longer intervals. Thus we train EMOTE with a per-frame lip-reading loss to preserve the speech-dependent content, while supervising emotion at the sequence level. Furthermore, we employ a content-emotion exchange mechanism in order to supervise different emotions on the same audio, while maintaining the lip motion synchronized with the speech. To employ deep perceptual losses without getting undesirable artifacts, we devise a motion prior in the form of a temporal VAE. Due to the absence of high-quality aligned emotional 3D face datasets with speech, EMOTE is trained with 3D pseudo-ground-truth extracted from an emotional video dataset (i.e., MEAD). Extensive qualitative and perceptual evaluations demonstrate that EMOTE produces speech-driven facial animations with better lip-sync than state-of-the-art methods trained on the same data, while offering additional, high-quality emotional control.

LitNeRF: Intrinsic Radiance Decomposition for High-Quality View Synthesis and Relighting of Faces

High-fidelity, photorealistic 3D capture of a human face is a long-standing problem in computer graphics – the complex material of skin, intricate geometry of hair, and fine scale textural details make it challenging. Traditional techniques rely on very large and expensive capture rigs to reconstruct explicit mesh geometry and appearance maps, and are limited by the accuracy of hand-crafted reflectance models. More recent volumetric methods (e.g., NeRFs) have enabled view-synthesis and sometimes relighting by learning an implicit representation of the density and reflectance basis, but suffer from artifacts and blurriness due to the inherent ambiguities in volumetric modeling. These problems are further exacerbated when capturing with few cameras and light sources. We present a novel technique for high-quality capture of a human face for 3D view synthesis and relighting using a sparse, compact capture rig consisting of 15 cameras and 15 lights. Our method combines a neural volumetric representation with traditional mesh reconstruction from multiview stereo. The proxy geometry allows us to anchor the 3D density field to prevent artifacts and guide the disentanglement of intrinsic radiance components of the face appearance such as diffuse and specular reflectance, and incident radiance (shadowing) fields. Our hybrid representation significantly improves the state-of-the-art quality for arbitrarily dense renders of a face from desired camera viewpoint as well as environmental, directional, and near-field lighting.

SESSION: Technoscape

VR-NeRF: High-Fidelity Virtualized Walkable Spaces

We present an end-to-end system for the high-fidelity capture, model reconstruction, and real-time rendering of walkable spaces in virtual reality using neural radiance fields. To this end, we designed and built a custom multi-camera rig to densely capture walkable spaces in high fidelity and with multi-view high dynamic range images in unprecedented quality and density. We extend instant neural graphics primitives with a novel perceptual color space for learning accurate HDR appearance, and an efficient mip-mapping mechanism for level-of-detail rendering with anti-aliasing, while carefully optimizing the trade-off between quality and speed. Our multi-GPU renderer enables high-fidelity volume rendering of our neural radiance field model at the full VR resolution of dual 2K × 2K at 36 Hz on our custom demo machine. We demonstrate the quality of our results on our challenging high-fidelity datasets, and compare our method and datasets to existing baselines. We release our dataset on our project website: https://vr-nerf.github.io.

What is the Best Automated Metric for Text to Motion Generation?

There is growing interest in generating skeleton-based human motions from natural language descriptions. While most efforts have focused on developing better neural architectures for this task, there has been no significant work on determining the proper evaluation metric. Human evaluation is the ultimate accuracy measure for this task, and automated metrics should correlate well with human quality judgments. Since descriptions are compatible with many motions, determining the right metric is critical for evaluating and designing effective generative models. This paper systematically studies which metrics best align with human evaluations and proposes new metrics that align even better. Our findings indicate that none of the metrics currently used for this task show even a moderate correlation with human judgments on a sample level. However, for assessing average model performance, commonly used metrics such as R-Precision and less-used coordinate errors show strong correlations. Additionally, several recently developed metrics are not recommended due to their low correlation compared to alternatives. We also introduce a novel metric based on a multimodal BERT-like model, MoBERT, which offers strongly human-correlated sample-level evaluations while maintaining near-perfect model-level correlation. Our results demonstrate that this new metric exhibits extensive benefits over all current alternatives.

SESSION: Motion Synthesis With Awareness, Part II

SAME: Skeleton-Agnostic Motion Embedding for Character Animation

Learning deep neural networks on human motion data has become common in computer graphics research, but the heterogeneity of available datasets poses challenges for training large-scale networks. This paper presents a framework that allows us to solve various animation tasks in a skeleton-agnostic manner. The core of our framework is to learn an embedding space to disentangle skeleton-related information from input motion while preserving semantics, which we call Skeleton-Agnostic Motion Embedding (SAME). To efficiently learn the embedding space, we develop a novel autoencoder with graph convolution networks and provide new formulations of various animation tasks operating in the SAME space. We showcase various examples, including retargeting, reconstruction, and interactive character control, and conduct an ablation study to validate design choices made during development.

ACE: Adversarial Correspondence Embedding for Cross Morphology Motion Retargeting from Human to Nonhuman Characters

Motion retargeting is a promising approach for generating natural and compelling animations for nonhuman characters. However, it is challenging to translate human movements into semantically equivalent motions for target characters with different morphologies due to the ambiguous nature of the problem. This work presents a novel learning-based motion retargeting framework, Adversarial Correspondence Embedding (ACE), to retarget human motions onto target characters with different body dimensions and structures. Our framework is designed to produce natural and feasible character motions by leveraging generative-adversarial networks (GANs) while preserving high-level motion semantics by introducing an additional feature loss. In addition, we pretrain a character motion prior that can be controlled in a latent embedding space and seek to establish a compact correspondence. We demonstrate that the proposed framework can produce retargeted motions for three different characters – a quadrupedal robot with a manipulator, a crab character, and a wheeled manipulator. We further validate the design choices of our framework by conducting baseline comparisons and a user study. We also showcase sim-to-real transfer of the retargeted motions by transferring them to a real Spot robot.

Discovering Fatigued Movements for Virtual Character Animation

Virtual character animation and movement synthesis have advanced rapidly during recent years, especially through a combination of extensive motion capture datasets and machine learning. A remaining challenge is interactively simulating characters that fatigue when performing extended motions, which is indispensable for the realism of generated animations. However, capturing such movements is problematic, as performing movements like backflips with fatigued variations up to exhaustion raises capture cost and risk of injury. Surprisingly, little research has been done on faithful fatigue modeling. To address this, we propose a deep reinforcement learning-based approach, which—for the first time in literature—generates control policies for full-body physically simulated agents aware of cumulative fatigue. For this, we first leverage Generative Adversarial Imitation Learning (GAIL) to learn an expert policy for the skill; Second, we learn a fatigue policy by limiting the generated constant torque bounds based on endurance time to non-linear, state- and time-dependent limits in the joint-actuation space using a Three-Compartment Controller (3CC) model. Our results demonstrate that agents can adapt to different fatigue and rest rates interactively, and discover realistic recovery strategies without the need for any captured data of fatigued movement.

SESSION: All About Animation

GroundLink: A Dataset Unifying Human Body Movement and Ground Reaction Dynamics

The physical plausibility of human motions is vital to various applications in fields including but not limited to graphics, animation, robotics, vision, biomechanics, and sports science. While fully simulating human motions with physics is an extreme challenge, we hypothesize that we can treat this complexity as a black box in a data-driven manner if we focus on the ground contact, and have sufficient observations of physics and human activities in the real world. To prove our hypothesis, we present GroundLink, a unified dataset comprised of captured ground reaction force (GRF) and center of pressure (CoP) synchronized to standard kinematic motion captures. GRF and CoP of GroundLink are not simulated but captured at high temporal resolution using force platforms embedded in the ground for uncompromising measurement accuracy. This dataset contains 368 processed motion trials (∼ 1.59M recorded frames) with 19 different movements including locomotion and weight-shifting actions such as tennis swings to signify the importance of capturing physics paired with kinematics. GroundLinkNet, our benchmark neural network model trained with GroundLink, supports our hypothesis by predicting GRFs and CoPs accurately and plausibly on unseen motions from various sources. The dataset, code, and benchmark models are made public for further research on various downstream tasks leveraging the rich physics information at https://csr.bu.edu/groundlink/.

Pose and Skeleton-aware Neural IK for Pose and Motion Editing

Posing a 3D character for film or game is an iterative and laborious process where many control handles (e.g. joints) need to be manipulated to achieve a compelling result. Neural Inverse Kinematics (IK) is a new type of IK that enables sparse control over a 3D character pose, and leverages full body correlations to complete the un-manipulated joints of the body. While neural IK is promising, current methods are not designed to preserve previous edits in posing workflows. Current models generate a single pose from the handles only—regardless of what was there previously—making it difficult to preserve any variations and hindering tasks such as pose and motion editing.

In this paper, we introduce SKEL-IK, a novel architecture and training scheme that is conditioned on a base pose, and designed to flow information directly onto the skeletal graph structure, such that hard constraints can be enforced by blocking information flows at certain joints. As a result, we are able to satisfy both hard and soft constraints, as well as preserve un-manipulated parts of the body when desired. Finally, by controlling the base pose in different ways, we demonstrate the ability of our model to perform tasks such as generating variations and quickly editing poses and motions; with less erosion of the base poses compared to the current state-of-the-art.

SESSION: Avatar Portrait

Neural Point-based Volumetric Avatar: Surface-guided Neural Points for Efficient and Photorealistic Volumetric Head Avatar

Rendering photorealistic and dynamically moving human heads is crucial for ensuring a pleasant and immersive experience in AR/VR and video conferencing applications. However, existing methods often struggle to model challenging facial regions (e.g., mouth interior, eyes, and beard), resulting in unrealistic and blurry results. In this paper, we propose Neural Point-based Volumetric Avatar (NPVA), a method that adopts the neural point representation as well as the neural volume rendering process and discards the predefined connectivity and hard correspondence imposed by mesh-based approaches. Specifically, the neural points are strategically constrained around the surface of the target expression via a high-resolution UV displacement map, achieving increased modeling capacity and more accurate control. We introduce three technical innovations to improve the rendering and training efficiency: a patch-wise depth-guided (shading point) sampling strategy, a lightweight radiance decoding process, and a Grid-Error-Patch (GEP) ray sampling strategy during training. By design, our NPVA is better equipped to handle topologically changing regions and thin structures while also ensuring accurate expression control when animating avatars. Experiments conducted on three subjects from the Multiface dataset demonstrate the effectiveness of our designs, outperforming previous state-of-the-art methods, especially in handling challenging facial regions.

AniPortraitGAN: Animatable 3D Portrait Generation from 2D Image Collections

Previous animatable 3D-aware GANs for human generation have primarily focused on either the human head or full body. However, head-only videos are relatively uncommon in real life, and full body generation typically does not deal with facial expression control and still has challenges in generating high-quality results. Towards applicable video avatars, we present an animatable 3D-aware GAN that generates portrait images with controllable facial expression, head pose, and shoulder movements. It is a generative model trained on unstructured 2D image collections without using 3D or video data. For the new task, we base our method on the generative radiance manifold representation and equip it with learnable facial and head-shoulder deformations. A dual-camera rendering and adversarial learning scheme is proposed to improve the quality of the generated faces, which is critical for portrait images. A pose deformation processing network is developed to generate plausible deformations for challenging regions such as long hair. Experiments show that our method, trained on unstructured 2D images, can generate diverse and high-quality 3D portraits with desired control over different properties.

SESSION: Texture Magic

Texture Atlas Compression Based on Repeated Content Removal

Optimizing the memory footprint of 3D models can have a major impact on the user experiences during real-time rendering and streaming visualization, where the major memory overhead lies in the high-resolution texture data. In this work, we propose a robust and automatic pipeline to content-aware, lossy compression for texture atlas. The design of our solution lies in two observations: 1) mapping multiple surface patches to the same texture region is seamlessly compatible with the standard rendering pipeline, requiring no decompression before any usage; 2) a texture image has background regions and salient structural features, which can be handled separately to achieve a high compression rate. Accordingly, our method contains joint operations of image segmentation, re-meshing, UV unwrapping, and texture baking. To evaluate the efficacy of our approach, we batch-processed a dataset containing 100 models collected online. On average, our method achieves a texture atlas compression ratio of 81.41% with an averaged PSNR and MS-SSIM scores of 40.90 and 0.98, a marginal error in visual appearance.

HyperDreamer: Hyper-Realistic 3D Content Generation and Editing from a Single Image

3D content creation from a single image is a long-standing yet highly desirable task. Recent advances introduce 2D diffusion priors, yielding reasonable results. However, existing methods are not hyper-realistic enough for post-generation usage, as users cannot view, render and edit the resulting 3D content from a full range. To address these challenges, we introduce HyperDreamer with several key designs and appealing properties: 1) Full-range viewable: 360° mesh modeling with high-resolution textures enables the creation of visually compelling 3D models from a full range of observation points. 2) Full-range renderable: Fine-grained semantic segmentation and data-driven priors are incorporated as guidance to learn reasonable albedo, roughness, and specular properties of the materials, enabling semantic-aware arbitrary material estimation. 3) Full-range editable: For a generated model or their own data, users can interactively select any region via a few clicks and efficiently edit the texture with text-based guidance. Extensive experiments demonstrate the effectiveness of HyperDreamer in modeling region-aware materials with high-resolution textures and enabling user-friendly editing. We believe that HyperDreamer holds promise for advancing 3D content creation and finding applications in various domains.

SESSION: Creative Expression

Text-Guided Vector Graphics Customization

Vector graphics are widely used in digital art and valued by designers for their scalability and layer-wise topological properties. However, the creation and editing of vector graphics necessitate creativity and design expertise, leading to a time-consuming process. In this paper, we propose a novel pipeline that generates high-quality customized vector graphics based on textual prompts while preserving the properties and layer-wise information of a given exemplar SVG. Our method harnesses the capabilities of large pre-trained text-to-image models. By fine-tuning the cross-attention layers of the model, we generate customized raster images guided by textual prompts. To initialize the SVG, we introduce a semantic-based path alignment method that preserves and transforms crucial paths from the exemplar SVG. Additionally, we optimize path parameters using both image-level and vector-level losses, ensuring smooth shape deformation while aligning with the customized raster image. We extensively evaluate our method using multiple metrics from vector-level, image-level, and text-level perspectives. The evaluation results demonstrate the effectiveness of our pipeline in generating diverse customizations of vector graphics with exceptional quality. The project page is https://intchous.github.io/SVGCustomization.

Anything to Glyph: Artistic Font Synthesis via Text-to-Image Diffusion Model

The automatic generation of artistic fonts is a challenging task that attracts many research interests. Previous methods specifically focus on glyph or texture style transfer. However, we often come across creative fonts composed of objects in posters or logos. These fonts have proven to be a challenge for existing methods as they struggle to generate similar designs. This paper proposes a novel method for generating creative artistic fonts using a pre-trained text-to-image diffusion model. Our model takes a shape image and a prompt describing an object as input and generates an artistic glyph image consisting of such objects. Specifically, we introduce a novel heatmap-based weak position constraint method to guide the positioning of objects in the generated image, and we also propose the Latent Space Semantic Augmentation Module that improves other information while constraining object position. Our approach is unique in that it can preserve the object’s original shape while constraining its position. And our training method requires only a small quantity of generated data, making it an efficient unsupervised learning approach. Experimental results demonstrate that our method can generate various glyphs, including Chinese, English, Japanese, and symbols, using different objects. We also conducted qualitative and quantitative comparisons with various position control methods for the diffusion model. The results indicate that our approach outperforms other methods in terms of visual quality, innovation, and user evaluation.

Close the Design-to-Manufacturing Gap in Computational Optics with a 'Real2Sim' Learned Two-Photon Neural Lithography Simulator

We introduce neural lithography to address the ‘design-to-manufacturing’ gap in computational optics. Computational optics with large design degrees of freedom enable advanced functionalities and performance beyond traditional optics. However, the existing design approaches often overlook the numerical modeling of the manufacturing process, which can result in significant performance deviation between the design and the fabricated optics. To bridge this gap, we, for the first time, propose a fully differentiable design framework that integrates a pre-trained photolithography simulator into the model-based optical design loop. Leveraging a blend of physics-informed modeling and data-driven training using experimentally collected datasets, our photolithography simulator serves as a regularizer on fabrication feasibility during design, compensating for structure discrepancies introduced in the lithography process. We demonstrate the effectiveness of our approach through two typical tasks in computational optics, where we design and fabricate a holographic optical element (HOE) and a multi-level diffractive lens (MDL) using a two-photon lithography system, showcasing improved optical performance on the task-specific metrics. The source code for this work is available on the project page: https://neural-litho.github.io.

SESSION: From Pixels to Gradients

Transparent Object Reconstruction via Implicit Differentiable Refraction Rendering

Reconstructing the geometry of transparent objects has been a long-standing challenge. Existing methods rely on complex setups, such as manual annotation or darkroom conditions, to obtain object silhouettes and usually require controlled environments with designed patterns to infer ray-background correspondence. However, these intricate arrangements limit the practical application for common users. In this paper, we significantly simplify the setups and present a novel method that reconstructs transparent objects in unknown natural scenes without manual assistance. Our method incorporates two key technologies. Firstly, we introduce a volume rendering-based method that estimates object silhouettes by projecting the 3D neural field onto 2D images. This automated process yields highly accurate multi-view object silhouettes from images captured in natural scenes. Secondly, we propose transparent object optimization through differentiable refraction rendering with the neural SDF field, enabling us to optimize the refraction ray based on color rather than explicit ray-background correspondence. Additionally, our optimization includes a ray sampling method to supervise the object silhouette at a low computational cost. Extensive experiments and comparisons demonstrate that our method produces high-quality results while offering much more convenient setups.

ShaDDR: Interactive Example-Based Geometry and Texture Generation via 3D Shape Detailization and Differentiable Rendering

We present ShaDDR, an example-based deep generative neural network which produces a high-resolution textured 3D shape through geometry detailization and conditional texture generation applied to an input coarse voxel shape. Trained on a small set of detailed and textured exemplar shapes, our method learns to detailize the geometry via multi-resolution voxel upsampling and generate textures on voxel surfaces via differentiable rendering against exemplar texture images from a few views. The generation is interactive, taking less than 1 second to produce a 3D model with voxel resolutions up to 5123. The generated shape preserves the overall structure of the input coarse voxel model, while the style of the generated geometric details and textures can be manipulated through learned latent codes. In the experiments, we show that our method can generate higher-resolution shapes with plausible and improved geometric details and clean textures compared to prior works. Furthermore, we showcase the ability of our method to learn geometric details and textures from shapes reconstructed from real-world photos. In addition, we have developed an interactive modeling application to demonstrate the generalizability of our method to various user inputs and the controllability it offers, allowing users to interactively sculpt a coarse voxel shape to define the overall structure of the detailized 3D shape. Code and data are available at https://github.com/qiminchen/ShaDDR.

SESSION: Embed to a Different Space

Zero-Shot 3D Shape Correspondence

We propose a novel zero-shot approach to computing correspondences between 3D shapes. Existing approaches mainly focus on isometric and near-isometric shape pairs (e.g., human vs. human), but less attention has been given to strongly non-isometric and inter-class shape matching (e.g., human vs. cow). To this end, we introduce a fully automatic method that exploits the exceptional reasoning capabilities of recent foundation models in language and vision to tackle difficult shape correspondence problems. Our approach comprises multiple stages. First, we classify the 3D shapes in a zero-shot manner by feeding rendered shape views to a language-vision model (e.g., BLIP2) to generate a list of class proposals per shape. These proposals are unified into a single class per shape by employing the reasoning capabilities of ChatGPT. Second, we attempt to segment the two shapes in a zero-shot manner, but in contrast to the co-segmentation problem, we do not require a mutual set of semantic regions. Instead, we propose to exploit the in-context learning capabilities of ChatGPT to generate two different sets of semantic regions for each shape and a semantic mapping between them. This enables our approach to match strongly non-isometric shapes with significant differences in geometric structure. Finally, we employ the generated semantic mapping to produce coarse correspondences that can further be refined by the functional maps framework to produce dense point-to-point maps. Our approach, despite its simplicity, produces highly plausible results in a zero-shot manner, especially between strongly non-isometric shapes.

Lock-free Vertex Clustering for Multicore Mesh Reduction

Modern data collection methods can capture representations of 3D objects at resolutions much greater than they can be discretely rendered as an image. To improve the efficiency of storage, transmission, rendering, and editing of 3D models constructed from such data, it is beneficial to first employ a mesh reduction technique to reduce the size of a mesh. Vertex clustering, a technique that merges close vertices together, has particularly wide applicability, because it operates only on vertices and their spatial proximity. However, it is also very difficult to accelerate with parallelisation in a deterministic manner because it contains extensive algorithmic dependencies.

Prior work treats the non-trivial clustering step of this process serially to preserve vertex priorities, which fundamentally limits to mid-single digits the acceleration rates that are possible for the process overall. This paper introduces a novel lock-free parallel algorithm, P-Weld, that exposes parallelism with a graph-theoretic lens that iteratively peels away layers of a mesh that have no remaining dependencies. Concurrent updates to shared data are managed with a linearisable sequence of atomic instructions that exactly reproduces the serial clustering. The resulting parallelism and improved spatial locality yield a 3.86 × speedup on a standard 14-million vertex mesh and a 2.93 × speedup on a 400-million vertex LiDaR point cloud covering the city of Vancouver, Canada, relative to a popular open source library.

SESSION: Magic Diffusion Model

Diffusing Colors: Image Colorization with Text Guided Diffusion

The colorization of grayscale images is a complex and subjective task with significant challenges. Despite recent progress in employing large-scale datasets with deep neural networks, difficulties with controllability and visual quality persist. To tackle these issues, we present a novel image colorization framework that utilizes image diffusion techniques with granular text prompts. This integration not only produces colorization outputs that are semantically appropriate but also greatly improves the level of control users have over the colorization process. Our method provides a balance between automation and control, outperforming existing techniques in terms of visual quality and semantic coherence. We leverage a pretrained generative Diffusion Model, and show that we can finetune it for the colorization task without losing its generative power or attention to text prompts. Moreover, we present a novel CLIP-based ranking model that evaluates color vividness, enabling automatic selection of the most suitable level of vividness based on the specific scene semantics. Our approach holds potential particularly for color enhancement and historical image colorization.

Single-Image 3D Human Digitization with Shape-guided Diffusion

We present an approach to generate a 360-degree view of a person with a consistent, high-resolution appearance from a single input image. NeRF and its variants typically require videos or images from different viewpoints. Most existing approaches taking monocular input either rely on ground-truth 3D scans for supervision or lack 3D consistency. While recent 3D generative models show promise of 3D consistent human digitization, these approaches do not generalize well to diverse clothing appearances, and the results lack photorealism. Unlike existing work, we utilize high-capacity 2D diffusion models pretrained for general image synthesis tasks as an appearance prior of clothed humans. To achieve better 3D consistency while retaining the input identity, we progressively synthesize multiple views of the human in the input image by inpainting missing regions with shape-guided diffusion conditioned on silhouette and surface normal. We then fuse these synthesized multi-view images via inverse rendering to obtain a fully textured high-resolution 3D mesh of the given person. Experiments show that our approach outperforms prior methods and achieves photorealistic 360-degree synthesis of a wide range of clothed humans with complex textures from a single image.

Example-Based Sampling with Diffusion Models

Much effort has been put into developing samplers with specific properties, such as producing blue noise, low-discrepancy, lattice or Poisson disk samples. These samplers can be slow if they rely on optimization processes, may rely on a wide range of numerical methods, are not always differentiable. The success of recent diffusion models for image generation suggests that these models could be appropriate for learning how to generate point sets from examples. However, their convolutional nature makes these methods impractical for dealing with scattered data such as point sets. We propose a generic way to produce 2-d point sets imitating existing samplers from observed point sets using a diffusion model. We address the problem of convolutional layers by leveraging neighborhood information from an optimal transport matching to a uniform grid, that allows us to benefit from fast convolutions on grids, and to support the example-based learning of non-uniform sampling patterns. We demonstrate how the differentiability of our approach can be used to optimize point sets to enforce properties.

SESSION: Simulation and Animation of Natural Phenomena

Animating Street View

We present a system that automatically brings street view imagery to life by populating it with naturally behaving, animated pedestrians and vehicles. Our approach is to remove existing people and vehicles from the input image, insert moving objects with proper scale, angle, motion and appearance, plan paths and traffic behavior, as well as render the scene with plausible occlusion and shadowing effects. The system achieves these by reconstructing the still image street scene, simulating crowd behavior, and rendering with consistent lighting, visibility, occlusions, and shadows. We demonstrate results on a diverse range of street scenes including regular still images and panoramas.

Real-time Height-field Simulation of Sand and Water Mixtures

We propose a height-field-based real-time simulation method for sand and water mixtures. Inspired by the shallow-water assumption, our approach extends the governing equations to handle two-phase flows of sand and water using height fields. Our depth-integrated governing equations can model the elastoplastic behavior of sand, as well as sand-water-mixing phenomena such as friction, diffusion, saturation, and momentum exchange. We further propose an operator-splitting time integrator that is both GPU-friendly and stable under moderate time step sizes. We have evaluated our method on a set of benchmark scenarios involving large bodies of heterogeneous materials, where our GPU-based algorithm runs at real-time frame rates. Our method achieves a desirable trade-off between fidelity and performance, bringing an unprecedentedly immersive experience for real-time applications.

A Physically-inspired Approach to the Simulation of Plant Wilting

Plants are among the most complex objects to be modeled in computer graphics. While a large body of work is concerned with structural modeling and the dynamic reaction to external forces, our work focuses on the dynamic deformation caused by plant internal wilting processes. To this end, we motivate the simulation of water transport inside the plant which is a key driver of the wilting process. We then map the change of water content in individual plant parts to branch stiffness values and obtain the wilted plant shape through a position based dynamics simulation. We show, that our approach can recreate measured wilting processes and does so with a higher fidelity than approaches ignoring the internal water flow. Realistic plant wilting is not only important in a computer graphics context but can also aid the development of machine learning algorithms in agricultural applications through the generation of synthetic training data.

SESSION: Navigating Shape Spaces

CLIPXPlore: Coupled CLIP and Shape Spaces for 3D Shape Exploration

This paper presents CLIPXPlore, a new framework that leverages a vision-language model to guide the exploration of the 3D shape space. Many recent methods have been developed to encode 3D shapes into a learned latent shape space to enable generative design and modeling. Yet, existing methods lack effective exploration mechanisms, despite the rich information. To this end, we propose to leverage CLIP, a powerful pre-trained vision-language model, to aid the shape-space exploration. Our idea is threefold. First, we couple the CLIP and shape spaces by generating paired CLIP and shape codes through sketch images and training a mapper network to connect the two spaces. Second, to explore the space around a given shape, we formulate a co-optimization strategy to search for the CLIP code that better matches the geometry of the shape. Third, we design three exploration modes, binary-attribute-guided, text-guided, and sketch-guided, to locate suitable exploration trajectories in shape space and induce meaningful changes to the shape. We perform a series of experiments to quantitatively and visually compare CLIPXPlore with different baselines in each of the three exploration modes, showing that CLIPXPlore can produce many meaningful exploration results that cannot be achieved by the existing solutions.

Explorable Mesh Deformation Subspaces from Unstructured 3D Generative Models

Exploring variations of 3D shapes is a time-consuming process in traditional 3D modeling tools. Deep generative models of 3D shapes often feature continuous latent spaces that can, in principle, be used to explore potential variations starting from a set of input shapes; in practice, doing so can be problematic—latent spaces are high dimensional and hard to visualize, contain shapes that are not relevant to the input shapes, and linear paths through them often lead to sub-optimal shape transitions. Furthermore, one would ideally be able to explore variations in the original high-quality meshes used to train the generative model, not its lower-quality output geometry. In this paper, we present a method to explore variations among a given set of landmark shapes by constructing a mapping from an easily-navigable 2D exploration space to a subspace of a pre-trained generative model. We first describe how to find a mapping that spans the set of input landmark shapes and exhibits smooth variations between them. We then show how to turn the variations in this subspace into deformation fields, to transfer those variations to high-quality meshes for the landmark shapes. Our results show that our method can produce visually-pleasing and easily-navigable 2D exploration spaces for several different shape categories, especially as compared to prior work on learning deformation spaces for 3D shapes.

https://github.com/ArmanMaesumi/generative-mesh-subspaces

ReparamCAD: Zero-shot CAD Re-Parameterization for Interactive Manipulation

Parametric CAD models encode entire families of shapes that should, in principle, be easy for designers to explore. However, in practice, parametric CAD models can be difficult to manipulate due to implicit semantic constraints among parameter values. Finding and enforcing these semantic constraints solely from geometry or programmatic shape representations is not possible because these constraints ultimately reflect design intent. They are informed by the designer’s experience and semantics in the real world. To address this challenge, we introduce ReparamCAD, a zero-shot pipeline that leverages pre-trained large language and image model to infer meaningful space of variations for a shape We then re-parameterize a new constrained parametric CAD program that captures these variations, enabling effortless exploration of the design space along meaningful design axes. We evaluated our approach through five examples and a user study. The result showed that the inferred spaces are meaningful and comparable to those defined by experts. Code and data are at: https://github.com/milmillin/ReparamCAD.

SESSION: Personalized Generative Models

MyStyle++: A Controllable Personalized Generative Prior

In this paper, we propose an approach to obtain a personalized generative prior with explicit control over a set of attributes. We build upon MyStyle, a recently introduced method, that tunes the weights of a pre-trained StyleGAN face generator on a few images of an individual. This system allows synthesizing, editing, and enhancing images of the target individual with high fidelity to their facial features. However, MyStyle does not demonstrate precise control over the attributes of the generated images. We propose to address this problem through a novel optimization system that organizes the latent space in addition to tuning the generator. Our key contribution is to formulate a loss that arranges the latent codes, corresponding to the input images, along a set of specific directions according to their attributes. We demonstrate that our approach, dubbed MyStyle++, is able to synthesize, edit, and enhance images of an individual with great control over the attributes, while preserving the unique facial characteristics of that individual.

Content-based Search for Deep Generative Models

The growing proliferation of customized and pretrained generative models has made it infeasible for a user to be fully cognizant of every model in existence. To address this need, we introduce the task of content-based model search: given a query and a large set of generative models, finding the models that best match the query. As each generative model produces a distribution of images, we formulate the search task as an optimization problem to select the model with the highest probability of generating similar content as the query. We introduce a formulation to approximate this probability given the query from different modalities, e.g., image, sketch, and text. Furthermore, we propose a contrastive learning framework for model retrieval, which learns to adapt features for various query modalities. We demonstrate that our method outperforms several baselines on Generative Model Zoo, a new benchmark we create for the model retrieval task.

Domain-Agnostic Tuning-Encoder for Fast Personalization of Text-To-Image Models

Text-to-image (T2I) personalization allows users to guide the creative image generation process by combining their own visual concepts in natural language prompts. Recently, encoder-based techniques have emerged as a new effective approach for T2I personalization, reducing the need for multiple images and long training times. However, most existing encoders are limited to a single-class domain, which hinders their ability to handle diverse concepts. In this work, we propose a domain-agnostic method that does not require any specialized dataset or prior information about the personalized concepts. We introduce a novel contrastive-based regularization technique to maintain high fidelity to the target concept characteristics while keeping the predicted embeddings close to editable regions of the latent space, by pushing the predicted tokens toward their nearest existing CLIP tokens. Our experimental results demonstrate the effectiveness of our approach and show how the learned tokens are more semantic than tokens predicted by unregularized models. This leads to a better representation that achieves state-of-the-art performance while being more flexible than previous methods.

SESSION: Reconstruction

Reach For the Spheres: Tangency-aware surface reconstruction of SDFs

Signed distance fields (SDFs) are a widely used implicit surface representation, with broad applications in computer graphics, computer vision, and applied mathematics. To reconstruct an explicit triangle mesh surface corresponding to an SDF, traditional isosurfacing methods, such as Marching Cubes and and its variants, are typically used. However, these methods overlook fundamental properties of SDFs, resulting in reconstructions that exhibit severe oversmoothing and feature loss. To address this shortcoming, we propose a novel method based on a key insight: each SDF sample corresponds to a spherical region that must lie fully inside or outside the surface, depending on its sign, and that must be tangent to the surface at some point. Leveraging this understanding, we formulate an energy that gauges the degree of violation of tangency constraints by a proposed surface. We then employ a gradient flow that minimizes our energy, starting from an initial triangle mesh that encapsulates the surface. This algorithm yields superior reconstructions to previous methods, even with sparsely sampled SDFs. Our approach provides a more nuanced understanding of SDFs and offers significant improvements in surface reconstruction.

Neural Stochastic Poisson Surface Reconstruction

Reconstructing a surface from a point cloud is an underdetermined problem. We use a neural network to study and quantify this reconstruction uncertainty under a Poisson smoothness prior. Our algorithm addresses the main limitations of existing work and can be fully integrated into the 3D scanning pipeline, from obtaining an initial reconstruction to deciding on the next best sensor position and updating the reconstruction upon capturing more data.

360° Reconstruction From a Single Image Using Space Carved Outpainting

We introduce POP3D, a novel framework that creates a full 360° -view 3D model from a single image. POP3D resolves two prominent issues that limit the single-view reconstruction. Firstly, POP3D offers substantial generalizability to arbitrary categories, a trait that previous methods struggle to achieve. Secondly, POP3D further improves reconstruction fidelity and naturalness, a crucial aspect that concurrent works fall short of. Our approach marries the strengths of four primary components: (1) a monocular depth and normal predictor that serves to predict crucial geometric cues, (2) a space carving method capable of demarcating the potentially unseen portions of the target object, (3) a generative model pre-trained on a large-scale image dataset that can complete unseen regions of the target, and (4) a neural implicit surface reconstruction method tailored in reconstructing objects using RGB images along with monocular geometric cues. The combination of these components enables POP3D to readily generalize across various in-the-wild images and generate state-of-the-art reconstructions, outperforming similar works by a significant margin. Project page: http://cg.postech.ac.kr/research/POP3D.

SESSION: Neural Physics

Neural Collision Fields for Triangle Primitives

We present neural collision fields as an alternative to contact point sampling in physics simulations. Our approach is built on top of a novel smoothed integral formulation for the contact surface patches between two triangle meshes. By reformulating collisions as an integral, we avoid issues of sampling common to many collision-handling algorithms. Because the resulting integral is difficult to evaluate numerically, we store its solution in an integrated neural collision field — a 6D neural field in the space of triangle pair vertex coordinates. Our network generalizes well to new triangle meshes without retraining. We demonstrate the effectiveness of our method by implementing it as a constraint in a position-based dynamics framework and show that our neural formulation successfully handles collisions in practical simulations involving both volumetric and thin-shell geometries.

Learning Contact Deformations with General Collider Descriptors

This paper presents a learning-based method for the simulation of rich contact deformations on reduced deformation models. Previous works learn deformation models for specific pairs of objects; we lift this limitation by designing a neural model that supports general rigid collider shapes. We do this by formulating a novel collider descriptor that characterizes local geometry in a region of interest. The paper shows that the learning-based deformation model can be trained on a library of colliders, but it accurately supports unseen collider shapes at runtime. We showcase our method on interactive dynamic simulations with animation of rich deformation detail, manipulation and exploration of untrained objects, and augmentation of contact information suitable for high-fidelity haptics.

Neural Stress Fields for Reduced-order Elastoplasticity and Fracture

We propose a hybrid neural network and physics framework for reduced-order modeling of elastoplasticity and fracture. State-of-the-art scientific computing models like the Material Point Method (MPM) faithfully simulate large-deformation elastoplasticity and fracture mechanics. However, their long runtime and large memory consumption render them unsuitable for applications constrained by computation time and memory usage, e.g., virtual reality. To overcome these barriers, we propose a reduced-order framework. Our key innovation is training a low-dimensional manifold for the Kirchhoff stress field via an implicit neural representation. This low-dimensional neural stress field (NSF) enables efficient evaluations of stress values and, correspondingly, internal forces at arbitrary spatial locations. In addition, we also train neural deformation and affine fields to build low-dimensional manifolds for the deformation and affine momentum fields. These neural stress, deformation, and affine fields share the same low-dimensional latent space, which uniquely embeds the high-dimensional simulation state. After training, we run new simulations by evolving in this single latent space, which drastically reduces the computation time and memory consumption. Our general continuum-mechanics-based reduced-order framework is applicable to any phenomena governed by the elastodynamics equation. To showcase the versatility of our framework, we simulate a wide range of material behaviors, including elastica, sand, metal, non-Newtonian fluids, fracture, contact, and collision. We demonstrate dimension reduction by up to 100,000 × and time savings by up to 10 ×.

SESSION: Multidisciplinary Fusion

Hand Pose Estimation with Mems-Ultrasonic Sensors

Hand tracking is an important aspect of human-computer interaction and has a wide range of applications in extended reality devices. However, current hand motion capture methods suffer from various limitations. For instance, visual hand pose estimation is susceptible to self-occlusion and changes in lighting conditions, while IMU-based tracking gloves experience significant drift and are not resistant to external magnetic field interference. To address these issues, we propose a novel and low-cost hand-tracking glove that utilizes several MEMS-ultrasonic sensors attached to the fingers, to measure the distance matrix among the sensors. Our lightweight deep network then reconstructs the hand pose from the distance matrix. Our experimental results demonstrate that this approach is both accurate, size-agnostic, and robust to external interference. We also show the design logic for the sensor selection, sensor configurations, circuit diagram, as well as model architecture.

ShapeSonic: Sonifying Fingertip Interactions for Non-Visual Virtual Shape Perception

For sighted users, computer graphics and virtual reality allow them to model and perceive imaginary objects and worlds. However, these approaches are inaccessible to blind and visually impaired (BVI) users, since they primarily rely on visual feedback. To this end, we introduce ShapeSonic, a system designed to convey vivid 3D shape perception using purely audio feedback or sonification. ShapeSonic tracks users’ fingertips in 3D and provides real-time sound feedback (sonification). The shape’s geometry and sharp features (edges and corners) are expressed as sounds whose volumes modulate according to fingertip distance. ShapeSonic is based on a mass-produced, commodity hardware platform (Oculus Quest). In a study with 15 sighted and 6 BVI users, we demonstrate the value of ShapeSonic in shape landmark localization and recognition. ShapeSonic users were able to quickly and relatively accurately “touch” points on virtual 3D shapes in the air.

An Architecture and Implementation of Real-Time Sound Propagation Hardware for Mobile Devices

This paper presents a high-performance and low-power hardware architecture for real-time sound rendering on mobile devices. Traditional sound rendering algorithms require high-performance CPUs or GPUs for processing because of its high computational complexities to realize ultra-realistic 3D audio. Thus, it has been hard to achieve real-time rates on low-power mobile devices. To overcome this limitation, we propose a hardware architecture that adopts hardware-friendly sound-propagation-path calculation algorithms. We verified the function and performance of our architecture through its implementation on an FPGA board. According to ASIC evaluation with the 8-nm process technology, it achieves high performance with 120 FPS, low power consumption with 50 mW, and a small silicon area with 0.31 mm2, allowing real-time sound rendering on mobile devices.

Ego3DPose: Capturing 3D Cues from Binocular Egocentric Views

We present Ego3DPose, a highly accurate binocular egocentric 3D pose reconstruction system. The binocular egocentric setup offers practicality and usefulness in various applications, however, it remains largely under-explored. It has been suffering from low pose estimation accuracy due to viewing distortion, severe self-occlusion, and limited field-of-view of the joints in egocentric 2D images. Here, we notice that two important 3D cues, stereo correspondences, and perspective, contained in the egocentric binocular input are neglected. Current methods heavily rely on 2D image features, implicitly learning 3D information, which introduces biases towards commonly observed motions and leads to low overall accuracy. We observe that they not only fail in challenging occlusion cases but also in estimating visible joint positions. To address these challenges, we propose two novel approaches. First, we design a two-path network architecture with a path that estimates pose per limb independently with its binocular heatmaps. Without full-body information provided, it alleviates bias toward trained full-body distribution. Second, we leverage the egocentric view of body limbs, which exhibits strong perspective variance (e.g., a significantly large-size hand when it is close to the camera). We propose a new perspective-aware representation using trigonometry, enabling the network to estimate the 3D orientation of limbs. Finally, we develop an end-to-end pose reconstruction network that synergizes both techniques. Our comprehensive evaluations demonstrate that Ego3DPose outperforms state-of-the-art models by a pose estimation error (i.e., MPJPE) reduction of 23.1% in the UnrealEgo dataset. Our qualitative results highlight the superiority of our approach across a range of scenarios and challenges.

SESSION: Flesh & Bones

SFLSH: Shape-Dependent Soft-Flesh Avatars

We present a multi-person soft-tissue avatar model. This model maps a body shape descriptor to heterogeneous geometric and mechanical parameters of a soft-tissue model across the body, effectively producing a shape-dependent parametric soft avatar model. The design of the model overcomes two major challenges, the potential redundancy of geometric and mechanical parameters, and the complexity to obtain abundant subject data, which together induce major risk of overfitting the resulting model. To overcome these challenges, we introduce a local shape-dependent regularization of the model. We demonstrate accurate results, on par with independent per-subject estimation, accurate interpolation within the range of body shapes of the training subjects, and good generalization to unseen body shapes. As a result, we obtain a parametric soft-flesh avatar model easy to integrate in many existing applications.

Neural Motion Graph

Deep learning techniques have been employed to design a controllable human motion synthesizer. Despite their potential, however, designing a neural network-based motion synthesis that enables flexible user interaction, fine-grained controllability, and the support of new types of motions at reduced time and space consumption costs remains a challenge. In this paper, we propose a novel approach, a neural motion graph, that addresses the challenge by enabling scalability to new motions while using compact neural networks. Our approach represents each type of motion with a separate neural node to reduce the cost of adding new motion types. In addition, designing a separate neural node for each motion type enables task-specific control strategies and has greater potential to achieve a high-quality synthesis of complex motions, such as the Mongolian dance. Furthermore, a single transition network, which acts as neural edges, is used to model the transition between two motion nodes. The transition network is designed with a lightweight control module to achieve a fine-grained response to user control signals. Overall, the design choice makes the neural motion graph highly controllable and scalable. In addition to being fully flexible to user interaction through high-level and fine-grained user-control signals, our experimental and subjective evaluation results demonstrate that our proposed approach, neural motion graph, outperforms state-of-the-art human motion synthesis methods in terms of the quality of controlled motion generation.

SESSION: Visualizing the Future

DeepBasis: Hand-Held Single-Image SVBRDF Capture via Two-Level Basis Material Model

Recovering spatial-varying bi-directional reflectance distribution function (SVBRDF) from a single hand-held captured image has been a meaningful but challenging task in computer graphics. Benefiting from the learned data priors, some previous methods can utilize the potential material correlations between image pixels to serve for SVBRDF estimation. To further reduce the ambiguity from single-image estimation, it is necessary to integrate additional explicit material correlations. Given the flexible expressive ability of basis material assumption, we propose DeepBasis, a deep-learning-based method integrated with this assumption. It jointly predicts basis materials and their blending weights. Then the estimated SVBRDF is their linear combination. To facilitate the extraction of data priors, we introduce a two-level basis model to keep the sufficient representative while using a fixed number of basis materials. Moreover, considering the absence of ground-truth basis materials and weights during network training, we propose a variance-consistency loss and adopt a joint prediction strategy, thereby enabling the existing SVBRDF dataset available for training. Additionally, due to the hand-held capture setting, the exact lighting directions are unknown. We model the lighting direction estimation as a sampling problem and propose an optimization-based algorithm to find the optimal estimation. Quantitative evaluation and qualitative analysis demonstrate that DeepBasis can produce a higher quality SVBRDF estimation than previous methods. All source codes will be publicly released.

MatFusion: A Generative Diffusion Model for SVBRDF Capture

We formulate SVBRDF estimation from photographs as a diffusion task. To model the distribution of spatially varying materials, we first train a novel unconditional SVBRDF diffusion backbone model on a large set of 312, 165 synthetic spatially varying material exemplars. This SVBRDF diffusion backbone model, named MatFusion, can then serve as a basis for refining a conditional diffusion model to estimate the material properties from a photograph under controlled or uncontrolled lighting. Our backbone MatFusion model is trained using only a loss on the reflectance properties, and therefore refinement can be paired with more expensive rendering methods without the need for backpropagation during training. Because the conditional SVBRDF diffusion models are generative, we can synthesize multiple SVBRDF estimates from the same input photograph from which the user can select the one that best matches the users’ expectation. We demonstrate the flexibility of our method by refining different SVBRDF diffusion models conditioned on different types of incident lighting, and show that for a single photograph under colocated flash lighting our method achieves equal or better accuracy than existing SVBRDF estimation methods.

Diffusion-based Holistic Texture Rectification and Synthesis

We present a novel framework for rectifying occlusions and distortions in degraded texture samples from natural images. Traditional texture synthesis approaches focus on generating textures from pristine samples, which necessitate meticulous preparation by humans and are often unattainable in most natural images. These challenges stem from the frequent occlusions and distortions of texture samples in natural images due to obstructions and variations in object surface geometry. To address these issues, we propose a framework that synthesizes holistic textures from degraded samples in natural images, extending the applicability of exemplar-based texture synthesis techniques. Our framework utilizes a conditional Latent Diffusion Model (LDM) with a novel occlusion-aware latent transformer. This latent transformer not only effectively encodes texture features from partially-observed samples necessary for the generation process of the LDM, but also explicitly captures long-range dependencies in samples with large occlusions. To train our model, we introduce a method for generating synthetic data by applying geometric transformations and free-form mask generation to clean textures. Experimental results demonstrate that our framework significantly outperforms existing methods both quantitatively and quantitatively. Furthermore, we conduct comprehensive ablation studies to validate the different components of our proposed framework. Results are corroborated by a perceptual user study which highlights the efficiency of our proposed approach.

SESSION: Visual Perception

Curl Noise Jittering

We propose a method for implicitly generating blue noise point sets. Our method is based on the observations that curl noise vector fields are volume-preserving and that jittering can be construed as moving points along the streamlines of a vector field. We demonstrate that the volume preservation keeps the points well separated when jittered using a curl noise vector field. At the same time, the anisotropy that stems from regular lattices is significantly reduced by such jittering. In combination, these properties entail that jittering by curl noise effectively transforms a regular lattice into a point set with blue noise properties. Our implicit method does not require computing the point set in advance. This makes our technique valuable when an arbitrarily large set of points with blue noise properties is needed. We compare our method to several other methods based on jittering as well as other methods for blue noise point set generation. Finally, we show several applications of curl noise jittering in two and three dimensions.

Perceptual error optimization for Monte Carlo animation rendering

Independently estimating pixel values in Monte Carlo rendering results in a perceptually sub-optimal white-noise distribution of error in image space. Recent works have shown that perceptual fidelity can be improved significantly by distributing pixel error as blue noise instead. Most such works have focused on static images, ignoring the temporal perceptual effects of animation display. We extend prior formulations to simultaneously consider the spatial and temporal domains, and perform an analysis to motivate a perceptually better spatio-temporal error distribution. We then propose a practical error optimization algorithm for spatio-temporal rendering and demonstrate its effectiveness in various configurations.

The effect of display capabilities on the gloss consistency between real and virtual objects

A faithful reproduction of gloss is inherently difficult because of the limited dynamic range, peak luminance, and 3D capabilities of display devices. This work investigates how the display capabilities affect gloss appearance with respect to a real-world reference object. To this end, we employ an accurate imaging pipeline to achieve a perceptual gloss match between a virtual and real object presented side-by-side on an augmented-reality high-dynamic-range (HDR) stereoscopic display, which has not been previously attained to this extent. Based on this precise gloss reproduction, we conduct a series of gloss matching experiments to study how gloss perception degrades based on individual factors: object albedo, display luminance, dynamic range, stereopsis, and tone mapping. We support the study with a detailed analysis of individual factors, followed by an in-depth discussion on the observed perceptual effects. Our experiments demonstrate that stereoscopic presentation has a limited effect on the gloss matching task on our HDR display. However, both reduced luminance and dynamic range of the display reduce the perceived gloss. This means that the visual system cannot compensate for the changes in gloss appearance across luminance (lack of gloss constancy), and the tone mapping operator should be carefully selected when reproducing gloss on a low dynamic range (LDR) display.

SESSION: What’re Your Points?

Conditional Resampled Importance Sampling and ReSTIR

Recent work on generalized resampled importance sampling (GRIS) enables importance-sampled Monte Carlo integration with random variable weights replacing the usual division by probability density. This enables very flexible spatiotemporal sample reuse, even if neighboring samples (e.g., light paths) have intractable probability densities. Unlike typical Monte Carlo integration, which samples according to some PDF, GRIS instead resamples existing samples. But resampling with GRIS assumes samples have tractable marginal contribution weights, which is problematic if reusing, for example, light subpaths from unidirectionally-sampled paths. Reusing such subpaths requires conditioning by (non-reused) segments of the path prefixes.

In this paper, we extend GRIS to conditional probability spaces, showing correctness given certain conditional independence between integration variables and their unbiased contribution weights. We show proper conditioning when using GRIS over randomized conditional domains and how to formulate a joint unbiased contribution weight for unbiased integration.

To show our theory has practical impact, we prototype a modified ReSTIR PT with a final gather pass. This reuses subpaths, postponing reuse at least one bounce along each light path. As in photon mapping, such a final gather reduces blotchy artifacts from sample correlation and reduced correlation improves the behavior of modern denoisers on ReSTIR PT signals.

ExtraSS: A Framework for Joint Spatial Super Sampling and Frame Extrapolation

We introduce ExtraSS, a novel framework that combines spatial super sampling and frame extrapolation to enhance real-time rendering performance. By integrating these techniques, our approach achieves a balance between performance and quality, generating temporally stable and high-quality, high-resolution results. Leveraging lightweight modules on warping and the ExtraSSNet for refinement, we exploit spatial-temporal information, improve rendering sharpness, handle moving shadings accurately, and generate temporally stable results. Computational costs are significantly reduced compared to traditional rendering methods, enabling higher frame rates and alias-free high resolution results. Evaluation using Unreal Engine demonstrates the benefits of our framework over conventional individual spatial or temporal super sampling methods, delivering improved rendering speed and visual quality. With its ability to generate temporally stable high-quality results, our framework creates new possibilities for real-time rendering applications, advancing the boundaries of performance and photo-realistic rendering in various domains.

Nonlinear Ray Tracing for Displacement and Shell Mapping

Displacement mapping and shell mapping add fine-scale geometric features to meshes and can significantly enhance the realism of an object’s surface representation. Both methods generate geometry within a layer between the base mesh and its offset mesh called a shell. It is not easy to simultaneously achieve high ray tracing performance, low memory consumption, interactive feedback, and ease of implementation, partly because the mapping between shell and texture space is nonlinear. This paper introduces a new efficient approach to perform acceleration structure traversal and intersection tests against microtriangles entirely in texture space by formulating nonlinear rays as degree-2 rational functions. Our method simplifies the implementation of tessellation-free displacement mapping and smooth shell mapping and works even if base mesh triangles are degenerated in uv space.

SESSION: Text To Anything

Face0: Instantaneously Conditioning a Text-to-Image Model on a Face

We present Face0, a novel way to instantaneously condition a text-to-image generation model on a face without any optimization procedures such as fine-tuning or inversions. We augment a dataset of annotated images with embeddings of the included faces and train an image generation model on the augmented dataset. Once trained, our system is practically identical at inference time to the underlying base model, and is therefore able to generate face-conditioned images in just a couple of seconds. Our method achieves pleasing results, is remarkably simple, extremely fast, and equips the underlying model with new capabilities, like controlling the generated images both via text or via direct manipulation of the input face embeddings. In addition, when using a fixed random vector instead of a face embedding from a user supplied image, our method essentially solves the problem of consistent character generation across images. Finally, our method decouples the model’s textual biases from its biases on faces. While requiring further research, we hope that this may help reduce biases in future text-to-image models.

Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation

Large text-to-image diffusion models have exhibited impressive proficiency in generating high-quality images. However, when applying these models to video domain, ensuring temporal consistency across video frames remains a formidable challenge. This paper proposes a novel zero-shot text-guided video-to-video translation framework to adapt image models to videos. The framework includes two parts: key frame translation and full video translation. The first part uses an adapted diffusion model to generate key frames, with hierarchical cross-frame constraints applied to enforce coherence in shapes, textures and colors. The second part propagates the key frames to other frames with temporal-aware patch matching and frame blending. Our framework achieves global style and local texture temporal consistency at a low cost (without re-training or optimization). The adaptation is compatible with existing image diffusion techniques, allowing our framework to take advantage of them, such as customizing a specific subject with LoRA, and introducing extra spatial guidance with ControlNet. Extensive experimental results demonstrate the effectiveness of our proposed framework over existing methods in rendering high-quality and temporally-coherent videos. Code is available at our project page: https://www.mmlab-ntu.com/project/rerender/

Break-A-Scene: Extracting Multiple Concepts from a Single Image

Text-to-image model personalization aims to introduce a user-provided concept to the model, allowing its synthesis in diverse contexts. However, current methods primarily focus on the case of learning a single concept from multiple images with variations in backgrounds and poses, and struggle when adapted to a different scenario. In this work, we introduce the task of textual scene decomposition: given a single image of a scene that may contain several concepts, we aim to extract a distinct text token for each concept, enabling fine-grained control over the generated scenes. To this end, we propose augmenting the input image with masks that indicate the presence of target concepts. These masks can be provided by the user or generated automatically by a pre-trained segmentation model. We then present a novel two-phase customization process that optimizes a set of dedicated textual embeddings (handles), as well as the model weights, striking a delicate balance between accurately capturing the concepts and avoiding overfitting. We employ a masked diffusion loss to enable handles to generate their assigned concepts, complemented by a novel loss on cross-attention maps to prevent entanglement. We also introduce union-sampling, a training strategy aimed to improve the ability of combining multiple concepts in generated images. We use several automatic metrics to quantitatively compare our method against several baselines, and further affirm the results using a user study. Finally, we showcase several applications of our method.

SESSION: See Through The Field

ActRay: Online Active Ray Sampling for Radiance Fields

Thanks to the high-quality reconstruction and photorealistic rendering, the Neural Radiance Field (NeRF) has garnered extensive attention and has been continuously improved. Despite its high visual quality, the prohibitive training time limits its practical application. Although significant acceleration has been achieved, it is still far from real-time training, due to the need for tens of thousands of iterations. In this paper, a feasible solution is to reduce the number of required iterations by always training the rays with the highest loss values, instead of the traditional method of training each ray with a uniform probability. To this end, we propose an online active ray sampling strategy, ActRay. Specifically, to avoid the substantial overhead of calculating the actual loss values for all rays in each iteration, a rendering-gradient-based loss propagation algorithm is presented to efficiently estimate the loss values. To further narrow the gap between the estimated loss and the actual loss, an online learning algorithm based on the Upper Confidence Bound (UCB) is proposed to control the sampling probability of the rays, thereby compensating for the bias in loss estimation. We evaluate ActRay on both real-world and synthetic scenes, and the promising results show that it accelerates radiance field training by 6.5x. Besides, we test ActRay under all kinds of radiance field representations (implicit, explicit, and hybrid), proving that it is general and effective to different representations. We believe this work will contribute to the practical application of radiance fields, because it has taken a step closer to real-time radiance field training. ActRay is open-source at: https://pku-netvideo.github.io/actray/.

MCNeRF: Monte Carlo Rendering and Denoising for Real-Time NeRFs

The volume rendering step used in Neural Radiance Fields (NeRFs) produces highly photorealistic results, but is inherently slow because it evaluates an MLP at a large number of sample points per ray. Previous work has addressed this by either proposing neural scene representations that are faster to evaluate or by pre-computing (and approximating) scene properties to reduce render times. In this work, we propose MCNeRF, a general Monte Carlo-based rendering algorithm that can speed up any NeRF representation. We show that the NeRF volume rendering integral can be efficiently computed via Monte Carlo integration using an importance sampling scheme based on ray density distributions. This allows us to use a small number of MLP evaluations to estimate pixel radiance. These noisy Monte Carlo estimates can be further denoised using an inexpensive image-space denoiser trained per-scene. We demonstrate that MCNeRF can be used to speed up NeRF representations like TensoRF by 7 × while closely matching their visual quality and without making the scene approximations that real-time NeRF rendering methods usually make.

RT-Octree: Accelerate PlenOctree Rendering with Batched Regular Tracking and Neural Denoising for Real-time Neural Radiance Fields

Neural Radiance Fields (NeRF) has demonstrated its ability to generate high-quality synthesized views. Nonetheless, due to its slow inference speed, there is a need to explore faster inference methods. In this paper, we propose RT-Octree, which uses batched regular tracking based on PlenOctree with neural denoising to achieve better real-time performance. We achieve this by modifying the volume rendering algorithm to regular tracking. We batch all samples for each pixel in one single ray-voxel intersection process to further improve the real-time performance. To reduce the variance caused by insufficient samples while ensuring real-time speed, we propose a lightweight neural network named GuidanceNet, which predicts the guidance map and weight maps utilized for the subsequent multi-layer denoising module. We evaluate our method on both synthetic and real-world datasets, obtaining a speed of 100 + frames per second (FPS) with a resolution of 1920 × 1080. Compared to PlenOctree, our method is 1.5 to 2 times faster in inference time and significantly outperforms NeRF by several orders of magnitude. The experimental results demonstrate the effectiveness of our approach in achieving real-time performance while maintaining similar rendering quality.

SESSION: Humans & Characters

Intrinsic Harmonization for Illumination-Aware Image Compositing

Despite significant advancements in network-based image harmonization techniques, there still exists a domain disparity between typical training pairs and real-world composites encountered during inference. Most existing methods are trained to reverse global edits made on segmented image regions, which fail to accurately capture the lighting inconsistencies between the foreground and background found in composited images. In this work, we introduce a self-supervised illumination harmonization approach formulated in the intrinsic image domain. First, we estimate a simple global lighting model from mid-level vision representations to generate a rough shading for the foreground region. A network then refines this inferred shading to generate a harmonious re-shading that aligns with the background scene. In order to match the color appearance of the foreground and background, we utilize ideas from prior harmonization approaches to perform parameterized image edits in the albedo domain. To validate the effectiveness of our approach, we present results from challenging real-world composites and conduct a user study to objectively measure the enhanced realism achieved compared to state-of-the-art harmonization methods.

Interactive Story Visualization with Multiple Characters

Accurate Story visualization requires several necessary elements, such as identity consistency across frames, the alignment between plain text and visual content, and a reasonable layout of objects in images. Most previous works endeavor to meet these requirements by fitting a text-to-image (T2I) model on a set of videos in the same style and with the same characters, e.g., the FlintstonesSV dataset. However, the learned T2I models typically struggle to adapt to new characters, scenes, and styles, and often lack the flexibility to revise the layout of the synthesized images. This paper proposes a system for generic interactive story visualization, capable of handling multiple novel characters and supporting the editing of layout and local structure. It is developed by leveraging the prior knowledge of large language and T2I models, trained on massive corpora. The system comprises four interconnected components: story-to-prompt generation (S2P), text-to-layout generation (T2L), controllable text-to-image generation (C-T2I), and image-to-video animation (I2V). First, the S2P module converts concise story information into detailed prompts required for subsequent stages. Next, T2L generates diverse and reasonable layouts based on the prompts, offering users the ability to adjust and refine the layout to their preferences. The core component, C-T2I, enables the creation of images guided by layouts, sketches, and actor-specific identifiers to maintain consistency and detail across visualizations. Finally, I2V enriches the visualization process by animating the generated images. Extensive experiments and a user study are conducted to validate the effectiveness and flexibility of interactive editing of the proposed system.

SESSION: \nabla f = ?

Differentiable Dynamic Visible-Light Tomography

We propose the first visible-light tomography system for real-time acquisition and reconstruction of general temporally-varying 3D phenomena. Using a single high-speed camera, a high-performance LED array and optical fibers with a total length of 5 km, we build a novel acquisition setup with no mechanical movements to simultaneously sample using 1,920 interleaved sources and detectors with a complete 360 ° coverage. Next, we introduce a novel differentiable framework to map both tomography acquisition and reconstruction to a carefully designed autoencoder. This allows the joint and automatic optimization of both processes in an end-to-end fashion, essentially learning to physically compress and computationally decompress the target information. Our framework can adapt to various factors, and trade between capture speed and reconstruction quality. We achieve an acquisition speed of up to 36.8 volumes per second at a spatial resolution of 32 × 128 × 128; each volume is captured with as few as 8 images. The effectiveness of the system is demonstrated on acquiring various dynamic scenes. Our results are also validated with the reconstructions computed from the measurements with one source on at a time, and compare favorably with state-of-the-art techniques.

Quantum Ray Marching: Reformulating Light Transport for Quantum Computers

The use of quantum computers in computer graphics has gained interest in recent years, especially for the application to rendering. The current state of the art in quantum rendering relies on Grover’s search for finding ray intersections in for M primitives. This quantum approach is faster than the naive approach of O(M) but slower than O(log M) of modern ray tracing with an acceleration data structure. Furthermore, this quantum ray tracing method is fundamentally limited to casting one ray at a time, leaving quantum rendering scales for the number of rays the same as non-quantum algorithms. We present a new quantum rendering method, quantum ray marching, based on the reformulation of ray marching as a quantum random walk. Our work is the first complete quantum rendering pipeline capable of light transport simulation and remains asymptotically faster than non-quantum counterparts. Our quantum ray marching can trace an exponential number of paths with polynomial cost, and it leverages quantum numerical integration to converge in O(1/N) for N estimates as opposed to non-quantum . These properties led to first quantum rendering that is asymptotically faster than non-quantum Monte Carlo rendering. We numerically tested our algorithm by rendering 2D and 3D scenes.

Efficient Graphics Representation with Differentiable Indirection

We introduce differentiable indirection – a novel learned primitive that employs differentiable multi-scale lookup tables as an effective substitute for traditional compute and data operations across the graphics pipeline. We demonstrate its flexibility on a number of graphics tasks, i.e., geometric and image representation, texture mapping, shading, and radiance field representation. In all cases, differentiable indirection seamlessly integrates into existing architectures, trains rapidly, and yields both versatile and efficient results.

SESSION: Put Things Together

Learning Gradient Fields for Scalable and Generalizable Irregular Packing

The packing problem, also known as cutting or nesting, has diverse applications in logistics, manufacturing, layout design, and atlas generation. It involves arranging irregularly shaped pieces to minimize waste while avoiding overlap. Recent advances in machine learning, particularly reinforcement learning, have shown promise in addressing the packing problem. In this work, we delve deeper into a novel machine learning-based approach that formulates the packing problem as conditional generative modeling. To tackle the challenges of irregular packing, including object validity constraints and collision avoidance, our method employs the score-based diffusion model to learn a series of gradient fields. These gradient fields encode the correlations between constraint satisfaction and the spatial relationships of polygons, learned from teacher examples. During the testing phase, packing solutions are generated using a coarse-to-fine refinement mechanism guided by the learned gradient fields. To enhance packing feasibility and optimality, we introduce two key architectural designs: multi-scale feature extraction and coarse-to-fine relation extraction. We conduct experiments on two typical industrial packing domains, considering translations only. Empirically, our approach demonstrates spatial utilization rates comparable to, or even surpassing, those achieved by the teacher algorithm responsible for training data generation. Additionally, it exhibits some level of generalization to shape variations. We are hopeful that this method could pave the way for new possibilities in solving the packing problem.

SESSION: Head & Face

An Implicit Physical Face Model Driven by Expression and Style

3D facial animation is often produced by manipulating facial deformation models (or rigs), that are traditionally parameterized by expression controls. A key component that is usually overlooked is expression “style", as in, how a particular expression is performed. Although it is common to define a semantic basis of expressions that characters can perform, most characters perform each expression in their own style. To date, style is usually entangled with the expression, and it is not possible to transfer the style of one character to another when considering facial animation. We present a new face model, based on a data-driven implicit neural physics model, that can be driven by both expression and style separately. At the core, we present a framework for learning implicit physics-based actuations for multiple subjects simultaneously, trained on a few arbitrary performance capture sequences from a small set of identities. Once trained, our method allows generalized physics-based facial animation for any of the trained identities, extending to unseen performances. Furthermore, it grants control over the animation style, enabling style transfer from one character to another or blending styles of different characters. Lastly, as a physics-based model, it is capable of synthesizing physical effects, such as collision handling, setting our method apart from conventional approaches.

SESSION: Computer Vision

Light-Efficient Holographic Illumination for Continuous-Wave Time-of-Flight Imaging

Time-of-flight (TOF) cameras have seen widespread adoption in recent years across the entire spectrum of commodity devices. However, these devices are fundamentally limited by their dynamic range, struggling with saturation from nearby, brighter objects and noisy depth from farther, darker objects. In this work, we explore overcoming these limitations in the context of continuous-wave time-of-flight (CWTOF) devices, by using a holographic light source capable of redistributing light according to arbitrary patterns. In particular, we propose using such a system to move light from overexposed to underexposed regions of a scene, such that the entire scene is well exposed. Such a methodology can be easily integrated with existing illumination schemes for TOF. Our proof-of-concept prototype is constructed from off-the-shelf optical components, and demonstrated on a number of lab scenes.

Self-Calibrating, Fully Differentiable NLOS Inverse Rendering

Existing time-resolved non-line-of-sight (NLOS) imaging methods reconstruct hidden scenes by inverting the optical paths of indirect illumination measured at visible relay surfaces. These methods are prone to reconstruction artifacts due to inversion ambiguities and capture noise, which are typically mitigated through the manual selection of filtering functions and parameters. We introduce a fully-differentiable end-to-end NLOS inverse rendering pipeline that self-calibrates the imaging parameters during the reconstruction of hidden scenes, using as input only the measured illumination while working both in the time and frequency domains. Our pipeline extracts a geometric representation of the hidden scene from NLOS volumetric intensities and estimates the time-resolved illumination at the relay wall produced by such geometric information using differentiable transient rendering. We then use gradient descent to optimize imaging parameters by minimizing the error between our simulated time-resolved illumination and the measured illumination. Our end-to-end differentiable pipeline couples diffraction-based volumetric NLOS reconstruction with path-space light transport and a simple ray marching technique to extract detailed, dense sets of surface points and normals of hidden scenes.We demonstrate the robustness of our method to consistently reconstruct geometry and albedo, even under significant noise levels.

Neural Spectro-polarimetric Fields

Modeling the spatial radiance distribution of light rays in a scene has been extensively explored for applications, including view synthesis. Spectrum and polarization, the wave properties of light, are often neglected due to their integration into three RGB spectral bands and their non-perceptibility to human vision. However, these properties are known to encompass substantial material and geometric information about a scene. Here, we propose to model spectro-polarimetric fields, the spatial Stokes-vector distribution of any light ray at an arbitrary wavelength. We present Neural Spectro-polarimetric Fields (NeSpoF), a neural representation that models the physically-valid Stokes vector at given continuous variables of position, direction, and wavelength. NeSpoF manages inherently noisy raw measurements, showcases memory efficiency, and preserves physically vital signals — factors that are crucial for representing the high-dimensional signal of a spectro-polarimetric field. To validate NeSpoF, we introduce the first multi-view hyperspectral-polarimetric image dataset, comprised of both synthetic and real-world scenes. These were captured using our compact hyperspectral-polarimetric imaging system, which has been calibrated for robustness against system imperfections. We demonstrate the capabilities of NeSpoF on diverse scenes.

UVDoc: Neural Grid-based Document Unwarping

Restoring the original, flat appearance of a printed document from casual photographs of bent and wrinkled pages is a common everyday problem. In this paper we propose a novel method for grid-based single-image document unwarping. Our method performs geometric distortion correction via a fully convolutional deep neural network that learns to predict the 3D grid mesh of the document and the corresponding 2D unwarping grid in a dual-task fashion, implicitly encoding the coupling between the shape of a 3D piece of paper and its 2D image. In order to allow unwarping models to train on data that is more realistic in appearance than the commonly used synthetic Doc3D dataset, we create and publish our own dataset, called UVDoc, which combines pseudo-photorealistic document images with physically accurate 3D shape and unwarping function annotations. Our dataset is labeled with all the information necessary to train our unwarping network, without having to engineer separate loss functions that can deal with the lack of ground-truth typically found in document in the wild datasets. We perform an in-depth evaluation that demonstrates that with the inclusion of our novel pseudo-photorealistic dataset, our relatively small network architecture achieves state-of-the-art results on the DocUNet benchmark. We show that the pseudo-photorealistic nature of our UVDoc dataset allows for new and better evaluation methods, such as lighting-corrected MS-SSIM. We provide a novel benchmark dataset that facilitates such evaluations, and propose a metric that quantifies line straightness after unwarping. Our code, results and UVDoc dataset are available at https://github.com/tanguymagne/UVDoc.

SESSION: Deformable Solids

LiCROM: Linear-Subspace Continuous Reduced Order Modeling with Neural Fields

Linear reduced-order modeling (ROM) simplifies complex simulations by approximating the behavior of a system using a simplified kinematic representation. Typically, ROM is trained on input simulations created with a specific spatial discretization, and then serves to accelerate simulations with the same discretization. This discretization-dependence is restrictive.

Becoming independent of a specific discretization would provide flexibility to mix and match mesh resolutions, connectivity, and type (tetrahedral, hexahedral) in training data; to accelerate simulations with novel discretizations unseen during training; and to accelerate adaptive simulations that temporally or parametrically change the discretization.

We present a flexible, discretization-independent approach to reduced-order modeling. Like traditional ROM, we represent the configuration as a linear combination of displacement fields. Unlike traditional ROM, our displacement fields are continuous maps from every point on the reference domain to a corresponding displacement vector; these maps are represented as implicit neural fields.

With linear continuous ROM (LiCROM), our training set can include multiple geometries undergoing multiple loading conditions, independent of their discretization. This opens the door to novel applications of reduced order modeling. We can now accelerate simulations that modify the geometry at runtime, for instance via cutting, hole punching, and even swapping the entire mesh. We can also accelerate simulations of geometries unseen during training. We demonstrate one-shot generalization, training on a single geometry and subsequently simulating various unseen geometries.

Subspace Mixed Finite Elements for Real-Time Heterogeneous Elastodynamics

Real-time elastodynamic solvers are well-suited for the rapid simulation of homogeneous elastic materials, with high-rates generally enabled by aggressive early termination of timestep solves. Unfortunately, the introduction of strong domain heterogeneities can make these solvers slow to converge. Stopping the solve short creates visible damping artifacts and rotational errors. To address these challenges we develop a reduced mixed finite element solver that preserves rich rotational motion, even at low-iteration regimes. Specifically, this solver augments time-step solve optimizations with auxillary stretch degrees of freedom at mesh elements, and maintains consistency with the primary positional degrees of freedoms at mesh nodes via explicit constraints. We make use of a Skinning Eigenmode subspace for our positional degrees of freedom. We accelerate integration of non-linear elastic energies with a cubature approximation, placing stretch degrees of freedom at cubature points. Across a wide range of examples we demonstrate that this subspace is particularly well suited for heterogeneous material simulation. Our resulting method is a subspace mixed finite element method completely decoupled from the resolution of the mesh that is well-suited for real-time simulation of heterogeneous domains.

Second-Order Finite Elements for Deformable Surfaces

We present a computational framework for simulating deformable surfaces from planar rest shape with second-order triangular finite elements. Our method develops numerical schemes for discretizing stretching, shearing, and bending energies of deformable surfaces in a second-order finite-element setting. In particular, we introduce a novel discretization scheme for approximating mean curvatures on a curved triangle mesh. Our framework also integrates a virtual-node finite-element scheme that supports two-way coupling between cut-cell rods without expensive remeshing. We compare our approach with traditional simulation methods using linear and higher-order finite elements and demonstrate its advantages in several challenging settings, such as low-resolution meshes, anisotropic triangulation, and stiff materials. Finally, we showcase several applications of our framework, including cloth simulation, mixed Origami and Kirigami, and biologically-inspired soft wing simulation.

A Hessian-Based Field Deformer for Real-Time Topology-Aware Shape Editing

Shape manipulation is a central research topic in computer graphics. Topology editing, such as breaking apart connections, joining disconnected ends, and filling/opening a topological hole, is generally more challenging than geometry editing. In this paper, we observe that the saddle points of the signed distance function (SDF) provide useful hints for altering surface topology deliberately. Based on this key observation, we parameterize the SDF into a cubic trivariate tensor-product B-spline function F whose saddle points {si} can be quickly exhausted based on a subdivision-based root-finding technique coupled with Newton’s method. Users can select one of the candidate points, say si, to edit the topology in real time. In implementation, we add a compactly supported B-spline function rooted at si, which we call a deformer in this paper, to F, with its local coordinate system aligning with the three eigenvectors of the Hessian. Combined with ray marching technique, our interactive system operates at 30 FPS. Additionally, our system empowers users to create desired bulges or concavities on the surface. An extensive user study indicates that our system is user-friendly and intuitive to operate. We demonstrate the effectiveness and usefulness of our system in a range of applications, including fixing surface reconstruction errors, artistic work design, 3D medical imaging and simulation, and antiquity restoration. Please refer to the attached video for a demonstration.

SESSION: Motion Capture and Reconstruction

Efficient Human Motion Reconstruction from Monocular Videos with Physical Consistency Loss

Vision-only motion reconstruction from monocular videos often produces artifacts such as foot sliding and jittering. Existing physics-based methods typically either simplify the problem to focus solely on foot-ground contacts, or they reconstruct full-body contacts within a physics simulator, necessitating the solution of a time-consuming bilevel optimization problem. To overcome these limitations, we present an efficient gradient-based method for reconstructing complex human motions (including highly dynamic and acrobatic movements) with physical constraints. Our approach reformulates human motion dynamics through a differentiable physical consistency loss within an augmented search space that accounts both for contacts and camera alignment. This enables us to transform the motion reconstruction task into a single-level trajectory optimization problem. Experimental results demonstrate that our method can reconstruct complex human motions from real-world videos in minutes, which is substantially faster than previous approaches. Additionally, the reconstructed results show enhanced physical realism compared to existing methods.

Fusing Monocular Images and Sparse IMU Signals for Real-time Human Motion Capture

Either RGB images or inertial signals have been used for the task of motion capture (mocap), but combining them together is a new and interesting topic. We believe that the combination is complementary and able to solve the inherent difficulties of using one modality input, including occlusions, extreme lighting/texture, and out-of-view for visual mocap and global drifts for inertial mocap. To this end, we propose a method that fuses monocular images and sparse IMUs for real-time human motion capture. Our method contains a dual coordinate strategy to fully explore the IMU signals with different goals in motion capture. To be specific, besides one branch transforming the IMU signals to the camera coordinate system to combine with the image information, there is another branch to learn from the IMU signals in the body root coordinate system to better estimate body poses. Furthermore, a hidden state feedback mechanism is proposed for both two branches to compensate for their own drawbacks in extreme input cases. Thus our method can easily switch between the two kinds of signals or combine them in different cases to achieve a robust mocap. Quantitative and qualitative results demonstrate that by delicately designing the fusion method, our technique significantly outperforms the state-of-the-art vision, IMU, and combined methods on both global orientation and local pose estimation. Our codes are available for research at https://shaohua-pan.github.io/robustcap-page/.

A Locality-based Neural Solver for Optical Motion Capture

We present a novel locality-based learning method for cleaning and solving optical motion capture data. Given noisy marker data, we propose a new heterogeneous graph neural network which treats markers and joints as different types of nodes, and uses graph convolution operations to extract the local features of markers and joints and transform them to clean motions. To deal with anomaly markers (e.g. occluded or with big tracking errors), the key insight is that a marker’s motion shows strong correlations with the motions of its immediate neighboring markers but less so with other markers, a.k.a. locality, which enables us to efficiently fill missing markers (e.g. due to occlusion). Additionally, we also identify marker outliers due to tracking errors by investigating their acceleration profiles. Finally, we propose a training regime based on representation learning and data augmentation, by training the model on data with masking. The masking schemes aim to mimic the occluded and noisy markers often observed in the real data. Finally, we show that our method achieves high accuracy on multiple metrics across various datasets. Extensive comparison shows our method outperforms state-of-the-art methods in terms of prediction accuracy of occluded marker position error by approximately 20%, which leads to a further error reduction on the reconstructed joint rotations and positions by 30%. The code and data for this paper are available at https://github.com/non-void/LocalMoCap.

Adaptive Tracking of a Single-Rigid-Body Character in Various Environments

Since the introduction of DeepMimic [Peng et al. 2018a], subsequent research has focused on expanding the repertoire of simulated motions across various scenarios. In this study, we propose an alternative approach for this goal, a deep reinforcement learning method based on the simulation of a single-rigid-body character. Using the centroidal dynamics model (CDM) to express the full-body character as a single rigid body (SRB) and training a policy to track a reference motion, we can obtain a policy that is capable of adapting to various unobserved environmental changes and controller transitions without requiring any additional learning. Due to the reduced dimension of state and action space, the learning process is sample-efficient. The final full-body motion is kinematically generated in a physically plausible way, based on the state of the simulated SRB character. The SRB simulation is formulated as a quadratic programming (QP) problem, and the policy outputs an action that allows the SRB character to follow the reference motion. We demonstrate that our policy, efficiently trained within 30 minutes on an ultraportable laptop, has the ability to cope with environments that have not been experienced during learning, such as running on uneven terrain or pushing a box, and transitions between learned policies, without any additional learning.

SESSION: Neural Shape Representation

Anti-Aliased Neural Implicit Surfaces with Encoding Level of Detail

We present LoD-NeuS, an efficient neural representation for high-frequency geometry detail recovery and anti-aliased novel view rendering. Drawing inspiration from voxel-based representations with the level of detail (LoD), we introduce a multi-scale tri-plane-based scene representation that is capable of capturing the LoD of the signed distance function (SDF) and the space radiance. Our representation aggregates space features from a multi-convolved featurization within a conical frustum along a ray and optimizes the LoD feature volume through differentiable rendering. Additionally, we propose an error-guided sampling strategy to guide the growth of the SDF during the optimization. Both qualitative and quantitative evaluations demonstrate that our method achieves superior surface reconstruction and photorealistic view synthesis compared to state-of-the-art approaches.

Compact Neural Graphics Primitives with Learned Hash Probing

Neural graphics primitives are faster and achieve higher quality when their neural networks are augmented by spatial data structures that hold trainable features arranged in a grid. However, existing feature grids either come with a large memory footprint (dense or factorized grids, trees, and hash tables) or slow performance (index learning and vector quantization). In this paper, we show that a hash table with learned probes has neither disadvantage, resulting in a favorable combination of size and speed. Inference is faster than unprobed hash tables at equal quality while training is only 1.2–2.6 × slower, significantly outperforming prior index learning approaches. We arrive at this formulation by casting all feature grids into a common framework: they each correspond to a lookup function that indexes into a table of feature vectors. In this framework, the lookup functions of existing data structures can be combined by simple arithmetic combinations of their indices, resulting in Pareto optimal compression and speed.

Constructive Solid Geometry on Neural Signed Distance Fields

Signed Distance Fields (SDFs) parameterized by neural networks have recently gained popularity as a fundamental geometric representation. However, editing the shape encoded by a neural SDF remains an open challenge. A tempting approach is to leverage common geometric operators (e.g., boolean operations), but such edits often lead to incorrect non-SDF outputs (which we call Pseudo-SDFs), preventing them from being used for downstream tasks. In this paper, we characterize the space of Pseudo-SDFs, which are eikonal yet not true distance functions, and derive the closest point loss, a novel regularizer that encourages the output to be an exact SDF. We demonstrate the applicability of our regularization to many operations in which traditional methods cause a Pseudo-SDF to arise, such as CSG and swept volumes, and produce a true (neural) SDF for the result of these operations.

Neural Gradient Learning and Optimization for Oriented Point Normal Estimation

We propose Neural Gradient Learning (NGL), a deep learning approach to learn gradient vectors with consistent orientation from 3D point clouds for normal estimation. It has excellent gradient approximation properties for the underlying geometry of the data. We utilize a simple neural network to parameterize the objective function to produce gradients at points using a global implicit representation. However, the derived gradients usually drift away from the ground-truth oriented normals due to the lack of local detail descriptions. Therefore, we introduce Gradient Vector Optimization (GVO) to learn an angular distance field based on local plane geometry to refine the coarse gradient vectors. Finally, we formulate our method with a two-phase pipeline of coarse estimation followed by refinement. Moreover, we integrate two weighting functions, i.e., anisotropic kernel and inlier score, into the optimization to improve the robust and detail-preserving performance. Our method efficiently conducts global gradient approximation while achieving better accuracy and generalization ability of local feature description. This leads to a state-of-the-art normal estimator that is robust to noise, outliers and point density variations. Extensive evaluations show that our method outperforms previous works in both unoriented and oriented normal estimation on widely used benchmarks. The source code and pre-trained models are available at https://github.com/LeoQLi/NGLO .