SA '22 Conference Papers: SIGGRAPH Asia 2022 Conference Papers

Full Citation in the ACM Digital Library

SESSION: Character Animation

Morig: Motion-Aware Rigging of Character Meshes from Point Clouds

We present MoRig, a method that automatically rigs character meshes driven by single-view point cloud streams capturing the motion of performing characters. Our method is also able to animate the 3D meshes according to the captured point cloud motion. MoRig’s neural network encodes motion cues from the point clouds into features that are informative about the articulated parts of the performing character. These motion-aware features guide the inference of an appropriate skeletal rig for the input mesh, which is then animated based on the point cloud motion. Our method can rig and animate diverse characters, including humanoids, quadrupeds, and toys with varying articulation. It accounts for occluded regions in the point clouds and mismatches in the part proportions between the input mesh and captured character. Compared to other rigging approaches that ignore motion cues, MoRig produces more accurate rigs, well-suited for re-targeting motion from captured characters.

QuestSim: Human Motion Tracking from Sparse Sensors with Simulated Avatars

Real-time tracking of human body motion is crucial for interactive and immersive experiences in AR/VR. However, very limited sensor data about the body is available from standalone wearable devices such as HMDs (Head Mounted Devices) or AR glasses. In this work, we present a reinforcement learning framework that takes in sparse signals from an HMD and two controllers, and simulates plausible and physically valid full body motions. Using high quality full body motion as dense supervision during training, a simple policy network can learn to output appropriate torques for the character to balance, walk, and jog, while closely following the input signals. Our results demonstrate surprisingly similar leg motions to ground truth without any observations of the lower body, even when the input is only the 6D transformations of the HMD. We also show that a single policy can be robust to diverse locomotion styles, different body sizes, and novel environments.

Transformer Inertial Poser: Real-time Human Motion Reconstruction from Sparse IMUs with Simultaneous Terrain Generation

Real-time human motion reconstruction from a sparse set of (e.g. six) wearable IMUs provides a non-intrusive and economic approach to motion capture. Without the ability to acquire position information directly from IMUs, recent works took data-driven approaches that utilize large human motion datasets to tackle this under-determined problem. Still, challenges remain such as temporal consistency, drifting of global and joint motions, and diverse coverage of motion types on various terrains. We propose a novel method to simultaneously estimate full-body motion and generate plausible visited terrain from only six IMU sensors in real-time. Our method incorporates 1. a conditional Transformer decoder model giving consistent predictions by explicitly reasoning prediction history, 2. a simple yet general learning target named "stationary body points” (SBPs) which can be stably predicted by the Transformer model and utilized by analytical routines to correct joint and global drifting, and 3. an algorithm to generate regularized terrain height maps from noisy SBP predictions which can in turn correct noisy global motion estimation. We evaluate our framework extensively on synthesized and real IMU data, and with real-time live demos, and show superior performance over strong baseline methods.

SESSION: Distances and Matching

Compressing Geodesic Information for Fast Point-to-Point Geodesic Distance Queries

Geodesic distances between pairs of points on a 3D mesh surface are a crucial ingredient of many geometry processing tasks, but are notoriously difficult to compute efficiently on demand. We propose a novel method for the compact storage of geodesic distance information, which enables answering point-to-point geodesic distance queries very efficiently. For a triangle mesh with n vertices, if computing the geodesic distance to all vertices from a single source vertex costs O(f(n)) time, then we generate a database of size O(mnlog n) in time in a preprocessing stage, where m is a constant that depends on the geometric complexity of the surface. We achieve this by computing a nested bisection of the mesh surface using separator curves and storing compactly-described functions approximating the distances between each mesh vertex and a small relevant subset of these curves. Using this database, the geodesic distance between two mesh vertices can then be approximated well by solving a small number of simple univariate minimization problems in O(mlog n) worst case time and O(m) average case time. Our method provides an excellent tradeoff between the size of the database, query runtime, and accuracy of the result. It can be used to compress exact or approximate geodesic distances, e.g., those obtained by VTP (exact), fast DGG, fast marching, or the heat method (approximate) and is very efficient if f(n) = n, as for the fast DGG method.

SESSION: Differentiable Rendering

Differentiable Rendering of Neural SDFs through Reparameterization

We present a method to automatically compute correct gradients with respect to geometric scene parameters in neural SDF renderers. Recent physically-based differentiable rendering techniques for meshes have used edge-sampling to handle discontinuities, particularly at object silhouettes, but SDFs do not have a simple parametric form amenable to sampling. Instead, our approach builds on area-sampling techniques and develops a continuous warping function for SDFs to account for these discontinuities. Our method leverages the distance to surface encoded in an SDF and uses quadrature on sphere tracer points to compute this warping function. We further show that this can be done by subsampling the points to make the method tractable for neural SDFs. Our differentiable renderer can be used to optimize neural shapes from multi-view images and produces comparable 3D reconstructions to recent SDF-based inverse rendering methods, without the need for 2D segmentation masks to guide the geometry optimization and no volumetric approximations to the geometry.

Learning-based Inverse Rendering of Complex Indoor Scenes with Differentiable Monte Carlo Raytracing

Indoor scenes typically exhibit complex, spatially-varying appearance from global illumination, making inverse rendering a challenging ill-posed problem. This work presents an end-to-end, learning-based inverse rendering framework incorporating differentiable Monte Carlo raytracing with importance sampling. The framework takes a single image as input to jointly recover the underlying geometry, spatially-varying lighting, and photorealistic materials. Specifically, we introduce a physically-based differentiable rendering layer with screen-space ray tracing, resulting in more realistic specular reflections that match the input photo. In addition, we create a large-scale, photorealistic indoor scene dataset with significantly richer details like complex furniture and dedicated decorations. Further, we design a novel out-of-view lighting network with uncertainty-aware refinement leveraging hypernetwork-based neural radiance fields to predict lighting outside the view of the input photo. Through extensive evaluations on common benchmark datasets, we demonstrate superior inverse rendering quality of our method compared to state-of-the-art baselines, enabling various applications such as complex object insertion and material editing with high fidelity. Code and data will be made available at

Differentiable Point-Based Radiance Fields for Efficient View Synthesis

We propose a differentiable rendering algorithm for efficient novel view synthesis. By departing from volume-based representations in favor of a learned point representation, we improve on existing methods more than an order of magnitude in memory and runtime, both in training and inference. The method begins with a uniformly-sampled random point cloud and learns per-point position and view-dependent appearance, using a differentiable splat-based renderer to train the model to reproduce a set of input training images with the given pose. Our method is up to 300 × faster than NeRF in both training and inference, with only a marginal sacrifice in quality, while using less than 10 MB of memory for a static scene. For dynamic scenes, our method trains two orders of magnitude faster than STNeRF and renders at a near interactive rate, while maintaining high image quality and temporal coherence even without imposing any temporal-coherency regularizers.

SESSION: Image Generation

DynaGAN: Dynamic Few-shot Adaptation of GANs to Multiple Domains

Few-shot domain adaptation to multiple domains aims to learn a complex image distribution across multiple domains from a few training images. A naïve solution here is to train a separate model for each domain using few-shot domain adaptation methods. Unfortunately, this approach mandates linearly-scaled computational resources both in memory and computation time and, more importantly, such separate models cannot exploit the shared knowledge between target domains. In this paper, we propose DynaGAN, a novel few-shot domain-adaptation method for multiple target domains. DynaGAN has an adaptation module, which is a hyper-network that dynamically adapts a pretrained GAN model into the multiple target domains. Hence, we can fully exploit the shared knowledge across target domains and avoid the linearly-scaled computational requirements. As it is still computationally challenging to adapt a large-size GAN model, we design our adaptation module to be lightweight using the rank-1 tensor decomposition. Lastly, we propose a contrastive-adaptation loss suitable for multi-domain few-shot adaptation. We validate the effectiveness of our method through extensive qualitative and quantitative evaluations.

Dr.3D: Adapting 3D GANs to Artistic Drawings

While 3D GANs have recently demonstrated the high-quality synthesis of multi-view consistent images and 3D shapes, they are mainly restricted to photo-realistic human portraits. This paper aims to extend 3D GANs to a different, but meaningful visual form: artistic portrait drawings. However, extending existing 3D GANs to drawings is challenging due to the inevitable geometric ambiguity present in drawings. To tackle this, we present Dr.3D, a novel adaptation approach that adapts an existing 3D GAN to artistic drawings. Dr.3D is equipped with three novel components to handle the geometric ambiguity: a deformation-aware 3D synthesis network, an alternating adaptation of pose estimation and image synthesis, and geometric priors. Experiments show that our approach can successfully adapt 3D GANs to drawings and enable multi-view consistent semantic editing of drawings.

SESSION: Acquisition

DeepMVSHair: Deep Hair Modeling from Sparse Views

We present DeepMVSHair, the first deep learning-based method for multi-view hair strand reconstruction. The key component of our pipeline is HairMVSNet, a differentiable neural architecture which represents a spatial hair structure as a continuous 3D hair growing direction field implicitly. Specifically, given a 3D query point, we decide its occupancy value and direction from observed 2D structure features. With the query point’s pixel-aligned features from each input view, we utilize a view-aware transformer encoder to aggregate anisotropic structure features to an integrated representation, which is decoded to yield 3D occupancy and direction at the query point. HairMVSNet effectively gathers multi-view hair structure features and preserves high-frequency details based on this implicit representation. Guided by HairMVSNet, our hair-growing algorithm produces results faithful to input multi-view images. We propose a novel image-guided multi-view strand deformation algorithm to enrich modeling details further. Extensive experiments show that the results by our sparse-view method are comparable to those by state-of-the-art dense multi-view methods and significantly better than those by single-view and sparse-view methods. In addition, our method is an order of magnitude faster than previous multi-view hair modeling methods.

SESSION: Radiance Fields, Bases, and Probes

Fast Dynamic Radiance Fields with Time-Aware Neural Voxels

Neural radiance fields (NeRF) have shown great success in modeling 3D scenes and synthesizing novel-view images. However, most previous NeRF methods take much time to optimize one single scene. Explicit data structures, e.g. voxel features, show great potential to accelerate the training process. However, voxel features face two big challenges to be applied to dynamic scenes, i.e. modeling temporal information and capturing different scales of point motions. We propose a radiance field framework by representing scenes with time-aware voxel features, named as TiNeuVox. A tiny coordinate deformation network is introduced to model coarse motion trajectories and temporal information is further enhanced in the radiance network. A multi-distance interpolation method is proposed and applied on voxel features to model both small and large motions. Our framework significantly accelerates the optimization of dynamic radiance fields while maintaining high rendering quality. Empirical evaluation is performed on both synthetic and real scenes. Our TiNeuVox completes training with only 8 minutes and 8-MB storage cost while showing similar or even better rendering performance than previous dynamic NeRF methods. Code is available at

FDNeRF: Few-shot Dynamic Neural Radiance Fields for Face Reconstruction and Expression Editing

We propose a Few-shot Dynamic Neural Radiance Field (FDNeRF), the first NeRF-based method capable of reconstruction and expression editing of 3D faces based on a small number of dynamic images. Unlike existing dynamic NeRFs that require dense images as input and can only be modeled for a single identity, our method enables face reconstruction across different persons with few-shot inputs. Compared to state-of-the-art few-shot NeRFs designed for modeling static scenes, the proposed FDNeRF accepts view-inconsistent dynamic inputs and supports arbitrary facial expression editing, i.e., producing faces with novel expressions beyond the input ones. To handle the inconsistencies between dynamic inputs, we introduce a well-designed conditional feature warping (CFW) module to perform expression conditioned warping in 2D feature space, which is also identity adaptive and 3D constrained. As a result, features of different expressions are transformed into the target ones. We then construct a radiance field based on these view-consistent features and use volumetric rendering to synthesize novel views of the modeled faces. Extensive experiments with quantitative and qualitative evaluation demonstrate that our method outperforms existing dynamic and few-shot NeRFs on both 3D face reconstruction and expression editing tasks. Code is available at .

NeuLighting: Neural Lighting for Free Viewpoint Outdoor Scene Relighting with Unconstrained Photo Collections

We propose NeuLighting, a new framework for free viewpoint outdoor scene relighting from a sparse set of unconstrained in-the-wild photo collections. Our framework represents all the scene components as continuous functions parameterized by MLPs that take a 3D location and the lighting condition as input and output reflectance and necessary outdoor illumination properties. Unlike object-level relighting methods which often leverage training images with controllable and consistent indoor illumination, we concentrate on the more challenging outdoor situation where all the images are captured under arbitrary unknown illumination. The key to our method includes a neural lighting representation that compresses the per-image illumination into a disentangled latent vector, and a new free viewpoint relighting scheme that is robust to arbitrary lighting variations across images. The lighting representation is compressive to explain a wide range of illumination and can be easily fed into the query-based NeuLighting framework, enabling efficient shading effect evaluation under any kind of novel illumination. Furthermore, to produce high-quality cast shadows, we estimate the sun visibility map to indicate the shadow regions according to the scene geometry and the sun direction. Thanks to the flexible and explainable neural lighting representation, our system supports outdoor relighting with many different illumination sources, including natural images, environment maps, and time-lapse videos. The high-fidelity renderings under novel views and illumination prove the superiority of our method against state-of-the-art relighting solutions.

Lightweight Neural Basis Functions for All-Frequency Shading

Basis functions provide both the abilities for compact representation and the properties for efficient computation. Therefore, they are pervasively used in rendering to perform all-frequency shading. However, common basis functions, including spherical harmonics (SH), wavelets, and spherical Gaussians (SG) all have their own limitations, such as low-frequency for SH, not rotationally invariant for wavelets, and no multiple product support for SG. In this paper, we present neural basis functions, an implicit and data-driven set of basis functions that circumvents the limitations with all desired properties. We first introduce a representation neural network that takes any general 2D spherical function (e.g. environment lighting, BRDF, and visibility) as input and projects it onto the latent space as coefficients of our neural basis functions. Then, we design several lightweight neural networks that perform different types of computation, giving our basis functions different computational properties such as double/triple product integrals and rotations. We demonstrate the practicality of our neural basis functions by integrating them into all-frequency shading applications, showing that our method not only achieves a compression rate of and 10 × -40 × better performance than wavelets at equal quality, but also renders all-frequency lighting effects in real-time without the aforementioned limitations from classic basis functions.

SESSION: Stylization and Colorization

StyleBin: Stylizing Video by Example in Stereo

In this paper we present StyleBin—an approach to example-based stylization of videos that can produce consistent binocular depiction of stylized content on stereoscopic displays. Given the target sequence and a set of stylized keyframes accompanied by information about depth in the scene, we formulate an optimization problem that converts the target video into a pair of stylized sequences, in which each frame consists of a set of seamlessly stitched patches taken from the original stylized keyframe. The aim of the optimization process is to align the individual patches so that they respect the semantics of the given target scene, while at the same time also following the prescribed local disparity in the corresponding viewpoints and being consistent in time. In contrast to previous depth-aware style transfer techniques, our approach is the first that can deliver semantically meaningful stylization and preserve essential visual characteristics of the given artistic media. We demonstrate the practical utility of the proposed method in various stylization use cases.

SESSION: Face, Speech, and Gesture

Animatomy: an Animator-centric, Anatomically Inspired System for 3D Facial Modeling, Animation and Transfer

We present Animatomy, a novel anatomic+animator centric representation of the human face. Present FACS-based systems are plagued with problems of face muscle separation, coverage, opposition, and redundancy. We, therefore, propose a collection of muscle fiber curves as an anatomic basis, whose contraction and relaxation provide us with a fine-grained parameterization of human facial expression. We build an end-to-end modular deformation architecture using this representation that enables: automatic optimization of the parameters of a specific face from high-quality dynamic facial scans; face animation driven by performance capture, keyframes, or dynamic simulation; interactive and direct manipulation of facial expression; and animation transfer from an actor to a character. We validate our facial system by showing compelling animated results, applications, and a quantitative comparison of our facial reconstruction to ground truth performance capture. Our system is being intensively used by a large creative team on Avatar: The Way of Water. We report feedback from these users as qualitative evaluation of our system.

Masked Lip-Sync Prediction by Audio-Visual Contextual Exploitation in Transformers

Previous studies have explored generating accurately lip-synced talking faces for arbitrary targets given audio conditions. However, most of them deform or generate the whole facial area, leading to non-realistic results. In this work, we delve into the formulation of altering only the mouth shapes of the target person. This requires masking a large percentage of the original image and seamlessly inpainting it with the aid of audio and reference frames. To this end, we propose the Audio-Visual Context-Aware Transformer (AV-CAT) framework, which produces accurate lip-sync with photo-realistic quality by predicting the masked mouth shapes. Our key insight is to exploit desired contextual information provided in audio and visual modalities thoroughly with delicately designed Transformers. Specifically, we propose a convolution-Transformer hybrid backbone and design an attention-based fusion strategy for filling the masked parts. It uniformly attends to the textural information on the unmasked regions and the reference frame. Then the semantic audio information is involved in enhancing the self-attention computation. Additionally, a refinement network with audio injection improves both image and lip-sync quality. Extensive experiments validate that our model can generate high-fidelity lip-synced results for arbitrary subjects.

VOCAL: Vowel and Consonant Layering for Expressive Animator-Centric Singing Animation

Singing and speaking are two fundamental forms of human communication. From a modeling perspective however, speaking can be seen as a subset of singing. We present VOCAL, a system that automatically generates expressive, animator-centric lower face animation from singing audio input. Articulatory phonetics and voice instruction ascribe additional roles to vowels (projecting melody and volume) and consonants (lyrical clarity and rhythmic emphasis) in song. Our approach directly uses these insights to define axes for Melodic-accent and Pitch-sensitivity (Ma-Ps), which together provide an abstract space to visually represent various singing styles. In our system. vowels are processed first. A lyrical vowel is often sung tonally as one or more different vowels. We perform any such vowel modifications using a neural network trained on input audio. These vowels are then dilated from their spoken behaviour to bleed into each other based on Melodic-accent (Ma), with Pitch-sensitivity (Ps) modeling visual vibrato. Consonant animation curves are then layered in, with viseme intensity modeling rhythmic emphasis (inverse to Ma). Our evaluation is fourfold: we show the impact of our design parameters; we compare our results to ground truth and prior art; we present compelling results on a variety of voices and singing styles; and we validate these results with professional singers and animators.

PADL: Language-Directed Physics-Based Character Control

Developing systems that can synthesize natural and life-like motions for simulated characters has long been a focus for computer animation. But in order for these systems to be useful for downstream applications, they need not only produce high-quality motions, but must also provide an accessible and versatile interface through which users can direct a character’s behaviors. Natural language provides a simple-to-use and expressive medium for specifying a user’s intent. Recent breakthroughs in natural language processing (NLP) have demonstrated effective use of language-based interfaces for applications such as image generation and program synthesis. In this work, we present PADL, which leverages recent innovations in NLP in order to take steps towards developing language-directed controllers for physics-based character animation. PADL allows users to issue natural language commands for specifying both high-level tasks and low-level skills that a character should perform. We present an adversarial imitation learning approach for training policies to map high-level language commands to low-level controls that enable a character to perform the desired task and skill specified by a user’s commands. Furthermore, we propose a multi-task aggregation method that leverages a language-based multiple-choice question-answering approach to determine high-level task objectives from language commands. We show that our framework can be applied to effectively direct a simulated humanoid character to perform a diverse array of complex motor skills.

SESSION: Perception in VR and AR

Display Size and Targeting Performance: Small Hurts, Large May Help

Which display size helps gamers win? Recommendations from the research and PC gaming communities are contradictory. We find that as display size grows, targeting performance improves. When size increases from 13′′ to 26′′, targeting time drops by over 3%. Further size increases from 26′′ through 39′′, 52′′ and 65′′, bring more modest improvements, with targeting time dropping a further 1%. While such improvements may not be meaningful for novice gamers, they are extremely important to skilled and competitive players. To produce these results, 30 gamers participated in a targeting task as we varied display size by placing a display at varying distances. We held field of view constant by varying viewport size, and resolution constant by rendering to a fixed-size off-screen buffer. This paper offers further experimental detail, and examines likely explanations for the effects of display size.

Realistic Luminance in VR

As virtual reality (VR) headsets continue to achieve ever more immersive visuals along the axes of resolution, field of view, focal cues, distortion mitigation, and so on, the luminance and dynamic range of these devices falls far short of widely available consumer televisions. While work remains to be done on the display architecture side, power and weight limitations in head-mounted displays pose a challenge for designs aiming for high luminance. In this paper, we seek to gain a basic understanding of VR user preferences for display luminance values in relation to known, real-world luminances for immersive, natural scenes. To do so, we analyze the luminance characteristics of an existing high-dynamic-range (HDR) panoramic image dataset, build an HDR VR headset capable of reproducing over 20,000 nits peak luminance, and conduct a first-of-its-kind study on user brightness preferences in VR. We conclude that current commercial VR headsets do not meet user preferences for display luminance, even for indoor scenes.

Impact of correct and simulated focus cues on perceived realism

The natural accommodation of the human eye to different distances results in focus cues, which contribute to depth perception and appearance. Since focus cues are very difficult to reproduce in an electronic display, it is desirable to know how much they contribute to realistic image appearance. In this work we quantify the potential benefit of focus cues in terms of increased realism compared to regular stereo image presentation. As a secondary goal, we evaluate whether three depth-of-field rendering techniques, which reproduce defocus blur at three different degrees of accuracy, can reintroduce the benefits of focus cues. Our findings confirm the importance of focus cues for realistic image appearance, and also show that they cannot easily be substituted by depth-of-field rendering.

SESSION: Faces and Avatars

AgileAvatar: Stylized 3D Avatar Creation via Cascaded Domain Bridging

Stylized 3D avatars have become increasingly prominent in our modern life. Creating these avatars manually usually involves laborious selection and adjustment of continuous and discrete parameters and is time-consuming for average users. Self-supervised approaches to automatically create 3D avatars from user selfies promise high quality with little annotation cost but fall short in application to stylized avatars due to a large style domain gap. We propose a novel self-supervised learning framework to create high-quality stylized 3D avatars with a mix of continuous and discrete parameters. Our cascaded domain bridging framework first leverages a modified portrait stylization approach to translate input selfies into stylized avatar renderings as the targets for desired 3D avatars. Next, we find the best parameters of the avatars to match the stylized avatar renderings through a differentiable imitator we train to mimic the avatar graphics engine. To ensure we can effectively optimize the discrete parameters, we adopt a cascaded relaxation-and-search pipeline. We use a human preference study to evaluate how well our method preserves user identity compared to previous work as well as manual creation. Our results achieve much higher preference scores than previous work and close to those of manual creation. We also provide an ablation study to justify the design choices in our pipeline.

SESSION: Shape Generation

Neural Wavelet-domain Diffusion for 3D Shape Generation

This paper presents a new approach for 3D shape generation, enabling direct generative modeling on a continuous implicit representation in wavelet domain. Specifically, we propose a compact wavelet representation with a pair of coarse and detail coefficient volumes to implicitly represent 3D shapes via truncated signed distance functions and multi-scale biorthogonal wavelets, and formulate a pair of neural networks: a generator based on the diffusion model to produce diverse shapes in the form of coarse coefficient volumes; and a detail predictor to further produce compatible detail coefficient volumes for enriching the generated shapes with fine structures and details. Both quantitative and qualitative experimental results manifest the superiority of our approach in generating diverse and high-quality shapes with complex topology and structures, clean surfaces, and fine details, exceeding the 3D generation capabilities of the state-of-the-art models.

CLIP-Mesh: Generating textured meshes from text using pretrained image-text models

We present a technique for zero-shot generation of a 3D model using only a target text prompt. Without any 3D supervision our method deforms the control shape of a limit subdivided surface along with its texture map and normal map to obtain a 3D asset that corresponds to the input text prompt and can be easily deployed into games or modeling applications. We rely only on a pre-trained CLIP model that compares the input text prompt with differentiably rendered images of our 3D model. While previous works have focused on stylization or required training of generative models we perform optimization on mesh parameters directly to generate shape, texture or both. To constrain the optimization to produce plausible meshes and textures we introduce a number of techniques using image augmentations and the use of a pretrained prior that generates CLIP image embeddings given a text embedding.

Scene Synthesis from Human Motion

Large-scale capture of human motion with diverse, complex scenes, while immensely useful, is often considered prohibitively costly. Meanwhile, human motion alone contains rich information about the scene they reside in and interact with. For example, a sitting human suggests the existence of a chair, and their leg position further implies the chair’s pose. In this paper, we propose to synthesize diverse, semantically reasonable, and physically plausible scenes based on human motion. Our framework, Scene Synthesis from HUMan MotiON (SUMMON), includes two steps. It first uses ContactFormer, our newly introduced contact predictor, to obtain temporally consistent contact labels from human motion. Based on these predictions, SUMMON then chooses interacting objects and optimizes physical plausibility losses; it further populates the scene with objects that do not interact with humans. Experimental results demonstrate that SUMMON synthesizes feasible, plausible, and diverse scenes and has the potential to generate extensive human-scene interaction data for the community.

LayoutEnhancer: Generating Good Indoor Layouts from Imperfect Data

We address the problem of indoor layout synthesis, which is a topic of continuing research interest in computer graphics. The newest works made significant progress using data-driven generative methods; however, these approaches rely on suitable datasets. In practice, desirable layout properties may not exist in a dataset, for instance, specific expert knowledge can be missing in the data. We propose a method that combines expert knowledge, for example, knowledge about ergonomics, with a data-driven generator based on the popular Transformer architecture. The knowledge is given as differentiable scalar functions, which can be used both as weights or as additional terms in the loss function. Using this knowledge, the synthesized layouts can be biased to exhibit desirable properties, even if these properties are not present in the dataset. Our approach can also alleviate problems of lack of data and imperfections in the data. Our work aims to improve generative machine learning for modeling and provide novel tools for designers and amateurs for the problem of interior layout creation.

SESSION: Reconstruction and Repair

Shape Completion with Points in the Shadow

Single-view point cloud completion aims to recover the full geometry of an object based on only limited observation, which is extremely hard due to the data sparsity and occlusion. The core challenge is to generate plausible geometries to fill the unobserved part of the object based on a partial scan, which is under-constrained and suffers from a huge solution space. Inspired by the classic shadow volume technique in computer graphics, we propose a new method to reduce the solution space effectively. Our method considers the camera a light source that casts rays toward the object. Such light rays build a reasonably constrained but sufficiently expressive basis for completion. The completion process is then formulated as a point displacement optimization problem. Points are initialized at the partial scan and then moved to their goal locations with two types of movements for each point: directional movements along the light rays and constrained local movement for shape refinement. We design neural networks to predict the ideal point movements to get the completion results. We demonstrate that our method is accurate, robust, and generalizable through exhaustive evaluation and comparison. Moreover, it outperforms state-of-the-art methods qualitatively and quantitatively on MVP datasets.

SESSION: Image Editing and Manipulation

Stitch it in Time: GAN-Based Facial Editing of Real Videos

The ability of Generative Adversarial Networks to encode rich semantics within their latent space has been widely adopted for facial image editing. However, replicating their success with videos has proven challenging. Applying StyleGAN editing to real videos introduces two main challenges: (i) StyleGAN operates over aligned crops. When editing videos, these crops need to be pasted back into the frame, resulting in a spatial inconsistency. (ii) Videos introduce a fundamental barrier to overcome — temporal coherency. To address the first challenge, we propose a novel stitching-tuning procedure. The generator is carefully tuned to overcome the spatial artifacts at crop borders, resulting in smooth transitions even when difficult backgrounds are involved. Turning to temporal coherence, we propose that this challenge is largely artificial. The source video is already temporally coherent, and deviations arise in part due to careless treatment of individual components in the editing pipeline. We leverage the natural alignment of StyleGAN and the tendency of neural networks to learn low-frequency functions, and demonstrate that they provide a strongly consistent prior. These components are combined in an end-to-end framework for semantic editing of facial videos. We compare our pipeline to the current state-of-the-art and demonstrate significant improvements. Our method produces meaningful manipulations and maintains greater spatial and temporal consistency, even on challenging talking head videos which current methods struggle with. Our code and videos are available at

VideoReTalking: Audio-based Lip Synchronization for Talking Head Video Editing In the Wild

We present VideoReTalking, a new system to edit the faces of a real-world talking head video according to input audio, producing a high-quality and lip-syncing output video even with a different emotion. Our system disentangles this objective into three sequential tasks: (1) face video generation with a canonical expression; (2) audio-driven lip-sync; and (3) face enhancement for improving photo-realism. Given a talking-head video, we first modify the expression of each frame according to the same expression template using the expression editing network, resulting in a video with the canonical expression. This video, together with the given audio, is then fed into the lip-sync network to generate a lip-syncing video. Finally, we improve the photo-realism of the synthesized faces through an identity-aware face enhancement network and post-processing. We use learning-based approaches for all three steps and all our modules can be tackled in a sequential pipeline without any user intervention. Furthermore, our system is a generic approach that does not need to be retrained to a specific person. Evaluations on two widely-used datasets and in-the-wild examples demonstrate the superiority of our framework over other state-of-the-art methods in terms of lip-sync accuracy and visual quality.

NeRFFaceEditing: Disentangled Face Editing in Neural Radiance Fields

Recent methods for synthesizing 3D-aware face images have achieved rapid development thanks to neural radiance fields, allowing for high quality and fast inference speed. However, existing solutions for editing facial geometry and appearance independently usually require retraining and are not optimized for the recent work of generation, thus tending to lag behind the generation process. To address these issues, we introduce NeRFFaceEditing, which enables editing and decoupling geometry and appearance in the pretrained tri-plane-based neural radiance field while retaining its high quality and fast inference speed. Our key idea for disentanglement is to use the statistics of the tri-plane to represent the high-level appearance of its corresponding facial volume. Moreover, we leverage a generated 3D-continuous semantic mask as an intermediary for geometry editing. We devise a geometry decoder (whose output is unchanged when the appearance changes) and an appearance decoder. The geometry decoder aligns the original facial volume with the semantic mask volume. We also enhance the disentanglement by explicitly regularizing rendered images with the same appearance but different geometry to be similar in terms of color distribution for each facial component separately. Our method allows users to edit via semantic masks with decoupled control of geometry and appearance. Both qualitative and quantitative evaluations show the superior geometry and appearance control abilities of our method compared to existing and alternative solutions.

Water Simulation and Rendering from a Still Photograph

We propose an approach to simulate and render realistic water animation from a single still input photograph. We first segment the water surface, estimate rendering parameters, and compute water reflection textures with a combination of neural networks and traditional optimization techniques. Then we propose an image-based screen space local reflection model to render the water surface overlaid on the input image and generate real-time water animation. Our approach creates realistic results with no user intervention for a wide variety of natural scenes containing large bodies of water with different lighting and water surface conditions. Since our method provides a 3D representation of the water surface, it naturally enables direct editing of water parameters and also supports interactive applications like adding synthetic objects to the scene.

SESSION: Appearance Modeling and Capture

Woven Fabric Capture from a Single Photo

Digitally reproducing the appearance of woven fabrics is important in many applications of realistic rendering, from interior scenes to virtual characters. However, designing realistic shading models and capturing real fabric samples are both challenging tasks. Previous work ranges from applying generic shading models not meant for fabrics, to data-driven approaches scanning fabrics requiring expensive setups and large data. In this paper, we propose a woven fabric material model and a parameter estimation approach for it. Our lightweight forward shading model treats yarns as bent and twisted cylinders, shading these using a microflake-based bidirectional reflectance distribution function (BRDF) model. We propose a simple fabric capture configuration, wrapping the fabric sample on a cylinder of known radius and capturing a single image under known camera and light positions. Our inverse rendering pipeline consists of a neural network to estimate initial fabric parameters and an optimization based on differentiable rendering to refine the results. Our fabric parameter estimation achieves high-quality recovery of measured woven fabric samples, which can be used for efficient rendering and further edited.

TileGen: Tileable, Controllable Material Generation and Capture

Recent methods (e.g. MaterialGAN) have used unconditional GANs to generate per-pixel material maps, or as a prior to reconstruct materials from input photographs. These models can generate varied random material appearance, but do not have any mechanism to constrain the generated material to a specific category or to control the coarse structure of the generated material, such as the exact brick layout on a brick wall. Furthermore, materials reconstructed from a single input photo commonly have artifacts and are generally not tileable, which limits their use in practical content creation pipelines. We propose TileGen, a generative model for SVBRDFs that is specific to a material category, always tileable, and optionally conditional on a provided input structure pattern. TileGen is a variant of StyleGAN whose architecture is modified to always produce tileable (periodic) material maps. In addition to the standard “style” latent code, TileGen can optionally take a condition image, giving a user direct control over the dominant spatial (and optionally color) features of the material. For example, in brick materials, the user can specify a brick layout and the brick color, or in leather materials, the locations of wrinkles and folds. Our inverse rendering approach can find a material perceptually matching a single target photograph by optimization. This reconstruction can also be conditional on a user-provided pattern. The resulting materials are tileable, can be larger than the target image, and are editable by varying the condition.

Gloss management for consistent reproduction of real and virtual objects

A good match of material appearance between real-world objects and their digital on-screen representations is critical for many applications such as fabrication, design, and e-commerce. However, faithful appearance reproduction is challenging, especially for complex phenomena, such as gloss. In most cases, the view-dependent nature of gloss and the range of luminance values required for reproducing glossy materials exceeds the current capabilities of display devices. As a result, appearance reproduction poses significant problems even with accurately rendered images. This paper studies the gap between the gloss perceived from real-world objects and their digital counterparts. Based on our psychophysical experiments on a wide range of 3D printed samples and their corresponding photographs, we derive insights on the influence of geometry, illumination, and the display’s brightness and measure the change in gloss appearance due to the display limitations. Our evaluation experiments demonstrate that using the prediction to correct material parameters in a rendering system improves the match of gloss appearance between real objects and their visualization on a display device.

SESSION: Maps and Fields

Isometric Energies for Recovering Injectivity in Constrained Mapping

Computing injective maps with low distortions is a long-standing problem in computer graphics. Such maps are particularly challenging to obtain in the presence of positional constraints, because an injective initial map is often not available. Recently, several energies were proposed and shown to be highly successful in optimizing injectivity from non-injective initial maps while satisfying positional constraints. However, minimizing these energies tends to produce elements with significant isometric distortions. This paper presents simple variants of these energies that retain their desirable traits while promoting isometry. While our method is not guaranteed to provide an injective map, we observe that, on large-scale 2D and 3D data sets, minimizing the proposed isometric variants results in a similar level of success in recovering injectivity as the original energies but a significantly lower isometric distortion.

Fast Editing of Singularities in Field-Aligned Stripe Patterns

Field-aligned parametrization is a method that maps a scalar function onto a surface, such that the gradient vector of the scalar function matches the input vector field. Using this idea, one can produce a stripe pattern that is convenient for various purposes such as remeshing, texture synthesis, and computational fabrication. In the final outcome, the positions of singularities (i.e., bifurcations of the stripe pattern) are essential for functionalities, manufacturability, or aesthetics. In this paper, we propose an algorithm to allow users to interactively edit the singularity positions of field-aligned stripe patterns. The algorithm computes a stripe pattern from a prescribed set of singularities, without generating any unwanted singularities. The solution of the algorithm is formulated as the global minima of a constrained quadratic optimization, whose computation speed is dominated by solving only two sparse linear systems. Furthermore, once the two matrices in the two linear systems are factorized, any update on singularity positions operates in linear time. We showcase several applications feasible with our fast yet simple algorithm.

Green Coordinates for Triquad Cages in 3D

We introduce Green coordinates for triquad cages in 3D. Based on Green’s third identity, Green coordinates allow defining the harmonic deformation of a 3D point inside a cage as a linear combination of its vertices and face normals. Using appropriate Neumann boundary conditions, the resulting deformations are quasi-conformal in 3D, and thus best-preserve the local deformed geometry, in that volumetric conformal 3D deformations do not exist unless rigid. Most coordinate systems use cages made of triangles, yet quads are in general favored by artists as those align naturally onto important geometric features of the 3D shapes, such as the limbs of a character, without introducing arbitrary asymmetric deformations and representation. While triangle cages admit per-face constant normals and result in a single Green normal-coordinate per triangle, the case of quad cages is at the same time more involved (as the normal varies along non-planar quads) and more flexible (as many different mathematical models allow defining the smooth geometry of a quad interpolating its four edges). We consider bilinear quads, and we introduce a new Neumann boundary condition resulting in a simple set of four additional normal-coordinates per quad. Our coordinates remain quasi-conformal in 3D, and we demonstrate their superior behavior under non-trivial deformations of realistic triquad cages.

Efficient Neural Radiance Fields for Interactive Free-viewpoint Video

This paper aims to tackle the challenge of efficiently producing interactive free-viewpoint videos. Some recent works equip neural radiance fields with image encoders, enabling them to generalize across scenes. When processing dynamic scenes, they can simply treat each video frame as an individual scene and perform novel view synthesis to generate free-viewpoint videos. However, their rendering process is slow and cannot support interactive applications. A major factor is that they sample lots of points in empty space when inferring radiance fields. We propose a novel scene representation, called ENeRF, for the fast creation of interactive free-viewpoint videos. Specifically, given multi-view images at one frame, we first build the cascade cost volume to predict the coarse geometry of the scene. The coarse geometry allows us to sample few points near the scene surface, thereby significantly improving the rendering speed. This process is fully differentiable, enabling us to jointly learn the depth prediction and radiance field networks from RGB images. Experiments on multiple benchmarks show that our approach exhibits competitive performance while being at least 60 times faster than previous generalizable radiance field methods.

SESSION: Solids and Fluids

Mixed Variational Finite Elements for Implicit Simulation of Deformables

We propose and explore a new method for the implicit time integration of elastica. Key to our approach is the use of a mixed variational principle. In turn, its finite element discretization leads to an efficient and accurate sequential quadratic programming solver with a superset of the desirable properties of many previous integration strategies. This framework fits a range of elastic constitutive models and remains stable across a wide span of time step sizes and material parameters (including problems that are approximately rigid). Our method exhibits convergence on par with full Newton type solvers and also generates visually plausible results in just a few iterations comparable to recent fast simulation methods that do not converge. These properties make it suitable for both offline accurate simulation and performant applications with expressive physics. We demonstrate the efficacy of our approach on a number of simulated examples.

SESSION: Sampling and Reconstruction

Unbiased Caustics Rendering Guided by Representative Specular Paths

Caustics are interesting patterns caused by the light being focused when reflecting off glossy materials. Rendering them in computer graphics is still challenging: they correspond to high luminous intensity focused over a small area. Finding the paths that contribute to this small area is difficult, and even more difficult when using camera-based path tracing instead of bidirectional approaches. Recent improvements in path guiding are still unable to compute efficiently the light paths that contribute to a caustic. In this paper, we present a novel path guiding approach to enable reliable rendering of caustics. Our approach relies on computing representative specular paths, then extending them using a chain of spherical Gaussians. We use these extended paths to estimate the incident radiance distribution and guide path tracing. We combine this approach with several practical strategies, such as spatial reusing and parallax-aware representation for arbitrarily curved reflectors. Our path-guided algorithm using extended specular paths outperforms current state-of-the-art methods and handles multiple bounces of light and a variety of scenes.

Marginal Multiple Importance Sampling

Multiple importance sampling (MIS) is a powerful tool to combine different sampling techniques in a provably good manner. MIS requires that the techniques’ probability density functions (PDFs) are readily evaluable point-wise. However, this requirement may not be satisfied when (some of) those PDFs are marginals, i.e., integrals of other PDFs. We generalize MIS to combine samples from such marginal PDFs. The key idea is to consider each marginalization domain as a continuous space of sampling techniques with readily evaluable (conditional) PDFs. We stochastically select techniques from these spaces and combine the samples drawn from them into an unbiased estimator. Prior work has dealt with the special cases of multiple classical techniques or a single marginal one. Our formulation can handle mixtures of those.

We apply our marginal MIS formulation to light-transport simulation to demonstrate its utility. We devise a marginal path sampling framework that makes previously intractable sampling techniques practical and significantly broadens the path-sampling choices beyond what is presently possible. We highlight results from two algorithms based on marginal MIS: a novel formulation of path-space filtering at multiple vertices along a camera path and a similar filtering method for photon-density estimation.

SESSION: Everything Interactive and Dynamic

Physical Interaction: Reconstructing Hand-object Interactions with Physics

Single view-based reconstruction of hand-object interaction is challenging due to the severe observation missing caused by occlusions. This paper proposes a physics-based method to better solve the ambiguities in the reconstruction. It first proposes a force-based dynamic model of the in-hand object, which not only recovers the unobserved contacts but also solves for plausible contact forces. Next, a confidence-based slide prevention scheme is proposed, which combines both the kinematic confidences and the contact forces to jointly model static and sliding contact motion. Qualitative and quantitative experiments show that the proposed technique reconstructs both physically plausible and more accurate hand-object interaction and estimates plausible contact forces in real-time with a single RGBD sensor.

Continuous deformation based panelization for design rationalization

Design rationalization is the process of simplifying a 3D shape to enable cost-efficient manufacturing. A common approach is to approximate the input shape by a collection of simple units, such as flat or spherical panels, that are easy to manufacture and simple to assemble. This panelization process typically involves a segmentation step, with each surface patch intended to be replaced by a single unit, followed by an approximation stage, where the final shapes and locations of the units are determined. While optimal panel parameters for given segments are readily determined, the discrete nature of segmentation—assigning surface elements to segments—prevents a continuous design optimization workflow. In this work, we propose a differentiable reformulation of panelization that enables its use in gradient-based design optimization. Our approach is to treat panelization as a smooth optimization problem, whose objective function encourages the surface to locally deform towards best-matching units. This formulation enables a fully-differentiable rationalization process with implicit segmentation in which panels emerge automatically. We integrate our rationalization process in a simple user interface allowing the designer to guide the optimization towards desired panelizations. We demonstrate the potential of our approach on a diverse set of complex shapes and different panel types.

Capturing and Animation of Body and Clothing from Monocular Video

While recent work has shown progress on extracting clothed 3D human avatars from a single image, video, or a set of 3D scans, several limitations remain. Most methods use a holistic representation to jointly model the body and clothing, which means that the clothing and body cannot be separated for applications like virtual try-on. Other methods separately model the body and clothing, but they require training from a large set of 3D clothed human meshes obtained from 3D/4D scanners or physics simulations. Our insight is that the body and clothing have different modeling requirements. While the body is well represented by a mesh-based parametric 3D model, implicit representations and neural radiance fields are better suited to capturing the large variety in shape and appearance present in clothing. Building on this insight, we propose SCARF (Segmented Clothed Avatar Radiance Field), a hybrid model combining a mesh-based body with a neural radiance field. Integrating the mesh into the volumetric rendering in combination with a differentiable rasterizer enables us to optimize SCARF directly from monocular videos, without any 3D supervision. The hybrid modeling enables SCARF to (i) animate the clothed body avatar by changing body poses (including hand articulation and facial expressions), (ii) synthesize novel views of the avatar, and (iii) transfer clothing between avatars in virtual try-on applications. We demonstrate that SCARF reconstructs clothing with higher visual quality than existing methods, that the clothing deforms with changing body pose and body shape, and that clothing can be successfully transferred between avatars of different subjects. The code and models are available at

Reconstructing Hand-Held Objects from Monocular Video

This paper presents an approach that reconstructs a hand-held object from a monocular video. In contrast to many recent methods that directly predict object geometry by a trained network, the proposed approach does not require any learned prior about the object and is able to recover more accurate and detailed object geometry. The key idea is that the hand motion naturally provides multiple views of the object and the motion can be reliably estimated by a hand pose tracker. Then, the object geometry can be recovered by solving a multi-view reconstruction problem. We devise an implicit neural representation-based method to solve the reconstruction problem and address the issues of imprecise hand pose estimation, relative hand-object motion, and insufficient geometry optimization for small objects. We also provide a newly collected dataset with 3D ground truth to validate the proposed approach. The dataset and code will be released at

SESSION: Material and Rendering

Direct acquisition of volumetric scattering phase function using speckle correlations

In material acquisition we want to infer the internal properties of materials from the way they scatter light. In particular, we are interested in measuring the phase function of the material, governing the amount of energy scattered towards different directions. This phase function has been shown to carry a lot of information about the type and size of particles dispersed in the medium, and is therefore essential for its characterization.

Previous approaches to this task have relied on computationally costly inverse rendering optimization. Alternatively, if the material can be made optically thin enough so that most light paths scatter only once, this optimization can be avoided and the phase function can be directly read from the profile of light scattering at different angles. However, in many realistic applications, it is not easy to slice or dilute the material so that it is thin enough for such a single scattering model to hold.

In this work we suggest a simple closed-form approach for acquiring material parameters from thick samples, avoiding costly optimization. Our approach is based on imaging the material of interest under coherent laser light and capturing speckle patterns. We show that memory-effect correlations between speckle patterns produced under nearby illumination directions provide a gating mechanism, allowing us to measure the singly scattered component of the light, even when observing thick samples where most light is scattered multiple times.

We have built an experimental prototype capable of measuring phase functions over a narrow angular cone. We test the accuracy of our approach using validation materials whose ground truth phase function is known; and we use it to capture a set of everyday materials.

FloRen: Real-time High-quality Human Performance Rendering via Appearance Flow Using Sparse RGB Cameras

We propose FloRen, a novel system for real-time, high-resolution free-view human synthesis. Our system runs at 15fps in 1K resolution with very sparse RGB cameras. In FloRen, a coarse-level implicit geometry is recovered at first as initialization, and then processed by a neural rendering framework based on appearance flow. Our appearance flow-based rendering framework consists of three steps, namely view-dependent depth refinement, appearance flow estimation and occlusion-aware color rendering. In this way, we resolve the view synthesis problem in the image plane, where 2D convolutional neural networks can be efficiently applied, contributing to high speed performance. For robust appearance flow estimation, we explicitly combine data-driven human prior knowledge with multiview geometric constraints. The accurate appearance flow enables precise color mapping from input view to novel view, which greatly facilitates high-resolution novel view generation. We demonstrate that our system achieves state-of-the-art performance and even outperforms many offline methods.

VIINTER: View Interpolation with Implicit Neural Representations of Images

We present VIINTER, a method for view interpolation by interpolating the implicit neural representation (INR) of the captured images. We leverage the learned code vector associated with each image and interpolate between these codes to achieve viewpoint transitions. We propose several techniques that significantly enhance the interpolation quality. VIINTER signifies a new way to achieve view interpolation without constructing 3D structure, estimating camera poses, or computing pixel correspondence. We validate the effectiveness of VIINTER on several multi-view scenes with different types of camera layout and scene composition. As the development of INR of images (as opposed to surface or volume) has centered around tasks like image fitting and super-resolution, with VIINTER, we show its capability for view interpolation and offer a promising outlook on using INR for image manipulation tasks.

SESSION: VR and Interaction

UmeTrack: Unified multi-view end-to-end hand tracking for VR

Real-time tracking of 3D hand pose in world space is a challenging problem and plays an important role in VR interaction. Existing work in this space are limited to either producing root-relative (versus world space) 3D pose or rely on multiple stages such as generating heatmaps and kinematic optimization to obtain 3D pose. Moreover, the typical VR scenario, which involves multi-view tracking from wide field of view (FOV) cameras is seldom addressed by these methods. In this paper, we present a unified end-to-end differentiable framework for multi-view, multi-frame hand tracking that directly predicts 3D hand pose in world space. We demonstrate the benefits of end-to-end differentiabilty by extending our framework with downstream tasks such as jitter reduction and pinch prediction. To demonstrate the efficacy of our model, we further present a new large-scale egocentric hand pose dataset that consists of both real and synthetic data. Experiments show that our system trained on this dataset handles various challenging interactive motions, and has been successfully applied to real-time VR applications.

H rtDown: Document Processor for Executable Linear Algebra Papers

Scientific documents describe a topic in a mix of prose and mathematical expressions. The prose refers to those expressions, which themselves must be encoded in, e.g., LaTeX. The resulting documents are static, even though most documents are now read digitally. Moreover, formulas must be implemented or re-implemented separately in a programming language in order to create executable research artifacts. Literate environments allow executable code to be added in addition to the prose and math. The code is yet another encoding of the same mathematical expressions.

We introduce H rtDown, a document processor, authoring environment, and paper reading environment for scientific documents. Prose is written in Markdown, linear algebra formulas in an enhanced version of I LA, derivations in LaTeX, and dynamic figures in Python. H rtDown is designed to support existing scientific writing practices: editing in plain text, using and defining symbols in prose-determined order, and context-dependent symbol re-use. H rtDown’s authoring environment assists authors by identifying incorrect formulas and highlighting symbols not yet described in the prose. H rtDown outputs a dynamic paper reader with math augmentations to aid in comprehension, and code libraries for experimenting with the executable formulas. H rtDown supports dynamic figures generated by inline Python code. This enables a new approach to scientific experimentation, where editing the mathematical formulas directly updates the figures. We evaluate H rtDown with an expert study and by re-implementing SIGGRAPH papers.

SESSION: Simulation of Everything

Fast Stabilization of Inducible Magnet Simulation

This paper presents a novel method for simulating inducible rigid magnets efficiently and stably. In the proposed method, inducible magnets are magnetized by a modified magnetization dynamics, so that the magnetic equilibrium can be obtained in a computationally efficient manner. Furthermore, our model of magnetic forces takes magnetization change into account to produce stable motions of inducible magnets. The experiments show that the proposed method enables a large-scale simulation involving a huge number of inducible magnets.


Reconstructing editable prismatic CAD from rounded voxel models

Reverse Engineering a CAD shape from other representations is an important geometric processing step for many downstream applications. In this work, we introduce a novel neural network architecture to solve this challenging task and approximate a smoothed signed distance function with an editable, constrained, prismatic CAD model. During training, our method reconstructs the input geometry in the voxel space by decomposing the shape into a series of 2D profile images and 1D envelope functions. These can then be recombined in a differentiable way allowing a geometric loss function to be defined. During inference, we obtain the CAD data by first searching a database of 2D constrained sketches to find curves which approximate the profile images, then extrude them and use Boolean operations to build the final CAD model. Our method approximates the target shape more closely than other methods and outputs highly editable constrained parametric sketches which are compatible with existing CAD software.