As the problem of reconstructing high dynamic range (HDR) images from a single exposure has attracted much research effort, it is essential to provide a robust protocol and clear guidelines on how to evaluate and compare new methods. In this work, we compared six recent single image HDR reconstruction (SI-HDR) methods in a subjective image quality experiment on an HDR display. We found that only two methods produced results that are, on average, more preferred than the unprocessed single exposure images. When the same methods are evaluated using image quality metrics, as typically done in papers, the metric predictions correlate poorly with subjective quality scores. The main reason is a significant tone and color difference between the reference and reconstructed HDR images. To improve the predictions of image quality metrics, we propose correcting for the inaccuracies of the estimated camera response curve before computing quality values. We further analyze the sources of prediction noise when evaluating SI-HDR methods and demonstrate that existing metrics can reliably predict only large quality differences.
3D models of manufactured objects are important for populating virtual worlds and for synthetic data generation for vision and robotics. To be most useful, such objects should be articulated: their parts should move when interacted with. While articulated object datasets exist, creating them is labor-intensive. Learning-based prediction of part motions can help, but all existing methods require annotated training data. In this paper, we present an unsupervised approach for discovering articulated motions in a part-segmented 3D shape collection. Our approach is based on a concept we call category closure: any valid articulation of an object’s parts should keep the object in the same semantic category (e.g. a chair stays a chair). We operationalize this concept with an algorithm that optimizes a shape’s part motion parameters such that it can transform into other shapes in the collection. We evaluate our approach by using it to re-discover part motions from the PartNet-Mobility dataset. For almost all shape categories, our method’s predicted motion parameters have low error with respect to ground truth annotations, outperforming two supervised motion prediction methods.
As a common practice, game modelers manually craft low-poly meshes for given 3D building models in order to achieve the ideal balance between the small element count and the visual similarity. This can take hours and involve tedious trial and error. We propose a novel and simple algorithm to automate this process by converting high-poly 3D building models into both simple and visually preserving low-poly meshes. Our algorithm has three stages: First, a watertight, self-collision-free visual hull is generated via Boolean intersecting 3D extrusions of input’s silhouettes; We then carve out notable but redundant structures from the visual hull via Boolean subtracting 3D primitives derived from parts of the input; Finally, we generate a progressively simplified low-poly mesh sequence from the carved mesh and extract the Pareto front for users to select the desired output. Each stage of our approach is guided by visual metrics, aiming to preserve the visual similarity to the input. We have tested our method on a dataset containing 100 building models with different styles, most of which are used in popular digital games. We highlight the superior robustness and quality by comparing with state-of-the-art competing techniques. Executable program for this paper is at lowpoly-modeling.github.io.
Bidirectional reflectance distribution functions (BRDFs) are pervasively used in computer graphics to produce realistic physically-based appearance. Many common materials in the real world have more than one layer, like wood, skin, car paint, and many decorative materials. However, precise simulation of layered material optics is non-trivial. The most accurate approaches rely on Monte Carlo random walks to simulate the light transport within the layers, leading to high variance and cost. Other approaches are efficient, but less accurate. In this paper, we propose to perform layering in the neural space, by compressing BRDFs into latent codes via a proposed representation neural network, and performing a learned layering operation on these latent vectors via a layering network. Our BRDF evaluation is noise-free and computationally efficient, compared to the state-of-the-art approach; it is also a first step towards a “neural algebra” of operations on BRDFs in a latent space.
Graph-based procedural materials are ubiquitous in content production industries. Procedural models allow the creation of photo-realistic materials with parametric control for flexible editing of appearance. However, designing a specific material is a time-consuming process in terms of building a model and fine-tuning parameters. Previous work [Hu et al. 2022; Shi et al. 2020] introduced material graph optimization frameworks for matching target material samples. However, these previous methods were limited to optimizing differentiable functions in the graphs. In this paper, we propose a fully differentiable framework which enables end-to-end gradient-based optimization of material graphs, even if some functions of the graph are non-differentiable. We leverage the Differentiable Proxy, a differentiable approximator of a non-differentiable black-box function. We use our framework to match structure and appearance of an output material to a target material, through a multi-stage differentiable optimization. Differentiable Proxies offer a more general optimization solution to material appearance matching than previous work.
The fundamental solutions (Green’s functions) of linear elasticity for an infinite and isotropic media are ubiquitous in interactive graphics applications that cannot afford the computational costs of volumetric meshing and finite-element simulation. For instance, the recent work of de Goes and James  leveraged these Green’s functions to formulate sculpting tools capturing in real-time broad and physically-plausible deformations more intuitively and realistically than traditional editing brushes. In this paper, we extend this family of Green’s functions by exploiting the anisotropic behavior of general linear elastic materials, where the relationship between stress and strain in the material depends on its orientation. While this more general framework prevents the existence of analytical expressions for its fundamental solutions, we show that a finite sum of spherical harmonics can be used to decompose a Green’s function, which can be further factorized into directional, radial, and material-dependent terms. From such a decoupling, we show how to numerically derive sculpting brushes to generate anisotropic deformation and finely control their falloff profiles in real-time.
This paper proposes a simple method which solves the problem of multi-view 3D reconstruction for objects with unknown and generic surface materials, imaged by a freely moving camera and lit by a freely moving point light source. The object can have arbitrary (diffuse or specular) and spatially-varying surface reflectances. Our solution consists of two small-sized neural networks (dubbed the ‘Shape-Net’ and ‘BRDF-Net’), used to parameterize the unknown shape and material map as functions on a canonical surface (e.g. unit sphere). Key to our method is a velocity field shape representation that drives the canonical surface to target shape through time. We show this parameterization can be implemented as a recurrent residual network that is guaranteed to be diffeomorphic and orientation-preserving. Our method yields an exceptionally clean formulation that can be optimized by standard gradient descent without initialization, and works with both near-field and distant light source. Synthetic and real experiments demonstrate the reliability and accuracy of our reconstructions, with extensions including novel-view-synthesis, relighting and material retouching done with ease. Our source codes are available at https://github.com/za-cheng/DNS.
We present a neural extension of basic shadow mapping for fast, high quality hard and soft shadows. We compare favorably to fast pre-filtering shadow mapping, all while producing visual results on par with ray traced hard and soft shadows. We show that combining memory bandwidth-aware architecture specialization and careful temporal-window training leads to a fast, compact and easy-to-train neural shadowing method. Our technique is memory bandwidth conscious, eliminates the need for post-process temporal anti-aliasing or denoising, and supports scenes with dynamic view, emitters and geometry while remaining robust to unseen objects.
Neural material reflectance representations address some limitations of traditional analytic BRDFs with parameter textures; they can theoretically represent any material data, whether a complex synthetic microgeometry with displacements, shadows and inter-reflections, or real measured reflectance. However, they still approximate the material on an infinite plane, which prevents them from correctly handling silhouette and parallax effects for viewing directions close to grazing. The goal of this paper is to design a neural material representation capable of correctly handling such silhouette effects. We extend the neural network query to take surface curvature information as input, while the query output is extended to return a transparency value in addition to reflectance. We train the new neural representation on synthetic data that contains queries spanning a variety of surface curvatures. We show an ability to accurately represent complex silhouette behavior that would traditionally require more expensive and less flexible techniques, such as on-the-fly geometry displacement or ray marching.
We propose a 3D object construction methodology built on face-loop modeling operations. Our Face Extrusion Quad (FEQ) meshes, have a well designed face-loop structure similar to artist crafted 3D models. Furthermore, we define a construction graph which encodes a sequence of primitive extrude/collapse and bridge/separate operations that operate on admissible face-loops. We show that FEQs are imbued with a meaningful face-loop induced shape skeleton, part segmentation, plausible construction history, and possess the many advantages of extrusion-based 3D modeling. Our evaluation is threefold: we show a gallery of challenging 3D models transformed to FEQs with compelling face-loop structure; we showcase the potential of an inherent construction graph, using FEQ-based cut-paste and inverse modeling applications; and we demonstrate the impact of various algorithmic and parameter related choices for FEQ modeling and application.
We present a learning algorithm that uses bone-driven motion networks to predict the deformation of loose-fitting garment meshes at interactive rates. Given a garment, we generate a simulation database and extract virtual bones from simulated mesh sequences using skin decomposition. At runtime, we separately compute low- and high-frequency deformations in a sequential manner. The low-frequency deformations are predicted by transferring body motions to virtual bones’ motions, and the high-frequency deformations are estimated leveraging the global information of virtual bones’ motions and local information extracted from low-frequency meshes. In addition, our method can estimate garment deformations caused by variations of the simulation parameters (e.g., fabric’s bending stiffness) using an RBF kernel ensembling trained networks for different sets of simulation parameters. Through extensive comparisons, we show that our method outperforms state-of-the-art methods in terms of prediction accuracy of mesh deformations by about 20% in RMSE and 10% in Hausdorff distance and STED. The code and data are available at https://github.com/non-void/VirtualBones.
In this work, we tackle the challenging problem of arbitrary image style transfer using a novel style feature representation learning method. A suitable style representation, as a key component in image stylization tasks, is essential to achieve satisfactory results. Existing deep neural network based approaches achieve reasonable results with the guidance from second-order statistics such as Gram matrix of content features. However, they do not leverage sufficient style information, which results in artifacts such as local distortions and style inconsistency. To address these issues, we propose to learn style representation directly from image features instead of their second-order statistics, by analyzing the similarities and differences between multiple styles and considering the style distribution. Specifically, we present Contrastive Arbitrary Style Transfer (CAST), which is a new style representation learning and style transfer method via contrastive learning. Our framework consists of three key components, i.e., a multi-layer style projector for style code encoding, a domain enhancement module for effective learning of style distribution, and a generative network for image style transfer. We conduct qualitative and quantitative evaluations comprehensively to demonstrate that our approach achieves significantly better results compared to those obtained via state-of-the-art methods. Code and models are available at https://github.com/zyxElsa/CAST_pytorch.
We present Shoot360, a system that efficiently generates multi-shot normal view videos with desired content presentation and various cinematic styles, given a collection of 360 video recordings on different environments. The core of our system is a three-step decision process: 1) It firstly semantically analyzes the contents of interest from each panorama environment based on shot units, and produces a guidance that specifies the semantic focus and movement type of its output shot according to the user specification on content presentation and cinematic styles. 2) Based on the obtained guidance, it generates video candidates for each shot with shot-level control parameters for view projections following the filming rules. 3) The system further aggregates the projected normal view shots with the imposed local and global constraints, which incorporates the external knowledge learned from exemplar videos and professional filming rules. Extensive experiments verify the effectiveness of our system design, and we conclude with promising extensions for applying it to more generalized scenarios.
This paper deals with the challenging task of synthesizing novel views for in-the-wild photographs. Existing methods have shown promising results leveraging monocular depth estimation and color inpainting with layered depth representations. However, these methods still have limited capability to handle scenes with complex 3D geometry. We propose a new method based on the multiplane image (MPI) representation. To accommodate diverse scene layouts in the wild and tackle the difficulty in producing high-dimensional MPI contents, we design a network structure that consists of two novel modules, one for plane depth adjustment and another for depth-aware color prediction. The former adjusts the initial plane positions using the RGBD context feature and an attention mechanism. Given adjusted depth values, the latter predicts the color and density for each plane separately with proper inter-plane interactions achieved via a feature masking strategy. To train our method, we construct large-scale stereo training data using only unconstrained single-view image collections by a simple yet effective warp-back strategy. The experiments on both synthetic and real datasets demonstrate that our trained model works remarkably well and achieves state-of-the-art results.
This paper develops a unified framework for image-to-image translation based on conditional diffusion models and evaluates this framework on four challenging image-to-image translation tasks, namely colorization, inpainting, uncropping, and JPEG restoration. Our simple implementation of image-to-image diffusion models outperforms strong GAN and regression baselines on all tasks, without task-specific hyper-parameter tuning, architecture customization, or any auxiliary loss or sophisticated new techniques needed. We uncover the impact of an L2 vs. L1 loss in the denoising diffusion objective on sample diversity, and demonstrate the importance of self-attention in the neural architecture through empirical studies. Importantly, we advocate a unified evaluation protocol based on ImageNet, with human evaluation and sample quality scores (FID, Inception Score, Classification Accuracy of a pre-trained ResNet-50, and Perceptual Distance against original images). We expect this standardized evaluation protocol to play a role in advancing image-to-image translation research. Finally, we show that a generalist, multi-task diffusion model performs as well or better than task-specific specialist counterparts. Check out https://diffusion-palette.github.io/ for an overview of the results and code.
Generative Adversarial Networks (GANs) are susceptible to bias, learned from either the unbalanced data, or through mode collapse. The networks focus on the core of the data distribution, leaving the tails — or the edges of the distribution — behind. We argue that this bias is responsible not only for fairness concerns, but that it plays a key role in the collapse of latent-traversal editing methods when deviating away from the distribution’s core. Building on this observation, we outline a method for mitigating generative bias through a self-conditioning process, where distances in the latent-space of a pre-trained generator are used to provide initial labels for the data. By fine-tuning the generator on a re-sampled distribution drawn from these self-labeled data, we force the generator to better contend with rare semantic attributes and enable more realistic generation of these properties. We compare our models to a wide range of latent editing methods, and show that by alleviating the bias they achieve finer semantic control and better identity preservation through a wider range of transformations. Our code and models will be available at https://github.com/yzliu567/sc-gan
Rendering photorealistic visuals of virtual scenes requires tractable models for the simulation of light. The rendering equation describes one such model using an integral equation, the crux of which is a continuous integral operator. A majority of rendering algorithms aim to approximate the effect of this light transport operator via discretization (using rays, particles, patches, etc.). Research spanning four decades has uncovered interesting properties and intuition surrounding this operator. In this paper we analyze compactness, a key property that is independent of its discretization and which characterizes the ability to approximate the operator uniformly by a sequence of finite rank operators. We conclusively prove lingering suspicions that this operator is not compact and therefore that any discretization that relies on a finite-rank or nonadaptive finite-bases is susceptible to unbounded error over arbitrary light distributions. Our result justifies the expectation for rendering algorithms to be evaluated using a variety of scenes and illumination conditions. We also discover that its lower dimensional counterpart (over purely diffuse scenes) is not compact except in special cases, and uncover connections with it being noninvertible and acting as a low-pass filter. We explain the relevance of our results in the context of previous work. We believe that our theoretical results will inform future rendering algorithms regarding practical choices.
Using a network trained by a large dataset is becoming popular for denoising Monte Carlo rendering. Such a denoising approach based on supervised learning is currently considered the best approach in terms of quality. Nevertheless, this approach may fail when the image to be rendered (i.e., the test data) has very different characteristics than the images included in the training dataset. A pre-trained network may not properly denoise such an image since it is unseen data from a supervised learning perspective. To address this fundamental issue, we introduce a post-processing network that improves the performance of supervised learning denoisers. The key idea behind our approach is to train this post-processing network with self-supervised learning. In contrast to supervised learning, our self-supervised model does not need a reference image in its training process. We can thus use a noisy test image and self-correct the model on the fly to improve denoising performance. Our main contribution is a self-supervised loss that can guide the post-correction network to optimize its parameters without relying on the reference. Our work is the first to apply this self-supervised learning concept in denoising Monte Carlo rendered estimates. We demonstrate that our post-correction framework can boost supervised denoising via our self-supervised optimization. Our implementation is available at https://github.com/CGLab-GIST/self-supervised-post-corr.
Concept sketches, ubiquitously used in industrial design, are inherently imprecise yet highly effective at communicating 3D shape to human observers. We present a new symmetry-driven algorithm for recovering designer-intended 3D geometry from concept sketches. We observe that most concept sketches of human-made shapes are structured around locally symmetric building blocks, defined by triplets of orthogonal symmetry planes. We identify potential building blocks using a combination of 2D symmetries and drawing order. We reconstruct each such building block by leveraging a combination of perceptual cues and observations about designer drawing choices. We cast this reconstruction as an integer programming problem where we seek to identify, among the large set of candidate symmetry correspondences formed by approximate pen strokes, the subset that results in the most symmetric and well-connected shape. We demonstrate the robustness of our approach by reconstructing 82 sketches, which exhibit significant over-sketching, inaccurate perspective, partial symmetry, and other imperfections. In a comparative study, participants judged our results as superior to the state-of-the-art by a ratio of 2:1.
Designing curve networks for fabrication requires simultaneous consideration of structural stability, cost effectiveness, and visual appeal—complex, interrelated objectives that make manual design a difficult and tedious task. We present a novel method for fabrication-aware simplification of curve networks, algorithmically selecting a stable subset of given 3D curves. While traditionally stability is measured as magnitude of deformation induced by a set of pre-defined loads, predicting applied forces for common day objects can be challenging. Instead, we directly optimize for minimal deformation under the worst-case load.
Our technical contribution is a novel formulation of 3D curve network simplification for worst-case stability, leading to a mixed-integer semi-definite programming problem (MI-SDP). We show that while solving MI-SDP directly is infeasible, a physical insight suggests an efficient greedy approximation algorithm. We demonstrate the potential of our approach on a variety of curve network designs and validate its effectiveness compared to simpler alternatives using numerical experiments.
We design new visual illusions by finding “adversarial examples” for principled models of human perception — specifically, for probabilistic models, which treat vision as Bayesian inference. To perform this search efficiently, we design a differentiable probabilistic programming language, whose API exposes MCMC inference as a first-class differentiable function. We demonstrate our method by automatically creating illusions for three features of human vision: color constancy, size constancy, and face perception.
Understanding the relation between anatomy and gait is key to successful predictive gait simulation. In this paper, we present Generative GaitNet, which is a novel network architecture based on deep reinforcement learning for controlling a comprehensive, full-body, musculoskeletal model with 304 Hill-type musculotendons. The Generative GaitNet is a pre-trained, integrated system of artificial neural networks learned in a 618-dimensional continuous domain of anatomy conditions (e.g., mass distribution, body proportion, bone deformity, and muscle deficits) and gait conditions (e.g., stride and cadence). The pre-trained GaitNet takes anatomy and gait conditions as input and generates a series of gait cycles appropriate to the conditions through physics-based simulation. We will demonstrate the efficacy and expressive power of Generative GaitNet to generate a variety of healthy and pathological human gaits in real-time physics-based simulation.
In many physical interactions such as opening doors and playing sports, humans act compliantly to move in various ways to avoid large impacts or to manipulate objects. This paper aims to build a framework for simulation and control of humanoids that creates physically compliant interactions with surroundings. We can generate a broad spectrum of movements ranging from passive reactions to external physical perturbations, to active manipulations with clear intentions. Technical challenges include defining compliance, reproducing physically reliable movements, and robustly controlling under-actuated dynamical systems. The key technical contribution is a two-level control architecture based on deep reinforcement learning that imitates human movements while adjusting their bodies to external perturbations. The controller minimizes the interaction forces and the control torques for imitation, and we demonstrate the effectiveness of the controller with various motor skills including opening doors, balancing a ball, and running hand in hand.
Brachiation is the primary form of locomotion for gibbons and siamangs, in which these primates swing from tree limb to tree limb using only their arms. It is challenging to control because of the limited control authority, the required advance planning, and the precision of the required grasps. We present a novel approach to this problem using reinforcement learning, and as demonstrated on a finger-less 14-link planar model that learns to brachiate across challenging handhold sequences. Key to our method is the use of a simplified model, a point mass with a virtual arm, for which we first learn a policy that can brachiate across handhold sequences with a prescribed order. This facilitates the learning of the policy for the full model, for which it provides guidance by providing an overall center-of-mass trajectory to imitate, as well as for the timing of the holds. Lastly, the simplified model can also readily be used for planning suitable sequences of handholds in a given environment. Our results demonstrate brachiation motions with a variety of durations for the flight and hold phases, as well as emergent extra back-and-forth swings when this proves useful. The system is evaluated with a variety of ablations. The method enables future work towards more general 3D brachiation, as well as using simplified model imitation in other settings. For videos, supplementary material and code, visit: https://brachiation-rl.github.io/brachiation.
Learning physics-based character controllers that can successfully integrate diverse motor skills using a single policy remains a challenging problem. We present a system to learn control policies for multiple soccer juggling skills, based on deep reinforcement learning. We introduce a task-description framework for these skills which facilitates the specification of individual soccer juggling tasks and the transitions between them. Desired motions can be authored using interpolation of crude reference poses or based on motion capture data. We show that a layer-wise mixture-of-experts architecture offers significant benefits. During training, transitions are chosen with the help of an adaptive random walk, in support of efficient learning. We demonstrate foot, head, knee, and chest juggles, foot stalls, the challenging around-the-world trick, as well as robust transitions. Our work provides a significant step towards realizing physics-based characters capable of the precision-based motor skills of human athletes. Code is available at https://github.com/ZhaomingXie/soccer_juggle_release.
We are witnessing an explosion of neural implicit representations in computer vision and graphics. Their applicability has recently expanded beyond tasks such as shape generation and image-based rendering to the fundamental problem of image-based 3D reconstruction. However, existing methods typically assume constrained 3D environments with constant illumination captured by a small set of roughly uniformly distributed cameras. We introduce a new method that enables efficient and accurate surface reconstruction from Internet photo collections in the presence of varying illumination. To achieve this, we propose a hybrid voxel- and surface-guided sampling technique that allows for more efficient ray sampling around surfaces and leads to significant improvements in reconstruction quality. Further, we present a new benchmark and protocol for evaluating reconstruction performance on such in-the-wild scenes. We perform extensive experiments, demonstrating that our approach surpasses both classical and neural reconstruction methods on a wide variety of metrics. Code and data will be made available at https://zju3dv.github.io/neuralrecon-w.
In many recent works, multi-layer perceptions (MLPs) have been shown to be suitable for modeling complex spatially-varying functions including images and 3D scenes. Although the MLPs are able to represent complex scenes with unprecedented quality and memory footprint, this expressive power of the MLPs, however, comes at the cost of long training and inference times. On the other hand, bilinear/trilinear interpolation on regular grid-based representations can give fast training and inference times, but cannot match the quality of MLPs without requiring significant additional memory. Hence, in this work, we investigate what is the smallest change to grid-based representations that allows for retaining the high fidelity result of MLPs while enabling fast reconstruction and rendering times. We introduce a surprisingly simple change that achieves this task – simply allowing a fixed non-linearity (ReLU) on interpolated grid values. When combined with coarse-to-fine optimization, we show that such an approach becomes competitive with the state-of-the-art. We report results on radiance fields, and occupancy fields, and compare against multiple existing alternatives. Code and data for the paper are available at https://geometry.cs.ucl.ac.uk/projects/2022/relu_fields.
A polygonal mesh is the most-commonly used representation of surfaces in computer graphics. Therefore, it is not surprising that a number of mesh classification networks have recently been proposed. However, while adversarial attacks are wildly researched in 2D, the field of adversarial meshes is under explored. This paper proposes a novel, unified, and general adversarial attack, which leads to misclassification of several state-of-the-art mesh classification neural networks. Our attack approach is black-box, i.e. it has access only to the network’s predictions, but not to the network’s full architecture or gradients. The key idea is to train a network to imitate a given classification network. This is done by utilizing random walks along the mesh surface, which gather geometric information. These walks provide insight onto the regions of the mesh that are important for the correct prediction of the given classification network. These mesh regions are then modified more than other regions in order to attack the network in a manner that is barely visible to the naked eye.
Low-overlap regions between paired point clouds make the captured features very low-confidence, leading cutting edge models to point cloud registration with poor quality. Beyond the traditional wisdom, we raise an intriguing question: Is it possible to exploit an intermediate yet misaligned image between two low-overlap point clouds to enhance the performance of cutting-edge registration models? To answer it, we propose a misaligned image supported registration network for low-overlap point cloud pairs, dubbed ImLoveNet. ImLoveNet first learns triple deep features across different modalities and then exports these features to a two-stage classifier, for progressively obtaining the high-confidence overlap region between the two point clouds. Therefore, soft correspondences are well established on the predicted overlap region, resulting in accurate rigid transformations for registration. ImLoveNet is simple to implement yet effective, since 1) the misaligned image provides clearer overlap information for the two low-overlap point clouds to better locate overlap parts; 2) it contains certain geometry knowledge to extract better deep features; and 3) it does not require the extrinsic parameters of the imaging device with respect to the reference frame of the 3D point cloud. Extensive qualitative and quantitative evaluations on different kinds of benchmarks demonstrate the effectiveness and superiority of our ImLoveNet over state-of-the-art approaches.
Möbius transformations play an important role in both geometry and spherical image processing – they are the group of conformal automorphisms of 2D surfaces and the spherical equivalent of homographies. Here we present a novel, Möbius-equivariant spherical convolution operator which we call Möbius convolution; with it, we develop the foundations for Möbius-equivariant spherical CNNs. Our approach is based on the following observation: to achieve equivariance, we only need to consider the lower-dimensional subgroup which transforms the positions of points as seen in the frames of their neighbors. To efficiently compute Möbius convolutions at scale we derive an approximation of the action of the transformations on spherical filters, allowing us to compute our convolutions in the spectral domain with the fast Spherical Harmonic Transform. The resulting framework is flexible and descriptive, and we demonstrate its utility by achieving promising results in both shape classification and image segmentation tasks.
Neural implicit fields have recently emerged as a useful representation for 3D shapes. These fields are commonly represented as neural networks which map latent descriptors and 3D coordinates to implicit function values. The latent descriptor of a neural field acts as a deformation handle for the 3D shape it represents. Thus, smoothness with respect to this descriptor is paramount for performing shape-editing operations. In this work, we introduce a novel regularization designed to encourage smooth latent spaces in neural fields by penalizing the upper bound on the field’s Lipschitz constant. Compared with prior Lipschitz regularized networks, ours is computationally fast, can be implemented in four lines of code, and requires minimal hyperparameter tuning for geometric applications. We demonstrate the effectiveness of our approach on shape interpolation and extrapolation as well as partial shape reconstruction from 3D point clouds, showing both qualitative and quantitative improvements over existing state-of-the-art and non-regularized baselines.
Holographic near-eye displays offer unprecedented capabilities for virtual and augmented reality systems, including perceptually important focus cues. Although artificial intelligence–driven algorithms for computer-generated holography (CGH) have recently made much progress in improving the image quality and synthesis efficiency of holograms, these algorithms are not directly applicable to emerging phase-only spatial light modulators (SLM) that are extremely fast but offer phase control with very limited precision. The speed of these SLMs offers time multiplexing capabilities, essentially enabling partially-coherent holographic display modes. Here we report advances in camera-calibrated wave propagation models for these types of holographic near-eye displays and we develop a CGH framework that robustly optimizes the heavily quantized phase patterns of fast SLMs. Our framework is flexible in supporting runtime supervision with different types of content, including 2D and 2.5D RGBD images, 3D focal stacks, and 4D light fields. Using our framework, we demonstrate state-of-the-art results for all of these scenarios in simulation and experiment.
We present Holographic Glasses, a holographic near-eye display system with an eyeglasses-like form factor for virtual reality. Holographic Glasses are composed of a pupil-replicating waveguide, a spatial light modulator, and a geometric phase lens to create holographic images in a lightweight and thin form factor. The proposed design can deliver full-color 3D holographic images using an optical stack of 2.5 mm thickness. A novel pupil-high-order gradient descent algorithm is presented for the correct phase calculation with the user’s varying pupil size. We implement benchtop and wearable prototypes for testing. Our binocular wearable prototype supports 3D focus cues and provides a diagonal field of view of 22.8° with a 2.3 mm static eye box and additional capabilities of dynamic eye box with beam steering, while weighing only 60 g excluding the driving board.
Document image unwarping is important for document digitization and analysis. The state-of-the-art approach relies on purely synthetic data to train deep networks for unwarping. As a result, the trained networks have generalization limitations when testing on real-world images, often yielding unsatisfying results. In this work, we propose to improve document unwarping performance by incorporating real-world images in training. We collected Document-in-the-Wild (DIW) dataset contains 5000 captured document images with large diversities in content, shape, and capturing environment. We annotate the boundaries of all DIW images and use them for weakly supervised learning. We propose a novel network architecture, PaperEdge, to train with a hybrid of synthetic and real document images. Additionally, we identify and analyze the flaws of popular evaluation metrics, e.g., MS-SSIM and Local Distortion (LD), for document unwarping and propose a more robust and reliable error metric called Aligned Distortion (AD). Training with a combination of synthetic and real-world document images, we demonstrate state-of-the-art performance on popular benchmarks with comprehensive quantitative evaluations and ablation studies. Code and data are available at https://github.com/cvlab-stonybrook/PaperEdge.
Poisson equations appear in many graphics settings including, but not limited to, physics-based fluid simulation. Numerical solvers for such problems strike context-specific memory, performance, stability and accuracy trade-offs. We propose a new Poisson filter-based solver that balances between the strengths of spectral and iterative methods. We derive universal Poisson kernels for forward and inverse Poisson problems, leveraging careful adaptive filter truncation to localize their extent, all while maintaining stability and accuracy. Iterative composition of our compact filters improves solver iteration time by orders-of-magnitude compared to optimized linear methods. While motivated by spectral formulations, we overcome important limitations of spectral methods while retaining many of their desirable properties. We focus on the application of our method to high-performance and high-fidelity fluid simulation, but we also demonstrate its broader applicability. We release our source code at https://github.com/Ubisoft-LaForge/CompactPoissonFilters .
We present the Geometric-Wave Acoustic (GWA) dataset, a large-scale audio dataset of about 2 million synthetic room impulse responses (IRs) and their corresponding detailed geometric and simulation configurations. Our dataset samples acoustic environments from over 6.8K high-quality diverse and professionally designed houses represented as semantically labeled 3D meshes. We also present a novel real-world acoustic materials assignment scheme based on semantic matching that uses a sentence transformer model. We compute high-quality impulse responses corresponding to accurate low-frequency and high-frequency wave effects by automatically calibrating geometric acoustic ray-tracing with a finite-difference time-domain wave solver. We demonstrate the higher accuracy of our IRs by comparing with recorded IRs from complex real-world environments. Moreover, we highlight the benefits of GWA on audio deep learning tasks such as automated speech recognition, speech enhancement, and speech separation. This dataset is the first data with accurate wave acoustic simulations in complex scenes. Codes and data are available at https://gamma.umd.edu/pro/sound/gwa.
We present a novel paradigm for modeling certain types of dynamic simulation in real-time with the aid of neural networks. In order to significantly reduce the requirements on data (especially time-dependent data), as well as decrease generalization error, our approach utilizes a data-driven neural network only to capture quasistatic information (instead of dynamic or time-dependent information). Subsequently, we augment our quasistatic neural network (QNN) inference with a (real-time) dynamic simulation layer. Our key insight is that the dynamic modes lost when using a QNN approximation can be captured with a quite simple (and decoupled) zero-restlength spring model, which can be integrated analytically (as opposed to numerically) and thus has no time-step stability restrictions. Additionally, we demonstrate that the spring constitutive parameters can be robustly learned from a surprisingly small amount of dynamic simulation data. Although we illustrate the efficacy of our approach by considering soft-tissue dynamics on animated human bodies, the paradigm is extensible to many different simulation frameworks.
Inverse rendering is a powerful approach to modeling objects from photographs, and we extend previous techniques to handle translucent materials that exhibit subsurface scattering. Representing translucency using a heterogeneous bidirectional scattering-surface reflectance distribution function (BSSRDF), we extend the framework of path-space differentiable rendering to accommodate both surface and subsurface reflection. This introduces new types of paths requiring new methods for sampling moving discontinuities in material space that arise from visibility and moving geometry. We use this differentiable rendering method in an end-to-end approach that jointly recovers heterogeneous translucent materials (represented by a BSSRDF) and detailed geometry of an object (represented by a mesh) from a sparse set of measured 2D images in a coarse-to-fine framework incorporating Laplacian preconditioning for the geometry. To efficiently optimize our models in the presence of the Monte Carlo noise introduced by the BSSRDF integral, we introduce a dual-buffer method for evaluating the L2 image loss. This efficiently avoids potential bias in gradient estimation due to the correlation of estimates for image pixels and their derivatives and enables correct convergence of the optimizer even when using low sample counts in the renderer. We validate our derivatives by comparing against finite differences and demonstrate the effectiveness of our technique by comparing inverse-rendering performance with previous methods. We show superior reconstruction quality on a set of synthetic and real-world translucent objects as compared to previous methods that model only surface reflection.
We tackle the problem of generating novel-view images from collections of 2D images showing refractive and reflective objects. Current solutions assume opaque or transparent light transport along straight paths following the emission-absorption model. Instead, we optimize for a field of 3D-varying index of refraction (IoR) and trace light through it that bends toward the spatial gradients of said IoR according to the laws of eikonal light transport.
Virtual reality (VR) headsets provide an immersive, stereoscopic visual experience, but at the cost of blocking users from directly observing their physical environment. Passthrough techniques are intended to address this limitation by leveraging outward-facing cameras to reconstruct the images that would otherwise be seen by the user without the headset. This is inherently a real-time view synthesis challenge, since passthrough cameras cannot be physically co-located with the user’s eyes. Existing passthrough techniques suffer from distracting reconstruction artifacts, largely due to the lack of accurate depth information (especially for near-field and disoccluded objects), and also exhibit limited image quality (e.g., being low resolution and monochromatic). In this paper, we propose the first learned passthrough method and assess its performance using a custom VR headset that contains a stereo pair of RGB cameras. Through both simulations and experiments, we demonstrate that our learned passthrough method delivers superior image quality compared to state-of-the-art methods, while meeting strict VR requirements for real-time, perspective-correct stereoscopic view synthesis over a wide field of view for desktop-connected headsets.
Neural approximations of scalar- and vector fields, such as signed distance functions and radiance fields, have emerged as accurate, high-quality representations. State-of-the-art results are obtained by conditioning a neural approximation with a lookup from trainable feature grids [Liu et al. 2020; Martel et al. 2021; Müller et al. 2022; Takikawa et al. 2021] that take on part of the learning task and allow for smaller, more efficient neural networks. Unfortunately, these feature grids usually come at the cost of significantly increased memory consumption compared to stand-alone neural network models. We present a dictionary method for compressing such feature grids, reducing their memory consumption by up to 100 × and permitting a multiresolution representation which can be useful for out-of-core streaming. We formulate the dictionary optimization as a vector-quantized auto-decoder problem which lets us learn end-to-end discrete neural representations in a space where no direct supervision is available and with dynamic topology and structure. Our source code is available at https://github.com/nv-tlabs/vqad.
We introduce α-functions, providing piecewise linear approximation to given data as the difference of two convex functions. The parameter α controls the shape of a paraboloid that is probing the data and may be used to filter out noise in the data. The use of convex functions enables tools for efficient approximation to the data, adding robustness to outliers, and dealing with gradient information. It also allows using the approach in higher dimension. We show that α-functions can be efficiently computed and demonstrate their versatility at the example of surface reconstruction from noisy surface samples.
Bird feathers exhibit fascinating reflectance, which is governed by fiber-like structures. Unlike hair and fur, the feather geometric structures follow intricate hierarchical patterns that span many orders of magnitude in scale. At the smallest scales, fiber elements have strongly non-cylindrical cross-sections and are often complemented by regular nanostructures, causing rich structural color. Therefore, past attempts to render feathers using fiber- or texture-based appearance models missed characteristic aspects of the visual appearance. We introduce a new feather modeling and rendering framework, which abstracts the microscopic geometry and reflectance into a microfacet-like BSDF. The R, TRT and T lobes, also known from hair and fur, here account for specular reflection off the cortex, diffuse reflection off the medulla, and transmission due to barbule spacing, respectively. Our BSDF, which does not require precomputation or storage, can be efficiently importance-sampled and readily integrated into rendering pipelines that represent feather geometry down to the barb level. We verify our approach using a BSDF-capturing setup for small biological structures, as well as against calibrated photographs of rock dove neck feathers.
Given specific scene configurations and target functions, automatic shader simplification searches for the best simplified shader variant from an optimization space with many candidates. Although various speedup methods have been proposed, there is still a costly render-and-evaluate process to obtain variant’s performance and quality, especially when the scene changes.
In this paper, we present a deep learning-based framework for predicting a shader’s simplification space, where the shader’s variants can be embedded into a metric space all at once for efficient quality evaluation. The novel framework allows the one-shot embedding of a space rather than a single instance. In addition, the simplification errors can be interpreted by mutual attention between shader fragments, presenting an informative focus-aware simplification framework that can assist experts in optimizing the codes. The results show that the new framework achieves significant speedup compared with existing search approaches. The focus-aware simplification framework reveals a new possibility of interpreting shaders for various applications.
This work proposes a real-time algorithm for reconstructing 3D human poses in crowded scenes from multiple calibrated views. The key challenge of this problem is to efficiently match 2D observations across multiple views. Previous methods perform multi-view matching either at the full-body level, which is sensitive to 2D pose estimation error, or at the part level, which ignores 2D constraints between different types of body parts in the same view. Instead, our approach reasons about all plausible skeleton proposals during multi-view matching, where each skeleton may consist of an arbitrary number of parts instead of being a whole body or a single part. To this end, we formulate the multi-view matching problem as mode seeking in the space of skeleton proposals and develop an efficient algorithm named QuickPose to solve the problem, which enables real-time motion capture in crowded scenes. Experiments show that the proposed algorithm achieves the state-of-the-art performance in terms of both speed and accuracy on public datasets.
Recent deep learning-based approaches have shown promising results for synthesizing plausible 3D human gestures from speech input. However, these approaches typically offer limited freedom to incorporate user control. Furthermore, training such models in a supervised manner often does not capture the multi-modal nature of the data, particularly because the same audio input can produce different gesture outputs. To address these problems, we present an approach for generating controllable 3D gestures that combines the advantage of database matching and deep generative modeling. Our method predicts 3D body motion by sequentially searching for the most plausible audio-gesture clips from a database using a k-Nearest Neighbors (k-NN) algorithm that considers the similarity to both the input audio and the previous body pose information. To further improve the synthesis quality, we propose a conditional Generative Adversarial Network (cGAN) model to provide a data-driven refinement to the k-NN result by comparing its plausibility against the ground truth audio-gesture pairs. Our novel approach enables direct and more varied control manipulation that is not possible with prior learning-based counterparts. Our experiments show that our proposed approach outperforms recent models on control-based synthesis tasks using high-level signals such as motion statistics while enabling flexible and effective user control for lower-level signals. 1
Getting up from an arbitrary fallen state is a basic human skill. Existing methods for learning this skill often generate highly dynamic and erratic get-up motions, which do not resemble human get-up strategies, or are based on tracking recorded human get-up motions. In this paper, we present a staged approach using reinforcement learning, without recourse to motion capture data. The method first takes advantage of a strong character model, which facilitates the discovery of solution modes. A second stage then learns to adapt the control policy to work with progressively weaker versions of the character. Finally, a third stage learns control policies that can reproduce the weaker get-up motions at much slower speeds. We show that across multiple runs, the method can discover a diverse variety of get-up strategies, and execute them at a variety of speeds. The results usually produce policies that use a final stand-up strategy that is common to the recovery motions seen from all initial states. However, we also find policies for which different strategies are seen for prone and supine initial fallen states. The learned get-up control strategies often have significant static stability, i.e., they can be paused at a variety of points during the get-up motion. We further test our method on novel constrained scenarios, such as having a leg and an arm in a cast.
The success of StyleGAN has enabled unprecedented semantic editing capabilities, on both synthesized and real images. However, such editing operations are either trained with semantic supervision or annotated manually by users. In another development, the CLIP architecture has been trained with internet-scale loose image and text pairings, and has been shown to be useful in several zero-shot learning settings. In this work, we investigate how to effectively link the pretrained latent spaces of StyleGAN and CLIP, which in turn allows us to automatically extract semantically-labeled edit directions from StyleGAN, finding and naming meaningful edit operations, in a fully unsupervised setup, without additional human guidance. Technically, we propose two novel building blocks; one for discovering interesting CLIP directions and one for semantically labeling arbitrary directions in CLIP latent space. The setup does not assume any pre-determined labels and hence we do not require any additional supervised text/attributes to build the editing framework. We evaluate the effectiveness of the proposed method and demonstrate that extraction of disentangled labeled StyleGAN edit directions is indeed possible, revealing interesting and non-trivial edit directions.
Computer graphics has experienced a recent surge of data-centric approaches for photorealistic and controllable content creation. StyleGAN in particular sets new standards for generative modeling regarding image quality and controllability. However, StyleGAN’s performance severely degrades on large unstructured datasets such as ImageNet. StyleGAN was designed for controllability; hence, prior works suspect its restrictive design to be unsuitable for diverse datasets. In contrast, we find the main limiting factor to be the current training strategy. Following the recently introduced Projected GAN paradigm, we leverage powerful neural network priors and a progressive growing strategy to successfully train the latest StyleGAN3 generator on ImageNet. Our final model, StyleGAN-XL, sets a new state-of-the-art on large-scale image synthesis and is the first to generate images at a resolution of 10242 at such a dataset scale. We demonstrate that this model can invert and edit images beyond the narrow domain of portraits or specific object classes. Code, models, and supplementary videos can be found at https://sites.google.com/view/stylegan-xl/ .
StyleGAN is known to produce high-fidelity images, while also offering unprecedented semantic editing. However, these fascinating abilities have been demonstrated only on a limited set of datasets, which are usually structurally aligned and well curated. In this paper, we show how StyleGAN can be adapted to work on raw uncurated images collected from the Internet. Such image collections impose two main challenges to StyleGAN: they contain many outlier images, and are characterized by a multi-modal distribution. Training StyleGAN on such raw image collections results in degraded image synthesis quality. To meet these challenges, we proposed a StyleGAN-based self-distillation approach, which consists of two main components: (i) A generative-based self-filtering of the dataset to eliminate outlier images, in order to generate an adequate training set, and (ii) Perceptual clustering of the generated images to detect the inherent data modalities, which are then employed to improve StyleGAN’s “truncation trick” in the image synthesis process. The presented technique enables the generation of high-quality images, while minimizing the loss in diversity of the data. Through qualitative and quantitative evaluation, we demonstrate the power of our approach to new challenging and diverse domains collected from the Internet. New datasets and pre-trained models are provided in our project website https://self-distilled-stylegan.github.io/.
We present a virtual reality display system simulator that accurately reproduces gaze-contingent distortions created by any viewing optic. The simulator hardware supports rapid prototyping by presenting stereoscopic distortions on a high-speed television paired with shutter glasses, eliminating the need to fabricate physical optics. We further introduce light field portals as an efficient and general-purpose representation for VR optics, enabling real-time emulation using our simulator. This platform is used to conduct the first user study of perceptual requirements for eye-tracked optical distortion correction. Because our hardware platform facilitates consistent head and eye movements, it enables direct comparison of these requirements across observers, optical designs, and scene content. We conclude by introducing a simple binocular distortion metric, built using light field portals, which agrees with key trends identified in the user study and lays a foundation for the design of perceptually-based distortion metrics and correction schemes.
LeviPrint is a system for assembling objects in a contactless manner using acoustic levitation. We explore a set of optimum acoustic fields that enables full trapping in position and orientation of elongated objects such as sticks. We then evaluate the capabilities of different ultrasonic levitators to dynamically manipulate these elongated objects. The combination of novel optimization algorithms and levitators enable the manipulation of sticks, beads and droplets to fabricate complex objects. A system prototype composed of a robot arm and a levitator is tested for different fabrication processes. We highlight the reduction of cross-contamination and the capability of building on top of objects from different angles as well as inside closed spaces. We hope that this technique inspires novel fabrication techniques and that reaches fields such as microfabrication of electromechanical components or even in-vivo additive manufacturing.
Diversity among agents’ behaviors and heterogeneity in virtual crowds in general, is an important aspect of crowd simulation as it is crucial to the perceived realism and plausibility of the resulting simulations. Heterogeneous crowds constitute the pillar in creating numerous real-life scenarios such as museum exhibitions, which require variety in agent behaviors, from basic collision avoidance to more complex interactions both among agents and with environmental features. Most of the existing systems optimize for specific behaviors such as goal seeking, and neglect to take into account other behaviors and how these interact together to form diverse agent profiles. In this paper, we present a RL-based framework for learning multiple agent behaviors concurrently. We optimize the agent policy by varying the importance of the selected behaviors (goal seeking, collision avoidance, interaction with environment, and grouping) while training; essentially we have a reward function that changes dynamically during training. The importance of each separate sub-behavior is added as input to the policy, resulting in the development of a single model capable of capturing as well as enabling dynamic run-time manipulation of agent profiles; thus allowing configurable profiles. Through a series of experiments, we verify that our system provides users with the ability to design virtual scenes; control and mix agent behaviors thus creating personality profiles, and assign different profiles to groups of agents. Moreover, we demonstrate that interestingly the proposed model generalizes to situations not seen in the training data such as a) crowds with higher density, b) behavior weights that are outside the training intervals and c) to scenes with more intricate environment layouts. Code, data and trained policies for this paper are at https://github.com/veupnea/CCP.
We present stroke transfer, an example-based synthesis method of brushstrokes for animated scenes under changes in viewpoint, lighting conditions, and object shapes. We introduce stroke field for guiding the generation of strokes, consisting of spatially varying attributes of strokes, namely, their orientations, lengths, widths, and colors. Strokes are synthesized as the integral curves of the stroke field. In essence, we separate elements that constitute the artistic stroke into style-specific transferable elements and instance-intrinsic ones. To generate the stroke field, we first compute a set of vector fields that reflect the instance-intrinsic elements and then combine them using style-specific weight functions learned from exemplars, with the weights computed in a proxy feature space shared among a variety of objects. The rendered animation using our method captures time-varying viewpoint, lighting conditions, and object shapes, as well as the artistic style given in the form of exemplars.
Recent research work has developed powerful generative models (e.g., StyleGAN2) that can synthesize complete human head images with impressive photorealism, enabling applications such as photorealistically editing real photographs. While these models can be trained on large collections of unposed images, their lack of explicit 3D knowledge makes it difficult to achieve even basic control over 3D viewpoint without unintentionally altering identity. On the other hand, recent Neural Radiance Field (NeRF) methods have already achieved multiview-consistent, photorealistic renderings but they are so far limited to a single facial identity. In this paper, we propose a new Morphable Radiance Field (MoRF) method that extends a NeRF into a generative neural model that can realistically synthesize multiview-consistent images of complete human heads, with variable and controllable identity. MoRF allows for morphing between particular identities, synthesizing arbitrary new identities, or quickly generating a NeRF from few images of a new subject, all while providing realistic and consistent rendering under novel viewpoints. We train MoRF in a supervised fashion by leveraging a high-quality database of multiview portrait images of several people, captured in studio with polarization-based separation of diffuse and specular reflection. Here, we demonstrate how MoRF is a strong new step forwards towards generative NeRFs for 3D neural head modeling.
Photorealistic telepresence requires both high-fidelity body modeling and faithful driving to enable dynamically synthesized appearance that is indistinguishable from reality. In this work, we propose an end-to-end framework that addresses two core challenges in modeling and driving full-body avatars of real people. One challenge is driving an avatar while staying faithful to details and dynamics that cannot be captured by a global low-dimensional parameterization such as body pose. Our approach supports driving of clothed avatars with wrinkles and motion that a real driving performer exhibits beyond the training corpus. Unlike existing global state representations or non-parametric screen-space approaches, we introduce texel-aligned features—a localised representation which can leverage both the structural prior of a skeleton-based parametric model and observed sparse image signals at the same time. Another challenge is modeling a temporally coherent clothed avatar, which typically requires precise surface tracking. To circumvent this, we propose a novel volumetric avatar representation by extending mixtures of volumetric primitives to articulated objects. By explicitly incorporating articulation, our approach naturally generalizes to unseen poses. We also introduce a localized viewpoint conditioning, which leads to a large improvement in generalization of view-dependent appearance. The proposed volumetric representation does not require high-quality mesh tracking as a prerequisite and brings significant quality improvements compared to mesh-based counterparts. In our experiments, we carefully examine our design choices and demonstrate the efficacy of our approach, outperforming the state-of-the-art methods on challenging driving scenarios.
This paper presents a novel system for generating free-viewpoint videos of multiple human performers from very sparse RGB cameras. The system reconstructs a layered neural representation of the dynamic multi-person scene from multi-view videos with each layer representing a moving instance or static background. Unlike previous work that requires instance segmentation as input, a novel approach is proposed to decompose the multi-person scene into layers and reconstruct neural representations for each layer in a weakly-supervised manner, yielding both high-quality novel view rendering and accurate instance masks. Camera synchronization error is also addressed in the proposed approach. The experiments demonstrate the better view synthesis quality of the proposed system compared to previous ones and the capability of producing an editable free-viewpoint video of a real soccer game using several asynchronous GoPro cameras. The dataset and code are available at https://github.com/zju3dv/EasyMocap .
We propose VoLux-GAN, a generative framework to synthesize 3D-aware faces with convincing relighting. Our main contribution is a volumetric HDRI relighting method that can efficiently accumulate albedo, diffuse and specular lighting contributions along each 3D ray for any desired HDR environmental map. Additionally, we show the importance of supervising the image decomposition process using multiple discriminators. In particular, we propose a data augmentation technique that leverages recent advances in single image portrait relighting to enforce consistent geometry, albedo, diffuse and specular components. Multiple experiments and comparisons with other generative frameworks show how our model is a step forward towards photorealistic relightable 3D generative models. Code and pre-trained models are available at: https://github.com/google/volux-gan.
A 3D caricature is an exaggerated 3D depiction of a human face. The goal of this paper is to model the variations of 3D caricatures in a compact parameter space so that we can provide a useful data-driven toolkit for handling 3D caricature deformations. To achieve the goal, we propose an MLP-based framework for building a deformable surface model, which takes a latent code and produces a 3D surface. In the framework, a SIREN MLP models a function that takes a 3D position on a fixed template surface and returns a 3D displacement vector for the input position. We create variations of 3D surfaces by learning a hypernetwork that takes a latent code and produces the parameters of the MLP. Once learned, our deformable model provides a nice editing space for 3D caricatures, supporting label-based semantic editing and point-handle-based deformation, both of which produce highly exaggerated and natural 3D caricature shapes. We also demonstrate other applications of our deformable model, such as automatic 3D caricature creation. Our code and supplementary materials are available at https://github.com/ycjungSubhuman/DeepDeformable3DCaricatures.
Animating a single face photo is an important research topic which receives considerable attention in computer vision and graphics. Yet line drawings for face portraits, which is a longstanding and popular art form, have not been explored much in this area. Simply concatenating a realistic talking face video generation model with a photo-to-drawing style transfer module suffers from severe inter-frame discontinuity issues. To address this new challenge, we propose a novel framework to generate artistic talking portrait-line-drawing video, given a single face photo and a speech signal. After predicting facial landmark movements from the input speech signal, we propose a novel GAN model to simultaneously handle domain transfer (from photo to drawing) and facial geometry change (according to the predicted facial landmarks). To address the inter-frame discontinuity issues, we propose two novel temporal coherence losses: one based on warping and the other based on a temporal coherence discriminator. Experiments show that our model produces high quality artistic talking portrait-line-drawing videos and outperforms baseline methods. We also show our method can be easily extended to other artistic styles and generate good results. The source code is available at https://github.com/AnimatePortrait/AnimatePortrait .
Although significant progress has been made to audio-driven talking face generation, existing methods either neglect facial emotion or cannot be applied to arbitrary subjects. In this paper, we propose the Emotion-Aware Motion Model (EAMM) to generate one-shot emotional talking faces by involving an emotion source video. Specifically, we first propose an Audio2Facial-Dynamics module, which renders talking faces from audio-driven unsupervised zero- and first-order key-points motion. Then through exploring the motion model’s properties, we further propose an Implicit Emotion Displacement Learner to represent emotion-related facial dynamics as linearly additive displacements to the previously acquired motion representations. Comprehensive experiments demonstrate that by incorporating the results from both modules, our method can generate satisfactory talking face results on arbitrary subjects with realistic emotion patterns.