Recent research on generative models has primarily focused on creating product-ready visual outputs; however, designers often favor access to standardized asset libraries, a domain that has yet to be significantly enhanced by generative capabilities. Although open-world scenes provide ample raw materials for designers, efficiently extracting high-quality, standardized assets remains a challenge. To address this, we introduce AssetDropper, the first generative framework designed to extract any asset from reference images, providing artists with an open-world asset palette. Our model adeptly extracts a front view of selected subjects from input images, effectively handling complex scenarios such as perspective distortion and subject occlusion. We establish a synthetic dataset of more than 200,000 image-subject pairs and a real-world benchmark with thousands more for evaluation, facilitating the exploration of future research in downstream tasks. Furthermore, to ensure precise asset extraction that aligns well with the image prompts, we employ a pre-trained reward model to achieve a closed loop with feedback. We design the reward model to perform an inverse task that pastes the extracted assets back into the reference sources, which assists training with additional consistency and mitigates hallucination. Extensive experiments show that, with the aid of reward-driven optimization, AssetDropper achieves the state-of-the-art results in asset extraction. Our code and dataset are available at .
Automated tools for video editing and assembly have applications ranging from filmmaking and advertisement to content creation for social media. Previous video editing work has mainly focused on either retrieval or user interfaces, leaving actual editing to the user. In contrast, we propose to automate the core task of video editing, formulating it as sequential decision making process. Ours is a multi-agent approach. We design an Editor agent and a Critic agent. The Editor takes as input a collection of video clips together with natural language instructions and uses tools commonly found in video editing software to produce an edited sequence. On the other hand, the Critic gives natural language feedback to the editor based on the produced sequence or renders it if it is satisfactory. We introduce a learning-based approach for enabling effective communication across specialized agents to address the language-driven video editing task. Finally, we explore an LLM-as-a-judge metric for evaluating the quality of video editing system and compare it with general human preference. We evaluate our system’s output video sequences qualitatively and quantitatively through a user study and find that our system vastly outperforms existing approaches in terms of coverage, time constraint satisfaction, and human preference. Please see our companion supplemental video for qualitative results.
Walk on stars (WoSt) has shown its power in being applied to Monte Carlo methods for solving partial differential equations, but the sampling techniques in WoSt are not satisfactory, leading to high variance. We propose a guiding-based importance sampling method to reduce the variance of WoSt. Drawing inspiration from path guiding in rendering, we approximate the directional distribution of the recursive term of WoSt using online-learned parametric mixture distributions, decoded by a lightweight neural field. This adaptive approach enables importance sampling the recursive term, which lacks shape information before computation. We introduce a reflection technique to represent guiding distributions at Neumann boundaries and incorporate multiple importance sampling with learnable selection probabilities to further reduce variance. We also present a practical GPU implementation of our method. Experiments show that our method effectively reduces variance compared to the original WoSt, given the same time or the same sample budget. Code and data for this paper are at .
In this work, we present a novel approach for motion customization in video generation, addressing the widespread gap in the exploration of motion representation within video generative models. Recognizing the unique challenges posed by the spatiotemporal nature of video, our method introduces Motion Embeddings, a set of explicit, temporally coherent embeddings derived from a given video. These embeddings are designed to integrate seamlessly with the temporal transformer modules of video diffusion models, modulating self-attention computations across frames without compromising spatial integrity. Our approach provides a compact and efficient solution to motion representation, utilizing two types of embeddings: a Motion Query-Key Embedding to modulate the temporal attention map and a Motion Value Embedding to modulate the attention values. Additionally, we introduce an inference strategy that excludes spatial dimensions from the Motion Query-Key Embedding and applies a debias operation to the Motion Value Embedding, both designed to debias appearance and ensure the embeddings focus solely on motion. Our contributions include the introduction of a tailored motion embedding for customization tasks and a demonstration of the practical advantages and effectiveness of our method through extensive experiments. Project page: https://wileewang.github.io/MotionInversion/
Video stabilization is pivotal for video processing, as it removes unwanted shakiness while preserving the original user motion intent. Existing approaches, depending on the domain they operate, suffer from several issues (e.g. geometric distortions, excessive cropping, poor generalization) that degrade the user experience. To address these issues, we introduce GaVS, a novel 3D-grounded approach that reformulates video stabilization as a temporally-consistent ‘local reconstruction and rendering’ paradigm. Given 3D camera poses, we augment a reconstruction model to predict Gaussian Splatting primitives, and finetune it at test-time, with multi-view dynamics-aware photometric supervision and cross-frame regularization, to produce temporally-consistent local reconstructions. The model are then used to render each stabilized frame. We utilize a scene extrapolation module to avoid frame cropping. Our method is evaluated on a repurposed dataset, instilled with 3D-grounded information, covering samples with diverse camera motions and scene dynamics. Quantitatively, our method is competitive with or superior to state-of-the-art 2D and 2.5D approaches in terms of conventional task metrics and new geometry consistency. Qualitatively, our method produces noticeably better results compared to alternatives, validated by the user study. Project Page: sinoyou.github.io/gavs.
Personalizing text-to-image models to generate images of specific subjects across diverse scenes and styles is a rapidly advancing field. Current approaches often face challenges in maintaining a balance between identity preservation and alignment with the input text prompt. Some methods rely on a single textual token to represent a subject, which limits expressiveness, while others employ richer representations but disrupt the model’s prior, diminishing prompt alignment. In this work, we introduce Nested Attention, a novel mechanism that injects a rich and expressive image representation into the model’s existing cross-attention layers. Our key idea is to generate query-dependent subject values, derived from nested attention layers that learn to select relevant subject features for each region in the generated image. We integrate these nested layers into an encoder-based personalization method, and show that they enable high identity preservation while adhering to input text prompts. Our approach is general and can be trained on various domains. Additionally, its prior preservation allows us to combine multiple personalized subjects from different domains in a single image.
Personalizing generative text-to-image models has seen remarkable progress, but extending this personalization to text-to-video models presents unique challenges. Unlike static concepts, personalizing text-to-video models has the potential to capture dynamic concepts – entities defined not only by their appearance but also by their motion.In this paper, we introduce Set-and-Sequence, a novel framework for personalizing Diffusion Transformers (DiTs)–based generative video models with dynamic concepts. Our approach imposes a spatio-temporal weight space within an architecture that does not explicitly separate spatial and temporal features. This is achieved in two key stages. First, we fine-tune Low-Rank Adaptation (LoRA) layers using an unordered set of frames from the video to learn an appearance LoRA basis that represents the appearance, free from temporal interference. In the second stage, with the appearance LoRAs frozen, we augment their coefficients with Motion Residuals and fine-tune them on the full video sequence, capturing motion dynamics. Our Set-and-Sequence framework resulting in a spatio-temporal weight space effectively embeds dynamic concepts into the video model’s output domain, enabling unprecedented editability and compositionality, and setting a new benchmark for personalizing dynamic concepts.
Action customization involves generating videos where the subject performs actions dictated by input control signals. Current methods use pose-guided or global motion customization but are limited by strict constraints on spatial structure such as layout, skeleton, and viewpoint consistency, reducing adaptability across diverse subjects and scenarios. To overcome these limitations, we propose FlexiAct, which transfers actions from a reference video to an arbitrary target image. Unlike existing methods, FlexiAct allows for variations in layout, viewpoint, and skeletal structure between the subject of the reference video and the target image, while maintaining identity consistency. Achieving this requires precise action control, spatial structure adaptation, and consistency preservation. To this end, we introduce RefAdapter, a lightweight image-conditioned adapter that excels in spatial adaptation and consistency preservation, surpassing existing methods in balancing appearance consistency and structural flexibility. Additionally, based on our observations, the denoising process exhibits varying levels of attention to motion (low frequency) and appearance details (high frequency) at different timesteps. So we propose FAE (Frequency-aware Action Extraction), which, unlike existing methods that rely on separate spatial-temporal architectures, directly achieves action extraction during the denoising process. Experiments demonstrate that our method effectively transfers actions to subjects with diverse layouts, skeletons, and viewpoints. We release our code and model weights to support further research at FlexiAct.
We present a grid-free fluid solver featuring a novel Gaussian representation. Drawing inspiration from the expressive capabilities of 3D Gaussian Splatting in multi-view image reconstruction, we model the continuous flow velocity as a weighted sum of multiple Gaussian functions. This representation is continuously differentiable, which enables us to derive spatial differentials directly and solve the time-dependent PDE via a custom first‑order optimization tailored to fluid dynamics.Compared to traditional discretizations, which typically adopt Eulerian, Lagrangian, or hybrid perspectives, our approach is inherently memory-efficient and spatially adaptive, enabling it to preserve fine-scale structures and vortices with high fidelity. While these advantages are also sought by implicit neural representations, GSR offers enhanced robustness, accuracy, and generality across diverse fluid phenomena, with improved computational efficiency during temporal evolution.Though our first‑order solver does not yet match the speed of fluid solvers using explicit representations, its continuous nature substantially reduces spatial discretization error and opens a new avenue for high‑fidelity simulation. We evaluate the proposed solver across a broad range of 2D and 3D fluid phenomena, demonstrating its ability to preserve intricate vortex dynamics, accurately capture boundary-induced effects such as Kármán vortex streets, and remain robust across long time horizons—all without additional parameter tuning. Our results suggest that GSR offers a compelling direction for future research in fluid simulation. The source code for our fluid solver is publicly available at .
This paper introduces a novel grid structure that extends tall cell methods for efficient deep water simulation. Unlike previous tall cell methods, which are designed to capture all the fine details around liquid surfaces, our approach subdivides tall cells horizontally, allowing for more aggressive adaptivity and a significant reduction in the number of cells. The foundation of our method lies in a new variational formulation of Poisson’s equations for pressure solve tailored for tall-cell grids, which naturally handles the transition of variable-sized cells. This variational view not only permits the use of the efficacy-proven conjugate gradient method but also facilitates monolithic two-way coupled rigid bodies. The key distinction between our method and previous general adaptive approaches, such as tetrahedral or octree grids, is the simplification of adaptive grid construction. Our method performs grid subdivision in a quadtree fashion, rather than an octree. These 2D cells are then simply extended vertically to complete the tall cell population. We demonstrate that this novel form of adaptivity, which we refer to as quadtree tall cells, delivers superior performance compared to traditional uniform tall cells.
We introduce a novel method for controlling a motion sequence using an arbitrary temporal control sequence using temporal alignment. Temporal alignment of motion has gained significant attention owing to its applications in motion control and retargeting. Traditional methods rely on either learned or hand-craft cross-domain mappings between frames in the original and control domains, which often require large, paired, or annotated datasets and time-consuming training. Our approach, named Metric-Aligning Motion Matching, achieves alignment by solely considering within-domain distances. It computes distances among patches in each domain and seeks a matching that optimally aligns the two within-domain distances. This framework allows for the alignment of a motion sequence to various types of control sequences, including sketches, labels, audio, and another motion sequence, all without the need for manually defined mappings or training with annotated data. We demonstrate the effectiveness of our approach through applications in efficient motion control, showcasing its potential in practical scenarios.
Keyframing has long been the cornerstone of standard character animation pipelines, offering precise control over detailed postures and dynamics. However, this approach is labor-intensive, necessitating significant manual effort. Automating this process while balancing the trade-off between minimizing manual input and maintaining full motion control has therefore been a central research challenge. In this work, we introduce AutoKeyframe, a novel framework that simultaneously accepts dense and sparse control signals for motion generation by generating keyframes directly. Dense signals govern the overall motion trajectory, while sparse signals define critical key postures at specific timings. This approach substantially reduces manual input requirements while preserving precise control over motion. The generated keyframes can be easily edited to serve as detailed control signals. AutoKeyframe operates by automatically generating keyframes from dense root positions, which can be determined through arc-length parameterization of the trajectory curve. This process is powered by an autoregressive diffusion model, which facilitates keyframe generation and incorporates a skeleton-based gradient guidance technique for sparse spatial constraints and frame editing. Extensive experiments demonstrate the efficacy of AutoKeyframe, achieving high-quality motion synthesis with precise and intuitive control.
Generating motion for arbitrary skeletons is a longstanding challenge in computer graphics, remaining largely unexplored due to the scarcity of diverse datasets and the irregular nature of the data. In this work, we introduce AnyTop, a diffusion model that generates motions for diverse characters with distinct motion dynamics, using only their skeletal structure as input. Our work features a transformer-based denoising network, tailored for arbitrary skeleton learning, integrating topology information into the traditional attention mechanism. Additionally, by incorporating textual joint descriptions into the latent feature representation, AnyTop learns semantic correspondences between joints across diverse skeletons. Our evaluation demonstrates that AnyTop generalizes well, even with as few as three training examples per topology, and can produce motions for unseen skeletons as well. Furthermore, our model’s latent space is highly informative, enabling downstream tasks such as joint correspondence, temporal segmentation, and motion editing. Our webpage, , includes links to videos and code.
We present a method for simulating deformable bodies in four spatial dimensions. To accomplish this, we generalize several pieces of the traditional simulation pipeline. Starting from the meshing stage, we propose a simple method for generating a pentachoral mesh, the 4D analog of a tetrahedral mesh. Next, we show how to generalize the deformation invariants, allowing us to construct 4D hyperelastic energies that lead directly to hyper-dimensional deformation forces. Finally, we formulate collision detection and response in 4D. Our eigenanalyses of the resulting deformation and collision energies generalize to arbitrarily higher dimensions. The resulting simulations display a variety of previously unseen visual phenomena.
From movie characters to modern science fiction — bringing characters into interactive, story-driven conversations has captured imaginations across generations. Achieving this vision is highly challenging and requires much more than just language modeling. It involves numerous complex AI challenges, such as conversational AI, maintaining character integrity, managing personality and emotions, handling knowledge and memory, synthesizing voice, generating animations, enabling real-world interactions, and integration with physical environments. Recent advancements in the development of foundation models, prompt engineering, and fine-tuning for downstream tasks have enabled researchers to address these individual challenges. However, combining these technologies for interactive characters remains an open problem. We present a system and platform for conveniently designing believable digital characters, enabling a conversational and story-driven experience while providing solutions to all of the technical challenges. As a proof-of-concept, we introduce Digital Einstein, which allows users to engage in conversations with a digital representation of Albert Einstein about his life, research, and persona. While Digital Einstein exemplifies our methods for a specific character, our system is flexible and generalizes to any story-driven or conversational character. By unifying these diverse AI components into a single, easy-to-adapt platform, our work paves the way for immersive character experiences, turning the dream of lifelike, story-based interactions into a reality.
Drawing is an artistic process involving extensive observation. Understanding how professional artists observe as they draw has significant value because it offers insight into their perception patterns and acquired skills. While previous studies used eye tracking to analyze the drawing process, they fell short in aligning gaze data with drawing actions due to the spatial and temporal gaps between observation and drawing in a model-to-paper setup. This paper presents a study in an image-to-image setup, in which artists observe a reference image and draw on a blank canvas on a tablet, capturing a clearer mapping between eye movements and drawn strokes. Our analysis demonstrates a strong spatial correlation between observed regions and corresponding strokes. We further find that artists initially follow a more structured region-by-region approach and then switch to a less constrained sequence for details. Based on these findings, we develop an assistive interface that integrates real-time visual guidance from professional artists’ eye tracking data, enabling novices to emulate their observation and drawing strategies. A user study shows that novices can draw significantly more accurate shapes using our assistive interface, highlighting the importance of modeling observation and the potential of leveraging eye tracking data in future educational and creativity support tools. Our datasets, analysis code, and assistive interface are available at .
We present HOIGaze – a novel learning-based approach for gaze estimation during hand-object interactions (HOI) in extended reality (XR). HOIGaze addresses the challenging HOI setting by building on one key insight: Eye, hand, and head movements are closely coordinated during HOIs and this coordination can be exploited to identify samples that are most useful for gaze estimator training – as such, effectively denoising the training data. This denoising approach is in stark contrast to previous gaze estimation methods that treated all training samples as equal. Specifically, we propose: 1) a novel hierarchical framework that first recognises the hand currently visually attended to and then estimates gaze direction based on the attended hand; 2) a new gaze estimator that uses cross-modal Transformers to fuse head and hand-object features extracted using a convolutional neural network and a spatio-temporal graph convolutional network; and 3) a novel eye-head coordination loss that upgrades training samples belonging to the coordinated eye-head movements. We evaluate HOIGaze on the HOT3D and Aria digital twin (ADT) datasets and show that it significantly outperforms state-of-the-art methods, achieving an average improvement of 15.6% on HOT3D and 6.0% on ADT in mean angular error. To demonstrate the potential of our method, we further report significant performance improvements for the sample downstream task of eye-based activity recognition on ADT. Taken together, our results underline the significant information content available in eye-hand-head coordination and, as such, open up an exciting new direction for learning-based gaze estimation.
Realistic Hand manipulation is a key component of immersive virtual reality (VR), yet existing methods often rely on a kinematic approach or motion-capture datasets that omit crucial physical attributes such as contact forces and finger torques. Consequently, these approaches prioritize tight, one-size-fits-all grips rather than reflecting users’ intended force levels. We present ForceGrip, a deep learning agent that synthesizes realistic hand manipulation motions, faithfully reflecting the user’s grip force intention. Instead of mimicking predefined motion datasets, ForceGrip uses generated training scenarios—randomizing object shapes, wrist movements, and trigger input flows—to challenge the agent with a broad spectrum of physical interactions. To effectively learn from these complex tasks, we employ a three-phase curriculum learning framework comprising Finger Positioning, Intention Adaptation, and Dynamic Stabilization. This progressive strategy ensures stable hand-object contact, adaptive force control based on user inputs, and robust handling under dynamic conditions. Additionally, a proximity reward function enhances natural finger motions and accelerates training convergence. Quantitative and qualitative evaluations reveal ForceGrip’s superior force controllability and plausibility compared to state-of-the-art methods. Demo videos are available as supplementary material and the code is provided at .
While haptic interfaces for virtual reality (VR) has received extensive research attention, on-face haptics in VR remained less explored, especially for virtual food intake. In this paper, we introduce VirCHEW Reality, a face-worn haptic device designed to provide on-face kinesthetic force feedback, to enhance the virtual food-chewing experience in VR. Leveraging a pneumatic actuation system, VirCHEW Reality controlled the process of air inflation and deflation, to simulate the mechanical properties of food textures, such as hardness, cohesiveness, and stickiness. We evaluated the system through three user studies. First, a just-noticeable difference (JND) study examined users’ sensitivity to and the system’s capability of rendering different levels of on-face pneumatic-based kinesthetic feedback while users performing chewing action. Building upon the user-distinguishable signal ranges found in the first study, we further conducted a matching study to explore the correspondence between the kinesthetic stimuli provided by our device and user-perceived food textures, revealing the capability of simulating food texture properties during chewing (e.g., hardness, cohesiveness, stickiness). Finally, a user study in a VR eating scenario showed that VirCHEW Reality could significantly improve the users’ ratings on the sense of presence, compared to the condition without haptic feedback. Our findings further highlighted possible applications in virtual/remote dining, healthcare, and immersive entertainment.
We leverage repetitive elements in 3D scenes to improve novel view synthesis. Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have greatly improved novel view synthesis but renderings of unseen and occluded parts remain low-quality if the training views are not exhaustive enough. Our key observation is that our environment is often full of repetitive elements. We propose to leverage those repetitions to improve the reconstruction of low-quality parts of the scene due to poor coverage and occlusions. We propose a method that segments each repeated instance in a 3DGS reconstruction, registers them together, and allows information to be shared among instances. Our method improves the geometry while also accounting for appearance variations across instances. We demonstrate our method on a variety of synthetic and real scenes with typical repetitive elements, leading to a substantial improvement in the quality of novel view synthesis.
We present a fast and simple technique to convert images into a radiance surface-based scene representation. Building on existing radiance volume reconstruction algorithms, we introduce a subtle yet impactful modification of the loss function requiring changes to only a few lines of code: instead of integrating the radiance field along rays and supervising the resulting images, we project the training images into the scene to directly supervise the spatio-directional radiance field.
The primary outcome of this change is the complete removal of alpha blending and ray marching from the image formation model, instead moving these steps into the loss computation. In addition to promoting convergence to surfaces, this formulation assigns explicit semantic meaning to 2D subsets of the radiance field, turning them into well-defined radiance surfaces. We finally extract a level set from this representation, which results in a high-quality radiance surface model.
Our method retains much of the speed and quality of the baseline algorithm. For instance, a suitably modified variant of Instant NGP maintains comparable computational efficiency, while achieving an average PSNR that is only 0.1 dB lower. Most importantly, our method generates explicit surfaces in place of an exponential volume, doing so with a level of simplicity not seen in prior work.
Lie sphere geometry provides a unified representation of points, oriented spheres and hyperplanes in Euclidean d-space as the subset of lines in \( \mathbb {R}^{d+3} \) that are contained in a certain quadric. The natural scalar product in this construction is zero if two elements are in oriented contact. We show how the sign of this product can be used to decide if spheres are disjoint. This allows us to model the space of spheres that are not intersecting a given union of spheres as the intersection of half-spaces (and the quadric). The maximal spheres are on the boundary of this set and can be computed by first constructing the intersection of half-spaces, which is a convex hull problem, and then intersecting edges of the hull against the quadric, which are the roots of a univariate quadratic. We demonstrate the method at the example of contouring a discrete signed distance field: every sample of the signed distance field represents an empty spheres and the zero-level contour has to be disjoint from the union of these spheres. Maximal spheres outside the empty spheres provide samples on the zero-level contour. The quality of this sample set is comparable to existing methods relying on optimization, while being deterministic and faster in practice.
In this paper, we tackle an important yet often overlooked question: What is the optimal mesh resolution for cloth simulation, without relying on preliminary simulations? The optimal resolution should be sufficient to capture fine details of all potential wrinkles, while avoiding an unnecessarily high resolution that wastes computational time and memory on excessive vertices. This challenge stems from the complex nature of wrinkle distribution, which varies spatially, temporally, and anisotropically across different orientations. To address this, we propose a method to estimate the optimal cloth mesh resolution, based on two key factors: material stiffness and boundary conditions.
To determine the influence of material stiffness on wrinkle wavelength and amplitude, we apply the experimental theory presented by Cerda and Mahadevan [2003] to calculate the optimal mesh resolution for cloth fabrics. Similarly, for boundary conditions influencing local wrinkle formation, we use the same scaling law to determine the source resolution for stationary boundary conditions introduced by garment-making techniques such as shirring, folding, stitching, and down-filling, as well as predicted areas accounting for dynamic wrinkles introduced by collision compression caused by human motions. To ensure a smooth transition between different source resolutions, we apply another experimental theory from [Vandeparre et al. 2011] to compute the transition distance. A mesh sizing map is introduced to facilitate smooth transitions, ensuring precision in critical areas while maintaining efficiency in less important regions. Based on these sizing maps, triangular meshes with optimal resolution distribution are generated using Poisson sampling and Delaunay triangulation. The resulting method can not only enhance the realism and precision of cloth simulations but also support diverse application scenarios, making it a versatile solution for complex garment design.
The concept of the Internet of Things (IoT) has driven the development of system-on-a-chip (SoC) technology for embedded and mobile systems, which may define the future of next-generation computation. In SoC devices, efficient cloth and deformable body simulations require parallelized, heterogeneous computation across multiple processing units. The key challenge in heterogeneous computation lies in task distribution, which must account for varying inter-task dependencies and communication costs. This paper proposes a novel framework for automated task scheduling to optimize simulation performance by minimizing communication overhead and aligning tasks with the specific strengths of each device. To achieve this, we introduce an efficient scheduling method based on the Heterogeneous Earliest Finish Time (HEFT) algorithm, adapted for hybrid systems. We model simulation tasks—such as those in iterative methods like Jacobi and Gauss-Seidel—as a Directed Acyclic Graph (DAG). To maximize the parallelism of nonlinear Gauss-Seidel simulation tasks, we present an innovative asynchronous Gauss-Seidel method with specialized data synchronization across units. Additionally, we employ task merging and tailored task-sorting strategies for Gauss-Seidel tasks to achieve an optimal balance between convergence and efficiency. We validate the effectiveness of our framework across various simulations, including XPBD, vertex block descent, and second-order stencil descent, using Apple M-series processors with both CPU and GPU cores. By maximizing computational efficiency and reducing processing times, our method achieves superior simulation frame rates compared to approaches that rely on individual devices in isolation. The source code with hybrid Metal/C++ implementation is available at https://github.com/ChengzhuUwU/libAtsSim.
Cutting thin-walled deformable structures is common in daily life, but poses significant challenges for simulation due to the introduced spatial discontinuities. Traditional methods rely on mesh-based domain representations, which require frequent remeshing and refinement to accurately capture evolving discontinuities. These challenges are further compounded in reduced-space simulations, where the basis functions are inherently geometry- and mesh-dependent, making it difficult or even impossible for the basis to represent the diverse family of discontinuities introduced by cuts.
Recent advances in representing basis functions with neural fields offer a promising alternative, leveraging their discretization-agnostic nature to represent deformations across varying geometries. However, the inherent continuity of neural fields is an obstruction to generalization, particularly if discontinuities are encoded in neural network weights.
We present Wind Lifter, a novel neural representation designed to accurately model complex cuts in thin-walled deformable structures. Our approach constructs neural fields that reproduce discontinuities precisely at specified locations, without “baking in” the position of the cut line. To achieve this, we augment the input coordinates of the neural field with the generalized winding number of any given cut line, effectively lifting the input from two to three dimensions. Lifting allows the network to focus on the easier problem of learning a 3D everywhere-continuous volumetric field, while a corresponding restriction operator enables the final output field to precisely resolve strict discontinuities. Crucially, our approach does not embed the discontinuity in the neural network’s weights, opening avenues to generalization of cut placement.
Our method achieves real-time simulation speeds and supports dynamic updates to cut line geometry during the simulation. Moreover, the explicit representation of discontinuities makes our neural field intuitive to control and edit, offering a significant advantage over traditional neural fields, where discontinuities are embedded within the network’s weights, and enabling new applications that rely on general cut placement.
The generation of high-quality, animatable 3D head avatars from text has enormous potential in content creation applications such as games, movies, and embodied virtual assistants. Current text-to-3D generation methods typically combine parametric head models with 2D diffusion models using score distillation sampling to produce 3D-consistent results. However, they struggle to synthesize realistic details and suffer from misalignments between the appearance and the driving parametric model, resulting in unnatural animation results. We discovered that these limitations stem from ambiguities in the 2D diffusion predictions during 3D avatar distillation, specifically: i) the avatar’s appearance and geometry is underconstrained by the text input, and ii) the semantic alignment between the predictions and the parametric head model is insufficient because the diffusion model alone cannot incorporate information from the parametric model. In this work, we propose a novel framework, AnimPortrait3D, for text-based realistic animatable 3DGS avatar generation with morphable model alignment, and introduce two key strategies to address these challenges. First, we tackle appearance and geometry ambiguities by utilizing prior information from a pretrained text-to-3D model to initialize a 3D avatar with robust appearance, geometry, and rigging relationships to the morphable model. Second, we refine the initial 3D avatar for dynamic expressions using a ControlNet that is conditioned on semantic and normal maps of the morphable model to ensure accurate alignment. As a result, our method outperforms existing approaches in terms of synthesis quality, alignment, and animation fidelity. Our experiments show that the proposed method advances the state of the art in text-based, animatable 3D head avatar generation. Code and model for this paper are at AnimPortrait3D.
We present LAM, an innovative Large Avatar Model for animatable Gaussian head reconstruction from a single image. Unlike previous methods that require extensive training on captured video sequences or rely on auxiliary neural networks for animation and rendering during inference, our approach generates Gaussian heads that are immediately animatable and renderable. Specifically, LAM creates an animatable Gaussian head in a single forward pass in seconds, enabling reenactment and rendering without additional networks or post-processing steps. This capability allows for seamless integration into existing rendering pipelines, ensuring real-time animation and rendering across a wide range of platforms, including mobile phones. The centerpiece of our framework is the canonical Gaussian attributes generator, which utilizes FLAME canonical points as queries. These points interact with multi-scale image features through a Transformer to accurately predict Gaussian attributes in the canonical space. The reconstructed canonical Gaussian avatar can then be animated utilizing standard linear blend skinning (LBS) with corrective blendshapes as the FLAME model did and rendered in real-time on various platforms. The experiments demonstrate that LAM outperforms state-of-the-art methods on existing benchmarks.
Creating animatable 3D avatars from a single image remains challenging due to style limitations (realistic, cartoon, anime) and difficulties in handling accessories or hairstyles. While 3D diffusion models advance single-view reconstruction for general objects, outputs often lack animation controls or suffer from artifacts because of the domain gap. We propose SOAP, a style-omniscient framework to generate rigged, topology-consistent avatars from any portrait. Our method leverages a multiview diffusion model trained on 24K 3D heads with multiple styles and an adaptive optimization pipeline to deform the FLAME mesh while maintaining topology and rigging via differentiable rendering. The resulting textured avatars support FACS-based animation, integrate with eyeballs and teeth, and preserve details like braided hair or accessories. Extensive experiments demonstrate the superiority of our method over state-of-the-art techniques for both single-view head modeling and diffusion-based generation of Image-to-3D. Our code and data are publicly available for research purposes at github.com/TingtingLiao/soap.
Obtaining 3D faces with microscopic structures from a single unconstrained image is challenging. The complexities of wrinkles and pores at a microscopic level, coupled with the blurriness of the input image, raise the difficulty. However, the distribution of wrinkles and pores tends to follow a specialized pattern, which can provide a strong prior for synthesizing them. Therefore, a key to microstructure synthesis is a parametric wrinkles and pore model with controllable semantic parameters. Additionally, ensuring differentiability is essential for enabling optimization through gradient descent methods. To this end, we propose a novel framework designed to reconstruct facial micro-wrinkles and pores from naturally captured images efficiently. At the core of our framework is a differentiable representation of wrinkles and pores via a graph neural network (GNN), which can simulate the complex interactions between adjacent wrinkles by multiple graph convolutions. Furthermore, to overcome the problem of inconsistency between the blurry input and clear wrinkles during optimization, we proposed a Direction Distribution Similarity that ensures that the wrinkle-directional features remain consistent. Consequently, our framework can synthesize facial micro-structures from a blurry skin image patch, which is cropped from a natural-captured facial image, in around an average of 2 seconds. Our framework can seamlessly integrate with existing macroscopic facial detail reconstruction methods to enhance their detailed appearance. We showcase this capability on several works, including DECA, HRN, and FaceScape.
We introduce Masked Anchored SpHerical Distances (MASH), a novel multi-view and parametrized representation of 3D shapes. Inspired by multi-view geometry and motivated by the importance of perceptual shape understanding for learning 3D shapes, MASH represents a 3D shape as a collection of observable local surface patches, each defined by a spherical distance function emanating from an anchor point. We further leverage the compactness of spherical harmonics to encode the MASH functions, combined with a generalized view cone with a parameterized base that masks the spatial extent of the spherical function to attain locality. We develop a differentiable optimization algorithm capable of converting any point cloud into a MASH representation accurately approximating ground-truth surfaces with arbitrary geometry and topology. Extensive experiments demonstrate that MASH is versatile for multiple applications including surface reconstruction, shape generation, completion, and blending, achieving superior performance thanks to its unique representation encompassing both implicit and explicit features. More information and resources can be found at: https://chli.top/MASH.
Transforming 3D shapes into representations that support part-level editing, flexible redesign, and efficient compression is vital for asset customization, content creation, and optimization in digital design. Despite its importance, achieving a representation that balances expressivity, editability, compactness, and interpretability remains a challenge. We introduce 3D2EP, a novel method for 3D shape decomposition that represents objects as a collection of differentiable, parametric primitives. Given a 3D shape represented by a voxel grid, 3D2EP decomposes this into a set of primitive parts, each generated by extruding a scaled 2D profile along a 3D curve, with the requisite components being predicted in a feedforward manner. That is, each primitive is constrained to have a single cross-section profile, up to scale. This enables the primitives to adapt to the data, capturing the geometry with precision but without excess degrees-of-freedom that would stymie editability.
Extensive evaluations highlight 3D2EP’s ability to reconstruct complex shapes with a compact and interpretable representation, emphasizing its suitability for a wide range of 3D modeling applications.
Autoregressive models have achieved remarkable success across various domains, yet their performance in 3D shape generation lags significantly behind that of diffusion models. In this paper, we introduce OctGPT, a novel multiscale autoregressive model for 3D shape generation that dramatically improves the efficiency and performance of prior 3D autoregressive approaches, while rivaling or surpassing state-of-the-art diffusion models. Our method employs a serialized octree representation to efficiently capture the hierarchical and spatial structures of 3D shapes. Coarse geometry is encoded via octree structures, while fine-grained details are represented by binary tokens generated using a vector quantized variational autoencoder (VQVAE), transforming 3D shapes into compact multiscale binary sequences suitable for autoregressive prediction. To address the computational challenges of handling long sequences, we incorporate octree-based transformers enhanced with 3D rotary positional encodings, scale-specific embeddings, and token-parallel generation schemes. These innovations reduce training time by 13 folds and generation time by 69 folds, enabling the efficient training of high-resolution 3D shapes, e.g.,10243, on just four NVIDIA 4090 GPUs only within days. OctGPT showcases exceptional versatility across various tasks, including text-, sketch-, and image-conditioned generation, as well as scene-level synthesis involving multiple objects. Extensive experiments demonstrate that OctGPT accelerates convergence and improves generation quality over prior autoregressive methods, offering a new paradigm for high-quality, scalable 3D content creation. Our code and trained models are available at .
In this paper, we introduce a second-order derivative of spherical harmonics, spherical harmonics Hessian, and solid spherical harmonics, a variant of spherical harmonics, to the computer graphics community. These mathematical tools are used to develop an analytical representation of the Hessian matrix of spherical harmonics coefficients for spherical lights. We apply our analytic representation of the Hessian matrix to grid-based SH lighting rendering applications with many spherical lights that store the incident light field as spherical harmonics coefficients and their spatial gradient at sparse grid. We develop a Hessian-based error metric, with which our method automatically and adaptively subdivides the grid whether the interpolation using the spatial gradient is appropriate. Our method can be easily incorporated into the grid-based precomputed radiance transfer (PRT) framework with small additional storage. We demonstrate that our adaptive grid subdivided by using the Hessian-based error metric can substantially improve the rendering quality in equal-time grid construction.
We introduce a guided lens sampling method for efficient rendering of circles of confusion (CoCs). While traditional Monte Carlo techniques simulate depth-of-field (DoF) effects by perturbing camera rays at the lens, uniform lens sampling often results in significant noise by failing to prioritize rays toward highlight regions in the scene. Although path guiding has proven effective for global illumination by learning importance distributions for incoming radiance, no comparable guiding technique for CoCs exists, primarily due to the strong parallax between adjacent pixels. We model highlight spots in world space using a globally shared radiance field, which is then transformed into lens space through a bipolar-cone projection to guide camera ray generation. We implement this theory using 3D Gaussians, achieving fast, robust guiding with minimal computational and storage overhead, making it suitable for production rendering. We also propose two extensions to further enhance local adaptation. Our experiments show that this approach significantly improves the sampling efficiency for CoC rendering.
Recent extensions to spatiotemporal path reuse, or ReSTIR, improve rendering efficiency in the presence of high-frequency content by augmenting path reservoirs to represent contributions over full pixel footprints. Still, if historical paths fail to contribute to future frames, these benefits disappear. Prior ReSTIR work backprojects to the prior frame to identify paths for reuse. Backprojection can fail to find relevant paths for many reasons, including moving cameras or subpixel geometry with differing motion.
We introduce reservoir splatting to reduce these failures. Splatting forward-projects the primary hits of prior-frame paths. Unlike backprojection, forward-projected path samples fall into the current-frame pixel relevant to their exact primary hits, making successful reuse more likely. This also enables motion blur for ReSTIR, by splatting at multiple time steps, and supports depth of field without the specialized shift maps needed previously.
Beyond enabling motion blur, splatting improves resampling quality over Zhang et al.’s [2024] Area ReSTIR at up to 10% lower cost. To improve robustness, we show how to MIS splatted and backprojected samples to help every current-frame pixel get at least one historical path proposed for reuse.
Existing neural shadow mapping methods [Datta et al. 2022] have shown to be promising in generating high quality soft shadows. However, it demonstrates limited generalizability to new scenes. In this paper, we present a novel neural method, named kernel predicting neural shadow mapping to address this issue. Specifically, we explicitly model soft shadow values as pixelwise local filtering from nearby base shadow values (i.e., the classic hard shadow values) in the screen space, where the local filter weights are predicted through a trained neural network. We use dilated filters as the representation of our local filters to maintain a balance between computational efficiency and receptive field of a local filter. We further enhance shadow quality by replacing the classic shadow map algorithm [Williams 1978] with moment shadow maps [Peters and Klein 2015] to generate the base shadows values. With carefully designed filters, input features, and loss functions with temporal regularization, our method runs in real-time framerates (i.e., >100 fps for 2048 × 1024 resolution), produces temporally-stable soft shadows with good generalizability, and consistently beats state-of-the-art methods in both visual qualities and numeric measures. Code and model weights are available at .
We present a method for generating stroke-based painterly drawings of participating media, such as smoke, fire, and clouds, by transferring stroke attributes—color, width, length, and orientation—from exemplar to animation frames. Building on the stroke transfer framework, we introduce features and basis fields capturing lighting-, view-, and geometry-dependent information, extending surface-based ones (e.g., intensity, apparent normals and curvatures, and distance from silhouettes) to volumetric scenes while supporting traditional surface objects. Novel features, including apparent relative velocity and mean free-path, address non-rigid flow and dynamic scenes. Our system combines automated exemplar selection, user-guided style learning, and temporally coherent stroke generation, enabling artistic and expressive visualizations of dynamic media.
Previous research in material models for surface and volume scattering has enabled highly realistic scenes in modern rendering systems. However, there has been comparatively little study of light sources in computer graphics despite their critical importance in illuminating and bringing life into these scenes. In the real world, photons are emitted through numerous physical processes including combustion, incandescence, and fluorescence. The qualities of light produced in each of these processes are unique to their physics, making them interesting to study individually.
In this work, we propose a model for glow discharge, a form of light-emitting electrostatic discharge commonly found in Neon lights and gas discharge lamps. We take inspiration from works in computational physics and develop an efficient point-wise solver for the emission due to glow discharge suitable for traditional volume rendering systems. Our model distills the complex mechanics of this process into a set of flexible and interpretable parameters. We demonstrate that our model can replicate the visual qualities of glow discharge under varying gases.
We explore interactive painting on 3D Gaussian splat scenes and other surfaces using 3D Gaussian splat brushes, each containing a chunk of realistic texture-geometry that make capture representations so appealing. The suite of brush capabilities we propose enables 3D artists to capture and then remix real world imagery and geometry with direct interactive control. In particular, we propose a set of algorithms for 1) selecting subsets of Gaussians as a brush pattern interactively, 2) applying the brush interactively to the same or other 3DGS scenes or other 3D surfaces using stamp-based painting, 3) using an inpainting Diffusion Model to adjust stamp seams for seamless and realistic appearance. We also present an ensemble of artistic brush parameters, resulting in a wide range of appearance options for the same brush. Our contribution is a judicious combination of algorithms, design features and creative affordances, that together enable the first prototype implementation of interactive brush-based painting with 3D Gaussian splats. We evaluate our system by showing compelling results on a diverse set of 3D scenes; and a user study with VFX/animation professionals, to validate system features, workflow, and potential for creative impact. Code and data for this paper can be accessed from splatpainting.github.io.
The contrast and luminance capabilities of a display are central to the quality of the image. High dynamic range (HDR) displays have high luminance and contrast, but it can be difficult to ascertain whether a given set of characteristics qualifies for this label. This is especially unclear for new display modes, such as virtual reality (VR). This paper studies the perceptual impact of peak luminance and contrast of a display, including characteristics and use cases representative of VR. To achieve this goal, we first developed a haploscope testbed prototype display capable of achieving 1,000 nits peak luminance and 1,000,000:1 contrast with high precision. We then collected a novel HDR video dataset targetting VR-relevant content types. We also implemented custom tone mapping operators to map between display parameter sets. Finally, we collected subjective preference data spanning 3 orders of magnitude in each dimension. Our data was used to fit a model, which was validated using a subjective study on an HDR VR prototype headmounted display (HMD). Our model helps provide guidance for future display design, and helps standardize the understanding of HDR1.
Facial expressions are key to human communication, conveying emotions and intentions. Given the rising popularity of digital humans and avatars, the ability to accurately represent facial expressions in real time has become an important topic. However, quantifying perceived differences between pairs of expressions is difficult, and no comprehensive subjective datasets are available for testing. This work introduces a new dataset targeting this problem: FaceExpressions-70k. Obtained via crowdsourcing, our dataset contains 70,500 subjective expression comparisons rated by over 1,000 study participants1 We demonstrate the applicability of the dataset for training perceptual expression difference models and guiding decisions on acceptable latency and sampling rates for facial expressions when driving a face avatar.
The true vision for real-time virtual and augmented reality is reproducing our visual reality in its entirety on immersive displays. To this end, foveated rendering leverages the limitations of spatial acuity in human peripheral vision to allocate computational resources to the fovea while reducing quality in the periphery. Such methods are often derived from studies on the spatial resolution of the human visual system and its ability to perceive blur in the periphery, enabling the potential for high spatial quality in real-time. However, the effects of blur on other visual cues that depend on luminance contrast, such as depth, remain largely unexplored. It is critical to understand this interplay, as accurate depth representation is a fundamental aspect of visual realism. In this paper, we present the first evaluation exploring the effects of foveated rendering on stereoscopic depth perception. We design a psychovisual experiment to quantitatively study the effects of peripheral blur on depth perception. Our analysis demonstrates that stereoscopic acuity remains unaffected (or even improves) by high levels of peripheral blur. Based on our studies, we derive a simple perceptual model that determines the amount of foveation that does not affect stereoacuity. Furthermore, we analyze the model in the context of common foveation practices reported in literature. The findings indicate that foveated rendering does not impact stereoscopic depth perception, and stereoacuity remains unaffected with up to 2 × stronger foveation than commonly used. Finally, we conduct a validation experiment and show that our findings hold for complex natural stimuli.
Knots and ties are captivating elements of digital garments and accessories, but they have been notoriously challenging and computationally expensive to model manually. In this paper, we propose a physics-based modeling system for knots and ties using templates. The primary challenge lies in transforming cloth pieces into desired knot and tie configurations in a controllable, penetration-free manner, particularly when interacting with surrounding meshes. To address this, we introduce a pipe-like parametric knot template representation, defined by a Bézier curve as its medial axis and an adaptively adjustable radius for enhanced flexibility and variation. This representation enables customizable knot sizes, shapes, and styles while ensuring intersection-free results through robust collision detection techniques. Using the defined knot template, we present a mapping and penetration-free initialization method to transform selected cloth regions from UV space into the initial 3D knot shape. We further enable quasistatic simulation of knots and their surrounding meshes through a fast and reliable collision handling and simulation scheme. Our experiments demonstrate the system’s effectiveness and efficiency in modeling a wide range of digital knots and ties with diverse styles and shapes, including configurations that were previously impractical to create manually.
Manual design of garments for avatars requires a large effort. Garment retargeting methods can save manual efforts by automatically deforming an existing garment design from one avatar to another. Previous methods are limited to human avatars with small variations in body shapes, while non-human avatars with unrealistic characteristics widely appear in games and animations. In this paper, the goal is to retarget artist-designed garments on a standard mannequin to a more general class of avatars. While there is a lack of training data of various avatars wearing garments, we propose a training-free method that performs optimizations on the mesh representation of the garments, with a combination of loss functions that preserve the geometrical features in the original design, guarantee intersection-free, and fit the garment adaptively to the avatars. Our method produces simulation-ready garment models that can be used later in avatar animations.
We present the first algorithm to automatically compute sewing patterns for upcycling existing garments into new designs. Our algorithm takes as input two garment designs along with their corresponding sewing patterns and determines how to cut one of them to match the other by following garment reuse principles. Specifically, our algorithm favors the reuse of seams and hems present in the existing garment, thereby preserving the embedded value of these structural components and simplifying the fabrication of the new garment. Finding optimal reused pattern is computationally challenging because it involves both discrete and continuous quantities. Discrete decisions include the choice of existing panels to cut from and the choice of seams and hems to reuse. Continuous variables include the precise placement of the new panels along seams and hems, and potential deformations of these panels to maximize reuse. Our key idea for making this optimization tractable is quantizing the shape of garment panels. This allows us to frame the search for an optimal reused pattern as a discrete assignment problem, which we solve efficiently with an ILP solver. We showcase our proposed pipeline on several reuse examples, including comparisons with reused patterns crafted by a professional garment designer. Additionally, we manufacture a physical reused garment to demonstrate the practical effectiveness of our approach.
Garment sewing patterns are the design language behind clothing, yet their current vector-based digital representations weren’t built with machine learning in mind. Vector-based representation encodes a sewing pattern as a discrete set of panels, each defined as a sequence of lines and curves, stitching information between panels and the placement of each panel around a body. However, this representation causes two major challenges for neural networks: discontinuity in latent space between patterns with different topologies and limited generalization to garments with unseen topologies in the training data. In this work, we introduce GarmentImage, a unified raster-based sewing pattern representation. GarmentImage encodes a garment sewing pattern’s geometry, topology and placement into multi-channel regular grids. Machine learning models trained on GarmentImage achieve seamless transitions between patterns with different topologies and show better generalization capabilities compared to models trained on vector-based representation. We demonstrate the effectiveness of GarmentImage across three applications: pattern exploration in latent space, text-based pattern editing, and image-to-pattern prediction. The results show that GarmentImage achieves superior performance on these applications using only simple convolutional networks.
Reconstructing 3D clothed humans from images is fundamental to applications like virtual try-on, avatar creation, and mixed reality. While recent advances have enhanced human body recovery, accurate reconstruction of garment geometry—especially for loose-fitting clothing—remains an open challenge. We present a novel method for high-fidelity 3D garment reconstruction from single images that bridges 2D and 3D representations. Our approach combines Implicit Sewing Patterns (ISP) with a generative diffusion model to learn rich garment shape priors in a 2D UV space. A key innovation is our mapping model that establishes correspondences between 2D image pixels, UV pattern coordinates, and 3D geometry, enabling joint optimization of both 3D garment meshes and the corresponding 2D patterns by aligning learned priors with image observations. Despite training exclusively on synthetically simulated cloth data, our method generalizes effectively to real-world images, outperforming existing approaches on both tight- and loose-fitting garments. The reconstructed garments maintain physical plausibility while capturing fine geometric details, enabling downstream applications including garment retargeting and texture manipulation.
We present RenderFormer, a neural rendering pipeline that directly renders an image from a triangle-based representation of a scene with full global illumination effects and that does not require per-scene training or fine-tuning. Instead of taking a physics-centric approach to rendering, we formulate rendering as a sequence-to-sequence transformation where a sequence of tokens representing triangles with reflectance properties is converted to a sequence of output tokens representing small patches of pixels. RenderFormer follows a two stage pipeline: a view-independent stage that models triangle-to-triangle light transport, and a view-dependent stage that transforms a token representing a bundle of rays to the corresponding pixel values guided by the triangle-sequence from the the view-independent stage. Both stages are based on the transformer architecture and are learned with minimal prior constraints. We demonstrate and evaluate RenderFormer on scenes with varying complexity in shape and light transport.
In this paper, we present a novel neural global illumination approach that enables multi-frequency reflections in dynamic scenes. Our method utilizes object-centric, spatial feature grids as the core framework to model rendering effects implicitly. A lightweight scene query, based on single-bounce ray tracing, is then performed on these feature grids to extract principal and secondary features separately. The principal features can capture a wide range of relatively low-frequency global illumination effects, such as diffuse indirect lighting and reflections on rough surfaces. In contrast, the secondary features can provide sparse scene-specific reflection details, typically with much higher frequencies than the final observed radiance. Inspired by the physical processes of light propagation, we introduce a novel dual-band feature fusion module that seamlessly blends these two types of features, generating fused features capable of modeling multi-frequency rendering effects. Additionally, we propose a two-stage training strategy tailored to accommodate the distinct characteristics of each feature type, significantly enhancing the overall quality and reducing artifacts in the rendered results. Experimental results demonstrate that our method delivers high-quality, multi-frequency dynamic reflections, outperforming state-of-the-art baselines, including path tracing with screen-space neural denoising and other neural global illumination methods.
Precomputed global illumination (GI) techniques, such as light probes, particularly focus on capturing indirect illumination and have gained widespread adoption. However, as the scale of the scenes continues to expand, the demand for storage space and runtime memory for light probes also increases substantially. To address this issue, we propose a novel Gaussian fitting compression technique specifically designed for light field probes, which enables the use of denser samples to describe illumination in complex scenes. The core idea of our method is utilizing low-bit adaptive Gaussian functions to store the latent representation of light probes, enabling parallel and high-speed decompression on the GPU. Additionally, we implement a custom gradient propagation process to replace conventional inference frameworks, like PyTorch, ensuring an exceptional compression speed.
At the same time, by constructing a cascaded light field texture in real-time, we avoid the need for baking and storing a large number of redundant light field probes arranged in the form of 3D textures. This approach allows us to achieve further compression of the memory while maintaining high visual quality and rendering speed. Compared to traditional methods based on Principal Component Analysis (PCA), our approach consistently yields superb results across various test scenarios, achieving compression ratios of up to 1:50.
We propose a neural approach for estimating spatially varying light selection distributions to improve importance sampling in Monte Carlo rendering, particularly for complex scenes with many light sources. Our method uses a neural network to predict the light selection distribution at each shading point based on local information, trained by minimizing the KL-divergence between the learned and target distributions in an online manner. To efficiently manage hundreds or thousands of lights, we integrate our neural approach with light hierarchy techniques, where the network predicts cluster-level distributions and existing methods sample lights within clusters. Additionally, we introduce a residual learning strategy that leverages initial distributions from existing techniques, accelerating convergence during training. Our method achieves superior performance across diverse and challenging scenes.
We introduce xADA, a generative model for creating expressive and realistic animation of the face, tongue, and head directly from speech audio. Our approach leverages the pretrained Whisper audio encoder to extract rich speech features which are decoded into face and head animation using a series of gated recurrent unit (GRU) networks. The generated animation maps directly onto MetaHuman compatible rig controls enabling seamless integration into industry-standard content creation pipelines. xADA operates fully automatically, with an option for users to override the detected emotion and/or blink timings. xADA generalizes across languages, and voice styles, and can animate non-verbal sounds. Quantitative evaluation and a user study demonstrate that xADA produces state-of-the-art animation with high realism, frequently indistinguishable from ground truth performance. Additionally, we outline a comprehensive data capture protocol designed to collect an extensive range of speech and non-verbal sounds for training animation models.
We present DuetGen, a novel framework for generating interactive two-person dances from music. The key challenge of this task lies in the inherent complexities of two-person dance interactions, where the partners need to synchronize both with each other and with the music. Inspired by the recent advances in motion synthesis, we propose a two-stage solution: encoding two-person motions into discrete tokens and then generating these tokens from music. To effectively capture intricate interactions, we represent both dancers’ motions as a unified whole to learn the necessary motion tokens, and adopt a coarse-to-fine learning strategy in both the stages. Our first stage utilizes a VQ-VAE that hierarchically separates high-level semantic features at a coarse temporal resolution from low-level details at a finer resolution, producing two discrete token sequences at different abstraction levels. Subsequently, in the second stage, two generative masked transformers learn to map music signals to these dance tokens: the first producing high-level semantic tokens, and the second, conditioned on music and these semantic tokens, producing the low-level tokens. We train both transformers to learn to predict randomly masked tokens within the sequence, enabling them to iteratively generate motion tokens by filling an empty token sequence during inference. Through the hierarchical masked modeling and dedicated interaction representation, DuetGen achieves the generation of synchronized and interactive two-person dances across various genres. Extensive experiments and user studies on a benchmark duet dance dataset demonstrate state-of-the-art performance of DuetGen in motion realism, music-dance alignment, and partner coordination. Code and model weights are available at .
The art of instrument performance stands as a vivid manifestation of human creativity and emotion. Nonetheless, generating instrument performance motions is a highly challenging task, as it requires not only capturing intricate movements but also reconstructing the complex dynamics of the performer-instrument interaction. While existing works primarily focus on modeling partial body motions, we propose Expressive ceLlo performance motion Generation for Audio Rendition (ELGAR), a state-of-the-art diffusion-based framework for whole-body fine-grained instrument performance motion generation solely from audio. To emphasize the interactive nature of the instrument performance, we introduce Hand Interactive Contact Loss (HICL) and Bow Interactive Contact Loss (BICL), which effectively guarantee the authenticity of the interplay. Moreover, to better evaluate whether the generated motions align with the semantic context of the music audio, we design novel metrics specifically for string instrument performance motion generation, including finger-contact distance, bow-string distance, and bowing score. Extensive evaluations and ablation studies are conducted to validate the efficacy of the proposed methods. In addition, we put forward a motion generation dataset SPD-GEN, collated and normalized from the MoCap dataset SPD. As demonstrated, ELGAR has shown great potential in generating instrument performance motions with complicated and fast interactions, which will promote further development in areas such as animation, music education, interactive art creation, etc. Our code and SPD-GEN dataset are available at .
The automatic generation of controllable co-speech gestures has recently gained growing attention. While existing systems typically achieve gesture control through predefined categorical labels or implicit pseudo-labels derived from motion examples, these approaches often compromise the rich details present in the original motion examples. We present MECo, a framework for motion-example-controlled co-speech gesture generation by leveraging large language models (LLMs). Our method capitalizes on LLMs’ comprehension capabilities through fine-tuning to simultaneously interpret speech audio and motion examples, enabling the synthesis of gestures that preserve example-specific characteristics while maintaining speech congruence. Departing from conventional pseudo-labeling paradigms, we position motion examples as explicit query contexts within the prompt structure to guide gesture generation. Experimental results demonstrate state-of-the-art performance across three metrics: Fréchet Gesture Distance (FGD), motion diversity, and example-gesture similarity. Furthermore, our framework enables granular control of individual body parts and accommodates diverse input modalities including motion clips, static poses, human video sequences, and textual descriptions.
Text-to-stylized human motion generation leverages text descriptions for motion generation with fine-grained style control with respect to a reference motion. However, existing approaches typically rely on supervised style learning with labeled datasets, constraining their adaptability and generalization for effective diverse style control. Additionally, they have not fully explored the temporal correlations between motion, textual descriptions, and style, making it challenging to generate semantically consistent motion with precise style alignment. To address these limitations, we introduce a novel method that integrates unsupervised style from arbitrary references into a text-driven diffusion model to generate semantically consistent stylized human motion. The core innovation lies in leveraging text as a mediator to capture the temporal correspondences between motion and style, enabling the seamless integration of temporally dynamic style into motion features. Specifically, we first train a diffusion model on a text-motion dataset to capture the correlation between motion and text semantics. A style adapter then extracts temporally dynamic style features from reference motions and integrates a novel Semantic-Aware Style Injection (SASI) module to infuse these features into the diffusion model. The SASI module computes the semantic correlation between motion and style features based on text, selectively incorporating style features that align with motion content, ensuring semantic consistency and precise style alignment. Our style adapter does not require a labeled style dataset for training, enhancing adaptability and generalization of style control. Extensive evaluations show that our method outperforms previous approaches in terms of semantic consistency and style expressivity. Our webpage, , includes links to the supplementary video and code.
We present an interpolation method for planar shapes using logarithmic metric blending. Our approach generalizes prior work on pullback metrics to a framework, allowing us to employ different techniques, such as logarithmic blending of symmetric positive definite matrices, to have precise control over both conformal and area distortions. Key contributions include generalizing the continuous blending scheme and its adaptation to discrete mesh interpolation through different conformal and isometric parameterizations. Experimental results demonstrate that our method outperforms existing techniques in achieving bounded distortions, making it a compelling choice for applications in animation and morphing.
Cage-based deformation is a fundamental problem in geometry processing, where a cage, a user-specified boundary of a region, is used to deform the ambient space of a given mesh. Traditional 3D cages are typically composed of triangles and quads. While quads can represent non-planar regions when their four corners are not coplanar, they form ruled surfaces with straight isoparametric curves, which limits their ability to handle curved and high-curvature deformations. In this work, we extend the cage for curved boundaries using Bézier patches, enabling flexible and high-curvature deformations with only a few control points. The higher-order structure of the Bézier patch also allows for the creation of a more compact and precise curved cage for the input model. Based on Green’s third identity, we derive the Green coordinates for the Bézier cage, achieving shape-preserving deformation with smooth surface boundaries. These coordinates are defined based on the vertex positions and normals of the Bézier control net. Given that the coordinates are approximately calculated through the Riemann summation, we propose a global projection technique to ensure that the coordinates accurately conform to the linear reproduction property. Experimental results show that our method achieves high performance in handling curved and high-curvature deformations.
Learning on triangle meshes has recently proven to be instrumental to a myriad of tasks, from shape classification, to segmentation, to deformation and animation, to mention just a few. While some of these applications are tackled through neural network architectures which are tailored to the application at hand, many others use generic frameworks for triangle meshes where the only customization required is the modification of the input features and the loss function. Our goal in this paper is to broaden the applicability of these generic frameworks to “wild” meshes, i.e. meshes in-the-wild which often have multiple components, non-manifold elements, disrupted connectivity, or a combination of these. We propose a configurable meta-framework based on the concept of caged geometry: Given a mesh, a cage is a single component manifold triangle mesh that envelopes it closely. Generalized barycentric coordinates map between functions on the cage, and functions on the mesh, allowing us to learn and test on a variety of data, in different applications. We demonstrate this concept by learning segmentation and skinning weights on difficult data, achieving better performance to state of the art techniques on wild meshes.
In this work, we present CineMaster, a novel framework for 3D-aware and controllable text-to-video generation. Our goal is to empower users with comparable controllability as professional film directors: precise placement of objects within the scene, flexible manipulation of both objects and camera in 3D space, and intuitive layout control over the rendered frames. To achieve this, CineMaster operates in two stages. In the first stage, we design an interactive workflow that allows users to intuitively construct 3D-aware control signals by positioning object bounding boxes and defining camera movements within the 3D space. In the second stage, these control signals—comprising rendered depth maps, camera trajectories, and object class labels—serve as the guidance for a text-to-video diffusion model, ensuring to generate the user-intended video content. Furthermore, to overcome the scarcity of in-the-wild datasets with 3D object motion and camera pose annotations, we carefully establish an automated data annotation pipeline that extracts 3D bounding boxes and camera trajectories from large-scale video data. Extensive qualitative and quantitative experiments demonstrate that CineMaster significantly outperforms existing methods and implements prominent 3D-aware text-to-video generation.
This paper presents a method that allows users to design cinematic video shots in the context of image-to-video generation. Shot design, a critical aspect of filmmaking, involves meticulously planning both camera movements and object motions in a scene. However, enabling intuitive shot design in modern image-to-video generation systems presents two main challenges: first, effectively capturing user intentions on the motion design, where both camera movements and scene-space object motions must be specified jointly; and second, representing motion information that can be effectively utilized by a video diffusion model to synthesize the image animations. To address these challenges, we introduce MotionCanvas, a method that integrates user-driven controls into image-to-video (I2V) generation models, allowing users to control both object and camera motions in a scene-aware manner. By connecting insights from classical computer graphics and contemporary video generation techniques, we demonstrate the ability to achieve 3D-aware motion control in I2V synthesis without requiring costly 3D-related training data. MotionCanvas enables users to intuitively depict scene-space motion intentions, and translates them into spatiotemporal motion-conditioning signals for video diffusion models. We demonstrate the effectiveness of our method on a wide range of real-world image content and shot-design scenarios, highlighting its potential to enhance the creative workflows in digital content creation and adapt to various image and video editing applications. Code and model weights are at https://motion-canvas25.github.io
Diffusion models have demonstrated impressive performance in generating high-quality videos from text prompts or images. However, precise control over the video generation process—such as camera manipulation or content editing—remains a significant challenge. Existing methods for controlled video generation are typically limited to a single control type, lacking the flexibility to handle diverse control demands. In this paper, we introduce Diffusion as Shader (DaS), a novel approach that supports multiple video control tasks within a unified architecture. Our key insight is that achieving versatile video control necessitates leveraging 3D control signals, as videos are fundamentally 2D renderings of dynamic 3D content. Unlike prior methods limited to 2D control signals, DaS leverages 3D tracking videos as control inputs, making the video diffusion process inherently 3D-aware. This innovation allows DaS to achieve a wide range of video controls by simply manipulating the 3D tracking videos. A further advantage of using 3D tracking videos is their ability to effectively link frames, significantly enhancing the temporal consistency of the generated videos. With just 3 days of fine-tuning on 8 H800 GPUs using less than 10k videos, DaS demonstrates strong control capabilities across diverse tasks, including mesh-to-video generation, camera control, motion transfer, and object manipulation. More results are available in the supplementary materials.
We present Mobius, a novel method to generate seamlessly looping videos from text descriptions directly without any user annotations, thereby creating new visual materials for the multi-media presentation. Our method repurposes the pre-trained video latent diffusion model for generating looping videos from text prompts without any training. During inference, we first construct a latent cycle by connecting the starting and ending noise of the videos. Given that the temporal consistency can be maintained by the context of the video diffusion model, we perform multi-frame latent denoising by gradually shifting the first-frame latent to the end in each step. As a result, the denoising context varies in each step while maintaining consistency throughout the inference process. Moreover, the latent cycle in our method can be of any length. This extends our latent-shifting approach to generate seamless looping videos beyond the scope of the video diffusion model’s context. Unlike previous cinemagraphs, the proposed method does not require an image as appearance, which will restrict the motions of the generated results. Instead, our method can produce more dynamic motion and better visual quality. We conduct multiple experiments and comparisons to verify the effectiveness of the proposed method, demonstrating its efficacy in different scenarios. All the code will be made available.
We present LayerFlow, a unified solution for layer-aware video generation. Given per-layer prompts, LayerFlow generates videos for the transparent foreground, clean background, and blended scene. It also supports versatile variants like decomposing a blended video or generating the background for the given foreground and vice versa. Starting from a text-to-video diffusion transformer, we organize the videos of different layers as sub-clips, and leverage layer embeddings to distinguish each clip and the corresponding layer-wise prompts. In this way, we seamlessly support the aforementioned variants in one unified framework. For the lack of high-quality layer-wise training videos, we design a multi-stage training strategy to accommodate static images with high-quality layer annotations. Specifically, we first train the model with low-quality video data. Then, we tune a motion LoRA to make the model compatible with static frames. Afterward, we train the content LoRA on the mixture of image data with high-quality layered images along with copy-pasted video data. During inference, we remove the motion LoRA thus generating smooth videos with desired layers.
Video matting has traditionally been limited by the lack of high-quality ground-truth data. Most existing video matting datasets provide only human-annotated imperfect alpha and foreground annotations, which must be composited to background images or videos during the training stage. Thus, the generalization capability of previous methods in real-world scenarios is typically poor. In this work, we propose to solve the problem from two perspectives. First, we emphasize the importance of large-scale pre-training by pursuing diverse synthetic and pseudo-labeled segmentation datasets. We also develop a scalable synthetic data generation pipeline that can render diverse human bodies and fine-grained hairs, yielding around 200 video clips with a 3-second duration for fine-tuning. Second, we introduce a novel video matting approach that can effectively leverage the rich priors from pre-trained video diffusion models. This architecture offers two key advantages. First, strong priors play a critical role in bridging the domain gap between synthetic and real-world scenes. Second, unlike most existing methods that process video matting frame-by-frame and use an independent decoder to aggregate temporal information, our model is inherently designed for video, ensuring strong temporal consistency. We provide a comprehensive quantitative evaluation across three benchmark datasets, demonstrating our approach’s superior performance, and present comprehensive qualitative results in diverse real-world scenes, illustrating the strong generalization capability of our method. Code is available at https://github.com/aim-uofa/GVM
Inverse rendering is crucial for many scientific and engineering disciplines. Recent progress in differentiable rendering has led to efficient differentiation of the full image formation process with respect to scene parameters, enabling gradient-based optimization.
However, computational demands pose a significant challenge for differentiable rendering, particularly when rendering all pixels during inverse rendering from high-resolution/multi-view images. This computational cost leads to slow performance in each iteration of inverse rendering. Meanwhile, naively reducing the sampling budget by uniformly sampling pixels to render in each iteration can result in high gradient variance during inverse rendering, ultimately degrading overall performance.
Our goal is to accelerate inverse rendering by reducing the sampling budget without sacrificing overall performance. In this paper, we introduce a novel image-space adaptive sampling framework to accelerate inverse rendering by dynamically adjusting pixel sampling probabilities based on gradient variance and contribution to the loss function. Our approach efficiently handles high-resolution images and complex scenes, with faster convergence and improved performance compared to uniform sampling, making it a robust solution for efficient inverse rendering.
Inferring scene parameters such as BSDFs and volume densities from user-provided target images has been achieved using a gradient-based optimization framework, which iteratively updates the parameters using the gradient of a loss function defined by the differences between rendered and target images. The gradient can be unbiasedly estimated via a physics-based rendering, i.e., differentiable Monte Carlo rendering. However, the estimated gradient can become noisy unless a large number of samples are used for gradient estimation, and relying on this noisy gradient often slows optimization convergence. An alternative is to exploit a biased version of the gradient, e.g., a filtered gradient, to achieve faster optimization convergence. Unfortunately, this can result in less noisy but overly blurred scene parameters compared to those obtained using unbiased gradients. This paper proposes a gradient combiner that blends unbiased and biased gradients in parameter space instead of relying solely on one gradient type (i.e., unbiased or biased). We demonstrate that optimization with our combined gradient enables more accurate inference of scene parameters than using unbiased or biased gradients alone.
Current makeup transfer methods are limited to simple makeup styles, making them difficult to apply in real-world scenarios. In this paper, we introduce Stable-Makeup, a novel diffusion-based makeup transfer method capable of robustly transferring a wide range of real-world makeup, onto user-provided faces. Stable-Makeup is based on a pre-trained diffusion model and utilizes a Detail-Preserving (D-P) makeup encoder to encode makeup details. It also employs content and structural control modules to preserve the content and structural information of the source image. With the aid of our newly added makeup cross-attention layers in U-Net, we can accurately transfer the detailed makeup to the corresponding position in the source image. After content-structure decoupling training, Stable-Makeup can maintain the content and the facial structure of the source image. Moreover, our method has demonstrated strong robustness and generalizability, making it applicable to various tasks such as cross-domain makeup transfer, makeup-guided text-to-image generation, and so on. Extensive experiments have demonstrated that our approach delivers state-of-the-art results among existing makeup transfer methods and exhibits a highly promising with broad potential applications in various related fields.
We present FashionComposer for compositional fashion image generation. Unlike previous methods, FashionComposer is highly flexible. It takes multi-modal input (i.e., text prompt, parametric human model, garment image, and face image) and supports personalizing the appearance, pose, and figure of the human and assigning multiple garments in one pass. To achieve this, we first develop a universal framework capable of handling diverse input modalities. We construct scaled training data to enhance the model’s robust compositional capabilities. To accommodate multiple reference images (garments and faces) seamlessly, we organize these references in a single image as an “asset library” and employ a reference UNet [Hu et al. 2023] to extract appearance features. To inject the appearance features into the correct pixels in the generated result, we propose subject-binding attention. It binds the appearance features from different “assets” with the corresponding text features. In this way, the model could understand each asset according to their semantics, supporting arbitrary numbers and types of reference images. As a comprehensive solution, FashionComposer also supports many other applications like human album generation, diverse virtual try-on tasks, etc.
We introduce ReStyle3D, a novel framework for scene-level appearance transfer from a single style image to a real-world scene represented by multiple views. The method combines explicit semantic correspondences with multi-view consistency to achieve precise and coherent stylization. Unlike conventional stylization methods that apply a reference style globally, ReStyle3D uses open-vocabulary segmentation to establish dense, instance-level correspondences between the style and real-world images. This ensures that each object is stylized with semantically matched textures. ReStyle3D first transfers the style to a single view using a training-free semantic-attention mechanism in a diffusion model. It then lifts the stylization to additional views via a learned warp-and-refine network guided by monocular depth and pixel-wise correspondences. Experiments show that ReStyle3D consistently outperforms prior methods in structure preservation, perceptual style similarity, and multi-view coherence. User studies further validate its ability to produce photo-realistic, semantically faithful results. Our code, pretrained models, and dataset will be publicly released, to support new applications in interior design, virtual staging, and 3D-consistent stylization. Project page and code at .
With the growing success of text or image guided 3D generators, users demand more control over the generation process, appearance stylization being one of them. Given a reference image, this requires adapting the appearance of a generated 3D asset to reflect the visual style of the reference while maintaining visual consistency from multiple viewpoints. To tackle this problem, we draw inspiration from the success of 2D stylization methods that leverage the attention mechanisms in large image generation models to capture and transfer visual style. In particular, we probe if large reconstruction models, commonly used in the context of 3D generation, has a similar capability. We discover that the certain attention blocks in these models capture the appearance specific features. By injecting features from a visual style image to such blocks, we develop a simple yet effective 3D appearance stylization method. Our method does not require training or test time optimization. Through both quantitative and qualitative evaluations, we demonstrate that our approach achieves superior results in terms of 3D appearance stylization, significantly improving efficiency while maintaining high-quality visual outcomes. Code and models are available via our project website: https://github.com/ipekoztas/3D-Stylization-LRM.
Scalable Vector Graphics (SVGs) are highly favored by designers due to their resolution independence and well-organized layer structure. Although existing text-to-vector (T2V) generation methods can create SVGs from text prompts, they often overlook an important need in practical applications: style customization, which is vital for producing a collection of vector graphics with consistent visual appearance and coherent aesthetics.
Extending existing T2V methods for style customization poses certain challenges. Optimization-based T2V models can utilize the priors of text-to-image (T2I) models for customization, but struggle with maintaining structural regularity. On the other hand, feed-forward T2V models can ensure structural regularity, yet they encounter difficulties in disentangling content and style due to limited SVG training data.
To address these challenges, we propose a novel two-stage style customization pipeline for SVG generation, making use of the advantages of both feed-forward T2V models and T2I image priors. In the first stage, we train a T2V diffusion model with a path-level representation to ensure the structural regularity of SVGs while preserving diverse expressive capabilities. In the second stage, we customize the T2V diffusion model to different styles by distilling customized T2I models. By integrating these techniques, our pipeline can generate high-quality and diverse SVGs in custom styles based on text prompts in an efficient feed-forward manner. The effectiveness of our method has been validated through extensive experiments. The project page is .
The comic production industry requires reference-based line art colorization with high accuracy, efficiency, contextual consistency, and flexible control.A comic page often involves diverse characters, objects, and backgrounds, which complicates the coloring process. Despite advancements in diffusion models for image generation, their application in line art colorization remains limited, facing challenges related to handling extensive reference images, time-consuming inference, and flexible control. We investigate the necessity of extensive contextual image guidance on the quality of line art colorization. To address these challenges, we introduce Cobra, an efficient and versatile method that supports color hints and utilizes over 200 reference images while maintaining low latency. Central to Cobra is a Causal Sparse DiT architecture, which leverages specially designed positional encodings, causal sparse attention, and Key-Value Cache to effectively manage long-context references and ensure color identity consistency. Results demonstrate that Cobra achieves accurate line art colorization through extensive contextual reference, significantly enhancing inference speed and interactivity, thereby meeting critical industrial demands. We release our codes and models on our project page: .
Trajectory modeling of dense points usually employs implicit deformation fields, represented as neural networks that map coordinates to relate canonical spatial positions to temporal offsets. However, the inductive biases inherent in neural networks can hinder spatial coherence in ill-posed scenarios. Current methods focus either on enhancing encoding strategies for deformation fields, often resulting in opaque and less intuitive models, or adopt explicit techniques like linear blend skinning, which rely on heuristic-based node initialization. Additionally, the potential of implicit representations for interpolating sparse temporal signals remains under-explored. To address these challenges, we propose a spline-based trajectory representation, where the number of knots explicitly determines the degrees of freedom. This approach enables efficient analytical derivation of velocities, preserving spatial coherence and accelerations, while mitigating temporal fluctuations. To model knot characteristics in both spatial and temporal domains, we introduce a novel low-rank time-variant spatial encoding, replacing conventional coupled spatiotemporal techniques. Our method demonstrates superior performance in temporal interpolation for fitting continuous fields with sparse inputs. Furthermore, it achieves competitive dynamic scene reconstruction quality compared to state-of-the-art methods while enhancing motion coherence without relying on linear blend skinning or as-rigid-as-possible constraints.
Cosserat rods have become an increasingly popular framework for simulating complex bending and twisting in thin elastic rods, used for hair, tree, and yarn-level cloth models. However, traditional approaches often encounter significant challenges in robustly and efficiently solving for valid quaternion orientations, even when employing small time steps or computationally expensive global solvers. We introduce stable Cosserat rods, a new solver that can achieve high accuracy with high stiffness levels and maintain stability under large time steps. It is also inherently suitable for parallelization. Our key contribution is a split position and rotation optimization scheme with a closed-form Gauss-Seidel quasi-static orientation update. This solver significantly improves the numerical stability with Cosserat rods, allowing faster computation and larger time steps. We validate our method across a wide range of applications, including simulations of hair, trees, yarn-level cloth, slingshots, and bridges, demonstrating its ability to handle diverse material behaviors and complex geometries. Furthermore, we show that our method is orders of magnitude faster and more stable than alternative rod solvers, such as extended position-based dynamics and discrete elastic rods.
Numerical schemes for time integration are the cornerstone of dynamical simulations for deformable solids. The most popular time integrators for isotropic distortion energies rely on nonlinear root-finding solvers, most prominently, Newton’s method. These solvers are computationally expensive and sensitive to ill-conditioned Hessians and poor initial guesses; these difficulties can particularly hamper the effectiveness of variational integrators, whose momentum conservation properties require reliable root-finding. To tackle these difficulties, this paper shows how to express variational time integration for a large class of elastic energies as an optimization problem with a “hidden” convex substructure. This hidden convexity suggests uses of optimization techniques with rigorous convergence analysis, guaranteed inversion-free elements, and conservation of physical invariants up to tolerance/numerical precision. In particular, we propose an Alternating Direction Method of Multipliers (ADMM) algorithm combined with a proximal operator step to solve our formulation. Empirically, our integrator improves the performance of elastic simulation tasks, as we demonstrate in a number of examples.
We introduce CLR-Wire, a novel framework for 3D curve-based wireframe generation that integrates geometry and topology into a unified Continuous Latent Representation. Unlike conventional methods that decouple vertices, edges, and faces, CLR-Wire encodes curves as Neural Parametric Curves along with their topological connectivity into a continuous and fixed-length latent space using an attention-driven variational autoencoder (VAE). This unified approach facilitates joint learning and generation of both geometry and topology. To generate wireframes, we employ a flow matching model to progressively map Gaussian noise to these latents, which are subsequently decoded into complete 3D wireframes. Our method provides fine-grained modeling of complex shapes and irregular topologies, and supports both unconditional generation and generation conditioned on point cloud or image inputs. Experimental results demonstrate that, compared with state-of-the-art generative approaches, our method achieves substantial improvements in accuracy, novelty, and diversity, offering an efficient and comprehensive solution for CAD design, geometric reconstruction, and 3D content creation.
Point-based representations have consistently played a vital role in geometric data structures. Most point cloud learning and processing methods typically leverage the unordered and unconstrained nature to represent the underlying geometry of 3D shapes. However, how to extract meaningful structural information from unstructured point cloud distributions and transform them into semantically meaningful point distributions remains an under-explored problem. We present PDT, a novel framework for point distribution transformation with diffusion models. Given a set of input points, PDT learns to transform the point set from its original geometric distribution into a target distribution that is semantically meaningful. Our method utilizes diffusion models with novel architecture and learning strategy, which effectively correlates the source and the target distribution through a denoising process. Through extensive experiments, we show that our method successfully transforms input point clouds into various forms of structured outputs - ranging from surface-aligned keypoints, and inner sparse joints to continuous feature lines. The results showcase our framework’s ability to capture both geometric and semantic features, offering a powerful tool for various 3D geometry processing tasks where structured point distributions are desired. Code will be available at this link: link.
Recent advancements in 3D generation techniques have simplified the tedious manual process of 3D asset production. Among these methods, 3D native latent diffusion models are particularly effective in generating high-quality geometric details. However, achieving local enhancement and editing of the generated 3D models remains a challenge due to the limited understanding of the relationship between text,images,and 3D in terms of local semantics and feature space.We explore and reveal the characteristics of the native 3D latent space, make it decomposable and low-rank, thereby enabling efficient and effective learning for multimodal local alignment. Based on this, we introduce RELATE3D, a novel approach that combines a Refocusing Adapter with part-to-latent correspondence guided training for precise local enhancement and part-level editing of 3D geometry. The Refocusing Adapter incorporates partial image and caption signals, and, combined with part-to-latent mapping, directs modifications to the relevant latent dimensions during latent diffusion process. We validate the effectiveness of our approach through extensive experiments and ablation studies, showcasing the capabilities of our generative local enhancement and editing process, as well as global refinement.
Recent advancements in generative models have significantly propelled 3D scene editing. While existing methods excel at text-guided texture modifications for 3D representations like 3D Gaussian Splatting (3DGS), they struggle with geometric transformations (e.g., rotating a character’s head) and lack precise spatial control over edits due to the inherent ambiguity of language-driven guidance. To address these limitations, we introduce DYG, a 3D drag-based editing framework for 3DGS. Users intuitively define editing regions using 3D masks and specify desired transformations through pairs of control points. DYG integrates the implicit triplane representation to establish the geometric scaffold of editing results, effectively overcoming suboptimal editing outcomes caused by the sparsity of 3DGS in the desired editing regions. Additionally, we incorporate a drag-based Latent Diffusion Model through the proposed Drag-SDS loss, enabling flexible, multi-view consistent, and fine-grained editing. Extensive experiments demonstrate that DYG enables effective drag-based editing, outperforming other baselines in terms of editing effect and quality. Additional results are available on our project page: .
Recently, 3D generation methods have shown their powerful ability to automate 3D model creation. However, most 3D generation methods only rely on an input image or a text prompt to generate a 3D model, which lacks the control of each component of the generated 3D model. Any modifications of the input image lead to an entire regeneration of the 3D models. In this paper, we introduce a new method called CMD that generates a 3D model from an input image while enabling flexible local editing of each component of the 3D model. In CMD, we formulate the 3D generation as a conditional multiview diffusion model, which takes the existing or known parts as conditions and generates the edited or added components. This conditional multiview diffusion model not only allows the generation of 3D models part by part but also enables local editing of 3D models according to the local revision of the input image without changing other 3D parts. Extensive experiments are conducted to demonstrate that CMD decomposes a complex 3D generation task into multiple components, improving the generation quality. Meanwhile, CMD enables efficient and flexible local editing of a 3D model by just editing one rendered image.
Recent advancements in large vision-language models have enabled highly expressive and diverse vector sketch generation. However, state-of-the-art methods rely on a time-consuming optimization process involving repeated feedback from a pretrained model to determine stroke placement. Consequently, despite producing impressive sketches, these methods are limited in practical applications. In this work, we introduce SwiftSketch, a diffusion model for image-conditioned vector sketch generation that can produce high-quality sketches in less than a second. SwiftSketch operates by progressively denoising stroke control points sampled from a Gaussian distribution. Its transformer-decoder architecture is designed to effectively handle the discrete nature of vector representation and capture the inherent global dependencies between strokes. To train SwiftSketch, we construct a synthetic dataset of image-sketch pairs, addressing the limitations of existing sketch datasets, which are often created by non-artists and lack professional quality. For generating these synthetic sketches, we introduce ControlSketch, a method that enhances SDS-based techniques by incorporating precise spatial control through a depth-aware ControlNet. We demonstrate that SwiftSketch generalizes across diverse concepts, efficiently producing sketches that combine high fidelity with a natural and visually appealing style.
Monte Carlo (MC) rendering is a widely used approach for photorealistic image synthesis, yet real-time applications often limit sampling to one path per pixel, resulting in high noise levels. To mitigate this, resampled importance sampling (RIS) has shown promise by approximating ideal sample distributions through a discrete set of candidates, avoiding the complexity of neural models or data-intensive structures. However, current RIS techniques often rely on random sampling, which fails to maximize the potential of the candidate pool. We propose a two step approach that first organizes samples candidates into local histograms and then sample the histogram using Quasi Monte Carlo and antithetic patterns. This can be done with minimal overhead and allows to reduce error in rendering to increase visual quality. Additionally, we show how it can be combined with blue noise error distribution to perceptually reduce noise artifacts. Our approach yields a higher-quality resampling estimator with enhanced noise reduction, demonstrating significant improvements in real-time rendering tasks.
The recent formulation of stylized rendering equation (SRE) models stylization by applying nonlinear functions to reflected radiance recursively at each bounce, allowing seamless blend between stylized and physically based light transport. A naive estimator has to branch at each stylized surface, resulting in exponential computation and storage cost. We propose a practical approach for rendering scenes with SRE at a tractable cost. We first propose nonlinear path filtering (NL-PF) that caches the radiance evaluations at intermediate bounces, reducing the exponential sampling cost of the branching estimator of SRE to polynomial. Despite the effectiveness of NL-PF, its high memory cost makes it less scalable. To further improve efficiency, we propose nonlinear radiance caching (NL-NRC) where we apply a compact neural network to store radiance fields. Our NL-NRC has the same linear time sampling cost as a non-branching path tracer and can solve SRE with a high number of bounces and recursive stylization. Our key insight is that, by allowing the network to learn outgoing radiance prior to applying any nonlinear function, the network converges to the correct solution, even when we only have access to biased gradients due to nonlinearity. Our NL-NRC enables rendering scenes with arbitrary, highly nonlinear stylization while achieving significant speedup over branching estimators.
Reinforcement learning (RL) has significantly advanced the control of physics-based and robotic characters that track kinematic reference motion. However, methods typically rely on a weighted sum of conflicting reward functions, requiring extensive tuning to achieve a desired behavior. Due to the computational cost of RL, this iterative process is a tedious, time-intensive task. Furthermore, for robotics applications, the weights need to be chosen such that the policy performs well in the real world, despite inevitable sim-to-real gaps. To address these challenges, we propose a multi-objective reinforcement learning framework that trains a single policy conditioned on a set of weights, spanning the Pareto front of reward trade-offs. Within this framework, weights can be selected and tuned after training, significantly speeding up iteration time. We demonstrate how this improved workflow can be used to perform highly dynamic motions with a robot character. Moreover, we explore how weight-conditioned policies can be leveraged in hierarchical settings, using a high-level policy to dynamically select weights according to the current task. We show that the multi-objective policy encodes a diverse spectrum of behaviors, facilitating efficient adaptation to novel tasks.
We present an interactive and explainable automated coaching assistant called CueTip for a variant of pool/billiards. CueTip’s novelty lies in its combination of three features: a natural-language interface, an ability to perform contextual, physics-aware reasoning, and that its explanations are rooted in a set of predetermined guidelines developed by domain experts. We instrument a physics simulator so that it generates event traces in natural language alongside traditional state traces. Event traces lend themselves to interpretation by language models, which serve as the interface to our assistant. We design and train a neural adaptor that decouples tactical choices made by CueTip from its interactivity and explainability allowing it to be reconfigured to mimic any pool playing agent. Our experiments show that CueTip enables contextual query-based assistance and explanations while maintaining the strength of the agent in terms of win rate (improving it in some situations). The explanations generated by CueTip are physically-aware and grounded in the expert rules and are therefore more reliable.
We present the first scene-update aerial path planning algorithm specifically designed for detecting and updating change areas in urban environments. While existing methods for large-scale 3D urban scene reconstruction focus on achieving high accuracy and completeness, they are inefficient for scenarios requiring periodic updates, as they often re-explore and reconstruct entire scenes, wasting significant time and resources on unchanged areas. To address this limitation, our method leverages prior reconstructions and change probability statistics to guide UAVs in detecting and focusing on areas likely to have changed. Our approach introduces a novel changeability heuristic to evaluate the likelihood of scene changes, driving the planning of two flight paths: a prior path informed by static priors and a dynamic real-time path that adapts to newly detected changes. Extensive experiments on real-world urban datasets demonstrate that our method significantly reduces flight time and computational overhead while maintaining high-quality updates comparable to full-scene re-exploration and reconstruction. These contributions pave the way for efficient, scalable, and adaptive UAV-based scene updates in complex urban environments.
Speech-driven 3D facial animation plays a key role in applications such as virtual avatars, gaming, and digital content creation. While existing methods have made significant progress in achieving accurate lip synchronization and generating basic emotional expressions, they often struggle to capture and effectively transfer nuanced performance styles. We propose a novel example-based generation framework that conditions a latent diffusion model on a reference style clip to produce highly expressive and temporally coherent facial animations. To address the challenge of accurately adhering to the style reference, we introduce a novel conditioning mechanism called style basis, which extracts key poses from the reference and additively guides the diffusion generation process to fit the style without compromising lip synchronization quality. This approach enables the model to capture subtle stylistic cues while ensuring that the generated animations align closely with the input speech. Extensive qualitative, quantitative, and perceptual evaluations demonstrate the effectiveness of our method in faithfully reproducing the desired style while achieving superior lip synchronization across various speech scenarios.
We propose StreamME, a method focuses on fast 3D avatar reconstruction. The StreamME synchronously records and reconstructs a head avatar from live video streams without any pre-cached data, enabling seamless integration of the reconstructed appearance into downstream applications. This exceptionally fast training strategy, which we refer to as on-the-fly training, is central to our approach. Our method is built upon 3D Gaussian Splatting (3DGS), eliminating the reliance on MLPs in deformable 3DGS and relying solely on geometry, which significantly improves the adaptation speed to facial expression. To further ensure high efficiency in on-the-fly training, we introduced a simplification strategy based on primary points, which distributes the point clouds more sparsely across the facial surface, optimizing points number while maintaining rendering quality. Leveraging the on-the-fly training capabilities, our method protects the facial privacy and reduces communication bandwidth in VR system or online conference. Additionally, it can be directly applied to downstream application such as animation, toonify, and relighting. Please refer to our project page for more details: .
Generating high-fidelity real-time animated sequences of photorealistic 3D head avatars is important for many graphics applications, including immersive telepresence and movies. This is a challenging problem particularly when rendering digital avatar close-ups for showing character’s facial microfeatures and expressions. To capture the expressive, detailed nature of human heads, including skin furrowing and finer-scale facial movements, we propose to couple locally-defined facial expressions with 3D Gaussian splatting to enable creating ultra-high fidelity, expressive and photorealistic 3D head avatars. In contrast to previous works that operate on a global expression space, we condition our avatar’s dynamics on patch-based local expression features and synthesize 3D Gaussians at a patch level. In particular, we leverage a patch-based geometric 3D face model to extract patch expressions and learn how to translate these into local dynamic skin appearance and motion by coupling the patches with anchor points of Scaffold-GS, a recent hierarchical scene representation. These anchors are then used to synthesize 3D Gaussians on-the-fly, conditioned by patch-expressions and viewing direction. We employ color-based densification and progressive training to obtain high-quality results and faster convergence for high resolution 3K training images. By leveraging patch-level expressions, ScaffoldAvatar consistently achieves state-of-the-art performance with visually natural motion, while encompassing diverse facial expressions and styles in real time.
Gaussian-based human avatars have achieved an unprecedented level of visual fidelity. However, existing approaches based on high-capacity neural networks typically require a desktop GPU to achieve real-time performance for a single avatar, and it remains non-trivial to animate and render such avatars on mobile devices including a standalone VR headset due to substantially limited memory and computational bandwidth. In this paper, we present SqueezeMe, a simple and highly effective framework to convert high-fidelity 3D Gaussian full-body avatars into a lightweight representation that supports both animation and rendering with mobile-grade compute. Our key observation is that the decoding of pose-dependent Gaussian attributes from a neural network creates non-negligible memory and computational overhead. Inspired by blendshapes and linear pose correctives widely used in Computer Graphics, we address this by distilling the pose correctives learned with neural networks into linear layers. Moreover, we further reduce the parameters by sharing the correctives among nearby Gaussians. Combining them with a custom splatting pipeline based on Vulkan, we achieve, for the first time, simultaneous animation and rendering of 3 Gaussian avatars in real-time (72 FPS) on a Meta Quest 3 VR headset.
Generating multiple distinct subjects remains a challenge for existing text-to-image diffusion models. Complex prompts often lead to subject leakage, causing inaccuracies in quantities, attributes, and visual features. Preventing leakage among subjects necessitates knowledge of each subject’s spatial location. Recent methods provide these spatial locations via an external layout control. However, enforcing such a prescribed layout often conflicts with the innate layout dictated by the sampled initial noise, leading to misalignment with the model’s prior. In this work, we introduce a new approach that predicts a spatial layout aligned with the prompt, derived from the initial noise, and refines it throughout the denoising process. By relying on this noise-induced layout, we avoid conflicts with externally imposed layouts and better preserve the model’s prior. Our method employs a small neural network to predict and refine the evolving noise-induced layout at each denoising step, ensuring clear boundaries between subjects while maintaining consistency. Experimental results show that this noise-aligned strategy achieves improved text-image alignment and more stable multi-subject generation compared to existing layout-guided techniques, while preserving the rich diversity of the model’s original distribution.
Image editing is an important task in computer graphics, vision, and VFX, with recent diffusion-based methods achieving fast and high-quality results. However, edits requiring significant structural changes, such as non-rigid deformations, object modifications, or content generation, remain challenging. Existing few step editing approaches produce artifacts such as irrelevant texture or struggle to preserve key attributes of the source image (e.g., pose). We introduce Cora , a novel editing framework that addresses these limitations by introducing correspondence-aware noise correction and interpolated attention maps. Our method aligns textures and structures between the source and target images through semantic correspondence, enabling accurate texture transfer while generating new content when necessary. Cora offers control over the balance between content generation and preservation. Extensive experiments demonstrate that, quantitatively and qualitatively, Cora excels in maintaining structure, textures, and identity across diverse edits, including pose changes, object addition, and texture refinements. User studies confirm that Cora delivers superior results, outperforming alternatives.
Project Page & source code: cora-edit.github.io
Despite significant advances in modeling image priors via diffusion models, 3D-aware image editing remains challenging, in part because the object is only specified via a single image. To tackle this challenge, we propose 3D-Fixup, a new framework for editing 2D images guided by learned 3D priors. The framework supports difficult editing situations such as object translation and 3D rotation. To achieve this, we leverage a training-based approach that harnesses the generative power of diffusion models. As video data naturally encodes real-world physical dynamics, we turn to video data for generating training data pairs, i.e., a source and a target frame. Rather than relying solely on a single trained model to infer transformations between source and target frames, we incorporate 3D guidance from an Image-to-3D model, which bridges this challenging task by explicitly projecting 2D information into 3D space. We design a data generation pipeline to ensure high-quality 3D guidance throughout training. Results show that by integrating these 3D priors, 3D-Fixup effectively supports complex, identity coherent 3D-aware edits, achieving high-quality results and advancing the application of diffusion models in realistic image manipulation. The code is provided at .
We present the first text-based image editing approach for object parts based on pre-trained diffusion models. Diffusion-based image editing approaches capitalized on the deep understanding of diffusion models of image semantics to perform a variety of edits. However, existing diffusion models lack sufficient understanding of many object parts, hindering fine-grained edits requested by users. To address this, we propose to expand the knowledge of pre-trained diffusion models to allow them to understand various object parts, enabling them to perform fine-grained edits. We achieve this by learning special textual tokens that correspond to different object parts through an efficient token optimization process. These tokens are optimized to produce reliable localization masks at each inference step to localize the editing region. Leveraging these masks, we design feature-blending and adaptive thresholding strategies to execute the edits seamlessly. To evaluate our approach, we establish a benchmark and an evaluation protocol for part editing. Experiments show that our approach outperforms existing editing methods on all metrics and is preferred by users \( 66-90\% \) of the time in conducted user studies.
Sketch segmentation involves grouping pixels within a sketch that belong to the same object or instance. It serves as a valuable tool for sketch editing tasks, such as moving, scaling, or removing specific components. While image segmentation models have demonstrated remarkable capabilities in recent years, sketches present unique challenges for these models due to their sparse nature and wide variation in styles. We introduce InkLayer, a method for instance segmentation of raster scene sketches. Our approach adapts state-of-the-art image segmentation and object detection models to the sketch domain by employing class-agnostic fine-tuning and refining segmentation masks using depth cues. Furthermore, our method organizes sketches into sorted layers, where occluded instances are inpainted, enabling advanced sketch editing applications. As existing datasets in this domain lack variation in sketch styles, we construct a synthetic scene sketch segmentation dataset, InkScenes, featuring sketches with diverse brush strokes and varying levels of detail. We use this dataset to demonstrate the robustness of our approach. Code and data for this paper are released at project page: .
We introduce a novel method for directional-field design on meshes, enabling users to specify singularities at any location on a mesh. Our method uses a piecewise power-linear representation for phase and scale, offering precise control over field topology. The resulting fields are smooth and accommodate any singularity index and field symmetry. With this representation, we mitigate the artifacts caused by coarse or uneven meshes. We showcase our approach on meshes with diverse topologies and triangle qualities.
We propose an exact method for embedding a disk-topology triangular mesh onto any convex polygon. The method employs a divide-and-conquer approach, iteratively decomposing the embedding problem into smaller sub-problems that map sub-meshes to convex sub-polygons. The process continues until each triangle in the mesh is naturally embedded into a corresponding 3-sided polygon. The approach is supported by a constructive proof, ensuring its theoretical validity. We translate this proof into a practical algorithm, incorporating various dividing strategies and interpolation weights. Unlike previous methods, our approach preserves the connectivity of the input mesh throughout the embedding process. Extensive experiments demonstrate the efficiency and effectiveness of the proposed method.
3D Gaussian Splatting (3DGS) has emerged as a mainstream solution for novel view synthesis and 3D reconstruction. By explicitly encoding a 3D scene using a collection of Gaussian kernels, 3DGS achieves high-quality rendering with superior efficiency. As a learning-based approach, 3DGS training has been dealt with the standard stochastic gradient descent (SGD) method, which offers at most linear convergence. Consequently, training often requires tens of minutes, even with GPU acceleration. This paper introduces a (near) second-order convergent training algorithm for 3DGS, leveraging its unique properties. Our approach is inspired by two key observations. First, the attributes of a Gaussian kernel contribute independently to the image-space loss, which endorses isolated and local optimization algorithms. We exploit this by splitting the optimization at the level of individual kernel attributes, analytically constructing small-size Newton systems for each parameter group, and efficiently solving these systems on GPU threads. This achieves Newton-like convergence per training image without relying on the global Hessian. Second, kernels exhibit sparse and structured coupling across input images. This property allows us to effectively utilize spatial information to mitigate overshoot during stochastic training. Our method converges an order faster than standard GPU-based 3DGS training, requiring over 10 × fewer iterations while maintaining or surpassing the quality of the compared with the SGD-based 3DGS reconstructions.
3D Gaussian Splatting (3DGS) enables the reconstruction of intricate digital 3D assets from multi-view images by leveraging a set of 3D Gaussian primitives for rendering. Its explicit and discrete representation facilitates the seamless composition of complex digital worlds, offering significant advantages over previous neural implicit methods. However, when applied to large-scale compositions, such as crowd-level scenes, it can encompass numerous 3D Gaussians, posing substantial challenges for real-time rendering. To address this, inspired by Unreal Engine 5’s Nanite system, we propose Virtualized 3D Gaussians (V3DG), a cluster-based LOD solution that constructs hierarchical 3D Gaussian clusters and dynamically selects only the necessary ones to accelerate rendering speed. Our approach consists of two stages: (1) Offline Build, where hierarchical clusters are generated using a local splatting method to minimize visual differences across granularities, and (2) Online Selection, where footprint evaluation determines perceptible clusters for efficient rasterization during rendering. We curate a dataset of synthetic and real-world scenes, including objects, trees, people, and buildings, each requiring 0.1 billion 3D Gaussians to capture fine details. Experiments show that our solution balances rendering efficiency and visual quality across user-defined tolerances, facilitating downstream interactive applications that compose extensive 3DGS assets for consistent rendering performance.
3D Gaussian Splatting (3DGS) has advanced radiance field reconstruction by enabling real-time rendering. However, its reliance on Gaussian kernels for geometry and low-order Spherical Harmonics (SH) for color encoding limits its ability to capture complex geometries and diverse colors. We introduce Deformable Beta Splatting (DBS), a deformable and compact approach that enhances both geometry and color representation. DBS replaces Gaussian kernels with deformable Beta Kernels, which offer bounded support and adaptive frequency control to capture fine geometric details with higher fidelity while achieving better memory efficiency. In addition, we extended the Beta Kernel to color encoding, which facilitates improved representation of diffuse and specular components, yielding superior results compared to SH-based methods. Furthermore, Unlike prior densification techniques that depend on Gaussian properties, we mathematically prove that adjusting regularized opacity alone ensures distribution-preserved Markov chain Monte Carlo (MCMC), independent of the splatting kernel type. Experimental results demonstrate that DBS achieves state-of-the-art visual quality while utilizing only 45% of the parameters and rendering 1.5x faster than 3DGS-MCMC, highlighting the superior performance of DBS for real-time radiance field rendering. Interactive demonstrations and source code are available on our project website: .
Neural image representations have emerged as a promising approach for encoding and rendering visual data. Combined with learning-based workflows, they demonstrate impressive trade-offs between visual fidelity and memory footprint. Existing methods in this domain, however, often rely on fixed data structures that suboptimally allocate memory or compute-intensive implicit models, hindering their practicality for real-time graphics applications.
Inspired by recent advancements in radiance field rendering, we introduce Image-GS, a content-adaptive image representation based on 2D Gaussians. Leveraging a custom differentiable renderer, Image-GS reconstructs images by adaptively allocating and progressively optimizing a group of anisotropic, colored 2D Gaussians. It achieves a favorable balance between visual fidelity and memory efficiency across a variety of stylized images frequently seen in graphics workflows, especially for those showing non-uniformly distributed features and in low-bitrate regimes. Moreover, it supports hardware-friendly rapid random access for real-time usage, requiring only 0.3K MACs to decode a pixel. Through error-guided progressive optimization, Image-GS naturally constructs a smooth level-of-detail hierarchy. We demonstrate its versatility with several applications, including texture compression, semantics-aware compression, and joint image compression and restoration.
The Boundary Representation (B-rep) is a widely used 3D model representation of most consumer products designed with CAD software. However, its highly irregular and sparse set of relationships poses significant challenges for designing a generative model tailored to B-reps. Existing approaches use multi-stage approaches to satisfy the complex constraints sequentially. As a result, the final geometry cannot incorporate user edits due to the non-deterministic dependencies between cascaded stages. In contrast, we propose BrepDiff, a single-stage diffusion model for B-rep generation. We present a masked UV grid representation consisting of structured point samples from faces, serving as input for a diffusion transformer. By introducing an asynchronous and shifted noise schedule, we improve the training signal, enabling the diffusion model to better capture the distribution of UV grids. The explicitness of our masked UV grid representation enables users to intuitively understand and freely design surface geometry without being constrained by topological validity. The interconnectivity can be derived from the face layout, which is later processed into a valid solid volume during post-processing. Our approach achieves performance on par with state-of-the-art cascaded models while offering complex and diverse manipulations of geometry and topology, such as shape completion, merging, and interpolation.
Boundary representation (B-Rep) models serve as the primary representation format in modern CAD systems for describing 3D shapes. While deep learning has achieved success with various geometric representations, B-Reps remain challenging due to their hybrid nature of combining continuous geometry with discrete topological relationships. In this paper, we present Stitch-A-Shape, a B-Rep generation framework that directly models both topology and geometry. This strategy departs from prior work that focuses on either topology or geometry while recovering the other through post-processing. Our method consists of a geometry module that determines the spatial configuration of geometric elements (vertices, curves, and surface control points) and a topology module that establishes connectivity relationships and identifies boundary structures, including outer and inner loops. Our approach leverages a sequential "stitching" representation that mirrors the native data structure and inherent bottom-up organization of B-Rep, assembling geometric entities from vertices through curves to faces. We validate that our framework can handle topological and geometric ambiguities, as well as open surfaces and compound solids. Experiments show that Stitch-A-Shape achieves superior generation quality and computational efficiency compared to existing approaches in unconditional generation tasks, while exhibiting effective capabilities in class-conditional generation and B-Rep autocompletion applications.
We present a self-supervised approach to in-the-wild image relighting that enables fully controllable, physically based illumination editing. We achieve this by combining the physical accuracy of traditional rendering with the photorealistic appearance made possible by neural rendering. Our pipeline works by inferring a colored mesh representation of a given scene using monocular estimates of geometry and intrinsic components. This representation allows users to define their desired illumination configuration in 3D. The scene under the new lighting can then be rendered using a path-tracing engine. We send this approximate rendering of the scene through a feed-forward neural renderer to predict the final photorealistic relighting result. We develop a differentiable rendering process to reconstruct in-the-wild scene illumination, enabling self-supervised training of our neural renderer on raw image collections. Our method represents a significant step in bringing the explicit physical control over lights available in typical 3D computer graphics tools, such as Blender, to in-the-wild relighting.
We present a simple, yet effective diffusion-based method for fine-grained, parametric control over light sources in an image. Existing relighting methods either rely on multiple input views to perform inverse rendering at inference time, or fail to provide explicit control over light changes. Our method fine-tunes a diffusion model on a small set of real raw photograph pairs, supplemented by synthetically rendered images at scale, to elicit its photorealistic prior for the relighting task. We leverage the linearity of light to synthesize image pairs depicting controlled light changes of either a target light source or ambient illumination. Using this data and an appropriate fine-tuning scheme, we train a model for precise illumination changes with explicit control over light intensity and color. Lastly, we show how our method can achieve compelling light editing results, and outperforms existing methods based on user preference.
Indoor lighting estimation from a single image or video remains a challenge due to its highly ill-posed nature, especially when the lighting condition of the scene varies spatially and temporally. We propose a method that estimates from an input video a continuous light field describing the spatiotemporally varying lighting of the scene. We leverage 2D diffusion priors for optimizing such light field represented as a MLP. To enable zero-shot generalization to in-the-wild scenes, we fine-tune a pre-trained image diffusion model to predict lighting at multiple locations by jointly inpainting multiple chrome balls as light probes. We evaluate our method on indoor lighting estimation from a single image or video and show superior performance over compared baselines. Most importantly, we highlight results on spatiotemporally consistent lighting estimation from in-the-wild videos, which is rarely demonstrated in previous works.
Relighting and novel view synthesis of human portraits are essential in applications such as portrait photography, virtual reality (VR), and augmented reality (AR). Despite recent progress, 3D-aware portrait relighting remains challenging due to the demands for photorealistic rendering, real-time performance, and generalization to unseen subjects. Existing works either rely on supervision from limited and expensive light stage captured data or produce suboptimal results. Moreover, many works are based on generative NeRFs, which suffer from poor 3D consistency and low real-time performance. We resort to recent progress on generative 3D Gaussians and design a lighting model based on a unified neural radiance transfer representation, which responds linearly to incident light. Using only in-the-wild images, our method achieves state-of-the-art relighting results and a significantly faster rendering speed (× 12) compared to previous 3D-aware portrait relighting research.
3D immersive scene generation is a challenging yet critical task in computer vision and graphics. A desired virtual 3D scene should 1) exhibit omnidirectional view consistency, and 2) allow for large-range exploration in complex scene hierarchies. Existing methods either rely on successive scene expansion via inpainting or employ panorama representation to represent large FOV scene environments. However, the generated scene suffers from semantic drift during expansion and is unable to handle occlusion among scene hierarchies. To tackle these challenges, we introduce LayerPano3D, a novel framework for full-view, explorable panoramic 3D scene generation from a single text prompt. Our key insight is to decompose a reference 2D panorama into multiple layers at different depth levels, where each layer reveals the unseen space from the reference views via diffusion prior. LayerPano3D comprises multiple dedicated designs: 1) We introduce a new panorama dataset Upright360 , comprising 9k high-quality and upright panorama images, and finetune the advanced Flux model on Upright360 for high-quality, upright and consistent panorama generation related tasks. 2) We pioneer the Layered 3D Panorama as underlying representation to manage complex scene hierarchies and lift it into 3D Gaussians to splat detailed 360-degree omnidirectional scenes with unconstrained viewing paths. Extensive experiments demonstrate that our framework generates state-of-the-art 3D panoramic scene in both full view consistency and immersive exploratory experience. We believe that LayerPano3D holds promise for advancing 3D panoramic scene creation with numerous applications. More examples please visit our webpage: ys-imtech.github.io/projects/LayerPano3D/
This paper presents a scalable framework for efficiently discovering the performance gamut of different processes. Gamut boundaries comprise the set of highest-performing solutions within a design space. While sampling methods are often inefficient or prone to premature convergence, Bayesian optimization struggles with taking advantage of existing large-scale parallel computation or experimentation. To address these challenges, we utilize Bayesian neural networks as scalable surrogates for performance prediction and uncertainty estimation. We further introduce a novel acquisition function that combines the diversity-driven exploration of stochastic optimization with the information-efficient exploitation of Bayesian optimization. This enables generating large, high-quality batches of samples. Our approach leverages large batch sizes to reduce the number of iterations needed for optimization. We demonstrate its effectiveness on real-world engineering and robotic problems, achieving faster and more extensive discovery of the performance gamut. Code and data are available at .
We introduce a novel Unsmoothed Aggregation (UA) Algebraic Multigrid (AMG) method combined with Preconditioned Conjugate Gradient (PCG) to overcome the limitations of Extended Position-Based Dynamics (XPBD) in high-resolution and high-stiffness simulations. While XPBD excels in simulating deformable objects due to its speed and simplicity, its nonlinear Gauss-Seidel (GS) solver often struggles with low-frequency errors, leading to instability and stalling issues, especially in high-resolution, high-stiffness simulations. Our multigrid approach addresses these issues efficiently by leveraging AMG. To reduce the computational overhead of traditional AMG, where prolongator construction can consume up to two-thirds of the runtime, we propose a lazy setup strategy that reuses prolongators across iterations based on matrix structure and physical significance. Furthermore, we introduce a simplified method for constructing near-kernel components by applying a few sweeps of iterative methods to the homogeneous equation, achieving convergence rates comparable to adaptive smoothed aggregation (adaptive-SA) at a lower computational cost. Experimental results demonstrate that our method significantly improves convergence rates and numerical stability, enabling efficient and stable high-resolution simulations of deformable objects.
Rigging and skinning are essential steps to create realistic 3D animations, often requiring significant expertise and manual effort. Traditional attempts at automating these processes rely heavily on geometric heuristics and often struggle with objects of complex geometry. Recent data-driven approaches show potential for better generality, but are often constrained by limited training data. We present the Anymate Dataset, a large-scale dataset of 230K 3D assets paired with expert-crafted rigging and skinning information—70 times larger than existing datasets. Using this dataset, we propose a learning-based auto-rigging framework with three sequential modules for joint, connectivity, and skinning weight prediction. We systematically design and experiment with various architectures as baselines for each module and conduct comprehensive evaluations on our dataset to compare their performance. Our models significantly outperform existing methods, providing a foundation for comparing future methods in automated rigging and skinning. Code and dataset can be found at .
Generating realistic human motion with high-level controls is a crucial task for social understanding, robotics, and animation. With high-quality MOCAP data becoming more available recently, a wide range of data-driven approaches have been presented. However, modelling multi-person interactions still remains a less explored area. In this paper, we present Graph-driven Interaction Sampling, a method that can generate realistic and diverse multi-person interactions by leveraging existing two-person motion diffusion models as motion priors. Instead of training a new model specific to multi-person interaction synthesis, our key insight is to spatially and temporally separate complex multi-person interactions into a graph structure of two-person interactions, which we name the Pairwise Interaction Graph. We thus decompose the generation task into simultaneous single-person motion generation conditioned on one other’s motion. In addition, to reduce artifacts such as interpenetrations of body parts in generated multi-person interactions, we introduce two graph-dependent guidance terms into the diffusion sampling scheme. Unlike previous work, our method can produce various high-quality multi-person interactions without having repetitive individual motions. Extensive experiments demonstrate that our approach consistently outperforms existing methods in reducing artifacts when generating a wide range of two-person and multi-person interactions.
Generating large-scale multi-character interactions is a challenging and important task in character animation. Multi-character interactions involve not only natural interactive motions but also characters coordinated with each other for transition. For example, a dance scenario involves characters dancing with partners and also characters coordinated to new partners based on spatial and temporal observations. We term such transitions as coordinated interactions and decompose them into interaction synthesis and transition planning. Previous methods of single-character animation do not consider interactions that are critical for multiple characters. Deep-learning-based interaction synthesis usually focuses on two characters and does not consider transition planning. Optimization-based interaction synthesis relies on manually designing objective functions that may not generalize well. While crowd simulation involves more characters, their interactions are sparse and passive. We identify two challenges to multi-character interaction synthesis, including the lack of data and the planning of transitions among close and dense interactions. Existing datasets either do not have multiple characters or do not have close and dense interactions. The planning of transitions for multi-character close and dense interactions needs both spatial and temporal considerations. We propose a conditional generative pipeline comprising a coordinatable multi-character interaction space for interaction synthesis and a transition planning network for coordinations. Our experiments demonstrate the effectiveness of our proposed pipeline for multi-character interaction synthesis and the applications facilitated by our method show the scalability and transferability.
We address a fundamental challenge in Reinforcement Learning from Interaction Demonstration (RLID): demonstration noise and coverage limitations. While existing data collection approaches provide valuable interaction demonstrations, they often yield sparse, disconnected, and noisy trajectories that fail to capture the full spectrum of possible skill variations and transitions. Our key insight is that despite noisy and sparse demonstrations, there exist infinite physically feasible trajectories that naturally bridge between demonstrated skills or emerge from their neighboring states, forming a continuous space of possible skill variations and transitions. Building upon this insight, we present two data augmentation techniques: a Stitched Trajectory Graph (STG) that discovers potential transitions between demonstration skills, and a State Transition Field (STF) that establishes unique connections for arbitrary states within the demonstration neighborhood. To enable effective RLID with augmented data, we develop an Adaptive Trajectory Sampling (ATS) strategy for dynamic curriculum generation and a historical encoding mechanism for memory-dependent skill learning. Our approach enables robust skill acquisition that significantly generalizes beyond the reference demonstrations. Extensive experiments across diverse interaction tasks demonstrate substantial improvements over state-of-the-art methods in terms of convergence stability, generalization capability, and recovery robustness.
Generating high-quality 4D content from monocular videos—for applications such as digital humans and AR/VR—poses challenges in ensuring temporal and spatial consistency, preserving intricate details, and incorporating user guidance effectively. To overcome these challenges, we introduce Splat4D, a novel framework enabling high-fidelity 4D content generation from a monocular video. Splat4D achieves superior performance while maintaining faithful spatial-temporal coherence, by leveraging multi-view rendering, inconsistency identification, a video diffusion model, and an asymmetric U-Net for refinement. Through extensive evaluations on public benchmarks, Splat4D consistently demonstrates state-of-the-art performance across various metrics, underscoring the efficacy of our approach. Additionally, the versatility of Splat4D is validated in various applications such as text/image conditioned 4D generation, 4D human generation, and text-guided content editing, producing coherent outcomes following user instructions. Project page: https://visual-ai.github.io/splat4d
In this paper, we investigate the challenges associated with using egocentric devices to photorealistic reconstruct the scene in high dynamic range. Existing methodologies typically assume using frame-rate 6DoF pose estimated from the device’s visual-inertial odometry system, which may neglect crucial details necessary for pixel-accurate reconstruction. This study presents two significant findings. Firstly, in contrast to mainstream work treating RGB camera as global shutter frame-rate camera, we emphasize the importance of employing visual-inertial bundle adjustment (VIBA) to calibrate the precise timestamps and movement of the rolling shutter RGB sensing camera in a high frequency trajectory format, which ensures an accurate calibration of the physical properties of the rolling-shutter camera. Secondly, we incorporate a physical image formation model based into Gaussian Splatting, which effectively addresses the sensor characteristics, including the rolling-shutter effect of RGB cameras and the dynamic ranges measured by sensors. Our proposed formulation is applicable to the widely-used variants of Gaussian Splats representation. We conduct a comprehensive evaluation of our pipeline using the open-source Project Aria device under diverse indoor and outdoor lighting conditions, and further validate it on a Meta Quest3 device. Across all experiments, we observe a consistent visual enhancement of +1 dB in PSNR by incorporating VIBA, with an additional +1 dB achieved through our proposed image formation model. Our complete implementation, evaluation datasets, and recording profile are available at https://www.projectaria.com/photoreal-reconstruction/
We propose an online 3D Gaussian-based dense mapping framework for photorealistic details reconstruction from a monocular image stream. Our approach addresses two key challenges in monocular online reconstruction: distributing Gaussians without relying on depth maps and ensuring both local and global consistency in the reconstructed maps. To achieve this, we introduce two key modules: the Hierarchical Gaussian Management Module for effective Gaussian distribution and the Global Consistency Optimization Module for maintaining alignment and coherence at all scales. In addition, we present the Multi-level Occupancy Hash Voxels (MOHV), a structure that regularizes Gaussians for capturing details across multiple levels of granularity. MOHV ensures accurate reconstruction of both fine and coarse geometries and textures, preserving intricate details while maintaining overall structural integrity. Compared to state-of-the-art RGB-only and even RGB-D methods, our framework achieves superior reconstruction quality with high computational efficiency. Moreover, it integrates seamlessly with various tracking systems, ensuring generality and scalability. Project page: .
Text-guided image generation enables the creation of visual content from textual descriptions. However, certain visual concepts cannot be effectively conveyed through language alone. This has sparked a renewed interest in utilizing the CLIP image embedding space for more visually-oriented tasks through methods such as IP-Adapter. Interestingly, the CLIP image embedding space has been shown to be semantically meaningful, where linear operations within this space yield semantically meaningful results. Yet, the specific meaning of these operations can vary unpredictably across different images. To harness this potential, we introduce pOps, a framework that trains specific semantic operators directly on CLIP image embeddings. Each pOps operator is built upon a pretrained Diffusion Prior model. While the Diffusion Prior model was originally trained to map between text embeddings and image embeddings, we demonstrate that it can be tuned to accommodate new input conditions, resulting in a diffusion operator. Working directly over image embeddings not only improves our ability to learn semantic operations but also allows us to directly use a textual CLIP loss as an additional supervision when needed. We show that pOps can be used to learn a variety of photo-inspired operators with distinct semantic meanings. These operators can then serve as creative tools within a design process, enabling artists to semantically manipulate visual concepts as part of their generative workflow. Finally, we show that pOps can be easily plugged into pretrained image diffusion models alongside existing spatial adapters, offering control over both semantics and structure.
This paper explores how to record, explore, and visualize long-term changes in an environment—at the scale of days, months, and even years—based on data that a single user can conveniently capture using the mobile phone they already carry. Our strategy involves making the data capture process as quick and convenient as possible so that it is easy to integrate into daily routines. This strategy yields large unstructured panoramic image datasets, which we process using novel registration and scene reconstruction approaches. Our central contribution lies in demonstrating pocket time-lapse as a novel application, made possible through several key technical contributions. These include a novel method for quickly and robustly registering thousands of unstructured panoramic images, a novel reconstruction technique for rendering time-lapse and performing state-of-the-art intrinsic image decomposition, and several large hand-captured datasets that span multiple years of data collection, totaling over 6k separate capture sessions and 50k images.
Despite rapid advancements in the capabilities of generative models, pretrained text-to-image models still struggle in capturing the semantics conveyed by complex prompts that compound multiple objects and instance-level attributes. Consequently, we are witnessing growing interests in integrating additional structural constraints, typically in the form of coarse bounding boxes, to better guide the generation process in such challenging cases. In this work, we take the idea of structural guidance a step further by making the observation that contemporary image generation models can directly provide a plausible fine-grained structural initialization. We propose a technique that couples this image-based structural guidance with LLM-based instance-level instructions, yielding output images that adhere to all parts of the text prompt, including object counts, instance-level attributes, and spatial relations between instances. Additionally, we contribute CompoundPrompts, a benchmark composed of complex prompts with three difficulty levels in which object instances are progressively compounded with attribute descriptions and spatial relations. Extensive experiments demonstrate that our method significantly surpasses the performance of prior models, particularly over complex multi-object and multi-attribute use cases.
The stories and characters that captivate us as we grow up shape unique fantasy worlds, with images serving as the primary medium for visually experiencing these realms. Personalizing generative models through fine-tuning with theme-specific data has become a prevalent approach in text-to-image generation. However, unlike object customization, which focuses on learning specific objects, theme-specific generation encompasses diverse elements such as characters, scenes, and objects. Such diversity also introduces a key challenge: how to adaptively generate multi-character, multi-concept, and continuous theme-specific images (TSI). Moreover, fine-tuning approaches often come with significant computational overhead, time costs, and risks of overfitting. This paper explores a fundamental question: Can image generation models directly leverage images as contextual input, similarly to how large language models use text as context? To address this, we present IP-Prompter, a novel training-free TSI generation method. IP-Prompter introduces visual prompting, a mechanism that integrates reference images into generative models, allowing users to seamlessly specify the target theme without requiring additional training. To further enhance this process, we propose a Dynamic Visual Prompting (DVP) mechanism, which iteratively optimizes visual prompts to improve the accuracy and quality of generated images. Our approach enables diverse applications, including consistent story generation, character design, realistic character generation, and style-guided image generation. Comparative evaluations against state-of-the-art personalization methods demonstrate that IP-Prompter achieves significantly better results and excels in maintaining character identity preserving, style consistency and text alignment, offering a robust and flexible solution for theme-specific image generation. Our project page: .
Open-vocabulary panoptic segmentation has received significant attention due to its applicability in the real world.
Despite claims of robust generalization, we find that the advancements of previous works are attributed mainly to trained categories, exposing a lack of generalization to novel classes.
In this paper, we explore boosting existing models from a data-centric perspective.
We propose DreamMask, which systematically explores how to generate training data in the open-vocabulary setting, and how to train the model with both real and synthetic data.
For the first part, we propose an automatic data generation pipeline with off-the-shelf models. We propose crucial designs for vocabulary expansion, layout arrangement, data filtering, etc.
Equipped with these techniques, our generated data could significantly outperform the manually collected web data.
To train the model with generated data, a synthetic-real alignment loss is designed to bridge the representation gap, bringing noticeable improvements across multiple benchmarks.
In general, DreamMask significantly simplifies the collection of large-scale training data, serving as a plug-and-play enhancement for existing methods. For instance, when trained on COCO and tested on ADE20K, the model equipped with DreamMask outperforms the previous state-of-the-art by a substantial margin of 2.1% mIoU.
Content creators often draw inspiration from multiple visual sources, combining distinct elements to craft new compositions. Modern computational approaches now aim to emulate this fundamental creative process. Although recent diffusion models excel at text-guided compositional synthesis, text as a medium often lacks precise control over visual details. Image-based composition approaches can capture more nuanced features, but existing methods are typically limited in the range of concepts they can capture, and require expensive training procedures or specialized data. We present IP-Composer, a novel training-free approach for compositional image generation that leverages multiple image references simultaneously, while using natural language to describe the concept to be extracted from each image. Our method builds on IP-Adapter, which synthesizes novel images conditioned on an input image’s CLIP embedding. We extend this approach to multiple visual inputs by crafting composite embeddings, stitched from the projections of multiple input images onto concept-specific CLIP-subspaces identified through text. Through comprehensive evaluation, we show that our approach enables more precise control over a larger range of visual concept compositions.
The inverse design of microstructures plays a pivotal role in optimizing metamaterials with specific, targeted physical properties. While traditional forward design methods are constrained by their inability to explore the vast combinatorial design space, inverse design offers a compelling alternative by directly generating structures that fulfill predefined performance criteria. However, achieving precise control over both geometry and material properties remains a significant challenge due to their intricate interdependence. Existing approaches, which typically rely on voxel or parametric representations, often limit design flexibility and structural diversity.
In this work, we present a novel generative model that integrates latent diffusion with Holoplane, an advanced hybrid neural representation that simultaneously encodes both geometric and physical properties. This combination ensures superior alignment between geometry and properties. Our approach generalizes across multiple microstructure classes, enabling the generation of diverse, tileable microstructures with significantly improved property accuracy and enhanced control over geometric validity, surpassing the performance of existing methods. We introduce a multi-class dataset encompassing a variety of geometric morphologies, including truss, shell, tube, and plate structures, to train and validate our model. Experimental results demonstrate the model’s ability to generate microstructures that meet target properties, maintain geometric validity, and integrate seamlessly into complex assemblies. Additionally, we explore the potential of our framework through the generation of new microstructures, cross-class interpolation, and the infilling of heterogeneous microstructures. Code and data for this paper are at .
Triply periodic minimal surfaces (TPMS) are a class of metamaterials with a variety of applications and well-known primitive morphologies. We present a new method for discovering novel microscale TPMS structures with exceptional energy-dissipation capabilities, achieving double the energy absorption of the best existing TPMS primitive structure. Our approach employs a parametric representation, allowing seamless interpolation between structures and representing a rich TPMS design space. As simulations are intractable for efficiently optimizing microscale hyperelastic structures, we propose a sample-efficient computational strategy for rapid discovery with limited empirical data from 3D-printed and tested samples that ensures high-fidelity results. We achieve this by leveraging a predictive uncertainty-aware Deep Ensembles model to identify which structures to fabricate and test next. We iteratively refine our model through batch Bayesian optimization, selecting structures for fabrication that maximize exploration of the performance space and exploitation of our energy-dissipation objective. Using our method, we produce the first open-source dataset of hyperelastic microscale TPMS structures, including a set of novel structures that demonstrate extreme energy dissipation capabilities, and show several potential applications of these structures.
We present a computational approach for designing Discrete Interlocking Materials (DIMs) with desired mechanical properties. Unlike conventional elastic materials, DIMs are kinematic materials governed by internal contacts among elements. These contacts induce anisotropic deformation limits that depend on the shape and topology of the elements. To enable gradient-based design optimization of DIMs with desired deformation limits, we introduce an implicit representation of interlocking elements based on unions of tori. Using this low-dimensional representation, we simulate DIMs with smoothly evolving contacts, allowing us to predict changes in deformation limits as a function of shape parameters. With this toolset in hand, we optimize for element shape parameters to design heterogeneous DIMs that best approximate prescribed limits. We demonstrate the effectiveness of our method by designing discrete interlocking materials with diverse limit profiles for in- and out-of-plane deformation and validate our method on fabricated physical prototypes.
Gothic microarchitecture is a design phenomenon widely observed in late medieval European art, comprising sculptural works that emulate the forms and structural composition of monumental Gothic architecture. Despite its prevalence in preserved artifacts, the design and construction methods of Gothic microarchitecture used by artisans remain a mystery, as these processes were orally transmitted and rarely documented. The Basel goldsmith drawings (“Basler Goldschmiedrisse”), a collection of over 200 late 15th-century design drawings from the Upper Rhine region, provide a rare glimpse into the workshop practices of Gothic artisans. This collection consists of unpaired 2D drawings, including top-view and side-view projections of Gothic microarchitecture, featuring nested curve networks without annotations or explicitly articulated design principles. Understanding these 2D drawings and reconstructing the 3D objects they represent has long posed a significant challenge due to the lack of documentation and the complexity of the designs. In this work, we propose a framework of simple yet expressive geometric principles to model Gothic microarchitecture as 3D curve networks, using limited input such as historical 2D drawings. Our approach formalizes a historically informed design space, constrained to tools traditionally available to artisans–namely compass and straightedge–and enables faithful reproduction of Gothic microarchitecture that conforms to physical artifacts. Our framework is intuitive and efficient, allowing users to interactively create 3D Gothic microarchitecture with minimal effort. It bridges the gap between historical artistry and modern computational design, while also shedding light on a lost chapter of Gothic craftsmanship.
The growing global demand for removable partial and full dentures, driven by an aging population and the high prevalence of edentulism, emphasizes the importance of advancing manufacturing solutions. Multi-material jetting, with newly regulatory-approved dental resins, facilitates the production of monolithic, full-color dentures, reducing manual labor and enabling advanced aesthetic customization.
We propose a practical method for dental layer biomimicry and multi-spot shade matching using multi-material 3D printing. It integrates seamlessly into workflows combining dental CAD tools, which compute the outer shape of dentures, and industrial multi-material slicers.
We introduce a morphable star-shape descriptor to embed enamel, dentin, and root layers with controlled thicknesses into outer tooth geometries. To compute per-layer material ratios, we first present a forward model to predict the color of printed layered slabs mimicking inner tooth structures with variable translucencies. The slab design enables reasonable color predictions on layered tooth shapes. Based on the forward model, we propose a backward model to compute per-layer material ratios for given layer-translucencies to achieve color matches at multiple points on the tooth surface.
The method’s effectiveness is demonstrated by fabricating various dentures with custom layers that accurately replicate VITA classical shades, showcasing its practical and versatile application in denture manufacturing.
We introduce MAGNET (Muscle Activation Generation Networks), a scalable framework for reconstructing full-body muscle activations across diverse human movements. Our approach employs musculoskeletal simulation with a novel two-level controller architecture trained using three-stage learning methods. Additionally, we develop distilled models tailored for solving downstream tasks or generating real-time muscle activations, even on edge devices. The efficacy of our framework is demonstrated through examples of daily life and challenging behaviors, as well as comprehensive evaluations.
Humans excel in navigating diverse, complex environments with agile motor skills, exemplified by parkour practitioners performing dynamic maneuvers, such as climbing up walls and jumping across gaps. Reproducing these agile movements with simulated characters remains challenging, in part due to the scarcity of motion capture data for agile terrain traversal behaviors and the high cost of acquiring such data. In this work, we introduce PARC (Physics-based Augmentation with Reinforcement Learning for Character Controllers), a framework that leverages machine learning and physics-based simulation to iteratively augment motion datasets and expand the capabilities of terrain traversal controllers. PARC begins by training a motion generator on a small dataset consisting of core terrain traversal skills. The motion generator is then used to produce synthetic data for traversing new terrains. However, these generated motions often exhibit artifacts, such as incorrect contacts or discontinuities. To correct these artifacts, we train a physics-based tracking controller to imitate the motions in simulation. The corrected motions are then added to the dataset, which is used to continue training the motion generator in the next iteration. PARC’s iterative process jointly expands the capabilities of the motion generator and tracker, creating agile and versatile models for interacting with complex environments. PARC provides an effective approach to develop controllers for agile terrain traversal, which bridges the gap between the scarcity of motion data and the need for versatile character controllers.
Physically simulated characters can learn highly natural full-body motion guided by motion capture datasets. However, the range of motion is limited to the existing high-quality datasets, and cannot effectively adapt to challenging scenarios. We propose a novel policy architecture that learns part-wise motion skills, where individual parts can be separately extended and combined for unobserved settings. Our method employs a set of part-specific codebooks, which robustly capture motion dynamics without catastrophic collapse or forgetting. This structured decomposition allows intuitive control over the character’s behavior and dynamic exploration for a novel combination of part-wise motion. We further incorporate a refinement network compensating for subtle discrepancies in the disjoint discrete tokens, thus improving motion quality and stability. Our extensive evaluations show that our part-wise latent token achieves superior performance in imitating motions, even those from unseen distribution. We also validate our method in challenging tasks, including body tracking, navigation on complex terrains, and point-goal navigation with damaged body parts. Finally, we introduce a part-wise expansion of motion priors, where the physically simulated character incrementally adapts partial motion and produces unique combinations of whole-body motion, significantly diversifying motions.
Assigning realistic materials to 3D models remains a significant challenge in computer graphics. We propose MatCLIP, a novel method that extracts shape- and lighting-insensitive descriptors of Physically Based Rendering (PBR) materials to assign plausible textures to 3D objects based on images, such as the output of Latent Diffusion Models (LDMs) or photographs. Matching PBR materials to static images is challenging because the PBR representation captures the dynamic appearance of materials under varying viewing angles, shapes, and lighting conditions. By extending an Alpha-CLIP-based model on material renderings across diverse shapes and lighting, and encoding multiple viewing conditions for PBR materials, our approach generates descriptors that bridge the domains of PBR representations with photographs or renderings, including LDM outputs. This enables consistent material assignments without requiring explicit knowledge of material relationships between different parts of an object. MatCLIP achieves a top-1 classification accuracy of 76.6%, outperforming state-of-the-art methods such as PhotoShape and MatAtlas by over 15 percentage points on publicly available datasets. Our method can be used to construct material assignments for 3D shape datasets such as ShapeNet, 3DCoMPaT++, and Objaverse. All code and data will be released at .
Detailed microstructures on specular objects often exhibit intriguing glinty patterns under high-frequency lighting, which is challenging to render using a conventional normal-mapped BRDF. In this paper, we present a manifold-based formulation of the glint normal distribution functions (NDF) that precisely captures the surface normal distributions over queried footprints. The manifold-based formulation transfers the integration for the glint NDF construction to a problem of mesh intersections. Compared to previous works that rely on complex numerical approximations, our integral solution is exact and much simpler to compute, which also allows an easy adaptation of a mesh clustering hierarchy to accelerate the NDF evaluation of large footprints. Our performance and quality analysis shows that our NDF formulation achieves similar glinty appearance compared to the baselines but is an order of magnitude faster. Within this framework, we further present a novel derivation of analytical shadow-masking for normal-mapped diffuse surfaces—a component that is often ignored in previous works.
Fluorescent materials are characterized by a spectral reradiation toward longer wavelengths. Recent work [Fichet et al. 2024] has shown that the rendering of fluorescence in a non-spectral engine is possible through the use of appropriate reduced reradiation matrices. But the approach has limited expressivity, as it requires the storage of one reduced matrix per fluorescent material, and only works with measured fluorescent assets.
In this work, we introduce an analytical approach to the editing and rendering of fluorescence in a non-spectral engine. It is based on a decomposition of the reduced reradiation matrix, and an analytically-integrable Gaussian-based model of the fluorescent component. The model reproduces the appearance of fluorescent materials accurately, especially with the addition of a UV basis. Most importantly, it grants variations of fluorescent material parameters in real-time, either for the editing of fluorescent materials, or for the dynamic spatial variation of fluorescence properties across object surfaces. A simplified one-Gaussian fluorescence model even allows for the artist-friendly creation of plausible fluorescent materials from scratch, requiring only a few reflectance colors as input.
Median filtering is a non-linear smoothing technique widely used in digital image processing to remove noise while retaining sharp edges. It is particularly well suited to removing outliers (impulse noise) or granular artifacts (speckle noise). However, the high computational cost of median filtering can be prohibitive. Sorting-based algorithms excel with small kernels but scale poorly with increasing kernel diameter, in contrast to constant-time methods characterized by higher constant factors but better scalability, such as histogram-based approaches or the 2D wavelet matrix.
This paper introduces a novel algorithm, leveraging the separability of the sorting problem through hierarchical tiling to minimize redundant computations. We propose two variants: a data-oblivious selection network that can operate entirely within registers, and a data-aware version utilizing random-access memory. These achieve per-pixel complexities of O(klog (k)) and O(k), respectively, for a k × k kernel — unprecedented for sorting-based methods. Our CUDA implementation is up to 5 times faster than the current state of the art on a modern GPU and is the fastest median filter in most cases for 8-, 16-, and 32-bit data types and kernels from 3 × 3 to 75 × 75.
Median filtering is a cornerstone of computational image processing. It provides an effective means of image smoothing, with minimal blurring or softening of edges, invariance to monotonic transformations such as gamma adjustment, and robustness to noise and outliers. However, known algorithms have all suffered from practical limitations: the bit depth of the image data, the size of the filter kernel, or the kernel shape itself. Square-kernel implementations tend to produce streaky cross-hatching artifacts, and nearly all known efficient algorithms are in practice limited to square kernels. We present for the first time a method that overcomes all of these limitations. Our method operates efficiently on arbitrary bit-depth data, arbitrary kernel sizes, and arbitrary convex kernel shapes, including circular shapes.
Despite recent advances in Novel View Synthesis (NVS), generating high-fidelity views from single or sparse observations remains challenging. Existing splatting-based approaches often produce distorted geometry due to splatting errors. While diffusion-based methods leverage rich 3D priors to achieve improved geometry, they often suffer from texture hallucination. In this paper, we introduce SplatDiff, a pixel-splatting-guided video diffusion model designed to synthesize high-fidelity novel views from a single image. Specifically, we propose an aligned synthesis strategy for precise control of target viewpoints and geometry-consistent view synthesis. To mitigate texture hallucination, we design a texture bridge module that enables high-fidelity texture generation through adaptive feature fusion. In this manner, SplatDiff leverages the strengths of splatting and diffusion for geometrically consistent, high-fidelity view synthesis. Extensive experiments verify the state-of-the-art performance of SplatDiff in single-view NVS. Additionally, without extra training, SplatDiff shows remarkable zero-shot performance across diverse tasks, including sparse-view NVS and stereo video conversion.
Online reconstruction of dynamic scenes is significant as it enables learning scenes from live-streaming video inputs, while existing offline dynamic reconstruction methods rely on recorded video inputs. However, previous online reconstruction approaches have primarily focused on efficiency and rendering quality, overlooking the temporal consistency of their results, which often contain noticeable artifacts in static regions. This paper identifies that errors such as noise in real-world recordings affect temporal inconsistency in online reconstruction. We propose a method that enhances temporal consistency in online reconstruction from observations with temporal inconsistency which is inevitable in cameras. We show that our method restores the ideal observation by subtracting the learned error. We demonstrate that applying our method to various baselines significantly enhances both temporal consistency and rendering quality across datasets. Code, video results, and checkpoints are available at .
Real-world video super-resolution (VSR) presents significant challenges due to complex and unpredictable degradations. Although some recent methods utilize image diffusion models for VSR and have shown improved detail generation capabilities, they still struggle to produce temporally consistent frames. We attempt to use Stable Video Diffusion (SVD) combined with ControlNet to address this issue. However, due to the intrinsic image-animation characteristics of SVD, it is challenging to generate fine details using only low-quality videos. To tackle this problem, we propose DAM-VSR, an appearance and motion disentanglement framework for VSR. This framework disentangles VSR into appearance enhancement and motion control problems. Specifically, appearance enhancement is achieved through reference image super-resolution, while motion control is achieved through video ControlNet. This disentanglement fully leverages the generative prior of video diffusion models and the detail generation capabilities of image super-resolution models. Furthermore, equipped with the proposed motion-aligned bidirectional sampling strategy, DAM-VSR can conduct VSR on longer input videos. DAM-VSR achieves state-of-the-art performance on real-world data and AIGC data, demonstrating its powerful detail generation capabilities.
Video super-resolution (VSR) aims to reconstruct a high-resolution (HR) video from a low-resolution (LR) counterpart. Achieving successful VSR requires producing realistic HR details and ensuring both spatial and temporal consistency. To restore realistic details, diffusion-based VSR approaches have recently been proposed. However, the inherent randomness of diffusion, combined with their tile-based approach, often leads to spatio-temporal inconsistencies. In this paper, we propose DC-VSR, a novel VSR approach to produce spatially and temporally consistent VSR results with realistic textures. To achieve spatial and temporal consistency, DC-VSR adopts a novel Spatial Attention Propagation (SAP) scheme and a Temporal Attention Propagation (TAP) scheme that propagate information across spatio-temporal tiles based on the self-attention mechanism. To enhance high-frequency details, we also introduce Detail-Suppression Self-Attention Guidance (DSSAG), a novel diffusion guidance scheme. Comprehensive experiments demonstrate that DC-VSR achieves spatially and temporally consistent, high-quality VSR results, outperforming previous approaches.
We present a novel stochastic version of the Barnes-Hut approximation. Regarding the level-of-detail (LOD) family of approximations as control variates, we construct an unbiased estimator of the kernel sum being approximated. Through several examples in graphics applications such as winding number computation and smooth distance evaluation, we demonstrate that our method is well-suited for GPU computation, capable of outperforming a GPU-optimized implementation of the deterministic Barnes-Hut approximation by achieving equal median error in up to 9.4x less time.
3D generative models of faces trained on in-the-wild image collections have improved greatly in recent times, offering better visual fidelity and view consistency. Making such generative models animatable is a hard yet rewarding task, with applications in virtual AI agents, character animation, and telepresence. However, it is not trivial to learn a well-behaved animation model with the generative setting, as the learned latent space aims to best capture the data distribution, often omitting details such as dynamic appearance and entangling animation with other factors that affect controllability. We present GAIA: Generative Animatable Interactive Avatars, which is able to generate high-fidelity 3D head avatars for both realistic animation and rendering. To achieve consistency during animation, we learn to generate Gaussians embedded in an underlying morphable model for human heads via a shared UV parameterization. For modeling realistic animation, we further design the generator to learn expression-conditioned details for both geometric deformation and dynamic appearance. Finally, facing an inevitable entanglement problem between facial identity and expression, we propose a novel two-branch architecture that encourages the generator to disentangle identity and expression. On existing benchmarks, GAIA achieves state-of-the-art performance in visual quality as well as realistic animation. The generated Gaussian-based avatar supports highly efficient animation and rendering, making it readily available for interactive animation and appearance editing.
Sparse volumetric reconstruction and rendering via 3D Gaussian splatting have recently enabled animatable 3D head avatars that are rendered under arbitrary viewpoints with impressive photorealism. Today, such photoreal avatars are seen as a key component in emerging applications in telepresence, extended reality, and entertainment. Building a photoreal avatar requires estimating the complex non-rigid motion of different facial components as seen in input video images; due to inaccurate motion estimation, animatable models typically present a loss of fidelity and detail when compared to their non-animatable counterparts, built from an individual facial expression. Also, recent state-of-the-art models are often affected by memory limitations that reduce the number of 3D Gaussians used for modeling, leading to lower detail and quality. To address these problems, we present a new high-detail 3D head avatar model that improves upon the state of the art, largely increasing the number of 3D Gaussians and modeling quality for rendering at 4K resolution. Our high-quality model is reconstructed from multiview input video and builds on top of a mesh-based 3D morphable model, which provides a coarse deformation layer for the head. Photoreal appearance is modelled by 3D Gaussians embedded within the continuous UVD tangent space of this mesh, allowing for more effective densification where most needed. Additionally, these Gaussians are warped by a novel UVD deformation field to capture subtle, localized motion. Our key contribution is the novel deformable Gaussian encoding and overall fitting procedure that allows our head model to preserve appearance detail, while capturing facial motion and other transient high-frequency features such as skin wrinkling.
Face image restoration aims to enhance degraded facial images while addressing challenges such as diverse degradation types, real-time processing demands, and, most crucially, the preservation of identity-specific features. Existing methods often struggle with slow processing times and suboptimal restoration, especially under severe degradation, failing to accurately reconstruct finer-level identity details. To address these issues, we introduce InstantRestore, a novel framework that leverages a single-step image diffusion model and an attention-sharing mechanism for fast and personalized face restoration. Additionally, InstantRestore incorporates a novel landmark attention loss, aligning key facial landmarks to refine the attention maps, enhancing identity preservation. At inference time, given a degraded input and a small (∼ 4) set of reference images, InstantRestore performs a single forward pass through the network to achieve near real-time performance. Unlike prior approaches that rely on full diffusion processes or per-identity model tuning, InstantRestore offers a scalable solution suitable for large-scale applications. Extensive experiments demonstrate that InstantRestore outperforms existing methods in quality and speed, making it an appealing choice for identity-preserving face restoration.
We propose Relightable Full-Body Gaussian Codec Avatars, a new approach for modeling relightable full-body avatars with fine-grained details including face and hands. The unique challenge for relighting full-body avatars lies in the large deformations caused by body articulation and the resulting impact on appearance caused by light transport. Changes in body pose can dramatically change the orientation of body surfaces with respect to lights, resulting in both local appearance changes due to changes in local light transport functions, as well as non-local changes due to occlusion between body parts. To address this, we decompose the light transport into local and non-local effects. Local appearance changes are modeled using learnable zonal harmonics for diffuse radiance transfer. Unlike spherical harmonics, zonal harmonics are highly efficient to rotate under articulation. This allows us to learn diffuse radiance transfer in a local coordinate frame, which disentangles the local radiance transfer from the articulation of the body. To account for non-local appearance changes, we introduce a shadow network that predicts shadows given precomputed incoming irradiance on a base mesh. This facilitates the learning of non-local shadowing between the body parts. Finally, we use a deferred shading approach to model specular radiance transfer and better capture reflections and highlights such as eye glints. We demonstrate that our approach successfully models both the local and non-local light transport required for relightable full-body avatars, with a superior generalization ability under novel illumination conditions and unseen poses.
With recent advancements in neural rendering and motion capture algorithms, remarkable progress has been made in photorealistic human avatar modeling, unlocking immense potential for applications in virtual reality, augmented reality, remote communication, and industries such as gaming, film, and medicine. However, existing methods fail to provide complete, faithful, and expressive control over human avatars due to their entangled representation of facial expressions and body movements. In this work, we introduce Expressive Virtual Avatars (EVA), an actor-specific, fully controllable, and expressive human avatar framework that achieves high-fidelity, lifelike renderings in real time while enabling independent control of facial expressions, body movements, and hand gestures. Specifically, our approach designs the human avatar as a two-layer model: an expressive template geometry layer and a 3D Gaussian appearance layer. First, we present an expressive template tracking algorithm that leverages coarse-to-fine optimization to accurately recover body motions, facial expressions, and non-rigid deformation parameters from multi-view videos. Next, we propose a novel decoupled 3D Gaussian appearance model designed to effectively disentangle body and facial appearance. Unlike unified Gaussian estimation approaches, our method employs two specialized and independent modules to model the body and face separately. Experimental results demonstrate that EVA surpasses state-of-the-art methods in terms of rendering quality and expressiveness, validating its effectiveness in creating full-body avatars. This work represents a significant advancement towards fully drivable digital human models, enabling the creation of lifelike digital avatars that faithfully replicate human geometry and appearance.
We present a framework of elastic locomotion, which allows users to enliven an elastic body to produce interesting locomotion by prescribing its high-level kinematics. We formulate this problem as an inverse simulation problem and seek the optimal muscle activations to drive the body to complete the desired actions. We employ the interior-point method to model wide-area contacts between the body and the environment with logarithmic barrier penalties. The core of our framework is a mixed second-order differentiation algorithm. By combining both analytic differentiation and numerical differentiation modalities, a general-purpose second-order differentiation scheme is made possible. Specifically, we augment complex-step finite difference (CSFD) with reverse automatic differentiation (AD). We treat AD as a generic function, mapping a computing procedure to its derivative w.r.t. output loss, and promote CSFD along the AD computation. To this end, we carefully implement all the arithmetics used in elastic locomotion, from elementary functions to linear algebra and matrix operation for CSFD promotion. With this novel differentiation tool, elastic locomotion can directly exploit Newton’s method and use its strong second-order convergence to find the needed activations at muscle fibers. This is not possible with existing first-order inverse or differentiable simulation techniques. We showcase a wide range of interesting locomotions of soft bodies and creatures to validate our method.
3D human reconstruction and animation are long-standing topics in computer graphics and vision. However, existing methods typically rely on sophisticated dense-view capture and/or time-consuming per-subject optimization procedures. To address these limitations, we propose HumanRAM, a novel feed-forward approach for generalizable human reconstruction and animation from monocular or sparse human images. Our approach integrates human reconstruction and animation into a unified framework by introducing explicit pose conditions, parameterized by a shared SMPL-X neural texture, into transformer-based large reconstruction models (LRM). Given monocular or sparse input images with associated camera parameters and SMPL-X poses, our model employs scalable transformers and a DPT-based decoder to synthesize realistic human renderings under novel viewpoints and novel poses. By leveraging the explicit pose conditions, our model simultaneously enables high-quality human reconstruction and high-fidelity pose-controlled animation. Experiments show that HumanRAM significantly surpasses previous methods in terms of reconstruction accuracy, animation fidelity, and generalization performance on real-world datasets.
Mesh repair is a critical process in 3D geometry processing aimed at correcting errors and imperfections in polygonal meshes to produce watertight, manifold, and feature-preserving meshes suitable for downstream tasks. While errors such as degeneracies, duplication, holes, and overlaps can be addressed through standard repair processes, cracks along trimmed curves require special attention and should ideally be repaired to align with sharp feature lines.
In this paper, we present a unified framework for repairing diverse mesh imperfections by leveraging a manifold wrap surface as a mediating agent. The primary role of the wrap surface is to define spatial connections between points on the original surface, thereby decoupling the challenges of edge connectivity and point relocation during repair. Throughout the process, our algorithm operates on the dual objects: the original defective mesh and the manifold wrap surface. The implementation begins by extracting a set of samples from the wrap surface and projecting them onto the original surface. These projected samples are optimized by minimizing the quadratic error relative to the tangent planes of neighboring points on the original surface. Notably, samples far from feature lines remain unchanged, while samples near feature lines converge to those lines even when the input surface lacks correct mesh topology. We then assign an adaptive weight to each sample based on the squared moving distance. By introducing this weight setting, we observe that the restricted power diagram prioritizes connectivity along feature lines, thereby effectively preserving sharp features. Through extensive experiments, we demonstrate the superiority of our proposed algorithm over existing methodologies in terms of manifoldness, watertightness, topological correctness, triangle quality, and feature preservation.
Despite significant advancements in video generation, inserting a given object into videos remains a challenging task. The difficulty lies in preserving the appearance details of the reference object and accurately modeling coherent motion at the same time. In this paper, we propose VideoAnydoor, a zero-shot video object insertion framework with high-fidelity detail preservation and precise motion control. Starting from a text-to-video model, we utilize an ID extractor to inject the global identity and leverage a box sequence to control the overall motion. To preserve the detailed appearance and meanwhile support fine-grained motion control, we design a pixel warper. It takes the reference image with arbitrary key-points and the corresponding key-point trajectories as inputs. It warps the pixel details according to the trajectories and fuses the warped features with the diffusion U-Net, thus improving detail preservation and supporting users in manipulating the motion trajectories. In addition, we propose a training strategy involving both videos and static images with a reweight reconstruction loss to enhance insertion quality. VideoAnydoor demonstrates significant superiority over existing methods and naturally supports various downstream applications (e.g., talking head generation, video virtual try-on, multi-region editing) without task-specific fine-tuning.
Recent video editing methods achieve attractive results in style transfer or appearance modification. However, editing the structural content of 3D scenes in videos remains challenging, particularly when dealing with significant viewpoint changes, such as large camera rotations or zooms. Key challenges include generating novel view content that remains consistent with the original video, preserving unedited regions, and translating sparse 2D inputs into realistic 3D video outputs. To address these issues, we propose Sketch3DVE, a sketch-based 3D-aware video editing method to enable detailed local manipulation of videos with significant viewpoint changes. To solve the challenge posed by sparse inputs, we employ image editing methods to generate edited results for the first frame, which are then propagated to the remaining frames of the video. We utilize sketching as an interaction tool for precise geometry control, while other mask-based image editing methods are also supported. To handle viewpoint changes, we perform a detailed analysis and manipulation of the 3D information in the video. Specifically, we utilize a dense stereo method to estimate a point cloud and the camera parameters of the input video. We then propose a point cloud editing approach that uses depth maps to represent the 3D geometry of newly edited components, aligning them effectively with the original 3D scene. To seamlessly merge the newly edited content with the original video while preserving the features of unedited regions, we introduce a 3D-aware mask propagation strategy and employ a video diffusion model to produce realistic edited videos. Extensive experiments demonstrate the superiority of Sketch3DVE in video editing.
Video inpainting, crucial for the media industry, aims to restore corrupted content. However, current methods relying on limited pixel propagation or single-branch image inpainting architectures face challenges with generating fully masked objects, balancing background preservation with foreground generation, and maintaining ID consistency over long video. To address these issues, we propose VideoPainter, an efficient dual-branch framework featuring a lightweight context encoder. This plug-and-play encoder processes masked videos and injects background guidance into any pre-trained video diffusion transformer, generalizing across arbitrary mask types, enhancing background integration and foreground generation, and enabling user-customized control. We further introduce a strategy to resample inpainting regions for maintaining ID consistency in any-length video inpainting. Additionally, we develop a scalable dataset pipeline using advanced vision models and construct VPData and VPBench—the largest video inpainting dataset with segmentation masks and dense caption (>390K clips) —to support large-scale training and evaluation. We also show VideoPainter’s promising potential in downstream applications such as video editing. Extensive experiments demonstrate VideoPainter’s state-of-the-art performance in any-length video inpainting and editing across 8 key metrics, including video quality, mask region preservation, and textual coherence.
Automatic video colorization poses challenges, requiring efficient generation of results that ensure frame and color consistency. Previous video colorization works often suffer from issues such as color flickering, bleeding, artifacts, and low color richness due to the inherent ambiguity and limitations of the models. While diffusion-based video-to-video approaches can produce customized colorization models through fine-tuning, their high inference costs limit their suitability for real-time scenarios. To address these challenges, we propose ColorSurge, a lightweight network for efficient end-to-end video colorization. ColorSurge employs a dual-branch structure, consisting of a grayscale branch and a color branch. In the grayscale branch, we extract the semantic content of grayscale videos and reconstruct and output features at different spatial scales. In the color branch, we introduce learnable color tokens and fuse these multi-scale semantic features through stacked Color Alchemy Blocks (CABs). Within each CAB, we incorporate Color Spatial Transformer Blocks (CSTB) and Color Temporal Transformer Blocks (CTTB) to constrain the spatial harmony and temporal consistency of colors. Finally, we use a Color Mapper to unify the grayscale and color features, mapping them to obtain the final colorized video result. Extensive experiments demonstrate that our method significantly outperforms previous state-of-the-art models in both qualitative and quantitative evaluations. Our code and model are available at .
Temporal video frame interpolation has been an active area of research in recent years, with a primary focus on motion estimation, compensation, and synthesis of the final frame. While recent methods have shown good quality results in many cases, they can still fail in challenging scenarios. Moreover, they typically produce fixed outputs with no means of control, further limiting their application in film production pipelines. In this work, we address the less explored problem of user-assisted frame interpolation to improve quality and enable control over the appearance and motion of interpolated frames. To this end, we introduce a tracking-based video frame interpolation method that utilizes sparse point tracks, first estimated and interpolated with existing point tracking methods and then optionally refined by the user. Additionally, we propose a mechanism for controlling the levels of hallucination in interpolated frames through inference-time model weight adaptation, allowing a continuous trade-off between hallucination and blurriness.
Even without any user input, our model achieves state-of-the-art results in challenging test cases. By using points tracked over the whole sequence, we can use better motion trajectory interpolation methods, such as cubic splines, to more accurately represent the true motion and achieve significant improvements in results. Our experiments demonstrate that refining tracks and their trajectories through user interactions significantly improves the quality of interpolated frames.
Recent years have seen a tremendous improvement in the quality of video generation and editing approaches. While several techniques focus on editing appearance, few address motion. Current approaches using text, trajectories, or bounding boxes are limited to simple motions, so we specify motions with a single motion reference video instead. We further propose to use a pre-trained image-to-video model rather than a text-to-video model. This approach allows us to preserve the exact appearance and position of a target object or scene and helps disentangle appearance from motion.
Our method, called motion-textual inversion, leverages our observation that image-to-video models extract appearance mainly from the (latent) image input, while the text/image embedding injected via cross-attention predominantly controls motion. We thus represent motion using text/image embedding tokens. By operating on an inflated motion-text embedding containing multiple text/image embedding tokens per frame, we achieve a high temporal motion granularity. Once optimized on the motion reference video, this embedding can be applied to various target images to generate videos with semantically similar motions.
Our approach does not require spatial alignment between the motion reference video and target image, generalizes across various domains, and can be applied to various tasks such as full-body and face reenactment, as well as controlling the motion of inanimate objects and the camera. We empirically demonstrate the effectiveness of our method in the semantic video motion transfer task, significantly outperforming existing methods in this context.
Project website: https://mkansy.github.io/reenact-anything/
We propose a novel approach for the computational modeling of lignified tissues, such as those found in tree branches and timber. We leverage a state-of-the-art strand-based representation for tree form, which we extend to describe biophysical processes at short and long time scales. Simulations at short time scales enable us to model different breaking patterns due to branch bending, twisting, and breaking. On long timescales, our method enables the simulation of realistic branch shapes under the influence of plausible biophysical processes, such as the development of compression and tension wood. We specifically focus on computationally fast simulations of woody material, enabling the interactive exploration of branches and wood breaking. By leveraging Cosserat rod physics, our method enables the generation of a wide variety of breaking patterns. We showcase the capabilities of our method by performing and visualizing numerous experiments.
We develop a method for automatic placement of knit singularities based on curl quantization, extending the knit-planning frameworks of Mitra et al. [2024; 2023]. Stripe patterns are generated that closely follow the isolines of an underlying knitting time function, and has course and wale singularities in regions of high curl for the normalized time function gradient and its 90° rotated field, respectively. Singularities are placed in an iterative fashion, and we show that this strategy allows us to easily maintain the structural constraints necessary for machine-knitting, e.g., the helix-free constraint, and to satisfy user constraints such as stripe alignment and singularity placement. Our more performant approach obviates the need for a mixed-integer solve [Mitra et al. 2023], manual fixing of singularity positions, or the running of a singularity matching procedure in post-processing [Mitra et al. 2024]. Our global optimization also produces smooth knit graphs that provide quick simulation-free previews of rendered knits without the surface artifacts of competing methods. Furthermore, we extend our method to the popular cut-and-sew garment design paradigm. We validate our method by machine-knitting and rendering yarn-based visualizations of prototypical models in the 3D and cut-and-sew settings.
Manufacturability is vital for product design and production, with accessibility being a key element, especially in subtractive manufacturing. Traditional methods for geometric accessibility analysis are time-consuming and struggle with scalability, while existing deep learning approaches in manufacturability analysis often neglect geometric challenges in accessibility and are limited to specific model types. In this paper, we introduce DeepMill, the first neural framework designed to accurately and efficiently predict inaccessible and occlusion regions under varying machining tool parameters, applicable to both CAD and freeform models. To address the challenges posed by cutter collisions and the lack of extensive training datasets, we construct a cutter-aware dual-head octree-based convolutional neural network (O-CNN) and generate an inaccessible and occlusion regions analysis dataset with a variety of cutter sizes for network training. Experiments demonstrate that DeepMill achieves 94.7% accuracy in predicting inaccessible regions and 88.7% accuracy in identifying occlusion regions, with an average processing time of 0.04 seconds for finely-tessellated geometries. Based on the outcomes, DeepMill implicitly captures both local and global geometric features, as well as the complex interactions between cutters and intricate 3D models. Code is publicly available at .
Heat exchangers are critical components in a wide range of engineering applications, from energy systems to chemical processing, where efficient thermal management is essential. The design objectives for heat exchangers include maximizing the heat exchange rate while minimizing the pressure drop, requiring both a large interface area and a smooth internal structure. State-of-the-art designs, such as triply periodic minimal surfaces (TPMS), have proven effective in optimizing heat exchange efficiency. However, TPMS designs are constrained by predefined mathematical equations, limiting their adaptability to freeform boundary shapes. Additionally, TPMS structures do not inherently control flow directions, which can lead to flow stagnation and undesirable pressure drops.
This paper presents DualMS, a novel computational framework for optimizing dual-channel minimal surfaces specifically for heat exchanger designs in freeform shapes. To the best of our knowledge, this is the first attempt to directly optimize minimal surfaces for two-fluid heat exchangers, rather than relying on TPMS. Our approach formulates the heat exchange maximization problem as a constrained connected maximum cut problem on a graph, with flow constraints guiding the optimization process. To address undesirable pressure drops, we model the minimal surface as a classification boundary separating the two fluids, incorporating an additional regularization term for area minimization. We employ a neural network that maps spatial points to binary flow types, enabling it to classify flow skeletons and automatically determine the surface boundary. DualMS demonstrates greater flexibility in surface topology compared to TPMS and achieves superior thermal performance, with lower pressure drops while maintaining a similar heat exchange rate under the same material cost. The project is open-sourced at .
Photorealistic rendering aims to accurately replicate real-world appearances. Traditional methods, like microfacet-based models, often struggle with complex visuals. Consequently, neural material techniques have emerged, typically offering improved performance over traditional approaches. However, these neural material approaches only attempt to address one or a few essential aspects of the complete appearance while neglecting others (quality, parallax & silhouette, synthesis, performance). Although these aspects may seem separate, they are inherently intertwined as part of the complete appearance which cannot be isolated. In this paper, we challenge the comprehensive neural material representation by thoroughly considering the essential aspects of the complete appearance. We introduce an int8-quantized neural network that keeps high fidelity (quality) while achieving an order of magnitude speedup (performance) compared to previous methods. We also present a controllable structure-preserving synthesis strategy (synthesis), along with accurate displacement effects (parallax & silhouette) through a dynamic two-step displacement tracing technique.
Advancements in neural rendering techniques have sparked renewed interest in neural materials, which are capable of representing bidirectional texture functions (BTFs) cheaply and with high quality. However, content creation in the neural material format is not straightforward. To address this limitation, we present the first image-conditioned diffusion model for neural materials, and show an extension to text conditioning. To achieve this, we make two main contributions: (1) we introduce a universal MLP variant of the NeuMIP architecture, defining a universal basis for neural materials as 16-channel feature textures, and (2) we train a conditional diffusion model for generating neural materials in this basis from flash images, natural images and text prompts. To achieve this, we also construct a new dataset of 150k neural materials in 16 categories, since no large-scale neural material data exists. To our knowledge, our work is the first to enable single-shot neural material generation from arbitrary text or image prompts.
Neural bidirectional reflectance distribution functions (BRDFs) have emerged as popular material representations for enhancing realism in physically-based rendering. Yet their importance sampling remains a significant challenge. In this paper, we introduce a reparameterization-based formulation of neural BRDF importance sampling that seamlessly integrates into the standard rendering pipeline with precise generation of BRDF samples. The reparameterization-based formulation transfers the distribution learning task to a problem of identifying BRDF integral substitutions. In contrast to previous methods that rely on invertible networks and multi-step inference to reconstruct BRDF distributions, our model removes these constraints, which offers greater flexibility and efficiency. Our variance and performance analysis demonstrates that our reparameterization method achieves the best variance reduction in neural BRDF renderings while maintaining high inference speeds compared to existing baselines.
We present a tool for enhancing the detail of physically based materials using an off-the-shelf diffusion model and inverse rendering. Our goal is to increase the visual fidelity of existing materials by adding, for instance, signs of wear, aging, and weathering that are tedious to author. To obtain realistic appearance with minimal user effort, we leverage a generative image model trained on a large dataset of natural images. Given the geometry, UV mapping, and basic appearance of an object, we proceed as follows: We render multiple views of the object and use them, together with an appearance-defining text prompt, to condition a diffusion model. The generated details are then backpropagated from the enhanced images to the material parameters via inverse rendering. For inverse rendering to be successful, the generated appearance has to be consistent across all the images. We propose two priors to address the multi-view consistency of the diffusion model. First, we ensure that the noise that seeds the diffusion process is itself consistent across views by integrating it from a view-independent UV space. Second, we enforce spatial consistency by biasing the attention mechanism via a projective constraint so that pixels attend strongly to their corresponding pixel locations in other views. Our approach does not require any training or finetuning of the diffusion model, is agnostic to the used material model, and the enhanced material properties, i.e., 2D PBR textures, can be further edited by artists. We demonstrate prompt-based material edits exhibiting high levels of realism and detail. This project is available at https://generative-detail.github.io.
High-quality 3D assets are essential for various applications in computer graphics and 3D vision but remain scarce due to significant acquisition costs. To address this shortage, we introduce Elevate3D, a novel framework that transforms readily accessible low-quality 3D assets into higher quality. At the core of Elevate3D is HFS-SDEdit, a specialized texture enhancement method that significantly improves texture quality while preserving the appearance and geometry while fixing its degradations. Furthermore, Elevate3D operates in a view-by-view manner, alternating between texture and geometry refinement. Unlike previous methods that have largely overlooked geometry refinement, our framework leverages geometric cues from images refined with HFS-SDEdit by employing state-of-the-art monocular geometry predictors. This approach ensures detailed and accurate geometry that aligns seamlessly with the enhanced texture. Elevate3D outperforms recent competitors by achieving state-of-the-art quality in 3D model refinement, effectively addressing the scarcity of high-quality open-source 3D assets.
We present a real-time deformation method for Escher tiles—interlocking organic forms that seamlessly tessellate the plane following symmetry rules. We formulate the problem as determining a periodic displacement field. The goal is to deform Escher tiles without introducing gaps or overlaps. The resulting displacement field is obtained in closed form by an analytical solution. Our method processes tiles of 17 wallpaper groups across various representations such as images and meshes. Rather than treating tiles as mere boundaries, we consider them as textured shapes, ensuring that both the boundary and interior deform simultaneously. To enable fine-grained artistic input, our interactive tool features a user-controllable adaptive fall-off parameter, allowing precise adjustment of locality and supporting deformations with meaningful semantic control. We demonstrate the effectiveness of our method through various examples, including photo editing and shape sculpting, showing its use in applications such as fabrication and animation.
The advantages of higher-order surfaces, such as their ability to represent complex geometry compactly and smoothly, have led to their increasing use in computer graphics. This trend underscores the importance of developing efficient rendering algorithms tailored for these representations. We introduce PaRas, a highly performant rasterizer for real-time rendering of large-scale parametric surfaces with high precision. Unlike conventional graphics pipelines that rely on hardware tessellation to convert smooth surfaces into numerous flat triangles, our method provides a highly efficient and parallel approach to directly rasterize parametric surfaces. PaRas seamlessly integrates into existing workflows, enabling smooth surfaces to be handled with the same ease as triangle meshes. To accomplish this, we formulate the rasterization of parametric surfaces as a point inversion problem, employing a Newton-type iteration on the GPU to compute precise solutions. The framework’s effectiveness is demonstrated on quartic triangular Bézier patches and rational Bézier patches, both commonly used in high-precision modeling and industrial applications. Experimental results indicate that our rendering pipeline achieves higher efficiency and greater accuracy compared to traditional hardware tessellation techniques.
Recent advancements in deep learning have revolutionized the reconstruction of spatially-varying surface reflectance of real-world objects. Many existing methods have successfully recovered high-quality reflectance maps using a remarkably limited number of images captured by a lightweight handheld camera and a flash-like light source. As the samples become sparse, the choice of the sampling set has a significant impact on the results. To determine the best sampling set for each material while ensuring minimal capture costs, we introduce an appearance-aware adaptive sampling method in this paper. We model the sampling process as a sequential decision-making problem, and employ a deep reinforcement learning (DRL) framework to solve it. At each step, an agent (NBVL Planner), after trained on a specially designed dataset, plans the next best view-lighting (NBVL) pair based on the appearance of the material recognized so far. Once stopped, the sequence of the NBVLs constitutes the best sampling set for the material. We show, through extensive experiments on both synthetic materials and real-world cases, that the best sampling set extracted by our method outperforms other sampling sets, especially for challenging materials featuring globally-varying specular reflectance.
This paper aims to quantify uncertainty for SVBRDF acquisition in multi-view captures. Under uncontrolled illumination and unstructured viewpoints, there is no guarantee that the observations contain enough information to reconstruct the appearance properties of a captured object. We study this ambiguity, or uncertainty, using entropy and accelerate the analysis by using the frequency domain, rather than the domain of incoming and outgoing viewing angles. The result is a method that computes a map of uncertainty over an entire object within a millisecond. We find that the frequency model allows us to recover SVBRDF parameters with competitive performance, that the accelerated entropy computation matches results with a physically-based path tracer, and that there is a positive correlation between error and uncertainty. We then show that the uncertainty map can be applied to improve SVBRDF acquisition using capture guidance, sharing information on the surface, and using a diffusion model to inpaint uncertain regions. Our code is available at .
Cinematic lighting is a powerful tool used in film-making to create a mood or atmosphere and to influence the audience’s perception and emotional response to a scene. For example, red can be used to increase feelings of anxiety or excitement, while blue might have a more calming effect. These responses can be harnessed to enhance the storytelling. Previous studies in Psychology have shown that light color has a direct impact on the perception of emotions and feelings. However, there is a lack of controlled empirical studies for understanding if lighting alone can alter the interpretation of emotion. Realistic virtual humans are an underused tool to study these effects in a controlled manner as they retain the same emotional expression across lighting conditions, and can display the same emotion across different genders and races. In this paper, we focus on studying the effect of light temperature, color, and shadow on the interpretation of emotions of realistic virtual humans, and compare to a human photo baseline. We are particularly interested in recognition of emotion, emotion intensity, and genuineness of the emotion. Our findings can be used by developers to increase the emotional intensity and genuineness of their virtual humans.
Shape primitive abstraction, which decomposes complex 3D shapes into simple geometric elements, plays a crucial role in human visual cognition and has broad applications in computer vision and graphics. While recent advances in 3D content generation have shown remarkable progress, existing primitive abstraction methods either rely on geometric optimization with limited semantic understanding or learn from small-scale, category-specific datasets, struggling to generalize across diverse shape categories. We present PrimitiveAnything, a novel framework that reformulates shape primitive abstraction as a primitive assembly generation task. PrimitiveAnything includes a shape-conditioned primitive transformer for auto-regressive generation and an ambiguity-free parameterization scheme to represent multiple types of primitives in a unified manner. The proposed framework directly learns the process of primitive assembly from large-scale human-crafted abstractions, enabling it to capture how humans decompose complex shapes into primitive elements. Through extensive experiments, we demonstrate that PrimitiveAnything can generate high-quality primitive assemblies that better align with human perception while maintaining geometric fidelity across diverse shape categories. It benefits various 3D applications and shows potential for enabling primitive-based user-generated content (UGC) in games. Project page: https://primitiveanything.github.io
Collage and packing techniques are widely used to organize geometric shapes into cohesive visual representations, facilitating the representation of visual features holistically, as seen in image collages and word clouds. Traditional methods often rely on object-space optimization, requiring intricate geometric descriptors and energy functions to handle complex shapes. In this paper, we introduce a versatile image-space collage technique. Leveraging a differentiable renderer, our method effectively optimizes the object layout with image-space losses, bringing the benefit of fixed complexity and easy accommodation of various shapes. Applying a hierarchical resolution strategy in image space, our method efficiently optimizes the collage with fast convergence, large coarse steps first and then small precise steps. The diverse visual expressiveness of our approach is demonstrated through various examples. Experimental results show that our method achieves an order of magnitude speedup performance compared to state-of-the-art techniques.
Three-dimensional building generation is vital for applications in gaming, virtual reality, and digital twins, yet current methods face challenges in producing diverse, structured, and hierarchically coherent buildings. We propose BuildingBlock, a hybrid approach that integrates generative models, procedural content generation (PCG), and large language models (LLMs) to address these limitations. Specifically, our method introduces a two-phase pipeline: the Layout Generation Phase (LGP) and the Building Construction Phase (BCP). LGP reframes box-based layout generation as a point-cloud generation task, utilizing a newly constructed architectural dataset and a Transformer-based diffusion model to create globally consistent layouts. With LLMs, these layouts are extended into rule-based hierarchical designs, seamlessly incorporating component styles and spatial structures. The BCP leverages these layouts to guide PCG, enabling local-customizable, high-quality structured building generation. Experimental results demonstrate BuildingBlock ’s effectiveness in generating diverse and hierarchically structured buildings, achieving state-of-the-art results on multiple benchmarks, and paving the way for scalable and intuitive architectural workflows.
A procedural program is the representation of a family of assets that share the same structural or semantic properties, whose final appearance is determined by different parameter assignments. Identifying the parameter values that define a desired asset is usually a time-consuming operation, since it requires manually tuning parameters separately and in a non-intuitive manner. In the domain of procedural patterns, recent works focused on estimating parameter values to match a target render or sketch, using parameter optimization or inference via neural networks. However, these approaches are neither fast enough for interactive design nor precise enough to give direct control. In this work, we propose an interactive method for procedural parameter estimation based on the idea of scaffolded procedural patterns. A scaffolded procedural pattern is a sequence of procedural programs that model a pattern in a coarse-to-fine manner, in which the desired pattern appearance is reached step-by-step by inheriting previously optimized parameters. Through scaffolding, patterns are more straightforward to sketch for users and easier to optimize for most algorithms. In our implementation, patterns are represented as procedural signed distance functions whose parameters are estimated with a gradient-free optimization method that runs in real-time on the GPU. We show that scaffolded patterns can be created with a node-based interface familiar to artists. We validate our approach by creating and interactively editing several scaffolded patterns. We show the effectiveness of scaffolding through a user study, where scaffolding enhances both the output quality and the editing experience with respect to approaches that optimize the procedural parameters all at once. We also perform a comparison with previous strategies and provide several recordings of real-time editing sessions in the accompanying materials.