CVMP '20: Proceedings of the 17th ACM SIGGRAPH European Conference on Visual Media Production

Full Citation in the ACM Digital Library

SECTION: Session 1: Production

High Fidelity Interactive Video Segmentation Using Tensor Decomposition, Boundary Loss, Convolutional Tessellations, and Context-Aware Skip Connections

We provide a high fidelity deep learning algorithm (HyperSeg) for interactive video segmentation tasks using a convolutional network with context-aware skip connections, and compressed, ”hypercolumn” image features combined with a convolutional tessellation procedure. In order to maintain high output fidelity, our model crucially processes and renders all image features in high resolution, without utilizing downsampling or pooling procedures. We maintain this consistent, high grade fidelity efficiently in our model chiefly through two means: (1) We use a statistically-principled, tensor decomposition procedure to modulate the number of hypercolumn features and (2) We render these features in their native resolution using a convolutional tessellation technique. For improved pixel-level segmentation results, we introduce a boundary loss function; for improved temporal coherence in video data, we include temporal image information in our model. Through experiments, we demonstrate the improved accuracy of our model against baseline models for interactive segmentation tasks using high resolution video data. We also introduce a benchmark video segmentation dataset, the VFX Segmentation Dataset, which contains over 27,046 high resolution video frames, including green screen and various composited scenes with corresponding, hand-crafted, pixel-level segmentations. Our work presents an extension to improvement to state of the art segmentation fidelity with high resolution data and can be used across a broad range of application domains, including VFX pipelines and medical imaging disciplines.

Image Decomposition using Geometric Region Colour Unmixing

In this paper, we propose a new geometric approach for image decomposition which aims to combine the advantages of two state of the art techniques. Given an input image, we first compute a palette of colours from the image and use it to split the RGB space into a number of different regions. Depending on which region a given pixel lies in, different geometric methods are used to unmix the pixel’s colour into a number of colours, where each colour is associated with a different layer. The layers created are smooth and homogeneous in colour, and have no reconstruction error when recombined. Our layer decomposition technique is fast to compute and the layers created can be used successfully in several applications, including layer compositing and recolouring.

Exploring the Use of Skeletal Tracking for Cheaper Motion Graphs and On-Set Decision Making in Free-Viewpoint Video Production

In free-viewpoint video (FVV), the motion and surface appearance of a real-world performance is captured as an animated mesh. While this technology can produce high-fidelity recreations of actors, the required 3D reconstruction step has substantial processing demands. This means FVV experiences are currently expensive to produce, and the processing delay means on-set decisions are hampered by a lack of feedback. This work explores the possibility of using RGB-camera-based skeletal tracking to reduce the amount of content that must be 3D reconstructed, as well as aiding on-set decision making. One particularly relevant application is in the construction of Motion Graphs, where state-of-the-art techniques require large amounts of content to be 3D reconstructed before a graph can be built, resulting in large amounts of wasted processing effort. Here, we propose the use of skeletons to assess which clips of FVV content to process, resulting in substantial cost savings with a limited impact on performance accuracy. Additionally, we explore how this technique could be utilised on set to reduce the possibility of requiring expensive reshoots.

SECTION: Session 2: Learning

Neural Face Models for Example-Based Visual Speech Synthesis

Creating realistic animations of human faces with computer graphic models is still a challenging task. It is often solved either with tedious manual work or motion capture based techniques that require specialised and costly hardware. Example based animation approaches circumvent these problems by re-using captured data of real people. This data is split into short motion samples that can be looped or concatenated in order to create novel motion sequences. The obvious advantages of this approach are the simplicity of use and the high realism, since the data exhibits only real deformations. Rather than tuning weights of a complex face rig, the animation task is performed on a higher level by arranging typical motion samples in a way such that the desired facial performance is achieved. Two difficulties with example based approaches, however, are high memory requirements as well as the creation of artefact-free and realistic transitions between motion samples. We solve these problems by combining the realism and simplicity of example-based animations with the advantages of neural face models. Our neural face model is capable of synthesising high quality 3D face geometry and texture according to a compact latent parameter vector. This latent representation reduces memory requirements by a factor of 100 and helps creating seamless transitions between concatenated motion samples. In this paper, we present a marker-less approach for facial motion capture based on multi-view video. Based on the captured data, we learn a neural representation of facial expressions, which is used to seamlessly concatenate facial performances during the animation procedure. We demonstrate the effectiveness of our approach by synthesising mouthings for Swiss-German sign language based on viseme query sequences.

Constant Velocity Constraints for Self-Supervised Monocular Depth Estimation

We present a new method for self-supervised monocular depth estimation. Contemporary monocular depth estimation methods use a triplet of consecutive video frames to estimate the central depth image. We make the assumption that the ego-centric view progresses linearly in the scene, based on the kinematic and physical properties of the camera. During the training phase, we can exploit this assumption to create a depth estimation for each image in the triplet. We then apply a new geometry constraint that supports novel synthetic views, thus providing a strong supervisory signal. Our contribution is simple to implement, requires no additional trainable parameter, and produces competitive results when compared with other state-of-the-art methods on the popular KITTI corpus.