CVMP '22: European Conference on Visual Media Production

Full Citation in the ACM Digital Library

SESSION: Paper Session 1

Neural apparent BRDF fields for multiview photometric stereo

We propose to tackle the multiview photometric stereo problem using an extension of Neural Radiance Fields (NeRFs), conditioned on light source direction. The geometric part of our neural representation predicts surface normal direction, allowing us to reason about local surface reflectance. The appearance part of our neural representation is decomposed into a neural bidirectional reflectance function (BRDF), learnt as part of the fitting process, and a shadow prediction network (conditioned on light source direction) allowing us to model the apparent BRDF. This balance of learnt components with inductive biases based on physical image formation models allows us to extrapolate far from the light source and viewer directions observed during training. We demonstrate our approach on a multiview photometric stereo benchmark and show that competitive performance can be obtained with the neural density representation of a NeRF.

A wide-baseline multiview system for indoor scene capture

We present a complete multiview acquisition system, a camera array allowing depth reconstruction based on disparity. We built a new wide-baseline camera grid supported by an interactive camera controller purposely built for indoor large scene capture. It is composed of 16 cameras aligned as a 4 × 4 grid, synchronized, and characterized. The design of the camera system manages storage and real-time capture viewing. We also propose a DNN-based approach to estimate the floating-point disparity values, which is adapted to wide-baseline configurations while providing high precision, even for sharp and concave objects. The ultimate result is a dense 3D point cloud which offers versatile possibilities of viewing.

Light Field GAN-based View Synthesis using full 4D information

Light Field (LF) technology offers a truly immersive experience having the potential to revolutionize entertainment, training, education, virtual and augmented reality, gaming, autonomous driving, and digital health. However, one of the main issues when working with LF is the amount of data needed to create a mesmerizing experience with realistic disparity, smooth motion parallax between views. In this paper, we introduce a learning based LF angular super-resolution approach for efficient view synthesis of novel virtual images. This is achieved by taking four corner views and then generating up to five in-between views. Our generative adversarial network approach uses LF spatial and angular information to ensure smooth disparity between the generated and original views. We consider plenoptic, synthetic LF content and camera array implementations which support different baseline settings. Experimental results show that our proposed method outperforms state-of-the-art light field view synthesis techniques, offering novel generated views with high visual quality.

SESSION: Paper Session 2

Semantic Segmentation for Multi-Contour Estimation in Maritime Scenes

In the maritime environment, navigation and localisation are primarily driven by systems such as GPS. However, in a scenario where GPS is not available, e.g., it is jammed or the satellite connection was lost, navigators can use visual methods derived from surrounding land masses and other permanent features in the perceptual range. To enable autonomous navigation, specifically localisation, a vessel must determine its position by extracting and matching the contours of its surrounding environment with an elevation model. The contours of interest are the true horizon line, visible horizon line and shoreline. Extracting these contours is commonly approached using computational methods such as edge detection or pixel clustering techniques that are not robust and build on weak priors. To this end, we propose the first learning-based framework that explores the fusion of inertial data into an encoder-decoder model and extracts the contours. In addition, extensive data augmentation methods are used to extend the MaSTr1325 dataset, introducing further robustness to the common environmental challenges faced by the sensors of unmanned surface vessels. We form a small curated dataset containing 300 images - composed of six component segmentation masks and three further masks describing the true horizon and visible horizon contours and the shoreline, for evaluation. We experimented extensively with popular segmentation models such as UNet, SegNet, DeepLabV3+ and TransUNet with various backbones for a quantitative comparison. The results show that, within a small margin of ten pixels and in a high-resolution image, our system detects three key contours, namely shorelines, true and visible horizon contours, used in navigation with an accuracy of 63.79%, 68.94% and 89.75%, respectively.

Tragic Talkers: A Shakespearean Sound- and Light-Field Dataset for Audio-Visual Machine Learning Research

3D audio-visual production aims to deliver immersive and interactive experiences to the consumer. Yet, faithfully reproducing real-world 3D scenes remains a challenging task. This is partly due to the lack of available datasets enabling audio-visual research in this direction. In most of the existing multi-view datasets, the accompanying audio is neglected. Similarly, datasets for spatial audio research primarily offer unimodal content, and when visual data is included, the quality is far from meeting the standard production needs. We present “Tragic Talkers”, an audio-visual dataset consisting of excerpts from the “Romeo and Juliet” drama captured with microphone arrays and multiple co-located cameras for light-field video. Tragic Talkers provides ideal content for object-based media (OBM) production. It is designed to cover various conventional talking scenarios, such as monologues, two-people conversations, and interactions with considerable movement and occlusion, yielding 30 sequences captured from a total of 22 different points of view and two 16-element microphone arrays. Additionally, we provide voice activity labels, 2D face bounding boxes for each camera view, 2D pose detection keypoints, 3D tracking data of the mouth of the actors, and dialogue transcriptions. We believe the community will benefit from this dataset as it can assist multidisciplinary research. Possible uses of the dataset are discussed.

SESSION: Paper Session 3

The Colour of Horror

In this paper, we present a simple method to produce a colour palette for film trailers. Our method uses k-means clustering with a saturation-based weighting to extract the dominant colours from the frames of the trailer. We use our method to generate the palettes of 29 thousand film trailers from 1960 to 2019. We aggregate these palettes by era, genre, and director by re-applying our clustering method, and we note various trends in the use of colour over time and between genres. We also show that our generated palettes reflect changes in mood and theme across films in a series, and we demonstrate the palettes of notable directors.

Model-Based Deep Portrait Relighting

Like most computer vision problems the relighting of portrait face images is more and more being entirely formulated as a deep learning problem. However, data-driven approaches need a detailed and exhaustive database to work on and the creation of ground truth data is tedious and oftentimes technically complex. At the same time, networks get bigger and deeper. Knowledge about the problem statement, scene structure, and physical laws are often neglected. In this paper, we propose to encompass prior knowledge for relighting directly in the network learning process, adding model-based building blocks to the training. Thereby, we improve the learning speed and effectiveness of the network, thus performing better even with a restricted dataset. We demonstrate through an ablation study that the proposed model-based building blocks improve the network’s training and enhance the generated images compared with the naive approach.

SESSION: Paper Session 4

Assessing Advances in Real Noise Image Denoisers

Recently image denoiser networks have made a number of advances to go beyond additive Gaussian white noise and deal with real noise, such as produced by digital cameras. We note that some of the performance gains reported in the state of the art could potentially be explained by an increase of network sizes. In this paper we propose to revisit some of these advances, including the synthetic noise generator and noise maps proposed in CBDNet, and re-assess them using a simple DnCNN baseline network and thus attempt at measuring how much gains can be attributed to using more modern architectures. In this work, we observe an increase of over +2 dB in denoising performance over our baseline network on the DND real world benchmark. Through this observation, we demonstrate that a smaller networks can offer competitive denoising results when correctly optimised for real world denoising.

U-Attention to Textures: Hierarchical Hourglass Vision Transformer for Universal Texture Synthesis

We present a novel U-Attention vision Transformer for universal texture synthesis. We exploit the natural long-range dependencies enabled by the attention mechanism to allow our approach to synthesize diverse textures while preserving their structures in a single inference. We propose a hierarchical hourglass backbone that attends to the global structure and performs patch mapping at varying scales in a coarse-to-fine-to-coarse stream. Completed by skip connection and convolution designs that propagate and fuse information at different scales, our hierarchical U-Attention architecture unifies attention to features from macro structures to micro details, and progressively refines synthesis results at successive stages. Our method achieves stronger 2 × synthesis than previous work on both stochastic and structured textures while generalizing to unseen textures without fine-tuning. Ablation studies demonstrate the effectiveness of each component of our architecture.

Distilling Style from Image Pairs for Global Forward and Inverse Tone Mapping

Many image enhancement or editing operations, such as forward and inverse tone mapping or color grading, do not have a unique solution, but instead a range of solutions, each representing a different style. Despite this, existing learning-based methods attempt to learn a unique mapping, disregarding this style. In this work, we show that information about the style can be distilled from collections of image pairs and encoded into a 2- or 3-dimensional vector. This gives us not only an efficient representation but also an interpretable latent space for editing the image style. We represent the global color mapping between a pair of images as a custom normalizing flow, conditioned on a polynomial basis of the pixel color. We show that such a network is more effective than PCA or VAE at encoding image style in low-dimensional space and lets us obtain an accuracy close to 40 dB, which is about 7-10 dB improvement over the state-of-the-art methods.