teaser

AV-Cloud is an audio rendering framework synchronous with the visual perspective. Given video collections, it constructs Audio-Visual Anchors for scene representation and transforms monaural reference sound into spatial audio.

Abstract

We aim to enable real-time audio-visual navigation within 3D scenes by rendering high-quality spatial audio in sync with visuals, using a set of training videos. Traditional audio-visual rendering methods often rely on visual cues, which can introduce challenges such as quality inconsistencies from visual artifacts and potential audio delays, affecting the seamless integration of audio and visual stimuli in real-time perception. To overcome these challenges, we introduce a novel framework, Audio-Visual Splatting, which efficiently renders spatial audio aligned with any visual viewpoint, eliminating the need for pre-rendered images or defined sound source locations. Our approach first learns the audio-visual scene representation based on a sparse point set derived from camera calibration. Subsequently, we propose an Audio-Visual Splatting module that efficiently decodes audio features to a spatial audio transfer function for an arbitrary listener's viewpoint. This function, applied through the Spatial Audio Render Head, transforms monaural input into viewpoint-specific spatial audio. Our pipeline achieves SOTA accuracy on spatial audio reconstruction, perceptive quality and acoustic effects on two challenging real-world datasets.

Audio-Visual Cloud Splatting (AVCS)

teaser

AVCS consists of two components: Anchor Projection (left) and Visual-to-Audio Splatting Transformer (right). Audio-Visual Anchors are projected into the coordinate system of the listener head, and the transformer decodes features for each audio frequency band, outputting two acoustic masks to convert the monaural reference sound into stereo audio at the target viewpoint.