We aim to enable real-time audio-visual navigation within 3D scenes by rendering high-quality spatial audio in sync with visuals, using a set of training videos. Traditional audio-visual rendering methods often rely on visual cues, which can introduce challenges such as quality inconsistencies from visual artifacts and potential audio delays, affecting the seamless integration of audio and visual stimuli in real-time perception. To overcome these challenges, we introduce a novel framework, Audio-Visual Splatting, which efficiently renders spatial audio aligned with any visual viewpoint, eliminating the need for pre-rendered images or defined sound source locations. Our approach first learns the audio-visual scene representation based on a sparse point set derived from camera calibration. Subsequently, we propose an Audio-Visual Splatting module that efficiently decodes audio features to a spatial audio transfer function for an arbitrary listener's viewpoint. This function, applied through the Spatial Audio Render Head, transforms monaural input into viewpoint-specific spatial audio. Our pipeline achieves SOTA accuracy on spatial audio reconstruction, perceptive quality and acoustic effects on two challenging real-world datasets.