-
SoundSpaces 2.0: A Simulation Platform for Visual-Acoustic Learning
Authors:
Changan Chen,
Carl Schissler,
Sanchit Garg,
Philip Kobernik,
Alexander Clegg,
Paul Calamia,
Dhruv Batra,
Philip W Robinson,
Kristen Grauman
Abstract:
We introduce SoundSpaces 2.0, a platform for on-the-fly geometry-based audio rendering for 3D environments. Given a 3D mesh of a real-world environment, SoundSpaces can generate highly realistic acoustics for arbitrary sounds captured from arbitrary microphone locations. Together with existing 3D visual assets, it supports an array of audio-visual research tasks, such as audio-visual navigation, m…
▽ More
We introduce SoundSpaces 2.0, a platform for on-the-fly geometry-based audio rendering for 3D environments. Given a 3D mesh of a real-world environment, SoundSpaces can generate highly realistic acoustics for arbitrary sounds captured from arbitrary microphone locations. Together with existing 3D visual assets, it supports an array of audio-visual research tasks, such as audio-visual navigation, map**, source localization and separation, and acoustic matching. Compared to existing resources, SoundSpaces 2.0 has the advantages of allowing continuous spatial sampling, generalization to novel environments, and configurable microphone and material properties. To our knowledge, this is the first geometry-based acoustic simulation that offers high fidelity and realism while also being fast enough to use for embodied learning. We showcase the simulator's properties and benchmark its performance against real-world audio measurements. In addition, we demonstrate two downstream tasks -- embodied navigation and far-field automatic speech recognition -- and highlight sim2real performance for the latter. SoundSpaces 2.0 is publicly available to facilitate wider research for perceptual systems that can both see and hear.
△ Less
Submitted 23 January, 2023; v1 submitted 16 June, 2022;
originally announced June 2022.
-
Audio-Visual Floorplan Reconstruction
Authors:
Senthil Purushwalkam,
Sebastian Vicenc Amengual Gari,
Vamsi Krishna Ithapu,
Carl Schissler,
Philip Robinson,
Abhinav Gupta,
Kristen Grauman
Abstract:
Given only a few glimpses of an environment, how much can we infer about its entire floorplan? Existing methods can map only what is visible or immediately apparent from context, and thus require substantial movements through a space to fully map it. We explore how both audio and visual sensing together can provide rapid floorplan reconstruction from limited viewpoints. Audio not only helps sense…
▽ More
Given only a few glimpses of an environment, how much can we infer about its entire floorplan? Existing methods can map only what is visible or immediately apparent from context, and thus require substantial movements through a space to fully map it. We explore how both audio and visual sensing together can provide rapid floorplan reconstruction from limited viewpoints. Audio not only helps sense geometry outside the camera's field of view, but it also reveals the existence of distant freespace (e.g., a dog barking in another room) and suggests the presence of rooms not visible to the camera (e.g., a dishwasher humming in what must be the kitchen to the left). We introduce AV-Map, a novel multi-modal encoder-decoder framework that reasons jointly about audio and vision to reconstruct a floorplan from a short input video sequence. We train our model to predict both the interior structure of the environment and the associated rooms' semantic labels. Our results on 85 large real-world environments show the impact: with just a few glimpses spanning 26% of an area, we can estimate the whole area with 66% accuracy -- substantially better than the state of the art approach for extrapolating visual maps.
△ Less
Submitted 31 December, 2020;
originally announced December 2020.
-
VisualEchoes: Spatial Image Representation Learning through Echolocation
Authors:
Ruohan Gao,
Changan Chen,
Ziad Al-Halah,
Carl Schissler,
Kristen Grauman
Abstract:
Several animal species (e.g., bats, dolphins, and whales) and even visually impaired humans have the remarkable ability to perform echolocation: a biological sonar used to perceive spatial layout and locate objects in the world. We explore the spatial cues contained in echoes and how they can benefit vision tasks that require spatial reasoning. First we capture echo responses in photo-realistic 3D…
▽ More
Several animal species (e.g., bats, dolphins, and whales) and even visually impaired humans have the remarkable ability to perform echolocation: a biological sonar used to perceive spatial layout and locate objects in the world. We explore the spatial cues contained in echoes and how they can benefit vision tasks that require spatial reasoning. First we capture echo responses in photo-realistic 3D indoor scene environments. Then we propose a novel interaction-based representation learning framework that learns useful visual features via echolocation. We show that the learned image features are useful for multiple downstream vision tasks requiring spatial reasoning---monocular depth estimation, surface normal estimation, and visual navigation---with results comparable or even better than heavily supervised pre-training. Our work opens a new path for representation learning for embodied agents, where supervision comes from interacting with the physical world.
△ Less
Submitted 17 July, 2020; v1 submitted 4 May, 2020;
originally announced May 2020.
-
SoundSpaces: Audio-Visual Navigation in 3D Environments
Authors:
Changan Chen,
Unnat Jain,
Carl Schissler,
Sebastia Vicenc Amengual Gari,
Ziad Al-Halah,
Vamsi Krishna Ithapu,
Philip Robinson,
Kristen Grauman
Abstract:
Moving around in the world is naturally a multisensory experience, but today's embodied agents are deaf---restricted to solely their visual perception of the environment. We introduce audio-visual navigation for complex, acoustically and visually realistic 3D environments. By both seeing and hearing, the agent must learn to navigate to a sounding object. We propose a multi-modal deep reinforcement…
▽ More
Moving around in the world is naturally a multisensory experience, but today's embodied agents are deaf---restricted to solely their visual perception of the environment. We introduce audio-visual navigation for complex, acoustically and visually realistic 3D environments. By both seeing and hearing, the agent must learn to navigate to a sounding object. We propose a multi-modal deep reinforcement learning approach to train navigation policies end-to-end from a stream of egocentric audio-visual observations, allowing the agent to (1) discover elements of the geometry of the physical space indicated by the reverberating audio and (2) detect and follow sound-emitting targets. We further introduce SoundSpaces: a first-of-its-kind dataset of audio renderings based on geometrical acoustic simulations for two sets of publicly available 3D environments (Matterport3D and Replica), and we instrument Habitat to support the new sensor, making it possible to insert arbitrary sound sources in an array of real-world scanned environments. Our results show that audio greatly benefits embodied visual navigation in 3D spaces, and our work lays groundwork for new research in embodied AI with audio-visual perception.
△ Less
Submitted 21 August, 2020; v1 submitted 24 December, 2019;
originally announced December 2019.
-
Interactive Sound Rendering on Mobile Devices using Ray-Parameterized Reverberation Filters
Authors:
Carl Schissler,
Dinesh Manocha
Abstract:
We present a new sound rendering pipeline that is able to generate plausible sound propagation effects for interactive dynamic scenes. Our approach combines ray-tracing-based sound propagation with reverberation filters using robust automatic reverb parameter estimation that is driven by impulse responses computed at a low sampling rate.We propose a unified spherical harmonic representation of dir…
▽ More
We present a new sound rendering pipeline that is able to generate plausible sound propagation effects for interactive dynamic scenes. Our approach combines ray-tracing-based sound propagation with reverberation filters using robust automatic reverb parameter estimation that is driven by impulse responses computed at a low sampling rate.We propose a unified spherical harmonic representation of directional sound in both the propagation and auralization modules and use this formulation to perform a constant number of convolution operations for any number of sound sources while rendering spatial audio. In comparison to previous geometric acoustic methods, we achieve a speedup of over an order of magnitude while delivering similar audio to high-quality convolution rendering algorithms. As a result, our approach is the first capable of rendering plausible dynamic sound propagation effects on commodity smartphones.
△ Less
Submitted 1 March, 2018;
originally announced March 2018.