Skip to main content

Showing 1–22 of 22 results for author: Grauman, K

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.09272  [pdf, other

    cs.CV cs.AI cs.SD eess.AS

    Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos

    Authors: Changan Chen, Puyuan Peng, Ami Baid, Zihui Xue, Wei-Ning Hsu, David Harwath, Kristen Grauman

    Abstract: Generating realistic audio for human interactions is important for many applications, such as creating sound effects for films or virtual reality games. Existing approaches implicitly assume total correspondence between the video and audio during training, yet many sounds happen off-screen and have weak to no correspondence with the visuals -- resulting in uncontrolled ambient sounds or hallucinat… ▽ More

    Submitted 20 June, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

    Comments: Project page: https://vision.cs.utexas.edu/projects/action2sound

  2. arXiv:2405.02821  [pdf, other

    cs.SD cs.AI cs.LG cs.RO eess.AS

    Sim2Real Transfer for Audio-Visual Navigation with Frequency-Adaptive Acoustic Field Prediction

    Authors: Changan Chen, Jordi Ramos, Anshul Tomar, Kristen Grauman

    Abstract: Sim2real transfer has received increasing attention lately due to the success of learning robotic tasks in simulation end-to-end. While there has been a lot of progress in transferring vision-based navigation policies, the existing sim2real strategy for audio-visual navigation performs data augmentation empirically without measuring the acoustic gap. The sound differs from light in that it spans a… ▽ More

    Submitted 5 May, 2024; originally announced May 2024.

  3. arXiv:2404.16216  [pdf, other

    cs.CV cs.RO cs.SD eess.AS

    ActiveRIR: Active Audio-Visual Exploration for Acoustic Environment Modeling

    Authors: Arjun Somayazulu, Sagnik Majumder, Changan Chen, Kristen Grauman

    Abstract: An environment acoustic model represents how sound is transformed by the physical characteristics of an indoor environment, for any given source/receiver location. Traditional methods for constructing acoustic models involve expensive and time-consuming collection of large quantities of acoustic data at dense spatial locations in the space, or rely on privileged knowledge of scene geometry to inte… ▽ More

    Submitted 24 April, 2024; originally announced April 2024.

    Comments: Project page: https://vision.cs.utexas.edu/projects/active_rir/

  4. arXiv:2404.05206  [pdf, other

    cs.CV cs.MM cs.SD eess.AS

    SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos

    Authors: Changan Chen, Kumar Ashutosh, Rohit Girdhar, David Harwath, Kristen Grauman

    Abstract: We propose a novel self-supervised embedding to learn how actions sound from narrated in-the-wild egocentric videos. Whereas existing methods rely on curated data with known audio-visual correspondence, our multimodal contrastive-consensus coding (MC3) embedding reinforces the associations between audio, language, and vision when all modality pairs agree, while diminishing those associations when… ▽ More

    Submitted 8 April, 2024; originally announced April 2024.

    Comments: Accepted at CVPR 2024. Project page: https://vision.cs.utexas.edu/projects/soundingactions

  5. arXiv:2307.15064  [pdf, other

    cs.MM cs.CV cs.SD eess.AS

    Self-Supervised Visual Acoustic Matching

    Authors: Arjun Somayazulu, Changan Chen, Kristen Grauman

    Abstract: Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment. Existing methods assume access to paired training data, where the audio is observed in both source and target environments, but this limits the diversity of training data or requires the use of simulated data or heuristics to create paired samples. We propose a self-supervised ap… ▽ More

    Submitted 23 November, 2023; v1 submitted 27 July, 2023; originally announced July 2023.

    Comments: Project page: https://vision.cs.utexas.edu/projects/ss_vam/ . Accepted at NeurIPS 2023

  6. arXiv:2307.04760  [pdf, other

    cs.CV cs.SD eess.AS

    Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos

    Authors: Sagnik Majumder, Ziad Al-Halah, Kristen Grauman

    Abstract: We propose a self-supervised method for learning representations based on spatial audio-visual correspondences in egocentric videos. Our method uses a masked auto-encoding framework to synthesize masked binaural (multi-channel) audio through the synergy of audio and vision, thereby learning useful spatial relationships between the two modalities. We use our pretrained features to tackle two downst… ▽ More

    Submitted 5 May, 2024; v1 submitted 10 July, 2023; originally announced July 2023.

    Comments: Accepted to CVPR 2024

  7. arXiv:2301.08730  [pdf, other

    cs.CV cs.SD eess.AS

    Novel-View Acoustic Synthesis

    Authors: Changan Chen, Alexander Richard, Roman Shapovalov, Vamsi Krishna Ithapu, Natalia Neverova, Kristen Grauman, Andrea Vedaldi

    Abstract: We introduce the novel-view acoustic synthesis (NVAS) task: given the sight and sound observed at a source viewpoint, can we synthesize the sound of that scene from an unseen target viewpoint? We propose a neural rendering approach: Visually-Guided Acoustic Synthesis (ViGAS) network that learns to synthesize the sound of an arbitrary point in space by analyzing the input audio-visual cues. To benc… ▽ More

    Submitted 24 October, 2023; v1 submitted 20 January, 2023; originally announced January 2023.

    Comments: Accepted at CVPR 2023. Project page: https://vision.cs.utexas.edu/projects/nvas

  8. arXiv:2301.02184  [pdf, other

    cs.CV cs.LG cs.SD eess.AS

    Chat2Map: Efficient Scene Map** from Multi-Ego Conversations

    Authors: Sagnik Majumder, Hao Jiang, Pierre Moulon, Ethan Henderson, Paul Calamia, Kristen Grauman, Vamsi Krishna Ithapu

    Abstract: Can conversational videos captured from multiple egocentric viewpoints reveal the map of a scene in a cost-efficient way? We seek to answer this question by proposing a new problem: efficiently building the map of a previously unseen 3D environment by exploiting shared information in the egocentric audio-visual observations of participants in a natural conversation. Our hypothesis is that as multi… ▽ More

    Submitted 20 April, 2023; v1 submitted 4 January, 2023; originally announced January 2023.

    Comments: Accepted to CVPR 2023

  9. arXiv:2206.08312  [pdf, other

    cs.SD cs.CV cs.MM eess.AS

    SoundSpaces 2.0: A Simulation Platform for Visual-Acoustic Learning

    Authors: Changan Chen, Carl Schissler, Sanchit Garg, Philip Kobernik, Alexander Clegg, Paul Calamia, Dhruv Batra, Philip W Robinson, Kristen Grauman

    Abstract: We introduce SoundSpaces 2.0, a platform for on-the-fly geometry-based audio rendering for 3D environments. Given a 3D mesh of a real-world environment, SoundSpaces can generate highly realistic acoustics for arbitrary sounds captured from arbitrary microphone locations. Together with existing 3D visual assets, it supports an array of audio-visual research tasks, such as audio-visual navigation, m… ▽ More

    Submitted 23 January, 2023; v1 submitted 16 June, 2022; originally announced June 2022.

    Comments: Camera-ready version. Website: https://soundspaces.org. Project page: https://vision.cs.utexas.edu/projects/soundspaces2

  10. arXiv:2206.04006  [pdf, other

    cs.SD cs.CV cs.LG eess.AS

    Few-Shot Audio-Visual Learning of Environment Acoustics

    Authors: Sagnik Majumder, Changan Chen, Ziad Al-Halah, Kristen Grauman

    Abstract: Room impulse response (RIR) functions capture how the surrounding physical environment transforms the sounds heard by a listener, with implications for various applications in AR, VR, and robotics. Whereas traditional methods to estimate RIRs assume dense geometry and/or sound measurements throughout the environment, we explore how to infer RIRs based on a sparse set of images and echoes observed… ▽ More

    Submitted 24 November, 2022; v1 submitted 8 June, 2022; originally announced June 2022.

    Comments: Accepted to NeurIPS 2022

  11. arXiv:2202.06875  [pdf, other

    cs.CV cs.MM cs.SD eess.AS

    Visual Acoustic Matching

    Authors: Changan Chen, Ruohan Gao, Paul Calamia, Kristen Grauman

    Abstract: We introduce the visual acoustic matching task, in which an audio clip is transformed to sound like it was recorded in a target environment. Given an image of the target environment and a waveform for the source audio, the goal is to re-synthesize the audio to match the target room acoustics as suggested by its visible geometry and materials. To address this novel task, we propose a cross-modal tr… ▽ More

    Submitted 13 June, 2022; v1 submitted 14 February, 2022; originally announced February 2022.

    Comments: Project page: https://vision.cs.utexas.edu/projects/visual-acoustic-matching. Accepted at CVPR 2022

  12. arXiv:2202.00850  [pdf, other

    cs.CV cs.LG cs.SD eess.AS eess.IV

    Active Audio-Visual Separation of Dynamic Sound Sources

    Authors: Sagnik Majumder, Kristen Grauman

    Abstract: We explore active audio-visual separation for dynamic sound sources, where an embodied agent moves intelligently in a 3D environment to continuously isolate the time-varying audio stream being emitted by an object of interest. The agent hears a mixed stream of multiple audio sources (e.g., multiple people conversing and a band playing music at a noisy party). Given a limited time budget, it needs… ▽ More

    Submitted 25 July, 2022; v1 submitted 1 February, 2022; originally announced February 2022.

    Comments: Accepted to ECCV 2022

  13. arXiv:2111.10882  [pdf, other

    cs.CV cs.SD eess.AS

    Geometry-Aware Multi-Task Learning for Binaural Audio Generation from Video

    Authors: Rishabh Garg, Ruohan Gao, Kristen Grauman

    Abstract: Binaural audio provides human listeners with an immersive spatial sound experience, but most existing videos lack binaural audio recordings. We propose an audio spatialization method that draws on visual information in videos to convert their monaural (single-channel) audio to binaural audio. Whereas existing approaches leverage visual features extracted directly from video frames, our approach ex… ▽ More

    Submitted 21 November, 2021; originally announced November 2021.

    Comments: Published in BMVC 2021, project page: http://vision.cs.utexas.edu/projects/geometry-aware-binaural/

  14. arXiv:2106.07732  [pdf, other

    cs.SD cs.CV cs.LG eess.AS

    Learning Audio-Visual Dereverberation

    Authors: Changan Chen, Wei Sun, David Harwath, Kristen Grauman

    Abstract: Reverberation not only degrades the quality of speech for human perception, but also severely impacts the accuracy of automatic speech recognition. Prior work attempts to remove reverberation based on the audio modality only. Our idea is to learn to dereverberate speech from audio-visual observations. The visual environment surrounding a human speaker reveals important cues about the room geometry… ▽ More

    Submitted 13 March, 2023; v1 submitted 14 June, 2021; originally announced June 2021.

    Comments: Accepted at ICASSP 2023. This is the longer version of the five-page camera-ready paper. Project page: https://vision.cs.utexas.edu/projects/learning-audio-visual-dereverberation

  15. arXiv:2105.07142  [pdf, other

    cs.CV cs.LG cs.RO cs.SD eess.AS

    Move2Hear: Active Audio-Visual Source Separation

    Authors: Sagnik Majumder, Ziad Al-Halah, Kristen Grauman

    Abstract: We introduce the active audio-visual source separation problem, where an agent must move intelligently in order to better isolate the sounds coming from an object of interest in its environment. The agent hears multiple audio sources simultaneously (e.g., a person speaking down the hall in a noisy household) and it must use its eyes and ears to automatically separate out the sounds originating fro… ▽ More

    Submitted 25 August, 2021; v1 submitted 15 May, 2021; originally announced May 2021.

    Comments: Accepted to ICCV 2021

  16. arXiv:2101.03149  [pdf, other

    cs.CV cs.SD eess.IV

    VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency

    Authors: Ruohan Gao, Kristen Grauman

    Abstract: We introduce a new approach for audio-visual speech separation. Given a video, the goal is to extract the speech associated with a face in spite of simultaneous background sounds and/or other human speakers. Whereas existing methods focus on learning the alignment between the speaker's lip movements and the sounds they generate, we propose to leverage the speaker's face appearance as an additional… ▽ More

    Submitted 6 April, 2021; v1 submitted 8 January, 2021; originally announced January 2021.

    Comments: In CVPR 2021. Project page: http://vision.cs.utexas.edu/projects/VisualVoice/

  17. arXiv:2012.11583  [pdf, other

    cs.CV cs.LG cs.RO cs.SD eess.AS

    Semantic Audio-Visual Navigation

    Authors: Changan Chen, Ziad Al-Halah, Kristen Grauman

    Abstract: Recent work on audio-visual navigation assumes a constantly-sounding target and restricts the role of audio to signaling the target's position. We introduce semantic audio-visual navigation, where objects in the environment make sounds consistent with their semantic meaning (e.g., toilet flushing, door creaking) and acoustic events are sporadic or short in duration. We propose a transformer-based… ▽ More

    Submitted 6 April, 2021; v1 submitted 21 December, 2020; originally announced December 2020.

    Comments: Project page: http://vision.cs.utexas.edu/projects/semantic-audio-visual-navigation

  18. arXiv:2005.01616  [pdf, other

    cs.CV cs.SD eess.AS

    VisualEchoes: Spatial Image Representation Learning through Echolocation

    Authors: Ruohan Gao, Changan Chen, Ziad Al-Halah, Carl Schissler, Kristen Grauman

    Abstract: Several animal species (e.g., bats, dolphins, and whales) and even visually impaired humans have the remarkable ability to perform echolocation: a biological sonar used to perceive spatial layout and locate objects in the world. We explore the spatial cues contained in echoes and how they can benefit vision tasks that require spatial reasoning. First we capture echo responses in photo-realistic 3D… ▽ More

    Submitted 17 July, 2020; v1 submitted 4 May, 2020; originally announced May 2020.

    Comments: Appears in ECCV 2020

  19. arXiv:1912.11474  [pdf, other

    cs.CV cs.HC cs.SD eess.AS

    SoundSpaces: Audio-Visual Navigation in 3D Environments

    Authors: Changan Chen, Unnat Jain, Carl Schissler, Sebastia Vicenc Amengual Gari, Ziad Al-Halah, Vamsi Krishna Ithapu, Philip Robinson, Kristen Grauman

    Abstract: Moving around in the world is naturally a multisensory experience, but today's embodied agents are deaf---restricted to solely their visual perception of the environment. We introduce audio-visual navigation for complex, acoustically and visually realistic 3D environments. By both seeing and hearing, the agent must learn to navigate to a sounding object. We propose a multi-modal deep reinforcement… ▽ More

    Submitted 21 August, 2020; v1 submitted 24 December, 2019; originally announced December 2019.

    Comments: Accepted to ECCV 2020 (Spotlight). Project page: http://vision.cs.utexas.edu/projects/audio_visual_navigation/

  20. arXiv:1912.04487  [pdf, other

    cs.CV cs.LG cs.SD eess.AS

    Listen to Look: Action Recognition by Previewing Audio

    Authors: Ruohan Gao, Tae-Hyun Oh, Kristen Grauman, Lorenzo Torresani

    Abstract: In the face of the video data deluge, today's expensive clip-level classifiers are increasingly impractical. We propose a framework for efficient action recognition in untrimmed video that uses audio as a preview mechanism to eliminate both short-term and long-term visual redundancies. First, we devise an ImgAud2Vid framework that hallucinates clip-level features by distilling from lighter modalit… ▽ More

    Submitted 28 March, 2020; v1 submitted 9 December, 2019; originally announced December 2019.

    Comments: Appears in CVPR 2020; Project page: http://vision.cs.utexas.edu/projects/listen_to_look/

  21. arXiv:1904.07750  [pdf, other

    cs.CV cs.MM cs.SD eess.AS

    Co-Separating Sounds of Visual Objects

    Authors: Ruohan Gao, Kristen Grauman

    Abstract: Learning how objects sound from video is challenging, since they often heavily overlap in a single audio channel. Current methods for visually-guided audio source separation sidestep the issue by training with artificially mixed video clips, but this puts unwieldy restrictions on training data collection and may even prevent learning the properties of "true" mixed sounds. We introduce a co-separat… ▽ More

    Submitted 20 August, 2019; v1 submitted 16 April, 2019; originally announced April 2019.

    Comments: ICCV 2019, Project page: http://vision.cs.utexas.edu/projects/coseparation/

  22. arXiv:1804.01665  [pdf, other

    cs.CV cs.MM cs.SD eess.AS

    Learning to Separate Object Sounds by Watching Unlabeled Video

    Authors: Ruohan Gao, Rogerio Feris, Kristen Grauman

    Abstract: Perceiving a scene most fully requires all the senses. Yet modeling how objects look and sound is challenging: most natural scenes and events contain multiple objects, and the audio track mixes all the sound sources together. We propose to learn audio-visual object models from unlabeled video, then exploit the visual context to perform audio source separation in novel videos. Our approach relies o… ▽ More

    Submitted 26 July, 2018; v1 submitted 5 April, 2018; originally announced April 2018.

    Comments: Published in ECCV 2018; Project Page: http://vision.cs.utexas.edu/projects/separating_object_sounds/