Skip to main content

Showing 1–6 of 6 results for author: Al-Halah, Z

Searching in archive eess. Search in all archives.
.
  1. arXiv:2307.04760  [pdf, other

    cs.CV cs.SD eess.AS

    Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos

    Authors: Sagnik Majumder, Ziad Al-Halah, Kristen Grauman

    Abstract: We propose a self-supervised method for learning representations based on spatial audio-visual correspondences in egocentric videos. Our method uses a masked auto-encoding framework to synthesize masked binaural (multi-channel) audio through the synergy of audio and vision, thereby learning useful spatial relationships between the two modalities. We use our pretrained features to tackle two downst… ▽ More

    Submitted 5 May, 2024; v1 submitted 10 July, 2023; originally announced July 2023.

    Comments: Accepted to CVPR 2024

  2. arXiv:2206.04006  [pdf, other

    cs.SD cs.CV cs.LG eess.AS

    Few-Shot Audio-Visual Learning of Environment Acoustics

    Authors: Sagnik Majumder, Changan Chen, Ziad Al-Halah, Kristen Grauman

    Abstract: Room impulse response (RIR) functions capture how the surrounding physical environment transforms the sounds heard by a listener, with implications for various applications in AR, VR, and robotics. Whereas traditional methods to estimate RIRs assume dense geometry and/or sound measurements throughout the environment, we explore how to infer RIRs based on a sparse set of images and echoes observed… ▽ More

    Submitted 24 November, 2022; v1 submitted 8 June, 2022; originally announced June 2022.

    Comments: Accepted to NeurIPS 2022

  3. arXiv:2105.07142  [pdf, other

    cs.CV cs.LG cs.RO cs.SD eess.AS

    Move2Hear: Active Audio-Visual Source Separation

    Authors: Sagnik Majumder, Ziad Al-Halah, Kristen Grauman

    Abstract: We introduce the active audio-visual source separation problem, where an agent must move intelligently in order to better isolate the sounds coming from an object of interest in its environment. The agent hears multiple audio sources simultaneously (e.g., a person speaking down the hall in a noisy household) and it must use its eyes and ears to automatically separate out the sounds originating fro… ▽ More

    Submitted 25 August, 2021; v1 submitted 15 May, 2021; originally announced May 2021.

    Comments: Accepted to ICCV 2021

  4. arXiv:2012.11583  [pdf, other

    cs.CV cs.LG cs.RO cs.SD eess.AS

    Semantic Audio-Visual Navigation

    Authors: Changan Chen, Ziad Al-Halah, Kristen Grauman

    Abstract: Recent work on audio-visual navigation assumes a constantly-sounding target and restricts the role of audio to signaling the target's position. We introduce semantic audio-visual navigation, where objects in the environment make sounds consistent with their semantic meaning (e.g., toilet flushing, door creaking) and acoustic events are sporadic or short in duration. We propose a transformer-based… ▽ More

    Submitted 6 April, 2021; v1 submitted 21 December, 2020; originally announced December 2020.

    Comments: Project page: http://vision.cs.utexas.edu/projects/semantic-audio-visual-navigation

  5. arXiv:2005.01616  [pdf, other

    cs.CV cs.SD eess.AS

    VisualEchoes: Spatial Image Representation Learning through Echolocation

    Authors: Ruohan Gao, Changan Chen, Ziad Al-Halah, Carl Schissler, Kristen Grauman

    Abstract: Several animal species (e.g., bats, dolphins, and whales) and even visually impaired humans have the remarkable ability to perform echolocation: a biological sonar used to perceive spatial layout and locate objects in the world. We explore the spatial cues contained in echoes and how they can benefit vision tasks that require spatial reasoning. First we capture echo responses in photo-realistic 3D… ▽ More

    Submitted 17 July, 2020; v1 submitted 4 May, 2020; originally announced May 2020.

    Comments: Appears in ECCV 2020

  6. arXiv:1912.11474  [pdf, other

    cs.CV cs.HC cs.SD eess.AS

    SoundSpaces: Audio-Visual Navigation in 3D Environments

    Authors: Changan Chen, Unnat Jain, Carl Schissler, Sebastia Vicenc Amengual Gari, Ziad Al-Halah, Vamsi Krishna Ithapu, Philip Robinson, Kristen Grauman

    Abstract: Moving around in the world is naturally a multisensory experience, but today's embodied agents are deaf---restricted to solely their visual perception of the environment. We introduce audio-visual navigation for complex, acoustically and visually realistic 3D environments. By both seeing and hearing, the agent must learn to navigate to a sounding object. We propose a multi-modal deep reinforcement… ▽ More

    Submitted 21 August, 2020; v1 submitted 24 December, 2019; originally announced December 2019.

    Comments: Accepted to ECCV 2020 (Spotlight). Project page: http://vision.cs.utexas.edu/projects/audio_visual_navigation/