Skip to main content

Showing 1–12 of 12 results for author: Owens, A

Searching in archive eess. Search in all archives.
.
  1. arXiv:2405.12221  [pdf, other

    cs.CV cs.LG cs.MM cs.SD eess.AS

    Images that Sound: Composing Images and Sounds on a Single Canvas

    Authors: Ziyang Chen, Daniel Geng, Andrew Owens

    Abstract: Spectrograms are 2D representations of sound that look very different from the images found in our visual world. And natural images, when played as spectrograms, make unnatural sounds. In this paper, we show that it is possible to synthesize spectrograms that simultaneously look like natural images and sound like natural audio. We call these spectrograms images that sound. Our approach is simple a… ▽ More

    Submitted 20 May, 2024; originally announced May 2024.

    Comments: Project site: https://ificl.github.io/images-that-sound/

  2. arXiv:2403.18821  [pdf, other

    cs.SD cs.CV cs.MM eess.AS

    Real Acoustic Fields: An Audio-Visual Room Acoustics Dataset and Benchmark

    Authors: Ziyang Chen, Israel D. Gebru, Christian Richardt, Anurag Kumar, William Laney, Andrew Owens, Alexander Richard

    Abstract: We present a new dataset called Real Acoustic Fields (RAF) that captures real acoustic room data from multiple modalities. The dataset includes high-quality and densely captured room impulse response data paired with multi-view images, and precise 6DoF pose tracking data for sound emitters and listeners in the rooms. We used this dataset to evaluate existing methods for novel-view acoustic synthes… ▽ More

    Submitted 27 March, 2024; originally announced March 2024.

    Comments: Accepted to CVPR 2024. Project site: https://facebookresearch.github.io/real-acoustic-fields/

  3. arXiv:2304.08490  [pdf, other

    cs.CV cs.MM cs.SD eess.AS

    Conditional Generation of Audio from Video via Foley Analogies

    Authors: Yuexi Du, Ziyang Chen, Justin Salamon, Bryan Russell, Andrew Owens

    Abstract: The sound effects that designers add to videos are designed to convey a particular artistic effect and, thus, may be quite different from a scene's true sound. Inspired by the challenges of creating a soundtrack for a video that differs from its true sound, but that nonetheless matches the actions occurring on screen, we propose the problem of conditional Foley. We present the following contributi… ▽ More

    Submitted 17 April, 2023; originally announced April 2023.

    Comments: CVPR 2023

  4. arXiv:2303.17490  [pdf, other

    cs.CV cs.MM cs.SD eess.AS eess.IV

    Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment

    Authors: Kim Sung-Bin, Arda Senocak, Hyunwoo Ha, Andrew Owens, Tae-Hyun Oh

    Abstract: How does audio describe the world around us? In this paper, we propose a method for generating an image of a scene from sound. Our method addresses the challenges of dealing with the large gaps that often exist between sight and sound. We design a model that works by scheduling the learning procedure of each model component to associate audio-visual modalities despite their information gaps. The k… ▽ More

    Submitted 30 March, 2023; originally announced March 2023.

    Comments: CVPR 2023

  5. arXiv:2303.11329  [pdf, other

    cs.CV cs.SD eess.AS

    Sound Localization from Motion: Jointly Learning Sound Direction and Camera Rotation

    Authors: Ziyang Chen, Shengyi Qian, Andrew Owens

    Abstract: The images and sounds that we perceive undergo subtle but geometrically consistent changes as we rotate our heads. In this paper, we use these cues to solve a problem we call Sound Localization from Motion (SLfM): jointly estimating camera rotation and localizing sound sources. We learn to solve these tasks solely through self-supervision. A visual model predicts camera rotation from a pair of ima… ▽ More

    Submitted 21 August, 2023; v1 submitted 20 March, 2023; originally announced March 2023.

    Comments: ICCV 2023. Project site: https://ificl.github.io/SLfM/

  6. arXiv:2205.05072  [pdf, other

    cs.CV cs.MM cs.SD eess.AS

    Learning Visual Styles from Audio-Visual Associations

    Authors: Tingle Li, Yichen Liu, Andrew Owens, Hang Zhao

    Abstract: From the patter of rain to the crunch of snow, the sounds we hear often convey the visual textures that appear within a scene. In this paper, we present a method for learning visual styles from unlabeled audio-visual data. Our model learns to manipulate the texture of a scene to match a sound, a problem we term audio-driven image stylization. Given a dataset of paired audio-visual data, we learn t… ▽ More

    Submitted 10 May, 2022; originally announced May 2022.

  7. arXiv:2204.12489  [pdf, other

    cs.CV cs.SD eess.AS

    Sound Localization by Self-Supervised Time Delay Estimation

    Authors: Ziyang Chen, David F. Fouhey, Andrew Owens

    Abstract: Sounds reach one microphone in a stereo pair sooner than the other, resulting in an interaural time delay that conveys their directions. Estimating a sound's time delay requires finding correspondences between the signals recorded by each microphone. We propose to learn these correspondences through self-supervision, drawing on recent techniques from visual tracking. We adapt the contrastive rando… ▽ More

    Submitted 28 January, 2023; v1 submitted 26 April, 2022; originally announced April 2022.

    Comments: ECCV 2022

  8. arXiv:2111.05846  [pdf, other

    cs.SD cs.CV cs.MM cs.RO eess.AS

    Structure from Silence: Learning Scene Structure from Ambient Sound

    Authors: Ziyang Chen, Xixi Hu, Andrew Owens

    Abstract: From whirling ceiling fans to ticking clocks, the sounds that we hear subtly vary as we move through a scene. We ask whether these ambient sounds convey information about 3D scene structure and, if so, whether they provide a useful learning signal for multimodal models. To study this, we collect a dataset of paired audio and RGB-D recordings from a variety of quiet indoor scenes. We then train mod… ▽ More

    Submitted 10 November, 2021; originally announced November 2021.

    Comments: Accepted to CoRL 2021 (Oral Presentation)

  9. arXiv:2008.04237  [pdf, other

    cs.CV cs.SD eess.AS

    Self-Supervised Learning of Audio-Visual Objects from Video

    Authors: Triantafyllos Afouras, Andrew Owens, Joon Son Chung, Andrew Zisserman

    Abstract: Our objective is to transform a video into a set of discrete audio-visual objects using self-supervised learning. To this end, we introduce a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time. We demonstrate the effectiveness of the audio-visual object embeddings that our model learns by using them for four downstream speech-oriented… ▽ More

    Submitted 10 August, 2020; originally announced August 2020.

    Comments: ECCV 2020

  10. arXiv:2006.14613  [pdf, other

    cs.CV cs.LG eess.IV

    Space-Time Correspondence as a Contrastive Random Walk

    Authors: Allan Jabri, Andrew Owens, Alexei A. Efros

    Abstract: This paper proposes a simple self-supervised approach for learning a representation for visual correspondence from raw video. We cast correspondence as prediction of links in a space-time graph constructed from video. In this graph, the nodes are patches sampled from each frame, and nodes adjacent in time can share a directed edge. We learn a representation in which pairwise similarity defines tra… ▽ More

    Submitted 3 December, 2020; v1 submitted 25 June, 2020; originally announced June 2020.

    Comments: NeurIPS 2020 camera ready version -- Code at github.com/ajabri/videowalk

  11. arXiv:1906.04160  [pdf, other

    cs.CV cs.LG eess.AS

    Learning Individual Styles of Conversational Gesture

    Authors: Shiry Ginosar, Amir Bar, Gefen Kohavi, Caroline Chan, Andrew Owens, Jitendra Malik

    Abstract: Human speech is often accompanied by hand and arm gestures. Given audio speech input, we generate plausible gestures to go along with the sound. Specifically, we perform cross-modal translation from "in-the-wild'' monologue speech of a single speaker to their hand and arm motion. We train on unlabeled videos for which we only have noisy pseudo ground truth from an automatic pose detection system.… ▽ More

    Submitted 10 June, 2019; originally announced June 2019.

    Comments: CVPR 2019

  12. arXiv:1804.03641  [pdf, other

    cs.CV cs.SD eess.AS

    Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

    Authors: Andrew Owens, Alexei A. Efros

    Abstract: The thud of a bouncing ball, the onset of speech as lips open -- when visual and audio events occur together, it suggests that there might be a common, underlying event that produced both signals. In this paper, we argue that the visual and audio components of a video signal should be modeled jointly using a fused multisensory representation. We propose to learn such a representation in a self-sup… ▽ More

    Submitted 9 October, 2018; v1 submitted 10 April, 2018; originally announced April 2018.