Skip to main content

Showing 1–10 of 10 results for author: Senocak, A

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.03344  [pdf, other

    cs.SD cs.AI eess.AS

    Audio Mamba: Bidirectional State Space Model for Audio Representation Learning

    Authors: Mehmet Hamza Erol, Arda Senocak, Jiu Feng, Joon Son Chung

    Abstract: Transformers have rapidly become the preferred choice for audio classification, surpassing methods based on CNNs. However, Audio Spectrogram Transformers (ASTs) exhibit quadratic scaling due to self-attention. The removal of this quadratic self-attention cost presents an appealing direction. Recently, state space models (SSMs), such as Mamba, have demonstrated potential in language and vision task… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

    Comments: Code is available at https://github.com/mhamzaerol/Audio-Mamba-AuM

  2. arXiv:2401.08415  [pdf, other

    cs.SD cs.LG eess.AS

    From Coarse to Fine: Efficient Training for Audio Spectrogram Transformers

    Authors: Jiu Feng, Mehmet Hamza Erol, Joon Son Chung, Arda Senocak

    Abstract: Transformers have become central to recent advances in audio classification. However, training an audio spectrogram transformer, e.g. AST, from scratch can be resource and time-intensive. Furthermore, the complexity of transformers heavily depends on the input audio spectrogram size. In this work, we aim to optimize AST training by linking to the resolution in the time-axis. We introduce multi-pha… ▽ More

    Submitted 16 January, 2024; originally announced January 2024.

    Comments: ICASSP 2024

  3. arXiv:2311.04066  [pdf, other

    cs.CV cs.AI cs.MM cs.SD eess.AS

    Can CLIP Help Sound Source Localization?

    Authors: Sooyoung Park, Arda Senocak, Joon Son Chung

    Abstract: Large-scale pre-trained image-text models demonstrate remarkable versatility across diverse tasks, benefiting from their robust representational capabilities and effective multimodal alignment. We extend the application of these models, specifically CLIP, to the domain of sound source localization. Unlike conventional approaches, we employ the pre-trained CLIP model without explicit text input, re… ▽ More

    Submitted 7 November, 2023; originally announced November 2023.

    Comments: WACV 2024

  4. arXiv:2309.10724  [pdf, other

    cs.CV cs.AI cs.MM cs.SD eess.AS

    Sound Source Localization is All about Cross-Modal Alignment

    Authors: Arda Senocak, Hyeonggon Ryu, Junsik Kim, Tae-Hyun Oh, Hanspeter Pfister, Joon Son Chung

    Abstract: Humans can easily perceive the direction of sound sources in a visual scene, termed sound source localization. Recent studies on learning-based sound source localization have mainly explored the problem from a localization perspective. However, prior arts and existing benchmarks do not account for a more important aspect of the problem, cross-modal semantic understanding, which is essential for ge… ▽ More

    Submitted 19 September, 2023; originally announced September 2023.

    Comments: ICCV 2023

  5. arXiv:2307.09286  [pdf, other

    cs.SD cs.LG eess.AS

    FlexiAST: Flexibility is What AST Needs

    Authors: Jiu Feng, Mehmet Hamza Erol, Joon Son Chung, Arda Senocak

    Abstract: The objective of this work is to give patch-size flexibility to Audio Spectrogram Transformers (AST). Recent advancements in ASTs have shown superior performance in various audio-based tasks. However, the performance of standard ASTs degrades drastically when evaluated using different patch sizes from that used during training. As a result, AST models are typically re-trained to accommodate change… ▽ More

    Submitted 18 July, 2023; originally announced July 2023.

    Comments: Interspeech 2023

  6. arXiv:2303.17517  [pdf, other

    cs.CL cs.CV cs.SD eess.AS

    Hindi as a Second Language: Improving Visually Grounded Speech with Semantically Similar Samples

    Authors: Hyeonggon Ryu, Arda Senocak, In So Kweon, Joon Son Chung

    Abstract: The objective of this work is to explore the learning of visually grounded speech models (VGS) from multilingual perspective. Bilingual VGS models are generally trained with an equal number of spoken captions from both languages. However, in reality, there can be an imbalance among the languages for the available spoken captions. Our key contribution in this work is to leverage the power of a high… ▽ More

    Submitted 30 March, 2023; originally announced March 2023.

    Comments: ICASSP 2023

  7. arXiv:2303.17490  [pdf, other

    cs.CV cs.MM cs.SD eess.AS eess.IV

    Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment

    Authors: Kim Sung-Bin, Arda Senocak, Hyunwoo Ha, Andrew Owens, Tae-Hyun Oh

    Abstract: How does audio describe the world around us? In this paper, we propose a method for generating an image of a scene from sound. Our method addresses the challenges of dealing with the large gaps that often exist between sight and sound. We design a model that works by scheduling the learning procedure of each model component to associate audio-visual modalities despite their information gaps. The k… ▽ More

    Submitted 30 March, 2023; originally announced March 2023.

    Comments: CVPR 2023

  8. arXiv:2211.01966  [pdf, other

    cs.CV cs.MM cs.SD eess.AS eess.IV

    MarginNCE: Robust Sound Localization with a Negative Margin

    Authors: Sooyoung Park, Arda Senocak, Joon Son Chung

    Abstract: The goal of this work is to localize sound sources in visual scenes with a self-supervised approach. Contrastive learning in the context of sound source localization leverages the natural correspondence between audio and visual signals where the audio-visual pairs from the same source are assumed as positive, while randomly selected pairs are negatives. However, this approach brings in noisy corre… ▽ More

    Submitted 3 November, 2022; originally announced November 2022.

    Comments: Submitted to ICASSP 2023. SOTA performance in Audio-Visual Sound Localization. 5 Pages

  9. arXiv:2202.05961  [pdf, other

    cs.CV eess.IV

    Audio-Visual Fusion Layers for Event Type Aware Video Recognition

    Authors: Arda Senocak, Junsik Kim, Tae-Hyun Oh, Hyeonggon Ryu, Dingzeyu Li, In So Kweon

    Abstract: Human brain is continuously inundated with the multisensory information and their complex interactions coming from the outside world at any given moment. Such information is automatically analyzed by binding or segregating in our brain. While this task might seem effortless for human brains, it is extremely challenging to build a machine that can perform similar tasks since complex interactions ca… ▽ More

    Submitted 11 February, 2022; originally announced February 2022.

  10. arXiv:2202.03007  [pdf, other

    cs.CV cs.SD eess.AS eess.IV

    Learning Sound Localization Better From Semantically Similar Samples

    Authors: Arda Senocak, Hyeonggon Ryu, Junsik Kim, In So Kweon

    Abstract: The objective of this work is to localize the sound sources in visual scenes. Existing audio-visual works employ contrastive learning by assigning corresponding audio-visual pairs from the same source as positives while randomly mismatched pairs as negatives. However, these negative pairs may contain semantically matched audio-visual information. Thus, these semantically correlated pairs, "hard po… ▽ More

    Submitted 7 February, 2022; originally announced February 2022.

    Comments: Accepted to ICASSP 2022. SOTA performance in Audio-Visual Sound Localization. 5 Pages