Skip to main content

Showing 1–22 of 22 results for author: Koepke, A S

.
  1. arXiv:2404.06309  [pdf, other

    cs.CV

    Audio-Visual Generalized Zero-Shot Learning using Pre-Trained Large Multi-Modal Models

    Authors: David Kurzendörfer, Otniel-Bogdan Mercea, A. Sophia Koepke, Zeynep Akata

    Abstract: Audio-visual zero-shot learning methods commonly build on features extracted from pre-trained models, e.g. video or audio classification models. However, existing benchmarks predate the popularization of large multi-modal models, such as CLIP and CLAP. In this work, we explore such large pre-trained models to obtain features, i.e. CLIP for visual features, and CLAP for audio features. Furthermore,… ▽ More

    Submitted 9 April, 2024; originally announced April 2024.

    Comments: CVPRw 2024 (L3D-IVU)

  2. arXiv:2402.19106  [pdf, other

    eess.AS cs.IR cs.SD

    A SOUND APPROACH: Using Large Language Models to generate audio descriptions for egocentric text-audio retrieval

    Authors: Andreea-Maria Oncescu, João F. Henriques, Andrew Zisserman, Samuel Albanie, A. Sophia Koepke

    Abstract: Video databases from the internet are a valuable source of text-audio retrieval datasets. However, given that sound and vision streams represent different "views" of the data, treating visual descriptions as audio descriptions is far from optimal. Even if audio class labels are present, they commonly are not very detailed, making them unsuited for text-audio retrieval. To exploit relevant audio in… ▽ More

    Submitted 29 February, 2024; originally announced February 2024.

    Comments: 9 pages, 2 figures, 9 tables, Accepted at ICASSP 2024

  3. arXiv:2311.08396  [pdf, other

    eess.AS cs.AI cs.CL cs.SD

    Zero-shot audio captioning with audio-language model guidance and audio context keywords

    Authors: Leonard Salewski, Stefan Fauth, A. Sophia Koepke, Zeynep Akata

    Abstract: Zero-shot audio captioning aims at automatically generating descriptive textual captions for audio content without prior training for this task. Different from speech recognition which translates audio content that contains spoken language into text, audio captioning is commonly concerned with ambient sounds, or sounds produced by a human performing an action. Inspired by zero-shot image captionin… ▽ More

    Submitted 14 November, 2023; originally announced November 2023.

    Comments: NeurIPS 2023 - Machine Learning for Audio Workshop (Oral)

  4. arXiv:2311.05043  [pdf, other

    cs.CV cs.AI cs.CL

    Zero-shot Translation of Attention Patterns in VQA Models to Natural Language

    Authors: Leonard Salewski, A. Sophia Koepke, Hendrik P. A. Lensch, Zeynep Akata

    Abstract: Converting a model's internals to text can yield human-understandable insights about the model. Inspired by the recent success of training-free approaches for image captioning, we propose ZS-A2T, a zero-shot framework that translates the transformer attention of a given model into natural language without requiring any training. We consider this in the context of Visual Question Answering (VQA). Z… ▽ More

    Submitted 8 November, 2023; originally announced November 2023.

    Comments: Published in GCPR 2023

  5. arXiv:2310.17653  [pdf, other

    cs.LG cs.CV

    Fantastic Gains and Where to Find Them: On the Existence and Prospect of General Knowledge Transfer between Any Pretrained Model

    Authors: Karsten Roth, Lukas Thede, Almut Sophia Koepke, Oriol Vinyals, Olivier Hénaff, Zeynep Akata

    Abstract: Training deep networks requires various design decisions regarding for instance their architecture, data augmentation, or optimization. In this work, we find these training variations to result in networks learning unique feature sets from the data. Using public model libraries comprising thousands of models trained on canonical datasets like ImageNet, we observe that for arbitrary pairings of pre… ▽ More

    Submitted 26 February, 2024; v1 submitted 26 October, 2023; originally announced October 2023.

    Comments: ICLR 2024 (spotlight)

  6. arXiv:2309.15086  [pdf, other

    cs.CV

    Video-adverb retrieval with compositional adverb-action embeddings

    Authors: Thomas Hummel, Otniel-Bogdan Mercea, A. Sophia Koepke, Zeynep Akata

    Abstract: Retrieving adverbs that describe an action in a video poses a crucial step towards fine-grained video understanding. We propose a framework for video-to-adverb retrieval (and vice versa) that aligns video embeddings with their matching compositional adverb-action text embedding in a joint embedding space. The compositional adverb-action text embedding is learned using a residual gating mechanism,… ▽ More

    Submitted 26 September, 2023; originally announced September 2023.

    Comments: BMVC 2023 (Oral)

  7. arXiv:2309.03869  [pdf, other

    cs.CV

    Text-to-feature diffusion for audio-visual few-shot learning

    Authors: Otniel-Bogdan Mercea, Thomas Hummel, A. Sophia Koepke, Zeynep Akata

    Abstract: Training deep learning models for video classification from audio-visual data commonly requires immense amounts of labeled training data collected via a costly process. A challenging and underexplored, yet much cheaper, setup is few-shot learning from video data. In particular, the inherently multi-modal nature of video data with sound and visual information has not been leveraged extensively for… ▽ More

    Submitted 7 September, 2023; originally announced September 2023.

    Comments: DAGM GCPR 2023

  8. arXiv:2308.10599  [pdf, other

    cs.CV cs.LG

    Image-free Classifier Injection for Zero-Shot Classification

    Authors: Anders Christensen, Massimiliano Mancini, A. Sophia Koepke, Ole Winther, Zeynep Akata

    Abstract: Zero-shot learning models achieve remarkable results on image classification for samples from classes that were not seen during training. However, such models must be trained from scratch with specialised methods: therefore, access to a training dataset is required when the need for zero-shot classification arises. In this paper, we aim to equip pre-trained models with zero-shot classification cap… ▽ More

    Submitted 21 August, 2023; originally announced August 2023.

    Comments: Accepted at ICCV 2023

  9. arXiv:2307.10865  [pdf, other

    cs.LG stat.ML

    Addressing caveats of neural persistence with deep graph persistence

    Authors: Leander Girrbach, Anders Christensen, Ole Winther, Zeynep Akata, A. Sophia Koepke

    Abstract: Neural Persistence is a prominent measure for quantifying neural network complexity, proposed in the emerging field of topological data analysis in deep learning. In this work, however, we find both theoretically and empirically that the variance of network weights and spatial concentration of large weights are the main factors that impact neural persistence. Whilst this captures useful informatio… ▽ More

    Submitted 20 November, 2023; v1 submitted 20 July, 2023; originally announced July 2023.

    Comments: Transactions on Machine Learning Research (TMLR), 2023

  10. arXiv:2306.07282  [pdf, other

    cs.CV cs.LG

    Waffling around for Performance: Visual Classification with Random Words and Broad Concepts

    Authors: Karsten Roth, Jae Myung Kim, A. Sophia Koepke, Oriol Vinyals, Cordelia Schmid, Zeynep Akata

    Abstract: The visual classification performance of vision-language models such as CLIP has been shown to benefit from additional semantic knowledge from large language models (LLMs) such as GPT-3. In particular, averaging over LLM-generated class descriptors, e.g. "waffle, which has a round shape", can notably improve generalization performance. In this work, we critically study this behavior and propose Wa… ▽ More

    Submitted 16 August, 2023; v1 submitted 12 June, 2023; originally announced June 2023.

    Comments: Accepted to ICCV 2023. Main paper with 9 pages

  11. arXiv:2304.03391  [pdf, other

    cs.CV

    Exposing and Mitigating Spurious Correlations for Cross-Modal Retrieval

    Authors: Jae Myung Kim, A. Sophia Koepke, Cordelia Schmid, Zeynep Akata

    Abstract: Cross-modal retrieval methods are the preferred tool to search databases for the text that best matches a query image and vice versa. However, image-text retrieval models commonly learn to memorize spurious correlations in the training data, such as frequent object co-occurrence, instead of looking at the actual underlying reasons for the prediction in the image. For image-text retrieval, this man… ▽ More

    Submitted 6 April, 2023; originally announced April 2023.

    Comments: CVPR'23 MULA Workshop

  12. arXiv:2210.14222  [pdf, other

    cs.RO cs.AI cs.CV cs.LG

    PlanT: Explainable Planning Transformers via Object-Level Representations

    Authors: Katrin Renz, Kashyap Chitta, Otniel-Bogdan Mercea, A. Sophia Koepke, Zeynep Akata, Andreas Geiger

    Abstract: Planning an optimal route in a complex environment requires efficient reasoning about the surrounding scene. While human drivers prioritize important objects and ignore details not relevant to the decision, learning-based planners typically extract features from dense, high-dimensional grid representations containing all vehicle and road context information. In this paper, we propose PlanT, a nove… ▽ More

    Submitted 25 October, 2022; originally announced October 2022.

    Comments: CoRL 2022. Project Page: https://www.katrinrenz.de/plant/

  13. arXiv:2207.09966  [pdf, other

    cs.CV

    Temporal and cross-modal attention for audio-visual zero-shot learning

    Authors: Otniel-Bogdan Mercea, Thomas Hummel, A. Sophia Koepke, Zeynep Akata

    Abstract: Audio-visual generalised zero-shot learning for video classification requires understanding the relations between the audio and visual information in order to be able to recognise samples from novel, previously unseen classes at test time. The natural semantic and temporal alignment between audio and visual data in video data can be exploited to learn powerful representations that generalise to un… ▽ More

    Submitted 20 July, 2022; originally announced July 2022.

    Comments: ECCV 2022

  14. CLEVR-X: A Visual Reasoning Dataset for Natural Language Explanations

    Authors: Leonard Salewski, A. Sophia Koepke, Hendrik P. A. Lensch, Zeynep Akata

    Abstract: Providing explanations in the context of Visual Question Answering (VQA) presents a fundamental problem in machine learning. To obtain detailed insights into the process of generating natural language explanations for VQA, we introduce the large-scale CLEVR-X dataset that extends the CLEVR dataset with natural language explanations. For each image-question pair in the CLEVR dataset, CLEVR-X contai… ▽ More

    Submitted 5 April, 2022; originally announced April 2022.

  15. arXiv:2203.03598  [pdf, other

    cs.CV cs.CL eess.AS

    Audio-visual Generalised Zero-shot Learning with Cross-modal Attention and Language

    Authors: Otniel-Bogdan Mercea, Lukas Riesch, A. Sophia Koepke, Zeynep Akata

    Abstract: Learning to classify video data from classes not included in the training data, i.e. video-based zero-shot learning, is challenging. We conjecture that the natural alignment between the audio and visual modalities in video data provides a rich training signal for learning discriminative multi-modal representations. Focusing on the relatively underexplored task of audio-visual zero-shot learning, w… ▽ More

    Submitted 4 April, 2022; v1 submitted 7 March, 2022; originally announced March 2022.

    Comments: CVPR 2022

  16. arXiv:2112.09418  [pdf, other

    eess.AS cs.IR cs.SD

    Audio Retrieval with Natural Language Queries: A Benchmark Study

    Authors: A. Sophia Koepke, Andreea-Maria Oncescu, João F. Henriques, Zeynep Akata, Samuel Albanie

    Abstract: The objectives of this work are cross-modal text-audio and audio-text retrieval, in which the goal is to retrieve the audio content from a pool of candidates that best matches a given written description and vice versa. Text-audio retrieval enables users to search large databases through an intuitive interface: they simply issue free-form natural language descriptions of the sound they would like… ▽ More

    Submitted 27 January, 2022; v1 submitted 17 December, 2021; originally announced December 2021.

    Comments: Submitted to Transactions on Multimedia. arXiv admin note: substantial text overlap with arXiv:2105.02192

    Journal ref: IEEE Transactions on Multimedia 2022

  17. arXiv:2105.02192  [pdf, other

    cs.IR cs.SD eess.AS

    Audio Retrieval with Natural Language Queries

    Authors: Andreea-Maria Oncescu, A. Sophia Koepke, João F. Henriques, Zeynep Akata, Samuel Albanie

    Abstract: We consider the task of retrieving audio using free-form natural language queries. To study this problem, which has received limited attention in the existing literature, we introduce challenging new benchmarks for text-based audio retrieval using text annotations sourced from the Audiocaps and Clotho datasets. We then employ these benchmarks to establish baselines for cross-modal audio retrieval,… ▽ More

    Submitted 22 July, 2021; v1 submitted 5 May, 2021; originally announced May 2021.

    Comments: Accepted at INTERSPEECH 2021

  18. arXiv:2105.01517  [pdf, other

    cs.CV cs.AI cs.LG

    Where and When: Space-Time Attention for Audio-Visual Explanations

    Authors: Yanbei Chen, Thomas Hummel, A. Sophia Koepke, Zeynep Akata

    Abstract: Explaining the decision of a multi-modal decision-maker requires to determine the evidence from both modalities. Recent advances in XAI provide explanations for models trained on still images. However, when it comes to modeling multiple sensory modalities in a dynamic world, it remains underexplored how to demystify the mysterious dynamics of a complex multi-modal model. In this work, we take a cr… ▽ More

    Submitted 4 May, 2021; originally announced May 2021.

  19. arXiv:2104.10955  [pdf, other

    cs.CV cs.AI cs.LG

    Distilling Audio-Visual Knowledge by Compositional Contrastive Learning

    Authors: Yanbei Chen, Yongqin Xian, A. Sophia Koepke, Ying Shan, Zeynep Akata

    Abstract: Having access to multi-modal cues (e.g. vision and audio) empowers some cognitive tasks to be done faster compared to learning from a single modality. In this work, we propose to transfer knowledge across heterogeneous modalities, even though these data modalities may not be semantically correlated. Rather than directly aligning the representations of different modalities, we compose audio, image,… ▽ More

    Submitted 22 April, 2021; originally announced April 2021.

    Comments: Accepted to CVPR2021

  20. arXiv:1910.12699  [pdf, other

    cs.CV

    Self-supervised learning of class embeddings from video

    Authors: Olivia Wiles, A. Sophia Koepke, Andrew Zisserman

    Abstract: This work explores how to use self-supervised learning on videos to learn a class-specific image embedding that encodes pose and shape information. At train time, two frames of the same video of an object class (e.g. human upper body) are extracted and each encoded to an embedding. Conditioned on these embeddings, the decoder network is tasked to transform one frame into another. To successfully p… ▽ More

    Submitted 28 October, 2019; originally announced October 2019.

    Comments: 4th International Workshop on Compact and Efficient Feature Representation and Learning in Computer Vision 2019

  21. arXiv:1808.06882  [pdf, other

    cs.CV

    Self-supervised learning of a facial attribute embedding from video

    Authors: Olivia Wiles, A. Sophia Koepke, Andrew Zisserman

    Abstract: We propose a self-supervised framework for learning facial attributes by simply watching videos of a human face speaking, laughing, and moving over time. To perform this task, we introduce a network, Facial Attributes-Net (FAb-Net), that is trained to embed multiple frames from the same video face-track into a common low-dimensional space. With this approach, we make three contributions: first, we… ▽ More

    Submitted 21 August, 2018; originally announced August 2018.

    Comments: To appear in BMVC 2018. Supplementary material can be found at http://www.robots.ox.ac.uk/~vgg/research/unsup_learn_watch_faces/fabnet.html

  22. arXiv:1807.10550  [pdf, other

    cs.CV

    X2Face: A network for controlling face generation by using images, audio, and pose codes

    Authors: Olivia Wiles, A. Sophia Koepke, Andrew Zisserman

    Abstract: The objective of this paper is a neural network model that controls the pose and expression of a given face, using another face or modality (e.g. audio). This model can then be used for lightweight, sophisticated video and image editing. We make the following three contributions. First, we introduce a network, X2Face, that can control a source face (specified by one or more frames) using another… ▽ More

    Submitted 27 July, 2018; originally announced July 2018.

    Comments: To appear in ECCV 2018. Accompanying video: http://www.robots.ox.ac.uk/~vgg/research/unsup_learn_watch_faces/x2face.html