Skip to main content

Showing 1–28 of 28 results for author: Koepke, A

.
  1. arXiv:2404.06309  [pdf, other

    cs.CV

    Audio-Visual Generalized Zero-Shot Learning using Pre-Trained Large Multi-Modal Models

    Authors: David Kurzendörfer, Otniel-Bogdan Mercea, A. Sophia Koepke, Zeynep Akata

    Abstract: Audio-visual zero-shot learning methods commonly build on features extracted from pre-trained models, e.g. video or audio classification models. However, existing benchmarks predate the popularization of large multi-modal models, such as CLIP and CLAP. In this work, we explore such large pre-trained models to obtain features, i.e. CLIP for visual features, and CLAP for audio features. Furthermore,… ▽ More

    Submitted 9 April, 2024; originally announced April 2024.

    Comments: CVPRw 2024 (L3D-IVU)

  2. arXiv:2402.19106  [pdf, other

    eess.AS cs.IR cs.SD

    A SOUND APPROACH: Using Large Language Models to generate audio descriptions for egocentric text-audio retrieval

    Authors: Andreea-Maria Oncescu, João F. Henriques, Andrew Zisserman, Samuel Albanie, A. Sophia Koepke

    Abstract: Video databases from the internet are a valuable source of text-audio retrieval datasets. However, given that sound and vision streams represent different "views" of the data, treating visual descriptions as audio descriptions is far from optimal. Even if audio class labels are present, they commonly are not very detailed, making them unsuited for text-audio retrieval. To exploit relevant audio in… ▽ More

    Submitted 29 February, 2024; originally announced February 2024.

    Comments: 9 pages, 2 figures, 9 tables, Accepted at ICASSP 2024

  3. arXiv:2311.08396  [pdf, other

    eess.AS cs.AI cs.CL cs.SD

    Zero-shot audio captioning with audio-language model guidance and audio context keywords

    Authors: Leonard Salewski, Stefan Fauth, A. Sophia Koepke, Zeynep Akata

    Abstract: Zero-shot audio captioning aims at automatically generating descriptive textual captions for audio content without prior training for this task. Different from speech recognition which translates audio content that contains spoken language into text, audio captioning is commonly concerned with ambient sounds, or sounds produced by a human performing an action. Inspired by zero-shot image captionin… ▽ More

    Submitted 14 November, 2023; originally announced November 2023.

    Comments: NeurIPS 2023 - Machine Learning for Audio Workshop (Oral)

  4. arXiv:2311.05043  [pdf, other

    cs.CV cs.AI cs.CL

    Zero-shot Translation of Attention Patterns in VQA Models to Natural Language

    Authors: Leonard Salewski, A. Sophia Koepke, Hendrik P. A. Lensch, Zeynep Akata

    Abstract: Converting a model's internals to text can yield human-understandable insights about the model. Inspired by the recent success of training-free approaches for image captioning, we propose ZS-A2T, a zero-shot framework that translates the transformer attention of a given model into natural language without requiring any training. We consider this in the context of Visual Question Answering (VQA). Z… ▽ More

    Submitted 8 November, 2023; originally announced November 2023.

    Comments: Published in GCPR 2023

  5. arXiv:2310.17653  [pdf, other

    cs.LG cs.CV

    Fantastic Gains and Where to Find Them: On the Existence and Prospect of General Knowledge Transfer between Any Pretrained Model

    Authors: Karsten Roth, Lukas Thede, Almut Sophia Koepke, Oriol Vinyals, Olivier Hénaff, Zeynep Akata

    Abstract: Training deep networks requires various design decisions regarding for instance their architecture, data augmentation, or optimization. In this work, we find these training variations to result in networks learning unique feature sets from the data. Using public model libraries comprising thousands of models trained on canonical datasets like ImageNet, we observe that for arbitrary pairings of pre… ▽ More

    Submitted 26 February, 2024; v1 submitted 26 October, 2023; originally announced October 2023.

    Comments: ICLR 2024 (spotlight)

  6. arXiv:2309.15086  [pdf, other

    cs.CV

    Video-adverb retrieval with compositional adverb-action embeddings

    Authors: Thomas Hummel, Otniel-Bogdan Mercea, A. Sophia Koepke, Zeynep Akata

    Abstract: Retrieving adverbs that describe an action in a video poses a crucial step towards fine-grained video understanding. We propose a framework for video-to-adverb retrieval (and vice versa) that aligns video embeddings with their matching compositional adverb-action text embedding in a joint embedding space. The compositional adverb-action text embedding is learned using a residual gating mechanism,… ▽ More

    Submitted 26 September, 2023; originally announced September 2023.

    Comments: BMVC 2023 (Oral)

  7. arXiv:2309.03869  [pdf, other

    cs.CV

    Text-to-feature diffusion for audio-visual few-shot learning

    Authors: Otniel-Bogdan Mercea, Thomas Hummel, A. Sophia Koepke, Zeynep Akata

    Abstract: Training deep learning models for video classification from audio-visual data commonly requires immense amounts of labeled training data collected via a costly process. A challenging and underexplored, yet much cheaper, setup is few-shot learning from video data. In particular, the inherently multi-modal nature of video data with sound and visual information has not been leveraged extensively for… ▽ More

    Submitted 7 September, 2023; originally announced September 2023.

    Comments: DAGM GCPR 2023

  8. arXiv:2308.10599  [pdf, other

    cs.CV cs.LG

    Image-free Classifier Injection for Zero-Shot Classification

    Authors: Anders Christensen, Massimiliano Mancini, A. Sophia Koepke, Ole Winther, Zeynep Akata

    Abstract: Zero-shot learning models achieve remarkable results on image classification for samples from classes that were not seen during training. However, such models must be trained from scratch with specialised methods: therefore, access to a training dataset is required when the need for zero-shot classification arises. In this paper, we aim to equip pre-trained models with zero-shot classification cap… ▽ More

    Submitted 21 August, 2023; originally announced August 2023.

    Comments: Accepted at ICCV 2023

  9. arXiv:2307.10865  [pdf, other

    cs.LG stat.ML

    Addressing caveats of neural persistence with deep graph persistence

    Authors: Leander Girrbach, Anders Christensen, Ole Winther, Zeynep Akata, A. Sophia Koepke

    Abstract: Neural Persistence is a prominent measure for quantifying neural network complexity, proposed in the emerging field of topological data analysis in deep learning. In this work, however, we find both theoretically and empirically that the variance of network weights and spatial concentration of large weights are the main factors that impact neural persistence. Whilst this captures useful informatio… ▽ More

    Submitted 20 November, 2023; v1 submitted 20 July, 2023; originally announced July 2023.

    Comments: Transactions on Machine Learning Research (TMLR), 2023

  10. arXiv:2306.07282  [pdf, other

    cs.CV cs.LG

    Waffling around for Performance: Visual Classification with Random Words and Broad Concepts

    Authors: Karsten Roth, Jae Myung Kim, A. Sophia Koepke, Oriol Vinyals, Cordelia Schmid, Zeynep Akata

    Abstract: The visual classification performance of vision-language models such as CLIP has been shown to benefit from additional semantic knowledge from large language models (LLMs) such as GPT-3. In particular, averaging over LLM-generated class descriptors, e.g. "waffle, which has a round shape", can notably improve generalization performance. In this work, we critically study this behavior and propose Wa… ▽ More

    Submitted 16 August, 2023; v1 submitted 12 June, 2023; originally announced June 2023.

    Comments: Accepted to ICCV 2023. Main paper with 9 pages

  11. arXiv:2304.03391  [pdf, other

    cs.CV

    Exposing and Mitigating Spurious Correlations for Cross-Modal Retrieval

    Authors: Jae Myung Kim, A. Sophia Koepke, Cordelia Schmid, Zeynep Akata

    Abstract: Cross-modal retrieval methods are the preferred tool to search databases for the text that best matches a query image and vice versa. However, image-text retrieval models commonly learn to memorize spurious correlations in the training data, such as frequent object co-occurrence, instead of looking at the actual underlying reasons for the prediction in the image. For image-text retrieval, this man… ▽ More

    Submitted 6 April, 2023; originally announced April 2023.

    Comments: CVPR'23 MULA Workshop

  12. arXiv:2210.14222  [pdf, other

    cs.RO cs.AI cs.CV cs.LG

    PlanT: Explainable Planning Transformers via Object-Level Representations

    Authors: Katrin Renz, Kashyap Chitta, Otniel-Bogdan Mercea, A. Sophia Koepke, Zeynep Akata, Andreas Geiger

    Abstract: Planning an optimal route in a complex environment requires efficient reasoning about the surrounding scene. While human drivers prioritize important objects and ignore details not relevant to the decision, learning-based planners typically extract features from dense, high-dimensional grid representations containing all vehicle and road context information. In this paper, we propose PlanT, a nove… ▽ More

    Submitted 25 October, 2022; originally announced October 2022.

    Comments: CoRL 2022. Project Page: https://www.katrinrenz.de/plant/

  13. arXiv:2207.09966  [pdf, other

    cs.CV

    Temporal and cross-modal attention for audio-visual zero-shot learning

    Authors: Otniel-Bogdan Mercea, Thomas Hummel, A. Sophia Koepke, Zeynep Akata

    Abstract: Audio-visual generalised zero-shot learning for video classification requires understanding the relations between the audio and visual information in order to be able to recognise samples from novel, previously unseen classes at test time. The natural semantic and temporal alignment between audio and visual data in video data can be exploited to learn powerful representations that generalise to un… ▽ More

    Submitted 20 July, 2022; originally announced July 2022.

    Comments: ECCV 2022

  14. CLEVR-X: A Visual Reasoning Dataset for Natural Language Explanations

    Authors: Leonard Salewski, A. Sophia Koepke, Hendrik P. A. Lensch, Zeynep Akata

    Abstract: Providing explanations in the context of Visual Question Answering (VQA) presents a fundamental problem in machine learning. To obtain detailed insights into the process of generating natural language explanations for VQA, we introduce the large-scale CLEVR-X dataset that extends the CLEVR dataset with natural language explanations. For each image-question pair in the CLEVR dataset, CLEVR-X contai… ▽ More

    Submitted 5 April, 2022; originally announced April 2022.

  15. arXiv:2203.03598  [pdf, other

    cs.CV cs.CL eess.AS

    Audio-visual Generalised Zero-shot Learning with Cross-modal Attention and Language

    Authors: Otniel-Bogdan Mercea, Lukas Riesch, A. Sophia Koepke, Zeynep Akata

    Abstract: Learning to classify video data from classes not included in the training data, i.e. video-based zero-shot learning, is challenging. We conjecture that the natural alignment between the audio and visual modalities in video data provides a rich training signal for learning discriminative multi-modal representations. Focusing on the relatively underexplored task of audio-visual zero-shot learning, w… ▽ More

    Submitted 4 April, 2022; v1 submitted 7 March, 2022; originally announced March 2022.

    Comments: CVPR 2022

  16. arXiv:2112.09418  [pdf, other

    eess.AS cs.IR cs.SD

    Audio Retrieval with Natural Language Queries: A Benchmark Study

    Authors: A. Sophia Koepke, Andreea-Maria Oncescu, João F. Henriques, Zeynep Akata, Samuel Albanie

    Abstract: The objectives of this work are cross-modal text-audio and audio-text retrieval, in which the goal is to retrieve the audio content from a pool of candidates that best matches a given written description and vice versa. Text-audio retrieval enables users to search large databases through an intuitive interface: they simply issue free-form natural language descriptions of the sound they would like… ▽ More

    Submitted 27 January, 2022; v1 submitted 17 December, 2021; originally announced December 2021.

    Comments: Submitted to Transactions on Multimedia. arXiv admin note: substantial text overlap with arXiv:2105.02192

    Journal ref: IEEE Transactions on Multimedia 2022

  17. arXiv:2105.02192  [pdf, other

    cs.IR cs.SD eess.AS

    Audio Retrieval with Natural Language Queries

    Authors: Andreea-Maria Oncescu, A. Sophia Koepke, João F. Henriques, Zeynep Akata, Samuel Albanie

    Abstract: We consider the task of retrieving audio using free-form natural language queries. To study this problem, which has received limited attention in the existing literature, we introduce challenging new benchmarks for text-based audio retrieval using text annotations sourced from the Audiocaps and Clotho datasets. We then employ these benchmarks to establish baselines for cross-modal audio retrieval,… ▽ More

    Submitted 22 July, 2021; v1 submitted 5 May, 2021; originally announced May 2021.

    Comments: Accepted at INTERSPEECH 2021

  18. arXiv:2105.01517  [pdf, other

    cs.CV cs.AI cs.LG

    Where and When: Space-Time Attention for Audio-Visual Explanations

    Authors: Yanbei Chen, Thomas Hummel, A. Sophia Koepke, Zeynep Akata

    Abstract: Explaining the decision of a multi-modal decision-maker requires to determine the evidence from both modalities. Recent advances in XAI provide explanations for models trained on still images. However, when it comes to modeling multiple sensory modalities in a dynamic world, it remains underexplored how to demystify the mysterious dynamics of a complex multi-modal model. In this work, we take a cr… ▽ More

    Submitted 4 May, 2021; originally announced May 2021.

  19. arXiv:2104.10955  [pdf, other

    cs.CV cs.AI cs.LG

    Distilling Audio-Visual Knowledge by Compositional Contrastive Learning

    Authors: Yanbei Chen, Yongqin Xian, A. Sophia Koepke, Ying Shan, Zeynep Akata

    Abstract: Having access to multi-modal cues (e.g. vision and audio) empowers some cognitive tasks to be done faster compared to learning from a single modality. In this work, we propose to transfer knowledge across heterogeneous modalities, even though these data modalities may not be semantically correlated. Rather than directly aligning the representations of different modalities, we compose audio, image,… ▽ More

    Submitted 22 April, 2021; originally announced April 2021.

    Comments: Accepted to CVPR2021

  20. arXiv:2006.01306  [pdf

    physics.optics physics.ins-det

    Optical Atomic Clock Comparison through Turbulent Air

    Authors: Martha I. Bodine, Jean-Daniel Deschênes, Isaac H. Khader, William C. Swann, Holly Leopardi, Kyle Beloy, Tobias Bothwell, Samuel M. Brewer, Sarah L. Bromley, Jwo-Sy Chen, Scott A. Diddams, Robert J. Fasano, Tara M. Fortier, Youssef S. Hassan, David B. Hume, Dhruv Kedar, Colin J. Kennedy, Amanda Koepke, David R. Leibrandt, Andrew D. Ludlow, William F. McGrew, William R. Milner, Daniele Nicolodi, Eric Oelker, Thomas E. Parker , et al. (10 additional authors not shown)

    Abstract: We use frequency comb-based optical two-way time-frequency transfer (O-TWTFT) to measure the optical frequency ratio of state-of-the-art ytterbium and strontium optical atomic clocks separated by a 1.5 km open-air link. Our free-space measurement is compared to a simultaneous measurement acquired via a noise-cancelled fiber link. Despite non-stationary, ps-level time-of-flight variations in the fr… ▽ More

    Submitted 11 September, 2020; v1 submitted 1 June, 2020; originally announced June 2020.

    Journal ref: Phys. Rev. Research 2, 033395 (2020)

  21. arXiv:2005.14694  [pdf, other

    physics.atom-ph physics.data-an physics.optics

    Frequency Ratio Measurements with 18-digit Accuracy Using a Network of Optical Clocks

    Authors: Boulder Atomic Clock Optical Network, Collaboration, :, Kyle Beloy, Martha I. Bodine, Tobias Bothwell, Samuel M. Brewer, Sarah L. Bromley, Jwo-Sy Chen, Jean-Daniel Deschênes, Scott A. Diddams, Robert J. Fasano, Tara M. Fortier, Youssef S. Hassan, David B. Hume, Dhruv Kedar, Colin J. Kennedy, Isaac Khader, Amanda Koepke, David R. Leibrandt, Holly Leopardi, Andrew D. Ludlow, William F. McGrew, William R. Milner, Nathan R. Newbury , et al. (13 additional authors not shown)

    Abstract: Atomic clocks occupy a unique position in measurement science, exhibiting higher accuracy than any other measurement standard and underpinning six out of seven base units in the SI system. By exploiting higher resonance frequencies, optical atomic clocks now achieve greater stability and lower frequency uncertainty than existing primary standards. Here, we report frequency ratios of the $^{27}$Al… ▽ More

    Submitted 29 May, 2020; originally announced May 2020.

    Comments: 51 pages, 12 figures, 6 tables

  22. arXiv:1910.12699  [pdf, other

    cs.CV

    Self-supervised learning of class embeddings from video

    Authors: Olivia Wiles, A. Sophia Koepke, Andrew Zisserman

    Abstract: This work explores how to use self-supervised learning on videos to learn a class-specific image embedding that encodes pose and shape information. At train time, two frames of the same video of an object class (e.g. human upper body) are extracted and each encoded to an embedding. Conditioned on these embeddings, the decoder network is tasked to transform one frame into another. To successfully p… ▽ More

    Submitted 28 October, 2019; originally announced October 2019.

    Comments: 4th International Workshop on Compact and Efficient Feature Representation and Learning in Computer Vision 2019

  23. arXiv:1908.06139  [pdf

    physics.ins-det physics.app-ph

    Meta-study of laser power calibrations ranging 20 orders of magnitude with traceability to the kilogram

    Authors: Paul A. Williams, Matthew T. Spidell, Joshua A. Hadler, Thomas Gerrits, Amanda Koepke, David Livigni, Michelle S. Stephens, Nathan A. Tomlin, Gordon A. Shaw, Jolene D. Splett, Igor Vayshenker, Malcolm G. White, Chris Yung, John H. Lehman

    Abstract: Laser power metrology at the National Institute of Standards and Technology (NIST) ranges 20 orders of magnitude from photon-counting (1000 photons/s) to 100 kW (10^23 photons/s at a wavelength of 1070 nm). As a part of routine practices, we perform internal (unpublished) comparisons between our various power meters to verify correct operation.

    Submitted 16 September, 2019; v1 submitted 16 August, 2019; originally announced August 2019.

    Comments: 6 figures, 3 tables, 21 pages

  24. Monte Carlo Sampling Bias in the Microwave Uncertainty Framework

    Authors: Michael Frey, Benjamin F. Jamroz, Amanda Koepke, Jacob D. Rezac, Dylan Williams

    Abstract: Uncertainty propagation software can have unknown, inadvertent biases introduced by various means. This work is a case study in bias identification and reduction in one such software package, the Microwave Uncertainty Framework (MUF). The general purpose of the MUF is to provide automated multivariate statistical uncertainty propagation and analysis on a Monte Carlo (MC) basis. Combine is a key mo… ▽ More

    Submitted 15 February, 2019; originally announced February 2019.

  25. arXiv:1808.06882  [pdf, other

    cs.CV

    Self-supervised learning of a facial attribute embedding from video

    Authors: Olivia Wiles, A. Sophia Koepke, Andrew Zisserman

    Abstract: We propose a self-supervised framework for learning facial attributes by simply watching videos of a human face speaking, laughing, and moving over time. To perform this task, we introduce a network, Facial Attributes-Net (FAb-Net), that is trained to embed multiple frames from the same video face-track into a common low-dimensional space. With this approach, we make three contributions: first, we… ▽ More

    Submitted 21 August, 2018; originally announced August 2018.

    Comments: To appear in BMVC 2018. Supplementary material can be found at http://www.robots.ox.ac.uk/~vgg/research/unsup_learn_watch_faces/fabnet.html

  26. arXiv:1807.10550  [pdf, other

    cs.CV

    X2Face: A network for controlling face generation by using images, audio, and pose codes

    Authors: Olivia Wiles, A. Sophia Koepke, Andrew Zisserman

    Abstract: The objective of this paper is a neural network model that controls the pose and expression of a given face, using another face or modality (e.g. audio). This model can then be used for lightweight, sophisticated video and image editing. We make the following three contributions. First, we introduce a network, X2Face, that can control a source face (specified by one or more frames) using another… ▽ More

    Submitted 27 July, 2018; originally announced July 2018.

    Comments: To appear in ECCV 2018. Accompanying video: http://www.robots.ox.ac.uk/~vgg/research/unsup_learn_watch_faces/x2face.html

  27. arXiv:1503.01183  [pdf, other

    stat.ML cs.LG

    A General Hybrid Clustering Technique

    Authors: Saeid Amiri, Bertrand Clarke, Jennifer Clarke, Hoyt A. Koepke

    Abstract: Here, we propose a clustering technique for general clustering problems including those that have non-convex clusters. For a given desired number of clusters $K$, we use three stages to find a clustering. The first stage uses a hybrid clustering technique to produce a series of clusterings of various sizes (randomly selected). They key steps are to find a $K$-means clustering using $K_\ell$ cluste… ▽ More

    Submitted 5 March, 2015; v1 submitted 3 March, 2015; originally announced March 2015.

  28. arXiv:1402.0536  [pdf, other

    stat.AP

    Predictive Modeling of Cholera Outbreaks in Bangladesh

    Authors: Amanda A. Koepke, Ira M. Longini Jr., M. Elizabeth Halloran, Jon Wakefield, Vladimir N. Minin

    Abstract: Despite seasonal cholera outbreaks in Bangladesh, little is known about the relationship between environmental conditions and cholera cases. We seek to develop a predictive model for cholera outbreaks in Bangladesh based on environmental predictors. To do this, we estimate the contribution of environmental variables, such as water depth and water temperature, to cholera outbreaks in the context of… ▽ More

    Submitted 11 January, 2015; v1 submitted 3 February, 2014; originally announced February 2014.

    Comments: 43 pages, including appendices, 5 figures, 1 table in the main text