Skip to main content

Showing 1–6 of 6 results for author: Hummel, T

Searching in archive cs. Search in all archives.
.
  1. arXiv:2309.15086  [pdf, other

    cs.CV

    Video-adverb retrieval with compositional adverb-action embeddings

    Authors: Thomas Hummel, Otniel-Bogdan Mercea, A. Sophia Koepke, Zeynep Akata

    Abstract: Retrieving adverbs that describe an action in a video poses a crucial step towards fine-grained video understanding. We propose a framework for video-to-adverb retrieval (and vice versa) that aligns video embeddings with their matching compositional adverb-action text embedding in a joint embedding space. The compositional adverb-action text embedding is learned using a residual gating mechanism,… ▽ More

    Submitted 26 September, 2023; originally announced September 2023.

    Comments: BMVC 2023 (Oral)

  2. arXiv:2309.03869  [pdf, other

    cs.CV

    Text-to-feature diffusion for audio-visual few-shot learning

    Authors: Otniel-Bogdan Mercea, Thomas Hummel, A. Sophia Koepke, Zeynep Akata

    Abstract: Training deep learning models for video classification from audio-visual data commonly requires immense amounts of labeled training data collected via a costly process. A challenging and underexplored, yet much cheaper, setup is few-shot learning from video data. In particular, the inherently multi-modal nature of video data with sound and visual information has not been leveraged extensively for… ▽ More

    Submitted 7 September, 2023; originally announced September 2023.

    Comments: DAGM GCPR 2023

  3. arXiv:2209.02536  [pdf, other

    cs.CV cs.AI

    Semantic Image Synthesis with Semantically Coupled VQ-Model

    Authors: Stephan Alaniz, Thomas Hummel, Zeynep Akata

    Abstract: Semantic image synthesis enables control over unconditional image generation by allowing guidance on what is being generated. We conditionally synthesize the latent space from a vector quantized model (VQ-model) pre-trained to autoencode images. Instead of training an autoregressive Transformer on separately learned conditioning latents and image latents, we find that jointly learning the conditio… ▽ More

    Submitted 6 September, 2022; originally announced September 2022.

    Comments: ICLR 2022 DGM4HSD

  4. arXiv:2207.09966  [pdf, other

    cs.CV

    Temporal and cross-modal attention for audio-visual zero-shot learning

    Authors: Otniel-Bogdan Mercea, Thomas Hummel, A. Sophia Koepke, Zeynep Akata

    Abstract: Audio-visual generalised zero-shot learning for video classification requires understanding the relations between the audio and visual information in order to be able to recognise samples from novel, previously unseen classes at test time. The natural semantic and temporal alignment between audio and visual data in video data can be exploited to learn powerful representations that generalise to un… ▽ More

    Submitted 20 July, 2022; originally announced July 2022.

    Comments: ECCV 2022

  5. arXiv:2105.01517  [pdf, other

    cs.CV cs.AI cs.LG

    Where and When: Space-Time Attention for Audio-Visual Explanations

    Authors: Yanbei Chen, Thomas Hummel, A. Sophia Koepke, Zeynep Akata

    Abstract: Explaining the decision of a multi-modal decision-maker requires to determine the evidence from both modalities. Recent advances in XAI provide explanations for models trained on still images. However, when it comes to modeling multiple sensory modalities in a dynamic world, it remains underexplored how to demystify the mysterious dynamics of a complex multi-modal model. In this work, we take a cr… ▽ More

    Submitted 4 May, 2021; originally announced May 2021.

  6. arXiv:2006.13546  [pdf

    cs.NE cs.CL cs.LG

    Crossmodal Language Grounding in an Embodied Neurocognitive Model

    Authors: Stefan Heinrich, Yuan Yao, Tobias Hinz, Zhiyuan Liu, Thomas Hummel, Matthias Kerzel, Cornelius Weber, Stefan Wermter

    Abstract: Human infants are able to acquire natural language seemingly easily at an early age. Their language learning seems to occur simultaneously with learning other cognitive functions as well as with playful interactions with the environment and caregivers. From a neuroscientific perspective, natural language is embodied, grounded in most, if not all, sensory and sensorimotor modalities, and acquired b… ▽ More

    Submitted 16 October, 2020; v1 submitted 24 June, 2020; originally announced June 2020.

    Journal ref: Frontiers in Neurorobotics, vol 14(52), 2020