Skip to main content

Showing 1–3 of 3 results for author: Ging, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2402.07270  [pdf, other

    cs.CV cs.CL cs.LG

    Open-ended VQA benchmarking of Vision-Language models by exploiting Classification datasets and their semantic hierarchy

    Authors: Simon Ging, María A. Bravo, Thomas Brox

    Abstract: The evaluation of text-generative vision-language models is a challenging yet crucial endeavor. By addressing the limitations of existing Visual Question Answering (VQA) benchmarks and proposing innovative evaluation methodologies, our research seeks to advance our understanding of these models' capabilities. We propose a novel VQA benchmark based on well-known visual classification datasets which… ▽ More

    Submitted 5 May, 2024; v1 submitted 11 February, 2024; originally announced February 2024.

    Comments: Accepted as Spotlight Paper for ICLR 2024. The first two authors contributed equally to this work

  2. arXiv:2211.12914  [pdf, other

    cs.CV cs.LG

    Open-vocabulary Attribute Detection

    Authors: María A. Bravo, Sudhanshu Mittal, Simon Ging, Thomas Brox

    Abstract: Vision-language modeling has enabled open-vocabulary tasks where predictions can be queried using any text prompt in a zero-shot manner. Existing open-vocabulary tasks focus on object classes, whereas research on object attributes is limited due to the lack of a reliable attribute-focused evaluation benchmark. This paper introduces the Open-Vocabulary Attribute Detection (OVAD) task and the corres… ▽ More

    Submitted 8 March, 2023; v1 submitted 23 November, 2022; originally announced November 2022.

    Comments: Accepted at CVPR 2023. https://ovad-benchmark.github.io

  3. arXiv:2011.00597  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning

    Authors: Simon Ging, Mohammadreza Zolfaghari, Hamed Pirsiavash, Thomas Brox

    Abstract: Many real-world video-text tasks involve different levels of granularity, such as frames and words, clip and sentences or videos and paragraphs, each with distinct semantics. In this paper, we propose a Cooperative hierarchical Transformer (COOT) to leverage this hierarchy information and model the interactions between different levels of granularity and different modalities. The method consists o… ▽ More

    Submitted 1 November, 2020; originally announced November 2020.

    Comments: 27 pages, 5 figures, 19 tables. To be published in the 34th conference on Neural Information Processing Systems (NeurIPS 2020). The first two authors contributed equally to this work

    ACM Class: I.2.7; I.2.10