Skip to main content

Showing 1–12 of 12 results for author: Shvetsova, N

Searching in archive cs. Search in all archives.
.
  1. arXiv:2310.04900  [pdf, other

    cs.CV

    HowToCaption: Prompting LLMs to Transform Video Annotations at Scale

    Authors: Nina Shvetsova, Anna Kukleva, Xudong Hong, Christian Rupprecht, Bernt Schiele, Hilde Kuehne

    Abstract: Instructional videos are an excellent source for learning multimodal representations by leveraging video-subtitle pairs extracted with automatic speech recognition systems (ASR) from the audio signal in the videos. However, in contrast to human-annotated captions, both speech and subtitles naturally differ from the visual content of the videos and thus provide only noisy supervision for multimodal… ▽ More

    Submitted 7 October, 2023; originally announced October 2023.

    Comments: https://github.com/ninatu/howtocaption

  2. arXiv:2309.08928  [pdf, other

    cs.CV

    In-Style: Bridging Text and Uncurated Videos with Style Transfer for Text-Video Retrieval

    Authors: Nina Shvetsova, Anna Kukleva, Bernt Schiele, Hilde Kuehne

    Abstract: Large-scale noisy web image-text datasets have been proven to be efficient for learning robust vision-language models. However, when transferring them to the task of video retrieval, models still need to be fine-tuned on hand-curated paired text-video data to adapt to the diverse styles of video descriptions. To address this problem without the need for hand-annotated pairs, we propose a new setti… ▽ More

    Submitted 16 September, 2023; originally announced September 2023.

    Comments: Published at ICCV 2023, code: https://github.com/ninatu/in_style

  3. arXiv:2308.13077  [pdf, other

    cs.CV

    Preserving Modality Structure Improves Multi-Modal Learning

    Authors: Swetha Sirnam, Mamshad Nayeem Rizve, Nina Shvetsova, Hilde Kuehne, Mubarak Shah

    Abstract: Self-supervised learning on large-scale multi-modal datasets allows learning semantically meaningful embeddings in a joint multi-modal representation space without relying on human annotations. These joint embeddings enable zero-shot cross-modal tasks like retrieval and classification. However, these methods often struggle to generalize well on out-of-domain data as they ignore the semantic struct… ▽ More

    Submitted 24 August, 2023; originally announced August 2023.

    Comments: Accepted at ICCV 2023

  4. arXiv:2303.16990  [pdf, other

    cs.CV

    What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions

    Authors: Brian Chen, Nina Shvetsova, Andrew Rouditchenko, Daniel Kondermann, Samuel Thomas, Shih-Fu Chang, Rogerio Feris, James Glass, Hilde Kuehne

    Abstract: Spatio-temporal grounding describes the task of localizing events in space and time, e.g., in video data, based on verbal descriptions only. Models for this task are usually trained with human-annotated sentences and bounding box supervision. This work addresses this task from a multimodal supervision perspective, proposing a framework for spatio-temporal action grounding trained on loose video an… ▽ More

    Submitted 28 May, 2024; v1 submitted 29 March, 2023; originally announced March 2023.

    Comments: To be presented at CVPR 2024. Project page: https://brian7685.github.io/STG/

  5. arXiv:2303.08914  [pdf, other

    cs.CV

    MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge

    Authors: Wei Lin, Leonid Karlinsky, Nina Shvetsova, Horst Possegger, Mateusz Kozinski, Rameswar Panda, Rogerio Feris, Hilde Kuehne, Horst Bischof

    Abstract: Large scale Vision-Language (VL) models have shown tremendous success in aligning representations between visual and text modalities. This enables remarkable progress in zero-shot recognition, image generation & editing, and many other exciting tasks. However, VL models tend to over-represent objects while paying much less attention to verbs, and require additional tuning on video data for best ze… ▽ More

    Submitted 22 July, 2023; v1 submitted 15 March, 2023; originally announced March 2023.

    Comments: Accepted at ICCV 2023

  6. arXiv:2301.02009  [pdf, other

    cs.CV

    Learning by Sorting: Self-supervised Learning with Group Ordering Constraints

    Authors: Nina Shvetsova, Felix Petersen, Anna Kukleva, Bernt Schiele, Hilde Kuehne

    Abstract: Contrastive learning has become an important tool in learning representations from unlabeled data mainly relying on the idea of minimizing distance between positive data pairs, e.g., views from the same images, and maximizing distance between negative data pairs, e.g., views from different images. This paper proposes a new variation of the contrastive learning objective, Group Ordering Constraints… ▽ More

    Submitted 18 August, 2023; v1 submitted 5 January, 2023; originally announced January 2023.

    Comments: Published at ICCV 2023, Code @ https://github.com/ninatu/learning_by_sorting

  7. arXiv:2210.03625  [pdf, other

    cs.CL cs.CV cs.MM

    C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval

    Authors: Andrew Rouditchenko, Yung-Sung Chuang, Nina Shvetsova, Samuel Thomas, Rogerio Feris, Brian Kingsbury, Leonid Karlinsky, David Harwath, Hilde Kuehne, James Glass

    Abstract: Multilingual text-video retrieval methods have improved significantly in recent years, but the performance for other languages lags behind English. We propose a Cross-Lingual Cross-Modal Knowledge Distillation method to improve multilingual text-video retrieval. Inspired by the fact that English text-video retrieval outperforms other languages, we train a student model using input text in differen… ▽ More

    Submitted 9 May, 2023; v1 submitted 7 October, 2022; originally announced October 2022.

    Comments: Accepted at ICASSP 2023. The code, models, and dataset are available at https://github.com/roudimit/c2kd

  8. arXiv:2209.06103  [pdf, other

    cs.CV cs.AI cs.CL

    VL-Taboo: An Analysis of Attribute-based Zero-shot Capabilities of Vision-Language Models

    Authors: Felix Vogel, Nina Shvetsova, Leonid Karlinsky, Hilde Kuehne

    Abstract: Vision-language models trained on large, randomly collected data had significant impact in many areas since they appeared. But as they show great performance in various fields, such as image-text-retrieval, their inner workings are still not fully understood. The current work analyses the true zero-shot capabilities of those models. We start from the analysis of the training corpus assessing to wh… ▽ More

    Submitted 12 September, 2022; originally announced September 2022.

  9. arXiv:2208.01956  [pdf, other

    cs.CV

    Augmentation Learning for Semi-Supervised Classification

    Authors: Tim Frommknecht, Pedro Alves Zipf, Quanfu Fan, Nina Shvetsova, Hilde Kuehne

    Abstract: Recently, a number of new Semi-Supervised Learning methods have emerged. As the accuracy for ImageNet and similar datasets increased over time, the performance on tasks beyond the classification of natural images is yet to be explored. Most Semi-Supervised Learning methods rely on a carefully manually designed data augmentation pipeline that is not transferable for learning on images of other doma… ▽ More

    Submitted 3 August, 2022; originally announced August 2022.

    Comments: Accepted to GCPR 2022, 13 pages with 4 figures

  10. arXiv:2112.04446  [pdf, other

    cs.CV cs.CL cs.SD eess.AS

    Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval

    Authors: Nina Shvetsova, Brian Chen, Andrew Rouditchenko, Samuel Thomas, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass, Hilde Kuehne

    Abstract: Multi-modal learning from video data has seen increased attention recently as it allows to train semantically meaningful embeddings without human annotation enabling tasks like zero-shot retrieval and classification. In this work, we present a multi-modal, modality agnostic fusion transformer approach that learns to exchange information between multiple modalities, such as video, audio, and text,… ▽ More

    Submitted 18 August, 2022; v1 submitted 8 December, 2021; originally announced December 2021.

    Comments: CVPR2022. The final published version of the proceedings will be available on IEEE Xplore

  11. arXiv:2112.00775  [pdf, other

    cs.CV

    Routing with Self-Attention for Multimodal Capsule Networks

    Authors: Kevin Duarte, Brian Chen, Nina Shvetsova, Andrew Rouditchenko, Samuel Thomas, Alexander Liu, David Harwath, James Glass, Hilde Kuehne, Mubarak Shah

    Abstract: The task of multimodal learning has seen a growing interest recently as it allows for training neural architectures based on different modalities such as vision, text, and audio. One challenge in training such models is that they need to jointly learn semantic concepts and their relationships across different input representations. Capsule networks have been shown to perform well in context of cap… ▽ More

    Submitted 1 December, 2021; originally announced December 2021.

  12. Anomaly Detection in Medical Imaging with Deep Perceptual Autoencoders

    Authors: Nina Shvetsova, Bart Bakker, Irina Fedulova, Heinrich Schulz, Dmitry V. Dylov

    Abstract: Anomaly detection is the problem of recognizing abnormal inputs based on the seen examples of normal data. Despite recent advances of deep learning in recognizing image anomalies, these methods still prove incapable of handling complex medical images, such as barely visible abnormalities in chest X-rays and metastases in lymph nodes. To address this problem, we introduce a new powerful method of i… ▽ More

    Submitted 13 September, 2021; v1 submitted 23 June, 2020; originally announced June 2020.

    Comments: The final authenticated publication is available online at https://ieeexplore.ieee.org/abstract/document/9521238

    Journal ref: IEEE Access, vol. 9, pp. 118571-118583, 2021