Skip to main content

Showing 1–2 of 2 results for author: Ivaniuta, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2404.01911  [pdf, other

    cs.CV

    VLRM: Vision-Language Models act as Reward Models for Image Captioning

    Authors: Maksim Dzabraev, Alexander Kunitsyn, Andrei Ivaniuta

    Abstract: In this work, we present an unsupervised method for enhancing an image captioning model (in our case, BLIP2) using reinforcement learning and vision-language models like CLIP and BLIP2-ITM as reward models. The RL-tuned model is able to generate longer and more comprehensive descriptions. Our model reaches impressive 0.90 R@1 CLIP Recall score on MS-COCO Carpathy Test Split. Weights are availabl… ▽ More

    Submitted 2 April, 2024; originally announced April 2024.

  2. arXiv:2203.07086  [pdf, ps, other

    cs.CV

    MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization

    Authors: Alexander Kunitsyn, Maksim Kalashnikov, Maksim Dzabraev, Andrei Ivaniuta

    Abstract: In this work we present a new State-of-The-Art on the text-to-video retrieval task on MSR-VTT, LSMDC, MSVD, YouCook2 and TGIF obtained by a single model. Three different data sources are combined: weakly-supervised videos, crowd-labeled text-image pairs and text-video pairs. A careful analysis of available pre-trained networks helps to choose the best prior-knowledge ones. We introduce three-stage… ▽ More

    Submitted 14 March, 2022; originally announced March 2022.