Skip to main content

Showing 1–2 of 2 results for author: Akshulakov, R

.
  1. arXiv:2401.05224  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Do Vision and Language Encoders Represent the World Similarly?

    Authors: Mayug Maniparambil, Raiymbek Akshulakov, Yasser Abdelaziz Dahou Djilali, Sanath Narayan, Mohamed El Amine Seddik, Karttikeya Mangalam, Noel E. O'Connor

    Abstract: Aligned text-image encoders such as CLIP have become the de facto model for vision-language tasks. Furthermore, modality-specific encoders achieve impressive performances in their respective domains. This raises a central question: does an alignment exist between uni-modal vision and language encoders since they fundamentally represent the same physical world? Analyzing the latent spaces structure… ▽ More

    Submitted 22 March, 2024; v1 submitted 10 January, 2024; originally announced January 2024.

    Comments: Accepted CVPR 2024

  2. arXiv:2308.09126  [pdf, other

    cs.CV cs.AI cs.CL

    EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding

    Authors: Karttikeya Mangalam, Raiymbek Akshulakov, Jitendra Malik

    Abstract: We introduce EgoSchema, a very long-form video question-answering dataset, and benchmark to evaluate long video understanding capabilities of modern vision and language systems. Derived from Ego4D, EgoSchema consists of over 5000 human curated multiple choice question answer pairs, spanning over 250 hours of real video data, covering a very broad range of natural human activity and behavior. For e… ▽ More

    Submitted 17 August, 2023; originally announced August 2023.

    Comments: https://egoschema.github.io/