Skip to main content

Showing 1–2 of 2 results for author: Avraham, E B

.
  1. arXiv:2402.05472  [pdf, other

    cs.CV

    Question Aware Vision Transformer for Multimodal Reasoning

    Authors: Roy Ganz, Yair Kittenplon, Aviad Aberdam, Elad Ben Avraham, Oren Nuriel, Shai Mazor, Ron Litman

    Abstract: Vision-Language (VL) models have gained significant research focus, enabling remarkable advances in multimodal reasoning. These architectures typically comprise a vision encoder, a Large Language Model (LLM), and a projection module that aligns visual features with the LLM's representation space. Despite their success, a critical limitation persists: the vision encoding process remains decoupled f… ▽ More

    Submitted 8 February, 2024; originally announced February 2024.

  2. arXiv:2401.03411  [pdf, other

    cs.CL cs.CV

    GRAM: Global Reasoning for Multi-Page VQA

    Authors: Tsachi Blau, Sharon Fogel, Roi Ronen, Alona Golts, Roy Ganz, Elad Ben Avraham, Aviad Aberdam, Shahar Tsiper, Ron Litman

    Abstract: The increasing use of transformer-based large language models brings forward the challenge of processing long sequences. In document visual question answering (DocVQA), leading methods focus on the single-page setting, while documents can span hundreds of pages. We present GRAM, a method that seamlessly extends pre-trained single-page models to the multi-page setting, without requiring computation… ▽ More

    Submitted 18 March, 2024; v1 submitted 7 January, 2024; originally announced January 2024.