ColPali: Efficient Document Retrieval with Vision Language Models

Manuel Faysse1,3  Hugues Sibille1,4  Tony Wu1
Gautier Viaud1Céline Hudelot3Pierre Colombo2,3
1Illuin Technology  2Equall.ai
3CentraleSupélec, Paris-Saclay  4ETH Zürich  
[email protected]
Abstract

Documents are visually rich structures that convey information through text, as well as tables, figures, page layouts, or fonts. While modern document retrieval systems exhibit strong performance on query-to-text matching, they struggle to exploit visual cues efficiently, hindering their performance on practical document retrieval applications such as Retrieval Augmented Generation. To benchmark current systems on visually rich document retrieval, we introduce the Visual Document Retrieval Benchmark ViDoRe, composed of various page-level retrieving tasks spanning multiple domains, languages, and settings. The inherent shortcomings of modern systems motivate the introduction of a new retrieval model architecture, ColPali, which leverages the document understanding capabilities of recent Vision Language Models to produce high-quality contextualized embeddings solely from images of document pages. Combined with a late interaction matching mechanism, ColPali largely outperforms modern document retrieval pipelines while being drastically faster and end-to-end trainable. We release all project artifacts at https://huggingface.co/vidore.

ColPali: Efficient Document Retrieval with Vision Language Models


Manuel Faysse1,3  Hugues Sibille1,4  Tony Wu1 Gautier Viaud1Céline Hudelot3Pierre Colombo2,3 1Illuin Technology  2Equall.ai 3CentraleSupélec, Paris-Saclay  4ETH Zürich [email protected]


1 Introduction

Refer to caption
Figure 1: For each term in a user query, ColPali identifies the most relevant document image patches (highlighted zones) and computes a query-to-page matching score. We can then swiftly retrieve the most relevant documents from a large pre-indexed corpus.
Refer to caption
Figure 2: ColPali simplifies document retrieval w.r.t. standard retrieval methods while achieving stronger performances with better latencies. Latencies and results are detailed in section 5 and subsection B.5.

Document Retrieval consists in matching a user query to relevant documents in a given corpus. It is central to many industrial applications, either as a standalone ranking system (search engines) or as part of more complex information extraction or Retrieval Augmented Generation (RAG) pipelines.

Over recent years, pretrained language models have enabled large improvements in text embedding models. In practical industrial settings, however, the main performance bottleneck for efficient document retrieval is not in embedding model performance but in the prior data ingestion pipeline. To index a standard PDF document, many steps are required. First, PDF parsers or Optical Character Recognition (OCR) systems are used to extract words from the pages. Document layout detection models can then be run to segment paragraphs, titles, and other page objects such as tables, figures, and headers. A chunking strategy is then defined to group text passages with some semantical coherence, and modern retrieval setups may even integrate a captioning step to describe visually rich elements in a natural language form, more suitable for embedding models. In our experiments (Table 2), we typically find that optimizing the ingestion pipeline yields much greater performance on visually rich document retrieval than optimizing the text embedding model.

Contribution 1: ViDoRe. In this work, we argue that document retrieval systems should not be evaluated solely on the capabilities of text embedding models (Bajaj et al., 2016; Thakur et al., 2021; Muennighoff et al., 2022), but should also consider the context and visual elements of the documents to be retrieved. To this end, we create and openly release ViDoRe, a comprehensive benchmark to evaluate systems on page-level document retrieval with a wide coverage of domains, visual elements, and languages. ViDoRe targets practical document retrieval settings, in which user queries may require both textual and visual understanding to be correctly matched to relevant documents. We highlight the shortcomings of current text-centric systems in these settings.111The benchmark leaderboard is hosted publicly at https://huggingface.co/spaces/vidore/vidore-leaderboard to encourage further developments.

Contribution 2: ColPali. We propose a novel model architecture and training strategy based on Vision Language Models (VLMs) to efficiently index documents purely from their visual features, allowing for subsequent fast query matching with late interaction mechanisms (Khattab and Zaharia, 2020). Our method, ColPali, outperforms all other retrieval systems on ViDoRe while being fast and end-to-end trainable. We release models and code at https://huggingface.co/vidore.

2 Problem Formulation & Related Work

Problem Setting. In our setting, a retrieval system scores how relevant a document d𝑑ditalic_d from corpus 𝒟𝒟\mathcal{D}caligraphic_D is with respect to a query q𝑞qitalic_q. Computing the similarity score s(q,d)+𝑠𝑞𝑑subscripts(q,d)\in\mathbb{R}_{+}italic_s ( italic_q , italic_d ) ∈ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT for each of the |𝒟|𝒟|\mathcal{D}|| caligraphic_D | documents in the corpus creates a ranking we can use to extract the most relevant documents. In this work, we focus on page-level retrieval: given a query, is the correct document page retrieved by the system? For coherence with existing literature, we further use the term document to refer to individual pages, i.e. the atomic retrieved elements in our setting. As we focus on practical industrial retrieval applications (RAG, search engines) with potentially large corpora sizes, latency constraints are imposed on scoring systems. Most current retrieval systems can be decomposed into (1) an offline indexation phase in which a document index is built and (2) an online querying phase in which a query is matched to documents from the index and where low latency is vital to the user experience.

Efficient document retrieval systems exhibit joint properties of high retrieval performance (R1), low latency during querying (R2), and high throughput during indexation (R3).

2.1 Textual Retrieval Methods

Document Retrieval in Text Space. Statistical methods based on word frequency like TF-IDF (Sparck Jones, 1972) and BM25 (Robertson et al., 1994) are still widely used due to their simplicity and efficiency. More recently, neural embedding models based on fine-tuned large language models display state-of-the-art performance on a variety of text embedding tasks and top the retrieval leaderboards (Muennighoff et al., 2022).

Neural Retrievers. In bi-encoder models (Reimers and Gurevych, 2019; Karpukhin et al., 2020; Wang et al., 2022), documents are independently mapped offline to a dense vector space. Queries are embedded online and matched to documents through a fast cosine distance computation. A slower, but slightly more performant alternative, cross-encoder systems (Wang et al., 2020; Cohere, 2024) concatenate query and document as a single input sequence and iteratively attribute matching scores to each possible combination. This enables full attention computation between query and document terms but comes at the cost of computational efficiency, as |𝒟|𝒟|\mathcal{D}|| caligraphic_D | encoding passes must be done online.

Multi-Vector retrieval via late interaction. In the late interaction paradigm (Khattab and Zaharia, 2020), an embedding is pre-computed and indexed per document token. At runtime, similarity can be computed with individual query token embeddings. The idea is to benefit from the rich interaction between individual query and document terms while taking advantage of the offline computation and fast query matching enabled by bi-encoders.

Retrieval Evaluation. Although benchmarks and leaderboards have been developed to evaluate text embedding models (Thakur et al., 2021; Muennighoff et al., 2022), as previously stated, much of the performance improvements in industrial use cases of embedding models stem from the prior data ingestion pipeline. While documents often rely on visual elements to more efficiently convey information to human readers, text-only systems barely tap into these visual cues.

To our knowledge, no benchmark evaluates document retrieval methods by considering both textual and visual document features like a human would.

2.2 Integrating Visual features

Contrastive Vision Language Models. Map** latent representations of textual content to corresponding representations of visual content has been done by aligning disjoint visual and text encoders through contrastive losses (Radford et al., 2021; Zhai et al., 2023). While some OCR capabilities exist in these models, the visual component is often not optimized for text understanding. The Fine-grained Interactive Language-Image Pre-training (Yao et al., 2021) framework extends the late interaction mechanism to cross-modal vision-language models, relying on max similarity operations between text tokens and image patches.

Visually Rich Document Understanding. To go beyond text, some document-focused models jointly encode text tokens alongside visual or document layout features (Appalaraju et al., 2021; Kim et al., 2021; Huang et al., 2022; Tang et al., 2022). Large Language transformer Models (LLMs) with strong reasoning capabilities have recently been combined with Vision Transformers (ViTs) (Dosovitskiy et al., 2020) to create VLMs (Alayrac et al., 2022; Liu et al., 2023b; Bai et al., 2023; Laurençon et al., 2024) where image patch vectors from contrastively trained ViT models (Zhai et al., 2023) are fed as input embeddings to the language model and concatenated with the text-token embeddings.

PaliGemma. The PaliGemma-3B model (Lucas Beyer* et al., 2024) extends concepts from Pali3 (Chen et al., 2023), and projects SigLIP-So400m/14 (Alabdulmohsin et al., 2023) patch embeddings into Gemma-2B’s text vector space (Gemma Team et al., 2024). Along with its reasonable size w.r.t. other performant VLMs, an interesting property of PaliGemma’s text model is that it is fine-tuned with full-block attention on the prefix (instruction text and image tokens).

VLMs display enhanced capabilities in Visual Question Answering, captioning, and document understanding (Yue et al., 2023), but are not optimized for retrieval tasks.

3 The ViDoRe Benchmark

Existing benchmarks for contrastive vision-language models primarily evaluate retrieval for natural images (Lin et al., 2014; Borchmann et al., 2021; Thapliyal et al., 2022). On the other hand, textual retrieval benchmarks (Muennighoff et al., 2022) are evaluated at the textual passage level and are not tailored for document retrieval tasks. We fill the gap with ViDoRe, a comprehensive benchmark for document retrieval using visual features.

3.1 Benchmark Design

ViDoRe is designed to comprehensively evaluate retrieval systems on their capacity to match queries to relevant documents at the page level. This benchmark encompasses multiple orthogonal subtasks, with focuses on various modalities - text, figures, infographics, tables; thematic domains - medical, business, scientific, administrative; or languages - English (eng), French (fra).

Dataset # Queries Domain
Academic Tasks
DocVQA (eng) 500 (500) Industrial
InfoVQA (eng) 500 (500) Infographics
TAT-DQA (eng) 1600 (1600) Varied Modalities
arXiVQA (eng) 500 (500) Scientific Figures
TabFQuAD (fra) 210 (210) Tables
Practical Tasks
Energy (eng) 100 (1000) Scientific
Government (eng) 100 (1000) Administrative
Healthcare (eng) 100 (1000) Medical
AI (eng) 100 (1000) Scientific
Shift Project (fra) 100 (1000) Environment
Table 1: ViDoRe comprehensively evaluates multimodal retrieval methods. The size of the document corpus is indicated in parentheses.

Academic Tasks. We repurpose widely used visual question-answering benchmarks for retrieval tasks: for each page-question-answer triplet, we use the question as the query, and the associated page as the gold document (Table 1). These academic datasets either focus on single specific modalities (Mathew et al., 2020, 2021; Li et al., 2024) or target more varied visually rich documents (Zhu et al., 2022). Moreover, we consider TabFQuAD, a human-labeled dataset on tables extracted from French industrial PDF documents released with this work. Details can be found in subsection A.1.

Practical tasks. We construct topic-specific retrieval benchmarks spanning multiple domains to go beyond repurposed QA datasets and evaluate retrieval in more realistic industrial situations (e.g. RAG). To achieve this, we collect publicly accessible PDF documents and generate queries pertaining to document pages using Claude-3 Sonnet, a high-quality proprietary vision-language model (Anthropic, 2024). In total, we collect 1,000 document pages per topic, which we associate with 100 queries extensively filtered for quality and relevance by human annotators. The corpus topics are intentionally specific to maximize syntactic proximity between documents, creating challenging retrieval tasks and covering an array of orthogonal domains (Table 1). Query-page pair examples are shown in Appendix E.222Answers are generated alongside queries to (1) ground queries and improve their quality and (2) provide resources to foster future work.

Evaluation Metrics. We evaluate performance on our benchmark (Requirement R1) using standard metrics from the retrieval literature (NDCG, Recall@K, MRR). We report NDCG@5 values as the main performance metric in this work and release the complete sets of results along with the models333https://huggingface.co/vidore. To validate compliance with practical industrial constraints, we also consider query latencies (R2) and indexing throughputs (R3).

3.2 Assessing Current Systems

Unstructured. We evaluate retrieval systems representative of those found in standard industrial RAG pipelines. As is common practice, we rely on the Unstructured444www.unstructured.io off-the-shelf tool in the highest resolution settings to construct high-quality text chunks from PDF documents. Unstructured orchestrates the document parsing pipeline, relying on deep learning vision models to detect titles and document layouts (Ge et al., 2021), OCR engines (Smith, 2007) to extract text in non-native PDFs, specialized methods or models to detect and reconstruct tables, and implements a chunking strategy (by-title) that leverages the detected document structure to preserve section boundaries when concatenating texts. As is common practice, in our simplest Unstructured configuration (text-only), only textual elements are kept, and figures, images, and tables are considered noisy information and are filtered out.

Unstructured + X. While Unstructured is a strong baseline by itself, we further augment Unstructured’s output by integrating the visual elements. In (+ OCR), tables, charts, and images are run through an OCR engine, processed by Unstructured, and chunked independently. In (+ Captioning), we set up a fully-fledged captioning strategy (Zhao et al., 2023), in which we feed visual elements to a strong proprietary Vision Language Model (Claude-3 Sonnet (Anthropic, 2024)) to obtain highly detailed textual descriptions of the elements. Both strategies aim to integrate visual elements in the retrieval pipeline but incur significant latency and resource costs (subsection 5.2).

Embedding Model. To embed textual chunks, we evaluate Okapi BM25, the de facto standard sparse statistical retrieval method, and the dense encoder of BGE-M3 (Chen et al., 2024), a multilingual neural method with SOTA performance in its size category. Chunks are embedded and scored independently, and page-level scores are obtained by max-pooling over the page’s chunk scores.555We empirically validated the max-pooling strategy over sub-page chunks to be more effective than concatenating all page chunks before embedding pagewise.

Contrastive VLMs. We also evaluate the strongest available vision-language embedding models; **a CLIP (Koukounas et al., 2024), Nomic Embed Vision (Nomic, 2024), and SigLIP-So400m/14 (Alabdulmohsin et al., 2023).

Results. From a performance perspective, best results are obtained by combining the Unstructured parser with visual information, either from captioning strategies or by running OCR on the visual elements (Table 2). Little difference is seen between BM25 and BGE-M3 embeddings highlighting the visual information bottleneck. Contrastive VLMs lag behind. Beyond retrieval performance (R1), the indexing latencies (R2) reported in Figure 3 illustrate that PDF parsing pipelines can be very lengthy, especially when incorporating OCR or captioning strategies. Querying latencies at runtime (R3) are very good for all evaluated systems (22absent22\leq 22≤ 22 ms on NVIDIA L4) due to fast query encoding and cosine similarity matching.

Refer to caption
Figure 3: Offline indexing with ColPali is much simpler and faster compared to standard retrieval methods. Indexing speeds reported are computed on Nvidia L4 GPUs and detailed in subsection B.5.

4 Late interaction based Vision Retrieval

4.1 Architecture

Vision-Language Models. Encouraged by their strong document understanding capabilities, we propose adapting recent VLMs for retrieval. The key concept is to leverage the alignment between output embeddings of text and image tokens acquired during multi-modal finetuning. To this extent, we introduce ColPali, a Paligemma-3B extension that is capable of generating ColBERT-style multi-vector representations of text and images (Figure 2). PaliGemma-3B is a strong candidate due to its small size, the many released checkpoints fine-tuned for different image resolutions and tasks, and the promising performances on various document understanding benchmarks. We add a projection layer to map the output language modeling embeddings to a vector space of reduced dimension D=128𝐷128D=128italic_D = 128 as used in the ColBERT paper (Khattab and Zaharia, 2020) to keep lightweight bag-of-embedding representations.

Late Interaction. Given query q𝑞qitalic_q and document d𝑑ditalic_d, we denote as 𝐄𝐪Nq×Dsubscript𝐄𝐪superscriptsubscript𝑁𝑞𝐷\mathbf{E_{q}}\in\mathbb{R}^{{N_{q}}\times D}bold_E start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT × italic_D end_POSTSUPERSCRIPT and 𝐄𝐝Nd×Dsubscript𝐄𝐝superscriptsubscript𝑁𝑑𝐷\mathbf{E_{d}}\in\mathbb{R}^{N_{d}\times D}bold_E start_POSTSUBSCRIPT bold_d end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT × italic_D end_POSTSUPERSCRIPT their respective multi-vector representation in the common embedding space Dsuperscript𝐷\mathbb{R}^{D}blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT. The late interaction operator, LI(q,d)LI𝑞𝑑\text{LI}\left(q,d\right)LI ( italic_q , italic_d ), is the sum over all query vectors 𝐄𝐝(j)superscriptsubscript𝐄𝐝𝑗\mathbf{E_{d}}^{(j)}bold_E start_POSTSUBSCRIPT bold_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT, of its maximum dot product |\langle\cdot|\cdot\rangle⟨ ⋅ | ⋅ ⟩ with each of the Ndsubscript𝑁𝑑N_{d}italic_N start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT document embedding vectors 𝐄𝐝(1:Nd)subscriptsubscript𝐄𝐝:1subscript𝑁𝑑\mathbf{E_{d}}_{(1:N_{d})}bold_E start_POSTSUBSCRIPT bold_d end_POSTSUBSCRIPT start_POSTSUBSCRIPT ( 1 : italic_N start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT.

LI(q,d)=i[|1,Nq|]maxj[|1,Nd|]𝐄𝐪(i)|𝐄𝐝(j)\text{LI}\left(q,d\right)=\sum_{i\in[|1,N_{q}|]}\max_{j\in[|1,N_{d}|]}\langle% \mathbf{E_{q}}^{(i)}|\mathbf{E_{d}}^{(j)}\rangleLI ( italic_q , italic_d ) = ∑ start_POSTSUBSCRIPT italic_i ∈ [ | 1 , italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT | ] end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_j ∈ [ | 1 , italic_N start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT | ] end_POSTSUBSCRIPT ⟨ bold_E start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | bold_E start_POSTSUBSCRIPT bold_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ⟩ (1)

Contrastive Loss. The Late Interaction operation is fully differentiable, enabling backpropagation. Let a batch {qk,dk}k[|1,b|]\left\{q_{k},d_{k}\right\}_{k\in[|1,b|]}{ italic_q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ∈ [ | 1 , italic_b | ] end_POSTSUBSCRIPT composed of b𝑏bitalic_b query-page pairs, where for all k[|1,b|]k\in[|1,b|]italic_k ∈ [ | 1 , italic_b | ], the document page dksubscript𝑑𝑘d_{k}italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the document corresponding to query qksubscript𝑞𝑘q_{k}italic_q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. Following Khattab and Zaharia (2020), we define our in-batch contrastive loss \mathcal{L}caligraphic_L as the softmaxed cross-entropy of the positive scores sk+=LI(dk,qk)superscriptsubscript𝑠𝑘LIsubscript𝑑𝑘subscript𝑞𝑘s_{k}^{+}=\text{LI}\left(d_{k},q_{k}\right)italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT = LI ( italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) w.r.t. to the maximal negative scores sk=maxl,lkLI(qk,pl)superscriptsubscript𝑠𝑘subscript𝑙𝑙𝑘LIsubscript𝑞𝑘subscript𝑝𝑙s_{k}^{-}=\max\limits_{l,l\neq k}\hskip 8.53581pt\text{LI}\left(q_{k},p_{l}\right)italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT = roman_max start_POSTSUBSCRIPT italic_l , italic_l ≠ italic_k end_POSTSUBSCRIPT LI ( italic_q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ).

4.2 Model training

Dataset. Our training dataset of 127,460 query-page pairs is comprised of train sets of openly available academic datasets (63%percent6363\%63 %) and a synthetic dataset made up of pages from web-crawled PDF documents and augmented with VLM-generated (Claude-3 Sonnet) pseudo-questions (37%percent3737\%37 %). Our training set is fully English by design, enabling us to study zero-shot generalization to non-English languages666Multilingual data is present in the pretraining corpus of the language model (Gemma-2B) and potentially occurs during PaliGemma-3B’s multimodal training.. We explicitly verify no multi-page PDF document is used both ViDoRe and in the train set to prevent evaluation contamination. A validation set is created with 2%percent22\%2 % of the samples to tune hyperparameters.

Parameters. All models are trained for 1 epoch on the train set. Unless specified otherwise, we train models in bfloat16 format, use low-rank adapters (LoRA, Hu et al. (2021)) with α=32𝛼32\alpha=32italic_α = 32 and r=32𝑟32r=32italic_r = 32 on the transformer layers from the language model, as well as the final randomly initialized projection layer, and use a paged_adamw_8bit optimizer. We train on an 8 GPU setup with data parallelism, a learning rate of 5e55𝑒55e-55 italic_e - 5 with linear decay with 2.5% warmup steps, and a batch size of 32.

Query Augmentation. As in Khattab and Zaharia (2020), we append 5 <unused0> tokens to the query tokens to serve as a soft, differentiable query expansion or re-weighting mechanism.

5 Results

ArxivQ DocQ InfoQ TabF TATQ Shift AI Energy Gov. Health. Avg.
Unstructured Text only
- BM25 - 34.1 - - 44.0 59.6 90.4 78.3 78.8 82.6 -
- BGE-M3 - 28.4\downarrow5.7 - - 36.1\downarrow7.9 68.5\uparrow8.9 88.4\downarrow2.0 76.8\downarrow1.5 77.7\downarrow1.1 84.6\uparrow2.0 -
Unstructured + OCR
- BM25 31.6 36.8 62.9 46.5 62.7 64.3 92.8 85.9 83.9 87.2 65.5
- BGE-M3 31.4\downarrow0.2 25.7\downarrow11.1 60.1\downarrow2.8 70.8\uparrow24.3 50.5\downarrow12.2 73.2\uparrow8.9 90.2\downarrow2.6 83.6\downarrow2.3 84.9\uparrow1.0 91.1\uparrow3.9 66.1\uparrow0.6
Unstructured + Captioning
- BM25 40.1 38.4 70.0 35.4 61.5 60.9 88.0 84.7 82.7 89.2 65.1
- BGE-M3 35.7\downarrow4.4 32.9\downarrow5.4 71.9\uparrow1.9 69.1\uparrow33.7 43.8\downarrow17.7 73.1\uparrow12.2 88.8\uparrow0.8 83.3\downarrow1.4 80.4\downarrow2.3 91.3\uparrow2.1 67.0\uparrow1.9
Contrastive VLMs
**a-CLIP 25.4 11.9 35.5 20.2 3.3 3.8 15.2 19.7 21.4 20.8 17.7
Nomic-vision 17.1 10.7 30.1 16.3 2.7 1.1 12.9 10.9 11.4 15.7 12.9
SigLIP (Vanilla) 43.2 30.3 64.1 58.1 26.2 18.7 62.5 65.7 66.1 79.1 51.4
Ours
SigLIP (Vanilla) 43.2 30.3 64.1 58.1 26.2 18.7 62.5 65.7 66.1 79.1 51.4
BiSigLIP (+fine-tuning) 58.5\uparrow15.3 32.9\uparrow2.6 70.5\uparrow6.4 62.7\uparrow4.6 30.5\uparrow4.3 26.5\uparrow7.8 74.3\uparrow11.8 73.7\uparrow8.0 74.2\uparrow8.1 82.3\uparrow3.2 58.6\uparrow7.2
BiPali (+LLM) 56.5\downarrow-2.0 30.0\downarrow-2.9 67.4\downarrow-3.1 76.9\uparrow14.2 33.4\uparrow2.9 43.7\uparrow17.2 71.2\downarrow-3.1 61.9\downarrow-11.7 73.8\downarrow-0.4 73.6\downarrow-8.8 58.8\uparrow0.2
ColPali (+Late Inter.) 79.1\uparrow22.6 54.4\uparrow24.5 81.8\uparrow14.4 83.9\uparrow7.0 65.8\uparrow32.4 73.2\uparrow29.5 96.2\uparrow25.0 91.0\uparrow29.1 92.7\uparrow18.9 94.4\uparrow20.8 81.3\uparrow22.5
Table 2: Comprehensive evaluation of baseline models and our proposed method on ViDoRe. Results are presented using NDCG@5 metrics, and illustrate the impact of different components. Text-only metrics are not computed for benchmarks with only visual elements.

5.1 Performance (R1)

We iteratively construct ColPali, starting from an off-the-shelf SigLIP model (Table 2).

BiSigLIP: Improving a strong model. SigLIP777https://huggingface.co/google/siglip-so400m-patch14-384 is a strong vision-language bi-encoder model, pretrained on the English split of WebLI (Chen et al., 2023), a corpus of billions of image-text pairs. We find that SigLIP largely outperforms both **a CLIP and Nomic-vision on document retrieval tasks. Further fine-tuning the textual component of this model on our document-oriented dataset (BiSigLIP) yields clear improvements across the board, particularly on figure retrieval (ArxivQA) and table retrieval tasks (TabFQuAD).

BiPali: Pairing with a language model. In the PaliGemma model architecture, SigLIP-generated patch embeddings are fed to a text language model to obtain LLM contextualized output patch embeddings.888Note that the SigLIP model used in PaliGemma slightly differs in terms of number patches - 1024 patches for PaliGemma’s vision encoder, and 729 for the standalone SigLIP model. We average pool these representations to obtain a single dense vector, effectively creating a PaliGemma bi-encoder model (BiPali). After fine-tuning on the training dataset, we obtain a model that performs slightly worse in English than the tuned BiSigLIP variant. This can be explained by the fact that contrary to SigLIP, the original PaliGemma is not trained on contrastive matching tasks, but rather on next token prediction. Our contrastive fine-tuning phase on 100K images to transform PaliGemma into a bi-encoder is 5 orders of magnitude smaller than SigLIP’s original contrastive training. However, we see notable improvements in French tasks, indicating that BiPali’s LLM (Gemma 2B) helps multilingual text understanding. This is particularly notable as our training dataset does not contain non-English samples.

ColPali: Adding Late Interaction. One benefit of inputting image patch embeddings through a language model is that they are natively mapped to a latent space similar to textual input (query). This enables leveraging the ColBERT strategy to compute interactions between text tokens and image patches, which enables a step-change improvement in performance compared to BiPali. Results in Table 2 show that our ColPali model also largely outperforms the strong baselines based on Unstructured and captioning, as well as all evaluated text-image embedding models. The difference is particularly stark on the more visually complex benchmark tasks, such as InfographicVQA, ArxivQA, and TabFQuAD representing respectively infographics, figures, and tables. However, text-centric documents are also better retrieved by the ColPali models across all evaluated domains and languages, making our approach the overall best-performing document-retrieval model.

Negative Results. For extensiveness, we also train ColSigLIP, a late interaction variant of the BiSigLIP model but obtain abysmal performances. We attribute this to the large gaps w.r.t. SigLIP’s pre-training, in which only a pooled latent representation is used in the contrastive loss, which does not optimize the representations of individual patch and token embeddings. Similarly, we train a BiSigLIPPaliGemma variant, in which we retrieve the image representations from the SigLIP model that has been further updated by PaliGemma fine-tuning, and use the text representations from PaliGemma’s text model. After fine-tuning on our dataset, performance is severely inferior to SigLIPVanilla which simply encodes with SigLIP’s original text and vision components. This indicates a logical misalignment between SigLIP embeddings, and Gemma embeddings after PaliGemma training. We detail these results in Table 5.

5.2 Latencies & Memory Footprint

Online Querying. (R2) Logically, querying latencies differ between ColPali and a BGE-M3 embedding model. For BGE, encoding takes about 22 ms for 15 tokens, while encoding a query with ColPali’s language model takes about 30 ms999Computed for a batch size of 1 (online), and averaged over 1000 queries. See subsection B.5. For smaller corpus sizes, computing the late interaction operation induces marginally small overheads (1absent1\approx 1≈ 1 ms per 1000 pages in the corpus), and the cosine similarity computation between bi-encoder vectors is even faster. Optimized late interaction engines (Santhanam et al., 2022; Lee et al., 2023) enable to easily scale corpus sizes to millions of documents with reduced latency degradations.

Offline Indexing. (R3) Standard retrieval methods using bi-encoders represent each chunk as a single vector embedding, which is easy to store and fast to compute. However, processing a PDF to get the different chunks is the most time-consuming part (layout detection, OCR, chunking), and using captioning to handle multimodal data will only exacerbate this already lengthy process. On the other hand, ColPali directly encodes pages from their image representation. Although the encoder model is larger than standard retrieval encoders, skip** the preprocessing allows large speedups at indexing101010Measures a NVIDIA L4 GPU, averaged on 100 pages, with a batch size of 4 pages for ColPali and 8 text chunks for Bi-Encoders. On average, a page is divided into 2.1 chunks. See subsection B.5. (Figure 3).

Memory Footprint. Our method requires storing a vector per image patch. We project each PaliGemma vector to a lower dimensional space (D=128) to maximize efficiency, leading to a memory footprint of 256256256256 KB per page (subsection B.4). Importantly, the memory footprint of the naive ColBERT indexing strategy can be drastically improved through compression and clustering mechanisms as proposed in the Performance-optimized Late Interaction Driver (Santhanam et al., 2022).

5.3 Interpretability

By superimposing the late interaction heatmap on top of the original image, we can visualize the most salient image patches with respect to each term of the query, yielding interpretable insights into model focus zones. As epitomized in Figure 1, we observe ColPali exhibits strong OCR capabilities as both the words "hourly" and "hours" present a high similarity score with the query token <_hour>. We also note particular focus on other non-trivial image features such as the x-axis representing hours being salient. Other visualization examples with similar trends of the model transcending pure OCR are shown in Appendix C.

6 Ablation study

Refer to caption
Figure 4: Relative NDCG@5 performance gain w.r.t. the default ColPali (1024 patches). TabFQuAD fine-tuning measures the performance difference on the TabFQuAD task after the introduction of targeted data in the training set. All other results refer to performance deltas averaged on all ViDoRe tasks.

Should we scale models or patch numbers ?
We train a variant of PaliGemma with half the number of image patches (512). While there is a clear performance degradation w.r.t. to the 1024-patch ColPali model (Figure 4), memory usage is much lower.111111While another PaliGemma variant exists with 2048 patches, the different training datamix and the large memory requirements make this model impractical for both training and inference time. As an alternative to PaliGemma, we train Idefics2-8B (Laurençon et al., 2024), a VLM with a similar architecture and based on a Mistral-7B (Jiang et al., 2023) language backbone and a SigLIP vision encoder paired with a perceiver resampler. The most notable differences with PaliGemma lie in the size of the language model (2B and 7B resp.) and the number of image patches (between 512 and 2048 for PaliGemma, and 60 post-resampling for Idefics2121212With the option of adding 4 sub-image crops of 64 tokens each to the sequence, for a total of 320 tokens). Our results (Figure 4) suggest language model size has a strong impact on performance, and along with the trained resampler enables more efficient representations for smaller numbers of image embeddings - ColIdefics2 with 60 patches edges out ColPali with 512 patches. Scaling the number of patches of the smaller ColPali model from 512 to 1024, enables largely surpassing the 60-patch ColIdefics2 while being about twice as fast in terms of training and inference latency. These results suggest there are tradeoffs between performance (R1), latencies during online querying (R2) and offline indexation phases (R3), and index memory size.

Should we fine-tune the vision component?
We run our contrastive finetuning on a ColPali model in which we also train the vision encoder and the projection layer. Results in Figure 4 show this leads to no significant improvements.

Do "query augmentation" tokens help?
In ColBERT, special tokens are concatenated to the input query to serve as soft query augmentation buffers. Training without these tokens, we observe no significant performance difference (Figure 4) in the English benchmarks. However, performance on the French tasks seems to improve (Table 5)

Is the Pairwise CE loss best?
Training with an in-batch negative contrastive loss, instead of the pairwise CE loss that only considers the hardest negative sample, leads to a slight performance degradation (2.4%percent2.4-2.4\%- 2.4 %) on the aggregated benchmark.

Can the model adapt to new tasks?
Contrary to more complex multi-step retrieval pipelines, ColPali can be trained end-to-end, directly optimizing the downstream retrieval task which greatly facilitates fine-tuning to boost performance on specialized domains, multilingual retrieval, or specific visual elements the model struggles with. To demonstrate, we add 1552 samples representing French tables and associated queries to the training set. This represents the only French data in the training set, with all other examples being kept unchanged. We see significant NDCG@5 improvements (Figure 4) and even starker Recall@1 gains (+6.63%percent6.63+6.63\%+ 6.63 %) on the TabFQuAD benchmark, with no performance degradation on the rest of the benchmark tasks (+0.34%percent0.34+0.34\%+ 0.34 %).

7 Conclusions

Through the conception of a new benchmark ViDoRe, we established the limits of both modern industrial document retrieval pipelines and off-the-shelf image-text contrastive models for visually rich document retrieval. We introduced ColPali, a novel retrieval model that leverages the latest generative Vision Language models to create highly performing multi-vector embeddings purely from visual document features. ColPali largely outperforms the best existing document retrieval methods while enabling faster corpus indexing time and maintaining low querying latencies, suggesting a very high potential for industrial document retrieval applications. We hope to encourage future work by publicly releasing the ViDoRe benchmark and all models and baselines from our study.

Future Work. Further performance gains could be obtained by exploring sub-image decomposition (Liu et al., 2023a), optimal image patch resampling strategies (Laurençon et al., 2024), or hard-negative mining. Subsequently, our vision is to combine visual retrieval and visually grounded query answering to create RAG systems that purely function from visual features. An interesting line of research could be attempting to generate answers leveraging information stored in the indexed multi-vector patch embeddings.

Limitations

Focus. In this work, we evaluate models on document retrieval tasks, covering several modalities (figures, text, tables, infographics). We however primarily focus on PDF-type documents, and evaluating systems on image retrieval with documents stemming from web page screenshots or hand-written documents might be an interesting generalization. We also focus on high-resource languages (English and French) and although we have shown the capacity of the ColPali model to generalize to languages outside of its fine-tuning set, it is unclear how the model would perform on languages that are not as represented in the model’s language backbone. Finally, our setup assumes relevant documents exist, but abstention methods for Information Retrieval systems might be interesting to explore in more practical settings in which confidence estimation might be important (Gisserot-Boukhlef et al., 2024).

Support. This work relies on multi-vector retrieving derived from the ColBERT late interaction mechanism. Although some vector databases support late interaction engines131313Vespa Engine, RAGatouille, QDrant, colbert.ai, many widely used vector retrieval frameworks do not propose native multi-vector support, and some engineering infrastructure efforts may be required to adapt them to work with ColPali (or ColBERT) models.

Data. In the creation of ViDoRe, we partially rely on synthetic query generation based on a commercial large language model, which may induce some amount of bias in the generated queries. To compensate for this, we have iterated on the prompting strategy and given real query examples to the models to help ground generation in realistic settings. We have further manually verified all synthetic queries through a lengthy process to validate their relevance and their quality. Our benchmark also includes many benchmark tasks with no synthetic data, and result trends observed between all tasks are correlated, further confirming the coherence of our benchmark design.

Ethical Considerations

Carbon Footprint. Our work fully leverages prior pretrained models and training is not particularly compute-intensive. Furthermore, we rely on low-rank adapters to further reduce the computational resources needed, both during training and for storage. Overall, a training run represents about 40 hours of Mi250x AMD GPUs. Our experiments, in total, represent 1405 Mi250x GPU hours from highly efficient compute clusters running on low-carbon nuclear energy, representing a total of around 15kg CO2 eq.

Impact. We believe our work could have a strong impact on improving industrial document retrieval systems. Our method is efficient, performs well, and the additional support towards visually rich information from documents could go a long way in unlocking knowledge sources previously difficult to index or query.

Resource Release. For transparency, and to foster future work, we release our comprehensive benchmark under open license and host a public leaderboard141414https://huggingface.co/spaces/vidore/vidore-leaderboard. Our models are released under the same usage license as the base model (Gemma Research license for ColPali, Apache2.0 for ColIdefics2) and should be used as intended by the VLM license.

Acknowledgements

This work is partially supported by Illuin Technology, and by a grant from ANRT France. This work was performed using HPC resources from the CINES ADASTRA through Grant 2024-AD011015443. We extend our warm thanks to Jonathan Dong, Caio Corro, Victor Pellegrain and Ender Konukoglu for their valuable feedback on the paper.

References

Appendix A Benchmark Datasets

A.1 Academic Datasets

DocVQA (Mathew et al., 2020) includes collected images from the UCSF Industry Documents Library. Questions and answers were manually annotated.

InfoVQA (Mathew et al., 2021) includes infographics collected from the Internet using the search query “infographics”. Questions and answers were manually annotated.

TAT-DQA (Zhu et al., 2022) is a large-scale Document VQA dataset that was constructed from publicly available real-world financial reports. It focuses on rich tabular and textual content requiring numerical reasoning. Questions and answers were manually annotated by human experts in finance.

arXivQA (Li et al., 2024) is a VQA dataset based on figures extracted from arXiv publications. The questions were generated synthetically using GPT-4 Vision.

TabFQuAD (Table French Question Answering Dataset) is designed to evaluate TableQA models in realistic industry settings. We create additional queries to augment the existing human-annotated ones using the same method described in subsection A.2.

A.2 Practical Datasets

Methodology. Creating a relevant retrieval dataset close to real use cases is a major challenge as the dataset needs to be both sufficiently large for effective fine-tuning and sufficiently diverse to cover a broad range of modalities (full text, tables, charts, …), domains (industry, healthcare, …), and query-document interactions (extractive questions, open-ended questions, …). Our approach to building this dataset involves several steps: (1) we use a web crawler to collect publicly available documents on various themes and sources, (2) we convert these PDFs into a series of images, one per page, and (3) we generate queries related to each image using a VLM.

Web-Crawler. We implemented a web crawler to efficiently collect large volumes of documents related to a given topic. The crawler is seeded with a user-defined query (e.g. "artificial intelligence") and then uses GPT-3.5 Turbo to brainstorm related topics and subtopics. This query augmentation strategy aims at both broadening and deepening the search. GPT-3.5 Turbo is further used to generate diverse search queries from each subtopic. This query set is then consumed by a pool of parallel workers whose job is to fetch the associated most relevant documents. We use SerpAPI151515https://serpapi.com/ along with a filetype filter (PDF documents only) to programmatically scrape Google Search rankings. Each file is hashed and stored in a Bloom filter (Bloom, 1970) shared among workers to avoid duplicate documents in the final corpus. Unique scraped files are downloaded, and inserted into a SQLite database along with additional metadata.

Datamix. Using the web crawler, we collected approximately 1,000 documents for each of the following four seeds: "energy", "government reports", "healthcare industry", and "artificial intelligence". These seeds were meticulously hand-picked to align with real-use cases for retrieval models and visually rich pages. We also removed all documents containing any private information. At this stage, we randomly selected 900 files for the training set and 100 files for the test set, ensuring that data leakage into the test set was avoided during subsequent processing steps.

Query Generation. To increase the efficiency of our query generation scheme and to limit API calls, we generate at most 3 questions per image. From all the documents collected, we randomly sample 10,000 images per theme and call Claude-3 Sonnet with the following prompt:

You are an assistant specialized in Multimodal RAG tasks.\n The task is the following: given an image from a pdf page, you will have to generate questions that can be asked by a user to retrieve information from a large documentary corpus. The question should be relevant to the page, and should not be too specific or too general. The question should be about the subject of the page, and the answer needs to be found in the page. \n Remember that the question is asked by a user to get some information from a large documentary corpus that contains multimodal data. Generate a question that could be asked by a user without knowing the existence and the content of the corpus. \n Generate as well the answer to the question, which should be found in the page. And the format of the answer should be a list of words answering the question. \n Generate at most THREE pairs of questions and answers per page in a dictionary with the following format, answer ONLY this dictionary NOTHING ELSE: \n {     "questions": [         {             "question": "XXXXXX",             "answer": ["YYYYYY"]         },         {             "question": "XXXXXX",             "answer": ["YYYYYY"]         },         {             "question": "XXXXXX",             "answer": ["YYYYYY"]         },     ] } where XXXXXX is the question and [’YYYYYY’] is the corresponding list of answers that could be as long as needed. \n Note: If there are no questions to ask about the page, return an empty list. Focus on making relevant questions concerning the page. \n Here is the page: \n

Human Validation. We manually validate every single synthetically created query in ViDoRe to ensure quality, query relevance, and consistency with the benchmark objective of evaluating retrieval in practical industrial settings. During this step, we randomly assign document-pair queries to 4 volunteer annotators and instruct them to filter out queries that do not fit the above-listed criteria. We also instruct annotators to flag any documents they deem to contain PII information or content not suited for an academic benchmark. No flag was raised during the entirety of the process, validating our prior PDF collection strategy. 100 queries per topic are collected in this manner. Annotators are colleagues and collaborators of the authors who volunteered to help. Each annotator spent approximately 3 hours filtering the larger query set down to 100 high-quality queries per topic.

Appendix B Implementation details

B.1 Codebase

The codebase is written in PyTorch161616https://pytorch.org/ and leverages HuggingFace tooling for model implementations and trainers171717https://huggingface.co.

B.2 Pairwise CE loss

Our in-batch contrastive loss \mathcal{L}caligraphic_L is defined as the softmaxed cross-entropy of the positive scores sk+=LI(dk,qk)superscriptsubscript𝑠𝑘LIsubscript𝑑𝑘subscript𝑞𝑘s_{k}^{+}=\text{LI}\left(d_{k},q_{k}\right)italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT = LI ( italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) w.r.t. to the maximal negative scores sk=maxl,lkLI(qk,pl)superscriptsubscript𝑠𝑘subscript𝑙𝑙𝑘LIsubscript𝑞𝑘subscript𝑝𝑙s_{k}^{-}=\max\limits_{l,l\neq k}\hskip 8.53581pt\text{LI}\left(q_{k},p_{l}\right)italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT = roman_max start_POSTSUBSCRIPT italic_l , italic_l ≠ italic_k end_POSTSUBSCRIPT LI ( italic_q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ).

For numerical stability, we reformulate the loss with the softplus function, leading to:

=1bk=1bsoftplus(sksk+)1𝑏superscriptsubscript𝑘1𝑏softplussuperscriptsubscript𝑠𝑘superscriptsubscript𝑠𝑘\mathcal{L}=\frac{1}{b}\sum_{k=1}^{b}\texttt{softplus}\left(s_{k}^{-}-s_{k}^{+% }\right)caligraphic_L = divide start_ARG 1 end_ARG start_ARG italic_b end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT softplus ( italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT - italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT ) (2)

B.3 Hyperparameters

Hyperparameters are tuned on a validation split composed of 2%percent22\%2 % of the training dataset. We find bi-encoder methods to be more sensible to learning rate variations than late interaction-based models and achieve the best performance for all models with a learning rate of 5e55𝑒55e-55 italic_e - 5. We experiment with LoRA rank and α𝛼\alphaitalic_α values and do not notice particular improvements past r=α=32𝑟𝛼32r=\alpha=32italic_r = italic_α = 32. Per-device batch sizes are kept small due to long sequence lengths that complicate scaling past b=4𝑏4b=4italic_b = 4. Simulating larger batch sizes for in-batch negative sampling should enable even better results. We find the best results with global batch size b=32𝑏32b=32italic_b = 32 for 1 epoch on our training set.

B.4 Embedding size

Minimizing storage footprint can be essential to industrial retrieval systems if databases contain millions of documents. With this criterion in view, we have compared the embedding sizes of the models in our study. As shown in Table 3, ColPali’s embedding size is an order of magnitude larger than BM25 and two orders of magnitude larger than BGE-M3. However, this study is limited to the naive method of storing ColPali’s multi-vector embeddings. In practical scenarios, using cluster centroids can reduce the size of ColPali multi-vector embeddings by up to an order of magnitude (Santhanam et al., 2022) and make it a competitive retrieval system.

Model Embedding size (KB)
BGE-M3 8.60
BM25 (dense emb.) 3.00
BM25 (sparse emb.) 1.56 ± 0.51
ColPali (float16) 256
Table 3: Comparison of the embedding sizes for the DocVQA test set from ViDoRe w.r.t. different retrieval models. The lower the size the smaller the storage footprint of the model. The mean ± std size is given for the sparse embeddings.

B.5 Latency computations

All latency computations are done on a NVIDIA L4 GPU. Queries are encoded independently (batch size of 1) to simulate online querying, and pages are encoded with a batch size of 4 for PaliGemma derived models, and 8 for BGE-M3. Reported times include image and text processing time before the model forward pass, as well as query-to-index matching times. We note an interesting feature of ColPali is that all documents have the same sequence length, leading to prior knowledge of runtime and memory consumptions. Query latency experiments are averaged over 1000 queries, and indexing times are measured for a 100 page document. Per page time is obtained by diving total time by 100, corresponding to inverse page throughput.

B.6 Captioning

Examples of captions generated for visually rich document chunks with Claude-3 Sonnet are shown in Figure 6 and Figure 5. The prompt used for generating the description is the following:

You are an assistant specialized in document analysis. Given a table or a figure, you have to provide a detailed summary of the content in maximum 3000 characters. Your summary should be qualitative and not quantitative. Here is the table/figure to analyze: {image}. Answer ONLY with the caption of the table/figure.
Refer to caption
Figure 5: Example from the "Energy" test set.
Caption: The image depicts the hourly energy generation profile, illustrating the contributions of various energy sources over 24 hours. The data is presented as a stacked bar chart, with the x-axis representing the hours of the day from 1 to 2, and the y-axis showing the average hourly generation in MW. The bars are segmented into different colors, each representing a distinct energy source: nuclear, bio, geothermal, solar, wind, hydro, natural gas, and other imports. The chart provides insights into the temporal variations in energy generation across different sources, highlighting the interplay between baseload and intermittent sources throughout the day.
Refer to caption
Figure 6: Example from the "Government Reports" test set.
Caption: The image shows a table titled "System of Record" which outlines the different types of documents or records maintained across various systems or departments within an organization related to project management and construction. The rows list documents like project plans, budgets, schedules, contracts, purchase orders, invoices, change requests, bid submissions, drawings, manuals, meeting minutes, and reports. The columns indicate the system or department responsible for maintaining each record, such as County Servers, Project View, OnBase, CGI Advantage Financial System, and Purchasing Department. The table uses "W" and "T" markers to denote which system or department serves as the primary source (writer) or storage location (trailer) for each type of document.

Appendix C More similarity maps

In Figure 7, ColPali assigns a high similarity to all patches with the word "Kazakhstan" when given the token <_Kazakhstan>. Moreover, our model seems to exhibit world knowledge capabilities as the patch around the word "Kashagan" - an offshore oil field in Kazakhstan - also shows a high similarity score. On the other hand, in Figure 8, we observe that ColPali is also capable of complex image understanding. Not only are the patches containing the word "formulations" highly similar to the query token _formula, but so is the upper-left molecule structure.

Refer to caption
Figure 7: Similarity of the image patches w.r.t. the underlined token in the user query. This example is from the Shift test set.
Refer to caption
Figure 8: Similarity of the image patches w.r.t. the underlined token in the user query. This example is from the Healthcare Industry test set.

It is also interesting to highlight that both similarity maps showcase a few white patches with high similarity scores. This behavior might first seem surprising as the white patches should not carry a meaningful signal from the original images. We believe the vectors associated with these patches share a similar role with the ViT registers (Darcet et al., 2023), i.e. these patches were repurposed for internal computations and stored the global information from the whole image.

Appendix D Additional results

D.1 Other Metrics

ArxivQ DocQ InfoQ TabF TATQ Shift AI Energy Gov. Health. Avg.
Unstructured Text only
BM25 - 26.6 - - 34.6 45.0 86.0 70.0 68.0 74.0 -
BGE-M3 - 22.8\downarrow3.8 - - 26.1\downarrow8.5 51.0\uparrow6.0 81.0\downarrow5.0 72.0\uparrow2.0 67.0\downarrow1.0 77.0\uparrow3.0 -
Unstructured + OCR
BM25 26.7 28.9 54.0 30.4 50.0 52.0 86.0 77.0 74.0 80.0 55.9
BGE-M3 28.1\uparrow1.4 22.9\downarrow6.0 53.8\downarrow0.2 55.7\uparrow25.3 38.6\downarrow11.4 56.0\uparrow4.0 82.0\downarrow4.0 79.0\uparrow2.0 76.0\uparrow2.0 83.0\uparrow3.0 57.5\uparrow1.6
Unstructured + Captioning
BM25 35.5 30.2 61.5 24.3 49.0 47.0 79.0 76.0 75.0 81.0 55.9
BGE-M3 29.3\downarrow6.2 26.0\downarrow4.2 62.1 \uparrow0.6 58.6\uparrow34.3 30.6\downarrow18.4 55.0\uparrow8.0 80.0\uparrow1.0 78.0\uparrow2.0 69.0\downarrow6.0 83.0\uparrow2.0 57.2\uparrow1.3
Contrastive VLMs
**a-CLIP 19.4 7.3 26.7 12.5 1.6 2.0 11.0 13.0 15.0 17.0 12.6
Nomic-vision 10.4 6.7 22.1 9.6 1.6 0.0 9.0 9.0 7.0 13.0 8.8
SigLIP (Vanilla) 34.2 21.3 51.8 46.1 17.9 13.0 50.0 51.0 47.0 65.0 39.7
Ours
(Copied) SigLIP (Vanilla) 34.2 21.3 51.8 46.1 17.9 13.0 50.0 51.0 47.0 65.0 39.7
BiSigLIP (+fine-tuning) 49.2\uparrow15.0 23.8\uparrow2.5 59.0\uparrow7.2 52.1\uparrow6.0 20.7\uparrow2.8 16.0\uparrow3.0 62.0\uparrow12.0 61.0\uparrow10.0 55.0\uparrow8.0 72.0\uparrow7.0 47.1\uparrow7.4
BiPali (+LLM) 46.4\downarrow-2.8 20.0\downarrow-3.8 54.6\downarrow-4.4 63.2\uparrow11.1 20.4\downarrow-0.4 34.0\uparrow18.0 59.0\downarrow-3.0 45.0\downarrow-16.0 57.0\uparrow2.0 56.0\downarrow-16.0 45.6\downarrow-1.5
ColPali (+Late Inter.) 72.4\uparrow26.0 45.6\uparrow25.6 74.6\uparrow20.0 75.4\uparrow12.1 53.1\uparrow32.7 55.0\uparrow21.0 93.0\uparrow34.0 85.0\uparrow40.0 85.0\uparrow28.0 88.0\uparrow32.0 72.7\uparrow27.1
Table 4: Comprehensive evaluation of baseline models and our proposed method on ViDoRe. Results are presented using Recall@1 metrics. Text-only metrics are not computed for benchmarks with only visual elements.

D.2 Model Variants

ArxivQ DocQ InfoQ TabF TATQ Shift AI Energy Gov. Health. Avg.
ColSigLIP (PaliGemma) 3.1 3.0 5.1 6.2 2.5 1.0 3.4 3.4 2.3 2.2 3.2
BiSigLIP (PaliGemma) 18.5 14.6 33.4 39.5 16.1 5.2 27.6 32.6 36.6 35.7 26.0
ColSigLIP (Original) 2.6 2.2 2.3 5.7 1.8 1.0 2.6 4.1 1.4 1.5 2.5
ColPali (No Mem. Tokens) 80.4 53.2 82.4 77.4 65.7 63.4 97.0 89.9 93.6 92.4 79.6
ColPali (Best) 79.1 54.4 81.8 83.9 65.8 73.2 96.2 91.0 92.7 94.4 81.3
Table 5: Evaluation of some "negative results" and ablations on ViDoRe; ColPali for reference. Results are presented using NDCG@5 metrics. Text-only metrics are not computed for benchmarks with only visual elements.

Appendix E ViDoRe examples

Energy

Query: What types of accounts or products allow investors to defer paying taxes?
Refer to caption
Query: What is the projected peak electricity demand in California for the year 2030?
Refer to caption
Query: What is the estimated total savings for a PV system in Durham under the net metering (flat rate) billing option over the system’s useful life of 25 years?
Refer to caption

Artificial Intelligence

Query: What are some common outcome areas targeted by TAII for different age groups?
Refer to caption
Query: hat did the robot monitor to determine when to activate or deactivate the blower motor and blinker?
Refer to caption
Query: What is the key approach used in the PDP architecture?
Refer to caption

Healthcare Industry

Query: What is the chemical formula for the ferroelectric material Lead Zirconium Titanate (PZT)?
Refer to caption
Query: What government entities are involved in public financing for healthcare in the US?
Refer to caption
Query: What does the AVPU scale stand for in assessing the level of consciousness of a seriously ill child?
Refer to caption

Government Reports

Query: What are some mandates for the EPA under the Pollution Prevention Act?
Refer to caption
Query: What is the strategy of KPMG Hazem Hassan?
Refer to caption
Query: What is the trust signal score for the consumer industry best-in-class archetype?
Refer to caption

Shift

Query: Selon le graphique, quelle est la capacité d’import et la consommation réelle de carburants SAF (biocarburants durables pour l’aviation) prévues en 2050 ?
Refer to caption
Query: Quelle partie de la production pétrolière du Kazakhstan provient de champs en mer ?
Refer to caption
Query: Quels sont les pays ayant la plus grande part des découvertes cumulées de pétrole brut en 2020 (en milliers de barils, hors découvertes cumulées) ?
Refer to caption