InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions
Abstract
We study the problem of completing various visual document understanding (VDU) tasks, e.g., question answering and information extraction, on real-world documents through human-written instructions. To this end, we propose InstructDoc, the first large-scale collection of 30 publicly available VDU datasets, each with diverse instructions in a unified format, which covers a wide range of 12 tasks and includes open document types/formats. Furthermore, to enhance the generalization performance on VDU tasks, we design a new instruction-based document reading and understanding model, InstructDr, that connects document images, image encoders, and large language models (LLMs) through a trainable bridging module. Experiments demonstrate that InstructDr can effectively adapt to new VDU datasets, tasks, and domains via given instructions and outperforms existing multimodal LLMs and ChatGPT without specific training.
Introduction
Building document artificial intelligence (Document AI) capable of reading and comprehending real-world documents, including webpages, office documents, mobile UIs, etc., has been a long-cherished goal. Toward this goal, numerous works on visual document understanding (VDU) have addressed a wide range of tasks, such as document question answering (QA) (Mathew, Karatzas, and Jawahar 2021) and information extraction (Jaume, Ekenel, and Thiran 2019). Document data contain both textual and visual objects, with content spread structurally across various locations depending on diverse document types and formats. To address this complexity, previous works have proposed models that aim to improve interactions among text/layout/visual modalities (Xu et al. 2021; Appalaraju et al. 2021). However, the diversity of documents and tasks poses a challenge in develo** a unified model that can comprehend intricate relationships between text and visual objects across a wide range of document types, formats, and tasks.
To improve the generalizability and adaptivity of unseen vision-language tasks, visual instruction tuning (Xu, Shen, and Huang 2023; Liu et al. 2023a) has been introduced. This approach involves training multimodal large language models (MLLMs) on a collection of images, task inputs, and instructions. However, according to (Liu et al. 2023b), most of the previous visual instruction tuning datasets have primarily focused on understanding visual (non-textual) objects in scene images and existing models struggle with accomplishing tasks that require visual document understanding abilities. While recent works (Zhang et al. 2023; Ye et al. 2023a) attempt to deal with the issue, they still exhibit limitations when generalizing to unseen tasks and documents.
In this paper, we propose InstructDoc111Our dataset and codes are publicly available at https://github.com/nttmdlab-nlp/InstructDoc, the first large-scale visual instruction tuning dataset that covers a wide range of VDU tasks and datasets (12 diverse tasks created from 30 openly available datasets). Each dataset has diverse instructions annotated by experts, following a unified instruction schema, composed of user’s intent and answer style, for VDU tasks. As shown in Figure 1, InstructDoc requires a rich set of abilities, including understanding document layout, visual representations of texts, and relation extraction of objects (e.g., graphs and charts) over open document types/formats with handcrafted instructions.
Furthermore, to enhance the generalization performance on VDU tasks, we present a Instruction-based Document reading and understanding model, InstructDr, which unifies the visual, text, and layout modalities of documents by bridging the gap between a vision encoder and a large language model (LLM) through a new bridging module called Document-former. The Document-former converts documents into a useful feature for the LLM. Experiments show that InstructDr achieves the highest zero-shot performance among existing MLLMs and outperforms ChatGPT on a wide range of VDU datasets with instructions.
![Refer to caption](x1.png)
Related Work
Visual document understanding.
Visual documents are ubiquitous and used in diverse applications, including QA on business documents (Mathew, Karatzas, and Jawahar 2021), information extraction on receipts (Jaume, Ekenel, and Thiran 2019), and classification over large document collections (Harley, Ufkes, and Derpanis 2015). Due to this diversity, previous works have generally been domain/task-specific, lacking the sharing of underlying data, model architectures, and objectives (Xu et al. 2020; Appalaraju et al. 2021; Huang et al. 2022). Although pixel-based methods (Kim et al. 2022; Lee et al. 2023) can simplify architectures, these methods have high computational costs (due to the encoding of high-resolution images) and can have degraded performance on new tasks. We leverage the reasoning abilities of LLMs and perform all VDU tasks in a unified sequence-to-sequence format with instructions, resulting in improved generalization performance.
Instruction-following language models.
Training LLMs with instructions on various NLP tasks has proven effective in improving zero-shot performance of unseen tasks (Wei et al. 2021; Iyer et al. 2022). Flan (Wei et al. 2021; Longpre et al. 2023), PromptSource (Bach et al. 2022), and Natural Instructions (Mishra et al. 2022) collected instructions and datasets for a variety of general NLP tasks, such as machine reading comprehension and summarization tasks on plain-text documents. In contrast, we tackle the challenge of understanding real-world documents organized in non-plain text formats (e.g., HTML and PDF).
Visual instruction tuning.
Researchers have recently explored the application of LLMs to vision-language tasks by distilling the output of LLMs (Liu et al. 2023a; Zhu et al. 2023; Ye et al. 2023b) or training with handcrafted instructions (Xu, Shen, and Huang 2023; Dai et al. 2023). However, as pointed out in (Liu et al. 2023b), these models struggle with tasks requiring document understanding abilities because they do not assume that text might be contained in images during instruction tuning. To mitigate this issue, LLaVAR (Zhang et al. 2023) and LLMDoc (Ye et al. 2023a) fine-tune MLLMs with instruction tuning on document images. However, these approaches have trouble understanding diverse real-world documents because (i) the datasets provide a few document and task types, hindering zero-shot generalization; and (ii) the models simply encode documents via vision encoders and cannot explicitly learn document meta-information (e.g., document layout). In contrast, the InstructDoc covers diverse VDU tasks and open document types/formats, and InstructDr learns rich representations of the underlying structure of documents with instructions.
InstructDoc Dataset
![Refer to caption](x2.png)
Problem Formulation
All of the tasks in InstructDoc are simply defined as: given an instruction and a document image , a model outputs an answer . Each task is composed of one or more datasets, where the dataset is associated with the set of instructions and contains instances . Here, we randomly select the instruction from for every instance. Note that we allow the utilization of external OCR engines to derive the answer in our setting, as in the previous VDU benchmark (Borchmann et al. 2021). Our goal is to enable the model to perform a wide range of VDU tasks with instructions rather than improving the accuracy of text recognition (Zhang et al. 2023).
We mainly evaluate the models’ ability to perform zero-shot learning scenarios. Specifically, we fine-tune a model on a collection of instruction tasks and evaluate it on unseen datasets defined three types: (i) Test: datasets not used during training, but whose tasks exist in training set; (ii) Test: datasets and associated tasks entirely unseen during training; and (iii) Test: datasets, tasks, and document types entirely unseen during training.
Dataset Collection
In this section, we describe the collection process of the InstructDoc dataset. InstructDoc is designed to cover a wide range of VDU tasks with instructions that require reasoning among document layout, images, and text.
Source dataset collection.
Figure 2 shows the source datasets in InstructDoc. We collected 30 publicly available datasets and 12 tasks in VDU areas from DUE (Borchmann et al. 2021) as well as through manual searches. Following the task clusters defined in previous works (Wei et al. 2021; Dai et al. 2023), we divided the QA datasets that require different reasoning abilities into different tasks. As a result, we divided the collected datasets into the following tasks:
- •
- •
-
•
Single-page QA w/ Discrete Reasoning requires various arithmetic abilities, including addition, sorting, or counting (Zhu et al. 2022).
- •
- •
- •
-
•
Document NLI is a task of natural language inference that predicts the entailment relationship between two sentences in a document (Borchmann et al. 2021)
-
•
Dialogue involves a human-agent interaction on the basis of document images (Zhang et al. 2023).
- •
-
•
Classification involves classifying a document from a set of candidate labels (Harley, Ufkes, and Derpanis 2015).
- •
-
•
Image-Text Matching (ITM) requires the model to determine whether a given OCR text and image match.
![Refer to caption](x3.png)
Query rephrasing.
We found that two KIE datasets (FUNSD and CORD) are challenging because they contain abbreviated queries that are difficult for humans to comprehend. To bridge the gap between humans and machines, we replace these queries with complete and more easily understandable phrases (e.g., menu.vatyn menu_whether_price_tax_included).
Instruction annotation.
For each dataset, we manually crafted five to ten distinct instruction templates in a unified format. For QA tasks, the answers have diverse styles in the original datasets; for example, DocVQA’s answer is extractive, which requires the model to extract a contiguous span of words from the document, but VisualMRC’s answer is generative, which is not limited to the word spans. Hence, an instruction that sufficiently describes an arbitrary VDU task should include intent and answer style or only intent. Specifically, as shown in Figure 1, intent describes how the task can be performed and answer style describes how the model generates the output. If each dataset provides query and options, we fill it in annotated instruction templates.
Data split.
We split InstructDoc into 23 held-in and seven held-out datasets. For the held-out evaluation, we aim to understand how instruction tuning on the held-in datasets improves the zero-shot generalization performance on unseen datasets, including (i) Test: FUNSD and CORD datasets, (ii) Test: ChartQA, InfoVQA, and TabFact datasets, and (iii) Test: DUDE and SlideVQA datasets. All other datasets were held-in ones to train our model. Note that the held-out datasets were carefully selected in order to avoid data contamination.
LLaVAR | DocOwl | InstructDoc | |
Both Single/Multi-page docs | ✓ | ||
Instruction annotation | ✓ | ✓ | |
Answer style annotation | ✓ | ||
#Document types | 8 | 7 | Open |
#Seed datasets | 1 | 8 | 30 |
#Task clusters | 1 | 3 | 12 |
#Avg. IT words | - | 5 | 20.3 |
#Avg. IT | - | 1 | 7.4 |
#Avg. OCR words | 52.5 | 270.1 | 443.2 |
#Avg. Answer words | 34.5 | 1.9 | 5.88 |
Comparison with Related Datasets
Table 1 shows the statistics of InstructDoc and other VDU instruction tuning datasets, including LLaVAR (Zhang et al. 2023) and DocOwl (Ye et al. 2023a). InstructDoc has three unique key properties; First, it is the first dataset to address open document types, including multi-page documents and has the highest standard deviation in the number of OCR tokens (1442.8) compared with LLaVAR (93.1) and DocOwl (807.2). This implies that our dataset is a more challenging setting. Second, InstructDoc covers the widest range of tasks, offering four times more tasks compared with DocOwl, while LLaVAR provides only a single task. Finally, InstructDoc provides a more extensive set of instructions (20.3 words and 7.4 templates) and annotates various answer styles within the instructions to deal with various VDU tasks that require diverse abilities. In contrast, the instructions in DocOwl are limited (five words and a single template) and LLaVAR has only machine-generated instructions, and they may not generalize well to reformulations and new tasks.
Our Model
Figure 3 depicts our model, InstructDr (Instruction-based Dument reading and understanding model). We use pre-trained BLIP-2 (Li et al. 2023), a state-of-the-art MLLM connected with instruction-tuned FlanT5 (Chung et al. 2022), as the base model. We extend BLIP-2 in three key ways; (i) equip** it with Document-former, an enhanced Q-former module that can capture and convert the visual and textual content/layout of documents into representations of the LLM, (ii) conducting multi-task instruction tuning with unified formats, and (iii) encoding multiple images in parallel to facilitate understanding of multi-page documents.
Cross-Dataset | Cross-Task | Cross-Domain | |||||||||
Model | Modal | #TuP | #ToP | FUNSD | CORD | ChartQA | InfoVQA | TabFact | DUDE | SlideVQA | Held-out |
eF1/F1 | eF1/F1 | RAcc./F1 | ANLS/F1 | Acc./F1 | ANLS/F1 | EM/F1 | Avg. | ||||
LLMDoc | V | 388M | 7B | -/- | -/- | -/- | 38.2†/- | 60.2†/- | -/- | -/- | -/- |
LLaVA | TV | 13B | 13B | 12.0/1.3 | 0.2/ 5.1 | 0.0/1.7 | 3.4/3.5 | 0.0/0.0 | 6.5/5.9 | 0.0/2.3 | 3.1/2.8 |
LLaVAR | TV | 13B | 13B | 12.0/2.0 | 0.1/10.8 | 0.0/3.0 | 6.2/4.6 | 0.0/2.1 | 8.1/5.1 | 0.0/6.2 | 3.8/4.8 |
MiniGPT-4 | TV | 3.1M | 7B | 12.0/2.2 | 0.2/ 2.1 | 0.0/0.4 | 4.3/0.5 | 0.3/0.2 | 5.9/1.1 | 0.0/0.4 | 3.2/1.0 |
mPLUG-Owl | TV | 388M | 7B | 12.0/6.7 | 0.2/15.0 | 0.0/0.3 | 5.6/5.3 | 0.0/2.6 | 5.8/5.5 | 0.0/0.4 | 3.4/5.1 |
InstructBLIP | TV | 103M | 3.4B | 16.8/15.0 | 4.9/9.5 | 3.3/7.2 | 8.7/7.3 | 33.6/33.7 | 11.0/8.8 | 5.2/9.0 | 11.9/12.9 |
BLIP-2 | TV | 103M | 3.4B | 19.6/19.6 | 32.0/51.9 | 23.6/21.5 | 48.2/36.7 | 58.6/58.6 | 39.8/35.4 | 28.3/38.8 | 35.7/37.5 |
BLIP-2 trained on IDoc | TV | 103M | 3.4B | 26.0/26.1 | 33.8/54.7 | 24.7/21.2 | 47.8/35.4 | 58.8/58.8 | 43.9/40.4 | 30.1/38.8 | 37.9/39.3 |
InstructDr (Ours) | TLV | 103.1M | 3.4B | 38.2/38.1 | 46.0/62.7 | 29.4/22.3 | 50.9/37.6 | 59.4/59.4 | 45.2/41.6 | 31.9/40.2 | 43.0/43.1 |
Spatial-aware Document Feature Extraction
Document image/OCR and instruction encoding.
To encode a document image, we use a pre-trained CLIP (Radford et al. 2021) vision encoder to extract its visual features . Additionally, we process the document image using an OCR engine and apply a sub-word tokenizer to obtain word tokens and their corresponding bounding boxes , where (, ) and (, ) represent the coordinates of the top-left and bottom-right corners, respectively. To learn the visual layout of the image, we construct a spatially aware OCR representation with learnable embedding layers , where OCR text features are calculated as and spatial features are calculated as . Similarly, we encode an instruction by and obtain its features .
Document-former.
We introduce Document-former, which is a trainable module to bridge the gap between an image encoder and an LLM, enabling extraction of document content/layout that LLMs can understand. The architecture of Document-former is a stack of Transformer blocks with cross-attention layers. To map document features into the LLM’s space, we use a set of learnable tokens , where is the dimension of the hidden size. These tokens interact with through cross-attention layers and with the input sequence, composed of and , through self-attention layers. As a result, we obtain and transform it via a projection feed-forward network (FFN) layer to , which have the same dimension as the LLM’s input embedding.
Multimodal Document Large Language Model
Connecting document features to LLM.
The LLM receives the document embeddings , the instruction, and OCR tokens as input and outputs the answer , token by token. The parameters of the LLM are initialized from an instruction-tuned FlanT5.
Parameter-efficient multi-task instruction tuning.
To achieve task-agnostic learning, we formulate the process of learning all held-in tasks in a unified sequence-to-sequence abstraction through instructions. To train the LLM efficiently, we update only the parameters of the Document-former (including ) and the projection FFN layer, while kee** other parameters frozen during training. We optimize the model by minimizing the negative log-likelihood between the ground-truth and predictions.
Multi-page document understanding.
We also support performing reasoning across multiple document pages. As shown in Figure 3b, each image is processed individually by the image encoder and Document-former, and their resulting document embeddings are mean-pooled together before being fed into the LLM. The OCR input to the LLM consists of concatenated tokens extracted from each page.
Experiments
Cross-Dataset | Cross-Task | Cross-Domain | |||||||
---|---|---|---|---|---|---|---|---|---|
Model | Modal | FUNSD | CORD | ChartQA | InfoVQA | TabFact | DUDE | SlideVQA | Held-out |
eF1/F1 | eF1/F1 | RAcc./F1 | ANLS/F1 | Acc./F1 | ANLS/F1 | EM/F1 | Avg. | ||
Supervised SOTA models | TLV | 92.1/- | 97.7/- | 72.3/- | 54.8*/- | 83.2*/- | 46.1*/- | 33.5/41.7 | -/- |
ChatGPT | T | 21.8/21.2 | 30.4/49.3 | 16.0/16.8 | 37.8/29.5 | 52.5/52.4 | 34.5/32.3 | 11.7/23.8 | 29.2/32.2 |
GPT-4 | T | 47.5/47.5 | 69.4/81.7 | 20.9/27.6 | 49.9/46.5 | 68.8/68.8 | 46.3/45.1 | 21.0/36.4 | 46.3/50.5 |
InstructDr (Ours) | TLV | 38.2/38.1 | 46.0/62.7 | 29.4/22.3 | 50.9/37.6 | 59.4/59.4 | 45.2/ 41.6 | 31.9/40.2 | 43.0/43.1 |
Experimental Setup
We mainly conducted evaluations under three zero-shot settings, including Test, Test, and Test. Furthermore, we evaluated our model under the task-specific fine-tuning setting.
Baselines.
We compared InstructDr with seven state-of-the-art (SOTA) MLLMs, including LLaVA (Liu et al. 2023a), MiniGPT-4 (Zhu et al. 2023) and mPLUG-Owl (Ye et al. 2023b), which align CLIP visual encoder with Vicuna (Chiang et al. 2023) trained on a dialogue generated by GPT-4 (OpenAI 2023); BLIP-2 (Li et al. 2023), which connects a FlanT5 with a vision encoder; InstructBLIP (Dai et al. 2023), which fine-tunes BLIP-2 with instructions on scene images; and LLMDoc (Ye et al. 2023a) and LLaVAR (Zhang et al. 2023), which fine-tune mPULG-Owl/LLaVA on the DocOwl/LLaVAR datasets. Additionally, we used Supervised SOTA models (Appalaraju et al. 2023; Chen et al. 2023; Huang et al. 2022; Landeghem et al. 2023) on each dataset and two text-based LLMs, ChatGPT (gpt-3.5-turbo-0613) and GPT-4. To control the answer’s length, we added control phrases (e.g., use 1 to 3 words to answer) to the selected instructions.
Evaluation metrics.
We followed the evaluation protocol of each dataset, we used ANLS (Biten et al. 2019) for InfoVQA, DUDE, Text-VQA and ST-VQA, EM for SlideVQA, Relaxed Accuracy (RAcc.) for ChartQA, entity F1 (eF1) for FUNSD and CORD, Accuracy (Acc.) for TabFact, and ROUGE-L for VisualMRC as evaluation metrics. Additionally, we used F1 as the optional metrics.
CORD | TabFact | DUDE | Held-out | |
Model | eF1 | Acc. | ANLS | Avg. |
InstructDr | 46.0 | 59.4 | 45.2 | 43.0 |
w/o Document-former | ||||
w/o Spatially OCR features | ||||
w/o Mean pooling (concat.) | - | - | - | |
w/o Instructions in test | ||||
w/o Instructions in train | ||||
w/o Instructions in both | ||||
w/o Query rephrasing | - | - | - | |
w/o Answer style annotation | - | - | - |
Implementation details.
Following (Wei et al. 2021), we balanced the training instances of different tasks by sampling a maximum of 5k instances for each held-in dataset while kee** all evaluation instances. We used the AdamW (Loshchilov and Hutter 2017) with a weight decay of 0.05. We applied a linear warmup during the initial 1,000 steps and used a cosine learning rate decay with a minimum learning rate of 0. We set the number of learnable tokens to . All images of the model input were resized to . We trained on eight A100 (40G) GPUs for three epochs and completed the training within two hours. If each dataset does not provide OCR, we extracted it via the Google Vision API.
Experimental Results and Analysis
Does our model outperform existing MLLMs?
Table 2 shows that our model achieved the highest performance on all datasets compared with other MLLMs. InstructDr consistently outperformed its original backbone, BLIP-2, by a significant margin, indicating that instruction tuning on InstructDoc effectively enhances performance on unseen VDU datasets, tasks, and domains. In contrast, InstructBLIP, which is instruction-tuned BLIP-2 trained on scene images, performed worse than BLIP-2. This is because that InstructBLIP does not assume that the images might contain text during instruction tuning. BLIP-2 fine-tuned on InstructDoc falls short of achieving the same level of performance compared with InstructDr, indicating that InstructDr is better suited for comprehending diverse real-world documents. This conclusion is further supported by the results presented in Table 4, where ablations of Document-former, spatial information, and strategy of gathering multi-page features have a significant negative impact on performance.
How well does our model perform in comparison with supervised SOTA models and powerful LLMs?
As shown in Table 3, our model outperformed ChatGPT on all datasets. Additionally, InstructDr achieved competitive results with supervised SOTA models and GPT-4 on the DUDE and SlideVQA datasets that require multiple reasoning skills (e.g., discrete, visual, and multi-hop reasoning). This indicates that our model can effectively learn diverse skills through instruction tuning with InstructDoc.
![Refer to caption](x4.png)
![Refer to caption](x5.png)
What is the role of instructions?
As shown in Table 4, removing instructions (i.e., only query and options as the model input) significantly decreased zero-shot performance during training or/and test time, indicating the effectiveness of incorporating instructions. This result was observed on the high-quality instruction-tuning datasets (Wei et al. 2021; Xu, Shen, and Huang 2023). Moreover, our instruction annotations, including query rephrasing and answer styles, helped to improve the zero-shot performance.
![Refer to caption](x6.png)
Does our model have robustness towards diverse instructions?
Figure 4 shows the performance variance when the models were given five different instructions; InstructDr exhibited the smallest performance variance and outperformed the other models. This indicates InstructDoc empowers the model with the ability to deal with a variety of instructions. Our results also suggest that using multiple instructions per dataset is important for achieving decent performance.
What is the impact of diverse task clusters?
As shown in Figure 5, as the number of task clusters increases, we can observe an improvement in models’ zero-shot performance.
VisualMRC | DUDE | SlideVQA | ||
Model | ROUGE-L | ANLS | EM | F1 |
Supervised SOTA models | 52.2 | 46.1 | 33.5 | 41.7 |
BLIP-2 | 60.5 | 45.6 | 36.9 | 46.5 |
InstructDr | 61.1 | 46.8 | 37.7 | 47.3 |
Image type in | TextVQA | ST-VQA | ||
---|---|---|---|---|
Model | instruction tuning | Acc. | ANLS | ANLS |
BLIP-2 | - | 48.7 | 64.8 | 39.1 |
InstructBLIP | Daily scene | 52.8 | 67.3 | 45.7 |
InstructDr | Documents | 53.8 | 68.1 | 43.3 |
Are our model weights effective for task-specific fine-tuning?
We further fine-tuned InstructDr (only Document-former module) on a specific dataset to investigate the knowledge and transferability of our instruction-tuned model weights. Table 5 shows the fine-tuning performance on held-in (VisualMRC) and held-out (DUDE, SlideVQA) tasks. InstructDr achieved state-of-the-art finetuning performance on VisualMRC, DUDE, and SlideVQA using a unified model. Compared with BLIP-2, InstructDr exhibited superior fine-tuning performance on both held-in/out datasets, validating InstructDr as a better weight initialization model for task-specific fine-tuning.
Can our model also understand images other than documents?
Table 6 shows the zero-shot performance of scene-text VQA (Singh et al. 2019; Biten et al. 2019) on scene images, which are the unseen image types in InstructDoc but were used for training our base model, BLIP-2. Note that ST-VQA’s images include the part of COCO (Lin et al. 2014) that InstructBLIP was trained on. This result indicates that InstructDr can effectively learn visual reasoning skills without forgetting the abilities of the original backbone.
Qualitative examples.
Figure 6 visualizes output examples, where the left/center/right examples require table/visual/hand-written text understanding skills. ChatGPT gave incorrect answers because it can only consider text information. Moreover, while BLIP-2 could not follow instructions (e.g., use 5 to 10 words) and extract items from structured text, InstructDr accomplished diverse VDU tasks with instructions. As shown in the right example, all models affected OCR quality, causing incorrect answers.
Limitations
Despite its impressive performance on various VDU tasks with instructions, InstructDr suffers from noisy OCR predictions, whose performance depends highly on OCR text qualities (right of Figure 6). We argue that our approach is more cost-efficient and accurate because another approach, the pixel-based ones (Kim et al. 2022; Chen et al. 2023), requires a large amount of computation to encode high-resolution images and cannot use document meta-information (e.g., bounding boxes). Moreover, since InstructDoc only contains a single document-text pair per instance, it cannot learn the correlation among multiple document-text pairs and lacks an in-context learning capability. The same observation has also been reported in the Flamingo (Alayrac et al. 2022) and BLIP-2. Finally, while we have constructed diverse VDU tasks, the number of tasks and corresponding instructions are still limited. We plan to consider utilizing automatic generation and augmentation techniques to increase the variety of instructions available.
Conclusion
We introduced a new large-scale instruction-tuning dataset, InstructDoc, to lay the foundation for building general-purpose VDU models that can follow natural language instructions. We also introduced a simple yet effective instruction tuning model, InstructDr, which unifies the vision, text, and layout modalities of documents by bridging the gap between a vision encoder and an LLM with Document-former. We performed a comprehensive study on instruction tuning with InstructDoc and demonstrated its generalization capability to a wide range of VDU datasets, tasks, and domains with instructions. We believe that our dataset will facilitate research on develo** general-purpose document artificial intelligence systems.
References
- Alayrac et al. (2022) Alayrac, J.-B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. 2022. Flamingo: a Visual Language Model for Few-Shot Learning. In NeurIPS.
- Appalaraju et al. (2021) Appalaraju, S.; Jasani, B.; Kota, B. U.; Xie, Y.; and Manmatha, R. 2021. Docformer: End-to-End Transformer for Document Understanding. In CVPR, 993–1003.
- Appalaraju et al. (2023) Appalaraju, S.; Tang, P.; Dong, Q.; Sankaran, N.; Zhou, Y.; and Manmatha, R. 2023. DocFormerv2: Local Features for Document Understanding. arXiv:2306.01733.
- Bach et al. (2022) Bach, S.; Sanh, V.; Yong, Z. X.; Webson, A.; Raffel, C.; Nayak, N. V.; Sharma, A.; Kim, T.; Bari, M. S.; Fevry, T.; Alyafeai, Z.; Dey, M.; Santilli, A.; Sun, Z.; Ben-david, S.; Xu, C.; Chhablani, G.; Wang, H.; Fries, J.; Al-shaibani, M.; Sharma, S.; Thakker, U.; Almubarak, K.; Tang, X.; Radev, D.; Jiang, M. T.-j.; and Rush, A. 2022. PromptSource: An Integrated Development Environment and Repository for Natural Language Prompts. In ACL-demo, 93–104.
- Biten et al. (2019) Biten, A. F.; Tito, R.; Mafla, A.; i Bigorda, L. G.; Rusiñol, M.; Jawahar, C. V.; Valveny, E.; and Karatzas, D. 2019. Scene Text Visual Question Answering. In ICCV, 4290–4300.
- Borchmann et al. (2021) Borchmann, Ł.; Pietruszka, M.; Stanislawek, T.; Jurkiewicz, D.; Turski, M.; Szyndler, K.; and Graliński, F. 2021. DUE: End-to-End Document Understanding Benchmark. In NeurIPS.
- Chen et al. (2023) Chen, X.; Djolonga, J.; Padlewski, P.; Mustafa, B.; Changpinyo, S.; Wu, J.; Ruiz, C. R.; Goodman, S.; Wang, X.; Tay, Y.; et al. 2023. PaLI-X: On Scaling up a Multilingual Vision and Language Model. arXiv:2305.18565.
- Chen et al. (2021) Chen, X.; Zhao, Z.; Chen, L.; Ji, J.; Zhang, D.; Luo, A.; Xiong, Y.; and Yu, K. 2021. WebSRC: A Dataset for Web-Based Structural Reading Comprehension. In EMNLP, 4173–4185.
- Chiang et al. (2023) Chiang, W.-L.; Li, Z.; Lin, Z.; Sheng, Y.; Wu, Z.; Zhang, H.; Zheng, L.; Zhuang, S.; Zhuang, Y.; Gonzalez, J. E.; Stoica, I.; and Xing, E. P. 2023. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality.
- Chung et al. (2022) Chung, H. W.; Hou, L.; Longpre, S.; Zoph, B.; Tay, Y.; Fedus, W.; Li, E.; Wang, X.; Dehghani, M.; Brahma, S.; et al. 2022. Scaling Instruction-Finetuned Language Models. arXiv:2210.11416.
- Dai et al. (2023) Dai, W.; Li, J.; Li, D.; Tiong, A. M. H.; Zhao, J.; Wang, W.; Li, B.; Fung, P.; and Hoi, S. 2023. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. arXiv:2305.06500.
- Harley, Ufkes, and Derpanis (2015) Harley, A. W.; Ufkes, A.; and Derpanis, K. G. 2015. Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval. In ICDAR, 991–995.
- Hsu, Giles, and Huang (2021) Hsu, T.-Y.; Giles, C. L.; and Huang, T.-H. 2021. SciCap: Generating Captions for Scientific Figures. In EMNLP Findings, 3258–3264.
- Huang et al. (2022) Huang, Y.; Lv, T.; Cui, L.; Lu, Y.; and Wei, F. 2022. LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking. In ACMM, 4083–4091.
- Huang et al. (2019) Huang, Z.; Chen, K.; He, J.; Bai, X.; Karatzas, D.; Lu, S.; and Jawahar, C. 2019. ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction. In ICDAR, 1516–1520.
- Iyer et al. (2022) Iyer, S.; Lin, X. V.; Pasunuru, R.; Mihaylov, T.; Simig, D.; Yu, P.; Shuster, K.; Wang, T.; Liu, Q.; Koura, P. S.; et al. 2022. OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization. arXiv:2212.12017.
- Jaume, Ekenel, and Thiran (2019) Jaume, G.; Ekenel, H. K.; and Thiran, J.-P. 2019. FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents. In ICDARW.
- Kembhavi et al. (2016) Kembhavi, A.; Salvato, M.; Kolve, E.; Seo, M.; Hajishirzi, H.; and Farhadi, A. 2016. A Diagram Is Worth A Dozen Images. In ECCV, 235–251.
- Kembhavi et al. (2017) Kembhavi, A.; Seo, M. J.; Schwenk, D.; Choi, J.; Farhadi, A.; and Hajishirzi, H. 2017. Are You Smarter Than a Sixth Grader? Textbook Question Answering for Multimodal Machine Comprehension. In CVPR, 5376–5384.
- Kim et al. (2022) Kim, G.; Hong, T.; Yim, M.; Nam, J.; Park, J.; Yim, J.; Hwang, W.; Yun, S.; Han, D.; and Park, S. 2022. OCR-free Document Understanding Transformer. In ECCV, 498–517.
- Krishna et al. (2017) Krishna, R.; Zhu, Y.; Groth, O.; Johnson, J.; Hata, K.; Kravitz, J.; Chen, S.; Kalantidis, Y.; Li, L.; Shamma, D. A.; Bernstein, M. S.; and Fei-Fei, L. 2017. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. Int. J. Comput. Vis., 123(1): 32–73.
- Kuznetsova et al. (2020) Kuznetsova, A.; Rom, H.; Alldrin, N.; Uijlings, J.; Krasin, I.; Pont-Tuset, J.; Kamali, S.; Popov, S.; Malloci, M.; Kolesnikov, A.; et al. 2020. The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scale. IJCV, 1956–1981.
- Landeghem et al. (2023) Landeghem, J.; Tito, R.; Borchmann, Ł.; Pietruszka, M.; Józiak, P.; Powalski, R.; Jurkiewicz, D.; Coustaty, M.; Ackaert, B.; Valveny, E.; et al. 2023. Document Understanding Dataset and Evaluation (DUDE). arXiv:2305.08455.
- Lee et al. (2023) Lee, K.; Joshi, M.; Turc, I. R.; Hu, H.; Liu, F.; Eisenschlos, J. M.; Khandelwal, U.; Shaw, P.; Chang, M.-W.; and Toutanova, K. 2023. Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding. In ICML, 18893–18912.
- Li et al. (2023) Li, J.; Li, D.; Savarese, S.; and Hoi, S. 2023. BLIP-2: Bootstrap** Language-Image Pre-training with Frozen Image Encoders and Large Language Models. In ICML.
- Li et al. (2020) Li, M.; Xu, Y.; Cui, L.; Huang, S.; Wei, F.; Li, Z.; and Zhou, M. 2020. DocBank: A Benchmark Dataset for Document Layout Analysis. In COLING, 949–960.
- Lin et al. (2014) Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; and Zitnick, C. L. 2014. Microsoft COCO: Common Objects in Context. In ECCV, 740–755.
- Liu et al. (2023a) Liu, H.; Li, C.; Wu, Q.; and Lee, Y. J. 2023a. Visual Instruction Tuning. arXiv:2304.08485.
- Liu et al. (2023b) Liu, Y.; Li, Z.; Li, H.; Yu, W.; Huang, M.; Peng, D.; Liu, M.; Chen, M.; Li, C.; **, L.; et al. 2023b. On the Hidden Mystery of OCR in Large Multimodal Models. arXiv:2305.07895.
- Longpre et al. (2023) Longpre, S.; Hou, L.; Vu, T.; Webson, A.; Chung, H. W.; Tay, Y.; Zhou, D.; Le, Q. V.; Zoph, B.; Wei, J.; et al. 2023. The FLAN collection: Designing Data and Methods for Effective Instruction Tuning. arXiv:2301.13688.
- Loshchilov and Hutter (2017) Loshchilov, I.; and Hutter, F. 2017. Decoupled Weight Decay Regularization. arXiv:1711.05101.
- Lu et al. (2022) Lu, P.; Mishra, S.; Xia, T.; Qiu, L.; Chang, K.-W.; Zhu, S.-C.; Tafjord, O.; Clark, P.; and Kalyan, A. 2022. Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering. In NeurIPS.
- Lu et al. (2021) Lu, P.; Qiu, L.; Chen, J.; Xia, T.; Zhao, Y.; Zhang, W.; Yu, Z.; Liang, X.; and Zhu, S.-C. 2021. IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning. In NeurIPS.
- Masry et al. (2022) Masry, A.; Do, X. L.; Tan, J. Q.; Joty, S.; and Hoque, E. 2022. ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning. In ACL Findings, 2263–2279.
- Mathew et al. (2022) Mathew, M.; Bagal, V.; Tito, R.; Karatzas, D.; Valveny, E.; and Jawahar, C. 2022. InfographicVQA. In WACV, 1697–1706.
- Mathew, Karatzas, and Jawahar (2021) Mathew, M.; Karatzas, D.; and Jawahar, C. V. 2021. DocVQA: A Dataset for VQA on Document Images. In WACV, 2200–2209.
- Mishra et al. (2019) Mishra, A.; Shekhar, S.; Singh, A. K.; and Chakraborty, A. 2019. OCR-VQA: Visual Question Answering by Reading Text in Images. In ICDAR, 947–952.
- Mishra et al. (2022) Mishra, S.; Khashabi, D.; Baral, C.; and Hajishirzi, H. 2022. Cross-Task Generalization via Natural Language Crowdsourcing Instructions. In ACL, 3470–3487.
- OpenAI (2023) OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774.
- Park et al. (2019) Park, S.; Shin, S.; Lee, B.; Lee, J.; Surh, J.; Seo, M.; and Lee, H. 2019. CORD: A Consolidated Receipt Dataset for Post-OCR Parsing. In Workshop on Document Intelligence at NeurIPS.
- Pfitzmann et al. (2022) Pfitzmann, B.; Auer, C.; Dolfi, M.; Nassar, A. S.; and Staar, P. 2022. DocLayNet: A Large Human-Annotated Dataset for Document-Layout Segmentation. In KDD, 3743–3751.
- Radford et al. (2021) Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. 2021. Learning Transferable Visual Models from Natural Language Supervision. In ICML, 8748–8763.
- Šimsa et al. (2023) Šimsa, Š.; Šulc, M.; Uřičář, M.; Patel, Y.; Hamdi, A.; Kocián, M.; Skalickỳ, M.; Matas, J.; Doucet, A.; Coustaty, M.; and Karatzas, D. 2023. DocILE Benchmark for Document Information Localization and Extraction. arXiv:2302.05658.
- Singh et al. (2019) Singh, A.; Natarajan, V.; Shah, M.; Jiang, Y.; Chen, X.; Batra, D.; Parikh, D.; and Rohrbach, M. 2019. Towards VQA Models That Can Read. In CVPR, 8317–8326.
- Sun et al. (2021) Sun, H.; Kuang, Z.; Yue, X.; Lin, C.; and Zhang, W. 2021. Spatial Dual-Modality Graph Reasoning for Key Information Extraction. arXiv:2103.14470.
- Tanaka et al. (2023) Tanaka, R.; Nishida, K.; Nishida, K.; Hasegawa, T.; Saito, I.; and Saito, K. 2023. SlideVQA: A Dataset for Document Visual Question Answering on Multiple Images. In AAAI, 13636–13645.
- Tanaka, Nishida, and Yoshida (2021) Tanaka, R.; Nishida, K.; and Yoshida, S. 2021. VisualMRC: Machine Reading Comprehension on Document Images. In AAAI, 13878–13888.
- Tüselmann et al. (2022) Tüselmann, O.; Müller, F.; Wolf, F.; and Fink, G. A. 2022. Recognition-free Question Answering on Handwritten Document Collections. In ICFHR, 259–273.
- Wang et al. (2021) Wang, B.; Li, G.; Zhou, X.; Chen, Z.; Grossman, T.; and Li, Y. 2021. Screen2words: Automatic mobile UI summarization with multimodal learning. In UIST, 498–510.
- Wei et al. (2021) Wei, J.; Bosma, M.; Zhao, V. Y.; Guu, K.; Yu, A. W.; Lester, B.; Du, N.; Dai, A. M.; and Le, Q. V. 2021. Finetuned language models are zero-shot learners. In ICLR.
- Xu et al. (2020) Xu, Y.; Li, M.; Cui, L.; Huang, S.; Wei, F.; and Zhou, M. 2020. LayoutLM: Pre-training of Text and Layout for Document Image Understanding. In KDD, 1192–1200.
- Xu et al. (2021) Xu, Y.; Xu, Y.; Lv, T.; Cui, L.; Wei, F.; Wang, G.; Lu, Y.; Florêncio, D. A. F.; Zhang, C.; Che, W.; Zhang, M.; and Zhou, L. 2021. LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding. In ACL/IJCNLP, 2579–2591.
- Xu, Shen, and Huang (2023) Xu, Z.; Shen, Y.; and Huang, L. 2023. MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning. In ACL, 11445–11465.
- Ye et al. (2023a) Ye, J.; Hu, A.; Xu, H.; Ye, Q.; Yan, M.; Dan, Y.; Zhao, C.; Xu, G.; Li, C.; Tian, J.; Qi, Q.; Zhang, J.; and Huang, F. 2023a. mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding. arXiv:2307.02499.
- Ye et al. (2023b) Ye, Q.; Xu, H.; Xu, G.; Ye, J.; Yan, M.; Zhou, Y.; Wang, J.; Hu, A.; Shi, P.; Shi, Y.; Jiang, C.; Li, C.; Xu, Y.; Chen, H.; Tian, J.; Qi, Q.; Zhang, J.; and Huang, F. 2023b. mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality. arXiv:2304.14178.
- Zhang et al. (2023) Zhang, Y.; Zhang, R.; Gu, J.; Zhou, Y.; Lipka, N.; Yang, D.; and Sun, T. 2023. LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding. arXiv:2306.17107.
- Zhu et al. (2023) Zhu, D.; Chen, J.; Shen, X.; Li, X.; and Elhoseiny, M. 2023. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. arXiv:2304.10592.
- Zhu et al. (2022) Zhu, F.; Lei, W.; Feng, F.; Wang, C.; Zhang, H.; and Chua, T.-S. 2022. Towards Complex Document Understanding by Discrete Reasoning. In ACMM, 4857–4866.
Appendix A Further InstructDoc Details
Dataset | Document type | Held-out (split) | # Held-in instances | # Held-out instances |
---|---|---|---|---|
DocILE (Šimsa et al. 2023) | Invoice | 56,369 | - | |
KLC (Borchmann et al. 2021) | Report | 13,449 | - | |
Deepform (Borchmann et al. 2021) | Form | 3,409 | - | |
FUNSD (Jaume, Ekenel, and Thiran 2019) | Form | ✓(test) | - | 2,270 |
PWC (Borchmann et al. 2021) | Article | 2,853 | - | |
Wildreceipt (Sun et al. 2021) | Receipt | 16,188 | - | |
CORD (Park et al. 2019) | Receipt | ✓(test) | - | 1,336 |
SROIE (Huang et al. 2019) | Receipt | 2,503 | - | |
VisualMRC (Tanaka, Nishida, and Yoshida 2021) | Web-page | 21,015 | - | |
WebSRC (Chen et al. 2021) | Web-page | 307,315 | - | |
OCRVQA (Mishra et al. 2019) | Book cover | 798,245 | - | |
DocVQA (Mathew, Karatzas, and Jawahar 2021) | Business document | 39,463 | - | |
HW-SQuAD (Tüselmann et al. 2022) | Hand-written document | 67,887 | - | |
TAT-DQA (Zhu et al. 2022) | Report | 13,251 | - | |
WTQ (Borchmann et al. 2021) | Table | 14,224 | - | |
IconQA (Lu et al. 2021) | Diagram | 29,859 | - | |
AI2D (Kembhavi et al. 2016) | Diagram | 12,413 | - | |
ScienceQA (Lu et al. 2022) | Diagram | 12,726 | - | |
TextbookQA (Kembhavi et al. 2017) | Diagram | 6,501 | - | |
InfographicVQA (Mathew et al. 2022) | Infographic | ✓(dev) | - | 2,801 |
ChartQA (Masry et al. 2022) | Chart | ✓(test) | - | 2,500 |
SlideVQA (Tanaka et al. 2023) | Slide deck | ✓(test) | - | 2,215 |
DUDE (Landeghem et al. 2023) | Open-domain | ✓(dev) | - | 6,315 |
TabFact (Borchmann et al. 2021) | Table | ✓(dev) | - | 12,740 |
LLaVAR (Zhang et al. 2023) | 8 types documents | 19,800 | - | |
RVL-CDIP (Harley, Ufkes, and Derpanis 2015) | Business document | 320,000 | - | |
SciCap (Hsu, Giles, and Huang 2021) | Figure | 75,494 | - | |
Screen2Words (Wang et al. 2021) | Mobile UI | 15,743 | - | |
DocBank (Li et al. 2020) | Article | 3,999,811 | - | |
DocLaynet (Pfitzmann et al. 2022) | Report | 69,102 | - |
Dataset Collection
Dataset list.
Table 7 shows the detail of all datasets we used in InstructDoc. It contains 5,917,602 held-in instances and 30,177 held-out instances.
Query rephrasing.
Table 8 shows the detail of the query rephrasing annotation. The rephrased queries are more easily understandable phrases than the original queries.
Instruction annotation.
Dataset Analysis
Starting words of the instructions.
![Refer to caption](x7.png)
Figure 7 shows the sunburst pattern of the first three words of the instructions. It can be seen that the instructions contain various types, such as questions (e.g., “What is the”) and requests (e.g., “I want to”) used in real-world situations.
Answer styles.
Figure 8 shows InstructDoc has five diverse answer types.
Word clouds.
Figure 9 shows how diverse the vocabulary space is in InstructDoc.
![Refer to caption](x8.png)
![Refer to caption](x9.png)
![Refer to caption](x10.png)
![Refer to caption](x11.png)
Appendix B Further Evaluation Setup Details
Main Evaluation Datasets Details
FUNSD.
Form Understanding in Noisy Scanned Documents (FUNSD) (Jaume, Ekenel, and Thiran 2019) evaluates on the KIE task: predicting the entity, “title”, “key”, “value”, or “other”, for the assigned text token.
CORD.
Consolidated Receipt Dataset for Post-OCR Parsing (CORD) (Park et al. 2019) is the KIE dataset with 30 labels under 4 categories such as “total” or “subtotal”.
InfographicVQA.
This dataset focuses on the task of single-page QA w/ discrete & visual reasoning on infographics. It requires understanding plots/graphs, texts, and layout (Mathew et al. 2022).
ChartQA.
This dataset focuses on the task of single-page QA w/ discrete & visual reasoning on chart images. We used both two subsets: (i) machine-generated set and (ii) human-written set (Masry et al. 2022).
TabFact.
This dataset studies the task of Document NLI with semi-structured evidence over tables. It predicts the entailment relationship between two sentences in a document (Borchmann et al. 2021).
DUDE.
Document Understanding Dataset and Evaluation (DUDE) (Landeghem et al. 2023) focuses on the task of multi-page QA w/ discrete & visual & multi-hop reasoning. It is a multi-page, multi-domain, and multi-industry Document VQA for real-world document understanding.
SlideVQA.
This dataset focuses on the task of multi-page QA w/ discrete & visual & multi-hop reasoning on the slide deck composed of multiple images. It requires selecting a set of evidence and answering the question (Tanaka et al. 2023).
Other Evaluation Datasets Details
VisualMRC.
Visual Machine Reading Comprehension (VisualMRC) (Tanaka, Nishida, and Yoshida 2021) is the task of abstractive single-page QA on the Web screenshot. We used the end-to-end setting where answers are derived from OCR results and images without ROI detection.
TextVQA.
ST-VQA.
It contains scene images from multiple sources, such as Visual Genome (Krishna et al. 2017). We used the Open Dictionary setting where answer candidates and vocabularies are not provided at test time (Biten et al. 2019).
Dataset | Before After |
---|---|
header title | |
FUNSD | question key |
answer value | |
menu.nm menu_name | |
menu.num menu_id | |
menu.unitprice menu_unitprice | |
menu.cnt menu_quantity | |
menu.discountprice menu_discountprice | |
menu.price menu_price | |
menu.itemsubtotal menu_price_discount_applied | |
menu.vatyn menu_whether_price_tax_includes | |
menu.etc menu_etc | |
menu.sub_nm submenu_name | |
menu.sub_unitprice submenu_unitprice | |
menu.sub_cnt submenu_quantity | |
menu.sub_price submenu_price | |
menu.sub_etc submenu_etc | |
CORD | void_menu.nm voidmenu_name |
void_menu.price voidmenu_price | |
sub_total.subtotal_price subtotal_price | |
sub_total.discount_price subtotal_discount_price | |
sub_total.service_price subtotal_service_price | |
sub_total.othersvc_price subtotal_chargeprice | |
sub_total.tax_price subtotal_tax_price | |
sub_total.etc subtotal_etc | |
total.total_price total_price | |
total.total_etc total_etc | |
total.cashprice total_cashprice | |
total.changeprice total_changeprice | |
total.creditcardprice total_creditcardprice | |
total.emoneyprice total_emoneyprice | |
total.menutype_cnt total_menutype_count | |
total.menuqty_cnt total_menuquantity_count |
Input |
“menu_name“, “menu_id“, ”menu_unitprice”, “menu_quantity”, “menu_discountprice”, “menu_price”,“menu_price_discount_applied”, “menu_whether_price_tax_includes”, ”menu_etc”, “submenu_name”, “submenu_unitprice”, ”submenu_quantity”, “submenu_price”,“submenu_etc”, “voidmenu_name”, ”voidmenu_price”, “subtotal_price”, “subtotal_discount_price”, “subtotal_service_price”, “subtotal_chargeprice”, “subtotal_tax_price”, “subtotal_etc”, “total_price”, “total_etc”, ”total_cashprice”, “total_changeprice”, “total_creditcardprice”, “total_emoneyprice”, ”total_menutype_count”, and “total_menuquantity_count”. Please output the category corresponding to the text “16,500”. |
Target |
menu_price |
Input |
You should answer with a natural language text with the correct information extracted strictly from the document. |
Target |
It is used to list the directory in long form and see all of its information. |
Input |
Answers can be either a specific span of the document or a numerical value (such as “2,” “2.5,” “2%,” etc.). Numerical questions require counting or summing values. |
Target |
7824 |
Input |
Please select one answer from the following options: “Elliptical”, “Linear”, “Deltoid”, “Oval” |
Target |
Linear |
Input |
Provide your short answer to the question “Which three business types is Pinterest good for?” based on the document. The answer type can be divided following types: - Directly extracting the words from the image - a list of spans inside of document (each span should be separated by “,”) - not exist explicitly as a span of document |
Target |
Interior design, Restaurants, Wedding venues |
Input |
If you need mathematical reasoning, you first extract numbers and then calculate them to answer the question. Answers contain either: - a span inside of document - a list of spans inside of document (each span should be separated by “,”) - not exist explicitly as a span of document (the answer should be freely generated text) Directly answer the question from the document with 1 to 3 words as possible. |
Target |
23% |
Input |
Can we conclude that “during the third round of the turkish cup , there be no new entry during that stage”, based on the content of the given document? Answers are selected from given options “yes” or “no”, and you should not output other than the selected option. |
Target |
yes |
Input |
“letter”, “form”, “email”, “handwritten”, “advertisement”, “scientific report”, “scientific publication”, “specification”, “file folder”, “news article”, “budget”, “invoice”, ”presentation”, “questionnaire”, “resume”, and “memo”. |
Target |
handwritten |
Input |
|
Target |
Page with search bar to search hotel. |
Input |
“Abstract”, “Author”, “Caption”, “Date”, “Equation”, “Figure”, ”Footer”, ”List”, “Paragraph”, “Reference”, “Section”, “Table”, and “Title”. |
Target |
Picture [750, 29, 954, 94] Section-header [209, 120, 376, 139] Table [151, 191, 850, 702] Picture [521, 731, 837, 946] Picture [158, 733, 485, 955] Page-footer [897, 962, 911, 972] |