Cross-Modal Entity Matching for
Visually Rich Documents

Ritesh Sarkhel Amazon
Seattle, USA
   Arnab Nandi The Ohio State University
Columbus, USA
Abstract

Visually rich documents (e.g. leaflets, banners, magazine articles) are physical or digital documents that utilize visual cues to augment their semantics. Information contained in these documents are ad-hoc and often incomplete. Existing works that enable structured querying on these documents do not take this into account. This makes it difficult to contextualize the information retrieved from querying these documents and gather actionable insights from them. We propose Juno – a cross-modal entity matching framework to address this limitation. It augments heterogeneous documents with supplementary information by matching a text span in the document with semantically similar tuples from an external database. Our main contribution in this is a deep neural network with attention that goes beyond traditional keyword-based matching and finds matching tuples by aligning text spans and relational tuples on a multimodal encoding space without any prior knowledge about the document type or the underlying schema. Exhaustive experiments on multiple real-world datasets show that Juno generalizes to heterogeneous documents with diverse layouts and formats. It outperforms state-of-the-art baselines by more than 6 F1 points with up to 60% less human-labeled samples. Our experiments further show that Juno is a computationally robust framework. We can train it only once, and then adapt it dynamically for multiple resource-constrained environments without sacrificing its downstream performance. This makes it suitable for on-device deployment in various edge-devices. To the best of our knowledge, ours is the first work that investigates the information incompleteness of visually rich documents and proposes a generalizable, performant and computationally robust framework to address it in an end-to-end way.

Index Terms:
visually rich document, entity matching, multimodal data, deep neural network

I Introduction

A visually rich document (VRD) is a physical or digital document that leverages explicit or implicit visual cues (e.g. color, distance, orientation) to augment its semantics. From medical intake forms to invoices, restaurant menus to leaflets, VRDs are pervasive in our everyday lives. Due to their popularity, there is a recent surge in research interest on structured querying [1, 2, 3, 4] of these documents. A data pipeline set up for this task typically works as follows. Given a schema 𝒮𝒮\mathcal{S}caligraphic_S and a document 𝒟𝒟\mathcal{D}caligraphic_D, we extract a structured record 𝒟𝒮subscriptsuperscript𝒮𝒟\mathcal{R^{S}_{D}}caligraphic_R start_POSTSUPERSCRIPT caligraphic_S end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT with schema 𝒮𝒮\mathcal{S}caligraphic_S from 𝒟𝒟\mathcal{D}caligraphic_D. We clean this record, transform it into a compatible format and load it onto a data-warehouse. This data-warehouse serves as the back-end of an analytical engine that allows us to execute queries and gather actionable insights from 𝒟𝒟\mathcal{D}caligraphic_D. Unfortunately, most existing methods make a closed-world assumption in setting up this pipeline that leads to limited query coverage. Take the following scenario for example.

Example: Alice wants to place an order from a restaurant she is visiting for the first time. She has a printed menu which contains various items along with their prices. Alice has some food allergies. The allergen information of an item, however, does not appear on the menu. This leads to her pouring over the menu and looking up each item in a nutritional table before she can place an order. Existing data pipelines that enable structured querying of visually rich documents cannot help her automate this process as they only support those queries that return a subset of text spans appearing on the document. Due to the ad-hoc and often incomplete nature of these documents, this can be limiting in many real-world scenarios. The information needed to gather insights (e.g. allergen information) may not appear in the document in the first place.

Refer to caption
Figure 1: Visually Rich Documents utilize visual and textual cues to highlight the semantics of various entities appearing on them. They can have diverse layouts, formats, and be used for short-form communication such as leaflets, posters and menu-cards.
Refer to caption
Figure 2: An overview of Juno’s end-to-end workflow is shown on the right side of this figure. It takes a text span from a visually rich document as input, encodes it as a fixed-length vector on a multimodal embedding space, and aligns it with semantically similar tuples in a relational database on that space. These tuples are then retrieved and returned back to the user.

Cross-modal entity matching: A data pipeline augmented with on-device cross-modal entity matching capability (defined in Section II) can help Alice automate this task. Briefly, a cross-modal entity matching (CEM) framework maps a data element e1subscript𝑒1{e}_{1}italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT in one modality (e.g. text span on a printed menu) to a data element e2subscript𝑒2{e}_{2}italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT in another modality (e.g. tuple in a nutritional database) if they represent the same real-world object. A typical entity matching framework works in two phases. In the first phase, data elements that are unlikely to match are discarded [5]. Remaining data elements from both sources are then matched by making pairwise comparisons through carefully designed rules in the second phase. What makes this task challenging in a cross-modal setting is matching data elements across modalities in a generalizable way. Contrary to its unimodal counterpart [6], cross-modal entity matching frameworks require an additional relationship modeling step [7] before the matching phase, where data elements from different modalities are represented on a shared embedding space. This is a challenging task for VRDs as text spans appearing on these documents employ both visual and textual cues to highlight their semantics. Diversity in the layout and format of these documents also makes it difficult to implement a generalizable solution for this task (see Fig. 1).

Limitations of existing solutions: Before describing our framework, let’s walk through a naive solution for the previous example first. Using off-the-shelf data pipelines, we can extract structured records corresponding to each item in the printed menu by employing a document understanding model (e.g. LayoutLMv2 [4]). Briefly, it is a Transformer-based [8] model that takes a rendered image of the document as input, encodes both its visual and textual features, and extracts structured records corresponding to each item appearing on this document. We can identify the allergen information of an item by performing a fuzzy-join between its corresponding record and a nutritional database. Unfortunately, this solution does not scale well for a large-scale corpus of heterogeneous documents. Performance of a document understanding model is directly proportional to the number of human-labeled samples used to fine-tune [9] that model. Performing fuzzy-join between a record and a relational database requires significant domain-expertise as well. For example, an item printed as ‘Pasta in Red Tomato Sauce’ on the menu may appear as ‘Pasta in Marinara Sauce’ in the database. Establishing semantic similarity in such scenarios requires carefully designed rules from domain-experts. It is hard to maintain and update these rules for a large-scale corpus. Off-the-shelf tools, including large language models trained on huge amount of textual data (e.g. GPT-3 [10]) cannot bridge this gap completely as they hallucinate on emergent topics [11, 12] typically covered by these documents (see Section V for experimental results). Fine-tuning these models on a custom domain requires significant effort and computational resources. Vision-language models (e.g. CLIP [13]) trained on huge amounts of {image, text} pairs exhibit good zero-shot generalization capability for cross-modal entity matching and retrieval tasks. These models, however, use pixel-level encodings to represent an image. This does not take the inherently multimodal nature of a visually rich document into account (see Section V for experimental results). Recent works [14, 1] have established the necessity of encoding both textual and visual features pertaining to the document layout to compute fixed-length representations for visually rich documents. Furthermore, the amount of computational resources needed to infer a match using these models makes it difficult to deploy them in resource-constrained environments. This motivates us to formalize the following objectives for our entity matching task.

Problem statement: Given a document 𝒟𝒟\mathcal{D}caligraphic_D and a relational table 𝒯𝒯\mathcal{T}caligraphic_T, our goal is to learn a map** f:wt:𝑓𝑤𝑡f:w\rightarrow titalic_f : italic_w → italic_t between a text span w𝒟𝑤𝒟w\in\mathcal{D}italic_w ∈ caligraphic_D and a set of tuples t𝒯𝑡𝒯t\in\mathcal{T}italic_t ∈ caligraphic_T if they represent the same real-world object. Our objectives are as follows.

  1. 1.

    f𝑓fitalic_f is performant i.e., its map** accuracy is high.

  2. 2.

    f𝑓fitalic_f is scalable i.e., the number of human-labeled samples needed to learn f𝑓fitalic_f is low.

  3. 3.

    f𝑓fitalic_f is generalizable i.e., it can be adapted for diverse documents without any prior knowledge about the document type or the underlying schema.

  4. 4.

    f𝑓fitalic_f is computationally robust i.e., it can adapted to resource-constrained environments without any significant degradation in its downstream performance.

Our contributions: We develop a generalizable framework for cross-modal entity matching against visually rich documents in this paper. Our core contribution is a multimodal deep neural network that maps text spans with semantically similar relational tuples by aligning them on a shared embedding space without any prior knowledge about the document type or the underlying schema. Contrary to existing works that use handcrafted rules from domain experts, our framework is more scalable as it leverages a novel attention mechanism [15] (defined in Section II) to reduce the number of pairwise comparisons to find a match. Compared to contemporary supervised solutions, our framework uses significantly less number of human-labeled samples in its training. This frees up developers from tedious feature engineering and allows them to focus more on training iterations. Not being bound to layout and/or format-specific rules makes our method more generalizable for diverse document types as well. We refer to this framework in the rest of these documents as Juno.

Summary of results: We evaluate Juno on two real-world datasets – The IMDB Movie Dataset and The NYC Open Event Dataset for separate entity matching tasks. Our results show that Juno is not only more performant than existing methods – outperforming state-of-the-art baselines by more than 6% in F1-score but it is also more scalable – reducing the number of human-labeled samples needed to train the same model for comparable performance using direct supervision by up to 60%. Our experiments further show that Juno is computationally robust – we can reduce its memory footprint by up to 24% using off-the-shelf algorithms without degrading its performance. This makes Juno a suitable candidate for on-device deployment of various cross-modal entity matching tasks. Investigating optimal ways to reduce memory footprint and computational overhead of entity matching frameworks for interactive applications (e.g. augmented/mixed reality settings [16]) is one of our planned future works.

II Background & Related Works

II-A Entity Matching

Let, D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and D2subscript𝐷2D_{2}italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT represent two unique data sources. Entity matching refers to the task of identifying all data element pairs <e1,e2><e_{1},e_{2}>< italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT >, e1D1for-allsubscript𝑒1subscript𝐷1\forall e_{1}\in D_{1}∀ italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, e2D2subscript𝑒2subscript𝐷2e_{2}\in D_{2}italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT that represent the same real-world objects. A typical entity matching framework works in two phases. In the blocking phase, it filters out obvious non-matches from both data sources to reduce the number of pairwise comparisons, whereas in the matching phase it identifies similar data elements from the remaining set. If D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and D2subscript𝐷2D_{2}italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are structured data sources, this can be done by making attribute-wise comparisons between tuples from D1subscript𝐷1D_{1}italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and D2subscript𝐷2D_{2}italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. If they have the same schema, this is a trivial task. Otherwise, we perform schema-matching [17, 18] to align these two data sources first. Contemporary researchers have investigated various techniques [19, 20] to perform entity matching between a document and a structured data source in recent years. Smith et al. [19] employed carefully designed rules to identify {subject, predicate, object} triplets from the document first. To perform entity matching, they aligned the predicates extracted from these triplets and tuples in a relational database using high-precision rules. Cafarella et al. [20] employed human-experts to modify, extend, and align web-tables extracted from semi-structured web-pages for data integration tasks. Unfortunately, rule-based solutions like these are hard to scale for a large-scale corpus due to the diversity in layout and format of visually rich documents. Contemporary researchers have proposed deep learning based solutions to improve the scalability of this task in recent years.

II-B Deep Entity Matching

A deep entity matching framework usually works as follows. After the blocking phase has pruned obvious non-matches, it formulates the matching phase as a binary classification problem, where it represents each data element (from both sources) as a fixed-length vector and categorizes each candidate pair as match (or non-match). Employing deep neural networks for this task obviates the need of careful feature engineering to represent a data element, and frees up developers to focus on training iterations. The choice of architecture usually depends on the dataset and the available computation budget. For instance, Ebraheem et al. [21] developed a Recurrent Neural Network (RNN)-based [22] model to match two relational databases by learning a distributed representation for each tuple. Nie et al. [23] extended their work to develop a label-efficient model that leverages the power of transfer learning [24]. Recent works [25, 26] have established the efficacy of Transformer-based models [8] pretrained on large amounts of textual data for this task, reporting state-of-the-art results on multiple benchmark datasets.

II-C Deep Entity Matching with Attention

Deep neural networks such as RNN [22] assigns equal weight to the entire input sequence when computing a fixed-length representation of a data element. This makes it difficult for an entity matching model to learn a meaningful summarization of long and potentially noisy input sequences (e.g. multi-valued attributes in relational tuples). Recent works [27, 28] have introduced attention mechanism to overcome this limitation. A deep neural network with attention takes a fixed-length vector representation (y𝑦yitalic_y) of a data element as input, and produces a summarized version (z𝑧zitalic_z) of y𝑦yitalic_y as output. In doing so, it only retains the information that is relevant to the downstream task and discards the rest thus allowing an entity matching model to ‘attend’ to the ‘important parts’ of an input sequence during the matching phase. Importantly, attention mechanism allows an entity matching model to learn which parts of an input sequence to attend during training. Although not for visually rich documents, Mudgal et al. [29] was one of the early works to establish the efficacy of attention mechanism for entity matching tasks. Li et al. [25] extended these findings and established the efficacy of attention mechanism for entity matching tasks with Transformer-based models in recent years. One of the key differences of the attention mechanism employed in Juno compared to existing works is its asymmetric nature. Contrary to existing works that utilize a symmetric attention mechanism due to the homogeneity of both data sources, we employ an asymmetric, bi-directional attention mechanism that adapts to the diverse characteristics of a candidate pair: multimodal text spans from heterogeneous documents, and tuples from relational databases.

II-D Cross-Modal Entity Matching

Contrary to its unimodal counterpart, a cross-modal entity matching (CEM) framework matches data elements across modalities. A traditional CEM framework works in three phases [7]. Fixed-length vectors are computed to represent data elements from both data sources in the first phase. These vectors are then projected on a shared, multimodal embedding space in the second phase. Finally, correlation between similar data elements are established on this shared embedding space in the third phase. Researchers have investigated various techniques to learn a shared embedding space that aligns matching data elements from different modalities. For example, Rasiwasia et al. [30] and Sharma et al. [31] leveraged canonical correlation analysis and bilinear modeling to learn a common subspace that maximizes the correlation between similar data elements. Wu et al. [32] and Carvalho et al. [33] followed a metric learning approach to learn a common representation and a similarity threshold across modalities. Cao et al. [34] and Lin et al. [35] employed hashing-based techniques to learn common representations on a Hamming space. Radford et al. [13] was one of the early works to train a Transformer-based vision-language model on semantically similar {image, text} pairs collected from the internet. Alayrac et al. [36] extended their work to show state-of-the-art results for vision-langauge tasks in zero-shot and few-shot settings. Girdhar et al. [37] established the efficacy of learning a shared embedding space that encodes multiple modalities simultaneously for cross-modal retrieval tasks.

Refer to caption
Figure 3: An overview of our neural network architecture. In this example, the network maps a visually rich movie poster (A) to tuples in a relational table (B) containing supplementary information about the movie.

These methods, however, do not address some of the key challenges of our task. First, existing works do not address one of the core issues of aligning a visually rich document against an enterprise-scale database which is scalability. Most existing methods utilize handcrafted rules designed using prior knowledge about the document type, or the underlying schema for this task. This is hard to scale for a large-scale corpus of heterogeneous documents. Second, it has been established [11, 12]that large-scale models that have been trained on huge amounts of textual and/or multimodal data tend to hallucinate on emergent topics that are not sufficiently represented in their training corpus. Fine-tuning these models on a custom domain require significant effort and computational resources. Third, existing works that learn a shared embedding space for cross-modal entity matching/retrieval tasks leverage pixel-level image encodings. Contemporary researchers [14, 1] have showed the superiority of leveraging layout-based hierarchy to learn a fixed-length representation for visually rich documents in recent years. We address this challenge by develo** a data model that represents heterogeneous VRDs and relational tuples in a principled way. We discuss it next.

III Our Data Model

We represent a relational database with schema 𝒮𝒮\mathcal{S}caligraphic_S as a set of tuples {t1,t2tT}subscript𝑡1subscript𝑡2subscript𝑡𝑇\{t_{1},t_{2}...t_{T}\}{ italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT … italic_t start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT }. Each tuple tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is represented as a nested set {ai,1,ai,2ai,n}subscript𝑎𝑖1subscript𝑎𝑖2subscript𝑎𝑖𝑛\{a_{i,1},a_{i,2}...a_{i,n}\}{ italic_a start_POSTSUBSCRIPT italic_i , 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i , 2 end_POSTSUBSCRIPT … italic_a start_POSTSUBSCRIPT italic_i , italic_n end_POSTSUBSCRIPT } where ai,jsubscript𝑎𝑖𝑗a_{i,j}italic_a start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT is a sequence of tokens denoting the value of the j𝑗jitalic_jth attribute of schema 𝒮𝒮\mathcal{S}caligraphic_S in tuple tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

Similarly, we represent a VRD as a nested set {V,H}𝑉𝐻\{V,H\}{ italic_V , italic_H }, where V𝑉Vitalic_V denotes the set of atomic elements and H𝐻Hitalic_H denotes their visual organization. We define them as follows.

III-A Atomic Elements

An atomic element represents the smallest visual element in a document. Each visual span in a document comprises of one or more atomic elements. We classify each atomic element into two major categories: text elements and image elements.

A.1. Text element: A text element is a visual element with a text-attribute. We deem each word in a document as a text element in our data model. We represent each text element atextsubscript𝑎𝑡𝑒𝑥𝑡a_{text}italic_a start_POSTSUBSCRIPT italic_t italic_e italic_x italic_t end_POSTSUBSCRIPT as a nested set: atext={texta_{text}=\{textitalic_a start_POSTSUBSCRIPT italic_t italic_e italic_x italic_t end_POSTSUBSCRIPT = { italic_t italic_e italic_x italic_t-data,x,y,w,h}data,x,y,w,h\}italic_d italic_a italic_t italic_a , italic_x , italic_y , italic_w , italic_h }, where text𝑡𝑒𝑥𝑡textitalic_t italic_e italic_x italic_t-data𝑑𝑎𝑡𝑎dataitalic_d italic_a italic_t italic_a represents the transcription of the span covered by this text element. hhitalic_h & w𝑤witalic_w denote the height and width of the smallest bounding-box enclosing this text span, and x,y𝑥𝑦x,yitalic_x , italic_y represent the coordinates of its top-left corner. We identify text elements in a document using Tesseract [38], a popular open-source OCR engine.

A.2. Image element: An image element, on the other hand, denotes an image-attribute in the document. We represent an image element eimgsubscript𝑒𝑖𝑚𝑔e_{img}italic_e start_POSTSUBSCRIPT italic_i italic_m italic_g end_POSTSUBSCRIPT as a nested set: eimg={pixele_{img}=\{pixelitalic_e start_POSTSUBSCRIPT italic_i italic_m italic_g end_POSTSUBSCRIPT = { italic_p italic_i italic_x italic_e italic_l-data,x,y,w,h}data,x,y,w,h\}italic_d italic_a italic_t italic_a , italic_x , italic_y , italic_w , italic_h }, where pixel𝑝𝑖𝑥𝑒𝑙pixelitalic_p italic_i italic_x italic_e italic_l-data𝑑𝑎𝑡𝑎dataitalic_d italic_a italic_t italic_a represents the pixel-values in the smallest bounding box enclosing eimgsubscript𝑒𝑖𝑚𝑔e_{img}italic_e start_POSTSUBSCRIPT italic_i italic_m italic_g end_POSTSUBSCRIPT.

III-B Visual Organization

We represent the visual organization of a document using a tree-like structure H𝐻Hitalic_H. Each node in H𝐻Hitalic_H represents a text span. We define H𝐻Hitalic_H using five levels of layout hierarchy following the hOCR specification format [39]. In this format, a document is deemed to be made up of several columns, a column is made up of several paragraphs, a paragraph is composed of several text-lines, and a text-line consists of multiple words. A node in H𝐻Hitalic_H is a child of another node if its text span is enclosed by the text span represented by its parent node. We use an open-source page segmentation algorithm [38] to construct the layout-tree H𝐻Hitalic_H in our experiments.

IV Methodology

At the core of our framework is a multimodal neural network that maps a text span from a visually rich document with semantically similar tuples from a relational database without any prior knowledge about the document type or the underlying schema. It works in two phases.

IV-A Phase I: Encoding inputs into fixed-length vectors

A.1. Encoding text spans: The first layer of our network, called the representation layer computes a distributed representation of a text span in the document. It represents each word w𝑤witalic_w represented as a leaf node in the document layout-tree as a vector wsubscript𝑤\mathcal{F}_{w}caligraphic_F start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT of dimensions 4×76847684\times 7684 × 768. The first three rows in wsubscript𝑤\mathcal{F}_{w}caligraphic_F start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT represents the embedding vector encoding w𝑤witalic_w, its immediate bi-gram, and tri-gram respectively. We use the publicly available LayoutLMv2 [4] model to compute these vectors. More specifically, we average the output from the last two layers of a LayoutLMv2BASE model pretrained on the IIT-CDIP dataset [40] to compute these vectors. The last row of wsubscript𝑤\mathcal{F}_{w}caligraphic_F start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT represents an embedding vector from the last fully-connected layer of a pretrained MobileNet111the MobileNet architecture has been shown [14] to be effective in encoding discriminative properties of VRDs model [41] from a rendered image of the document. While the first three rows in wsubscript𝑤\mathcal{F}_{w}caligraphic_F start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT encodes the local context information of the word w𝑤witalic_w, the last row encodes document-level visual cues such as layout, orientation, and formatting of the document.

A.2. Encoding relational tuples: The representation layer also computes a two-dimensional vector tsubscript𝑡\mathcal{F}_{t}caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (n×768𝑛768n\times 768italic_n × 768) to represent a relational tuple t𝑡titalic_t. The i𝑖iitalic_ith row in tsubscript𝑡\mathcal{F}_{t}caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denotes the embedding vector of the i𝑖iitalic_ith attribute in tuple t𝑡titalic_t. Following prior works [23], we represent each attribute in t𝑡titalic_t as a sequence of tokens and utilize a pretrained RoBERTaBASE model to compute an embedding vector for each attribute. We impute missing values in a tuple with a special token [UNK] from the model’s vocabulary. For multi-valued attributes, we linearize the attribute values as a long sequence first, and then compute an embedding vector of that sequence. The main reason behind using pretrained models in the representation layer of our network is to leverage the power of transfer learning [9] and minimize the amount of human-labeled samples needed to learn a fixed-length encoding of a data element. Transformer-based models [8] such as RoBERTa and LayoutLMv2 are good candidates for this task as they are capable of embedding out-of-vocabulary words with a fixed vocabulary size due to their subword encoding capabilities [42]. Other models with such capabilities can also be used instead as they have a transitive effect on the subsequent layers of our network. We establish this flexibility provided by the plug-and-play nature of our architecture through experiments in Section V.

IV-B Phase II: Aligning text spans and relational tuples

The second layer of our network, called the alignment layer computes pairwise similarities between text spans in the document and relational tuples in the database. It comprises of two fully-connected layers, each with a dimension of 768×768768768768\times 768768 × 768. The alignment layer takes the vectors wsubscript𝑤\mathcal{F}_{w}caligraphic_F start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT and tsubscript𝑡\mathcal{F}_{t}caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT computed by the representation layer as input, and projects them on to a shared embedding space. Let, wsubscriptsuperscript𝑤\mathcal{F^{\prime}}_{w}caligraphic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT & tsubscriptsuperscript𝑡\mathcal{F^{\prime}}_{t}caligraphic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT represent these projected vectors. We compute the distance between a text span w𝑤witalic_w in the document and a relational tuple t𝑡titalic_t on this shared space as follows.

𝐋align(w,t)=argmini,j|w[j]t[i]|subscript𝐋𝑎𝑙𝑖𝑔𝑛𝑤𝑡𝑎𝑟𝑔𝑚𝑖subscript𝑛𝑖𝑗subscriptsuperscript𝑤delimited-[]𝑗subscriptsuperscript𝑡delimited-[]𝑖\mathbf{L}_{align}(w,t)=argmin_{\hskip 1.42271pti,j}|\hskip 1.42271pt\mathcal{% F^{\prime}}_{w}[j]-\mathcal{F^{\prime}}_{t}[i]\hskip 1.42271pt|bold_L start_POSTSUBSCRIPT italic_a italic_l italic_i italic_g italic_n end_POSTSUBSCRIPT ( italic_w , italic_t ) = italic_a italic_r italic_g italic_m italic_i italic_n start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT | caligraphic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT [ italic_j ] - caligraphic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ italic_i ] | (1)

To identify a matching tuple tsuperscript𝑡t^{*}italic_t start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT for the text span w𝑤witalic_w, we minimize the distance 𝐋align(w,t)subscript𝐋𝑎𝑙𝑖𝑔𝑛𝑤𝑡\mathbf{L}_{align}(w,t)bold_L start_POSTSUBSCRIPT italic_a italic_l italic_i italic_g italic_n end_POSTSUBSCRIPT ( italic_w , italic_t ) between w𝑤witalic_w and all tuples t𝑡titalic_t in the relational database.

t=argmint,i,j|w[j]t[i]|,tsuperscript𝑡𝑎𝑟𝑔𝑚𝑖subscript𝑛𝑡𝑖𝑗subscriptsuperscript𝑤delimited-[]𝑗subscriptsuperscript𝑡delimited-[]𝑖for-all𝑡t^{*}=argmin_{\hskip 1.42271ptt,i,j}|\hskip 1.42271pt\mathcal{F^{\prime}}_{w}[% j]-\mathcal{F^{\prime}}_{t}[i]\hskip 1.42271pt|,\forall titalic_t start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_a italic_r italic_g italic_m italic_i italic_n start_POSTSUBSCRIPT italic_t , italic_i , italic_j end_POSTSUBSCRIPT | caligraphic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT [ italic_j ] - caligraphic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ italic_i ] | , ∀ italic_t (2)

Unfortunately, this results in a linear scan over the entire database for each text span in the document. This is a major computational bottleneck in terms of scaling our framework up for enterprise-scale databases and verbose documents. This can be mitigated by pruning off unlikely matches from both data sources before any pairwise comparison takes place. Contrary to existing methods that require carefully designed rules from domain-experts for this task, we perform this pruning operation in an end-to-end trainable fashion, with significantly less human-effort. The key enabler in this is the attention layer in our network.

B.1. Bi-directional attention for faster alignment: Let’s assume that the i𝑖iitalic_ith attribute of tuple t𝑡titalic_t has the minimum distance from a text span w𝑤witalic_w in Eq. 2. The attention layer indexes this information for each document in the training corpus using two vector-stores VDsubscript𝑉𝐷{V_{D}}italic_V start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT and VTsubscript𝑉𝑇{V_{T}}italic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT. VDsubscript𝑉𝐷{V_{D}}italic_V start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT indexes the embedding vector representing the text span w𝑤witalic_w (i.e. wsubscript𝑤\mathcal{F}_{w}caligraphic_F start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT) and VTsubscript𝑉𝑇{V_{T}}italic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT indexes the embedding vector for the i𝑖iitalic_ith attribute in t𝑡titalic_t (i.e. t[i]subscript𝑡delimited-[]𝑖\mathcal{F}_{t}[i]caligraphic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ italic_i ]). Both vector-stores use the i𝑖iitalic_ith attribute in t𝑡titalic_t as the indexing attribute. Once the indexes have been constructed, we cluster the embedding vectors against each indexing attribute using the DBSCAN algorithm. We update both vector stores using the cluster centroids. We only keep the cluster centroids for each indexing attribute. This helps us impose an upper bound on the computational cost of this pruning step, making our inference cost more manageable. Once the vector stores have been updated, we are ready to prune off unlikely matches from our search space. We achieve this by computing two vectors – one over the input document using the vector store VDsubscript𝑉𝐷{V_{D}}italic_V start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT, and another one over the relational database using the vector store VDsubscript𝑉𝐷{V_{D}}italic_V start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT. We describe how these vectors are computed next.

Refer to caption
Figure 4: Visualization of bi-directional attention computed over a text-span in a movie poster and a relational tuple containing metadata about the movie. Darker shades refer to higher attention scores assigned by our network referring to higher likelihood of finding a match, whereas lighter shades refer to lower attention scores, signifying lower likelihood of finding a match. In this example, we observe higher probabilities of finding a matching tuple against the text spans “Clint Eastwood” and “Coogan’s Bluff” against attributes Actor, Director and Title of a relational tuple in the database.

B.2. Pruning off unlikely matches: Let’s assume that the minimum distance between a text span and a cluster-centroid indexed in VDsubscript𝑉𝐷V_{D}italic_V start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT is c𝑐citalic_c. We can compute this distance for all text spans appearing in the document to obtain a vector C={c1,c2}𝐶subscript𝑐1subscript𝑐2\vec{C}=\{c_{1},c_{2}...\}over→ start_ARG italic_C end_ARG = { italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT … }. The probability of finding a matching tuple for a text span appearing in the document can therefore be defined as follows.

AD=1Cmin(C)max(C)min(C)subscript𝐴𝐷1𝐶𝑚𝑖𝑛𝐶𝑚𝑎𝑥𝐶𝑚𝑖𝑛𝐶\vec{A_{D}}=1-\frac{\vec{C}-min(\vec{C})}{max(\vec{C})-min(\vec{C})}over→ start_ARG italic_A start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_ARG = 1 - divide start_ARG over→ start_ARG italic_C end_ARG - italic_m italic_i italic_n ( over→ start_ARG italic_C end_ARG ) end_ARG start_ARG italic_m italic_a italic_x ( over→ start_ARG italic_C end_ARG ) - italic_m italic_i italic_n ( over→ start_ARG italic_C end_ARG ) end_ARG (3)

The ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT term in ADsubscript𝐴𝐷\vec{A_{D}}over→ start_ARG italic_A start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_ARG represents the probability of finding a matching tuple for the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT text span in the document. To discard those text spans that are unlikely to find a matching tuple in the database, we apply a filter on ADsubscript𝐴𝐷\vec{A_{D}}over→ start_ARG italic_A start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_ARG using a mechanism called k-max weighted attention [43]. It sorts the probability terms in ADsubscript𝐴𝐷\vec{A}_{D}over→ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT in descending order, retaining the top-k terms (k=25) and sets the rest to zero. The non-zero probability terms correspond to those text spans that are retained after the pruning stage is complete. Similarly, minimizing the distance between the embedding vector of the text span w𝑤witalic_w projected onto the shared embedding space (i.e. wsubscriptsuperscript𝑤\mathcal{F^{\prime}}_{w}caligraphic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT) and the cluster-centroids indexed in VTsubscript𝑉𝑇V_{T}italic_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT returns a vector ATsubscript𝐴𝑇\vec{A}_{T}over→ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT over the relational tuples. The ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT term in ATsubscript𝐴𝑇\vec{A_{T}}over→ start_ARG italic_A start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG represents the maximum probability of a text span being matched against the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT attribute of a tuple in our database. To discard unlikely matches, we apply k-max weighted attention on ATsubscript𝐴𝑇\vec{A}_{T}over→ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, which retains top-k tuples (k=100) in ATsubscript𝐴𝑇\vec{A}_{T}over→ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT and sets the rest of the probability terms to zero. The non-zero terms in ATsubscript𝐴𝑇\vec{A}_{T}over→ start_ARG italic_A end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT correspond to those tuples that are retained after the pruning stage is complete. Fig. 4 shows a visualization of the non-zero probability terms computed over a document and a relational tuple from one of our experimental datasets. Contrary to existing works that leverage carefully designed rules, we follow a principled, unsupervised learning technique to guide our pruning operation. Experiments show that the bi-directional attention scheme employed in our framework reduces the number of pairwise comparisons, which in turn reduces the end-to-end latency of our framework (see Section V.5)

B.3. Aggregation: We only consider those text spans (and relational tuples) that remain after the pruning step for pairwise comparisons (Eq. 1). Once we have identified a matching tuple for each text span in the input document, we can aggregate them from text span-level to document-level.222document-level aggregation is needed in many real-world applications where the number of matching tuples that can be returned for a document has a strict upper bound The aggregation layer executes this operation by performing majority voting amongst all matching tuples identified for all text spans in the document.

IV-C Training the Network

We train our network using a learning objective similar to triplet loss [44]. Our goal is to minimize the distance between a text span w𝑤witalic_w & its matching tuple t𝑡titalic_t on their shared embedding space, and maximize the distance between w𝑤witalic_w & a non-matching tuple tsuperscript𝑡t^{\prime}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT from the database. We formalize this learning objective as follows.

𝐋align(w,t)=argmini,j|w[j]t[i]|λargmini,j|w[j]t[i]|subscript𝐋𝑎𝑙𝑖𝑔𝑛𝑤𝑡𝑎𝑟𝑔𝑚𝑖subscript𝑛𝑖𝑗subscriptsuperscript𝑤delimited-[]𝑗subscriptsuperscript𝑡delimited-[]𝑖𝜆𝑎𝑟𝑔𝑚𝑖subscript𝑛𝑖𝑗subscriptsuperscript𝑤delimited-[]𝑗subscriptsuperscriptsuperscript𝑡delimited-[]𝑖\mathbf{L}_{align}(w,t)=argmin_{\hskip 1.42271pti,j}|\hskip 1.42271pt\mathcal{% F^{\prime}}_{w}[j]-\mathcal{F^{\prime}}_{t}[i]\hskip 1.42271pt|-\\ \lambda\cdot argmin_{\hskip 1.42271pti,j}|\hskip 1.42271pt\mathcal{F^{\prime}}% _{w}[j]-\mathcal{F^{\prime}}_{t^{\prime}}[i]\hskip 1.42271pt|start_ROW start_CELL bold_L start_POSTSUBSCRIPT italic_a italic_l italic_i italic_g italic_n end_POSTSUBSCRIPT ( italic_w , italic_t ) = italic_a italic_r italic_g italic_m italic_i italic_n start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT | caligraphic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT [ italic_j ] - caligraphic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ italic_i ] | - end_CELL end_ROW start_ROW start_CELL italic_λ ⋅ italic_a italic_r italic_g italic_m italic_i italic_n start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT | caligraphic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT [ italic_j ] - caligraphic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_i ] | end_CELL end_ROW (4)

The first term in Eq. 4 represents the minimum distance between a text span w𝑤witalic_w and its matching tuple t𝑡titalic_t. The second term, on the other hand, represents the minimum distance between w𝑤witalic_w and a non-matching tuple tsuperscript𝑡t^{\prime}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, λ𝜆\lambdaitalic_λ is a hyperparameter. We set its value to 0.025. We obtain matching tuples for a text span from human-experts during training corpus construction. We train our network on a Tesla P100 GPU for 20 epochs using stochastic gradient descent with a batch size of 4. Each sample in our training corpus consists of a triplet, a text span, its matching tuple, and a non-matching tuple sampled from the database. We use early stop** to prevent overfitting and Adam optimizer with a learning-rate of 1×1041superscript1041\times 10^{-4}1 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, weight decay of 1×1021superscript1021\times 10^{-2}1 × 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT and (β1,β2)subscript𝛽1subscript𝛽2(\beta_{1},\beta_{2})( italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = (0.9, 0.999) to train our network.

IV-D End-to-end workflow

Once our network is trained, we can identify matching tuples for a text span in two steps. In the first step, we filter out those tuples that are unlikely to match by computing their likelihood of finding a match against any text span in the document using the output from the attention layer of our network. We only keep the top-k most likely tuples from this step. The rest are discarded. In the second step, the remaining tuples are compared against the text span using Eq. 2, and the pairwise distance between each tuple and the text span is minimized. To find a matching tuple for an entire document, we follow the same steps as mentioned above, identify matching tuples for each word in the document, and then aggregate those results using the aggregation layer of our network.

V Experiments

We seek to answer four key questions through our experiments: (a) how do we perform on entity matching tasks against heterogeneous documents? (b) how do we compare against state-of-the-art baselines on each of these tasks? (c) what are the individual contributions of some of the key components in our framework?, and (d) is our framework computationally robust?. We answer the first question by reporting the F1-score of two entity matching tasks on separate publicly available datasets (see Section V.B.1). To answer the second question, we compare our performance against a number of state-of-the-art baselines in Section V.B.4. We answer the third question through an ablation study in Section V.B.5. Finally, we study the computational robustness of our framework by adapting it to a number of resource constrained environments and then reporting its performance on our experimental datasets in Section V.B.6. We conduct all of our experiments on a system with 25GB RAM and a Tesla P100 GPU.

V-A Experiment Design

A.1. Datasets: We evaluate our framework on two real-world datasets. Each dataset contains documents with diverse layouts, formats, and textual content. We describe both of them below.

  1. (i)

    IMDB Movie Dataset. This dataset consists of approximately 8.4K movie posters from the IMDB website and a relational table collected from the IMDB movie database (https://datasets.imdbws.com/) containing an equal number of tuples. The table contains 12 unique attributes capturing various movie metadata, such as ‘Title’, ‘Directors’, ‘Actors’ and more. The posters are stored as image files. The database is stored as a single JSON.

  2. (ii)

    NYC Open Event Dataset. This dataset consists of approximately 7.9K event publicity images curated by New York City Parks and Recreations Department. It also contains a relational table with 92.5K tuples from the NYC Open data website (https://opendata.cityofnewyork.us/). The table contains 24 unique attributes capturing various important event information, such as ‘Address’, ‘Date’, ‘Description’ and more. The event posters are stored in image format and the database is stored as a JSON.

Refer to caption
Figure 5: Sample documents from IMDB Movie Dataset (upper row) and NYC Open Event Dataset (bottom row)

For both datasets, our objective is to map a document to its matching tuple in the database. We construct gold-standard labels for both tasks by annotating these matching pairs. Each pair consists of a text span in the document and its corresponding matching tuple in our database. We recruited 3 graduate students for this task and resolved all inter-annotator disagreements by consensus. We split both datasets into training, validation and test sets. Recall that a training sample in our experimental setup consists of a triplet – a text span, its matching tuple, and a non-matching tuple randomly sampled from our database. The training and validation set consists of 2000 and 500 of such triplets respectively from each dataset. The rest of the samples comprise the test set. We have made both datasets available at: https://github.com/anonsig2020/cem_vrd

A.2. Evaluation metrics: We compare a tuple inferred as a match by our framework for each document in our test corpus against its groundtruth label. We deem a match to be accurate if it has identified the same tuple as the groundtruth. We report the average precision and F1-score @k=[1,5,20]@𝑘1520@k=[1,5,20]@ italic_k = [ 1 , 5 , 20 ] for both datasets in Table I.

A.3. Baselines: We compare the downstream performance of our framework against a number of state-of-the-art baselines using the same experimental setup as described above. They are as follows.

  1. (i)

    Fuzzy string matching (M1). In this unsupervised baseline method, we compare each attribute of a tuple in our database with n-grams in a visually rich document using fuzzy-string matching333https://pypi.org/project/fuzzywuzzy/. This pairwise comparison results in a similarity score for every relational tuple in the database. We identify the matching tuple by selecting the tuple with the highest similarity score in the document.

  2. (ii)

    Text embeddings (M2). Following [23], in this unsupervised baseline method we encode each word in the document as well as each attribute in a relational tuple using a pretrained RoBERTaBASE model. We identify a matching tuple in two steps. First, we identify a matching tuple for each word using a k-nearest-neighbor-search over the entire database. Second, we perform majority voting amongst all tuples retrieved from the first step to identify the tuple at the document-level.

  3. (iii)

    Document IE (M3). In this supervised baseline, we identify a matching tuple by performing a fuzzy-join operation. We extract a structured record from each document by employing the state-of-the-art LayoutLMV2BASE model [4] that is first pretrained on the IIT-CDIP dataset [40] and then fine-tuned on our training corpus. To identify a matching tuple, we performing a fuzzy outer-join between the record and the relational database.

  4. (iv)

    Graph-based embeddings (M4). In this baseline method, we use EMBDI [45] which is a state-of-the-art, graph-based approach to compute fixed-length representations of a relational tuple. We encode each word in the document using a pretrained RoBERTaBASE model. We identify the matching tuple for each document in two steps. First, we identify a matching tuple for each word by solving an orthogonal Procrustes problem following Cappuzzo et al. [45]. Second, we perform majority voting amongst all tuples retrieved from the first step to identify the matching tuple at the document-level.

  5. (v)

    Hashing-based approach (M5). Following Lin et al. [35], we construct a neural network that learns 8-bit binary hash-codes to represent each word in a document. We represent relational tuples as a sequence of tokens and compute 8-bit hash codes for them in a similar fashion. To identify the matching tuple for a word, we minimize the Hamming distance between two binary vectors, one representing the word itself and another one representing the tuple. We identify the matching tuple at document-level by performing majority voting amongst all tuples identified during word-level matching.

  6. (vi)

    Large language model (M6). In this baseline, we employ the publicly available, pretrained GPT-Neo [46] model to identify matching tuples from a relational database. We transcribe each document and feed the extracted text as an input to our model. For each tuple in our database, we prompt the model to output the following: if a tuple is semantically similar to the document, reply with a ‘yes’ or ‘no’. If the answer to the previous question is ‘yes’, assign a number between 1 to 10 to reflects the confidence in the accuracy of this answer. We use this setup to identify the matching tuple at document-level by sorting all the tuples identified from the previous step by their confidence scores and then returning the top-1 tuple.

  7. (vii)

    Vision-language model (M7). Following Radford et al. [13], in this baseline method we use a ResNet50-based image encoder model and a Transformer-based text encoder model simultaneously trained on {image, text} pairs collected via web crawling. We compute a fixed-length representation of each document from its rendered image using this image encoder model, and similarly for each relational tuple using the text encoder model. We identify the matching tuple for a document by minimizing the distance between these two vectors within our database.

TABLE I: End-to-end performance on experimental datasets
Dataset k Precision (%) F1 (%)
IMDB Movie Dataset k=1 75.05 85.74
k=5 77.25 87.16
k=20 87.90 93.56
NYC Open Event Dataset k=1 58.80 74.05
k=5 60.10 75.07
k=20 68.06 80.99

V-B Experimental Results

B.1. End-to-end performance: We present the precision and F1-score @k=[1,5,20]@𝑘1520@k=[1,5,20]@ italic_k = [ 1 , 5 , 20 ] of our framework on both datasets in Table I. On the IMDB Movie Dataset, we obtain a top-1 precision of 75.05% and a F1-score of 85.74%, whereas for the NYC Open Event Dataset we obtain a top-1 precision of 56.80% and a F1-score of 72.44%. We obtain a perfect recall on both datasets as our framework returns a non-empty set of tuples for each document. Taking a closer look at these results reveals that we perform comparatively better on documents that are relatively verbose. The average turnaround latency of our framework is 0.8 secs/document for the IMDB Movie Dataset, and 3.1 secs/document for the NYC Open Event Dataset. We observe relatively higher turnaround latency for the NYC Open Event Dataset because the average number of candidate pairs participating in pairwise comparisons (see Section IV.B) is relatively higher for this dataset.

B.2. On the diversity of training samples: We investigate the role played by the diversity of samples in our training corpus by varying the number of samples used to train our model. The rest of our experimental settings is kept unchanged. Results show (see Fig. 6) that increasing the number of samples improves the average F1 score for both datasets. However, improvement in performance starts to plateau as the number of training samples exceeds a threshold.

Refer to caption
Figure 6: F1-score @k@𝑘@k@ italic_k = 1 for with n={500,1000,1500,2000}𝑛500100015002000n=\{500,1000,1500,2000\}italic_n = { 500 , 1000 , 1500 , 2000 } samples used to train our network

B.3. Label-efficiency: We investigate the label-efficiency of our framework by training a model that has an identical architecture as ours but with one key difference. Recall that our network employs pretrained models (see Section IV.A) in its representation layer to leverage transfer learning [9] and train in a label-efficient way. In this experiment, instead of these pretrained models, we use randomly initialized, trainable weights of same dimensions to encode each input sequence. We train this model using direct supervision with the same learning objective on the same training corpus. If its performance on the validation set is observed to be incomparable with our original model, we increase the size of its training corpus by introducing 25 additional samples. We keep increasing the number of samples in the training corpus in this way until we this baseline model has obtained a comparable F1 score with our original network.

TABLE II: Number of human-labeled samples needed for comparable downstream performance in a supervised setting
Dataset #No. of training samples Saved (%)
IMDB Movie Dataset 2000 57.05
NYC Open Event Dataset 2000 60.77

Let, the number of human-labeled samples needed to train the baseline model for comparable performance is Nbasesubscript𝑁baseN_{\texttt{base}}italic_N start_POSTSUBSCRIPT base end_POSTSUBSCRIPT. The number of human-labeled samples needed to train our original network is N𝑁Nitalic_N. Therefore, we define the label-efficiency of our network as NbaseNNsubscript𝑁base𝑁𝑁\frac{N_{\texttt{base}}-N}{N}divide start_ARG italic_N start_POSTSUBSCRIPT base end_POSTSUBSCRIPT - italic_N end_ARG start_ARG italic_N end_ARG. We report the number of human-labeled samples used to train our original network in the first column of Table II. The second column in this table denotes the label-efficiency of our network against the supervised baseline on both datasets. Results show that our framework requires up to 60% less human-labeled samples to obtain comparable performance than a fully supervised baseline.

B.4. Comparison against baselines: We compare our downstream performance against a number of baseline methods in Table III. The best performing models on both datasets are shown in boldface. We outperform the fuzzy-match based baseline (M1) by more than 37 F1 points on the IMDB Movie Dataset and 27 F1 points on the NYC Open Event Dataset. Diving deep at these results reveals that this string-matching based approach does not fare well on documents that have multiple potential matches in the database, and the matching tuple can only be disambiguated from contextual encodings. We observe similar trend against the text embedding-based baseline (M2) also. Results show that we outperform this baseline by more than 12 F1 points on the IMDB Movie Dataset and 8 F1 points on the NYC Open Event Dataset. This establishes the superiority of the multimodal encoding capability of our representation layer compared to off-the-shelf text-based embedding techniques.

TABLE III: End-to-end performance of all competing methods on our experimental datasets
Dataset Method Precision (%) F1 (%)
IMDB Movie Dataset Fuzzy matching (M1) 33.33 48.72
Text embeddings (M2) 62.50 76.92
Document IE (M3) 72.33 82.32
Graph-based (M4) 66.50 79.87
Hashing-based (M5) 64.33 75.18
Large Language Model (M6) 73.18 82.47
Vision-Language Model (M7) 72.90 82.66
Our method 75.05 85.74
NYC Open Event Dataset Fuzzy matching (M1) 45.05 56.47
Text embeddings (M2) 45.95 62.54
Document IE (M3) 57.75 59.17
Graph-based (M4) 50.33 66.96
Hashing-based (M5) 48.60 63.25
Large Language Model (M6) 56.06 71.47
Vision-Language Model (M7) 55.80 72.25
Our method 58.80 74.05

Compared to the Document IE-based baseline (M3), we observe an improvement of 2.72 F1 points for the IMDB Movie Dataset and 1.05 F1 points for the NYC Open Event Dataset. Training a document understanding model (e.g. LayoutLMv2) that extracts structured records from a visually rich document, however, requires additional human-labeling effort at the token-level to construct its training corpus. It is cumbersome and often hard to scale. We also outperform the graph-based baseline method (M4) on both datasets. Our results show that the semantic coherence of this graph-based encoding technique does not translate well for cross-modal entity matching tasks. We observe similar improvement in performance against the hashing-based baseline (M5) as well. Although the turnaround latency of this baseline is less than ours, we outperform it by more than 10 F1 points on both datasets. Finally, comparing the downstream performance of our framework against a pretrained GPT-Neo model (M6) reveals that we can outperform this baseline by 1.87 F1 points on the IMDB Movie Dataset and 2.74 F1 points on the NYC Open Event Dataset. Taking a closer look at these results reveal that this model tends to hallucinate on many documents in both datasets. It also requires additional steps to disambiguate among multiple potential matches for textually sparse documents. We also outperform the vision-language model-based baseline (M7) on both datasets. We hypothesize that this is because of relatively weaker representation capabilities of this baseline, stemming from its usage of pixel-level abstraction to represent each document using a vision encoder model. This establishes the importance of encoding both visual and textual features of a document for this entity matching task.

TABLE IV: Results from the ablation study on IMDB Movie Dataset
Ablated component Removed or Replaced? ΔΔ\Deltaroman_ΔF1 (%)\downarrow
Attention layer Removed 2.24
Visual features Removed 4.90
Representation layer LayoutLMv2 replaced with RoBERTa 1.06
Representation layer MobileNet replaced with CLIP -3.55

B.5. Ablation study: To measure individual contributions of some of the key components in our framework, we perform an ablation study. In each of these ablative baselines, we remove or replace a key component in our framework and observe its effect on our downstream performance. We present our findings in Table IV. The first column in this table specifies the component in our network that is being removed (or replaced), and the last column denotes the degradation in F1-score due to this change with respect to our original model. In the first ablative baseline, we remove the attention layer from our network. This results in a 2.24% decrease in F1-score and 4absent4\approx 4≈ 4x increase in turnaround latency thus establishing the contribution of the bi-directional attention scheme employed by our network for fast alignment between a text span and a relational tuple. In the second baseline, we remove the visual features from the representation layer of our network. We observe that this results in a 4.90% decrease in F1-score. This establishes the necessity of encoding both visual and textual features to represent a text span in a visually rich document. In our third ablative baseline, we replace the pretrained LayoutLMv2BASE model, which has been used to encode the textual features of a document in the representation layer of our network with a RoBERTaBASE model pretrained on the Google Book corpus. We observe a 1.06 F1 points drop in performance. This establishes the flexibility offered by our representation layer to plug-and-play stronger foundational models within our network in the future. In our final baseline method, we replace the pretrained MobileNet model with the same vision-language model [13] employed by our final baseline method (M7) to encode the visual features of the document. Results show that introducing an image encoder model that has been jointly trained on image and text pairs can benefit the representation capability of our model, improving its downstream performance. Investigating optimal ways to introduce even larger models within our network is one of our planned future works.

B.6. Computational robustness: One of the major challenges of deploying a deep entity matching framework for real-world applications is the amount of computational resources needed to infer a match. For instance, the cost of computing a fixed-length representation of a data element using a Transformer-based model with d𝑑ditalic_d layers is: O(n2d+nd2+nd)𝑂superscript𝑛2𝑑𝑛superscript𝑑2𝑛𝑑O(n^{2}\cdot d+n\cdot d^{2}+n\cdot d)italic_O ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ italic_d + italic_n ⋅ italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_n ⋅ italic_d ) for an input sequence length of n𝑛nitalic_n. This quadratic cost makes it difficult to deploy these frameworks in resource-constrained environments without incurring any degradation in downstream performance. In Juno, the computational bottleneck stems from various pretrained models employed in the representation layer of our network (see Section IV.A). Recent advances in controllable model compression [47, 48] makes it possible to mitigate this to some extent. They allow us to train a deep neural network only once, and then adapt it at run-time by setting certain model hyperparamters such as input sequence length, number of layers, based on computational constraints (e.g. working memory size, computational capability, number of cores etc.) imposed by the environment. We use publicly available algorithms officially released by their respective authors to adapt our network for various resource-constrained environments. We point interested readers to the original works by Cai et al. [47] and Kim et al. [48] for more background on this.

TABLE V: End-to-end performance of our framework on various resource-constrained environments
Environment ΔΔ\Deltaroman_Δ Model Footprint (%) \downarrow ΔΔ\Deltaroman_Δ GFlops (%) \downarrow F1 (%)
A12 7.77 49.25 86.66
A13 5.89 41.90 86.95
A14 24.36 68.75 87.0
A15 20.74 56.33 86.95
Original 85.74
TABLE VI: Environmental constraints of various resource-constrained environments used in our experiments
Environment Cores Threads Cache Memory Capacity
A12 6 6 8MB 4GB
A13 6 6 8MB 4GB
A14 6 6 8MB 6GB
A15 6 6 12MB 6GB

In this section, we simulate such resource-constrained environments, adapt our framework to their computational constraints using [48], and report our downstream performance (see Table V). Each row in this table represents a computational environment that simulates a recently released iPhone. The left-most column in this table denotes the specific iPhone processor this computational environment simulates. We describe the computational constraints simulated in each of these environments in Table VI. We adapt our model based on these constraints using [48] and report its performance on the IMDB Movie Dataset. The second and third column in Table V represent the reduction in memory footprint and computational overhead of the resulting model compared to our original network. The final column in this table represents the downstream performance of the resulting model adapted for that environment. Our results show up to 24% reduction in memory footprint and 68% reduction in computational overhead of the original model without any degradation in downstream performance. This establishes the computational robustness of our framework across various resource-constrained environments, making it a suitable candidate for on-device deployment in edge-devices. Investigating optimal ways to adapt our framework for interactive applications in edge-devices is one of our planned future works.

VI Conclusion

Visually rich documents are great sources of ad-hoc information. The information they contain, however, is often incomplete. This makes it difficult to contextualize the information retrieved from these documents and gather actionable insights from them. We develop Juno – a generalizable framework to address this limitation by augmenting each document with supplementary information from a relational database. To identify matching tuples, we develop a multimodal neural network that maps text spans in the document to tuples in the database. Harnessing the power of pretrained models through transfer leaning, Juno executes this entity matching task with significantly less human-labeled samples. It ensures fast map** against large-scale databases by leveraging a novel bi-directional attention mechanism that allows it to prune unlikely matches from the search space and reduce the number of pairwise comparisons. To the best of our knowledge, this is the first work that investigates the incompleteness of VRDs and proposes a generalizable, performant and computationally robust framework to address it in an end-to-end way. Contrary to existing works, Juno does not utilize any handcrafted rules, or prior knowledge about the document type and/or the underlying schema to achieve its goal. Experiments on two heterogeneous datasets for separate entity matching tasks show that it is not only more performant than state-of-the-art baseline methods – outperforming them by more than 6% in F1-score, but also more scalable – reducing the number of human-labeled samples needed to train a supervised baseline with the same backbone that achieves comparable performance by up to 60%. Our experiments also show that Juno is computationally robust. Contrary to existing vision-language models, we can use off-the-shelf algorithms to adapt it for resource-constrained environments, reducing its memory footprint by up to 24% without any performance degradation on real-world datasets. Investigating an optimal way to adapt deep entity matching frameworks for interactive applications on edge-devices [16] is one of our planned future works.

References

  • [1] B. P. Majumder, N. Potti et al., “Representation learning for information extraction from form-like documents,” in The ACL, 2020, pp. 6495–6504.
  • [2] R. Sarkhel and A. Nandi, “Visual segmentation for information extraction from heterogeneous visually rich documents,” in The SIGMOD.   ACM, 2019, pp. 247–262.
  • [3] ——, “Improving information extraction from visually rich documents using visual span representations,” The VLDB, vol. 14, no. 5, 2021.
  • [4] Y. Xu, Y. Xu et al., “Layoutlmv2: Multi-modal pre-training for visually-rich document understanding,” arXiv preprint arXiv:2012.14740, 2020.
  • [5] G. Papadakis, D. Skoutas, E. Thanos, and T. Palpanas, “Blocking and filtering techniques for entity resolution: A survey,” ACM Computing Surveys (CSUR), vol. 53, no. 2, pp. 1–42, 2020.
  • [6] V. Christophides, V. Efthymiou, T. Palpanas, G. Papadakis, and K. Stefanidis, “An overview of end-to-end entity resolution for big data,” ACM Computing Surveys (CSUR), vol. 53, no. 6, pp. 1–42, 2020.
  • [7] K. Wang, Q. Yin et al., “A comprehensive survey on cross-modal retrieval,” arXiv preprint arXiv:1607.06215, 2016.
  • [8] A. Vaswani, N. Shazeer et al., “Attention is all you need,” in NeurIPS, 2017.
  • [9] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE TKDE, vol. 22, no. 10, pp. 1345–1359, 2009.
  • [10] L. Floridi and M. Chiriatti, “Gpt-3: Its nature, scope, limits, and consequences,” Minds and Machines, vol. 30, pp. 681–694, 2020.
  • [11] R. Azamfirei, S. R. Kudchadkar, and J. Fackler, “Large language models and the perils of their hallucinations,” Critical Care, vol. 27, no. 1, pp. 1–2, 2023.
  • [12] H. Ye, T. Liu, A. Zhang, W. Hua, and W. Jia, “Cognitive mirage: A review of hallucinations in large language models,” arXiv preprint arXiv:2309.06794, 2023.
  • [13] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning.   PMLR, 2021, pp. 8748–8763.
  • [14] R. Sarkhel and A. Nandi, “Deterministic routing between layout abstractions for multi-scale classification of visually rich documents,” in The IJCAI.   AAAI Press, 2019, pp. 3360–3366.
  • [15] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-based models for speech recognition,” Advances in neural information processing systems, vol. 28, 2015.
  • [16] C. Burley and R. Sarkhel, “Quill: A declarative approach for accelerating augmented reality application development,” A Quarterly bulletin of the Computer Society of the IEEE Technical Committee on Data Engineering, vol. 45, no. 3, 2022.
  • [17] E. Rahm and P. A. Bernstein, “A survey of approaches to automatic schema matching,” the VLDB Journal, vol. 10, no. 4, pp. 334–350, 2001.
  • [18] R. Fagin, L. M. Haas et al., “Clio: Schema map** creation and data exchange,” in Conceptual modeling: foundations and applications.   Springer, 2009, pp. 198–236.
  • [19] E. Smith, D. Papadopoulos et al., “Lillie: Information extraction and database integration using linguistics and learning-based algorithms,” Information Systems, vol. 105, p. 101938, 2022.
  • [20] M. J. Cafarella, A. Halevy et al., “Data integration for the relational web,” The VLDB, vol. 2, no. 1, pp. 1090–1101, 2009.
  • [21] M. Ebraheem, S. Thirumuruganathan et al., “Distributed representations of tuples for entity resolution,” The VLDB, vol. 11, no. 11, pp. 1454–1467, 2018.
  • [22] I. Sutskever, Training recurrent neural networks.   University of Toronto Toronto, ON, Canada, 2013.
  • [23] H. Nie, X. Han et al., “Deep sequence-to-sequence entity matching for heterogeneous entity resolution,” The CIKM, 2019.
  • [24] S. Wu and U. Manber, “Fast text searching: allowing errors,” Communications of the ACM, vol. 35, no. 10, pp. 83–91, 1992.
  • [25] “Deep entity matching with pre-trained language models,” The VLDB Endowment, vol. 14, no. 1, p. 50–60, 2020.
  • [26] C. Zhao and Y. He, “Auto-em: End-to-end fuzzy entity-matching using pre-trained deep models and transfer learning,” in The World Wide Web Conference, 2019, pp. 2413–2424.
  • [27] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” Advances in neural information processing systems, vol. 27, 2014.
  • [28] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014.
  • [29] S. Mudgal, H. Li et al., “Deep learning for entity matching: A design space exploration,” in The SIGMOD, 2018, pp. 19–34.
  • [30] N. Rasiwasia, J. Costa Pereira et al., “A new approach to cross-modal multimedia retrieval,” in The ACM MM, 2010, pp. 251–260.
  • [31] A. Sharma, A. Kumar et al., “Generalized multiview analysis: A discriminative latent space,” in IEEE CVPR.   IEEE, 2012, pp. 2160–2167.
  • [32] W. Wu, J. Xu et al., “Learning similarity function between objects in heterogeneous spaces,” Microsoft Research Technique Report, 2010.
  • [33] M. Carvalho, R. Cadène et al., “Cross-modal retrieval in the cooking context: Learning semantic text-image embeddings,” in The SIGIR, 2018, pp. 35–44.
  • [34] Y. Cao, M. Long et al., “Correlation hashing network for efficient cross-modal retrieval,” arXiv preprint arXiv:1602.06697, 2016.
  • [35] K. Lin, H.-F. Yang et al., “Deep learning of binary hash codes for fast image retrieval,” in Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 2015, pp. 27–35.
  • [36] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds et al., “Flamingo: a visual language model for few-shot learning,” Advances in Neural Information Processing Systems, vol. 35, pp. 23 716–23 736, 2022.
  • [37] R. Girdhar, A. El-Nouby, Z. Liu, M. Singh, K. V. Alwala, A. Joulin, and I. Misra, “Imagebind: One embedding space to bind them all,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 15 180–15 190.
  • [38] R. Smith, “An overview of the tesseract ocr engine,” in Document Analysis and Recognition, 2007. ICDAR 2007. Ninth International Conference on, vol. 2.   IEEE, 2007, pp. 629–633.
  • [39] T. M. Breuel, “The hocr microformat for ocr workflow and results,” in The ICDAR, vol. 2.   IEEE, 2007, pp. 1063–1067.
  • [40] A. W. Harley, A. Ufkes, and K. G. Derpanis, “Evaluation of deep convolutional nets for document image classification and retrieval,” in 2015 13th International Conference on Document Analysis and Recognition (ICDAR).   IEEE, 2015, pp. 991–995.
  • [41] A. G. Howard, M. Zhu et al., “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
  • [42] R. Sennrich, B. Haddow et al., “Neural machine translation of rare words with subword units,” in The ACL, 2016, pp. 1715–1725.
  • [43] W. Bian, S. Li et al., “A compare-aggregate model with dynamic-clip attention for answer selection,” in The CIKM, 2017, pp. 1987–1990.
  • [44] R. C. Fernandez and S. Madden, “Termite: a system for tunneling through heterogeneous data,” in Proceedings of the Second International Workshop on Exploiting Artificial Intelligence Techniques for Data Management, 2019, pp. 1–8.
  • [45] R. Cappuzzo, P. Papotti et al., “Creating embeddings of heterogeneous relational datasets for data integration tasks,” in Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, 2020, pp. 1335–1349.
  • [46] E. AI. (2021) The gpt-neo 1.3b model. https://github.com/EleutherAI/gpt-neo. Accessed: 2023-04-05.
  • [47] H. Cai, C. Gan, T. Wang, Z. Zhang, and S. Han, “Once-for-all: Train one network and specialize it for efficient deployment,” in International Conference on Learning Representations, 2019.
  • [48] G. Kim and K. Cho, “Length-Adaptive Transformer: Train Once with Length Drop, Use Anytime with Search,” Jun. 2021, arXiv:2010.07003 [cs]. [Online]. Available: http://arxiv.longhoe.net/abs/2010.07003