SiMa: Effective and Efficient Matching Across Data Silos Using Graph Neural Networks

Christos Koutras Rihan Hai Kyriakos Psarakis Marios Fragkoulis Asterios Katsifodimos Delft University of TechnologyP.O. Box 121243017-6221 [email protected]

(2024)

Abstract.

How can we leverage existing column relationships within silos, to predict similar ones across silos? Can we do this efficiently and effectively? Existing matching approaches do not exploit prior knowledge, relying on prohibitively expensive similarity computations. In this paper we present the first technique for matching columns across data silos, called SiMa, which leverages Graph Neural Networks (GNNs) to learn from existing column relationships within data silos, and dataset-specific profiles. The main novelty of SiMa is its ability to be trained incrementally on column relationships within each silo individually, without requiring the consolidation of all datasets in a single place. Our experiments show that SiMa is more effective than the – otherwise inapplicable to the setting of silos – state-of-the-art matching methods, while requiring orders of magnitude less computational resources. Moreover, we demonstrate that SiMa considerably outperforms other state-of-the-art column representation learning methods.

^†^†copyright: acmcopyright^†^†journalyear: 2024^†^†doi: -^†^†conference: EDBT ’24: International Conference on Extending Database Technology; 2024; ^†^†isbn: 978-1-4503-XXXX-X/18/06

1. Introduction

Given a large set of datasets spread across different data silos (Mansour et al., 2022), as well as example column relationships within those silos, how can we detect pairs of dataset columns, that are joinable or unionable across silos? Can we do this efficiently and effectively?

Refer to caption — Figure 1. Three typical data silos in the banking industry.

Organizations nowadays accumulate large numbers of heterogeneous datasets in data lakes, with the goal of gaining insights by combining those datasets. The structure (e.g., departments, teams, locations) of organizations, but also the sheer scale of their data lakes, force organizations to establish barriers for their data assets, leading to the phenomenon of data silos: disjoint and isolated collections of datasets, belonging to different stakeholders. Interestingly, data silos may even exist within the same organization, as individual teams enforce their own conventions and formats, as well as encapsulate knowledge about their data assets. Silo-ing data impedes collaboration and information sharing among different groups of interest.

Running Example. Consider an organization in the banking industry as depicted in Figure 1. Employees of the banking silo already know the relationships between their datasets (black dotted lines), i.e. columns from tables inside the silo that are semantically related (storing values that refer to the same semantic type). However, the possible relationships between the banking silo and the other two silos (green lines) are missing, i.e. columns of the same semantic type, residing in different silos. Data scientists building ML models can benefit from dataset augmentation in terms of extra data points (by finding other unionable datasets) and/or extra features (by finding other joinable datasets) from other data silos (Chepurko et al., 2020).

Column Matches Within Silos. To enable collaboration across departments and teams, organizations build and maintain dataset metadata catalogs (Fernandez et al., 2018a; Halevy et al., 2016): a graph structure that encapsulates relationships among datasets. Typically, within a given silo, one can enrich a metadata catalog with PK-FK relationships using schema information and automated data profiling techniques (Abedjan et al., 2015) as well as joinability/unionability relationships, using matching techniques(Koutras et al., 2021). Moreover, such relationships can be derived from domain experts, query logs (Nandi and Bernstein, 2009; Bharadwaj et al., 2021), and even data science notebooks (Psallidas et al., 2022). However, discovering relationships among columns across data silos is very challenging (Mansour et al., 2021).

Existing solutions. In the data management research, the problem of finding relationships among datasets has been investigated in three different contexts (more details in Section 3): i) schema matching, with a multitude of automated methods (Rahm and Bernstein, 2001; Zhang et al., 2011; Lehmberg and Bizer, 2017; Fernandez et al., 2018b; Koutras et al., 2020; Cappuzzo et al., 2020); $ii)$ related-dataset search (Fernandez et al., 2018a; Nargesian et al., 2018; Zhu et al., 2019; Bogatu et al., 2020; Zhang and Ives, 2020; Fan et al., 2022; Khatiwada et al., 2022; Bogatu et al., 2022), and $iii)$ column-type detection (Hulsebos et al., 2019; Zhang et al., [n. d.]). In short, traditional schema matching methods are a) computationally and resource expensive; b) they cannot always be employed in the setting of data silos as they require co-locating all datasets to calculate similarities; c) they do not leverage existing knowledge within silos. Related-dataset search methods are not applicable to the matching problem as their goal is to search top-k related datasets to a given dataset, sacrificing recall for precision. To tune such methods for discovering more column matches (increase the recall), we would need to set k to a large value, which could dramatically affect the quality of the results (high false positive rates, thus lower precision). Finally, column-type detection requires knowing the types of all columns in advance, alongside massive training data.

SiMa: an efficient & effective silo matcher¹¹1An early version of this work has been presented in a non-archival workshop (Authors, 2022).. In this paper we propose SiMa, a novel approach to the problem of discovering relationships between tabular columns across data silos (Figure 2). SiMa is based on the observation that within silos we can find existing matches among columns and train a ML model that learns to predict column relationships across silos: i) equi-joinable, ii) fuzzily-joinable, iii) unionable columns of the same domain.

SiMa leverages the representational power of Graph Neural Networks (GNNs). However, employing GNNs for the purposes of matching across data silos is far from straightforward, as we need to: i) transform tabular data to information-preserving graphs, ii) initialize nodes with suitable features, iii) introduce non-trivial negative-sampling techniques and training schemes to optimize the learning process. SiMa provides with effective and efficient solutions to each of these problems, proceeding as shown in Figure 2.

In short this paper makes the following contributions:

•

We define the problem of matching across data silos (§ 4).
•

We propose SiMa, a generic and inductive GNN-based learning framework, which discovers relatedness across different data silos. To the best of our knowledge, our work is the first to generalize local matches within a silo, to links across silos.
•

We show how to represent data silos, and the knowledge about matches among datasets inside those silos as graphs, turning the problem matching across data silos into a link prediction task (§ 5).
•

We propose two optimization techniques, negative edge sampling and incremental model training, which improve the training efficiency and effectiveness of our GNN for the purposes of matching across silos (§ 7).
•

With experiments (§ 8) over real-world data from several domains and open datasets, SiMa demonstrates significant effectiveness gains with orders of magnitude run-time performance savings (up to 600x) compared to traditional (and inapplicable) schema matching methods and column representation techniques.

The datasets, ground truth and code of this work, are available at https://github.com/delftdata/SiMa.

2. Approach Overview

Five aspects comprise SiMa’s approach: a relatedness graph, data profiles, a learning method, a prediction method, and an optimization process.

– Relatedness Graph (Section 5.2). As shown in Figure 2b, SiMa transforms each silo’s set of columns and the respective matches among them (Figure 2a) into nodes and corresponding edges, thus creating as many graphs as data silos.

– Data Profiles (Section 5.3). For each column, SiMa builds a profile of 987 features (Figure 2b), such as the number of numerical values among the instances or character-level aggregates (Abedjan et al., 2015; Hulsebos et al., 2019). These column profiles facilitate the training process of a Graph Neural Network (GNN).

– Learning from profiles and graph information (Section 6). Each data profile is used as a feature vector of each node in the relatedness graph (Figure 2b). Using GNNs, SiMa takes into consideration the profiles and the graph edges in order to learn how to incorporate the graph’s neighborhood information together with the features of each node.

– Predicting matches across silos (Section 6.1). Finally, as depicted in Figure 2e, SiMa uses the learned graph embeddings from the GNNs to capture similarity among columns, and discover matches across data silos. We do this by fine tuning a link prediction model, enabling SiMa to decide whether there could be a match between a pair of columns or not.

– Optimizing the GNN learning process (Section 7). SiMa applies sophisticated negative edge sampling techniques on the graphs (Figure 2c) to fine tune the prediction ability of the GNNs by leveraging the knowledge inside each data silo (Section 7.1). Moreover, as shown in Figure 2d, with incremental training (Section 7.2) SiMa not only improves its match prediction ability, but also allows silos to train a GNN individually, without having to consolidate all data profiles in one place.

3. Related Work

The closest work to this paper is traditional schema matching methods that were not designed for the problem of matching across silos (Section 3.1), as well as dataset discovery & semantic type detection that do not address our problem directly (Section 3.2).

3.1. Schema Matching

A natural choice to bridge data silos would be to employ schema matching (Rahm and Bernstein, 2001; Gal, 2011), namely a set of methods responsible for finding matches among elements of disparate datasets based on various similarity criteria (e.g. Jaccard similarity). Schema matching is a well-studied research topic, with various methods mainly focusing on finding matches between pairs of tables (Do and Rahm, 2002; Madhavan et al., 2001; Zhang et al., 2011; Chen et al., 2018; Shraga et al., 2020; Cappuzzo et al., 2020). However, existing matching methods assume global access to all datasets so that they can compute similarities between pairs of columns. Across data silos, this is usually impossible, since the stakeholders are not willing to share data with each other (Miller, 2018).

Statistics-based methods. One can base a matching method’s similarity calculations on data statistics (Do and Rahm, 2002; Zhang et al., 2011): first compute statistics of columns within a silo, and carry those statistics over to other silos for similarity calculations. However, it is often the case that characteristics of values across silos can differ substantially, even for the same semantic types (e.g., names of people in different countries) leading to false negative matches.

Embedding-based methods. Embedding-based methods could be applied in matching. However, despite their employment on embedding cell-values, and consequently table columns (Fernandez et al., 2018b; Nargesian et al., 2018), pre-trained models have been shown to not work well on domain-specific datasets (Koutras et al., 2021). On the other hand, locally-trained embedding methods (Fernandez and Madden, 2019; Koutras et al., 2020; Cappuzzo et al., 2020) leverage the architecture of skip-gram models (Mikolov et al., 2013; Bojanowski et al., 2017) to train on corpora consisting of tabular data; yet, these still seem to be insufficiently effective when used for matching related columns (Koutras et al., 2021).

Scalability Issues. Most importantly, applying schema matching solutions (Rahm and Bernstein, 2001; Zhang et al., 2011; Lehmberg and Bizer, 2017; Fernandez et al., 2018b; Koutras et al., 2020; Cappuzzo et al., 2020) requires, in the worst case, computation of similarities between all pairs of columns. As the number of columns $n$ increases beyond the thousands – a small number considering the size of data lakes and commercial databases – computing O( $n^{2}$ ) similarities is impractical.

3.2. Dataset Discovery & Semantics Types

Related-dataset search. Related-dataset search methods (Fernandez et al., 2018a; Cafarella et al., 2009; Das Sarma et al., 2012; Nargesian et al., 2018; Zhu et al., 2019; Bogatu et al., 2020; Zhang and Ives, 2020; Fan et al., 2022; Khatiwada et al., 2022; Bogatu et al., 2022) rely on the syntactic-, distribution- or even embedding-similarity of data instances within dataset columns. In order to scale, related dataset search methods make use of LSH (Fernandez et al., 2018a; Nargesian et al., 2018; Bogatu et al., 2020) or inverted (Zhu et al., 2019) indexes. However, their application to the matching problem is not straightforward: dataset search methods return the top-k related datasets to a dataset given as query. In the case of matching across data silos, we are not concerned with capturing the top-k related datasets, but matches among columns across silos. Therefore, to use a related-dataset search method for capturing all possible column matches, one would need to set k to a high value, in order to expand the range of the results. Yet, this could have a very negative impact on precision, as large k values could severely increase false positive rates.

Column-type detection. Solving the problem of matching across data silos as a column-type detection problem (Hulsebos et al., 2019; Zhang et al., [n. d.]), assumes knowledge of the exact set of semantic types that exist across the data silos, and requires massive training data that are tailored to those types. None of these assumptions hold true in the context of data silos. Despite their proven effectiveness on column classification tasks, these methods are not applicable on silos, since the semantic types and their number are unknown when trying to find links among datasets from different silos.

4. Problem Definition

In this work, we are interested in the problem of capturing relevance among tabular datasets that belong to different silos; we focus on tabular data since they constitute the main form of useful, structured datasets in silos and include web tables, spreadsheets, CSV files and database relations. To prepare our problem setting, we start with the following definitions.

Definition 0 (Data silos).

Consider a set of data silos $S=\{S_{1},S_{2},\dots,S_{n}\}$ . Each data silo $S_{i}\ (i\in[1,n])$ consists of a set of tables. Assuming that the number of columns in $S_{i}$ is $d$ , we denote a column from data silo $S_{i}$ as $c^{i}_{l}\ (l\in[1,d])$ .

Definition 0 (Intra-relatedness and Inter-relatedness).

If two columns $c^{i}_{k},c^{i}_{l}$ are from the same data silo $S_{i}$ ( $k\neq l$ ), and represent the same semantic type, we refer to their relationship as intra-related; if two columns $c^{i}_{l}$ and $c^{j}_{t}$ ( $i\neq j$ ) are located in different data silos, and represent the same semantic type, we refer to their relationship as inter-related.

Intra- and inter-related columns refer to three different notions of matches: i) columns that share exact value overlaps and draw values from the same domain, i.e., they are equi-joinable, ii) columns that share semantically equivalent values of different formats and belong to the same domain, i.e., they are fuzzily-joinable, and iii) columns that do not share any kind of value overlaps but store instances from the same domain, i.e., they are unionable.

Given a set of data silos $\mathcal{S}$ , we refer to the set of all columns in $\mathcal{S}$ as $\mathcal{C}$ . For example, in Figure 1 we have $\mathcal{S}=$ {Insurance, Banking, Public District Data }, and the total number of columns $|C|$ is 21. In this work, we assume that the intra-relatedness in each data silo is known, which is common in organizations as discussed in Section 1.

The Problem of Matching Across Data Silos. Consider a set of data silos $\mathcal{S}$ , and that the intra-relatedness relationships in each data silo $S_{i}\in\mathcal{S}$ are known. The problem of matching across data silos, is to capture the potential inter-relatedness relationships among the table columns belonging to different silos.

For instance, in Figure 2 we know that in the silo Insurance the columns Customer.Address and Insurance.Address are related. Now we aim to discover inter-relatedness between different silos, such as Insurance.Address and District.Name (in the silo Public District Data), which are from two data silos and remain unknown among their corresponding stakeholders. In Section 6 we will elaborate on how we transform the above problem to a link prediction problem.

5. GNNs for Matching Data Silos

Table 1. Essential notations used in the paper

Notations	Description
$\mathcal{G}$	A graph
$v$	A node in $\mathcal{G}$
$\mathcal{N}_{v}$	The set of neighborhood nodes of $v$
h ${}_{v}$	Features associated with $v$
$\mathbf{h}^{0}_{v}$	The initial feature vector of $v$
$k$	The layer index
$\textbf{h}^{k}_{v}$	The $k$ -th layer feature vector of $v$
$\textbf{h}^{k}_{\mathcal{N}_{v}}$	The $k$ -th layer feature vector of $\mathcal{N}_{v}$
W ${}^{k}$	The weight matrix to the $k$ -th layer
$\mathcal{S}$	The set of data silos
$i$	The data silo index
$S_{i}$	The $i$ -th data silo
$\mathcal{RG}$	The set of relatedness graphs of $\mathcal{S}$
$RG_{i}$	The relatedness graph of $S_{i}$
$\textbf{f}_{v}$	Initial feature vector of $v$ obtained via profiling
$PE_{i}$	The set of positive edges of $RG_{i}$
$NE_{i}$	The set of negative edges of $RG_{i}$

In this section, we present how SiMa utilizes intra-silo column relatedness knowledge and manages to leverage Graph Neural Networks (GNNs) to provide with inter-silo link suggestions. Towards this direction, we first give a preliminary introduction on GNNs in Section 5.1. Then in Section 5.2 we showcase how we model a set of data silos as graphs, and obtain the initial features via profiling in Section 5.3. We transform the problem of matching across data silo to a link prediction task, and describe how SiMa employs GNNs to solve the problem in Section 6. We explain SiMa’s algorithmic pipeline in Section 6.1. In Table 1 we summarize the notations frequently used in this paper.

5.1. Preliminary: GNNs

Recently, Graph Neural Networks (Wu et al., 2020) have gained a lot of popularity due to their straightforward applicability and impressive results in traditional graph problems such as node classification (Kipf and Welling, 2016; Hamilton et al., 2017), graph classification (Errica et al., 2019) and link prediction (Zhang and Chen, 2018; Ying et al., 2018; Fan et al., 2019). Intuitively, GNNs can learn a “recipe” to incorporate the neighborhood information and the features of each node in order to embed it into a vector.

In this work, we aim at finding a learning model that can perform well, not only on silos with known column relationships, but also on unseen columns in unseen data silos. This requires a generic, inductive learning framework. Based on the wealth of literature around GNNs, we opt for the seminal GNN model of GraphSAGE (Hamilton et al., 2017), which is one of the representative models generalizable to unseen data during the training process. More specifically, GraphSAGE incorporates the features associated with each node $v$ of a graph, denoted by h ${}_{v}$ , together with its neighborhood information $\mathcal{N}_{v}$ , in order to learn a function that is able to embed graph nodes into a vector space of given dimensions. The embedding function is trained through message passing among the nodes of the graph, in addition to an optimization objective that depends on the use case. Typically, GraphSAGE uses several layers for learning how to aggregate messages from each node’s neighborhood, where in the $k$ -th layer it proceeds as follows for a node $v$ :

(1)

\begin{gathered}\textbf{h}^{k}_{\mathcal{N}_{v}}=\text{AGGREGATE}_{k}(\{% \textbf{h}^{k-1}_{u},\forall u\in\mathcal{N}_{v}\})\\ \textbf{h}^{k}_{v}\leftarrow\sigma\big{(}\textbf{W}^{k}\cdot\text{CONCAT}(% \textbf{h}^{k-1}_{v},\textbf{h}^{k}_{\mathcal{N}_{v}})\big{)}\end{gathered}

Given a node $v$ , GraphSAGE first aggregates the representations of its neighborhood nodes from the previous layer $k$ -1, and obtains $\textbf{h}^{k}_{\mathcal{N}_{v}}$ . Then the concatenated (CONCAT) result of the current node representation $\textbf{h}^{k-1}_{v}$ and the neighborhood information $\textbf{h}^{k}_{\mathcal{N}_{v}}$ is combined with the $k$ -th layer weight matrices W ${}^{k}$ . After passing the activation function $\sigma(\cdot)$ , we obtain the feature vector of $v$ on the current layer $k$ , i.e., $\textbf{h}^{k}_{v}$ . Such a process starts from the initial feature vector of the node $v$ , i.e., $\mathbf{h}^{0}_{v}$ . By stacking several such layers GraphSAGE controls the depth from which this information arrives in the graph. For instance, $k=3$ indicates that a node $n$ will aggregate information until 3 hops away from $n$ .

5.2. Modeling Data Silos as Graphs

We see that applying a GNN model on a given graph is seamless and quite intuitive: nodes exchange messages with their neighborhood concerning information about their features, which is then aggregated to reach their final representation. Yet, for the GNN to function properly, the graph on which it is trained should reveal information that is correct, namely we should be sure about the edges connecting different nodes.

Based on this last observation and on the fact that data silos maintain information about relationships among their own datasets, we see that if we model each silo as a graph then this could enable the application of GNNs. In order to do so, for each data silo $S_{i}$ , as defined in Section 4, we construct a relatedness graph that represents the links among the various tabular datasets that reside in the corresponding data silo.

Definition 0 (Relatedness graph).

Given a data silo $S_{i}$ , its relatedness graph $RG_{i}=(V_{i},E_{i})$ is an undirected graph with nodes $V_{i}$ and edges $E_{i}$ . Each column $c^{i}_{l}$ of $S_{i}$ is represented as a node $v\in V_{i}$ . For each pair of columns $c^{i}_{l},c^{i}_{t}$ of $S_{i}$ that are intra-related, there is an edge $e\in E_{i}$ between their corresponding nodes in $RG_{i}$ .

For example, Figure 3 shows the corresponding relatedness graph of the data silo Insurance from Figure 1. Based on it, we see that a silo’s relatedness graph consists of several connected components, where each of them represents a different domain to which columns of the datasets that are stored in the data silo belong; thus, the neighborhood of each node in the graph includes only the nodes that are relevant to it in the silo. This is shown in Figure 3, where we see four different connected components, colored differently, which represent four different domains in the silo: addresses, names, tax ids and purchase info.

5.3. Profiles as Initial Features

Initialization requirement of GNNs. With SiMa we opt for applying the GraphSAGE model using the relatedness graphs of the corresponding data silos. For this to be possible two conditions should be satisfied about the relatedness graph: i) there should be a representative set of edges and ii) each node should come with an initial feature vector. SiMa’s relatedness graphs already satisfy the first condition, since every such graph includes edges denoting similar columns. Yet, nodes in the relatedness graphs are featureless. Moreover, in order to leverage GNNs and use them for matching across data silos, we need to employ them towards a specific goal. Therefore, in the following we discuss how to produce initial features for each column-node in a relatedness graph, and present a method of using a GNN model for bridging data silos by modeling our problem as a link prediction task.

Initial feature vectors from data profiles. In order to handle the feature initialization requirement, in SiMa we draw inspiration from the data profiling literature (Abedjan et al., 2015). In the case of tabular data, profiles summarize the information of a data element, by calculating a series of simple statistics (e.g. number of null values, aggregates etc.). Consequently, we can utilize a simple profiler in order to associate each column in a data silo to a feature vector, summarizing statistical information about it.

In specific, for each data silo, we feed all the including tables into the profiling component we adopt from (Hulsebos et al., 2019). However, since we need the initial profiles to summarize simple information for each column (so as not to depend on complex profiles), we exclude the features referring to pre-trained value and paragraph embeddings. In short, SiMa computes a feature vector for each column in a silo by collecting the following:

– Global statistics. Those include aggregates on high level characteristics of a column, e.g. number of numerical values among the included instances.

– Character-level distributions. For each of the 96 ASCII characters that might be present in the corresponding values of the column, we save charachter-level distributions. Specifically, the profiler counts the number of each such ASCII character in a column and then feeds it to aggregate functions, such as mean, median etc.

Using the above profiling scheme, we associate each node $v$ , belonging to a relatedness graph $RG_{i}$ , with a vector $\textbf{f}_{v}$ . This $\textbf{f}_{v}$ will serve as $\textbf{h}^{0}_{v}$ for initializing the feature vector of $v$ before starting the GraphSAGE training process, as shown in Figure 2b.

6. Training GNNs for Matching Silos

Matching across silos as link prediction. In order to leverage the capabilities of a GNN, there should be an objective function tailored to the goal of the problem that needs to be solved. With SiMa we want to be able to capture relatedness for every pair of columns belonging to different data silos, which translates to the following objective.

Problem 6.1 (Link prediction of relatedness graph).

Consider a set of relatedness graphs $\mathcal{RG}$ , the challenge of link prediction across relatedness graphs is to build a model $\mathcal{M}$ that predicts whether there should be an edge between nodes from different relatedness graphs. Given a pair of nodes $(u,v)$ from two different relatedness graphs $RG_{i},RG_{j}\in\mathcal{RG}$ ( $i\neq j$ ) where $u\in RG_{i}$ , $v\in RG_{j}$ , ideally

\displaystyle\begin{split}\mathcal{M}(u,v)=\begin{cases}1,&u\ and\ v\ are\ % linked\\ 0,&otherwise\end{cases}\end{split}

It is easy to see that we have now transformed our initial matching across data silos problem to a link prediction problem over the relatedness graphs.

Two types of edges for training. Towards this direction, we train a prediction function $\phi$ that receives as input the representations $\textbf{h}_{u}$ and $\textbf{h}_{v}$ , of the corresponding nodes $u$ and $v$ , from the last layer of the GraphSAGE neural network, and computes a similarity score $sim(u,v)=\phi(\textbf{h}_{u},\textbf{h}_{v})$ .

To train a robust GNN model, we need the following two types of edges in our relatedness graph.

Definition 0 (Positive edges and negative edges).

In a relatedness graph $RG_{i}=(V_{i},E_{i})$ , we refer to each edge $e\in E_{i}$ as a positive edge; if a ‘virtual’ edge $e$ connects two unrelated nodes $u$ and $v$ , we refer to it as a negative edge. Thus, we obtain the following two sets of edges.

Positive edges $PE_{i}=\{(u,v)|r(u,v)=1\wedge u,v\in RG_{i}\}$

Negative edges $NE_{i}=\{(u,v)|r(u,v)=0\wedge u,v\in RG_{i}\}$

To differentiate with negative edges $NE_{i}$ , in the sequel we refer to the edges of a relatedness graph $V_{i}$ as positive edges $PE_{i}$ . Notably, the training samples we get from our relatedness graphs contain only pairs of nodes for which a link should exist (i.e., positive edges $PE_{i}$ ). Thus, we need to provide the training process with a corresponding set of negative edges, which connect nodes-columns that are not related. To do so, for every relatedness graph $RG_{i}$ we construct a set of negative edges $NE_{i}$ , since we know that nodes belonging to different connected components in $RG_{i}$ represent pairs of unrelated columns in the corresponding data silo $S_{i}$ ; we elaborate on negative edge sampling strategies in Section 7.1.

Two-fold GNN model training. After constructing the set of negative edges, we initiate the training process with the goal of optimizing the following cross-entropy loss function:

(2)

\displaystyle\begin{split}\mathcal{L}&=-\sum_{(u,v)\in RG_{i}}{\log{\sigma(sim% (u,v))}}\\ &\quad-\sum_{(u,v)\in NE_{i}}{\big{[}1-\log(\sigma(sim(u,v)))\big{]}}\end{split}

where $\sigma(\cdot)$ is the sigmoid function and $1\leq i\leq n$ , with $n$ representing the number of relatedness graphs (constructed from the original data silos) included in training data. The similarity scores are computed by feeding pairs of node representations to a Multi-layer Perceptron (MLP), whose parameters are also learned during the training process in order to give correct predictions. Intuitively, with this model training we want to compute representations, so as to build a similarity function (through the training of the MLP), which based on them, correctly distinguishes semantically related from unrelated nodes-columns.

To summarize, SiMa uses a two-fold model, which consists of:

•

A GraphSAGE neural network that applies message passing and aggregation (Equation 1) in order to embed the nodes-columns of the relatedness graph into a vector space of given dimensions.
•

A MLP, with one hidden layer, which receives in its input a pair of node representations and based on them it calculates a similarity score in order to predict whether there should be a link or not between them, i.e., whether the corresponding columns are related.

Input : Set of data silos

\mathcal{S}

Model

\mathcal{M}

Profiler

\mathcal{P}

Number of training epochs

e

Output : Trained model

\mathcal{M}

\mathcal{RG}\leftarrow\{\}

// Initialize set of relatedness graphs

\text{f}\leftarrow[]

// List of initial node feature vectors

n\leftarrow|\mathcal{S}|

for $i\leftarrow 1$ to $n$ do

RG_{i}\leftarrow\text{ConstructGraphFromSilo}(S_{i})

\mathcal{RG}.\text{add}(RG_{i})

foreach node $u\in RG_{i}$ do

{}_{u}

\leftarrow\mathcal{P}(u)

// Compute profile of corresponding column and store it as u’s initial feature vector

3 f.append(f

{}_{u}

)

4 end foreach

6 end for

\mathcal{PE},\mathcal{NE}\leftarrow\{\}

// Initialize sets of positive/negative edges

7 for $i\leftarrow 1$ to $n$ do

8 foreach edge $(u,v)\in RG_{i}$ do

\mathcal{PE}.\text{add}((u,v))

10 end foreach

\mathcal{NE}.\text{union}(\text{SampleNegativeEdges}(RG_{i}))

12 end for

13for $i\leftarrow 1$ to $e$ do

\leftarrow\mathcal{M}

.GraphSage(

\mathcal{RG}

, f) // Apply GraphSAGE to all relatedness graphs and get node embeddings

PosEdgePred\leftarrow\mathcal{M}.

MLP(

\mathcal{PE}

, h) // Get link predictions for positive edges

NegEdgePred\leftarrow\mathcal{M}.

MLP(

\mathcal{NE}

, h) // Get link predictions for negative edges

Loss

\leftarrow

ComputeLoss(

PosEdgePred

NegEdgePred

) // Compute cross-entropy loss based on predictions

Loss.BackPropagate(

\mathcal{M}

.parameters) // Tune model parameters with backwards propagation

15 end for

Algorithm 1 SiMa

In the above model, there can be certain modifications with respect to the kind of GNN used (e.g. replace GraphSAGE with the classical Graph Convolutional Network (Kipf and Welling, 2016)) and prediction model (e.g. replace MLP by a simple dot product model). However, since the focus of this work is on building a method which uses GNNs as a tool towards matching across data silos, and not on comparing/proposing novel GNN-based link prediction models, we opt for a model architecture similar to the ones employed for link prediction (Vretinaris et al., 2021; Ying et al., 2018).

6.1. SiMa’s Pipeline

In Algorithm 1 we show the pipeline that we employ with SiMa. The key challenge here is to build a model that can represent columns of data silos in such a way, so that relatedness prediction based on them is correct. Our method has four inputs: i) the set of data silos $\mathcal{S}$ , ii) our defined model $\mathcal{M}$ , including the GraphSAGE neural network and the MLP predictor, iii) the profiler $\mathcal{P}$ that we use in order to initialize feature vectors of nodes, and iv) the number of training epochs $e$ . The output of SiMa consists of the trained model $\mathcal{M}$ , which can then be used to embed any column of a data silo and, based on these embeddings, predict links between columns.

Initially, all data silos in $\mathcal{S}$ are transformed to their relatedness graph counterpart. In addition, we compute the corresponding profiles of each node and store them as initial feature vectors (lines 4-11). Based on these graphs, we construct the sets of positive and negative edges to feed our training process (lines 13-18). While getting positive edges is trivial, since we just fetch the edges that are present in the relatedness graphs, constructing a set of negative edges requires a sampling strategy (line 17). This is because the set of all negative edges is orders of magnitude larger than the set of positive ones. Ergo, we need to sample some of these negative edges in order to balance the ratio of positive to negative examples for our training. We elaborate on our optimized strategies for negative edge sampling in Section 7.1.

Following the preparation of positive and negative edge training samples, we move to the training of our model (lines 19 - 25). In specific, we start by applying the current GraphSAGE neural network through the message passing and aggregation functions shown in Equation 1. At the next step, we get the predictions for the pairs of nodes in the set of positive and negative edges respectively (lines 21-22), by placing in the input of our defined MLP architecture their corresponding embeddings. Finally, the cross-entropy loss is calculated (Equation 2) based on all predictions made for both positive and negative edges (line 23) and based on it we back propagate the errors in order to tune the parameters of the GraphSAGE and MLP models used (line 24). The training process repeats for the number of epochs $e$ , which is specified in the input. In the end of this loop, we get our trained model $\mathcal{M}$ which is able to embed columns in data silos and, based on these representations, predict whether they are related or not.

7. Optimization Techniques

In this section, we present novel techniques applied in Algorithm 1: i) sampling (Section 7.1) and ii) incremental model training (Section 7.2).

7.1. Negative Sampling Strategies

Since the number of possible negative edges in our relatedness graphs might be overwhelming with respect to the number of positive edges, we need to devise negative sampling strategies. In fact, negative sampling for graph representation learning has been shown to drastically impact the effectiveness of a model (Yang et al., 2020). Such sampling techniques can provide with negative edge samples that help our link prediction model distinguish related/dissimilar columns. Thus, in the following we describe three negative sampling strategies (termed as NS1, NS2, NS3), each enhanced with different insights. It is important to mention here that these negative sampling techniques take place inside every relatedness graph, where we have the knowledge of which node pairs represent negative examples. In Figure 4 we depict each sampling strategy and how it operates on the relatedness graph of Figure 3.

NS1: Sampling on whole graph. The most straightforward and simple way to compute a sample of negative (non-directed) edges per relatedness graph, is to randomly sample some of them out of the set of all possible negative node pairs. Specifically, based on the node connectivity information we have about each relatedness graph, we are able to compute the full set of distinct pairs which include nodes from different connected components. Then, we randomly pick some of them in order to construct a set with a size equal to the number of positive edges in the corresponding relatedness graph to feed to our loss function.

As we see in Figure 4a, a major drawback is that there could be nodes not connected with any negative edge in the sample, like the three rightmost nodes in the figure. This could severely affect the training process, since for these nodes we miss information about nodes they should not relate to. Moreover, it might be that certain nodes show up more frequently in the negative samples than others, which creates an unwanted imbalance in the training data for negative examples.

NS2: Sampling per node. To guarantee that every node is associated with at least one negative edge, we randomly sample negative edges for each node separately. To balance the number of positive and negative edges that a node is associated with, we specify the sample size to be equal to the degree of the node in the relatedness graph, i.e., to the number of positive edges; since we want to control the number of incoming negative edges per node, we opt for directed edges.

Figure 4b shows a possible output of such a negative edge sampling strategy. In contrast to the previous strategy, we see that now every node in the graph receives a sample of directed negative edges, of size equal to its corresponding degree in the original relatedness graph. Nonetheless, this improved sampling strategy does not ensure that a node will receive negative edges from a set of nodes that belong to different connected components, namely different column domains. For example, in Figure 4b the upper left “Address” node receives two edges both coming from the connected component representing the domain of customer names. This non-diversity of the negative samples that are associated with each node, disrupts the learning process since the model does not receive enough information about which columns should not be regarded as related.

NS3: Sampling per domain. To improve the shortness of diversity in the negative edges each node receives, we impose sampling per node to take place per different domain, i.e., for each different connected component in the relatedness graph. In detail, this time we pick the random samples based on each connected component that has not yet been associated to the node. Hence, each node receives at least one negative edge from every other connected component in the graph, ensuring this way that there is diverse and complete information with respect to domains that the corresponding column does not relate. To keep the number of negative edges close to the number of positive ones, we specify one random sample from each domain per node.

To illustrate how the above strategy proceeds, in Figure 4c we show negative samples computed only for two different nodes, “Tax id” and “Name” (we do so in order to not overload the figure with negative edges for all nodes). Indeed, as we discussed above, both of these nodes receive exactly one randomly picked edge from each unrelated domain. Therefore, every node has complete information which can be leveraged by our proposed model in order to learn correctly which pairs of nodes should not be linked.

Remarks. When using NS3 the number of negative edges sampled for training might be considerably higher than the one of positive edges. To deal with this imbalance of positive and negative data, we use the weighted version of the binary cross-entropy function:

(3)

\displaystyle\begin{split}\mathcal{L}&=-\sum_{(u,v)\in RG_{i}}{\bm{w_{p}}\cdot% \log{\sigma(sim(u,v))}}\\ &\quad-\sum_{(u,v)\in NE_{i}}{\log(1-\sigma(sim(u,v)))}\end{split}

where $w_{p}$ is the weight we use to balance the contribution of the positive and the negative examples, which we set to be equal to the ratio of negative to positive edges included in the training.

n\leftarrow|\mathcal{RG}|

;

TG\leftarrow[]

// List of relatedness graphs included in the training

2 for $i\leftarrow 1$ to $n$ do

TG

.append(

RG_{i}

) ;

4 for $j\leftarrow 1$ to $e_{p}$ do

\leftarrow\mathcal{M}

.GraphSage(

TG_{i}

, f

{}_{i}

) // Apply GraphSAGE only on relatedness graphs in TG

PosEdgePred\leftarrow\mathcal{M}.

MLP(

\mathcal{PE}_{i}

, h) // Get link predictions for positive edges

NegEdgePred\leftarrow\mathcal{M}.

MLP(

\mathcal{NE}_{i}

, h) // Get link predictions for negative edges

Loss

\leftarrow

ComputeLoss(

PosEdgePred

NegEdgePred

) // Compute cross-entropy loss based on predictions

Loss.BackPropagate(

\mathcal{M}

.parameters) // Tune model parameters with backwards propagation

6 end for

8 end for

Algorithm 2 Incremental Training

7.2. Incremental Training

Originally, SiMa trains on the positive and negative samples it receives by taking into consideration every relatedness graph in the input (lines 19-25 in Algorithm 1). However, proceeding with training on the whole set of graphs might harm the effectiveness of the learning process, since the model in each epoch trains on the same set of positive and negative samples; hence, it can potentially overfit. Therefore, we need to devise an alternative training strategy, which feeds the model with new training data periodically.

Towards this direction, we design an incremental training scheme that proceeds per relatedness graph. In specific, we initiate training with one relatedness graph and the corresponding positive/negative samples we get from it. After a specific number of epochs, we add the training samples from another relatedness graph and we continue the process by adding every other relatedness graph. In this way, we help the model to deal periodically with novel samples potentially representing previously unseen domains that the new relatedness graph brings; thus, we increase the chances of boosting the effectiveness that the resulting link prediction will have. Essentially, our incremental training scheme resembles curriculum learning (Bengio et al., 2009) in that it constantly provides the learning process with new data; yet, curriculum learning methods also verify that the training examples are of increasing difficulty.

Algorithm 2 shows how incremental training proceeds and replaces the original training scheme in the context of our initial pipeline in Algorithm 1. We see that the only difference with the previous scheme is that now we train the model on an incrementally growing set of relatedness graphs ( $TG$ of lines 2), which is initialized with the first relatedness graph and it receives an extra graph, periodically every $e_{p}$ epochs (which we assume to be a fraction of the original number of epochs $e$ in Algorithm 1), until it contains all of them in the final iteration. For each epoch, we apply GraphSAGE only on the relatedness graphs in $TG$ (lines 6) and use positive/negative samples coming from them in order to train the MLP (lines 8-9).

8. Experimental Evaluation

In this section, we assess the effectiveness and efficiency of SiMa through an extensive set of experiments. In what follows, we first describe our experimental setup, namely the datasets, baselines and settings against which we assess our method. We then present our experimental results, where we focus on: i) the effect of different parameters for training the GraphSAGE model, ii) how the different sampling and training techniques (Section 7) affect SiMa’s effectiveness and execution time, and iii) how SiMa compares with other matching and simple ML baselines both in terms of effectiveness and efficiency. Our main results can be summarized as follows:

•

The optimization techniques introduced in Section 7 considerably boost SiMa’s effectiveness Figure 5.
•

SiMa’s GNN-based model leverages existing matches better than a simple ML baseline Figure 6 that takes into account only the profiles.
•

SiMa is more effective than the state-of-the-art schema matching method, due to its ability to maintain higher precision values when recall increases Figure 6.
•

Contextualized representations are not suitable for matching columns across data silos.
•

SiMa exhibits lower execution times than other methods Table 4. Especially when compared to state-of-the-art matching methods the gap is considerably high.

8.1. Setup

Methods used for evaluation. We make use of three different methods for comparing SiMa in terms of effectiveness and efficiency:

– COMA (Do and Rahm, 2002), which is a seminal and state-of-the-art matching method that combines multiple criteria in order to output a set of possible matches. We make use of the COMA version that uses both schema and instance-based information about the datasets in order to proceed. Note that COMA is not an applicable solution to the problem of matching across data silos as studied in this paper (Section 3.1). However, we use it in order to see how close SiMa can get to a state of the art matching method, on the same data and in a non-siloed setting. In our experiments we use COMA 3.0 Community Edition.

– Starmie (Fan et al., 2022) is a state-of-the-art top-k unionable table search in data lakes. The method employs a multi-column table encoder that serializes instances from tables to feed them into a pre-trained Language Model (LM) (specifically, the authors use RoBERTa (Liu et al., 2019)). Starmie uses contrastive learning (Chen et al., 2020) to produce column representations that capture relatedness. In our evaluation, we use Starmie, as shared in a public repository²²2https://github.com/megagonlabs/starmie, to produce contextualized column representations for the columns of the datasets included in each silo. We then compute pairwise cosine similarity for columns among datasets of different silos. We ran Starmie with default parameters, except for tuning the max sequence length to 256, number of epochs to 5, and batch size to 8.

– Baseline: MLP. To show the gains of using SiMa’s GNN model to represent dataset columns, we compare our method against a simple Multi-Layer Perceptron (MLP) prediction model. In specific, we use a MLP with one hidden layer, which receives in its input pairs of profiles and learns to predict whether there is a relationship among the columns they represent. In other words, this model disregards column representations computed through SiMa’s GNN model and straightforwardly uses only column profiles and information about existing column matches.

Note that we have not included baselines from the dataset discovery literature (Fernandez et al., 2018a; Cafarella et al., 2009; Das Sarma et al., 2012; Nargesian et al., 2018; Zhu et al., 2019; Bogatu et al., 2020; Zhang and Ives, 2020) as those do not address the problem of matching columns across silos; instead, they address the top-k similar dataset retrieval problem, which makes them not directly applicable to our problem setting.

Except for COMA, Starmie and the MLP baseline, we ran our experiments with two additional schema matching techniques, namely Distribution-based matching (Zhang et al., 2011) and EmbDI (Cappuzzo et al., 2020). Both the Distribution-based method and EmbDI exhibited consistently worse results than COMA, exhibiting very high false negative rates. In addition, EmbDI can only run on considerably smaller number of datasets than the ones we examine in our evaluation. Therefore, we omit presentation of results coming from these methods, and keep COMA as the representative state-of-the-art matching method.

Real-world Datasets and Ground Truth. Since there is no benchmark available for matching across silos in the area of schema matching, nor in the area of related dataset search, we opted for leveraging two real-world, open data repositories:

•

NYC OpenData. The New York City OpenData repository³³3https://opendata.cityofnewyork.us/ contains public data published by New York City agencies and other partners. For the needs of this paper, we use tables under the City Government category.
•

LA OpenData. The Los Angeles OpenData portal⁴⁴4https://data.lacity.org/ encompasses public datasets covering different activities and sectors of the city of Los Angeles. We focus on tabular data under the Administration & Finance category.

For each of the above sources we select a subset of tables and curate them to contain only columns that i) store categorical or text data, and ii) the majority of their instances are not null values. Matching columns that store numerical data is out of the scope of this paper: calculating distribution similarity or set overlaps would be the adequate method to use in such cases (Zhu et al., 2019). Based on this curation, we ended up with 22 tables, for which we manually annotated column matches among them, including equi/fuzzy-joins as well as columns of the same domain that are non-overlap**. We captured 125 and 193 column matches for the tables from the NYC and LA OpenData repositories, respectively.

To derive a larger number of tables, we adopt the method of (Nargesian et al., 2018; Lee et al., 2007). These methods produce pairs of tabular datasets that share a varying number of columns/rows. We make use of Valentine (Koutras et al., 2021) to create scenarios of equi-joinable, fuzzily-joinable and unionable columns of varying difficulty (i.e., zero/low/high - exact/non-exact value overlaps), based on the subsets of tables we created from the two OpenData repositories.

Creating Data Silos. To evaluate our method we construct sets of data silos based on the tables we derived from the two OpenData repositories. Particularly, we create two benchmarks containing a given number of data silos, where each silo contains a number of tables coming from different source tables. This way we ensure that there is a sufficient amount of column matches within each silo from which our model can learn to predict column relationships among datasets across different silos. In Table 2 we detail the number of tables, column matches inside (used for training) and among (used for testing) silos for each of the two benchmarks we created.

Effectiveness calculation. We evaluate the effectiveness of SiMa and the methods we compare against, by computing precision-recall curves based on the predictions (similarity scores for the case of COMA) that we retrieve for every possible pair of columns belonging to datasets of different silos. We opt for using precision-recall curves. Those are ideal for showing effectiveness results, when the distribution of labels in the test set is considerably imbalanced (He and Garcia, 2009), which is the case in our benchmarks. Indeed, in realistic matching scenarios, the number of non-matching column pairs, significantly outnumbers the matching column pairs. Notably, with precision-recall curves we show the effectiveness of methods with respect to varying similarity thresholds. Therefore, we achieve a non-biased presentation of results across the board, in contrast to showing precision and recall values only for specific threshold values.

Implementation details. We experimented with different parameters for training SiMa’s model and the simple MLP-baseline to pick the configuration that performs the best in both benchmarks. We list our observations in the following:

– GNN layers: SiMa does not benefit from using more than one layers for GraphSAGE, since the connected components formed by our graph construction method are complete graphs. Moreover, we found out that using max-pooling, as described in (Hamilton et al., 2017), to aggregate the representations of each node’s neighborhood nodes gives the best results.

– Number of epochs: We ran our model and the simple MLP baseline for several epochs and plotted loss curves when validating their prediction capability in a small subset of the training data (using a 90:10 split). We observed that for more than 100 epochs there is no considerable change in the training/validation loss. Thus, in these experiments we train for 100 epochs. In the case of incremental training and since we have 10 relatedness graphs ( $|\mathcal{RG}|=10$ ) we train incrementally for 10 epochs per relatedness graph, leading to 100 epochs in total.

– Dimension of embeddings: We evaluated the effectiveness of our model for varying dimensions of node representations produced by the GraphSAGE model we use in the range of $\{32,64,128,256,512\}$ . We found out that using embeddings of 256 dimensions provides with the best results.

Benchmark	# Silos	# Datasets	# Training Matches	# Test Matches
NYC OpenData	10	290	5437	34537
LA OpenData	10	224	5913	34986

Table 2. Data silo matching benchmarks used for evaluation.

For training we use the Adam optimizer (Kingma and Ba, 2014) with a learning rate of 0.01, while we use a MLP of one hidden layer for both SiMa and the baseline. Furthermore, our method is implemented in Python 3.7.4 and is openly available for experimenation⁵⁵5https://github.com/delftdata/SiMa, while GraphSAGE was implemented using the Deep Graph Library (Wang et al., 2019) on top of PyTorch.⁶⁶6https://pytorch.org Experiments for SiMa and the MLP baseline ran on an 8-core MacBook Pro, while for running COMA we set up a Linux machine with 128 AMD EPYC 7H12 2.60GHz cores.

8.2. Effect of Optimizations

We assess the effectiveness of different negative sampling techniques and training schemes, as discussed in Section 7. To this end, we run two sets of experiments: i) using the incremental training scheme, we apply four variants of SiMa, where three are based on a different negative sampling strategy and one considers all negative edges without sampling, and ii) using the best such variant, we compare SiMa’s incremental training against training on the whole set of relatedness graphs we get from the data silos.

Sampling Strategies. In Figure 5 we see precision-recall curves for SiMa’s different variants when evaluated upon both data silo benchmarks. First, we validate the boost in effectiveness that sampling edges from each other domain per node, i.e. NS3, can bring. Particularly, we see a considerable increase in both precision and recall, since with NS3 every node receives negative edges that cover the spectrum of other domains present in the corresponding relatedness graph. Consequently, the false positive links that our method predicts decrease (i.e. precision increases), while the better representational quality of the embeddings produced by our encapsulated GraphSAGE model ensures fewer false negatives (i.e. recall increases). Moreover, we see that using SiMa with NS3 can produce a higher precision for high recall values, especially in the case of LA OpenData.

On the other hand, the other two sampling techniques, NS1 and NS2, and the variant using all negative samples exhibit different results depending on each benchmark. In specific, we see that sampling edges per node, as specified by NS2, produces low effectiveness results in both data silo settings. Precision for high recall values is mediocre, due to the lack of diversity and completeness about the knowledge each node receives about other domains in the relatedness graph. Surprisingly, picking negative edges at random on the whole graph, as specified by NS1, seems to bring consistently better results than the NS2 variant, even if it does not guarantee that every node is covered by the negative edges sampled. However, NS1 may pick negative edges that are more informative, yet this cannot be guaranteed due to its randomness.

Finally, we observe that using all available negative edges without sampling brings inconsistent results. In Figure 5 we see the variant using all negative edges during training performs better than employing NS1 and NS2, while it is very close to NS3. On the contrary, in Figure 5 SiMa with all negative edges results is worse than using NS1 and only slightly better than NS2. This behavior is to be expected, since not employing a dedicated sampling strategy that guarantees the quality and amount of negative edges included during training (like NS3), means that the model risks overfitting.

Takeaways: $i)$ among the different sampling strategies, sampling per domain – NS3, yields the best results; $ii)$ removing negative sampling harms effectiveness.

Incremental Training. Here we want to verify whether incremental training has a substantial influence on the effectiveness of the training process. We observe that training on all relatedness graphs from the beginning can severely affect the effectiveness of our method, since our model overfits on the set of possible and negative samples it receives. In contrast, our incremental training scheme drastically helps our model to adapt to new examples and significantly improves its prediction correctness. Indeed, as seen on the right-hand side of Figure 5 by applying SiMa’s model on every relatedness graph incrementally in the order of number of edges they store, we make sure that the learning process can leverage the novel information that each graph brings, i.e. novel examples of semantic types that were not introduced by the previous graphs. Moreover, incremental training ensures faster execution times, since in earlier epochs the model sees less training examples. Therefore, in the following experiments we configure SiMa to apply the incremental training scheme and use NS3 as the negative edge sampling strategy.

Takeaway: SiMa’s incremental training scheme improves the effectiveness of SiMa as shown by the precision-recall curves, with higher precision for high recall values.

8.3. SiMa comparison to other methods

We compare SiMa with COMA, Starmie and the MLP baseline, to showcase the capability of our method to achieve better results in both effectiveness and efficiency. For a fair comparison with the MLP baseline we train it using the best negative sampling technique as found in Section 8.2, i.e., NS3, while we do so by employing the incremental training scheme. Below, we discuss the results.

Effectiveness comparison. Figure 6 shows the comparison of SiMa against COMA (Do and Rahm, 2002), Starmie (Fan et al., 2022) and the MLP baseline, in terms of effectiveness. First, we observe that SiMa learns significantly better how to disambiguate between positive and negative links, based on the knowledge that exists in each data silo, than the MLP baseline. For both data silo benchmarks we see that using only the initial column profiles with a simple MLP prediction model does not give good results. Indeed, existing matches between columns that are represented by profiles that are not similar, cannot help a model. On the contrary, SiMa can learn the intrinsic graph characteristics that lead to a column relationship, by exploiting the message passing component of GNNs.

Surprisingly, even in the case where we could employ the state-of-the-art schema matching method of COMA for matching data silos, we observe that it would give inferior results compared to SiMa. In particular, Figure 6 shows that COMA cannot keep a high precision for recall values above 0.4, which means that there is only a small fraction of matches that it can correctly predict. Similarly, in Figure 6 we see that COMA’s precision significantly drops for recall values above 0.5. On the contrary, SiMa in both cases can be highly precise even for recall values above 0.8. This is due to the fact that the similarity signals that COMA uses are oftentimes not sufficient to distinguish whether a pair of columns is a match or not; however, existing matches in the silos and the architecture of SiMa’s model enable our method to accurately sort out true negatives. Notably, our model outperforms COMA even if matching columns in the benchmarks we created have similar or exactly the same names, which is something that COMA takes advantage of. In a real world scenario, column names might not be human-understandable or could be missing, which would considerably decrease the effectiveness of COMA. SiMa is agnostic to column names, hence its performance is not affected by their existence/quality.

Finally, we observe that the contextualized column representations trained through BERT (Devlin et al., 2018) with Starmie produce results of low quality. In specific, we noticed that the false negative rate is significantly high when considering cosine similarity of Starmie embeddings between columns. Nonetheless, this result is expected: using context information to find column matches among datasets of different silos is not effective in our case, since most of the matches represent joins of columns that share no common context. In the original paper of Starmie (Fan et al., 2022) such column representations are shown to be effective due to the nature of the problem that is targeted there: discovering unionable tables, requires a method that captures well the context of their columns.

Takeaways: $i)$ SiMa exhibits consistently high effectiveness, whereas the competition falls short in precision for high recall values; $ii)$ embeddings computed through GNNs, have higher representational power than initial column features (MLP baseline); $iii)$ contextualized column representations are not suitable for matching columns across data silos.

Best F1 Scores
Benchmark	SiMa	COMA	Starmie	MLP
NYC OpenData	0.787	0.564	0.384	0.656
LA OpenData	0.858	0.600	0.310	0.736
PR-AUC Scores
Benchmark	SiMa	COMA	Starmie	MLP
NYC OpenData	0.774	0.561	0.358	0.619
LA OpenData	0.861	0.578	0.292	0.761

Table 3. Effectiveness scores of SiMa and competition.

Benchmark	SiMa	COMA	Starmie	MLP
NYC OpenData	52	30900	73	59
LA OpenData	51	20100	61	54

Table 4. Total execution times in minutes (CPU).

Efficiency comparison. In Table 4, we see how SiMa compares with the other method in terms of efficiency measured in minutes. The total execution time for SiMa and the MLP baseline refers to the sum of dataset profiling, training and inference times.

First, we observe that SiMa is considerably cheaper than the state-of-the-art traditional matching method COMA. Specifically, SiMa is more than two orders of magnitude faster. SiMa’s runtime is dominated by the computation of profiles (roughly $80\%$ of total execution), hence in the case where these are pre-computed our method can give results in a small fraction of the time shown in the table. Additionally, we verify that employing state-of-the-art schema matching methods, in this scale, might be infeasible: in real-world scenarios where datasets of multiple data silos with variable sizes need to be matched this can be prohibitively expensive. Specifically, COMA’s syntactic similarity-based matching can be slow due to computations of various measures among instance sets of columns (e.g. TF-IDF), especially in the case where there are a lot of text values.

On the other hand, we observe that using the initial profiles of the columns for training a simple prediction model with the MLP baseline not only is much less effective, but also exhibits slower training times. This is because the dimensionality of the initial profiles is much larger than the ones produced through the GraphSAGE model we employ in SiMa. In addition, Starmie is slow when ran on CPU due to the computationally intensive training of the contextualized column representations through BERT, and the generation of positive and negative examples for its contrastive learning process.

Takeaway: the complete pipeline of SiMa (profile computation, graph construction, training and inference) requires orders of magnitude less time and resources than the best-performing schema matching algorithm. This is due to the use of lower-dimension GNN embeddings for training our prediction model.

9. Conclusion

In this paper, we introduced SiMa, a novel method for matching columns across disparate data silos, which uses an effective prediction model based on the representational power of GNNs. SiMa uses the knowledge about existing relationships among datasets in silos, in order to build a model that can capture potential links across them. Our experimental results show that SiMa can be more effective than state-of-the-art matching and column representation methods, while it is significantly faster and cheaper to employ. Moreover, we show that our optimization techniques significantly improve the effectiveness of our method.

References

(1)
Abedjan et al. (2015) Ziawasch Abedjan, Lukasz Golab, and Felix Naumann. 2015. Profiling relational data: a survey. VLDBJ 24, 4 (2015), 557–581.
Authors (2022) Anonymous Authors. 2022. Anonymous Title. Non-archival Workshop (no proceedings). (2022).
Bengio et al. (2009) Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. 2009. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning. 41–48.
Bharadwaj et al. (2021) Sagar Bharadwaj, Praveen Gupta, Ranjita Bhagwan, and Saikat Guha. 2021. Discovering related data at scale. Proceedings of the VLDB Endowment 14, 8 (2021), 1392–1400.
Bogatu et al. (2020) Alex Bogatu, Alvaro AA Fernandes, Norman W Paton, and Nikolaos Konstantinou. 2020. Dataset Discovery in Data Lakes. In IEEE ICDE.
Bogatu et al. (2022) Alex Bogatu, Norman W Paton, Mark Douthwaite, and André Freitas. 2022. Voyager: Data Discovery and Integration for Data Science. In Proceedings 25th International Conference on Extending Database Technology (EDBT 2022).
Bojanowski et al. (2017) Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. TACL 5 (2017), 135–146.
Cafarella et al. (2009) Michael J Cafarella, Alon Halevy, and Nodira Khoussainova. 2009. Data integration for the relational web. In VLDB.
Cappuzzo et al. (2020) Riccardo Cappuzzo, Paolo Papotti, and Saravanan Thirumuruganathan. 2020. Creating embeddings of heterogeneous relational datasets for data integration tasks. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 1335–1349.
Chen et al. (2018) Chen Chen, Behzad Golshan, Alon Y Halevy, Wang-Chiew Tan, and AnHai Doan. 2018. BigGorilla: An Open-Source Ecosystem for Data Preparation and Integration. IEEE Data Eng. Bull. 41, 2 (2018), 10–22.
Chen et al. (2020) Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In International conference on machine learning. PMLR, 1597–1607.
Chepurko et al. (2020) Nadiia Chepurko, Ryan Marcus, Emanuel Zgraggen, Raul Castro Fernandez, Tim Kraska, and David Karger. 2020. ARDA: automatic relational data augmentation for machine learning. Proceedings of the VLDB Endowment 13, 9 (2020), 1373–1387.
Das Sarma et al. (2012) Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, and Cong Yu. 2012. Finding Related Tables. In ACM SIGMOD.
Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
Do and Rahm (2002) Hong-Hai Do and Erhard Rahm. 2002. COMA: a system for flexible combination of schema matching approaches. In VLDB.
Errica et al. (2019) Federico Errica, Marco Podda, Davide Bacciu, and Alessio Micheli. 2019. A fair comparison of graph neural networks for graph classification. arXiv preprint arXiv:1912.09893 (2019).
Fan et al. (2022) Grace Fan, ** Wang, Yuliang Li, Dan Zhang, and Renée Miller. 2022. Semantics-aware Dataset Discovery from Data Lakes with Contextualized Column-based Representation Learning. arXiv preprint arXiv:2210.01922 (2022).
Fan et al. (2019) Wenqi Fan, Yao Ma, Qing Li, Yuan He, Eric Zhao, Jiliang Tang, and Dawei Yin. 2019. Graph neural networks for social recommendation. In The World Wide Web Conference. 417–426.
Fernandez et al. (2018a) Raul Castro Fernandez, Ziawasch Abedjan, et al. 2018a. Aurum: A data discovery system. In IEEE ICDE.
Fernandez and Madden (2019) Raul Castro Fernandez and Samuel Madden. 2019. Termite: a system for tunneling through heterogeneous data. In Proceedings of the Second International Workshop on Exploiting Artificial Intelligence Techniques for Data Management. 1–8.
Fernandez et al. (2018b) Raul Castro Fernandez, Essam Mansour, et al. 2018b. See** semantics: Linking datasets using word embeddings for data discovery. In IEEE ICDE.
Gal (2011) Avigdor Gal. 2011. Uncertain schema matching. Synthesis Lectures on Data Management 3, 1 (2011), 1–97.
Halevy et al. (2016) Alon Halevy, Flip Korn, Natalya F Noy, et al. 2016. Goods: Organizing google’s datasets. In SIGMOD. ACM, 795–806.
Hamilton et al. (2017) William L Hamilton, Rex Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In Proceedings of the 31st International Conference on Neural Information Processing Systems. 1025–1035.
He and Garcia (2009) Haibo He and Edwardo A Garcia. 2009. Learning from imbalanced data. IEEE Transactions on knowledge and data engineering 21, 9 (2009), 1263–1284.
Hulsebos et al. (2019) Madelon Hulsebos, Kevin Hu, Michiel Bakker, Emanuel Zgraggen, Arvind Satyanarayan, Tim Kraska, Cagatay Demiralp, and César Hidalgo. 2019. Sherlock: A deep learning approach to semantic data type detection. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1500–1508.
Khatiwada et al. (2022) Aamod Khatiwada, Grace Fan, Roee Shraga, Zixuan Chen, Wolfgang Gatterbauer, Renée J Miller, and Mirek Riedewald. 2022. SANTOS: Relationship-based Semantic Table Union Search. arXiv preprint arXiv:2209.13589 (2022).
Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
Kipf and Welling (2016) Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
Koutras et al. (2020) Christos Koutras, Marios Fragkoulis, Asterios Katsifodimos, and Christoph Lofi. 2020. REMA: Graph Embeddings-based Relational Schema Matching. In SEAData.
Koutras et al. (2021) Christos Koutras, George Siachamis, Andra Ionescu, Kyriakos Psarakis, Jerry Brons, Marios Fragkoulis, Christoph Lofi, Angela Bonifati, and Asterios Katsifodimos. 2021. Valentine: Evaluating Matching Techniques for Dataset Discovery. In 2021 IEEE 37th International Conference on Data Engineering (ICDE). IEEE, 468–479.
Lee et al. (2007) Yoonkyong Lee, Mayssam Sayyadian, AnHai Doan, and Arnon S. Rosenthal. 2007. ETuner: Tuning Schema Matching Software Using Synthetic Scenarios. VLDBJ 16, 1 (2007), 97–122.
Lehmberg and Bizer (2017) Oliver Lehmberg and Christian Bizer. 2017. Stitching web tables for improving matching quality. In VLDB.
Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, **gfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
Madhavan et al. (2001) Jayant Madhavan, Philip A Bernstein, and Erhard Rahm. 2001. Generic schema matching with cupid. In VLDB.
Mansour et al. (2021) Essam Mansour, Kavitha Srinivas, and Katja Hose. 2021. Federated Data Science to Break Down Silos [Vision]. SIGMOD record (2021).
Mansour et al. (2022) Essam Mansour, Kavitha Srinivas, and Katja Hose. 2022. Federated Data Science to Break Down Silos [Vision]. ACM SIGMOD Record 50, 4 (2022), 16–22.
Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, et al. 2013. Distributed representations of words and phrases and their compositionality. In NIPS.
Miller (2018) Renée J Miller. 2018. Open data integration. Proceedings of the VLDB Endowment 11, 12 (2018), 2130–2139.
Nandi and Bernstein (2009) Arnab Nandi and Philip A Bernstein. 2009. HAMSTER: using search clicklogs for schema and taxonomy matching. Proceedings of the VLDB Endowment 2, 1 (2009), 181–192.
Nargesian et al. (2018) Fatemeh Nargesian, Erkang Zhu, Ken Q Pu, and Renée J Miller. 2018. Table union search on open data. In VLDB.
Psallidas et al. (2022) Fotis Psallidas, Yiwen Zhu, Bojan Karlas, Jordan Henkel, Matteo Interlandi, Subru Krishnan, Brian Kroth, Venkatesh Emani, Wentao Wu, Ce Zhang, et al. 2022. Data Science Through the Looking Glass: Analysis of Millions of GitHub Notebooks and ML. NET Pipelines. ACM SIGMOD Record 51, 2 (2022), 30–37.
Rahm and Bernstein (2001) Erhard Rahm and Philip A Bernstein. 2001. A survey of approaches to automatic schema matching. VLDBJ 10, 4 (2001), 334–350.
Shraga et al. (2020) Roee Shraga, Avigdor Gal, and Haggai Roitman. 2020. Adnev: Cross-domain schema matching using deep similarity matrix adjustment and evaluation. Proceedings of the VLDB Endowment 13, 9 (2020), 1401–1415.
Vretinaris et al. (2021) Alina Vretinaris, Chuan Lei, Vasilis Efthymiou, Xiao Qin, and Fatma Özcan. 2021. Medical entity disambiguation using graph neural networks. In Proceedings of the 2021 International Conference on Management of Data. 2310–2318.
Wang et al. (2019) Minjie Wang, Da Zheng, Zihao Ye, Quan Gan, Mufei Li, Xiang Song, ****g Zhou, Chao Ma, Lingfan Yu, Yu Gai, et al. 2019. Deep graph library: A graph-centric, highly-performant package for graph neural networks. arXiv preprint arXiv:1909.01315 (2019).
Wu et al. (2020) Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip. 2020. A comprehensive survey on graph neural networks. IEEE transactions on neural networks and learning systems 32, 1 (2020), 4–24.
Yang et al. (2020) Zhen Yang, Ming Ding, Chang Zhou, Hongxia Yang, **gren Zhou, and Jie Tang. 2020. Understanding negative sampling in graph representation learning. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1666–1676.
Ying et al. (2018) Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L Hamilton, and Jure Leskovec. 2018. Graph convolutional neural networks for web-scale recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 974–983.
Zhang et al. ([n. d.]) Dan Zhang, Yoshihiko Suhara, **feng Li, Madelon Hulsebos, Catagay Demiralp, and Wang-Chiew Tan. [n. d.]. Sato: Contextual Semantic Type Detection in Tables. Proceedings of the VLDB Endowment 13, 11 ([n. d.]).
Zhang and Chen (2018) Muhan Zhang and Yixin Chen. 2018. Link prediction based on graph neural networks. Advances in Neural Information Processing Systems 31 (2018), 5165–5175.
Zhang et al. (2011) Meihui Zhang, Marios Hadjieleftheriou, Beng Chin Ooi, et al. 2011. Automatic discovery of attributes in relational databases. In ACM SIGMOD.
Zhang and Ives (2020) Yi Zhang and Zachary G Ives. 2020. Finding Related Tables in Data Lakes for Interactive Data Science. In ACM SIGMOD.
Zhu et al. (2019) Erkang Zhu, Dong Deng, Fatemeh Nargesian, and Renée J. Miller. 2019. JOSIE Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes. In ACM SIGMOD.