SQLformer: Deep Auto-Regressive Query Graph Generation for Text-to-SQL Translation

Adrián Bazaga, Pietro Liò, Gos Micklem
University of Cambridge, Cambridge, United Kingdom
{

ar989,pl219,gm263

}@cam.ac.uk

Abstract

In recent years, the task of text-to-SQL translation, which converts natural language questions into executable SQL queries, has gained significant attention for its potential to democratize data access. Despite its promise, challenges such as adapting to unseen databases and aligning natural language with SQL syntax have hindered widespread adoption. To overcome these issues, we introduce SQLformer, a novel Transformer architecture specifically crafted to perform text-to-SQL translation tasks. Our model predicts SQL queries as abstract syntax trees (ASTs) in an autoregressive way, incorporating structural inductive bias in the encoder and decoder layers. This bias, guided by database table and column selection, aids the decoder in generating SQL query ASTs represented as graphs in a Breadth-First Search canonical order. Our experiments demonstrate that SQLformer achieves state-of-the-art performance across six prominent text-to-SQL benchmarks.

\newcolumntype

M>l< \newcolumntype=> \newcolumntype+>

SQLformer: Deep Auto-Regressive Query Graph Generation for Text-to-SQL Translation

Adrián Bazaga, Pietro Liò, Gos Micklem University of Cambridge, Cambridge, United Kingdom { $ar989,pl219,gm263$ }@cam.ac.uk

1 Introduction

Relational databases are essential tools within various critical sectors like healthcare and industry among others. For those with technical expertise, accessing data from these databases using some form of structured query language, such as SQL, can be efficient. However, the intricate nature of SQL can make it daunting for non-technical users to learn, creating significant barriers to users.

Consequently, there has been a surge in interest in the field of text-to-SQL Cai et al. (2018); Zelle and Mooney (1996); Xu et al. (2017); Yu et al. (2018); Yaghmazadeh et al. (2017), which aims to convert natural language questions (NLQs) directly into SQL queries. This has the potential to dramatically reduce the obstacles faced by non-expert users when interacting with relational databases (DBs).

Early work in the field primarily focused on develo** and evaluating semantic parsers for individual databases Hemphill et al. (1990); Dahl et al. (1994); Zelle and Mooney (1996); Zettlemoyer and Collins (2012); Dong and Lapata (2016). However, given the widespread use of DBs, an approach based on creating a separate semantic parser for each database does not scale.

One of the key hurdles in achieving domain generalisation Wang et al. (2021); Cao et al. (2021); Wang et al. (2022); Cai et al. (2022); Hui et al. (2022) is the need for complex reasoning to generate SQL queries rich in structure. This involves the ability to accurately contextualise a user query against a specific DB by considering both explicit relations (like the table-column relations defined by the DB schema) and implicit relations (like determining if a phrase corresponds or applies to a specific column or table).

Recently, there has been a release of large-scale datasets Yu et al. (2019b); Zhong et al. (2017) comprising hundreds of DBs and their associated question-SQL pairs. This has opened up the possibility of develo** semantic parsers capable of functioning effectively across different DBs Guo et al. (2019); Bogin et al. (2019); Zhang et al. (2019); Wang et al. (2021); Suhr et al. (2020); Choi et al. (2020); Bazaga et al. (2021). However, this demands the model to interpret queries in the context of relational DBs unseen during training, and precisely convey the query intent through SQL logic. As a result, cross-DB text-to-SQL semantic parsers cannot simply rely on memorising observed SQL patterns. Instead, they must accurately model the natural language query, the underlying DB structures, and the context of both.

Current strategies for cross-DB text-to-SQL semantic parsers generally follow a set of design principles to navigate these challenges. First, the question and schema representation are contextualised mutually by learning an embedding function conditioned on the schema Hwang et al. (2019); Guo et al. (2019); Wang et al. (2021). Second, pre-trained language models (LMs), such as BERT Devlin et al. (2019) or RoBERTa Liu et al. (2019), have been shown to greatly improve parsing accuracy by enhancing generalisation over language variations and capturing long-range dependencies. Related approaches Yin et al. (2020); Yu et al. (2021a) have adopted pre-training on a BERT architecture with the inclusion of grammar-augmented synthetic examples, which when combined with robust base semantic parsers, have achieved state-of-the-art results.

In this paper, we present SQLformer, a novel Transformer variant with grammar-based decoding for text-to-SQL translation. We represent each NLQ as a graph with syntactic and part-of-speech relationships and depict the database schema as a graph of table and column metadata. Inspired by the image domain Dosovitskiy et al. (2021), we incorporate learnable table and column embeddings into the encoder to select relevant tables and columns. Our model enriches the decoder input with this database information, guiding the decoder with schema-aware context. Then, the autoregressive decoder predicts the SQL query as an AST. Unlike large pre-trained language models or prompt-based techniques such as GPT-3, SQLformer offers greater efficiency and adaptability. We investigate SQLformer performance using six common text-to-SQL benchmarks of varying sizes and complexities. Our results show that SQLformer consistently achieves state-of-the-art performance across the multiple benchmarks, delivering more accurate and effective text-to-SQL capabilities on real-world scenarios.

2 Related Work

Earlier research often employed a sketch-based slot filling approach for SQL generation, which divides the task into several independent modules, each predicting a distinct part of the SQL query. Notable methods include SQLNet Xu et al. (2017), TypeSQL Yu et al. (2018), SQLOVA Hwang et al. (2019), X-SQL He et al. (2019), and RYANSQL Choi et al. (2020). These methods work well for simple queries but struggle with more complex scenarios typically encountered in real-world applications.

To address the challenges of complex SQL tasks, attention-based architectures have been widely adopted. For instance, IRNet Guo et al. (2019) separately encodes the question and schema using a LSTM and a self-attention mechanism respectively. Schema linking is accomplished by enhancing the question-schema encoding with custom type embeddings. The SQL rule-based decoder from Yin and Neubig (2017a) was then used in order to decode a query into an intermediate representation, attaining a high-level abstraction for SQL.

On the other hand, graph-based approaches have also been effective in modeling complex question and database relationships. For instance, Global-GNN Bogin et al. (2019) models the database as a graph, while RAT-SQL Wang et al. (2021) introduces schema encoding and linking, attributing a relation to every pair of input items. Further developments include LGESQL Cao et al. (2021), which distinguishes between local and non-local relations. SADGA Cai et al. (2022) utilises contextual and dependency structure to jointly encode the question graph with the database schema graph. $\textnormal{S}^{2}\textnormal{SQL}$ Hui et al. (2022) incorporates syntactic dependencies in a relational graph network Wang et al. (2020), and RASAT Qi et al. (2022) integrates a relation-aware self-attention module into T5 Raffel et al. (2020). These methods have demonstrated the effectiveness of modeling questions and database schema as relational graphs.

Recent work has demonstrated the effectiveness of fine-tuning pre-trained models. For instance, Shaw et al. (2021) showed that fine-tuning a pre-trained T5-3B model could yield competitive results. Building on this, Scholak et al. (2021) introduced PICARD, a technique that constrains the auto-regressive decoder by applying incremental parsing during inference time. This approach filters out grammatically incorrect sequences in real time during beam search, improving the quality of the generated SQL. RESDSQL Li et al. (2023a) proposes an schema ranking approach, retaining only the schemas most relevant to the question, before feeding it to a pre-trained RoBERTa Liu et al. (2019) in a seq2seq setting. However, these methods leverage pre-trained language models without incorporating SQL-specific constraints during decoding, which can limit their performance.

3 Preliminaries

3.1 Problem Formulation

Given a natural language question $\mathcal{Q}$ and a schema $\mathcal{S}$ = ( $\mathcal{T}$ , $\mathcal{C}$ ) for a relational database, our objective is to generate a corresponding SQL query $\mathcal{Y}$ . Here, the sequence $\mathcal{Q}$ $=$ $\{$ q₁ $\ldots$ q_{$|$ $\mathcal{Q}$ $|$} $\}$ is a sequence of natural language tokens or words, where $|$ $\mathcal{Q}$ $|$ is the length of the question. The database schema is comprised of tables $\mathcal{T}$ $=$ $\{$ t₁, $\ldots$ , t_{$|$ $\mathcal{T}$ $|$} $\}$ and columns $\mathcal{C}$ $=$ $\{$ c₁, $\ldots$ , c_{$|$ $\mathcal{C}$ $|$} $\}$ , where $|$ $\mathcal{T}$ $|$ and $|$ $\mathcal{C}$ $|$ are the number of tables and columns in the database, respectively. Each column name c_i $\in$ $\mathcal{C}$ , is comprised of tokens {c_i,1, $\ldots$ , c_{i, $|C_{i}|$}} , where $|$ C_i $|$ is the number of tokens in the column name, and similarly table names are also comprised of tokens {t_i,1, $\ldots$ , t_{i, $|t_{i}|$}}, where $|t_{i}|$ is the number of tokens in the table name.

3.2 Query Construction

We define the output SQL query $\mathcal{Y}$ as a graph, representing the AST of the query in the context-free grammar of SQL, which our model learns to generate in an autoregressive fashion. The query is an undirected graph $\mathcal{G}$ = ( $\mathcal{V}$ , $\mathcal{E}$ ), of vertices $\mathcal{V}$ and edges $\mathcal{E}$ . Similar to previous works Yin and Neubig (2017b); Wang et al. (2021); Qi et al. (2022), the nodes $\mathcal{V}$ $=$ $\mathcal{P}$ $\cup$ $\mathcal{T}$ $\cup$ $\mathcal{C}$ are the possible actions derived from SQL context-free grammar rules Yin and Neubig (2017b), $\mathcal{P}$ , such as SelectTable, SelectColumn, Root, as well as the tables ( $\mathcal{T}$ ) and the columns ( $\mathcal{C}$ ) of the database schema. $\mathcal{P}$ are used to represent non-terminal nodes, depicting rules of the grammar, whereas $\mathcal{T}$ and $\mathcal{C}$ are used for terminal nodes, such as when selecting table or column names to be applied within a specific rule. The edge set $\mathcal{E}$ $=$ $\{$ (v_i,v_j) $|$ v_i, v_j $\in$ $\mathcal{V}$ $\}$ defines the connectivity between the different nodes.

In particular, we choose to represent the graph using an adjacency matrix under a Breadth-First-Search (BFS) node ordering scheme, $\pi$ , that maps nodes to rows of the adjacency matrix as a sequence You et al. (2018). This approach permits the modelling of graphs of varying size, such as the ones representing the ASTs of complex SQL queries. Formally, given a map** $f_{S}$ from graph, $\mathcal{G}$ , to sequences, $\mathcal{S}$ , and a graph $\mathcal{G}$ with $n$ nodes under BFS node ordering $\pi$ , we can formulate

\mathbf{S^{\pi}}=\mathbf{f_{S}(\mathcal{G},\pi)=(S^{\pi}_{1},\ldots,S^{\pi}_{n% })}

(1)

where $S^{\pi}_{i}$ $\in$ $\{$ 0, 1 $\}$ ^i-1, i $\in$ $\{$ 1, $\ldots$ , n $\}$ depicts an adjacency vector between node $\pi$ (v_i) and the previous nodes $\pi$ (v_j), j $\in$ $\{$ 1, $\ldots$ , i-1 $\}$ already existing in the graph, so that:

\mathbf{S^{\pi}_{i}}=\mathbf{A(^{\pi}_{1,i},\ldots,A^{\pi}_{i-1,i})^{T},% \forall\textit{i}\in\{2,\ldots,\textit{n}\}}

(2)

Then, using $S^{\pi}$ , we can determine uniquely the SQL graph $\mathcal{G}$ in a sequential form and learn to predict it autoregressively.

4 SQLformer

4.1 Model Overview

Refer to caption — Figure 1: An illustration of SQLformer: our model inherits the seq2seq nature of the Transformer architecture, consisting of $L$ layers of encoders and decoders. SQLformer encoder introduces database table and column selection as inductive biases to contextualize the embedding of a question. In this example, the question consists of six tokens (Fig. 2). This schema-conditioned question representation serves as input to the SQLformer decoder module. Here we show the decoding timestep $t$ = 4 as an example. The architecture for the decoder module is detailed in Fig. 4.

In light of recent advancements Shaw et al. (2021); Scholak et al. (2021); Li et al. (2023b), we approach the text-to-SQL problem as a translation task by using an encoder-decoder architecture. Specifically, we extend the original Transformer encoder (Subsection 4.3) by incorporating learnable table and column tokens in the encoder, used to select the most relevant tables and columns in the database schema given the NLQ. This information is injected as input to the decoder, so that it can be enriched with the representation of the schema-aware question encoding and the most relevant tables and columns in the database schema selected by the model. Moreover, the SQLformer decoder extends the original Transformer decoder (Subsection 4.4) in a way that integrates both node type, adjacency and previous generated action embeddings for generating a SQL query autoregressively as a sequence of actions derived from a SQL grammar Yin and Neubig (2017b). The overall architecture of our SQLformer model is described in Fig. 1.

4.2 Model Inputs

In this section, we detail how the inputs to our model are constructed, in particular, the construction of both the NLQ and schema graphs.

Question Graph Construction.

The natural language question can be formulated as a graph $\mathcal{G}_{Q}$ = ( $\mathcal{Q}$ , $\mathcal{R}$ ), where the node set $\mathcal{Q}$ consists of the natural language tokens, and $\mathcal{R}$ = $\{$ r₁, $\ldots$ , r_{$|$ $\mathcal{R}$ $|$} $\}$ represents one-hop relations between words. We employ two types of relations for the question graph: syntactic dependencies and part-of-speech tagging, incorporating grammatical meaning. These relations form a joint question graph, which is then linearized as a Levi graph. Fig. 2 illustrates an example question graph with some relationships. Tables 7 and 8 describe all relations used. To encode each token in the question graph, we use a Graph Attention Network (GAT) Veličković et al. (2018).

Database Schema Graph Construction.

Similarly, a database schema graph is represented by $\mathcal{G}_{S}$ = ( $\mathcal{S}$ , $\mathcal{R}$ ) where the node set $\mathcal{S}$ $=$ ( $\mathcal{T}$ , $\mathcal{C}$ ) represents the tables, $\mathcal{T}$ , and the columns, $\mathcal{C}$ , in the schema. The edge set $\mathcal{R}$ $=$ $\{$ r₁, $\ldots$ , r_{$|$ $\mathcal{R}$ $|$} $\}$ depicts the structural relationships among tables and columns in the schema. Similarly to previous works, we use the common relational database-specific relations, such as primary/foreign key for column pairs, column types, and whether a column belongs to a specific table. Fig. 3 shows an example database schema graph and Table 9 provides a description of the types of relationships used for database schema graph construction. We encode the schema using a GAT and use average pooling to obtain a single embedding to represent each database schema.

4.3 Table and Column Selection Encoder

To describe our proposed modification to the Transformer encoder, we first introduce the original Transformer architecture. The Transformer encoder Vaswani et al. (2017) consists of alternating layers of multi-head self-attention (MHA) and Fully-connected Forward Network (FFN) blocks. Before every block, Layer Normalisation (LN) is applied, and after every block, a residual connection is added. More formally, in the $\ell^{th}$ encoder layer, the hidden states are represented as $X^{\ell}_{S}=\{x^{\ell}_{1},\ldots,x^{\ell}_{N}\}$ , where N is the maximum length of the inputs.

First, a MHA block maps X into a query matrix Q $\in$ $\mathbb{R}^{n\times d_{k}}$ , key matrix K $\in$ $\mathbb{R}^{n\times d_{k}}$ and value matrix V $\in$ $\mathbb{R}^{n\times d_{v}}$ , where m is the number of query vectors, and n the number of key or value vectors. Then, an attention vector is calculated as follows:

	$\displaystyle\mathbf{Attention(Q,K,V)}{=}$	$\displaystyle\mathbf{softmax(A)}\mathbf{V},$		(3)
	$\displaystyle\mathbf{A}{=}$	$\displaystyle\frac{\mathbf{Q}\mathbf{K^{T}}}{\mathbf{\sqrt{d_{k}}}}$		(3)

In practice, the MHA block calculates the self-attention over h heads, where each head i is independently parametrized by $\mathbf{W^{Q}_{i}}$ $\in$ $\mathbb{R}^{d_{m}\times d_{k}}$ , $\mathbf{W^{K}_{i}}$ $\in$ $\mathbb{R}^{d_{m}\times d_{k}}$ and $\mathbf{W^{V}_{i}}$ $\in$ $\mathbb{R}^{d_{m}\times d_{v}}$ , map** the input embeddings $\mathcal{X}$ into queries and key-value pairs. Then, the attention for each head is calculated and concatenated, as follows:

$\displaystyle\mathbf{Head_{i}}{=}$	$\displaystyle\mathbf{Attention(QW^{Q}_{i},KW^{K}_{i},VW^{V}_{i})}$	(4)
$\displaystyle\mathbf{MHA(\mathcal{X}^{\ell}_{\mathcal{S}})}{=}$	$\displaystyle\mathbf{Concat(Head_{1},\ldots,\textnormal{Head}_{h})W^{U}}$
$\displaystyle\mathbf{\bar{\mathcal{X}}^{\ell}_{\mathcal{S}}}{=}$	$\displaystyle\mathbf{MHA(\mathcal{X}^{\ell}_{\mathcal{S}})}$

where $\mathbf{W^{U}}$ $\mathbf{\in}$ $\mathbb{R}^{d^{h}_{m}\times d_{m}}$ is a trainable parameter matrix. Next, to acquire the hidden states of the input, a FFN block is applied, as follows:

\mathbf{FFN(\bar{\mathcal{X}}^{\ell}_{\mathcal{S}})}=\mathbf{max(0,\bar{% \mathcal{X}}^{\ell}_{\mathcal{S}}W_{1}+b_{1})W_{2}+b_{2}}

(5)

where $\mathbf{W_{1}}$ $\in$ $\mathbb{R}^{d_{m}\times d_{ff}}$ and $\mathbf{W_{2}}$ $\in$ $\mathbb{R}^{d_{ff}\times d_{m}}$ are linear weight matrices. Finally, layer normalisation and residual connection are applied as follows:

\mathbf{\tilde{\mathcal{X}}^{\ell}_{\mathcal{S}}=\mathbf{LayerNorm}(\bar{% \mathbf{\mathcal{X}}}^{\ell}_{\mathcal{S}}+\mathbf{FFN}(\bar{\mathbf{\mathcal{% X}}}^{\ell}_{\mathcal{S}}))}

(6)

In the SQLformer encoder, we input the 1D sequence of natural language token embeddings, $Z$ , and prepend two learnable tokens: $Z_{tables}$ and $Z_{cols}$ . The states of these tokens at the encoder output, $\tilde{\mathcal{X}}_{tables}$ and $\tilde{\mathcal{X}}_{columns}$ , serve as input to two MLP blocks responsible for selecting $k_{1}$ tables and $k_{2}$ columns based on the NLQ. Sinusoidal vectors retain the original positional information.

After L encoder layers, we obtain the input question embedding $\tilde{\mathcal{X}}^{\ell}_{\mathcal{S}}$ , with the first two tokens as $\tilde{\mathcal{X}}_{tables}$ and $\tilde{\mathcal{X}}_{columns}$ , and the rest as natural language question tokens $\tilde{\mathcal{X}}_{Q}$ $\in$ $\mathbb{R}^{d\times Q}$ . $\tilde{\mathcal{X}}_{T}$ and $\tilde{\mathcal{X}}_{C}$ input to MLP blocks $\mathbf{MLP_{T}}$ $\in$ $\mathbf{\mathbb{R}^{d\times|\mathcal{T}|}}$ and $\mathbf{MLP_{C}}$ $\in$ $\mathbf{\mathbb{R}^{d\times|\mathcal{C}|}}$ , where $d$ is the hidden size of the token embeddings, and $|\mathcal{T}|$ and $|\mathcal{C}|$ are the sizes of the tables and columns vocabularies, respectively. The embeddings are projected into probability vectors:

	$\displaystyle\mathbf{P_{tables}}{=}$	$\displaystyle\mathbf{softmax(MLP_{T}(\tilde{\mathcal{X}}_{T}))}$		(7)
	$\displaystyle\mathbf{P_{columns}}{=}$	$\displaystyle\mathbf{softmax(MLP_{C}(\tilde{\mathcal{X}}_{C}))}$		(7)

Then, the top $k_{1}$ and $k_{2}$ tables and columns, respectively, are selected according to $\mathbf{P_{tables}}$ and $\mathbf{P_{columns}}$ . A masking vector is applied to $\mathbf{P_{columns}}$ to ensure that only columns from the selected tables are considered, avoiding the selection of columns not present in the selected tables. Next, two embedding lookup tables, $\mathbf{E_{T}}$ $\in$ $\mathbf{\mathbb{R}^{|\mathcal{T}|\times d_{t}}}$ and $\mathbf{E_{C}}$ $\in$ $\mathbf{\mathbb{R}^{|\mathcal{C}|\times d_{c}}}$ , are used for map** the k top tables and columns, respectively, into embeddings, as $\tilde{\mathcal{X}}^{k}_{T}$ $\in$ $\mathbf{\mathbb{R}^{k_{1}\times d}}$ and $\tilde{\mathcal{X}}^{k}_{C}$ $\in$ $\mathbf{\mathbb{R}^{k_{2}\times d}}$ , where $d$ is the size of the learnable embeddings. These are aggregated and concatenated, giving the final representation for the schema, depicted as $\mathbf{\tilde{\mathcal{X}}_{schema}}$

Finally, $\tilde{\mathcal{X}}_{Q}$ and $\mathbf{\tilde{\mathcal{X}}_{schema}}$ are aggregated to effectively contextualize the natural language question embedding by the embedding of the most likely tables and columns in the schema being mentioned. The result of this aggregation is given as input to the decoder module as part of the cross-attention.

4.4 Autoregressive Query Graph Generation Decoder

During the decoding phase, previous works (e.g. Wang et al. (2021); Cao et al. (2021); Hui et al. (2022); Cai et al. (2022)) widely adopt the LSTM-based tree decoder from Yin and Neubig (2017a) to generate SQL grammar rules. In contrast, the SQLformer decoder (Fig. 4) extends the original Transformer decoder to predict the SQL grammar rules autoregressively. This approach has multiple advantages. First, it maintains the context of previously generated parts of the query for longer sequences than LSTM-based decoders. This is especially important for long queries, such as these containing sub-queries. Also, the Transformer encourages permutation invariance desirable for processing the node embeddings of the SQL graph, as the graph is invariant under any permutation of the nodes. Additionally, the highly parallelizable nature of the inherited Transformer architecture results in higher efficiency for both training and inference speed compared to previous LSTM-based approaches (see Table 10 for an analysis on training and inference efficiency).

In the SQLformer decoder, each query node is described by three attributes: node type, node adjacency, and the previous action. Nodes are assigned a type, represented as $N^{V}$ $=$ $\{$ V₀, V₁, $\ldots$ , V_N $\}$ , where $V_{i}$ is a one-hot representation of the node type. Nodes are grouped as non-terminal or terminal, with terminal types including $table_{i}d$ and $column_{i}d$ . Node type embeddings are calculated using a learnable transformation $\mathbf{\Psi}(N^{V})$ $\in$ $\mathbf{\mathbb{R}^{|V|\times d_{V}}}$ , where $d_{V}$ is the embedding dimensionality and $|$ V $|$ is the number of possible node types. Node adjacency is represented as $N^{A}$ $=$ $\{$ A₀, A₁, $\ldots$ , A_N $\}$ , with $A_{i}$ $\in$ $\{$ 0, 1 $\}$ ^M, and embeddings obtained from $\mathbf{\Phi}(N^{A})$ $\in$ $\mathbf{\mathbb{R}^{1\times d_{A}}}$ , with $d_{A}$ as the embedding dimensionality. The previous action embedding, $a_{t-1}$ , is given by the transformation $\mathbf{\Omega}(N^{R})$ $\in$ $\mathbf{\mathbb{R}^{1\times d_{T}}}$ , where $N^{R}$ is the SQL grammar rule chosen in the previous timestep and $d_{T}$ is the embedding dimensionality.

We extend the Transformer decoder architecture to incorporate the node type, adjacency and previous action embeddings to represent a node at each timestep. In particular, inspired by Ying et al. (2021), we include the node type and adjacency embeddings in the multi head self-attention aggregation process as a bias term (see Fig. 4 for an illustration). Formally, we modify Eq. 3 so that $\mathbf{\Psi}(N^{V})$ and $\mathbf{\Phi}(N^{A})$ act as a bias term in the attention calculation, as follows

	$\displaystyle\mathbf{A}{=}$	$\displaystyle\mathbf{\frac{QK^{T}}{\sqrt{d_{k}}}+U}$		(8)
	$\displaystyle\mathbf{U}{=}$	$\displaystyle\mathbf{\Psi(N^{V})}+\mathbf{\Phi(N^{A})}$		(8)

Then, the updated residuals for the node embedding, $\mathbf{n}^{\ell}_{t}$ , at layer $\ell$ , can be formalised as

	$\displaystyle\mathbf{n}^{\ell}_{t}{=}$	$\displaystyle\mathbf{n^{\ell-1}_{t}+\mathbf{O}^{\ell}\bigm{\\|}_{k=1}^{K}\sum_{% j=1}^{N}\left(G^{k,\ell}\textnormal{ }\mathbf{V}^{k,\ell}\right)}$		(9)
	$\displaystyle\mathbf{G^{k,\ell}}{=}$	$\displaystyle\mathbf{softmax(A^{k,\ell})}$		(9)

where $\parallel$ means concatenation, and K is the number of attention heads. As a result, the decoder state at the current timestep after $L$ decoder layers, $\mathbf{n}^{L}_{t}$ , is fed to an action output MLP head which computes the distribution $P(a_{t+1})$ of next timestep actions based on the node type, adjacency, and previous action at timestep $t$ . Formally, $P(a_{t+1})$ is calculated as follows

\mathbf{P(a_{t+1}}\mid\mathbf{n}^{L}_{t})=\mathbf{softmax(W_{a}n^{L}_{t})}

(10)

Finally, the prediction of the SQL query AST can be decoupled into a sequence of actions $a$ = ( $a_{1}$ , $\ldots$ , $a_{|a|}$ ), yielding the training objective for the task as

\mathbf{\mathcal{L}}=\mathbf{-\sum_{p=1}^{|a|}log\textnormal{ }P(a_{p}\mid a_{% <p},\mathcal{S},\mathcal{Q})}

(11)

5 Experiments

In this section, we show our model performance on six common text-to-SQL datasets. Also, we present ablation studies to analyse the importance of the different components of the SQLformer architecture.

Method	EM		EX
Method	Dev	Test	Dev	Test
SADGA + GAP Cai et al. (2022)	73.1	70.1	-	-
RAT-SQL + GraPPa Yu et al. (2021a)	73.4	69.6	-	-
RAT-SQL + GAP + NatSQL Shi et al. (2021)	73.7	68.7	75.0	73.3
SMBOP + GraPPa Rubin and Berant (2021)	74.7	69.5	75.0	71.1
DT-Fixup SQL-SP + RoBERTa Xu et al. (2021)	75.0	70.9	-	-
LGESQL + ELECTRA Cao et al. (2021)	75.1	72.0	-	-
RASAT Qi et al. (2022)	75.3	70.9	80.5	75.5
T5-3B Scholak et al. (2021)	75.5	71.9	79.3	75.1
S²SQL + ELECTRA Hui et al. (2022)	76.4	72.1	-	-
RESDSQL Li et al. (2023a)	80.5	72.0	84.1	79.9
GRAPHIX-T5-3B Li et al. (2023b)	77.1	74.0	81.0	77.6
SQLformer (our approach)	78.2	75.6	82.5	81.9

Table 1: EM and EX results on Spider’s dev and test dataset splits. We compare our approach with recent state-of-the-art methods. Underline depicts the previous best performing method for each metric.

5.1 Experimental Setup

Dataset.

We consider six benchmark datasets, with complete details included in Appendix E. In particular, our experiments use the Spider Yu et al. (2019b) dataset, a large-scale cross-domain text-to-SQL benchmark, as well as context-dependent benchmarks such as the SparC Yu et al. (2019c) and CoSQL Yu et al. (2019a) datasets. Additionally, we evaluate our method for zero-shot domain generalization performance on the Spider-DK Gan et al. (2021b), Spider-SYN Gan et al. (2021a) and Spider-Realistic datasets.

Evaluation Metrics.

We report results using the same metrics as previous works Yu et al. (2019b); Li et al. (2023a, b). For Spider-family datasets (i.e. Spider, Spider-DK, Spider-SYN and Spider-Realistic), we consider two prevalent evaluation metrics: Exact Match (EM) and Execution (EX) accuracies. The EX metric evaluates whether the predicted and ground-truth SQL queries yields the same execution results on the database. However, there can be instances where EX gives false positives. To counteract this, EM evaluates how much a predicted SQL query is comparable to the ground truth query. For SParC and CoSQL, we measure EM at the question (QEM) and interaction (IEM) levels, as well as EX at both question (QEX) and interaction levels (IEX).

Implementation Details.

We implemented SQLformer in PyTorch Paszke et al. (2019). For the graph neural network components, we use PyTorch Geometric Fey and Lenssen (2019). The questions, column and table names are tokenized and lemmatized using stanza Qi et al. (2020). For dependency parsing and part-of-speech tagging, stanza Qi et al. (2020) is used. We find the best set of hyperparameters on a randomly sampled subset of 10% queries from the dev dataset. For training, we set the maximum input length as 1024, maximum number of generated AST nodes to 200, batch size as 32 and maximum training steps to 20,000. A detailed list of hyperparameters can be found in Appendix D. Tokens embeddings are initialized with ELECTRA Clark et al. (2020) using the HuggingFace library Wolf et al. (2020). We use teacher forcing in the decoder for higher training stability. Results are on the test set unless stated otherwise.

5.2 Overall Performance

Results on Spider.

The EM and EX accuracy results on the Spider benchmark are presented in Table 1. Our proposed model SQLformer achieves competitive performance in both EM and EX accuracy. On the test set, compared with RAT-SQL Wang et al. (2021), our model’s EM increases from 69.6% to 75.6%, achieving a 6.0% absolute improvement. When compared to approaches that fine-tune a Language Model (LM) with a much larger amount of parameters, such as T5-3B (71.9%), we achieve a 3.7% absolute improvement. This effectively shows the benefit of our proposed architecture for solving text-to-SQL tasks with fewer parameters. Furthermore, SQLformer sets a new state-of-the-art in EX accuracy with 81.9%. Compared to RESDSQL Li et al. (2023a), which achieves 72.0% EM and 79.9% EX, SQLformer surpasses it by 3.6% and 2.0% respectively. Similarly, SQLformer outperforms GRAPHIX-T5 Li et al. (2023b), which has 74.0% EM and 77.6% EX, by 1.6% and 4.3%. Against other methods like RASAT, SQLformer shows significant improvements of 4.7% in EM and 6.4% in EX. These comparisons highlight the effectiveness of SQLformer in generating highly accurate SQL queries, significantly improving upon existing state-of-the-art methods.

Results on Difficult Queries.

We provide a breakdown of accuracy by query difficulty level (easy, medium, hard, extra hard) as defined by Yu et al. (2019b). Table 2 compares our approach to state-of-the-art baselines on the EM accuracy metric. Performance drops with increasing query difficulty, from 92.7% on $easy$ to 51.2% on $extra$ hard queries. Compared to RAT-SQL, SQLformer shows improvements of 9.7% on $hard$ and 8.3% on $extra$ hard queries, demonstrating its effectiveness in handling complex queries. Therefore, SQLformer surpasses the baseline methods across all four subsets by a significant margin, giving supporting evidence for the effectiveness of our approach.

Method	Easy	Medium	Hard	Extra	All
RAT-SQL + BERT	86.4	73.6	62.1	42.9	69.7
SADGA	90.3	72.4	63.8	49.4	71.6
LGESQL	91.5	76.7	66.7	48.8	74.1
GRAPHIX-T5-3B	91.9	81.6	61.5	50	75.6
SQLformer (our approach)	92.7	82.9	71.8	51.2	76.8

Table 2: EM accuracy on the Spider queries across different levels of difficulty as defined by Yu et al. (2019b).

Zero-Shot Results on Domain Generalization and Robustness.

In Table 3, we analyze SQLformer’s capabilities in zero-shot domain generalization and robustness on the Spider-DK, Spider-SYN, and Spider-Realistic benchmarks. SQLformer excels with EM accuracies of 55.1% on Spider-DK, 71.2% on Spider-SYN, and 78.7% on Spider-Realistic. These results surpass models like LGESQL with ELECTRA and sophisticated systems like GRAPHIX-T5-3B by 3.9%, 4.3%, and 6.3% on DK, SYN, and Realistic, respectively, and RESDSQL by 1.8%, 2.1%, and 1.3%. SQLformer’s EX accuracies of 68.2%, 78.4%, and 82.6% also outperform RESDSQL, demonstrating SQLformer’s ability to adapt to unseen domains without direct prior training potential for real-world applications where database schemas and linguistic variations are highly variable.

Method	Spider-DK		Spider-SYN		Spider-R
Method	EM	EX	EM	EX	EM	EX
RAT-SQL + GraPPa Yu et al. (2021a)	38.5	-	49.1	-	59.3	-
LGESQL + ELECTRA Cao et al. (2021)	48.4	-	64.6	-	69.2	-
T5-3B Scholak et al. (2021)	-	-	-	-	68.7	71.4
GRAPHIX-T5-3B Li et al. (2023b)	51.2	-	66.9	-	72.4	-
RESDSQL Li et al. (2023a)	53.3	66.0	69.1	76.9	77.4	81.9
SQLformer (our approach)	55.1	68.2	71.2	78.4	78.7	82.6

Table 3: EM and EX on Spider-SYN, Spider-DK and Spider-Realistic benchmarks.

Results on Context-Dependent Settings.

We present the experimental results for SQLformer in comparison with several leading methods on the SParC (Table 4) and CoSQL (Table 5) datasets. For the SParC dataset, SQLformer achieves 68.6% QEM and 51.3% IEM, outperforming RASAT by 1.9% and 4.1% respectively. Additionally, SQLformer shows significant improvements in QEX and IEX metrics, with 74.5% and 55.8%, further confirming its superior capacity for maintaining contextual understanding in multi-turn SQL dialogue tasks. For the CoSQL dataset, SQLformer attains 60.2% QEM and 31.4% IEM, surpassing RASAT by 1.4% and 5.1%. Moreover, SQLformer’s QEX and IEX scores are 68.4% and 39.2%, respectively, highlighting its potential in enhancing interactive SQL query generation. These results underscore the effectiveness of SQLformer in delivering more accurate and contextually aware SQL interpretations compared to previous leading methods.

Method	QEM	IEM	QEX	IEX
EditSQL + BERT Zhang et al. (2019)	47.2	29.5	-	-
IGSQL + BERT Cai and Wan (2020)	50.7	32.5	-	-
RAT-SQL + SCoRe Yu et al. (2021b)	62.2	42.5	-	-
RASAT Qi et al. (2022)	66.7	47.2	72.5	53.1
SQLformer (our approach)	68.6	51.3	74.5	55.8

Table 4: Evaluation results on the SParC dataset.

Method	QEM	IEM	QEX	IEX
EditSQL + BERT Zhang et al. (2019)	39.9	12.3	-	-
IGSQL + BERT Cai and Wan (2020)	44.1	15.8	-	-
RAT-SQL + SCoRe Yu et al. (2021b)	52.1	22.0	-	-
T5-3B Scholak et al. (2021)	56.9	24.2	-	-
HIE-SQL + GraPPa Zheng et al. (2022)	56.4	28.7	-	-
RASAT Qi et al. (2022)	58.8	26.3	66.7	37.5
SQLformer (our approach)	60.2	31.4	68.4	39.2

Table 5: Evaluation results on the CoSQL dataset.

5.3 Ablation Study

To validate the importance of each component in our architecture, we performed ablation studies on the SQLformer model. Table 6 compares the impact of four critical design choices: removing table and column selection, part-of-speech question encoding, and dependency graph question encoding. Additionally, we analyze the impact of varying the number of selected tables ( $k_{1}$ ) and columns ( $k_{2}$ ) on the performance of SQLformer (see Table 11).

Method	EM accuracy ( $\%$ )
SQLformer w/o table + column selection	72.3 $\pm$ 0.38
SQLformer encoder + LSTM-based decoder	74.2 $\pm$ 0.38
SQLformer w/o Part-of-Speech graph	77.3 $\pm$ 0.63
SQLformer w/o dependency graph	77.5 $\pm$ 0.72
SQLformer (baseline)	78.2 $\pm$ 0.75

Table 6: EM accuracy (and

\pm

95% confidence interval) of SQLformer ablation study on the Spider development set.

The results show that table and column selection has the biggest impact, with a performance drop from 78.2% to 72.3% when removed. This highlights the importance of schema-question linking. Removing the dependency graph and part-of-speech encodings leads to smaller decreases of 0.7% and 0.9%, respectively. Using an LSTM-based decoder from Yin and Neubig (2017a) instead of a Transformer-based one decreases performance by 4%, demonstrating the effectiveness of our approach.

6 Conclusion

In this work, we introduced SQLformer, a novel model for text-to-SQL translation that leverages an autoregressive Transformer-based approach. SQLformer uses a specially designed encoder to link questions and schema and utilizes pre-trained models for effective language representation. Its unique decoder integrates node adjacency, type, and previous action information, conditioned on top-selected tables, columns, and schema-aware question encoding. Notably, SQLformer outperformed competitive text-to-SQL baselines across six datasets, demonstrating state-of-the-art performance.

Limitations

One of the main limitations of our work is its focus on the English language, as it is the language used by most publicly available datasets. A potential way to alleviate this is by using multi-language PLMs for processing the questions.

References

Bazaga et al. (2021) Adrián Bazaga, Nupur Gunwant, and Gos Micklem. 2021. Translating synthetic natural language to database queries with a polyglot deep learning framework. Scientific Reports, 11(1).
Bogin et al. (2019) Ben Bogin, Matt Gardner, and Jonathan Berant. 2019. Global Reasoning over Database Structures for Text-to-SQL Parsing. ArXiv:1908.11214 [cs].
Cai et al. (2018) Ruichu Cai, Boyan Xu, Xiaoyan Yang, Zhenjie Zhang, Zijian Li, and Zhihao Liang. 2018. An Encoder-Decoder Framework Translating Natural Language to Database Queries. ArXiv:1711.06061 [cs].
Cai et al. (2022) Ruichu Cai, **jie Yuan, Boyan Xu, and Zhifeng Hao. 2022. SADGA: Structure-Aware Dual Graph Aggregation Network for Text-to-SQL. ArXiv:2111.00653 [cs].
Cai and Wan (2020) Yitao Cai and Xiaojun Wan. 2020. IGSQL: Database schema interaction graph based neural model for context-dependent text-to-SQL generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6903–6912, Online. Association for Computational Linguistics.
Cao et al. (2021) Ruisheng Cao, Lu Chen, Zhi Chen, Yanbin Zhao, Su Zhu, and Kai Yu. 2021. LGESQL: Line Graph Enhanced Text-to-SQL Model with Mixed Local and Non-Local Relations. ArXiv:2106.01093 [cs].
Choi et al. (2020) DongHyun Choi, Myeong Cheol Shin, EungGyun Kim, and Dong Ryeol Shin. 2020. RYANSQL: Recursively Applying Sketch-based Slot Fillings for Complex Text-to-SQL in Cross-Domain Databases. ArXiv:2004.03125 [cs].
Clark et al. (2020) Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. 2020. ELECTRA: Pre-training text encoders as discriminators rather than generators. In ICLR.
Dahl et al. (1994) Deborah A. Dahl, Madeleine Bates, Michael Brown, William Fisher, Kate Hunicke-Smith, David Pallett, Christine Pao, Alexander Rudnicky, and Elizabeth Shriberg. 1994. Expanding the Scope of the ATIS Task: The ATIS-3 Corpus. In Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994.
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv:1810.04805 [cs].
Dong and Lapata (2016) Li Dong and Mirella Lapata. 2016. Language to Logical Form with Neural Attention. ArXiv:1601.01280 [cs].
Dosovitskiy et al. (2021) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ArXiv:2010.11929 [cs].
Fey and Lenssen (2019) Matthias Fey and Jan Eric Lenssen. 2019. Fast Graph Representation Learning with PyTorch Geometric. ArXiv:1903.02428 [cs, stat].
Gan et al. (2021a) Yujian Gan, ** Huang, Matthew Purver, John R. Woodward, **xia Xie, and Pengsheng Huang. 2021a. Towards robustness of text-to-sql models against synonym substitution.
Gan et al. (2021b) Yujian Gan, Xinyun Chen, and Matthew Purver. 2021b. Exploring underexplored limitations of cross-domain text-to-sql generalization.
Guo et al. (2019) Jiaqi Guo, Zecheng Zhan, Yan Gao, Yan Xiao, Jian-Guang Lou, Ting Liu, and Dongmei Zhang. 2019. Towards Complex Text-to-SQL in Cross-Domain Database with Intermediate Representation. ArXiv:1905.08205 [cs].
He et al. (2019) Pengcheng He, Yi Mao, Kaushik Chakrabarti, and Weizhu Chen. 2019. X-SQL: reinforce schema representation with context. ArXiv:1908.08113 [cs].
Hemphill et al. (1990) Charles T. Hemphill, John J. Godfrey, and George R. Doddington. 1990. The ATIS Spoken Language Systems Pilot Corpus. In Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27,1990.
Hui et al. (2022) Binyuan Hui, Ruiying Geng, Lihan Wang, Bowen Qin, Bowen Li, Jian Sun, and Yongbin Li. 2022. S$^2$SQL: Injecting Syntax to Question-Schema Interaction Graph Encoder for Text-to-SQL Parsers. ArXiv:2203.06958 [cs].
Hwang et al. (2019) Wonseok Hwang, **yeong Yim, Seunghyun Park, and Minjoon Seo. 2019. A Comprehensive Exploration on WikiSQL with Table-Aware Word Contextualization. ArXiv:1902.01069 [cs].
Li et al. (2023a) Haoyang Li, **g Zhang, Cui** Li, and Hong Chen. 2023a. RESDSQL: Decoupling Schema Linking and Skeleton Parsing for Text-to-SQL. ArXiv:2302.05965 [cs].
Li et al. (2023b) **yang Li, Binyuan Hui, Reynold Cheng, Bowen Qin, Chenhao Ma, Nan Huo, Fei Huang, Wenyu Du, Luo Si, and Yongbin Li. 2023b. Graphix-t5: Mixing pre-trained transformers with graph-aware layers for text-to-sql parsing. arXiv preprint arXiv:2301.07507.
Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, **gfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. ArXiv:1907.11692 [cs].
Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. ArXiv:1912.01703 [cs, stat].
Qi et al. (2022) Jiexing Qi, **gyao Tang, Ziwei He, Xiangpeng Wan, Yu Cheng, Chenghu Zhou, Xinbing Wang, Quanshi Zhang, and Zhouhan Lin. 2022. RASAT: Integrating Relational Structures into Pretrained Seq2Seq Model for Text-to-SQL. ArXiv:2205.06983 [cs].
Qi et al. (2020) Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton, and Christopher D. Manning. 2020. Stanza: A Python Natural Language Processing Toolkit for Many Human Languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 101–108, Online. Association for Computational Linguistics.
Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. ArXiv:1910.10683 [cs, stat].
Rubin and Berant (2021) Ohad Rubin and Jonathan Berant. 2021. SmBoP: Semi-autoregressive bottom-up semantic parsing. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 311–324, Online. Association for Computational Linguistics.
Scholak et al. (2021) Torsten Scholak, Nathan Schucher, and Dzmitry Bahdanau. 2021. PICARD: Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models. ArXiv:2109.05093 [cs].
Shaw et al. (2021) Peter Shaw, Ming-Wei Chang, Panupong Pasupat, and Kristina Toutanova. 2021. Compositional Generalization and Natural Language Variation: Can a Semantic Parsing Approach Handle Both? ArXiv:2010.12725 [cs].
Shi et al. (2021) Peng Shi, Patrick Ng, Zhiguo Wang, Henghui Zhu, Alexander Hanbo Li, Jun Wang, Cicero Nogueira dos Santos, and Bing Xiang. 2021. Learning contextual representations for semantic parsing with generation-augmented pre-training. Proceedings of the AAAI Conference on Artificial Intelligence, 35(15):13806–13814.
Suhr et al. (2020) Alane Suhr, Ming-Wei Chang, Peter Shaw, and Kenton Lee. 2020. Exploring Unexplored Generalization Challenges for Cross-Database Semantic Parsing. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8372–8388, Online. Association for Computational Linguistics.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. ArXiv:1706.03762 [cs].
Veličković et al. (2018) Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph Attention Networks. ArXiv:1710.10903 [cs, stat].
Wang et al. (2021) Bailin Wang, Richard Shin, Xiaodong Liu, Oleksandr Polozov, and Matthew Richardson. 2021. RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers. ArXiv:1911.04942 [cs].
Wang et al. (2020) Kai Wang, Weizhou Shen, Yunyi Yang, Xiaojun Quan, and Rui Wang. 2020. Relational Graph Attention Network for Aspect-based Sentiment Analysis. ArXiv:2004.12362 [cs].
Wang et al. (2022) Lihan Wang, Bowen Qin, Binyuan Hui, Bowen Li, Min Yang, Bailin Wang, Binhua Li, Fei Huang, Luo Si, and Yongbin Li. 2022. Proton: Probing Schema Linking Information from Pre-trained Language Models for Text-to-SQL Parsing. ArXiv:2206.14017 [cs].
Wolf et al. (2020) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. 2020. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
Xu et al. (2021) Peng Xu, Dhruv Kumar, Wei Yang, Wenjie Zi, Keyi Tang, Chenyang Huang, Jackie Chi Kit Cheung, Simon J.D. Prince, and Yanshuai Cao. 2021. Optimizing deeper transformers on small datasets. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2089–2102, Online. Association for Computational Linguistics.
Xu et al. (2017) Xiaojun Xu, Chang Liu, and Dawn Song. 2017. SQLNet: Generating Structured Queries From Natural Language Without Reinforcement Learning. ArXiv:1711.04436 [cs].
Yaghmazadeh et al. (2017) Navid Yaghmazadeh, Yuepeng Wang, Isil Dillig, and Thomas Dillig. 2017. SQLizer: query synthesis from natural language. Proceedings of the ACM on Programming Languages, 1(OOPSLA):63:1–63:26.
Yin and Neubig (2017a) Pengcheng Yin and Graham Neubig. 2017a. A Syntactic Neural Model for General-Purpose Code Generation. ArXiv:1704.01696 [cs].
Yin and Neubig (2017b) Pengcheng Yin and Graham Neubig. 2017b. A syntactic neural model for general-purpose code generation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 440–450, Vancouver, Canada. Association for Computational Linguistics.
Yin et al. (2020) Pengcheng Yin, Graham Neubig, Wen-tau Yih, and Sebastian Riedel. 2020. TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data. ArXiv:2005.08314 [cs].
Ying et al. (2021) Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, and Tie-Yan Liu. 2021. Do Transformers Really Perform Bad for Graph Representation? ArXiv:2106.05234 [cs].
You et al. (2018) Jiaxuan You, Rex Ying, Xiang Ren, William L. Hamilton, and Jure Leskovec. 2018. GraphRNN: Generating Realistic Graphs with Deep Auto-regressive Models. ArXiv:1802.08773 [cs].
Yu et al. (2018) Tao Yu, Zifan Li, Zilin Zhang, Rui Zhang, and Dragomir Radev. 2018. TypeSQL: Knowledge-based Type-Aware Neural Text-to-SQL Generation. ArXiv:1804.09769 [cs].
Yu et al. (2021a) Tao Yu, Chien-Sheng Wu, Xi Victoria Lin, Bailin Wang, Yi Chern Tan, Xinyi Yang, Dragomir Radev, Richard Socher, and Caiming Xiong. 2021a. GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing. ArXiv:2009.13845 [cs].
Yu et al. (2019a) Tao Yu, Rui Zhang, He Yang Er, Suyi Li, Eric Xue, Bo Pang, Xi Victoria Lin, Yi Chern Tan, Tianze Shi, Zihan Li, Youxuan Jiang, Michihiro Yasunaga, Sungrok Shim, Tao Chen, Alexander Fabbri, Zifan Li, Luyao Chen, Yuwen Zhang, Shreya Dixit, Vincent Zhang, Caiming Xiong, Richard Socher, Walter S Lasecki, and Dragomir Radev. 2019a. Cosql: A conversational text-to-sql challenge towards cross-domain natural language interfaces to databases.
Yu et al. (2021b) Tao Yu, Rui Zhang, Alex Polozov, Christopher Meek, and Ahmed Hassan Awadallah. 2021b. {SC}ore: Pre-training for context representation in conversational semantic parsing. In International Conference on Learning Representations.
Yu et al. (2019b) Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, Zilin Zhang, and Dragomir Radev. 2019b. Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. ArXiv:1809.08887 [cs].
Yu et al. (2019c) Tao Yu, Rui Zhang, Michihiro Yasunaga, Yi Chern Tan, Xi Victoria Lin, Suyi Li, Heyang Er, Irene Li, Bo Pang, Tao Chen, Emily Ji, Shreya Dixit, David Proctor, Sungrok Shim, Jonathan Kraft, Vincent Zhang, Caiming Xiong, Richard Socher, and Dragomir Radev. 2019c. Sparc: Cross-domain semantic parsing in context.
Zelle and Mooney (1996) John M. Zelle and Raymond J. Mooney. 1996. Learning to parse database queries using inductive logic programming. In Proceedings of the thirteenth national conference on Artificial intelligence - Volume 2, AAAI’96, pages 1050–1055, Portland, Oregon. AAAI Press.
Zettlemoyer and Collins (2012) Luke S. Zettlemoyer and Michael Collins. 2012. Learning to Map Sentences to Logical Form: Structured Classification with Probabilistic Categorial Grammars. ArXiv:1207.1420 [cs].
Zhang et al. (2019) Rui Zhang, Tao Yu, He Yang Er, Sungrok Shim, Eric Xue, Xi Victoria Lin, Tianze Shi, Caiming Xiong, Richard Socher, and Dragomir Radev. 2019. Editing-Based SQL Query Generation for Cross-Domain Context-Dependent Questions. ArXiv:1909.00786 [cs].
Zheng et al. (2022) Yanzhao Zheng, Haibin Wang, Baohua Dong, Xingjun Wang, and Changshan Li. 2022. Hie-sql: History information enhanced network for context-dependent text-to-sql semantic parsing.
Zhong et al. (2017) Victor Zhong, Caiming Xiong, and Richard Socher. 2017. Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning. ArXiv:1709.00103 [cs].

Appendix A Details on structural information types

All types of structural information used for question graph construction are shown in Table 8 and Table 7. In particular, Table 8 highlights the syntactic dependency tags using during parsing of questions and Table 7 summarizes the semantic part-of-speech tags. Moreover, all relationships used for database schema graph construction are listed in Table 9.

Tag	Description
ADJ	Adjective: describes a noun or pronoun.
ADV	Adverb: modifies a verb, adjective, or another adverb.
INTJ	Interjection: expresses a spontaneous feeling or reaction.
NOUN	Noun: names a specific object or set of objects.
PROPN	Proper Noun: names specific individuals, places, organizations.
VERB	Verb: describes an action, occurrence, or state of being.
ADP	Adposition: relates to other words, specifying relationships.
AUX	Auxiliary: helps form verb tenses, moods, or voices.
CCONJ	Coordinating Conjunction: connects words, phrases, or clauses of equal rank.
DET	Determiner: modifies a noun, indicating reference.
NUM	Numeral: represents a number.

Table 7: Types of part-of-speech tags used during question graph construction

Tag	Description
ACL	Clausal modifier of noun.
ADVCL	Adverbial clause modifier.
ADVMOD	Adverbial modifier.
AMOD	Adjectival modifier.
APPOS	Appositional modifier.
AUX	Auxiliary.
CC	Coordinating conjunction.
CCOMP	Clausal complement.
COMP	Compound.
CONJ	Conjunct.
COP	Copula.
CSUBJ	Clausal subject.
DET	Determiner.
IOBJ	Indirect object.
NMOD	Nominal modifier.
NSUBJ	Nominal subject.
NUMMOD	Numeric modifier.
OBJ	Object.

Table 8: Types of dependency parsing tags used during question graph construction

Source node $x$	Target node $y$	Relationship	Description
Table	Column	Has-Column	Column y belongs to the table x.
Column	Table	Is-Primary-Key	The column x is primary key of table y.
Column	Column	Is-Foreign-Key	Column x is the foreign key of column y.
Column	Literal	Column-Type	The column x has type y.

Table 9: Summary of structural information types used in SQLformer during database schema graph construction.

Appendix B Analysis on training and inference efficiency

Table 10 presents a comparative analysis of training and inference times between LSTM-based methods and SQLformer. Specifically, the average training time for every 50 iterations is calculated for both types of methods. The findings indicate that SQLformer achieves a training speed that is approximately four times faster and an inference speed that is 1.2 times faster than that of the LSTM-based methods.

Method	Spider		SParC		CoSQL
Method	Tr	In	Tr	In	Tr	In
LSTM	203.1	19.3	174.2	18.5	162.8	19.7
SQLformer	52.9	16.2	67.4	15.6	53.7	15.8

Table 10: Training (Tr) and inference (In) efficiency comparison between LSTM-based approaches and SQLformer. Training efficiency is calculated as the average training time in seconds from 50 iterations. Inference efficiency is calculated as seconds per 100 queries.

Appendix C Details on the decoder architecture

In the SQLformer decoder (Figure 4), the inputs are the node adjacencies and types in the current timestep of the generation process, as well as the previous action embedding. The node type and adjacency embeddings are integrated with the previous action embedding into the aggregation process of the MHA mechanism as a bias term. The node embedding is then transformed through a series of $L$ decoding layers with $H$ heads. The final representation is used to generate the probability distribution of actions to take in the next timestep.

Appendix D Summary of best hyperparameters used for SQLformer training

For training SQLformer, we find the best set of hyperparameters on a randomly sampled subset of 10% queries from the Spider dev split. Specifically, we find the best maximum previous AST nodes in the BFS ordering to be 30, and maximum training steps as 20,000. The number of layers for the encoder and decoder are both set to 6 and number of heads is 8. The dimensionality of the encoder and the decoder are set to 512. $k_{1}$ and $k_{2}$ are set to 10. The embedding sizes for tables and columns are set to 512. The node adjacency, node type and action embeddings sizes are 512. The output MLP for generating the next output action during decoding has 2 layers and hidden dimensionality of 512.

Appendix E Dataset details

For our experiments we use the (1) Spider dataset Yu et al. (2019b), a large-scale cross-domain text-to-SQL benchmark. This dataset also incorporates multiple text-to-SQL datasets. The Spider dataset contains 8,659 training examples of question and SQL query pairs, 1,034 development (dev) examples and 2,147 test examples, spanning 300 complex databases across 138 different domains. Also, we run experiments on context-dependent settings with the (2) SParC and (3) CoSQL datasets, as well as zero-shot domain generalization performance on (4) Spider-DK, (5) Spider-SYN and (6) Spider-Realistic benchmarks.

Appendix F Impact of Number of Selected Top Tables and Columns

In this section, we analyze the impact of selecting different numbers of top tables and columns on the performance of SQLformer. The performance is measured using EM accuracy on the Spider development set. Table 11 summarizes the results of varying the number of selected tables ( $k_{1}$ ) and columns ( $k_{2}$ ). As shown in the table, selecting more tables and columns generally improves the EM accuracy. However, this improvement comes with diminishing returns, indicating a trade-off between the number of selected schema elements and the model’s complexity and efficiency.

# tables ( $k_{1}$ )	# columns ( $k_{2}$ )	EM accuracy (%)
5	5	73.1
10	5	75.2
5	10	76.7
10	10	78.2

Table 11: EM accuracy of SQLformer with varying numbers of top selected tables and columns.

These results demonstrate that while including more tables and columns can enhance performance, the gains are not linear and should be balanced against computational efficiency. Adjusting the number of top selected tables and columns can be a critical hyperparameter for optimizing performance in different application scenarios.