SQLformer: Deep Auto-Regressive Query Graph Generation for Text-to-SQL Translation

Adrián Bazaga, Pietro Liò, Gos Micklem
University of Cambridge, Cambridge, United Kingdom
{ar989,pl219,gm263𝑎𝑟989𝑝𝑙219𝑔𝑚263ar989,pl219,gm263italic_a italic_r 989 , italic_p italic_l 219 , italic_g italic_m 263}@cam.ac.uk
Abstract

In recent years, the task of text-to-SQL translation, which converts natural language questions into executable SQL queries, has gained significant attention for its potential to democratize data access. Despite its promise, challenges such as adapting to unseen databases and aligning natural language with SQL syntax have hindered widespread adoption. To overcome these issues, we introduce SQLformer, a novel Transformer architecture specifically crafted to perform text-to-SQL translation tasks. Our model predicts SQL queries as abstract syntax trees (ASTs) in an autoregressive way, incorporating structural inductive bias in the encoder and decoder layers. This bias, guided by database table and column selection, aids the decoder in generating SQL query ASTs represented as graphs in a Breadth-First Search canonical order. Our experiments demonstrate that SQLformer achieves state-of-the-art performance across six prominent text-to-SQL benchmarks.

\newcolumntype

M>l< \newcolumntype=> \newcolumntype+>

SQLformer: Deep Auto-Regressive Query Graph Generation for Text-to-SQL Translation


Adrián Bazaga, Pietro Liò, Gos Micklem University of Cambridge, Cambridge, United Kingdom {ar989,pl219,gm263𝑎𝑟989𝑝𝑙219𝑔𝑚263ar989,pl219,gm263italic_a italic_r 989 , italic_p italic_l 219 , italic_g italic_m 263}@cam.ac.uk


1 Introduction

Relational databases are essential tools within various critical sectors like healthcare and industry among others. For those with technical expertise, accessing data from these databases using some form of structured query language, such as SQL, can be efficient. However, the intricate nature of SQL can make it daunting for non-technical users to learn, creating significant barriers to users.

Consequently, there has been a surge in interest in the field of text-to-SQL Cai et al. (2018); Zelle and Mooney (1996); Xu et al. (2017); Yu et al. (2018); Yaghmazadeh et al. (2017), which aims to convert natural language questions (NLQs) directly into SQL queries. This has the potential to dramatically reduce the obstacles faced by non-expert users when interacting with relational databases (DBs).

Early work in the field primarily focused on develo** and evaluating semantic parsers for individual databases Hemphill et al. (1990); Dahl et al. (1994); Zelle and Mooney (1996); Zettlemoyer and Collins (2012); Dong and Lapata (2016). However, given the widespread use of DBs, an approach based on creating a separate semantic parser for each database does not scale.

One of the key hurdles in achieving domain generalisation Wang et al. (2021); Cao et al. (2021); Wang et al. (2022); Cai et al. (2022); Hui et al. (2022) is the need for complex reasoning to generate SQL queries rich in structure. This involves the ability to accurately contextualise a user query against a specific DB by considering both explicit relations (like the table-column relations defined by the DB schema) and implicit relations (like determining if a phrase corresponds or applies to a specific column or table).

Recently, there has been a release of large-scale datasets Yu et al. (2019b); Zhong et al. (2017) comprising hundreds of DBs and their associated question-SQL pairs. This has opened up the possibility of develo** semantic parsers capable of functioning effectively across different DBs Guo et al. (2019); Bogin et al. (2019); Zhang et al. (2019); Wang et al. (2021); Suhr et al. (2020); Choi et al. (2020); Bazaga et al. (2021). However, this demands the model to interpret queries in the context of relational DBs unseen during training, and precisely convey the query intent through SQL logic. As a result, cross-DB text-to-SQL semantic parsers cannot simply rely on memorising observed SQL patterns. Instead, they must accurately model the natural language query, the underlying DB structures, and the context of both.

Current strategies for cross-DB text-to-SQL semantic parsers generally follow a set of design principles to navigate these challenges. First, the question and schema representation are contextualised mutually by learning an embedding function conditioned on the schema Hwang et al. (2019); Guo et al. (2019); Wang et al. (2021). Second, pre-trained language models (LMs), such as BERT Devlin et al. (2019) or RoBERTa Liu et al. (2019), have been shown to greatly improve parsing accuracy by enhancing generalisation over language variations and capturing long-range dependencies. Related approaches Yin et al. (2020); Yu et al. (2021a) have adopted pre-training on a BERT architecture with the inclusion of grammar-augmented synthetic examples, which when combined with robust base semantic parsers, have achieved state-of-the-art results.

In this paper, we present SQLformer, a novel Transformer variant with grammar-based decoding for text-to-SQL translation. We represent each NLQ as a graph with syntactic and part-of-speech relationships and depict the database schema as a graph of table and column metadata. Inspired by the image domain Dosovitskiy et al. (2021), we incorporate learnable table and column embeddings into the encoder to select relevant tables and columns. Our model enriches the decoder input with this database information, guiding the decoder with schema-aware context. Then, the autoregressive decoder predicts the SQL query as an AST. Unlike large pre-trained language models or prompt-based techniques such as GPT-3, SQLformer offers greater efficiency and adaptability. We investigate SQLformer performance using six common text-to-SQL benchmarks of varying sizes and complexities. Our results show that SQLformer consistently achieves state-of-the-art performance across the multiple benchmarks, delivering more accurate and effective text-to-SQL capabilities on real-world scenarios.

2 Related Work

Earlier research often employed a sketch-based slot filling approach for SQL generation, which divides the task into several independent modules, each predicting a distinct part of the SQL query. Notable methods include SQLNet Xu et al. (2017), TypeSQL Yu et al. (2018), SQLOVA Hwang et al. (2019), X-SQL He et al. (2019), and RYANSQL Choi et al. (2020). These methods work well for simple queries but struggle with more complex scenarios typically encountered in real-world applications.

To address the challenges of complex SQL tasks, attention-based architectures have been widely adopted. For instance, IRNet Guo et al. (2019) separately encodes the question and schema using a LSTM and a self-attention mechanism respectively. Schema linking is accomplished by enhancing the question-schema encoding with custom type embeddings. The SQL rule-based decoder from Yin and Neubig (2017a) was then used in order to decode a query into an intermediate representation, attaining a high-level abstraction for SQL.

On the other hand, graph-based approaches have also been effective in modeling complex question and database relationships. For instance, Global-GNN Bogin et al. (2019) models the database as a graph, while RAT-SQL Wang et al. (2021) introduces schema encoding and linking, attributing a relation to every pair of input items. Further developments include LGESQL Cao et al. (2021), which distinguishes between local and non-local relations. SADGA Cai et al. (2022) utilises contextual and dependency structure to jointly encode the question graph with the database schema graph. S2SQLsuperscriptS2SQL\textnormal{S}^{2}\textnormal{SQL}S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT SQL Hui et al. (2022) incorporates syntactic dependencies in a relational graph network Wang et al. (2020), and RASAT Qi et al. (2022) integrates a relation-aware self-attention module into T5 Raffel et al. (2020). These methods have demonstrated the effectiveness of modeling questions and database schema as relational graphs.

Recent work has demonstrated the effectiveness of fine-tuning pre-trained models. For instance, Shaw et al. (2021) showed that fine-tuning a pre-trained T5-3B model could yield competitive results. Building on this, Scholak et al. (2021) introduced PICARD, a technique that constrains the auto-regressive decoder by applying incremental parsing during inference time. This approach filters out grammatically incorrect sequences in real time during beam search, improving the quality of the generated SQL. RESDSQL Li et al. (2023a) proposes an schema ranking approach, retaining only the schemas most relevant to the question, before feeding it to a pre-trained RoBERTa Liu et al. (2019) in a seq2seq setting. However, these methods leverage pre-trained language models without incorporating SQL-specific constraints during decoding, which can limit their performance.

3 Preliminaries

3.1 Problem Formulation

Given a natural language question 𝒬𝒬\mathcal{Q}caligraphic_Q and a schema 𝒮𝒮\mathcal{S}caligraphic_S = (𝒯𝒯\mathcal{T}caligraphic_T, 𝒞𝒞\mathcal{C}caligraphic_C) for a relational database, our objective is to generate a corresponding SQL query 𝒴𝒴\mathcal{Y}caligraphic_Y. Here, the sequence 𝒬𝒬\mathcal{Q}caligraphic_Q === {{\{{q1 \ldots q||||𝒬𝒬\mathcal{Q}caligraphic_Q||||}}\}} is a sequence of natural language tokens or words, where ||||𝒬𝒬\mathcal{Q}caligraphic_Q|||| is the length of the question. The database schema is comprised of tables 𝒯𝒯\mathcal{T}caligraphic_T === {{\{{t1, \ldots, t||||𝒯𝒯\mathcal{T}caligraphic_T||||}}\}} and columns 𝒞𝒞\mathcal{C}caligraphic_C === {{\{{c1, \ldots, c||||𝒞𝒞\mathcal{C}caligraphic_C||||}}\}}, where ||||𝒯𝒯\mathcal{T}caligraphic_T|||| and ||||𝒞𝒞\mathcal{C}caligraphic_C|||| are the number of tables and columns in the database, respectively. Each column name ci \in 𝒞𝒞\mathcal{C}caligraphic_C, is comprised of tokens {ci,1, \ldots, ci,|Ci|subscript𝐶𝑖|C_{i}|| italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT |} , where ||||Ci|||| is the number of tokens in the column name, and similarly table names are also comprised of tokens {ti,1, \ldots, ti,|ti|subscript𝑡𝑖|t_{i}|| italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT |}, where |ti|subscript𝑡𝑖|t_{i}|| italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | is the number of tokens in the table name.

3.2 Query Construction

We define the output SQL query 𝒴𝒴\mathcal{Y}caligraphic_Y as a graph, representing the AST of the query in the context-free grammar of SQL, which our model learns to generate in an autoregressive fashion. The query is an undirected graph 𝒢𝒢\mathcal{G}caligraphic_G = (𝒱𝒱\mathcal{V}caligraphic_V, \mathcal{E}caligraphic_E), of vertices 𝒱𝒱\mathcal{V}caligraphic_V and edges \mathcal{E}caligraphic_E. Similar to previous works Yin and Neubig (2017b); Wang et al. (2021); Qi et al. (2022), the nodes 𝒱𝒱\mathcal{V}caligraphic_V === 𝒫𝒫\mathcal{P}caligraphic_P \cup 𝒯𝒯\mathcal{T}caligraphic_T \cup 𝒞𝒞\mathcal{C}caligraphic_C are the possible actions derived from SQL context-free grammar rules Yin and Neubig (2017b), 𝒫𝒫\mathcal{P}caligraphic_P, such as SelectTable, SelectColumn, Root, as well as the tables (𝒯𝒯\mathcal{T}caligraphic_T) and the columns (𝒞𝒞\mathcal{C}caligraphic_C) of the database schema. 𝒫𝒫\mathcal{P}caligraphic_P are used to represent non-terminal nodes, depicting rules of the grammar, whereas 𝒯𝒯\mathcal{T}caligraphic_T and 𝒞𝒞\mathcal{C}caligraphic_C are used for terminal nodes, such as when selecting table or column names to be applied within a specific rule. The edge set \mathcal{E}caligraphic_E === {{\{{(vi,vj) |||| vi, vj \in 𝒱𝒱\mathcal{V}caligraphic_V}}\}} defines the connectivity between the different nodes.

In particular, we choose to represent the graph using an adjacency matrix under a Breadth-First-Search (BFS) node ordering scheme, π𝜋\piitalic_π, that maps nodes to rows of the adjacency matrix as a sequence You et al. (2018). This approach permits the modelling of graphs of varying size, such as the ones representing the ASTs of complex SQL queries. Formally, given a map** fSsubscript𝑓𝑆f_{S}italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT from graph, 𝒢𝒢\mathcal{G}caligraphic_G, to sequences, 𝒮𝒮\mathcal{S}caligraphic_S, and a graph 𝒢𝒢\mathcal{G}caligraphic_G with n𝑛nitalic_n nodes under BFS node ordering π𝜋\piitalic_π, we can formulate

𝐒π=𝐟𝐒(𝒢,π)=(𝐒𝟏π,,𝐒𝐧π)superscript𝐒𝜋subscript𝐟𝐒𝒢𝜋subscriptsuperscript𝐒𝜋1subscriptsuperscript𝐒𝜋𝐧\mathbf{S^{\pi}}=\mathbf{f_{S}(\mathcal{G},\pi)=(S^{\pi}_{1},\ldots,S^{\pi}_{n% })}bold_S start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT = bold_f start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT ( caligraphic_G , italic_π ) = ( bold_S start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT , … , bold_S start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_n end_POSTSUBSCRIPT ) (1)

where Siπsubscriptsuperscript𝑆𝜋𝑖S^{\pi}_{i}italic_S start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT \in {{\{{0, 1}}\}}i-1, i \in {{\{{1, \ldots, n}}\}} depicts an adjacency vector between node π𝜋\piitalic_π(vi) and the previous nodes π𝜋\piitalic_π(vj), j \in {{\{{1, \ldots, i-1}}\}} already existing in the graph, so that:

𝐒𝐢π=𝐀(𝟏,𝐢π,,𝐀𝐢𝟏,𝐢π)𝐓,i{𝟐,,n}\mathbf{S^{\pi}_{i}}=\mathbf{A(^{\pi}_{1,i},\ldots,A^{\pi}_{i-1,i})^{T},% \forall\textit{i}\in\{2,\ldots,\textit{n}\}}bold_S start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT = bold_A ( start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_1 , bold_i end_POSTSUBSCRIPT , … , bold_A start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i - bold_1 , bold_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT bold_T end_POSTSUPERSCRIPT , ∀ i ∈ { bold_2 , … , n } (2)

Then, using Sπsuperscript𝑆𝜋S^{\pi}italic_S start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT, we can determine uniquely the SQL graph 𝒢𝒢\mathcal{G}caligraphic_G in a sequential form and learn to predict it autoregressively.

4 SQLformer

4.1 Model Overview

Refer to caption
Figure 1: An illustration of SQLformer: our model inherits the seq2seq nature of the Transformer architecture, consisting of L𝐿Litalic_L layers of encoders and decoders. SQLformer encoder introduces database table and column selection as inductive biases to contextualize the embedding of a question. In this example, the question consists of six tokens (Fig. 2). This schema-conditioned question representation serves as input to the SQLformer decoder module. Here we show the decoding timestep t𝑡titalic_t = 4 as an example. The architecture for the decoder module is detailed in Fig. 4.

In light of recent advancements Shaw et al. (2021); Scholak et al. (2021); Li et al. (2023b), we approach the text-to-SQL problem as a translation task by using an encoder-decoder architecture. Specifically, we extend the original Transformer encoder (Subsection 4.3) by incorporating learnable table and column tokens in the encoder, used to select the most relevant tables and columns in the database schema given the NLQ. This information is injected as input to the decoder, so that it can be enriched with the representation of the schema-aware question encoding and the most relevant tables and columns in the database schema selected by the model. Moreover, the SQLformer decoder extends the original Transformer decoder (Subsection 4.4) in a way that integrates both node type, adjacency and previous generated action embeddings for generating a SQL query autoregressively as a sequence of actions derived from a SQL grammar Yin and Neubig (2017b). The overall architecture of our SQLformer model is described in Fig. 1.

4.2 Model Inputs

In this section, we detail how the inputs to our model are constructed, in particular, the construction of both the NLQ and schema graphs.

Question Graph Construction.

The natural language question can be formulated as a graph 𝒢Qsubscript𝒢𝑄\mathcal{G}_{Q}caligraphic_G start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT = (𝒬𝒬\mathcal{Q}caligraphic_Q, \mathcal{R}caligraphic_R), where the node set 𝒬𝒬\mathcal{Q}caligraphic_Q consists of the natural language tokens, and \mathcal{R}caligraphic_R = {{\{{r1, \ldots, r||||\mathcal{R}caligraphic_R||||}}\}} represents one-hop relations between words. We employ two types of relations for the question graph: syntactic dependencies and part-of-speech tagging, incorporating grammatical meaning. These relations form a joint question graph, which is then linearized as a Levi graph. Fig. 2 illustrates an example question graph with some relationships. Tables 7 and 8 describe all relations used. To encode each token in the question graph, we use a Graph Attention Network (GAT) Veličković et al. (2018).

Refer to caption
Figure 2: Illustration of an example Spider question with six tokens as a graph 𝒢𝒢\mathcal{G}caligraphic_G with part-of-speech and dependency relations. In this example, the token number𝑛𝑢𝑚𝑏𝑒𝑟numberitalic_n italic_u italic_m italic_b italic_e italic_r has a OBJECT dependency with Find𝐹𝑖𝑛𝑑Finditalic_F italic_i italic_n italic_d, and Find𝐹𝑖𝑛𝑑Finditalic_F italic_i italic_n italic_d and number𝑛𝑢𝑚𝑏𝑒𝑟numberitalic_n italic_u italic_m italic_b italic_e italic_r are tagged as verb (VB) and noun (NN), respectively.

Database Schema Graph Construction.

Similarly, a database schema graph is represented by 𝒢Ssubscript𝒢𝑆\mathcal{G}_{S}caligraphic_G start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = (𝒮𝒮\mathcal{S}caligraphic_S, \mathcal{R}caligraphic_R) where the node set 𝒮𝒮\mathcal{S}caligraphic_S === (𝒯𝒯\mathcal{T}caligraphic_T, 𝒞𝒞\mathcal{C}caligraphic_C) represents the tables, 𝒯𝒯\mathcal{T}caligraphic_T, and the columns, 𝒞𝒞\mathcal{C}caligraphic_C, in the schema. The edge set \mathcal{R}caligraphic_R === {{\{{r1, \ldots, r||||\mathcal{R}caligraphic_R||||}}\}} depicts the structural relationships among tables and columns in the schema. Similarly to previous works, we use the common relational database-specific relations, such as primary/foreign key for column pairs, column types, and whether a column belongs to a specific table. Fig. 3 shows an example database schema graph and Table 9 provides a description of the types of relationships used for database schema graph construction. We encode the schema using a GAT and use average pooling to obtain a single embedding to represent each database schema.

Refer to caption
Figure 3: An illustration of an example Spider schema for database scientist_1𝑠𝑐𝑖𝑒𝑛𝑡𝑖𝑠𝑡_1scientist\_1italic_s italic_c italic_i italic_e italic_n italic_t italic_i italic_s italic_t _ 1. In this example, there are a total of 3 tables (scientists𝑠𝑐𝑖𝑒𝑛𝑡𝑖𝑠𝑡𝑠scientistsitalic_s italic_c italic_i italic_e italic_n italic_t italic_i italic_s italic_t italic_s, projects𝑝𝑟𝑜𝑗𝑒𝑐𝑡𝑠projectsitalic_p italic_r italic_o italic_j italic_e italic_c italic_t italic_s, assigned_to𝑎𝑠𝑠𝑖𝑔𝑛𝑒𝑑_𝑡𝑜assigned\_toitalic_a italic_s italic_s italic_i italic_g italic_n italic_e italic_d _ italic_t italic_o), with multiple columns for each table and relationships between the tables.

4.3 Table and Column Selection Encoder

To describe our proposed modification to the Transformer encoder, we first introduce the original Transformer architecture. The Transformer encoder Vaswani et al. (2017) consists of alternating layers of multi-head self-attention (MHA) and Fully-connected Forward Network (FFN) blocks. Before every block, Layer Normalisation (LN) is applied, and after every block, a residual connection is added. More formally, in the thsuperscript𝑡\ell^{th}roman_ℓ start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT encoder layer, the hidden states are represented as XS={x1,,xN}subscriptsuperscript𝑋𝑆subscriptsuperscript𝑥1subscriptsuperscript𝑥𝑁X^{\ell}_{S}=\{x^{\ell}_{1},\ldots,x^{\ell}_{N}\}italic_X start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = { italic_x start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT }, where N is the maximum length of the inputs.

First, a MHA block maps X into a query matrix Q \in n×dksuperscript𝑛subscript𝑑𝑘\mathbb{R}^{n\times d_{k}}blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, key matrix K \in n×dksuperscript𝑛subscript𝑑𝑘\mathbb{R}^{n\times d_{k}}blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and value matrix V \in n×dvsuperscript𝑛subscript𝑑𝑣\mathbb{R}^{n\times d_{v}}blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where m is the number of query vectors, and n the number of key or value vectors. Then, an attention vector is calculated as follows:

𝐀𝐭𝐭𝐞𝐧𝐭𝐢𝐨𝐧(𝐐,𝐊,𝐕)=𝐀𝐭𝐭𝐞𝐧𝐭𝐢𝐨𝐧𝐐𝐊𝐕absent\displaystyle\mathbf{Attention(Q,K,V)}{=}bold_Attention ( bold_Q , bold_K , bold_V ) = 𝐬𝐨𝐟𝐭𝐦𝐚𝐱(𝐀)𝐕,𝐬𝐨𝐟𝐭𝐦𝐚𝐱𝐀𝐕\displaystyle\mathbf{softmax(A)}\mathbf{V},bold_softmax ( bold_A ) bold_V , (3)
𝐀=𝐀absent\displaystyle\mathbf{A}{=}bold_A = 𝐐𝐊𝐓𝐝𝐤superscript𝐐𝐊𝐓subscript𝐝𝐤\displaystyle\frac{\mathbf{Q}\mathbf{K^{T}}}{\mathbf{\sqrt{d_{k}}}}divide start_ARG bold_QK start_POSTSUPERSCRIPT bold_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG bold_d start_POSTSUBSCRIPT bold_k end_POSTSUBSCRIPT end_ARG end_ARG

In practice, the MHA block calculates the self-attention over h heads, where each head i is independently parametrized by 𝐖𝐢𝐐subscriptsuperscript𝐖𝐐𝐢\mathbf{W^{Q}_{i}}bold_W start_POSTSUPERSCRIPT bold_Q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT \in dm×dksuperscriptsubscript𝑑𝑚subscript𝑑𝑘\mathbb{R}^{d_{m}\times d_{k}}blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, 𝐖𝐢𝐊subscriptsuperscript𝐖𝐊𝐢\mathbf{W^{K}_{i}}bold_W start_POSTSUPERSCRIPT bold_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT \in dm×dksuperscriptsubscript𝑑𝑚subscript𝑑𝑘\mathbb{R}^{d_{m}\times d_{k}}blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and 𝐖𝐢𝐕subscriptsuperscript𝐖𝐕𝐢\mathbf{W^{V}_{i}}bold_W start_POSTSUPERSCRIPT bold_V end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT \in dm×dvsuperscriptsubscript𝑑𝑚subscript𝑑𝑣\mathbb{R}^{d_{m}\times d_{v}}blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, map** the input embeddings 𝒳𝒳\mathcal{X}caligraphic_X into queries and key-value pairs. Then, the attention for each head is calculated and concatenated, as follows:

𝐇𝐞𝐚𝐝𝐢=subscript𝐇𝐞𝐚𝐝𝐢absent\displaystyle\mathbf{Head_{i}}{=}bold_Head start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT = 𝐀𝐭𝐭𝐞𝐧𝐭𝐢𝐨𝐧(𝐐𝐖𝐢𝐐,𝐊𝐖𝐢𝐊,𝐕𝐖𝐢𝐕)𝐀𝐭𝐭𝐞𝐧𝐭𝐢𝐨𝐧subscriptsuperscript𝐐𝐖𝐐𝐢subscriptsuperscript𝐊𝐖𝐊𝐢subscriptsuperscript𝐕𝐖𝐕𝐢\displaystyle\mathbf{Attention(QW^{Q}_{i},KW^{K}_{i},VW^{V}_{i})}bold_Attention ( bold_QW start_POSTSUPERSCRIPT bold_Q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT , bold_KW start_POSTSUPERSCRIPT bold_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT , bold_VW start_POSTSUPERSCRIPT bold_V end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT ) (4)
𝐌𝐇𝐀(𝒳𝒮)=𝐌𝐇𝐀subscriptsuperscript𝒳𝒮absent\displaystyle\mathbf{MHA(\mathcal{X}^{\ell}_{\mathcal{S}})}{=}bold_MHA ( caligraphic_X start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT ) = 𝐂𝐨𝐧𝐜𝐚𝐭(𝐇𝐞𝐚𝐝𝟏,,Head𝐡)𝐖𝐔𝐂𝐨𝐧𝐜𝐚𝐭subscript𝐇𝐞𝐚𝐝1subscriptHead𝐡superscript𝐖𝐔\displaystyle\mathbf{Concat(Head_{1},\ldots,\textnormal{Head}_{h})W^{U}}bold_Concat ( bold_Head start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT , … , Head start_POSTSUBSCRIPT bold_h end_POSTSUBSCRIPT ) bold_W start_POSTSUPERSCRIPT bold_U end_POSTSUPERSCRIPT
𝒳¯𝒮=subscriptsuperscript¯𝒳𝒮absent\displaystyle\mathbf{\bar{\mathcal{X}}^{\ell}_{\mathcal{S}}}{=}over¯ start_ARG caligraphic_X end_ARG start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT = 𝐌𝐇𝐀(𝒳𝒮)𝐌𝐇𝐀subscriptsuperscript𝒳𝒮\displaystyle\mathbf{MHA(\mathcal{X}^{\ell}_{\mathcal{S}})}bold_MHA ( caligraphic_X start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT )

where 𝐖𝐔superscript𝐖𝐔\mathbf{W^{U}}bold_W start_POSTSUPERSCRIPT bold_U end_POSTSUPERSCRIPT \mathbf{\in} dmh×dmsuperscriptsubscriptsuperscript𝑑𝑚subscript𝑑𝑚\mathbb{R}^{d^{h}_{m}\times d_{m}}blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is a trainable parameter matrix. Next, to acquire the hidden states of the input, a FFN block is applied, as follows:

𝐅𝐅𝐍(𝒳¯𝒮)=𝐦𝐚𝐱(𝟎,𝒳¯𝒮𝐖𝟏+𝐛𝟏)𝐖𝟐+𝐛𝟐𝐅𝐅𝐍subscriptsuperscript¯𝒳𝒮𝐦𝐚𝐱0subscriptsuperscript¯𝒳𝒮subscript𝐖1subscript𝐛1subscript𝐖2subscript𝐛2\mathbf{FFN(\bar{\mathcal{X}}^{\ell}_{\mathcal{S}})}=\mathbf{max(0,\bar{% \mathcal{X}}^{\ell}_{\mathcal{S}}W_{1}+b_{1})W_{2}+b_{2}}bold_FFN ( over¯ start_ARG caligraphic_X end_ARG start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT ) = bold_max ( bold_0 , over¯ start_ARG caligraphic_X end_ARG start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT bold_W start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT + bold_b start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT ) bold_W start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT + bold_b start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT (5)

where 𝐖𝟏subscript𝐖1\mathbf{W_{1}}bold_W start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT \in dm×dffsuperscriptsubscript𝑑𝑚subscript𝑑𝑓𝑓\mathbb{R}^{d_{m}\times d_{ff}}blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_f italic_f end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and 𝐖𝟐subscript𝐖2\mathbf{W_{2}}bold_W start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT \in dff×dmsuperscriptsubscript𝑑𝑓𝑓subscript𝑑𝑚\mathbb{R}^{d_{ff}\times d_{m}}blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_f italic_f end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT are linear weight matrices. Finally, layer normalisation and residual connection are applied as follows:

𝒳~𝒮=𝐋𝐚𝐲𝐞𝐫𝐍𝐨𝐫𝐦(𝒳¯𝒮+𝐅𝐅𝐍(𝒳¯𝒮))subscriptsuperscript~𝒳𝒮𝐋𝐚𝐲𝐞𝐫𝐍𝐨𝐫𝐦subscriptsuperscript¯𝒳𝒮𝐅𝐅𝐍subscriptsuperscript¯𝒳𝒮\mathbf{\tilde{\mathcal{X}}^{\ell}_{\mathcal{S}}=\mathbf{LayerNorm}(\bar{% \mathbf{\mathcal{X}}}^{\ell}_{\mathcal{S}}+\mathbf{FFN}(\bar{\mathbf{\mathcal{% X}}}^{\ell}_{\mathcal{S}}))}over~ start_ARG caligraphic_X end_ARG start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT = bold_LayerNorm ( over¯ start_ARG caligraphic_X end_ARG start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT + bold_FFN ( over¯ start_ARG caligraphic_X end_ARG start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT ) ) (6)

In the SQLformer encoder, we input the 1D sequence of natural language token embeddings, Z𝑍Zitalic_Z, and prepend two learnable tokens: Ztablessubscript𝑍𝑡𝑎𝑏𝑙𝑒𝑠Z_{tables}italic_Z start_POSTSUBSCRIPT italic_t italic_a italic_b italic_l italic_e italic_s end_POSTSUBSCRIPT and Zcolssubscript𝑍𝑐𝑜𝑙𝑠Z_{cols}italic_Z start_POSTSUBSCRIPT italic_c italic_o italic_l italic_s end_POSTSUBSCRIPT. The states of these tokens at the encoder output, 𝒳~tablessubscript~𝒳𝑡𝑎𝑏𝑙𝑒𝑠\tilde{\mathcal{X}}_{tables}over~ start_ARG caligraphic_X end_ARG start_POSTSUBSCRIPT italic_t italic_a italic_b italic_l italic_e italic_s end_POSTSUBSCRIPT and 𝒳~columnssubscript~𝒳𝑐𝑜𝑙𝑢𝑚𝑛𝑠\tilde{\mathcal{X}}_{columns}over~ start_ARG caligraphic_X end_ARG start_POSTSUBSCRIPT italic_c italic_o italic_l italic_u italic_m italic_n italic_s end_POSTSUBSCRIPT, serve as input to two MLP blocks responsible for selecting k1subscript𝑘1k_{1}italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT tables and k2subscript𝑘2k_{2}italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT columns based on the NLQ. Sinusoidal vectors retain the original positional information.

After L encoder layers, we obtain the input question embedding 𝒳~𝒮subscriptsuperscript~𝒳𝒮\tilde{\mathcal{X}}^{\ell}_{\mathcal{S}}over~ start_ARG caligraphic_X end_ARG start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT, with the first two tokens as 𝒳~tablessubscript~𝒳𝑡𝑎𝑏𝑙𝑒𝑠\tilde{\mathcal{X}}_{tables}over~ start_ARG caligraphic_X end_ARG start_POSTSUBSCRIPT italic_t italic_a italic_b italic_l italic_e italic_s end_POSTSUBSCRIPT and 𝒳~columnssubscript~𝒳𝑐𝑜𝑙𝑢𝑚𝑛𝑠\tilde{\mathcal{X}}_{columns}over~ start_ARG caligraphic_X end_ARG start_POSTSUBSCRIPT italic_c italic_o italic_l italic_u italic_m italic_n italic_s end_POSTSUBSCRIPT, and the rest as natural language question tokens 𝒳~Qsubscript~𝒳𝑄\tilde{\mathcal{X}}_{Q}over~ start_ARG caligraphic_X end_ARG start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT \in d×Qsuperscript𝑑𝑄\mathbb{R}^{d\times Q}blackboard_R start_POSTSUPERSCRIPT italic_d × italic_Q end_POSTSUPERSCRIPT. 𝒳~Tsubscript~𝒳𝑇\tilde{\mathcal{X}}_{T}over~ start_ARG caligraphic_X end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT and 𝒳~Csubscript~𝒳𝐶\tilde{\mathcal{X}}_{C}over~ start_ARG caligraphic_X end_ARG start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT input to MLP blocks 𝐌𝐋𝐏𝐓subscript𝐌𝐋𝐏𝐓\mathbf{MLP_{T}}bold_MLP start_POSTSUBSCRIPT bold_T end_POSTSUBSCRIPT \in 𝐝×|𝒯|superscript𝐝𝒯\mathbf{\mathbb{R}^{d\times|\mathcal{T}|}}blackboard_R start_POSTSUPERSCRIPT bold_d × | caligraphic_T | end_POSTSUPERSCRIPT and 𝐌𝐋𝐏𝐂subscript𝐌𝐋𝐏𝐂\mathbf{MLP_{C}}bold_MLP start_POSTSUBSCRIPT bold_C end_POSTSUBSCRIPT \in 𝐝×|𝒞|superscript𝐝𝒞\mathbf{\mathbb{R}^{d\times|\mathcal{C}|}}blackboard_R start_POSTSUPERSCRIPT bold_d × | caligraphic_C | end_POSTSUPERSCRIPT, where d𝑑ditalic_d is the hidden size of the token embeddings, and |𝒯|𝒯|\mathcal{T}|| caligraphic_T | and |𝒞|𝒞|\mathcal{C}|| caligraphic_C | are the sizes of the tables and columns vocabularies, respectively. The embeddings are projected into probability vectors:

𝐏𝐭𝐚𝐛𝐥𝐞𝐬=subscript𝐏𝐭𝐚𝐛𝐥𝐞𝐬absent\displaystyle\mathbf{P_{tables}}{=}bold_P start_POSTSUBSCRIPT bold_tables end_POSTSUBSCRIPT = 𝐬𝐨𝐟𝐭𝐦𝐚𝐱(𝐌𝐋𝐏𝐓(𝒳~𝐓))𝐬𝐨𝐟𝐭𝐦𝐚𝐱subscript𝐌𝐋𝐏𝐓subscript~𝒳𝐓\displaystyle\mathbf{softmax(MLP_{T}(\tilde{\mathcal{X}}_{T}))}bold_softmax ( bold_MLP start_POSTSUBSCRIPT bold_T end_POSTSUBSCRIPT ( over~ start_ARG caligraphic_X end_ARG start_POSTSUBSCRIPT bold_T end_POSTSUBSCRIPT ) ) (7)
𝐏𝐜𝐨𝐥𝐮𝐦𝐧𝐬=subscript𝐏𝐜𝐨𝐥𝐮𝐦𝐧𝐬absent\displaystyle\mathbf{P_{columns}}{=}bold_P start_POSTSUBSCRIPT bold_columns end_POSTSUBSCRIPT = 𝐬𝐨𝐟𝐭𝐦𝐚𝐱(𝐌𝐋𝐏𝐂(𝒳~𝐂))𝐬𝐨𝐟𝐭𝐦𝐚𝐱subscript𝐌𝐋𝐏𝐂subscript~𝒳𝐂\displaystyle\mathbf{softmax(MLP_{C}(\tilde{\mathcal{X}}_{C}))}bold_softmax ( bold_MLP start_POSTSUBSCRIPT bold_C end_POSTSUBSCRIPT ( over~ start_ARG caligraphic_X end_ARG start_POSTSUBSCRIPT bold_C end_POSTSUBSCRIPT ) )

Then, the top k1subscript𝑘1k_{1}italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and k2subscript𝑘2k_{2}italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT tables and columns, respectively, are selected according to 𝐏𝐭𝐚𝐛𝐥𝐞𝐬subscript𝐏𝐭𝐚𝐛𝐥𝐞𝐬\mathbf{P_{tables}}bold_P start_POSTSUBSCRIPT bold_tables end_POSTSUBSCRIPT and 𝐏𝐜𝐨𝐥𝐮𝐦𝐧𝐬subscript𝐏𝐜𝐨𝐥𝐮𝐦𝐧𝐬\mathbf{P_{columns}}bold_P start_POSTSUBSCRIPT bold_columns end_POSTSUBSCRIPT. A masking vector is applied to 𝐏𝐜𝐨𝐥𝐮𝐦𝐧𝐬subscript𝐏𝐜𝐨𝐥𝐮𝐦𝐧𝐬\mathbf{P_{columns}}bold_P start_POSTSUBSCRIPT bold_columns end_POSTSUBSCRIPT to ensure that only columns from the selected tables are considered, avoiding the selection of columns not present in the selected tables. Next, two embedding lookup tables, 𝐄𝐓subscript𝐄𝐓\mathbf{E_{T}}bold_E start_POSTSUBSCRIPT bold_T end_POSTSUBSCRIPT \in |𝒯|×𝐝𝐭superscript𝒯subscript𝐝𝐭\mathbf{\mathbb{R}^{|\mathcal{T}|\times d_{t}}}blackboard_R start_POSTSUPERSCRIPT | caligraphic_T | × bold_d start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and 𝐄𝐂subscript𝐄𝐂\mathbf{E_{C}}bold_E start_POSTSUBSCRIPT bold_C end_POSTSUBSCRIPT \in |𝒞|×𝐝𝐜superscript𝒞subscript𝐝𝐜\mathbf{\mathbb{R}^{|\mathcal{C}|\times d_{c}}}blackboard_R start_POSTSUPERSCRIPT | caligraphic_C | × bold_d start_POSTSUBSCRIPT bold_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, are used for map** the k top tables and columns, respectively, into embeddings, as 𝒳~Tksubscriptsuperscript~𝒳𝑘𝑇\tilde{\mathcal{X}}^{k}_{T}over~ start_ARG caligraphic_X end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT \in 𝐤𝟏×𝐝superscriptsubscript𝐤1𝐝\mathbf{\mathbb{R}^{k_{1}\times d}}blackboard_R start_POSTSUPERSCRIPT bold_k start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT × bold_d end_POSTSUPERSCRIPT and 𝒳~Cksubscriptsuperscript~𝒳𝑘𝐶\tilde{\mathcal{X}}^{k}_{C}over~ start_ARG caligraphic_X end_ARG start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT \in 𝐤𝟐×𝐝superscriptsubscript𝐤2𝐝\mathbf{\mathbb{R}^{k_{2}\times d}}blackboard_R start_POSTSUPERSCRIPT bold_k start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT × bold_d end_POSTSUPERSCRIPT, where d𝑑ditalic_d is the size of the learnable embeddings. These are aggregated and concatenated, giving the final representation for the schema, depicted as 𝒳~𝐬𝐜𝐡𝐞𝐦𝐚subscript~𝒳𝐬𝐜𝐡𝐞𝐦𝐚\mathbf{\tilde{\mathcal{X}}_{schema}}over~ start_ARG caligraphic_X end_ARG start_POSTSUBSCRIPT bold_schema end_POSTSUBSCRIPT

Finally, 𝒳~Qsubscript~𝒳𝑄\tilde{\mathcal{X}}_{Q}over~ start_ARG caligraphic_X end_ARG start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT and 𝒳~𝐬𝐜𝐡𝐞𝐦𝐚subscript~𝒳𝐬𝐜𝐡𝐞𝐦𝐚\mathbf{\tilde{\mathcal{X}}_{schema}}over~ start_ARG caligraphic_X end_ARG start_POSTSUBSCRIPT bold_schema end_POSTSUBSCRIPT are aggregated to effectively contextualize the natural language question embedding by the embedding of the most likely tables and columns in the schema being mentioned. The result of this aggregation is given as input to the decoder module as part of the cross-attention.

4.4 Autoregressive Query Graph Generation Decoder

During the decoding phase, previous works (e.g. Wang et al. (2021); Cao et al. (2021); Hui et al. (2022); Cai et al. (2022)) widely adopt the LSTM-based tree decoder from Yin and Neubig (2017a) to generate SQL grammar rules. In contrast, the SQLformer decoder (Fig. 4) extends the original Transformer decoder to predict the SQL grammar rules autoregressively. This approach has multiple advantages. First, it maintains the context of previously generated parts of the query for longer sequences than LSTM-based decoders. This is especially important for long queries, such as these containing sub-queries. Also, the Transformer encourages permutation invariance desirable for processing the node embeddings of the SQL graph, as the graph is invariant under any permutation of the nodes. Additionally, the highly parallelizable nature of the inherited Transformer architecture results in higher efficiency for both training and inference speed compared to previous LSTM-based approaches (see Table 10 for an analysis on training and inference efficiency).

In the SQLformer decoder, each query node is described by three attributes: node type, node adjacency, and the previous action. Nodes are assigned a type, represented as NVsuperscript𝑁𝑉N^{V}italic_N start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT === {{\{{V0, V1, \ldots, VN}}\}}, where Visubscript𝑉𝑖V_{i}italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a one-hot representation of the node type. Nodes are grouped as non-terminal or terminal, with terminal types including tableid𝑡𝑎𝑏𝑙subscript𝑒𝑖𝑑table_{i}ditalic_t italic_a italic_b italic_l italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_d and columnid𝑐𝑜𝑙𝑢𝑚subscript𝑛𝑖𝑑column_{i}ditalic_c italic_o italic_l italic_u italic_m italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_d. Node type embeddings are calculated using a learnable transformation 𝚿(NV)𝚿superscript𝑁𝑉\mathbf{\Psi}(N^{V})bold_Ψ ( italic_N start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT ) \in |𝐕|×𝐝𝐕superscript𝐕subscript𝐝𝐕\mathbf{\mathbb{R}^{|V|\times d_{V}}}blackboard_R start_POSTSUPERSCRIPT | bold_V | × bold_d start_POSTSUBSCRIPT bold_V end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where dVsubscript𝑑𝑉d_{V}italic_d start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT is the embedding dimensionality and ||||V|||| is the number of possible node types. Node adjacency is represented as NAsuperscript𝑁𝐴N^{A}italic_N start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT === {{\{{A0, A1, \ldots, AN}}\}}, with Aisubscript𝐴𝑖A_{i}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT \in {{\{{0, 1}}\}}M, and embeddings obtained from 𝚽(NA)𝚽superscript𝑁𝐴\mathbf{\Phi}(N^{A})bold_Φ ( italic_N start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT ) \in 𝟏×𝐝𝐀superscript1subscript𝐝𝐀\mathbf{\mathbb{R}^{1\times d_{A}}}blackboard_R start_POSTSUPERSCRIPT bold_1 × bold_d start_POSTSUBSCRIPT bold_A end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, with dAsubscript𝑑𝐴d_{A}italic_d start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT as the embedding dimensionality. The previous action embedding, at1subscript𝑎𝑡1a_{t-1}italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT, is given by the transformation 𝛀(NR)𝛀superscript𝑁𝑅\mathbf{\Omega}(N^{R})bold_Ω ( italic_N start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ) \in 𝟏×𝐝𝐓superscript1subscript𝐝𝐓\mathbf{\mathbb{R}^{1\times d_{T}}}blackboard_R start_POSTSUPERSCRIPT bold_1 × bold_d start_POSTSUBSCRIPT bold_T end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where NRsuperscript𝑁𝑅N^{R}italic_N start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT is the SQL grammar rule chosen in the previous timestep and dTsubscript𝑑𝑇d_{T}italic_d start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT is the embedding dimensionality.

We extend the Transformer decoder architecture to incorporate the node type, adjacency and previous action embeddings to represent a node at each timestep. In particular, inspired by Ying et al. (2021), we include the node type and adjacency embeddings in the multi head self-attention aggregation process as a bias term (see Fig. 4 for an illustration). Formally, we modify Eq. 3 so that 𝚿(NV)𝚿superscript𝑁𝑉\mathbf{\Psi}(N^{V})bold_Ψ ( italic_N start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT ) and 𝚽(NA)𝚽superscript𝑁𝐴\mathbf{\Phi}(N^{A})bold_Φ ( italic_N start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT ) act as a bias term in the attention calculation, as follows

𝐀=𝐀absent\displaystyle\mathbf{A}{=}bold_A = 𝐐𝐊𝐓𝐝𝐤+𝐔superscript𝐐𝐊𝐓subscript𝐝𝐤𝐔\displaystyle\mathbf{\frac{QK^{T}}{\sqrt{d_{k}}}+U}divide start_ARG bold_QK start_POSTSUPERSCRIPT bold_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG bold_d start_POSTSUBSCRIPT bold_k end_POSTSUBSCRIPT end_ARG end_ARG + bold_U (8)
𝐔=𝐔absent\displaystyle\mathbf{U}{=}bold_U = 𝚿(𝐍𝐕)+𝚽(𝐍𝐀)𝚿superscript𝐍𝐕𝚽superscript𝐍𝐀\displaystyle\mathbf{\Psi(N^{V})}+\mathbf{\Phi(N^{A})}bold_Ψ ( bold_N start_POSTSUPERSCRIPT bold_V end_POSTSUPERSCRIPT ) + bold_Φ ( bold_N start_POSTSUPERSCRIPT bold_A end_POSTSUPERSCRIPT )

Then, the updated residuals for the node embedding, 𝐧tsubscriptsuperscript𝐧𝑡\mathbf{n}^{\ell}_{t}bold_n start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, at layer \ellroman_ℓ, can be formalised as

𝐧t=subscriptsuperscript𝐧𝑡absent\displaystyle\mathbf{n}^{\ell}_{t}{=}bold_n start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 𝐧𝐭𝟏+𝐎𝐤=𝟏𝐊𝐣=𝟏𝐍(𝐆𝐤, 𝐕𝐤,)\displaystyle\mathbf{n^{\ell-1}_{t}+\mathbf{O}^{\ell}\bigm{\|}_{k=1}^{K}\sum_{% j=1}^{N}\left(G^{k,\ell}\textnormal{ }\mathbf{V}^{k,\ell}\right)}bold_n start_POSTSUPERSCRIPT roman_ℓ - bold_1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT + bold_O start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT bold_k = bold_1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_K end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT bold_j = bold_1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_N end_POSTSUPERSCRIPT ( bold_G start_POSTSUPERSCRIPT bold_k , roman_ℓ end_POSTSUPERSCRIPT bold_V start_POSTSUPERSCRIPT bold_k , roman_ℓ end_POSTSUPERSCRIPT ) (9)
𝐆𝐤,=superscript𝐆𝐤absent\displaystyle\mathbf{G^{k,\ell}}{=}bold_G start_POSTSUPERSCRIPT bold_k , roman_ℓ end_POSTSUPERSCRIPT = 𝐬𝐨𝐟𝐭𝐦𝐚𝐱(𝐀𝐤,)𝐬𝐨𝐟𝐭𝐦𝐚𝐱superscript𝐀𝐤\displaystyle\mathbf{softmax(A^{k,\ell})}bold_softmax ( bold_A start_POSTSUPERSCRIPT bold_k , roman_ℓ end_POSTSUPERSCRIPT )

where parallel-to\parallel means concatenation, and K is the number of attention heads. As a result, the decoder state at the current timestep after L𝐿Litalic_L decoder layers, 𝐧tLsubscriptsuperscript𝐧𝐿𝑡\mathbf{n}^{L}_{t}bold_n start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, is fed to an action output MLP head which computes the distribution P(at+1)𝑃subscript𝑎𝑡1P(a_{t+1})italic_P ( italic_a start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) of next timestep actions based on the node type, adjacency, and previous action at timestep t𝑡titalic_t. Formally, P(at+1)𝑃subscript𝑎𝑡1P(a_{t+1})italic_P ( italic_a start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) is calculated as follows

𝐏(𝐚𝐭+𝟏𝐧tL)=𝐬𝐨𝐟𝐭𝐦𝐚𝐱(𝐖𝐚𝐧𝐭𝐋)𝐏conditionalsubscript𝐚𝐭1subscriptsuperscript𝐧𝐿𝑡𝐬𝐨𝐟𝐭𝐦𝐚𝐱subscript𝐖𝐚subscriptsuperscript𝐧𝐋𝐭\mathbf{P(a_{t+1}}\mid\mathbf{n}^{L}_{t})=\mathbf{softmax(W_{a}n^{L}_{t})}bold_P ( bold_a start_POSTSUBSCRIPT bold_t + bold_1 end_POSTSUBSCRIPT ∣ bold_n start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = bold_softmax ( bold_W start_POSTSUBSCRIPT bold_a end_POSTSUBSCRIPT bold_n start_POSTSUPERSCRIPT bold_L end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_t end_POSTSUBSCRIPT ) (10)

Finally, the prediction of the SQL query AST can be decoupled into a sequence of actions a𝑎aitalic_a = (a1subscript𝑎1a_{1}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, \ldots , a|a|subscript𝑎𝑎a_{|a|}italic_a start_POSTSUBSCRIPT | italic_a | end_POSTSUBSCRIPT), yielding the training objective for the task as

=𝐩=𝟏|𝐚|𝐥𝐨𝐠 𝐏(𝐚𝐩𝐚<𝐩,𝒮,𝒬)superscriptsubscript𝐩1𝐚𝐥𝐨𝐠 𝐏conditionalsubscript𝐚𝐩subscript𝐚absent𝐩𝒮𝒬\mathbf{\mathcal{L}}=\mathbf{-\sum_{p=1}^{|a|}log\textnormal{ }P(a_{p}\mid a_{% <p},\mathcal{S},\mathcal{Q})}caligraphic_L = - ∑ start_POSTSUBSCRIPT bold_p = bold_1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | bold_a | end_POSTSUPERSCRIPT bold_log bold_P ( bold_a start_POSTSUBSCRIPT bold_p end_POSTSUBSCRIPT ∣ bold_a start_POSTSUBSCRIPT < bold_p end_POSTSUBSCRIPT , caligraphic_S , caligraphic_Q ) (11)

5 Experiments

In this section, we show our model performance on six common text-to-SQL datasets. Also, we present ablation studies to analyse the importance of the different components of the SQLformer architecture.

Method EM EX
Dev Test Dev Test
SADGA + GAP Cai et al. (2022) 73.1 70.1 - -
RAT-SQL + GraPPa Yu et al. (2021a) 73.4 69.6 - -
RAT-SQL + GAP + NatSQL Shi et al. (2021) 73.7 68.7 75.0 73.3
SMBOP + GraPPa Rubin and Berant (2021) 74.7 69.5 75.0 71.1
DT-Fixup SQL-SP + RoBERTa Xu et al. (2021) 75.0 70.9 - -
LGESQL + ELECTRA Cao et al. (2021) 75.1 72.0 - -
RASAT Qi et al. (2022) 75.3 70.9 80.5 75.5
T5-3B Scholak et al. (2021) 75.5 71.9 79.3 75.1
S2SQL + ELECTRA Hui et al. (2022) 76.4 72.1 - -
RESDSQL Li et al. (2023a) 80.5 72.0 84.1 79.9
GRAPHIX-T5-3B Li et al. (2023b) 77.1 74.0 81.0 77.6
SQLformer (our approach) 78.2 75.6 82.5 81.9
Table 1: EM and EX results on Spider’s dev and test dataset splits. We compare our approach with recent state-of-the-art methods. Underline depicts the previous best performing method for each metric.

5.1 Experimental Setup

Dataset.

We consider six benchmark datasets, with complete details included in Appendix E. In particular, our experiments use the Spider Yu et al. (2019b) dataset, a large-scale cross-domain text-to-SQL benchmark, as well as context-dependent benchmarks such as the SparC Yu et al. (2019c) and CoSQL Yu et al. (2019a) datasets. Additionally, we evaluate our method for zero-shot domain generalization performance on the Spider-DK Gan et al. (2021b), Spider-SYN Gan et al. (2021a) and Spider-Realistic datasets.

Evaluation Metrics.

We report results using the same metrics as previous works Yu et al. (2019b); Li et al. (2023a, b). For Spider-family datasets (i.e. Spider, Spider-DK, Spider-SYN and Spider-Realistic), we consider two prevalent evaluation metrics: Exact Match (EM) and Execution (EX) accuracies. The EX metric evaluates whether the predicted and ground-truth SQL queries yields the same execution results on the database. However, there can be instances where EX gives false positives. To counteract this, EM evaluates how much a predicted SQL query is comparable to the ground truth query. For SParC and CoSQL, we measure EM at the question (QEM) and interaction (IEM) levels, as well as EX at both question (QEX) and interaction levels (IEX).

Implementation Details.

We implemented SQLformer in PyTorch Paszke et al. (2019). For the graph neural network components, we use PyTorch Geometric Fey and Lenssen (2019). The questions, column and table names are tokenized and lemmatized using stanza Qi et al. (2020). For dependency parsing and part-of-speech tagging, stanza Qi et al. (2020) is used. We find the best set of hyperparameters on a randomly sampled subset of 10% queries from the dev dataset. For training, we set the maximum input length as 1024, maximum number of generated AST nodes to 200, batch size as 32 and maximum training steps to 20,000. A detailed list of hyperparameters can be found in Appendix D. Tokens embeddings are initialized with ELECTRA Clark et al. (2020) using the HuggingFace library Wolf et al. (2020). We use teacher forcing in the decoder for higher training stability. Results are on the test set unless stated otherwise.

5.2 Overall Performance

Results on Spider.

The EM and EX accuracy results on the Spider benchmark are presented in Table 1. Our proposed model SQLformer achieves competitive performance in both EM and EX accuracy. On the test set, compared with RAT-SQL Wang et al. (2021), our model’s EM increases from 69.6% to 75.6%, achieving a 6.0% absolute improvement. When compared to approaches that fine-tune a Language Model (LM) with a much larger amount of parameters, such as T5-3B (71.9%), we achieve a 3.7% absolute improvement. This effectively shows the benefit of our proposed architecture for solving text-to-SQL tasks with fewer parameters. Furthermore, SQLformer sets a new state-of-the-art in EX accuracy with 81.9%. Compared to RESDSQL Li et al. (2023a), which achieves 72.0% EM and 79.9% EX, SQLformer surpasses it by 3.6% and 2.0% respectively. Similarly, SQLformer outperforms GRAPHIX-T5 Li et al. (2023b), which has 74.0% EM and 77.6% EX, by 1.6% and 4.3%. Against other methods like RASAT, SQLformer shows significant improvements of 4.7% in EM and 6.4% in EX. These comparisons highlight the effectiveness of SQLformer in generating highly accurate SQL queries, significantly improving upon existing state-of-the-art methods.

Results on Difficult Queries.

We provide a breakdown of accuracy by query difficulty level (easy, medium, hard, extra hard) as defined by Yu et al. (2019b). Table 2 compares our approach to state-of-the-art baselines on the EM accuracy metric. Performance drops with increasing query difficulty, from 92.7% on easy𝑒𝑎𝑠𝑦easyitalic_e italic_a italic_s italic_y to 51.2% on extra𝑒𝑥𝑡𝑟𝑎extraitalic_e italic_x italic_t italic_r italic_a hard queries. Compared to RAT-SQL, SQLformer shows improvements of 9.7% on hard𝑎𝑟𝑑harditalic_h italic_a italic_r italic_d and 8.3% on extra𝑒𝑥𝑡𝑟𝑎extraitalic_e italic_x italic_t italic_r italic_a hard queries, demonstrating its effectiveness in handling complex queries. Therefore, SQLformer surpasses the baseline methods across all four subsets by a significant margin, giving supporting evidence for the effectiveness of our approach.

Method Easy Medium Hard Extra All
RAT-SQL + BERT 86.4 73.6 62.1 42.9 69.7
SADGA 90.3 72.4 63.8 49.4 71.6
LGESQL 91.5 76.7 66.7 48.8 74.1
GRAPHIX-T5-3B 91.9 81.6 61.5 50 75.6
SQLformer (our approach) 92.7 82.9 71.8 51.2 76.8
Table 2: EM accuracy on the Spider queries across different levels of difficulty as defined by Yu et al. (2019b).

Zero-Shot Results on Domain Generalization and Robustness.

In Table 3, we analyze SQLformer’s capabilities in zero-shot domain generalization and robustness on the Spider-DK, Spider-SYN, and Spider-Realistic benchmarks. SQLformer excels with EM accuracies of 55.1% on Spider-DK, 71.2% on Spider-SYN, and 78.7% on Spider-Realistic. These results surpass models like LGESQL with ELECTRA and sophisticated systems like GRAPHIX-T5-3B by 3.9%, 4.3%, and 6.3% on DK, SYN, and Realistic, respectively, and RESDSQL by 1.8%, 2.1%, and 1.3%. SQLformer’s EX accuracies of 68.2%, 78.4%, and 82.6% also outperform RESDSQL, demonstrating SQLformer’s ability to adapt to unseen domains without direct prior training potential for real-world applications where database schemas and linguistic variations are highly variable.

Method Spider-DK Spider-SYN Spider-R
EM EX EM EX EM EX
RAT-SQL + GraPPa Yu et al. (2021a) 38.5 - 49.1 - 59.3 -
LGESQL + ELECTRA Cao et al. (2021) 48.4 - 64.6 - 69.2 -
T5-3B Scholak et al. (2021) - - - - 68.7 71.4
GRAPHIX-T5-3B Li et al. (2023b) 51.2 - 66.9 - 72.4 -
RESDSQL Li et al. (2023a) 53.3 66.0 69.1 76.9 77.4 81.9
SQLformer (our approach) 55.1 68.2 71.2 78.4 78.7 82.6
Table 3: EM and EX on Spider-SYN, Spider-DK and Spider-Realistic benchmarks.

Results on Context-Dependent Settings.

We present the experimental results for SQLformer in comparison with several leading methods on the SParC (Table 4) and CoSQL (Table 5) datasets. For the SParC dataset, SQLformer achieves 68.6% QEM and 51.3% IEM, outperforming RASAT by 1.9% and 4.1% respectively. Additionally, SQLformer shows significant improvements in QEX and IEX metrics, with 74.5% and 55.8%, further confirming its superior capacity for maintaining contextual understanding in multi-turn SQL dialogue tasks. For the CoSQL dataset, SQLformer attains 60.2% QEM and 31.4% IEM, surpassing RASAT by 1.4% and 5.1%. Moreover, SQLformer’s QEX and IEX scores are 68.4% and 39.2%, respectively, highlighting its potential in enhancing interactive SQL query generation. These results underscore the effectiveness of SQLformer in delivering more accurate and contextually aware SQL interpretations compared to previous leading methods.

Method QEM IEM QEX IEX
EditSQL + BERT Zhang et al. (2019) 47.2 29.5 - -
IGSQL + BERT Cai and Wan (2020) 50.7 32.5 - -
RAT-SQL + SCoRe Yu et al. (2021b) 62.2 42.5 - -
RASAT Qi et al. (2022) 66.7 47.2 72.5 53.1
SQLformer (our approach) 68.6 51.3 74.5 55.8
Table 4: Evaluation results on the SParC dataset.
Method QEM IEM QEX IEX
EditSQL + BERT Zhang et al. (2019) 39.9 12.3 - -
IGSQL + BERT Cai and Wan (2020) 44.1 15.8 - -
RAT-SQL + SCoRe Yu et al. (2021b) 52.1 22.0 - -
T5-3B Scholak et al. (2021) 56.9 24.2 - -
HIE-SQL + GraPPa Zheng et al. (2022) 56.4 28.7 - -
RASAT Qi et al. (2022) 58.8 26.3 66.7 37.5
SQLformer (our approach) 60.2 31.4 68.4 39.2
Table 5: Evaluation results on the CoSQL dataset.

5.3 Ablation Study

To validate the importance of each component in our architecture, we performed ablation studies on the SQLformer model. Table 6 compares the impact of four critical design choices: removing table and column selection, part-of-speech question encoding, and dependency graph question encoding. Additionally, we analyze the impact of varying the number of selected tables (k1subscript𝑘1k_{1}italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) and columns (k2subscript𝑘2k_{2}italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) on the performance of SQLformer (see Table 11).

Method EM accuracy (%percent\%%)
SQLformer w/o table + column selection 72.3 ±plus-or-minus\pm± 0.38
SQLformer encoder + LSTM-based decoder 74.2 ±plus-or-minus\pm± 0.38
SQLformer w/o Part-of-Speech graph 77.3 ±plus-or-minus\pm± 0.63
SQLformer w/o dependency graph 77.5 ±plus-or-minus\pm± 0.72
SQLformer (baseline) 78.2 ±plus-or-minus\pm± 0.75
Table 6: EM accuracy (and ±plus-or-minus\pm± 95% confidence interval) of SQLformer ablation study on the Spider development set.

The results show that table and column selection has the biggest impact, with a performance drop from 78.2% to 72.3% when removed. This highlights the importance of schema-question linking. Removing the dependency graph and part-of-speech encodings leads to smaller decreases of 0.7% and 0.9%, respectively. Using an LSTM-based decoder from Yin and Neubig (2017a) instead of a Transformer-based one decreases performance by 4%, demonstrating the effectiveness of our approach.

6 Conclusion

In this work, we introduced SQLformer, a novel model for text-to-SQL translation that leverages an autoregressive Transformer-based approach. SQLformer uses a specially designed encoder to link questions and schema and utilizes pre-trained models for effective language representation. Its unique decoder integrates node adjacency, type, and previous action information, conditioned on top-selected tables, columns, and schema-aware question encoding. Notably, SQLformer outperformed competitive text-to-SQL baselines across six datasets, demonstrating state-of-the-art performance.

Limitations

One of the main limitations of our work is its focus on the English language, as it is the language used by most publicly available datasets. A potential way to alleviate this is by using multi-language PLMs for processing the questions.

References

Appendix A Details on structural information types

All types of structural information used for question graph construction are shown in Table 8 and Table 7. In particular, Table 8 highlights the syntactic dependency tags using during parsing of questions and Table 7 summarizes the semantic part-of-speech tags. Moreover, all relationships used for database schema graph construction are listed in Table 9.

Tag Description
ADJ Adjective: describes a noun or pronoun.
ADV Adverb: modifies a verb, adjective, or another adverb.
INTJ Interjection: expresses a spontaneous feeling or reaction.
NOUN Noun: names a specific object or set of objects.
PROPN Proper Noun: names specific individuals, places, organizations.
VERB Verb: describes an action, occurrence, or state of being.
ADP Adposition: relates to other words, specifying relationships.
AUX Auxiliary: helps form verb tenses, moods, or voices.
CCONJ Coordinating Conjunction: connects words, phrases, or clauses of equal rank.
DET Determiner: modifies a noun, indicating reference.
NUM Numeral: represents a number.
Table 7: Types of part-of-speech tags used during question graph construction
Tag Description
ACL Clausal modifier of noun.
ADVCL Adverbial clause modifier.
ADVMOD Adverbial modifier.
AMOD Adjectival modifier.
APPOS Appositional modifier.
AUX Auxiliary.
CC Coordinating conjunction.
CCOMP Clausal complement.
COMP Compound.
CONJ Conjunct.
COP Copula.
CSUBJ Clausal subject.
DET Determiner.
IOBJ Indirect object.
NMOD Nominal modifier.
NSUBJ Nominal subject.
NUMMOD Numeric modifier.
OBJ Object.
Table 8: Types of dependency parsing tags used during question graph construction
Source node x𝑥xitalic_x Target node y𝑦yitalic_y Relationship Description
Table Column Has-Column Column y belongs to the table x.
Column Table Is-Primary-Key The column x is primary key of table y.
Column Column Is-Foreign-Key Column x is the foreign key of column y.
Column Literal Column-Type The column x has type y.
Table 9: Summary of structural information types used in SQLformer during database schema graph construction.

Appendix B Analysis on training and inference efficiency

Table 10 presents a comparative analysis of training and inference times between LSTM-based methods and SQLformer. Specifically, the average training time for every 50 iterations is calculated for both types of methods. The findings indicate that SQLformer achieves a training speed that is approximately four times faster and an inference speed that is 1.2 times faster than that of the LSTM-based methods.

Method Spider SParC CoSQL
Tr In Tr In Tr In
LSTM 203.1 19.3 174.2 18.5 162.8 19.7
SQLformer 52.9 16.2 67.4 15.6 53.7 15.8
Table 10: Training (Tr) and inference (In) efficiency comparison between LSTM-based approaches and SQLformer. Training efficiency is calculated as the average training time in seconds from 50 iterations. Inference efficiency is calculated as seconds per 100 queries.

Appendix C Details on the decoder architecture

In the SQLformer decoder (Figure 4), the inputs are the node adjacencies and types in the current timestep of the generation process, as well as the previous action embedding. The node type and adjacency embeddings are integrated with the previous action embedding into the aggregation process of the MHA mechanism as a bias term. The node embedding is then transformed through a series of L𝐿Litalic_L decoding layers with H𝐻Hitalic_H heads. The final representation is used to generate the probability distribution of actions to take in the next timestep.

Refer to caption
Figure 4: Overview of the SQLformer decoder architecture.

Appendix D Summary of best hyperparameters used for SQLformer training

For training SQLformer, we find the best set of hyperparameters on a randomly sampled subset of 10% queries from the Spider dev split. Specifically, we find the best maximum previous AST nodes in the BFS ordering to be 30, and maximum training steps as 20,000. The number of layers for the encoder and decoder are both set to 6 and number of heads is 8. The dimensionality of the encoder and the decoder are set to 512. k1subscript𝑘1k_{1}italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and k2subscript𝑘2k_{2}italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are set to 10. The embedding sizes for tables and columns are set to 512. The node adjacency, node type and action embeddings sizes are 512. The output MLP for generating the next output action during decoding has 2 layers and hidden dimensionality of 512.

Appendix E Dataset details

For our experiments we use the (1) Spider dataset Yu et al. (2019b), a large-scale cross-domain text-to-SQL benchmark. This dataset also incorporates multiple text-to-SQL datasets. The Spider dataset contains 8,659 training examples of question and SQL query pairs, 1,034 development (dev) examples and 2,147 test examples, spanning 300 complex databases across 138 different domains. Also, we run experiments on context-dependent settings with the (2) SParC and (3) CoSQL datasets, as well as zero-shot domain generalization performance on (4) Spider-DK, (5) Spider-SYN and (6) Spider-Realistic benchmarks.

Appendix F Impact of Number of Selected Top Tables and Columns

In this section, we analyze the impact of selecting different numbers of top tables and columns on the performance of SQLformer. The performance is measured using EM accuracy on the Spider development set. Table 11 summarizes the results of varying the number of selected tables (k1subscript𝑘1k_{1}italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) and columns (k2subscript𝑘2k_{2}italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT). As shown in the table, selecting more tables and columns generally improves the EM accuracy. However, this improvement comes with diminishing returns, indicating a trade-off between the number of selected schema elements and the model’s complexity and efficiency.

# tables (k1subscript𝑘1k_{1}italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) # columns (k2subscript𝑘2k_{2}italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) EM accuracy (%)
5 5 73.1
10 5 75.2
5 10 76.7
10 10 78.2
Table 11: EM accuracy of SQLformer with varying numbers of top selected tables and columns.

These results demonstrate that while including more tables and columns can enhance performance, the gains are not linear and should be balanced against computational efficiency. Adjusting the number of top selected tables and columns can be a critical hyperparameter for optimizing performance in different application scenarios.