RATSF: Empowering Customer Service Volume Management through Retrieval-Augmented Time-Series Forecasting

Tianfeng Wang Alibaba GroupHangzhouChina [email protected]  and  Gaojie Cui Yuaiweiwu TechBei**gChina [email protected]
Abstract.

An efficient customer service management system hinges on precise forecasting of service volume. In this scenario, where data non-stationarity is pronounced, successful forecasting heavily relies on identifying and leveraging similar historical data rather than merely summarizing periodic patterns. Existing models based on RNN or Transformer architectures may struggle with this flexible and effective utilization. To tackle this challenge, we initially developed the Time Series Knowledge Base (TSKB) with an advanced indexing system for efficient historical data retrieval. We also developed the Retrieval Augmented Cross-Attention (RACA) module, a variant of the cross-attention mechanism within Transformer’s decoder layers, designed to be seamlessly integrated into the vanilla Transformer architecture to assimilate key historical data segments. The synergy between TSKB and RACA forms the backbone of our Retrieval-Augmented Time Series Forecasting (RATSF) framework. Based on the above two components, RATSF not only significantly enhances performance in the context of Fliggy hotel service volume forecasting but also adapts flexibly to various scenarios and integrates with a multitude of Transformer variants for time-series forecasting. Extensive experimentation has validated the effectiveness and generalizability of this system design across multiple diverse contexts.

PVLDB Reference Format:
PVLDB, 14(1): XXX-XXX, 2020.
doi:XX.XX/XXX.XX This work is licensed under the Creative Commons BY-NC-ND 4.0 International License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of this license. For any use beyond those covered by this license, obtain permission by emailing [email protected]. Copyright is held by the owner/author(s). Publication rights licensed to the VLDB Endowment.
Proceedings of the VLDB Endowment, Vol. 14, No. 1 ISSN 2150-8097.
doi:XX.XX/XXX.XX

1. Introduction

Accurate estimation of total service demand is crucial for customer service volume management in the travel industry, including hotels, airlines, attractions, and transaction-brokering apps like Fliggy, significantly affecting system costs.Underestimating by 100 service requests incurs urgent mobilization costs, equivalent to three times the labor cost for a single request; overestimation, meanwhile, leads to wasted labor costs. The travel industry presents unique challenges in service volume forecasting due to its interplay with variables like order statuses, weather, international events, and destination country policies. At Fliggy, precise forecasting is essential for planning the recruitment and training of customer service staff, as well as scheduling. These factors result in non-stationary data patterns that can introduce significant biases with traditional uni-variant time-series forecasting methods (Hyndman and Athanasopoulos, 2018; Anderson and Kendall, 1976; Taylor and Letham, 2018), which rely on periodic summaries and trend analyses.

In fact, across domains like stock market analysis and station traffic forecasting where business cycles and fluctuations play a significant role, there is a shared desire for a universal and flexible time series forecasting system design that can adeptly utilize historical sequence information to tackle intricate prediction challenges. Meanwhile, in the realm of time series forecasting(TSF), where models, whether based on Recurrent Neural Networks (RNNs) or Transformer, commonly face difficulties efficiently processing and extracting insights from vast amounts of historical data.

One naive way to solve this is to try elongating the sequence that the transformer processes. Some approach involves sampling historical data to fit within a limited context window. Specifically, the Informer (Zhou et al., 2021) algorithm samples K points from the sequence and derives a shorter Q sequence based on these sampled points. However, this approach assumes that all historical information is equally important, which may not be suitable for many time series scenarios where different data points can carry varying significance. Perceiver (Jaegle et al., 2021) and similar methods opt for a different approach by map** Query sequences to fixed lengths, reducing computation and allowing for more historical data storage. Nonetheless, they still face challenges in efficiently extracting and interpreting critical information from extended time series.

In the field of natural language processing, strides have been made to expand Transformer models’ context handling. On one front, techniques like flash-attention (Dao et al., 2022) enhance efficiency and reduce complexity, enabling longer context processing. However, it’s crucial to note that directly extending receptive fields may cause larger models to overlook significant details in lengthy inputs, like (Liu et al., 2023) mentioned.

On another front, Retrieval-Augmented Generation (RAG)-like methods have drawn attention for enhancing model performance by incorporating external information. NVIDIA’s (Xu et al., 2024) research indicates that even when models already handle large context windows in text tasks, they can still achieve substantial performance gains by retrieving and using relevant data from external sources.

To address the above challenges, we have identified a potential approach employing the concept of retrieval augmentation (RA). Concretely, we focus on implementing two central enhancements: a knowledge base schema that can efficiently index all historical series, and a cross-attention module embedded in a transformer model to integrate historical information for pinpointing and exploiting the most predictive segments, thereby refining prediction accuracy. The synergy between TSKB and RACA constitutes the RATSF framework, which significantly improves the performance of most univariate time series forecasting tasks. Currently, RATSF has been integrated into four key service areas within Fliggy, encompassing hotel bookings and after-sales services, train ticketing, and flight reservations and modifications. This integration has led to a notable decrease in forecasting errors, from an approximate range of 15% to approximately 8% across all the aforementioned sectors, which has significantly reduced personnel management costs, although the exact extent of cost savings cannot be divulged for commercial reasons.

In concise terms, our main contributions are:

1. We present a straightforward and manageable design for a time-series knowledge base (TSKB) based on the characteristic that time series data is easy to be stored in a structured way, which significantly facilitates efficient management of historical data.

2. We introduce a versatile cross-attention module, retrieval augmented cross-attention (RACA) module, designed to integrate retrieved historical data into the forecasting process. This module is easily adaptable and can be seamlessly integrated with various time-series transformers.

3. We raise a Retrieval-Augmented Time Series Forecasting (RATSF) framework, which decreases 7% forecasting errors in the real service areas within Fliggy and therefore reduced huge personnel practical management costs. Extensive experiments on more publicly available datasets demonstrate that our method achieves the best performance for the univariate time series forecasting task and is general for a broader range of industrial applications.

2. Review

Despite intense competition (Salinas et al., 2020; Das et al., 2023; Ye et al., 2022; Han and Woo, 2022), Transformer models (Vaswani et al., 2017) and their improved variants (Qi et al., 2021; Woo et al., 2022; Liu et al., 2021; Kitaev et al., 2020; Lim et al., 2021) have become the mainstream choice in time-series forecasting tasks. However, the computational complexity of the original Transformer model scales as O(n²) with respect to sequence length, significantly limiting the maximum sequence length it can handle and thereby constraining its ability to use historical data.

Improvements in Time-Series Forecasting with Transformers. Numbers of works focus on enhancing the Transformer model’s capability to extract temporal features, thereby improving prediction accuracy, exemplified by Fedformer’s (Zhou et al., 2022) introduction of a frequency-augmented Attention mechanism that directly incorporates Fourier operators, a similar path taken by TimesNet (Wu et al., 2023). In contrast, Autoformer (Wu et al., 2021) proposes an operator that decomposes time series information into trend and periodic components. These two types of solutions perform well for relatively stationary sequences but may see performance drop when dealing with highly non-stationary time series data.

Another core strategy involves learning a robust representation of the time series first, which is then used to enhance the forecasting accuracy. TNC (Tonekaboni et al., 2021) harnesses the core concept of contrastive learning and employs samples within a specific temporal neighborhood within a window as positive pairs, while treating samples from differing temporal neighborhoods as negative pairs. Despite these improvements having collectively boosted the ability of Transformers in time-series forecasting tasks, none has fundamentally increased the Transformer’s receptive field.

Retrieve Augmented Method in NLP. In last two years, the Retrieval-Augmented Generation (RAG) approach (Guu et al., 2020; Borgeaud et al., 2022; Pasupat et al., 2021; Zhang et al., 2022; Shuster et al., 2021; Khandelwal et al., 2020; Zuo et al., 2019) has gained widespread adoption in the field of NLP. REALM harnesses a knowledge retriever to distill information from vast corpora and thereby enhance the performance of pre-trained language models. Transformer-XL+kNN (Bonetta et al., 2021) incorporates a K-Nearest Neighbors algorithm to search through training data, refining dialogue generation capabilities. In the context of named entity recognition tasks, U-RaNER (Tan et al., 2023) utilizes multi-modal heterogeneous retrieval techniques to boost knowledge retrive and, by integrating retrieved knowledge into the model, strengthens its understanding of queries and improves entity recognition accuracy.

Retrieval Augmented Method in Time-series Forecasting. MQ-ReTCNN (Yang et al., 2022) is designed for complex time series prediction tasks involving multiple entities and variables. It employs a scoring function to compile relevant contexts from offline data, selects scored segments, and appends them to the prediction sequence.Its retrieval mechanism emphasizes leveraging historical sequences of one entity to enhance predictions about another, without directly utilizing historical data for direct prediction assistance.

ReTime (**g et al., 2022) creates a relation graph based on temporal closeness between sequences and employs relational retrieval instead of content-based retrieval.It does not optimally use historically similar sequences as reference points due to its inherent design limitations. Both MQ-ReTCNN and ReTime incorporate retrieval enhancement strategies but have yet to introduce a general and efficient retrieval technique specifically for single-variable time series prediction scenarios.

Refer to caption
Figure 1. RATSF uses a TSKB to store and index historical sequences, alongside a transformer-based forecasting model. As illustrated in the bottom-right corner, TSKB segments the full history for efficient retrieval. To do forecasting, the Encoder fetches 𝐗osubscript𝐗𝑜\mathbf{X}_{o}bold_X start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT from t recent time steps, and forms a retrieval sequence with d latest steps, retrieves N related sequences 𝐗rsubscript𝐗𝑟\mathbf{X}_{r}bold_X start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT from the TSKB. RACA in the Decoder then merges 𝐗rsubscript𝐗𝑟\mathbf{X}_{r}bold_X start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT and 𝐗osubscript𝐗𝑜\mathbf{X}_{o}bold_X start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT info to deliver result.

3. Method

3.1. Setting & Notations

Before elucidating our approach, we first clarify the setting of our problem and define some symbols that will be used throughout the text.

In actual business operations, service volume is influenced by a multitude of factors, some of which are challenging to fully represent through variables. To simplify the reasoning process and enhance the versatility of our system, we have configured our hotel service volume prediction using a uni-variant time-series forecast setup. This means that the inputs to the model consist solely of the the values of the time series and it’s temporal features . Due to the requirements of our actual task—where personnel management necessitates advance preparation for recruitment, short-term scheduling, and training. Our forecasting task is designed to start from a specific time point t𝑡titalic_t and predict a sequence of Lfsubscript𝐿𝑓L_{f}italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT data points in one go.

Time Marking. We take moment t𝑡titalic_t as the reference point, and the collection of the future time points to be forecasted is represented as [t+1,t+2,,t+lf]𝑡1𝑡2𝑡subscript𝑙𝑓{\left[t+1,t+2,\ldots,t+l_{f}\right]}[ italic_t + 1 , italic_t + 2 , … , italic_t + italic_l start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ], where lfsubscript𝑙𝑓l_{f}italic_l start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT denotes the length of the forecast period. Concurrently, we define the retrieve segment length as lrsubscript𝑙𝑟l_{r}italic_l start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, with the indexing sequence 𝐊𝐊\mathbf{K}bold_K in the TSKB having a length of Lrsubscript𝐿𝑟L_{r}italic_L start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, and the length of the V sequence as lvsubscript𝑙𝑣l_{v}italic_l start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT.

Sequence Source Marking. The subscript o denotes the original sequence, that is, the sequence 𝐗𝐨=x1,,xtsubscript𝐗𝐨subscript𝑥1subscript𝑥𝑡\mathbf{X_{o}}={x_{1},...,x_{t}}bold_X start_POSTSUBSCRIPT bold_o end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT prior to moment t𝑡titalic_t; 𝐗𝐟subscript𝐗𝐟\mathbf{X_{f}}bold_X start_POSTSUBSCRIPT bold_f end_POSTSUBSCRIPT represents the sequence to be forecasted; 𝐗𝐫subscript𝐗𝐫\mathbf{X_{r}}bold_X start_POSTSUBSCRIPT bold_r end_POSTSUBSCRIPT signifies the 𝐕𝐕\mathbf{V}bold_V sequence retrieved from the TSKB.

Transformer Processing Marking. The superscript is used to indicate the results after processing by various parts of the Transformer. 𝐗𝐨𝐛subscriptsuperscript𝐗𝐛𝐨\mathbf{X^{b}_{o}}bold_X start_POSTSUPERSCRIPT bold_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_o end_POSTSUBSCRIPT denotes the original sequence after Embedding, and 𝐗𝐨𝐞subscriptsuperscript𝐗𝐞𝐨\mathbf{X^{e}_{o}}bold_X start_POSTSUPERSCRIPT bold_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_o end_POSTSUBSCRIPT denotes the sequence after the Encoder. Within the Decoder, H represents the internal hidden state of the Decoder, l𝑙litalic_l represents the layer number, and Hlsuperscript𝐻𝑙H^{l}italic_H start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT indicates the hidden state processed by the l𝑙litalic_l-th layer.

3.2. Overview of RATSF

The RATSF system is composed of two core components: a Time Series Knowledge Base (TSKB, detailed in Section 3.3) and a Transformer-based time series forecasting model. The latter has been enhanced by replacing its original decoder with our novel Retrieval Augmented Cross-Attention (RACA, described in Section 3.4) module. The TSKB efficiently segments and archives historical time series data, establishing a precise indexing system. The transformer model, with the integration of RACA, effectively utilizes retrieved historical data to significantly enhance the accuracy of its predictions. The data flow diagram of RATSF is illustrated in Figure 1, providing a visual representation of the process.

3.3. TSKB

Unlike traditional methods that solely store and retrieve the entire original sequence, TSKB preserves the original sequence as the core content (V) while selects certain segments from it to construct a discriminative indexing sequence (K). This approach can be likened to processing a written piece where conventional methods involve direct full-text searches for desired information, which may be inefficient and less precise; whereas with TSKB, it’s akin to extracting key headlines from the body of the text to serve as indices that facilitate rapid access to core information. This dual-sequence structure allows the system, when handling large-scale data, to efficiently locate and access targeted portions of the original sequence through customized index sequences, thereby significantly enhancing overall performance.

3.3.1. Sequential Slicing

As illustrated in Figure 2, the content sequence V is obtained using a rolling window approach with a step size of S𝑆Sitalic_S and a window length of Lvsubscript𝐿𝑣L_{v}italic_L start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT. This approach allows us to sample all historical sequences and incorporate them into TSKB. Since prediction models only use present data to infer the future, we align indexing sequence K with this constraint. We extract the initial segment of each V, having a length of Lrsubscript𝐿𝑟L_{r}italic_L start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, to serve as its index K. The selection of Lvsubscript𝐿𝑣L_{v}italic_L start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, Lrsubscript𝐿𝑟L_{r}italic_L start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT and S𝑆Sitalic_S is introduced in section 3.5.

Refer to caption
Figure 2. TSKB utilizes a rolling window of length Lvsubscript𝐿𝑣L_{v}italic_L start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT to collect V, with an indexing segment of length Lrsubscript𝐿𝑟L_{r}italic_L start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT taken from its leading part as K , and advances the window in steps of size S.
Refer to caption
Figure 3. Within the cold start stage, we use raw sequence of retrieval input to find Top-K relevant sequences using DTW. After one epoch, we switch to Euclidean Distance for retrieval embedding to gather Top-K matches.

3.3.2. Embedding Learning

A key approach to enhancing the recall and accuracy of knowledge base retrieval is to employ a well-trained embedding vector to improve the precision of the index. In the following sections, we will focus on how to train and utilize these index embedding vectors. Judging the quality of a retrieval embedding mainly depends on how well it captures similarities at key information points relevant to forecasting tasks. Time series data is complex due to numerous information points and the challenge of presetting comparative weights. Therefore, we use this principle: if a representation closely mirrors the important details needed for prediction, its overall performance will be better.

To ensure that the embeddings capture the essential information for forecasting, we employ the encoder from the RATSF’s forecasting model, which is designed to select precise information during training, thereby enhancing forecast accuracy. Specifically, each indexing sequence K rom the knowledge base is processed through the encoder layer of the RATSF forecasting model to generate a retrieval embedding.

Meanwhile, in the early stages of training the forecasting model, its encoder is not yet capable of producing representations that accurately match target sequences. The model’s effectiveness and learning pace are significantly correlated with its ability to retrieve historical sequences beneficial to forecasting tasks.

To avoid time and data-consuming iterations stemming from a random initialization state, we introduce Dynamic Time War** (DTW(Müller, 2007)) as an auxiliary tool during the initial phase of model training, as shown in left part of Figure 3. DTW is initially used for similarity-based sequence retrieval, aiding in the iterative training process of the RATSF forecasting model. After completing one epoch of training, we transition to using embeddings generated by the forecasting model itself for sequence retrieval, continuing the training until the model converges.

Refer to caption
Figure 4. Each RACA has two cross-attention modules: one utilizes the Encoder output as K,V, while the other employs embedded retrieved sequences as K,V. Both outputs are concatenated and passed through a Linear module to reshape back to the input dimensions.

3.4. RACA

As previously mentioned, Retrieval-Augmented Cross-Attention (RACA) is a module designed to be integrated into a time-series forecasting model, coupled with the TSKB. In our demonstration, we employ the vanilla Transformer as the time-series forecasting model; however, it’s important to note that the RACA module is designed to be compatible and can be seamlessly plug into any transformer-based time-series model. We will specifically demonstrate this in Section 4 .

The decoder consists of l𝑙litalic_l stacked RACA modules for inputs of length Lfsubscript𝐿𝑓L_{f}italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT. Each module has two parallel units: Unit 1, like in Function (1), uses 𝐗fbsubscriptsuperscript𝐗𝑏𝑓\mathbf{X}^{b}_{f}bold_X start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT as Query with Key and Value from 𝐗oesubscriptsuperscript𝐗𝑒𝑜\mathbf{X}^{e}_{o}bold_X start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT, outputting 𝐇fl,1subscriptsuperscript𝐇𝑙1𝑓\mathbf{H}^{l,1}_{f}bold_H start_POSTSUPERSCRIPT italic_l , 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT of the same shape. Unit 2, retrieval-augment part, like in Function (2) ,also queries 𝐗fbsubscriptsuperscript𝐗𝑏𝑓\mathbf{X}^{b}_{f}bold_X start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT but fetches Keys and Values from the concatenated sequence of 𝐗resubscriptsuperscript𝐗𝑒𝑟\mathbf{X}^{e}_{r}bold_X start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, generating 𝐇fl,2subscriptsuperscript𝐇𝑙2𝑓\mathbf{H}^{l,2}_{f}bold_H start_POSTSUPERSCRIPT italic_l , 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT. s𝑠sitalic_s in the Functions is the scaling factor.

Like shown in Figure 4, retrieved sequences are transformed through the encoder’s embedding module, generating Ns 𝐗resubscriptsuperscript𝐗𝑒𝑟\mathbf{X}^{e}_{r}bold_X start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, and we concatenate these N vectors into one with shape[1,NLv,D]1𝑁subscript𝐿𝑣𝐷[1,N*L_{v},D][ 1 , italic_N ∗ italic_L start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_D ] for later use.

(1) 𝐇fl,1=softmax(𝐗fb𝐗oes)𝐗oe,subscriptsuperscript𝐇𝑙1𝑓softmaxsubscriptsuperscript𝐗𝑏𝑓subscriptsuperscript𝐗𝑒𝑜𝑠subscriptsuperscript𝐗𝑒𝑜\displaystyle\mathbf{H}^{l,1}_{f}=\text{softmax}\left(\frac{\mathbf{X}^{b}_{f}% \mathbf{X}^{e}_{o}}{s}\right)\mathbf{X}^{e}_{o},bold_H start_POSTSUPERSCRIPT italic_l , 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = softmax ( divide start_ARG bold_X start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT bold_X start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT end_ARG start_ARG italic_s end_ARG ) bold_X start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ,
(2) 𝐇fl,2=softmax(𝐗fb𝐗res)𝐗re,subscriptsuperscript𝐇𝑙2𝑓softmaxsubscriptsuperscript𝐗𝑏𝑓subscriptsuperscript𝐗𝑒𝑟𝑠subscriptsuperscript𝐗𝑒𝑟\displaystyle\mathbf{H}^{l,2}_{f}=\text{softmax}\left(\frac{\mathbf{X}^{b}_{f}% \mathbf{X}^{e}_{r}}{s}\right)\mathbf{X}^{e}_{r},bold_H start_POSTSUPERSCRIPT italic_l , 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = softmax ( divide start_ARG bold_X start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT bold_X start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_ARG start_ARG italic_s end_ARG ) bold_X start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ,
(3) 𝐇fl+1=δ(concat(𝐇fl,1,𝐇fl,2)).subscriptsuperscript𝐇𝑙1𝑓𝛿concatsubscriptsuperscript𝐇𝑙1𝑓subscriptsuperscript𝐇𝑙2𝑓\displaystyle\mathbf{H}^{l+1}_{f}=\delta\left(\text{concat}\left(\mathbf{H}^{l% ,1}_{f},\mathbf{H}^{l,2}_{f}\right)\right).bold_H start_POSTSUPERSCRIPT italic_l + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = italic_δ ( concat ( bold_H start_POSTSUPERSCRIPT italic_l , 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , bold_H start_POSTSUPERSCRIPT italic_l , 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ) ) .

As shown in Function 3, after concatenating both intermediate vectors,𝐇fl,1subscriptsuperscript𝐇𝑙1𝑓\mathbf{H}^{l,1}_{f}bold_H start_POSTSUPERSCRIPT italic_l , 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT and 𝐇fl,2subscriptsuperscript𝐇𝑙2𝑓\mathbf{H}^{l,2}_{f}bold_H start_POSTSUPERSCRIPT italic_l , 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT, along the sequence dimension, forms a [1,2Lf,D]12subscript𝐿𝑓𝐷[1,2*L_{f},D][ 1 , 2 ∗ italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , italic_D ] vector, it undergoes linear transformation before being passed to the next layer.

3.5. Deployments Procedure

In this section, we briefly introduce the procedure to deploy RATSF in a new field.

3.5.1. TSKB Initialization

Set the initial retrieval length (Lrsubscript𝐿𝑟L_{r}italic_L start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT), which can be equivalent to the prediction length (Lfsubscript𝐿𝑓L_{f}italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT). Set the sequence length for the V sequence (Lvsubscript𝐿𝑣L_{v}italic_L start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT) to be at least the sum of Lrsubscript𝐿𝑟L_{r}italic_L start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT and Lfsubscript𝐿𝑓L_{f}italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT. This ensures adequate info for retrieval and forecast. It is suggested to set several Lvsubscript𝐿𝑣L_{v}italic_L start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT values, such as Lrsubscript𝐿𝑟L_{r}italic_L start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT + Lfsubscript𝐿𝑓L_{f}italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT, 2*Lfsubscript𝐿𝑓L_{f}italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT + Lrsubscript𝐿𝑟L_{r}italic_L start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, 3*Lfsubscript𝐿𝑓L_{f}italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT + Lrsubscript𝐿𝑟L_{r}italic_L start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, etc.,to build several TSKBs for optimal value selection later. Set the rolling window step (stride) to 1 and collect V𝑉Vitalic_V sequences for each TSKBs. Then, Select the initial lrsubscript𝑙𝑟l_{r}italic_l start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT elements of V𝑉Vitalic_V sequences to form K𝐾Kitalic_K sequences.

3.5.2. Identifying Optimal Lvsubscript𝐿𝑣L_{v}italic_L start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT

Train diverse RATSF models on their respective TSKBs and evaluate each by prediction accuracy.Select the model with the highest prediction accuracy as the optimal RATSF model (Msuperscript𝑀M^{*}italic_M start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT), and identify the corresponding V sequence length (Lvsubscript𝐿𝑣L_{v}italic_L start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT) as the final chosen value (Lvsuperscriptsubscript𝐿𝑣L_{v}^{*}italic_L start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT).

3.5.3. Adjusting Retrieval Length Lrsubscript𝐿𝑟L_{r}italic_L start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT

Adjust the retrieval length (Lrsubscript𝐿𝑟L_{r}italic_L start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT) to observe the impact on the performance of the optimal model (M*). The adjustment range could be from 0.5Lf0.5subscript𝐿𝑓0.5L_{f}0.5 italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT to Lvsubscript𝐿𝑣L_{v}italic_L start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, with each adjustment increment being 0.5Lf0.5subscript𝐿𝑓0.5L_{f}0.5 italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT. Record the Lrsubscript𝐿𝑟L_{r}italic_L start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT value that enables M to achieve the highest prediction accuracy, denoted as Lrsuperscriptsubscript𝐿𝑟L_{r}^{*}italic_L start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

3.5.4. TSKB and Model Optimization

With the confirmed optimal retrieval length (LrL_{r}*italic_L start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ∗) and V sequence length (Lvsuperscriptsubscript𝐿𝑣L_{v}^{*}italic_L start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT), reconfigure the TSKB. Reinitialize and train the RATSF model with the updated TSKB till converge.

4. Experiments

To demonstrate the superiority of our method in service volume forecasting, we first conducted experiments using the Fliggy Hotel Service Volume Dataset (FHSV), showcasing the performance enhancement of typical time-series models when integrated with RATSF. Additionally, to validate the general applicability of our approach, we extended our experiments to three other datasets. Furthermore, through a series of ablation studies, we individually assessed the impact of each key design choice in RATSF to confirm the correctness of our detailed selections.

4.1. Experiment Setup

Datasets. Given that our primary focus lies in the forecasting task within the context of customer service volume management, we initially employed the Fliggy Hotel Service Volume Dataset to substantiate the efficacy of RATSF, the detail description of the dataset is in Appendix A. Additionally, to demonstrate the performance of our approach in other contexts, we employed three classic time-series forecasting datasets: ETT(Zhou et al., 2021), Exchange(Zhao et al., 2019) , Traffic(tra, 2017) .

Models.In Section 3.4, we established that the Retrieval Augmented Cross-Attention (RACA) module is compatible with various Transformer variants. To underscore this versatility, we selected three prominent time-series domain models: Fedformer(Zhou et al., 2022), Autoformer(Wu et al., 2021), and NS-Transformer(Liu et al., 2022), and compared them to the vanilla Transformer. Leveraging the specialized operators within Fedformer and Autoformer that cater to time-series periodicity and trends, such as Autoformer’s seasonal and trend-cyclical initializations, we have seamlessly integrated these models with the RACA module. This integration allows for the effective incorporation of their distinct sequential decomposition capabilities, enhancing the overall processing power. Furthermore, to demonstrate the advantage of our recall mechanism in the RATSF framework, we compared it with ReTime, a RAG-based model that also uses specialized retrieval techniques. Since ReTime has not released their code and their application domain is distinct from our focus on univariate time-series forecasting, we reconstructed the Relational Retrieval method based on their published paper. (It is worth noting that, as of the time of writing this paper, neither MQ-ReTCNN nor ReTime have released their code to the public.)

As shown in Figure 1, the training outcomes of the encoder’s output significantly influence both the retrieval quality and the decoder’s performance. In the main experiments, we will compare the retrieval schemes of RATSF and ReTime. Furthermore, in several subsequent experiments, we will elucidate the differences between RATSF’s retrieval approach and DTW. Consequently, for the majority of the experiments in this paper, to facilitate these comparisons, we have replaced the input to RACA from 𝐗resubscriptsuperscript𝐗𝑒𝑟\mathbf{X}^{e}_{r}bold_X start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT to 𝐗rbsubscriptsuperscript𝐗𝑏𝑟\mathbf{X}^{b}_{r}bold_X start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, and we will provide an ablation study to explore the specific effects arising from this substitution.

Forecasting Setting. Like we claim in Section 3.1, we adopt the uni-variant time-series forecasting setting. In the Fliggy’s practical operations, we are required to finalize our forecasting for the upcoming week’s day-by-day staffing needs one week in advance, which is why the actual prediction window period is set at 7 days, represented as 7 tokens. In the travel industry, which Fliggy App’s serving, a quarter generally defines a relatively complete business trend. Therefore, we choose an encoder window length of 98 days—exceeding 90 days and divisible by 14—which consequently results in 𝐗𝐨subscript𝐗𝐨\mathbf{X_{o}}bold_X start_POSTSUBSCRIPT bold_o end_POSTSUBSCRIPT is 98 units long. And we concate {xt14,,xt}subscript𝑥𝑡14subscript𝑥𝑡\left\{x_{t-14},...,x_{t}\right\}{ italic_x start_POSTSUBSCRIPT italic_t - 14 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } with a placeholder of length 7 to form 𝐗𝐟subscript𝐗𝐟\mathbf{X_{f}}bold_X start_POSTSUBSCRIPT bold_f end_POSTSUBSCRIPT. To match the length of 𝐗𝐟subscript𝐗𝐟\mathbf{X_{f}}bold_X start_POSTSUBSCRIPT bold_f end_POSTSUBSCRIPT, we set Lvsubscript𝐿𝑣L_{v}italic_L start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT=21 and Lrsubscript𝐿𝑟L_{r}italic_L start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT=14. Regarding normalization, we employ unified μ𝜇\muitalic_μ and σ𝜎\sigmaitalic_σ values for whitening operations during the preprocessing stage and inverse normalization is carried out after forecasting. We have applied the same data processing scheme to both the ETT, Exchange and Traffic datasets.

Training Setting.We have selected the Adam optimizer, with the batch size===64 and max training epochs===10. The initial learning rate (lr) is set to 0.0001, employing a linear decay strategy where the decay parameter for the first 5 epochs is 0.9, for the last 5 epochs is 0.5. We also incorporate L1𝐿1L1italic_L 1 regularization with λ𝜆\lambdaitalic_λ=0.0001. Concurrently, an early stop** mechanism is activated when the loss ceases to decrease.

Metric.To ensure that our evaluation metrics directly reflect business performance, we first inverse normalize the forecast result as above mentioned and then compute the MSE (Mean Squared Error) and MAE (Mean Absolute Error) against the Ground Truth. The lower these two indicators are, the more accurate the prediction results prove to be. Moreover, each decrement of 1 unit in MAE signifies a corresponding reduction of 1 unit in service staff management expenditure. This configuration has been applied uniformly across all experiments.

Table 1. Performance comparison of the Transformer models with their retrieval-based variants on four experimental datasets. ’baseline’ denotes the original model without retrieval, compared to Relational Retrieva from ReTime and our approach RATSF.All results are evaluated using MSE (Mean Squared Error) and MAE (Mean Absolute Error), and the best result is bolded.
models transformer nstransformer autoformer fedformer
metric mse mae mse mae mse mae mse mae
FHSV baseline 4344450.512 1349.146 4414326.000 1242.128 6096673.500 1752.798 5517485.000 1638.275
Relational Retrieval 3737283.320 1276.836 4037294.000 1168.127 5821383.300 1703.328 5411003.700 1610.320
RATSF 3421546.027 1101.172 3823635.000 1070.933 5597297.500 1657.561 5386818.500 1595.657
ETTh1 baseline 3.187 1.391 1.501 0.920 8.651 2.264 4.159 1.597
Relational Retrieval 2.876 1.206 1.465 0.900 8.203 2.197 3.012 1.572
RATSF 2.366 1.180 1.407 0.886 7.300 2.113 3.952 1.565
Exchange baseline 0.00165 0.03134 0.00014 0.00922 0.00058 0.01906 0.00172 0.03493
Relational Retrieval 0.00143 0.02845 0.00013 0.00920 0.00069 0.02013 0.00102 0.02821
RATSF 0.00068 0.02012 0.00013 0.00912 0.00067 0.02000 0.00068 0.02084
Traffic baseline 0.00007 0.00564 0.00006 0.00569 0.00008 0.00643 0.00012 0.00876
Relational Retrieval 0.00007 0.00534 0.00006 0.00512 0.00007 0.00601 0.00009 0.00718
RATSF 0.00006 0.00498 0.00006 0.00472 0.00007 0.00514 0.00007 0.00650

4.2. Main Restult

Table 1 displays the prediction accuracy of the Transformer and its variants on four datasets, comparing performance without RA, with ReTime’s RA, and with our RATSF approach. In the Fliggy Hotel Customer Service Volume Dataset forecasting scenario, the Transformer model’s MAE loss was reduced by 5.34% with ReTime’s Relational Retrieval and by 18% after incorporating RATSF. This improvement could potentially translate to a reduction of approximately 200 in redundant personnel costs in practical management operations. A horizontal analysis of the first row in Table 1 reveals that all compared Transformer-based time-series model variants experienced a significant reduction in MAE after employing Retrieval-Augmented (RA) strategies. Specifically, the adoption of RATSF resulted in respective decreases of 14%, 5%, and 4%. In contrast, the application of ReTime’s retreval method led to improvements of 5.9%, 2.8%, and 1.7% compared to their original designs without RA. These results clearly demonstrate two conclusions: first, effective RA strategies can enhance the performance of time-series Transformer models; second, RATSF exhibits a significant advantage in its retrieval strategy.

Furthermore, cross-comparison reveals that Autoformer and FedFormer yield slightly inferior results compared to the Transformer, mainly due to the stronger non-stationarity inherent in customer service volumes data within the hotel industry. As an illustrative example, while Mondays typically exhibit similar cyclical characteristics, actual service volume during Mondays preceding a short holiday can surge significantly compared to regular ones. In such instances, historical data from periods close to other short holidays provide more valuable insights than simple temporal periodicity and short-term trends; specific case studies will be showcased later.

Additionally, across the ETT, Exchange and the Traffic dataset, the adoption of RATSF design in the Transformer and its time-series optimized variants led to noticeable reductions in MAE metrics. Moreover, among these, the NSTransformer consistently outperforms the other models, likely because its mean-adapted feature proves effective in a broad range of time-series prediction domains.

4.3. Ablation Study

We confirm the effectiveness of several design choices in this section. Section 4.3.1 emphasizes the superiority of RACA’s approach in integrating historical data for improved forecasting. Section 4.3.2 demonstrates the advantages of utilizing model encoder for recalling historical pieces. Section 4.3.3 demonstrates the benefits of incorporating DTW into the training process.

Refer to caption
Figure 5. This figure shows RACA’s use of retrieved sequences for a random forecast sequence 𝐗𝐟superscriptsubscript𝐗𝐟\mathbf{X_{f}^{*}}bold_X start_POSTSUBSCRIPT bold_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.The lower half features the top 3 similar sequences of 𝐗𝐟superscriptsubscript𝐗𝐟\mathbf{X_{f}^{*}}bold_X start_POSTSUBSCRIPT bold_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, concatenated along the time axis to form RACA’s input. Attention weights at the first forecast point (xt+1subscript𝑥𝑡1x_{t+1}italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT) are marked by data bars, with yellow and green indicating RACA’s focus on downward-curving segments. The upper half replicates prediction results (yellow ”pred” lines for xt+1xt+fsubscript𝑥𝑡1subscript𝑥𝑡𝑓x_{t+1}...x_{t+f}italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT … italic_x start_POSTSUBSCRIPT italic_t + italic_f end_POSTSUBSCRIPT) and true values (orange ”gt” lines), aligned with the corresponding final Lfsubscript𝐿𝑓L_{f}italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT lengths of the retrieved sequences. This visual comparison highlights the model’s pattern recognition, with RACA’s focus areas correlating to the predictions’ inflection at xt+1subscript𝑥𝑡1x_{t+1}italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT.For this sample, the model notably focuses on segments indicating ascent and predicts a subsequent increase in value.
Table 2. Ablation of the specifically designed structures within RACA on FHSV scenario.
dataset methods metric
mse 4344450.512
baseline mae 1349.146
baseline + mse 4071384.750
Design One mae 1226.850
mse 3421546.027
FHSV baseline + RACA mae 1101.172

4.3.1. Integrate Historical Sequences with RACA

Based on the Transformer architecture, two designs are considered for integrating historical pieces into the forecasting model: Design One represents a straightforward approach where historical pieces are combined with the forecasting context inputs within the encoder; Design Two employs our advanced RACA design that integrate historical pieces using cross-attention mechanisms within the decoder.

With the Fliggy Hotel Service Volume Dataset, we contrasted the performance of three methods: No historical pieces used(baseline experiment), Strategy One, and Strategy Two (RACA). The outcomes in Table 2 demonstrate that the RACA design excels, with its MAE notably lower than both the baseline and Strategy One.

A deeper examination disclosed that the Cross Attention mechanism in the RACA design efficiently guides the forecast sequence to concentrate on past trends analogous to the current scenario. In Figure 5, we illustrate how retrieved sequences influence the predicted values for the first forecast point of a specific sequence through RACA’s cross-attention mechanism. The lower half features the top 3 similar sequences, concatenated along the time axis to form RACA’s input. Attention weights at the first forecast point (xt+1subscript𝑥𝑡1x_{t+1}italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT) are marked by data bars, with yellow and green indicating RACA’s focus on downward-curving segments. The upper half replicates prediction results (yellow ”pred” lines for xt+1xt+fsubscript𝑥𝑡1subscript𝑥𝑡𝑓x_{t+1}...x_{t+f}italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT … italic_x start_POSTSUBSCRIPT italic_t + italic_f end_POSTSUBSCRIPT) and true values (orange ”gt” lines), aligned with the corresponding final Lfsubscript𝐿𝑓L_{f}italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT lengths of the retrieved sequences. This visual comparison highlights the model’s pattern recognition, with RACA’s focus areas correlating to the predictions’ inflection at xt+1subscript𝑥𝑡1x_{t+1}italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT.

Table 3. Comparison of several variants of retrieval representation schemes on the FHSH dataset.
stage methods metric
mse 4344450.512
baseline mae 1349.146
mse 3787191.250
baseline + DTW mae 1226.690
mse 3784188.541
baseline + MLP mae 1210.785
mse 3421546.027
representation retrieval baseline + encoder mae 1101.172
mse 3421546.027
RATSF with embedding mae 1101.172
mse 3343793.904
decoding RATSF with encoder mae 1004.326

4.3.2. Retrieval Embedding

As previously mentioned, in our main experiments, we aimed to isolate the contribution of the encoder’s output to the retrieval performance. To do this, we utilized the encoder’s output 𝐗resubscriptsuperscript𝐗𝑒𝑟\mathbf{X}^{e}_{r}bold_X start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT for the retrieval process and the embedding output 𝐗rbsubscriptsuperscript𝐗𝑏𝑟\mathbf{X}^{b}_{r}bold_X start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT for decoding. In this section, we have completed the discussion of this part of the logic. Here, we will compare different schemes for retrieval and evaluate the effects of various decoding schemes.

We argue that DTW is not the optimal retrieval method and demonstrate this through an experiment on the FHSV dataset, where we compare DTW, a separately trained two-layer MLP, and our current design choice: using the forecasting model’s encoder for retrieval. The upper half of Table 3 presents a comparison of the forecasting precision achieved with these different retrieval embedding strategies. The ’Baseline’ indicates the standard Transformer model without any retrieval mechanism. As the results indicate, the encoder output outperforms the MLP, likely due to the attention mechanism within the Transformer’s encoder, which is adept at identifying and refining the relationships among data points in sequences, thus capturing more accurate patterns.

The lower half of Table 3 compares the decoding performance of RACA when using the Transformer’s embedding output and encoder output as the information source. The experimental results indicate that refining the retrieved sequence information with the Encoder can reduce the MAE by 18.38% to 1004.326, compared to the vanilla Transformer, and by 8.79% compared to using the embedding output. This substantiates the effectiveness of the RATSF architecture demonstrated in Figure 1.

Table 4. Effectiveness of Using DTW as Auxiliary Training.
dataset number of epochs metric
mse 3421546.027
1 epoch mae 1101.172
mse 3769362.251
2 epoch mae 1252.136
mse 3697616.753
3 epoch mae 1135.631
mse 3720408.244
4 epoch mae 1248.433
mse 3842794.029
5 epoch mae 1278.324
mse 3800124.170
FHSV end to end mae 1302.054

4.3.3. Effectiveness of Using DTW as Auxiliary Training

In Section 3.3.2, we assert the effectiveness of employing DTW as a retrieval method during the cold start phase to facilitate the training of forecasting models for a single epoch. To substantiate the rationale for this strategy, we conducted a comparative experimental study: one baseline group without DTW-assisted training (referred to as ’end-to-end’), and several other groups, each trained with DTW assistance for 1 to 5 epochs. We ensured that all experiments had the same maximum number of training epochs and utilized an identical early stop** strategy. Table 4 demonstrates that the model’s predictive performance is the weakest when DTW is not utilized, as indicated by the highest MSE and MAE values in the last row of Table 4; furthermore, extending DTW-assisted training beyond one epoch did not lead to improved outcomes.

This result aligns with our hypothesis that in the absence of DTW support during the initial training phase, the model has difficulty identifying high-quality retrieval vectors, which are crucial for accurate forecasting. However, as our primary experiments have shown, DTW is not inherently the best retrieval method for enhancing predictive accuracy. Therefore, extending the period of DTW-assisted training may hinder rather than enhance the model’s ability to express and forecast effectively.

5. Conclusion

In conclusion, RATSF stands out in the domain of univariant time-series forecasting by integrating a TSKB that meticulously slices and indexes historical data, leading to superior retrieval precision. This is further enhanced by the Encoder’s representations, which act as a sophisticated retrieval mechanism. Coupled with our innovative RACA mechanism for adeptly merging retrieved segments, we achieve heightened forecast accuracy. The adaptability of our method is highlighted by its compatibility with diverse Transformer variants and its extensive, successful application within Fliggy’s service volume forecasting scenarios.

References

  • (1)
  • tra (2017) 2017. Traffic Dataset. http://pems.dot.ca.gov/.
  • Anderson and Kendall (1976) O. D. Anderson and Maurce Kendall. 1976. Time-series. Vol. 2nd edn. J. R. Stat. Soc. (Series D).
  • Bonetta et al. (2021) Giovanni Bonetta, Rossella Cancelliere, Ding Liu, and Paul Vozila. 2021. Retrieval-Augmented Transformer-XL for Close-Domain Dialog Generation. In The 34th International FLAIRS Conference. https://doi.org/10.32473/flairs.v34i1.128369
  • Borgeaud et al. (2022) Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. 2022. Improving language models by retrieving from trillions of tokens. In International conference on machine learning. PMLR, 2206–2240.
  • Dao et al. (2022) Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems 35 (2022), 16344–16359.
  • Das et al. (2023) Abhimanyu Das, Weihao Kong, Andrew Leach, Shaan K Mathur, Rajat Sen, and Rose Yu. 2023. Long-term Forecasting with TiDE: Time-series Dense Encoder. Transactions on Machine Learning Research (2023).
  • Guu et al. (2020) Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. 2020. Retrieval augmented language model pre-training. In International conference on machine learning. PMLR, 3929–3938.
  • Han and Woo (2022) Siho Han and Simon S Woo. 2022. Learning sparse latent graph representations for anomaly detection in multivariate time series. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2977–2986.
  • Hyndman and Athanasopoulos (2018) Rob J Hyndman and George Athanasopoulos. 2018. Forecasting: principles and practice. OTexts.
  • Jaegle et al. (2021) Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. 2021. Perceiver: General perception with iterative attention. In International conference on machine learning. PMLR, 4651–4664.
  • **g et al. (2022) Baoyu **g, Si Zhang, Yada Zhu, Bin Peng, Kaiyu Guan, Andrew Margenot, and Hanghang Tong. 2022. Retrieval Based Time Series Forecasting. arXiv preprint arXiv:2209.13525 (2022).
  • Khandelwal et al. (2020) Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. 2020. Generalization through Memorization: Nearest Neighbor Language Models. In International Conference on Learning Representations.
  • Kitaev et al. (2020) Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. 2020. Reformer: The Efficient Transformer. In International Conference on Learning Representations.
  • Lim et al. (2021) Bryan Lim, Sercan Ö Arık, Nicolas Loeff, and Tomas Pfister. 2021. Temporal fusion transformers for interpretable multi-horizon time series forecasting. International Journal of Forecasting 37, 4 (2021), 1748–1764.
  • Liu et al. (2023) Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2023. Lost in the Middle: How Language Models Use Long Contexts. arXiv:2307.03172.
  • Liu et al. (2021) Shizhan Liu, Hang Yu, Cong Liao, Jianguo Li, Weiyao Lin, Alex X Liu, and Schahram Dustdar. 2021. Pyraformer: Low-complexity pyramidal attention for long-range time series modeling and forecasting. In International conference on learning representations.
  • Liu et al. (2022) Yong Liu, Haixu Wu, Jianmin Wang, and Mingsheng Long. 2022. Non-stationary transformers: Exploring the stationarity in time series forecasting. Advances in Neural Information Processing Systems 35 (2022), 9881–9893.
  • Müller (2007) Meinard Müller. 2007. Dynamic time war**. Information retrieval for music and motion (2007), 69–84.
  • Pasupat et al. (2021) Panupong Pasupat, Yuan Zhang, and Kelvin Guu. 2021. Controllable Semantic Parsing via Retrieval Augmentation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Marie-Francine Moens, Xuan**g Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 7683–7698.
  • Qi et al. (2021) Xinyuan Qi, Kai Hou, Tong Liu, Zhongzhong Yu, Sihao Hu, and Wenwu Ou. 2021. From Known to Unknown: Knowledge-guided Transformer for Time-Series Sales Forecasting in Alibaba. CoRR abs/2109.08381 (2021).
  • Salinas et al. (2020) David Salinas, Valentin Flunkert, Jan Gasthaus, and Tim Januschowski. 2020. DeepAR: Probabilistic forecasting with autoregressive recurrent networks. International Journal of Forecasting 36, 3 (2020), 1181–1191.
  • Shuster et al. (2021) Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela, and Jason Weston. 2021. Retrieval Augmentation Reduces Hallucination in Conversation. In Findings of the Association for Computational Linguistics: EMNLP 2021, Marie-Francine Moens, Xuan**g Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association for Computational Linguistics, Punta Cana, Dominican Republic, 3784–3803.
  • Tan et al. (2023) Zeqi Tan, Shen Huang, Zixia Jia, Jiong Cai, Yinghui Li, Weiming Lu, Yueting Zhuang, Kewei Tu, Pengjun Xie, Fei Huang, and Yong Jiang. 2023. DAMO-NLP at SemEval-2023 Task 2: A Unified Retrieval-augmented System for Multilingual Named Entity Recognition. In Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023), Atul Kr. Ojha, A. Seza Doğruöz, Giovanni Da San Martino, Harish Tayyar Madabushi, Ritesh Kumar, and Elisa Sartori (Eds.). Association for Computational Linguistics, Toronto, Canada, 2014–2028.
  • Taylor and Letham (2018) Sean J Taylor and Benjamin Letham. 2018. Forecasting at scale. The American Statistician 72, 1 (2018), 37–45.
  • Tonekaboni et al. (2021) Sana Tonekaboni, Danny Eytan, and Anna Goldenberg. 2021. Unsupervised Representation Learning for Time Series with Temporal Neighborhood Coding. In International Conference on Learning Representations.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
  • Woo et al. (2022) Gerald Woo, Chenghao Liu, Doyen Sahoo, Akshat Kumar, and Steven C. H. Hoi. 2022. ETSformer: Exponential Smoothing Transformers for Time-series Forecasting. (2022).
  • Wu et al. (2023) Haixu Wu, Tengge Hu, Yong Liu, Hang Zhou, Jianmin Wang, and Mingsheng Long. 2023. TimesNet: Temporal 2D-Variation Modeling for General Time Series Analysis. In International Conference on Learning Representations.
  • Wu et al. (2021) Haixu Wu, Jiehui Xu, Jianmin Wang, and Mingsheng Long. 2021. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. Advances in Neural Information Processing Systems 34 (2021), 22419–22430.
  • Xu et al. (2024) Peng Xu, Wei **, Xianchao Wu, Lawrence McAfee, Chen Zhu, Zihan Liu, Sandeep Subramanian, Evelina Bakhturina, Mohammad Shoeybi, and Bryan Catanzaro. 2024. Retrieval meets Long Context Large Language Models. In The Twelfth International Conference on Learning Representations.
  • Yang et al. (2022) Sitan Yang, Carson Eisenach, and Dhruv Madeka. 2022. MQRetNN: Multi-Horizon Time Series Forecasting with Retrieval Augmentation. arXiv preprint arXiv:2207.10517 (2022).
  • Ye et al. (2022) Junchen Ye, Zihan Liu, Bowen Du, Leilei Sun, Weimiao Li, Yanjie Fu, and Hui Xiong. 2022. Learning the evolutionary and multi-scale graph structure for multivariate time series forecasting. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2296–2306.
  • Zhang et al. (2022) Yue Zhang, Hongliang Fei, and ** Li. 2022. End-to-end distantly supervised information extraction with retrieval augmentation. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2449–2455.
  • Zhao et al. (2019) Liang Zhao, Olga Gkountouna, and Dieter Pfoser. 2019. Traffic Dataset. ACM Transactions on Spatial Algorithms and Systems (TSAS) 5, 3 (2019), 1–28.
  • Zhou et al. (2021) Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. 2021. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI conference on artificial intelligence, Vol. 35. 11106–11115.
  • Zhou et al. (2022) Tian Zhou, Ziqing Ma, Qingsong Wen, Xue Wang, Liang Sun, and Rong **. 2022. FEDformer: Frequency enhanced decomposed transformer for long-term series forecasting. In Proc. 39th International Conference on Machine Learning (ICML 2022) (Baltimore, Maryland).
  • Zuo et al. (2019) Xiaochen Zuo, Xue Yang, Zhicheng Dou, and Ji Rong Wen. 2019. RUCIR at TREC 2019: Conversational Assistance Track. 28th Text REtrieval Conference, TREC 2019 - Proceedings. https://doi.org/10.1145/1122445.1122456
{appendices}

6. Fliggy Hotel Service Volume Dataset

The Fliggy Hotel Service Volume Dataset (FHSV) is derived from service records within the Fliggy APP’s customer service center, capturing the volume of daily customer service operations related to inquiries about hotel reservations, cancellations, and changes. Each instance of a service request received either through the APP interface or by phone is counted as a single service event. Daily aggregates of such events are recorded, forming individual data points. This dataset comprises 1740 data points collected between January 1, 2019, and October 7, 2023.

To facilitate experimentation, this time series dataset has been sequentially partitioned in a chronological order into three subsets: the train set (from January 1, 2019, up until May 24, 2022), the evaluation set (from May 25, 2022, to October 31, 2022), and the test set (from November 1, 2022, onward). The reported experimental results are based on the model’s performance on the test set alone.

Table 5. Length of Retrieval Index Segment of Knowledge Base.
dataset retrieval length metric
mse 4621934.067
4 mae 1303.329
mse 3828479.700
7 mae 1178.574
mse 3421546.027
14 mae 1101.172
mse 3747974.421
21 mae 1208.379
mse 3816283.253
28 mae 1243.133
mse 3913793.800
FHSV 35 mae 1332.472

7. Other Experiment Details

7.1. Length of Retrieval Index Segment of Knowledge Base

The length of the retrieval segment in a knowledge base significantly affects the accuracy of retrieving the original sequence. Through controlled experiments, we identified the most suitable retrieval segment length for the Fliggy scenario. As shown in Table 5, we incrementally extended the length of the retrieval segment K from 4 units to 35 (corresponding to half to five times the length of the prediction sequence at 7 units), then observed the impact on model performance as measured by MSE and MAE.

The results revealed that during the expansion of the retrieval segment from 4 to 14 units, the predictive performance improved. However, beyond this point, as the retrieval segment continued to lengthen, the predictive performance began to decline. This trend demonstrates that, in practical scenarios, as the retrieval segment grows from short to long, its representation transitions from being information-poor to increasingly mixed and complex.

This situation underscores why, in traditional time series databases, using full-ordered representations for retrieval often yields effective outcomes. By contrast, our K,V design effectively circumvents this issue, providing flexible and accurate retrieval results.

7.2. Optimal Retrieval Count

Having established that integrating effective historical pieces positively impacts forecasting, we then inquire about the optimal number of sequences to retrieve. To demonstrate the effect of this choice, we gradually increase the number of integrated retrieve sequences from 0 to 5. As shown in Table 6, during the process of expanding from integrating no historical piece to integrating up to three historical pieces, the MAE error consistently decreases; however, when the quantity of integrated historical pieces is further increased, the forecasting error starts to escalate.

Upon analyzing samples, the insight is obvious: not all of the top five retrieved historical piece for most of samples closely resemble the Ground Truth. The reason being, focusing solely on the TopK ranking without adequately considering similarity thresholds can easily lead to a scenario where, as K increases, the quality of the retrieved pieces becomes less assured and more prone to introduce noise and irrelevant information, thus potentially causing confusion instead.

Table 6. Results of recalling different numbers of pieces in our RATSF model.
dataset methods metric
mse 4344450.512
FHSV baseline mae 1349.146
mse 3678230.020
recalled Top-1 pieces mae 1160.101
mse 3700934.754
recalled Top-2 pieces mae 1155.952
mse 3421546.027
recalled Top-3 pieces mae 1101.172
mse 3460814.538
recalled Top-4 pieces mae 1108.895
mse 3750426.751
recalled Top-5 pieces mae 1231.564