RATSF: Empowering Customer Service Volume Management through Retrieval-Augmented Time-Series Forecasting

Tianfeng Wang Alibaba GroupHangzhouChina [email protected] and Gaojie Cui Yuaiweiwu TechBei**gChina [email protected]

Abstract.

An efficient customer service management system hinges on precise forecasting of service volume. In this scenario, where data non-stationarity is pronounced, successful forecasting heavily relies on identifying and leveraging similar historical data rather than merely summarizing periodic patterns. Existing models based on RNN or Transformer architectures may struggle with this flexible and effective utilization. To tackle this challenge, we initially developed the Time Series Knowledge Base (TSKB) with an advanced indexing system for efficient historical data retrieval. We also developed the Retrieval Augmented Cross-Attention (RACA) module, a variant of the cross-attention mechanism within Transformer’s decoder layers, designed to be seamlessly integrated into the vanilla Transformer architecture to assimilate key historical data segments. The synergy between TSKB and RACA forms the backbone of our Retrieval-Augmented Time Series Forecasting (RATSF) framework. Based on the above two components, RATSF not only significantly enhances performance in the context of Fliggy hotel service volume forecasting but also adapts flexibly to various scenarios and integrates with a multitude of Transformer variants for time-series forecasting. Extensive experimentation has validated the effectiveness and generalizability of this system design across multiple diverse contexts.

PVLDB Reference Format:
PVLDB, 14(1): XXX-XXX, 2020.
doi:XX.XX/XXX.XX ^†^†This work is licensed under the Creative Commons BY-NC-ND 4.0 International License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of this license. For any use beyond those covered by this license, obtain permission by emailing [email protected]. Copyright is held by the owner/author(s). Publication rights licensed to the VLDB Endowment.
Proceedings of the VLDB Endowment, Vol. 14, No. 1 ISSN 2150-8097.
doi:XX.XX/XXX.XX

1. Introduction

Accurate estimation of total service demand is crucial for customer service volume management in the travel industry, including hotels, airlines, attractions, and transaction-brokering apps like Fliggy, significantly affecting system costs.Underestimating by 100 service requests incurs urgent mobilization costs, equivalent to three times the labor cost for a single request; overestimation, meanwhile, leads to wasted labor costs. The travel industry presents unique challenges in service volume forecasting due to its interplay with variables like order statuses, weather, international events, and destination country policies. At Fliggy, precise forecasting is essential for planning the recruitment and training of customer service staff, as well as scheduling. These factors result in non-stationary data patterns that can introduce significant biases with traditional uni-variant time-series forecasting methods (Hyndman and Athanasopoulos, 2018; Anderson and Kendall, 1976; Taylor and Letham, 2018), which rely on periodic summaries and trend analyses.

In fact, across domains like stock market analysis and station traffic forecasting where business cycles and fluctuations play a significant role, there is a shared desire for a universal and flexible time series forecasting system design that can adeptly utilize historical sequence information to tackle intricate prediction challenges. Meanwhile, in the realm of time series forecasting(TSF), where models, whether based on Recurrent Neural Networks (RNNs) or Transformer, commonly face difficulties efficiently processing and extracting insights from vast amounts of historical data.

One naive way to solve this is to try elongating the sequence that the transformer processes. Some approach involves sampling historical data to fit within a limited context window. Specifically, the Informer (Zhou et al., 2021) algorithm samples K points from the sequence and derives a shorter Q sequence based on these sampled points. However, this approach assumes that all historical information is equally important, which may not be suitable for many time series scenarios where different data points can carry varying significance. Perceiver (Jaegle et al., 2021) and similar methods opt for a different approach by map** Query sequences to fixed lengths, reducing computation and allowing for more historical data storage. Nonetheless, they still face challenges in efficiently extracting and interpreting critical information from extended time series.

In the field of natural language processing, strides have been made to expand Transformer models’ context handling. On one front, techniques like flash-attention (Dao et al., 2022) enhance efficiency and reduce complexity, enabling longer context processing. However, it’s crucial to note that directly extending receptive fields may cause larger models to overlook significant details in lengthy inputs, like (Liu et al., 2023) mentioned.

On another front, Retrieval-Augmented Generation (RAG)-like methods have drawn attention for enhancing model performance by incorporating external information. NVIDIA’s (Xu et al., 2024) research indicates that even when models already handle large context windows in text tasks, they can still achieve substantial performance gains by retrieving and using relevant data from external sources.

To address the above challenges, we have identified a potential approach employing the concept of retrieval augmentation (RA). Concretely, we focus on implementing two central enhancements: a knowledge base schema that can efficiently index all historical series, and a cross-attention module embedded in a transformer model to integrate historical information for pinpointing and exploiting the most predictive segments, thereby refining prediction accuracy. The synergy between TSKB and RACA constitutes the RATSF framework, which significantly improves the performance of most univariate time series forecasting tasks. Currently, RATSF has been integrated into four key service areas within Fliggy, encompassing hotel bookings and after-sales services, train ticketing, and flight reservations and modifications. This integration has led to a notable decrease in forecasting errors, from an approximate range of 15% to approximately 8% across all the aforementioned sectors, which has significantly reduced personnel management costs, although the exact extent of cost savings cannot be divulged for commercial reasons.

In concise terms, our main contributions are:

1. We present a straightforward and manageable design for a time-series knowledge base (TSKB) based on the characteristic that time series data is easy to be stored in a structured way, which significantly facilitates efficient management of historical data.

2. We introduce a versatile cross-attention module, retrieval augmented cross-attention (RACA) module, designed to integrate retrieved historical data into the forecasting process. This module is easily adaptable and can be seamlessly integrated with various time-series transformers.

3. We raise a Retrieval-Augmented Time Series Forecasting (RATSF) framework, which decreases 7% forecasting errors in the real service areas within Fliggy and therefore reduced huge personnel practical management costs. Extensive experiments on more publicly available datasets demonstrate that our method achieves the best performance for the univariate time series forecasting task and is general for a broader range of industrial applications.

2. Review

Despite intense competition (Salinas et al., 2020; Das et al., 2023; Ye et al., 2022; Han and Woo, 2022), Transformer models (Vaswani et al., 2017) and their improved variants (Qi et al., 2021; Woo et al., 2022; Liu et al., 2021; Kitaev et al., 2020; Lim et al., 2021) have become the mainstream choice in time-series forecasting tasks. However, the computational complexity of the original Transformer model scales as O(n²) with respect to sequence length, significantly limiting the maximum sequence length it can handle and thereby constraining its ability to use historical data.

Improvements in Time-Series Forecasting with Transformers. Numbers of works focus on enhancing the Transformer model’s capability to extract temporal features, thereby improving prediction accuracy, exemplified by Fedformer’s (Zhou et al., 2022) introduction of a frequency-augmented Attention mechanism that directly incorporates Fourier operators, a similar path taken by TimesNet (Wu et al., 2023). In contrast, Autoformer (Wu et al., 2021) proposes an operator that decomposes time series information into trend and periodic components. These two types of solutions perform well for relatively stationary sequences but may see performance drop when dealing with highly non-stationary time series data.

Another core strategy involves learning a robust representation of the time series first, which is then used to enhance the forecasting accuracy. TNC (Tonekaboni et al., 2021) harnesses the core concept of contrastive learning and employs samples within a specific temporal neighborhood within a window as positive pairs, while treating samples from differing temporal neighborhoods as negative pairs. Despite these improvements having collectively boosted the ability of Transformers in time-series forecasting tasks, none has fundamentally increased the Transformer’s receptive field.

Retrieve Augmented Method in NLP. In last two years, the Retrieval-Augmented Generation (RAG) approach (Guu et al., 2020; Borgeaud et al., 2022; Pasupat et al., 2021; Zhang et al., 2022; Shuster et al., 2021; Khandelwal et al., 2020; Zuo et al., 2019) has gained widespread adoption in the field of NLP. REALM harnesses a knowledge retriever to distill information from vast corpora and thereby enhance the performance of pre-trained language models. Transformer-XL+kNN (Bonetta et al., 2021) incorporates a K-Nearest Neighbors algorithm to search through training data, refining dialogue generation capabilities. In the context of named entity recognition tasks, U-RaNER (Tan et al., 2023) utilizes multi-modal heterogeneous retrieval techniques to boost knowledge retrive and, by integrating retrieved knowledge into the model, strengthens its understanding of queries and improves entity recognition accuracy.

Retrieval Augmented Method in Time-series Forecasting. MQ-ReTCNN (Yang et al., 2022) is designed for complex time series prediction tasks involving multiple entities and variables. It employs a scoring function to compile relevant contexts from offline data, selects scored segments, and appends them to the prediction sequence.Its retrieval mechanism emphasizes leveraging historical sequences of one entity to enhance predictions about another, without directly utilizing historical data for direct prediction assistance.

ReTime (**g et al., 2022) creates a relation graph based on temporal closeness between sequences and employs relational retrieval instead of content-based retrieval.It does not optimally use historically similar sequences as reference points due to its inherent design limitations. Both MQ-ReTCNN and ReTime incorporate retrieval enhancement strategies but have yet to introduce a general and efficient retrieval technique specifically for single-variable time series prediction scenarios.

Refer to caption — Figure 1. RATSF uses a TSKB to store and index historical sequences, alongside a transformer-based forecasting model. As illustrated in the bottom-right corner, TSKB segments the full history for efficient retrieval. To do forecasting, the Encoder fetches $\mathbf{X}_{o}$ from t recent time steps, and forms a retrieval sequence with d latest steps, retrieves N related sequences $\mathbf{X}_{r}$ from the TSKB. RACA in the Decoder then merges $\mathbf{X}_{r}$ and $\mathbf{X}_{o}$ info to deliver result.

3. Method

3.1. Setting & Notations

Before elucidating our approach, we first clarify the setting of our problem and define some symbols that will be used throughout the text.

In actual business operations, service volume is influenced by a multitude of factors, some of which are challenging to fully represent through variables. To simplify the reasoning process and enhance the versatility of our system, we have configured our hotel service volume prediction using a uni-variant time-series forecast setup. This means that the inputs to the model consist solely of the the values of the time series and it’s temporal features . Due to the requirements of our actual task—where personnel management necessitates advance preparation for recruitment, short-term scheduling, and training. Our forecasting task is designed to start from a specific time point $t$ and predict a sequence of $L_{f}$ data points in one go.

Time Marking. We take moment $t$ as the reference point, and the collection of the future time points to be forecasted is represented as ${\left[t+1,t+2,\ldots,t+l_{f}\right]}$ , where $l_{f}$ denotes the length of the forecast period. Concurrently, we define the retrieve segment length as $l_{r}$ , with the indexing sequence $\mathbf{K}$ in the TSKB having a length of $L_{r}$ , and the length of the V sequence as $l_{v}$ .

Sequence Source Marking. The subscript o denotes the original sequence, that is, the sequence $\mathbf{X_{o}}={x_{1},...,x_{t}}$ prior to moment $t$ ; $\mathbf{X_{f}}$ represents the sequence to be forecasted; $\mathbf{X_{r}}$ signifies the $\mathbf{V}$ sequence retrieved from the TSKB.

Transformer Processing Marking. The superscript is used to indicate the results after processing by various parts of the Transformer. $\mathbf{X^{b}_{o}}$ denotes the original sequence after Embedding, and $\mathbf{X^{e}_{o}}$ denotes the sequence after the Encoder. Within the Decoder, H represents the internal hidden state of the Decoder, $l$ represents the layer number, and $H^{l}$ indicates the hidden state processed by the $l$ -th layer.

3.2. Overview of RATSF

The RATSF system is composed of two core components: a Time Series Knowledge Base (TSKB, detailed in Section 3.3) and a Transformer-based time series forecasting model. The latter has been enhanced by replacing its original decoder with our novel Retrieval Augmented Cross-Attention (RACA, described in Section 3.4) module. The TSKB efficiently segments and archives historical time series data, establishing a precise indexing system. The transformer model, with the integration of RACA, effectively utilizes retrieved historical data to significantly enhance the accuracy of its predictions. The data flow diagram of RATSF is illustrated in Figure 1, providing a visual representation of the process.

3.3. TSKB

Unlike traditional methods that solely store and retrieve the entire original sequence, TSKB preserves the original sequence as the core content (V) while selects certain segments from it to construct a discriminative indexing sequence (K). This approach can be likened to processing a written piece where conventional methods involve direct full-text searches for desired information, which may be inefficient and less precise; whereas with TSKB, it’s akin to extracting key headlines from the body of the text to serve as indices that facilitate rapid access to core information. This dual-sequence structure allows the system, when handling large-scale data, to efficiently locate and access targeted portions of the original sequence through customized index sequences, thereby significantly enhancing overall performance.

3.3.1. Sequential Slicing

As illustrated in Figure 2, the content sequence V is obtained using a rolling window approach with a step size of $S$ and a window length of $L_{v}$ . This approach allows us to sample all historical sequences and incorporate them into TSKB. Since prediction models only use present data to infer the future, we align indexing sequence K with this constraint. We extract the initial segment of each V, having a length of $L_{r}$ , to serve as its index K. The selection of $L_{v}$ , $L_{r}$ and $S$ is introduced in section 3.5.

3.3.2. Embedding Learning

A key approach to enhancing the recall and accuracy of knowledge base retrieval is to employ a well-trained embedding vector to improve the precision of the index. In the following sections, we will focus on how to train and utilize these index embedding vectors. Judging the quality of a retrieval embedding mainly depends on how well it captures similarities at key information points relevant to forecasting tasks. Time series data is complex due to numerous information points and the challenge of presetting comparative weights. Therefore, we use this principle: if a representation closely mirrors the important details needed for prediction, its overall performance will be better.

To ensure that the embeddings capture the essential information for forecasting, we employ the encoder from the RATSF’s forecasting model, which is designed to select precise information during training, thereby enhancing forecast accuracy. Specifically, each indexing sequence K rom the knowledge base is processed through the encoder layer of the RATSF forecasting model to generate a retrieval embedding.

Meanwhile, in the early stages of training the forecasting model, its encoder is not yet capable of producing representations that accurately match target sequences. The model’s effectiveness and learning pace are significantly correlated with its ability to retrieve historical sequences beneficial to forecasting tasks.

To avoid time and data-consuming iterations stemming from a random initialization state, we introduce Dynamic Time War** (DTW(Müller, 2007)) as an auxiliary tool during the initial phase of model training, as shown in left part of Figure 3. DTW is initially used for similarity-based sequence retrieval, aiding in the iterative training process of the RATSF forecasting model. After completing one epoch of training, we transition to using embeddings generated by the forecasting model itself for sequence retrieval, continuing the training until the model converges.

3.4. RACA

As previously mentioned, Retrieval-Augmented Cross-Attention (RACA) is a module designed to be integrated into a time-series forecasting model, coupled with the TSKB. In our demonstration, we employ the vanilla Transformer as the time-series forecasting model; however, it’s important to note that the RACA module is designed to be compatible and can be seamlessly plug into any transformer-based time-series model. We will specifically demonstrate this in Section 4 .

The decoder consists of $l$ stacked RACA modules for inputs of length $L_{f}$ . Each module has two parallel units: Unit 1, like in Function (1), uses $\mathbf{X}^{b}_{f}$ as Query with Key and Value from $\mathbf{X}^{e}_{o}$ , outputting $\mathbf{H}^{l,1}_{f}$ of the same shape. Unit 2, retrieval-augment part, like in Function (2) ,also queries $\mathbf{X}^{b}_{f}$ but fetches Keys and Values from the concatenated sequence of $\mathbf{X}^{e}_{r}$ , generating $\mathbf{H}^{l,2}_{f}$ . $s$ in the Functions is the scaling factor.

Like shown in Figure 4, retrieved sequences are transformed through the encoder’s embedding module, generating Ns $\mathbf{X}^{e}_{r}$ , and we concatenate these N vectors into one with shape $[1,N*L_{v},D]$ for later use.

(1)		$\displaystyle\mathbf{H}^{l,1}_{f}=\text{softmax}\left(\frac{\mathbf{X}^{b}_{f}% \mathbf{X}^{e}_{o}}{s}\right)\mathbf{X}^{e}_{o},$
(2)		$\displaystyle\mathbf{H}^{l,2}_{f}=\text{softmax}\left(\frac{\mathbf{X}^{b}_{f}% \mathbf{X}^{e}_{r}}{s}\right)\mathbf{X}^{e}_{r},$
(3)		$\displaystyle\mathbf{H}^{l+1}_{f}=\delta\left(\text{concat}\left(\mathbf{H}^{l% ,1}_{f},\mathbf{H}^{l,2}_{f}\right)\right).$

As shown in Function 3, after concatenating both intermediate vectors, $\mathbf{H}^{l,1}_{f}$ and $\mathbf{H}^{l,2}_{f}$ , along the sequence dimension, forms a $[1,2*L_{f},D]$ vector, it undergoes linear transformation before being passed to the next layer.

3.5. Deployments Procedure

In this section, we briefly introduce the procedure to deploy RATSF in a new field.

3.5.1. TSKB Initialization

Set the initial retrieval length ( $L_{r}$ ), which can be equivalent to the prediction length ( $L_{f}$ ). Set the sequence length for the V sequence ( $L_{v}$ ) to be at least the sum of $L_{r}$ and $L_{f}$ . This ensures adequate info for retrieval and forecast. It is suggested to set several $L_{v}$ values, such as $L_{r}$ + $L_{f}$ , 2* $L_{f}$ + $L_{r}$ , 3* $L_{f}$ + $L_{r}$ , etc.,to build several TSKBs for optimal value selection later. Set the rolling window step (stride) to 1 and collect $V$ sequences for each TSKBs. Then, Select the initial $l_{r}$ elements of $V$ sequences to form $K$ sequences.

3.5.2. Identifying Optimal $L_{v}$

Train diverse RATSF models on their respective TSKBs and evaluate each by prediction accuracy.Select the model with the highest prediction accuracy as the optimal RATSF model ( $M^{*}$ ), and identify the corresponding V sequence length ( $L_{v}$ ) as the final chosen value ( $L_{v}^{*}$ ).

3.5.3. Adjusting Retrieval Length $L_{r}$

Adjust the retrieval length ( $L_{r}$ ) to observe the impact on the performance of the optimal model (M*). The adjustment range could be from $0.5L_{f}$ to $L_{v}$ , with each adjustment increment being $0.5L_{f}$ . Record the $L_{r}$ value that enables M to achieve the highest prediction accuracy, denoted as $L_{r}^{*}$ .

3.5.4. TSKB and Model Optimization

With the confirmed optimal retrieval length ( $L_{r}*$ ) and V sequence length ( $L_{v}^{*}$ ), reconfigure the TSKB. Reinitialize and train the RATSF model with the updated TSKB till converge.

4. Experiments

To demonstrate the superiority of our method in service volume forecasting, we first conducted experiments using the Fliggy Hotel Service Volume Dataset (FHSV), showcasing the performance enhancement of typical time-series models when integrated with RATSF. Additionally, to validate the general applicability of our approach, we extended our experiments to three other datasets. Furthermore, through a series of ablation studies, we individually assessed the impact of each key design choice in RATSF to confirm the correctness of our detailed selections.

4.1. Experiment Setup

Datasets. Given that our primary focus lies in the forecasting task within the context of customer service volume management, we initially employed the Fliggy Hotel Service Volume Dataset to substantiate the efficacy of RATSF, the detail description of the dataset is in Appendix A. Additionally, to demonstrate the performance of our approach in other contexts, we employed three classic time-series forecasting datasets: ETT(Zhou et al., 2021), Exchange(Zhao et al., 2019) , Traffic(tra, 2017) .

Models.In Section 3.4, we established that the Retrieval Augmented Cross-Attention (RACA) module is compatible with various Transformer variants. To underscore this versatility, we selected three prominent time-series domain models: Fedformer(Zhou et al., 2022), Autoformer(Wu et al., 2021), and NS-Transformer(Liu et al., 2022), and compared them to the vanilla Transformer. Leveraging the specialized operators within Fedformer and Autoformer that cater to time-series periodicity and trends, such as Autoformer’s seasonal and trend-cyclical initializations, we have seamlessly integrated these models with the RACA module. This integration allows for the effective incorporation of their distinct sequential decomposition capabilities, enhancing the overall processing power. Furthermore, to demonstrate the advantage of our recall mechanism in the RATSF framework, we compared it with ReTime, a RAG-based model that also uses specialized retrieval techniques. Since ReTime has not released their code and their application domain is distinct from our focus on univariate time-series forecasting, we reconstructed the Relational Retrieval method based on their published paper. (It is worth noting that, as of the time of writing this paper, neither MQ-ReTCNN nor ReTime have released their code to the public.)

As shown in Figure 1, the training outcomes of the encoder’s output significantly influence both the retrieval quality and the decoder’s performance. In the main experiments, we will compare the retrieval schemes of RATSF and ReTime. Furthermore, in several subsequent experiments, we will elucidate the differences between RATSF’s retrieval approach and DTW. Consequently, for the majority of the experiments in this paper, to facilitate these comparisons, we have replaced the input to RACA from $\mathbf{X}^{e}_{r}$ to $\mathbf{X}^{b}_{r}$ , and we will provide an ablation study to explore the specific effects arising from this substitution.

Forecasting Setting. Like we claim in Section 3.1, we adopt the uni-variant time-series forecasting setting. In the Fliggy’s practical operations, we are required to finalize our forecasting for the upcoming week’s day-by-day staffing needs one week in advance, which is why the actual prediction window period is set at 7 days, represented as 7 tokens. In the travel industry, which Fliggy App’s serving, a quarter generally defines a relatively complete business trend. Therefore, we choose an encoder window length of 98 days—exceeding 90 days and divisible by 14—which consequently results in $\mathbf{X_{o}}$ is 98 units long. And we concate $\left\{x_{t-14},...,x_{t}\right\}$ with a placeholder of length 7 to form $\mathbf{X_{f}}$ . To match the length of $\mathbf{X_{f}}$ , we set $L_{v}$ =21 and $L_{r}$ =14. Regarding normalization, we employ unified $\mu$ and $\sigma$ values for whitening operations during the preprocessing stage and inverse normalization is carried out after forecasting. We have applied the same data processing scheme to both the ETT, Exchange and Traffic datasets.

Training Setting.We have selected the Adam optimizer, with the batch size $=$ 64 and max training epochs $=$ 10. The initial learning rate (lr) is set to 0.0001, employing a linear decay strategy where the decay parameter for the first 5 epochs is 0.9, for the last 5 epochs is 0.5. We also incorporate $L1$ regularization with $\lambda$ =0.0001. Concurrently, an early stop** mechanism is activated when the loss ceases to decrease.

Metric.To ensure that our evaluation metrics directly reflect business performance, we first inverse normalize the forecast result as above mentioned and then compute the MSE (Mean Squared Error) and MAE (Mean Absolute Error) against the Ground Truth. The lower these two indicators are, the more accurate the prediction results prove to be. Moreover, each decrement of 1 unit in MAE signifies a corresponding reduction of 1 unit in service staff management expenditure. This configuration has been applied uniformly across all experiments.

Table 1. Performance comparison of the Transformer models with their retrieval-based variants on four experimental datasets. ’baseline’ denotes the original model without retrieval, compared to Relational Retrieva from ReTime and our approach RATSF.All results are evaluated using MSE (Mean Squared Error) and MAE (Mean Absolute Error), and the best result is bolded.

models		transformer		nstransformer		autoformer		fedformer
metric		mse	mae	mse	mae	mse	mae	mse	mae
FHSV	baseline	4344450.512	1349.146	4414326.000	1242.128	6096673.500	1752.798	5517485.000	1638.275
	Relational Retrieval	3737283.320	1276.836	4037294.000	1168.127	5821383.300	1703.328	5411003.700	1610.320
	RATSF	3421546.027	1101.172	3823635.000	1070.933	5597297.500	1657.561	5386818.500	1595.657
ETTh1	baseline	3.187	1.391	1.501	0.920	8.651	2.264	4.159	1.597
	Relational Retrieval	2.876	1.206	1.465	0.900	8.203	2.197	3.012	1.572
	RATSF	2.366	1.180	1.407	0.886	7.300	2.113	3.952	1.565
Exchange	baseline	0.00165	0.03134	0.00014	0.00922	0.00058	0.01906	0.00172	0.03493
	Relational Retrieval	0.00143	0.02845	0.00013	0.00920	0.00069	0.02013	0.00102	0.02821
	RATSF	0.00068	0.02012	0.00013	0.00912	0.00067	0.02000	0.00068	0.02084
Traffic	baseline	0.00007	0.00564	0.00006	0.00569	0.00008	0.00643	0.00012	0.00876
	Relational Retrieval	0.00007	0.00534	0.00006	0.00512	0.00007	0.00601	0.00009	0.00718
	RATSF	0.00006	0.00498	0.00006	0.00472	0.00007	0.00514	0.00007	0.00650

4.2. Main Restult

Table 1 displays the prediction accuracy of the Transformer and its variants on four datasets, comparing performance without RA, with ReTime’s RA, and with our RATSF approach. In the Fliggy Hotel Customer Service Volume Dataset forecasting scenario, the Transformer model’s MAE loss was reduced by 5.34% with ReTime’s Relational Retrieval and by 18% after incorporating RATSF. This improvement could potentially translate to a reduction of approximately 200 in redundant personnel costs in practical management operations. A horizontal analysis of the first row in Table 1 reveals that all compared Transformer-based time-series model variants experienced a significant reduction in MAE after employing Retrieval-Augmented (RA) strategies. Specifically, the adoption of RATSF resulted in respective decreases of 14%, 5%, and 4%. In contrast, the application of ReTime’s retreval method led to improvements of 5.9%, 2.8%, and 1.7% compared to their original designs without RA. These results clearly demonstrate two conclusions: first, effective RA strategies can enhance the performance of time-series Transformer models; second, RATSF exhibits a significant advantage in its retrieval strategy.

Furthermore, cross-comparison reveals that Autoformer and FedFormer yield slightly inferior results compared to the Transformer, mainly due to the stronger non-stationarity inherent in customer service volumes data within the hotel industry. As an illustrative example, while Mondays typically exhibit similar cyclical characteristics, actual service volume during Mondays preceding a short holiday can surge significantly compared to regular ones. In such instances, historical data from periods close to other short holidays provide more valuable insights than simple temporal periodicity and short-term trends; specific case studies will be showcased later.

Additionally, across the ETT, Exchange and the Traffic dataset, the adoption of RATSF design in the Transformer and its time-series optimized variants led to noticeable reductions in MAE metrics. Moreover, among these, the NSTransformer consistently outperforms the other models, likely because its mean-adapted feature proves effective in a broad range of time-series prediction domains.

4.3. Ablation Study

We confirm the effectiveness of several design choices in this section. Section 4.3.1 emphasizes the superiority of RACA’s approach in integrating historical data for improved forecasting. Section 4.3.2 demonstrates the advantages of utilizing model encoder for recalling historical pieces. Section 4.3.3 demonstrates the benefits of incorporating DTW into the training process.

Table 2. Ablation of the specifically designed structures within RACA on FHSV scenario.

dataset	methods	metric
		mse	4344450.512
	baseline	mae	1349.146
	baseline +	mse	4071384.750
	Design One	mae	1226.850
		mse	3421546.027
FHSV	baseline + RACA	mae	1101.172

4.3.1. Integrate Historical Sequences with RACA

Based on the Transformer architecture, two designs are considered for integrating historical pieces into the forecasting model: Design One represents a straightforward approach where historical pieces are combined with the forecasting context inputs within the encoder; Design Two employs our advanced RACA design that integrate historical pieces using cross-attention mechanisms within the decoder.

With the Fliggy Hotel Service Volume Dataset, we contrasted the performance of three methods: No historical pieces used(baseline experiment), Strategy One, and Strategy Two (RACA). The outcomes in Table 2 demonstrate that the RACA design excels, with its MAE notably lower than both the baseline and Strategy One.

A deeper examination disclosed that the Cross Attention mechanism in the RACA design efficiently guides the forecast sequence to concentrate on past trends analogous to the current scenario. In Figure 5, we illustrate how retrieved sequences influence the predicted values for the first forecast point of a specific sequence through RACA’s cross-attention mechanism. The lower half features the top 3 similar sequences, concatenated along the time axis to form RACA’s input. Attention weights at the first forecast point ( $x_{t+1}$ ) are marked by data bars, with yellow and green indicating RACA’s focus on downward-curving segments. The upper half replicates prediction results (yellow ”pred” lines for $x_{t+1}...x_{t+f}$ ) and true values (orange ”gt” lines), aligned with the corresponding final $L_{f}$ lengths of the retrieved sequences. This visual comparison highlights the model’s pattern recognition, with RACA’s focus areas correlating to the predictions’ inflection at $x_{t+1}$ .

Table 3. Comparison of several variants of retrieval representation schemes on the FHSH dataset.

stage	methods	metric
		mse	4344450.512
	baseline	mae	1349.146
		mse	3787191.250
	baseline + DTW	mae	1226.690
		mse	3784188.541
	baseline + MLP	mae	1210.785
		mse	3421546.027
representation retrieval	baseline + encoder	mae	1101.172
		mse	3421546.027
	RATSF with embedding	mae	1101.172
		mse	3343793.904
decoding	RATSF with encoder	mae	1004.326

4.3.2. Retrieval Embedding

As previously mentioned, in our main experiments, we aimed to isolate the contribution of the encoder’s output to the retrieval performance. To do this, we utilized the encoder’s output $\mathbf{X}^{e}_{r}$ for the retrieval process and the embedding output $\mathbf{X}^{b}_{r}$ for decoding. In this section, we have completed the discussion of this part of the logic. Here, we will compare different schemes for retrieval and evaluate the effects of various decoding schemes.

We argue that DTW is not the optimal retrieval method and demonstrate this through an experiment on the FHSV dataset, where we compare DTW, a separately trained two-layer MLP, and our current design choice: using the forecasting model’s encoder for retrieval. The upper half of Table 3 presents a comparison of the forecasting precision achieved with these different retrieval embedding strategies. The ’Baseline’ indicates the standard Transformer model without any retrieval mechanism. As the results indicate, the encoder output outperforms the MLP, likely due to the attention mechanism within the Transformer’s encoder, which is adept at identifying and refining the relationships among data points in sequences, thus capturing more accurate patterns.

The lower half of Table 3 compares the decoding performance of RACA when using the Transformer’s embedding output and encoder output as the information source. The experimental results indicate that refining the retrieved sequence information with the Encoder can reduce the MAE by 18.38% to 1004.326, compared to the vanilla Transformer, and by 8.79% compared to using the embedding output. This substantiates the effectiveness of the RATSF architecture demonstrated in Figure 1.

Table 4. Effectiveness of Using DTW as Auxiliary Training.

dataset	number of epochs	metric
		mse	3421546.027
	1 epoch	mae	1101.172
		mse	3769362.251
	2 epoch	mae	1252.136
		mse	3697616.753
	3 epoch	mae	1135.631
		mse	3720408.244
	4 epoch	mae	1248.433
		mse	3842794.029
	5 epoch	mae	1278.324
		mse	3800124.170
FHSV	end to end	mae	1302.054

4.3.3. Effectiveness of Using DTW as Auxiliary Training

In Section 3.3.2, we assert the effectiveness of employing DTW as a retrieval method during the cold start phase to facilitate the training of forecasting models for a single epoch. To substantiate the rationale for this strategy, we conducted a comparative experimental study: one baseline group without DTW-assisted training (referred to as ’end-to-end’), and several other groups, each trained with DTW assistance for 1 to 5 epochs. We ensured that all experiments had the same maximum number of training epochs and utilized an identical early stop** strategy. Table 4 demonstrates that the model’s predictive performance is the weakest when DTW is not utilized, as indicated by the highest MSE and MAE values in the last row of Table 4; furthermore, extending DTW-assisted training beyond one epoch did not lead to improved outcomes.

This result aligns with our hypothesis that in the absence of DTW support during the initial training phase, the model has difficulty identifying high-quality retrieval vectors, which are crucial for accurate forecasting. However, as our primary experiments have shown, DTW is not inherently the best retrieval method for enhancing predictive accuracy. Therefore, extending the period of DTW-assisted training may hinder rather than enhance the model’s ability to express and forecast effectively.

5. Conclusion

In conclusion, RATSF stands out in the domain of univariant time-series forecasting by integrating a TSKB that meticulously slices and indexes historical data, leading to superior retrieval precision. This is further enhanced by the Encoder’s representations, which act as a sophisticated retrieval mechanism. Coupled with our innovative RACA mechanism for adeptly merging retrieved segments, we achieve heightened forecast accuracy. The adaptability of our method is highlighted by its compatibility with diverse Transformer variants and its extensive, successful application within Fliggy’s service volume forecasting scenarios.

References

(1)
tra (2017) 2017. Traffic Dataset. http://pems.dot.ca.gov/.
Anderson and Kendall (1976) O. D. Anderson and Maurce Kendall. 1976. Time-series. Vol. 2nd edn. J. R. Stat. Soc. (Series D).
Bonetta et al. (2021) Giovanni Bonetta, Rossella Cancelliere, Ding Liu, and Paul Vozila. 2021. Retrieval-Augmented Transformer-XL for Close-Domain Dialog Generation. In The 34th International FLAIRS Conference. https://doi.org/10.32473/flairs.v34i1.128369
Borgeaud et al. (2022) Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. 2022. Improving language models by retrieving from trillions of tokens. In International conference on machine learning. PMLR, 2206–2240.
Dao et al. (2022) Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems 35 (2022), 16344–16359.
Das et al. (2023) Abhimanyu Das, Weihao Kong, Andrew Leach, Shaan K Mathur, Rajat Sen, and Rose Yu. 2023. Long-term Forecasting with TiDE: Time-series Dense Encoder. Transactions on Machine Learning Research (2023).
Guu et al. (2020) Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. 2020. Retrieval augmented language model pre-training. In International conference on machine learning. PMLR, 3929–3938.
Han and Woo (2022) Siho Han and Simon S Woo. 2022. Learning sparse latent graph representations for anomaly detection in multivariate time series. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2977–2986.
Hyndman and Athanasopoulos (2018) Rob J Hyndman and George Athanasopoulos. 2018. Forecasting: principles and practice. OTexts.
Jaegle et al. (2021) Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. 2021. Perceiver: General perception with iterative attention. In International conference on machine learning. PMLR, 4651–4664.
**g et al. (2022) Baoyu **g, Si Zhang, Yada Zhu, Bin Peng, Kaiyu Guan, Andrew Margenot, and Hanghang Tong. 2022. Retrieval Based Time Series Forecasting. arXiv preprint arXiv:2209.13525 (2022).
Khandelwal et al. (2020) Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. 2020. Generalization through Memorization: Nearest Neighbor Language Models. In International Conference on Learning Representations.
Kitaev et al. (2020) Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. 2020. Reformer: The Efficient Transformer. In International Conference on Learning Representations.
Lim et al. (2021) Bryan Lim, Sercan Ö Arık, Nicolas Loeff, and Tomas Pfister. 2021. Temporal fusion transformers for interpretable multi-horizon time series forecasting. International Journal of Forecasting 37, 4 (2021), 1748–1764.
Liu et al. (2023) Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2023. Lost in the Middle: How Language Models Use Long Contexts. arXiv:2307.03172.
Liu et al. (2021) Shizhan Liu, Hang Yu, Cong Liao, Jianguo Li, Weiyao Lin, Alex X Liu, and Schahram Dustdar. 2021. Pyraformer: Low-complexity pyramidal attention for long-range time series modeling and forecasting. In International conference on learning representations.
Liu et al. (2022) Yong Liu, Haixu Wu, Jianmin Wang, and Mingsheng Long. 2022. Non-stationary transformers: Exploring the stationarity in time series forecasting. Advances in Neural Information Processing Systems 35 (2022), 9881–9893.
Müller (2007) Meinard Müller. 2007. Dynamic time war**. Information retrieval for music and motion (2007), 69–84.
Pasupat et al. (2021) Panupong Pasupat, Yuan Zhang, and Kelvin Guu. 2021. Controllable Semantic Parsing via Retrieval Augmentation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Marie-Francine Moens, Xuan**g Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 7683–7698.
Qi et al. (2021) Xinyuan Qi, Kai Hou, Tong Liu, Zhongzhong Yu, Sihao Hu, and Wenwu Ou. 2021. From Known to Unknown: Knowledge-guided Transformer for Time-Series Sales Forecasting in Alibaba. CoRR abs/2109.08381 (2021).
Salinas et al. (2020) David Salinas, Valentin Flunkert, Jan Gasthaus, and Tim Januschowski. 2020. DeepAR: Probabilistic forecasting with autoregressive recurrent networks. International Journal of Forecasting 36, 3 (2020), 1181–1191.
Shuster et al. (2021) Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela, and Jason Weston. 2021. Retrieval Augmentation Reduces Hallucination in Conversation. In Findings of the Association for Computational Linguistics: EMNLP 2021, Marie-Francine Moens, Xuan**g Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association for Computational Linguistics, Punta Cana, Dominican Republic, 3784–3803.
Tan et al. (2023) Zeqi Tan, Shen Huang, Zixia Jia, Jiong Cai, Yinghui Li, Weiming Lu, Yueting Zhuang, Kewei Tu, Pengjun Xie, Fei Huang, and Yong Jiang. 2023. DAMO-NLP at SemEval-2023 Task 2: A Unified Retrieval-augmented System for Multilingual Named Entity Recognition. In Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023), Atul Kr. Ojha, A. Seza Doğruöz, Giovanni Da San Martino, Harish Tayyar Madabushi, Ritesh Kumar, and Elisa Sartori (Eds.). Association for Computational Linguistics, Toronto, Canada, 2014–2028.
Taylor and Letham (2018) Sean J Taylor and Benjamin Letham. 2018. Forecasting at scale. The American Statistician 72, 1 (2018), 37–45.
Tonekaboni et al. (2021) Sana Tonekaboni, Danny Eytan, and Anna Goldenberg. 2021. Unsupervised Representation Learning for Time Series with Temporal Neighborhood Coding. In International Conference on Learning Representations.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
Woo et al. (2022) Gerald Woo, Chenghao Liu, Doyen Sahoo, Akshat Kumar, and Steven C. H. Hoi. 2022. ETSformer: Exponential Smoothing Transformers for Time-series Forecasting. (2022).
Wu et al. (2023) Haixu Wu, Tengge Hu, Yong Liu, Hang Zhou, Jianmin Wang, and Mingsheng Long. 2023. TimesNet: Temporal 2D-Variation Modeling for General Time Series Analysis. In International Conference on Learning Representations.
Wu et al. (2021) Haixu Wu, Jiehui Xu, Jianmin Wang, and Mingsheng Long. 2021. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. Advances in Neural Information Processing Systems 34 (2021), 22419–22430.
Xu et al. (2024) Peng Xu, Wei **, Xianchao Wu, Lawrence McAfee, Chen Zhu, Zihan Liu, Sandeep Subramanian, Evelina Bakhturina, Mohammad Shoeybi, and Bryan Catanzaro. 2024. Retrieval meets Long Context Large Language Models. In The Twelfth International Conference on Learning Representations.
Yang et al. (2022) Sitan Yang, Carson Eisenach, and Dhruv Madeka. 2022. MQRetNN: Multi-Horizon Time Series Forecasting with Retrieval Augmentation. arXiv preprint arXiv:2207.10517 (2022).
Ye et al. (2022) Junchen Ye, Zihan Liu, Bowen Du, Leilei Sun, Weimiao Li, Yanjie Fu, and Hui Xiong. 2022. Learning the evolutionary and multi-scale graph structure for multivariate time series forecasting. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2296–2306.
Zhang et al. (2022) Yue Zhang, Hongliang Fei, and ** Li. 2022. End-to-end distantly supervised information extraction with retrieval augmentation. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2449–2455.
Zhao et al. (2019) Liang Zhao, Olga Gkountouna, and Dieter Pfoser. 2019. Traffic Dataset. ACM Transactions on Spatial Algorithms and Systems (TSAS) 5, 3 (2019), 1–28.
Zhou et al. (2021) Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. 2021. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI conference on artificial intelligence, Vol. 35. 11106–11115.
Zhou et al. (2022) Tian Zhou, Ziqing Ma, Qingsong Wen, Xue Wang, Liang Sun, and Rong **. 2022. FEDformer: Frequency enhanced decomposed transformer for long-term series forecasting. In Proc. 39th International Conference on Machine Learning (ICML 2022) (Baltimore, Maryland).
Zuo et al. (2019) Xiaochen Zuo, Xue Yang, Zhicheng Dou, and Ji Rong Wen. 2019. RUCIR at TREC 2019: Conversational Assistance Track. 28th Text REtrieval Conference, TREC 2019 - Proceedings. https://doi.org/10.1145/1122445.1122456

{appendices}

6. Fliggy Hotel Service Volume Dataset

The Fliggy Hotel Service Volume Dataset (FHSV) is derived from service records within the Fliggy APP’s customer service center, capturing the volume of daily customer service operations related to inquiries about hotel reservations, cancellations, and changes. Each instance of a service request received either through the APP interface or by phone is counted as a single service event. Daily aggregates of such events are recorded, forming individual data points. This dataset comprises 1740 data points collected between January 1, 2019, and October 7, 2023.

To facilitate experimentation, this time series dataset has been sequentially partitioned in a chronological order into three subsets: the train set (from January 1, 2019, up until May 24, 2022), the evaluation set (from May 25, 2022, to October 31, 2022), and the test set (from November 1, 2022, onward). The reported experimental results are based on the model’s performance on the test set alone.

Table 5. Length of Retrieval Index Segment of Knowledge Base.

dataset	retrieval length	metric
		mse	4621934.067
	4	mae	1303.329
		mse	3828479.700
	7	mae	1178.574
		mse	3421546.027
	14	mae	1101.172
		mse	3747974.421
	21	mae	1208.379
		mse	3816283.253
	28	mae	1243.133
		mse	3913793.800
FHSV	35	mae	1332.472

7. Other Experiment Details

7.1. Length of Retrieval Index Segment of Knowledge Base

The length of the retrieval segment in a knowledge base significantly affects the accuracy of retrieving the original sequence. Through controlled experiments, we identified the most suitable retrieval segment length for the Fliggy scenario. As shown in Table 5, we incrementally extended the length of the retrieval segment K from 4 units to 35 (corresponding to half to five times the length of the prediction sequence at 7 units), then observed the impact on model performance as measured by MSE and MAE.

The results revealed that during the expansion of the retrieval segment from 4 to 14 units, the predictive performance improved. However, beyond this point, as the retrieval segment continued to lengthen, the predictive performance began to decline. This trend demonstrates that, in practical scenarios, as the retrieval segment grows from short to long, its representation transitions from being information-poor to increasingly mixed and complex.

This situation underscores why, in traditional time series databases, using full-ordered representations for retrieval often yields effective outcomes. By contrast, our K,V design effectively circumvents this issue, providing flexible and accurate retrieval results.

7.2. Optimal Retrieval Count

Having established that integrating effective historical pieces positively impacts forecasting, we then inquire about the optimal number of sequences to retrieve. To demonstrate the effect of this choice, we gradually increase the number of integrated retrieve sequences from 0 to 5. As shown in Table 6, during the process of expanding from integrating no historical piece to integrating up to three historical pieces, the MAE error consistently decreases; however, when the quantity of integrated historical pieces is further increased, the forecasting error starts to escalate.

Upon analyzing samples, the insight is obvious: not all of the top five retrieved historical piece for most of samples closely resemble the Ground Truth. The reason being, focusing solely on the TopK ranking without adequately considering similarity thresholds can easily lead to a scenario where, as K increases, the quality of the retrieved pieces becomes less assured and more prone to introduce noise and irrelevant information, thus potentially causing confusion instead.

Table 6. Results of recalling different numbers of pieces in our RATSF model.

dataset	methods	metric
		mse	4344450.512
FHSV	baseline	mae	1349.146
		mse	3678230.020
	recalled Top-1 pieces	mae	1160.101
		mse	3700934.754
	recalled Top-2 pieces	mae	1155.952
		mse	3421546.027
	recalled Top-3 pieces	mae	1101.172
		mse	3460814.538
	recalled Top-4 pieces	mae	1108.895
		mse	3750426.751
	recalled Top-5 pieces	mae	1231.564