\zxrsetup

toltxlabel=true, tozreflabel=false

F-FOMAML: GNN-Enhanced Meta-Learning for Peak Period Demand Forecasting with Proxy Data

Zexing Xu¹¹1Corresponding author: [email protected] University of Illinois Urbana-Champaign Linjun Zhang Rutgers University Sitan Yang Amazon Rasoul Etesami University of Illinois Urbana-Champaign Hanghang Tong University of Illinois Urbana-Champaign Huan Zhang University of Illinois Urbana-Champaign Jiawei Han University of Illinois Urbana-Champaign

(May 24, 2024)

F-FOMAML: GNN-Enhanced Meta-Learning for Peak Period Demand Forecasting with Proxy Data

(May 24, 2024)

Abstract

Demand prediction is a crucial task for e-commerce and physical retail businesses, especially during high-stake sales events. However, the limited availability of historical data from these peak periods poses a significant challenge for traditional forecasting methods. In this paper, we propose a novel approach that leverages strategically chosen proxy data reflective of potential sales patterns from similar entities during non-peak periods, enriched by features learned from a graph neural networks (GNNs)-based forecasting model, to predict demand during peak events. We formulate the demand prediction as a meta-learning problem and develop the Feature-based First-Order Model-Agnostic Meta-Learning (F-FOMAML) algorithm that leverages proxy data from non-peak periods and GNN-generated relational metadata to learn feature-specific layer parameters, thereby adapting to demand forecasts for peak events. Theoretically, we show that by considering domain similarities through task-specific metadata, our model achieves improved generalization, where the excess risk decreases as the number of training tasks increases. Empirical evaluations on large-scale industrial datasets demonstrate the superiority of our approach. Compared to existing state-of-the-art models, our method demonstrates a notable improvement in demand prediction accuracy, reducing the Mean Absolute Error by 26.24% on an internal vending machine dataset and by 1.04% on the publicly accessible JD.com dataset.

1 Introduction

Forecasting product demand during high-stake sales events such as Black Friday or Prime Day is a daunting task for both e-commerce giants like Amazon and JD.com and physical retailers. This challenge stems largely from the scarcity of event-specific historical data. Commonly, businesses anchor their strategies on regular sales data, which may not fully capture the distinct consumer behaviors observed during promotional periods. Beyond standard demand prediction, promotional forecasting includes predicting "extreme" events (Le Guen and Thome, 2020). These events, marked by deeper discounts and atypical merchandising strategies, significantly deviate from the typical sales patterns influenced by factors like seasonality or product life cycles. This deviation necessitates a specialized approach to deal-level forecasting, one that thoroughly considers promotion-specific intricacies, from the depth of discounts to deal combinations.

For instance, an online retailer aiming to anticipate the demand spike for a newly launched electronic item during a holiday sale might struggle. They might be unsure how various promotions will influence demand during these events, particularly when previous similar event data is limited or non-existent. To mitigate this, our research effectively uses proxy data from non-peak sales to inform decisions during peak sales events. However, this supplemental data alone is insufficient, given the intricate interrelationships among various products and categories and even across different shop** platforms. We thus introduce a representation learning task for each product, leveraging a cutting-edge Graph Neural Network (GNN) based forecasting model (Yang et al., 2023). This model generates embeddings enriched with graph-enhanced features, encapsulating cross-product information derived from pertinent graph structures. Such structures offer insights into a myriad of dynamics, from relationships between products to patterns of inter-platform shop** behaviors.

Our proposed Feature-based First-Order Model-Agnostic Meta-Learning (F-FOMAML) approach refines the foundational MAML framework (Finn et al., 2017a; Nichol et al., 2018), incorporating task-specific insights (Meng et al., 2022) drawn from GNN-processed data. By training F-FOMAML with this enhanced metadata, the model showcases an unparalleled ability to adapt, consistently surpassing conventional forecasting techniques in accuracy metrics. While we primarily target enhancing e-commerce and brick-and-mortar retail demand prediction, the potential of our GNN-augmented F-FOMAML is vast. Its versatility makes it a candidate for various applications, from fortifying online banking fraud detection systems to optimizing digital advertising click-through rates.

Our main contributions can be summarized as follows:

1.

Model: We propose a novel approach to model demand prediction, reframing it as a graph-augmented meta-learning challenge.
2.

Algorithm: We introduce the GNN-infused F-FOMAML algorithm, which skillfully combines meta-learning and the feature-wise linear modulation (FiLM) layers. This results in a model capable of producing robust predictions, even when historical data is sparse.
3.

Theory: We provide a theoretical framework that provides insights into how our proposed algorithm reduces predictive risk through the lens of bias-variance trade-off.
4.

Numerical Experiments: Our numerical experiments address the inherent challenges of forecasting with multi-modal time series data (combining static and dynamic features) while facing data scarcity in both the target domain and source tasks. Empirical tests validate F-FOMAML’s proficiency, with the model consistently outshining existing forecasting methods in the prediction MAE values by 26.24% on the vending machine dataset and by 8.7% on the JD.com dataset using domain-knowledge constructed features. Furthermore, our model achieves an 1.04% improvement over the MAE metric against baselines with GNN integrated.

2 Related Work

In this section, we discuss prior studies related to sales prediction models, meta-learning’s role in time series forecasting, and the significance of proxy data in prediction endeavors. We categorize the related works into four main sub-domains: Prediction with Limited Data, Meta-Learning for Demand Prediction, Few-Shot Meta-Learning Methods, and Graph Neural Networks for Time Series Forecasting.

Prediction with Limited Data

Previous work in transfer learning has focused on learning from data-rich domains and transferring knowledge to data-sparse regions or underrepresented classes (Gupta et al., 2016; Jean et al., 2016; Zhu et al., 2020).

In e-commerce, we aim to learn from popular products to improve the performance of new or less popular products. Multi-task learning has also been used to enhance model performance on data-sparse tasks (Bishop et al., 2014; Chang et al., 2019; Fiot and Dinuzzo, 2015; Pan and Yang, 2009). Conventional transfer learning methods learn transferable latent factors between one source domain and one target domain (Long et al., 2013; Gong et al., 2012; Tzeng et al., 2017; Long et al., 2015). In our work, we focus on adopting meta-learning techniques to learn from various tasks and then adapt them to unseen tasks in demand prediction.

Meta-Learning for Demand Prediction

Meta-learning has been applied to various retail and demand prediction tasks, with an emphasis on learning from diverse data sources and adapting to new tasks with limited data. For instance, Li et al. (2020) employed meta-learning to predict demand in retail settings, demonstrating the effectiveness of meta-learning in capturing complex patterns across diverse scenarios. Similarly, Wang et al. (2019) applied meta-learning to online retail data, highlighting the potential for meta-learning in e-commerce applications.

In the time series-related problems, Oreshkin et al. (Oreshkin et al., 2020) briefly discusses the relation between neural time series prediction and meta-learning (Oreshkin et al., 2020). Yao et al. incorporate the gradient-based meta-learning with a region functionality based memory (Yao et al., 2019) for spatiotemporal prediction. However, this method relies on the spatial semantic correlations between tasks, which limits its applicability to our problem.

Our work contributes to the problem of learning customer demand for new products with few historical data points. Previous works have suggested comparing the features of new products to existing ones (Ferreira et al., 2016; Baardman et al., 2017), or efficient methods for eliciting additional information (Cao and Zhang, 2021; Ma and Simchi-Levi, 2022). Our paper assumes that sales have already been observed at limited prices and leverages more information from other related products and environments as proxy data.

Meta-Learning Methods for Few-Shot Learning

Meta-learning methods for few-shot learning can be broadly categorized into two main approaches: metric-learning-based and optimization-based. Metric-learning-based approaches focus on establishing similarity or dissimilarity between classes, as demonstrated by works such as Prototypical Networks (Snell et al., 2017), Matching Networks (Vinyals et al., 2016), and Relation Networks (Sung et al., 2017). These methods aim to learn representations that facilitate comparisons between few-shot examples and known classes. On the other hand, optimization-based approaches aim to learn a good initialization point that can quickly adapt to new tasks with minimal parameter updates. Prominent examples of this category include Model-Agnostic Meta-Learning (MAML) (Finn et al., 2017a), Reptile (Nichol et al., 2018), and Meta-SGD (Li et al., 2017b). These methods have been further extended by advanced techniques such as Task-Specific Adaptation (TSA) (Zhou et al., 2020) and Multi-Modal Model-Agnostic Meta-Learning (MUMOMAML) (Vuorio et al., 2019). These optimization-based approaches enhance the adaptability and robustness of the learned models across diverse tasks and domains.

Our method falls into the domain of optimization-based approaches.

Graph Neural Networks for Time Series Forecasting

Deep learning models have been extensively explored for time series forecasting especially those with the Seq2Seq structure (Sutskever et al., 2014), which involves learning an encoder to transform various inputs into fixed-length hidden states for producing forecasts. Recent developments include DeepAR (Salinas et al., 2020), TFT (Lim et al., 2019) and MQ-Forecasters (Wen et al., 2017; Eisenach et al., 2020). However, these methods do not account for cross-observation information, which becomes important in many practical applications. As a result, Graph Neural Networks (GNNs) have rapidly emerged as a promising framework to address this issue by combining temporal processing with graph convolution to augment the learning of individual time series (Kipf and Welling, 2017; Li et al., 2017a; Wu et al., 2020; Shang et al., 2021). A popular family of methods propose graph structure learning for the joint inference of a latent structure through GNN while forecasting (Kipf et al., 2018; Wu et al., 2020). However, they suffer limitations in scaling to large datasets. A scalable approach recently introduced by Yang et al. (Yang et al., 2023) uses predefined graphs as data augmentations rather than enabling graph structure learning, which demonstrates not only to scale to graphs over millions of nodes but also shows substantially improving model performance, especially for cold-start problems when data is scarce.

Our work builds upon these foundations by specifically applying meta-learning and few-shot learning techniques to the demand forecasting problem, with the goal of improving the adaptability and performance of models in this context. To the best of our knowledge, we are the first to study peak period demand prediction with limited records by borrowing relation-aware knowledge from other time periods. We focus on this domain, exploring the application of meta-learning for few-shot prediction and incorporating auxiliary information, such as proxy data from other related tasks, to improve model performance.

3 Problem Formulation

Refer to caption — Figure 1: Pipeline of the GNN-enhanced F-FOMAML for demand forecasting.

During peak periods, promoted products often have limited historical sales (e.g., less popular items) or are new items without historical transaction data. Consequently, the demand forecasting tasks for product-location pairs during these periods are new and unseen compared to regular products and periods (e.g., paper towels). To address this challenge, we frame our research problem within a generic setting, focusing on a few-shot meta-learning paradigm, specifically targeting demand forecasting. Throughout our discussion, we use JD.com’s transactional data as the primary example to illustrate our approach.

3.1 Task Definition

Demand forecasting aims to predict the future demand for a product in a specific environment based on observed features. Each forecasting task is associated with a product and its environment.

Formally, let $\mathcal{P}({\mathcal{T}})$ denote a distribution over tasks ${\mathcal{T}}_{ij}$ , each corresponding to product $i$ in environment $j$ . For a set $[n]=\{1,\ldots,n\}$ of $n$ products with product $i$ present in $t_{i}$ environments, we have a total of $\sum_{i=1}^{n}t_{i}$ tasks. Each task dataset is symbolized as a pair $(\mathbf{x}_{ij},y_{ij})$ , where $\mathbf{x}_{ij}$ is the feature vector and $y_{ij}$ signifies the associated demand.

To provide a concrete example, consider a scenario where we have $n=10$ products, each available in $t_{i}=5$ locations. Therefore, we have a total of 50 tasks in our meta-training set. The dataset corresponding to each task is represented as a demand-feature pair $(\mathbf{x}_{ij},y_{ij})$ , where $\mathbf{x}_{ij}\in\mathbb{R}^{m}$ is a feature vector and $y_{ij}\in\mathbb{R}$ is the associated demand.

Our goal is to train a model, denoted by $f:\mathbb{R}^{m}\rightarrow\mathbb{R}^{+}$ , capable of map** $m$ -dimensional observations $\mathbf{x}$ to outputs $y$ across a large or possibly infinite number of tasks. We employ the First-Order Model Agnostic Meta-Learning (FOMAML) algorithm for this purpose. For a given product characterized by a feature vector $s_{i},\forall i\in[n]$ and an environment (e.g., location) characterized by a feature vector $v_{j},\forall j\in[t_{i}]$ , we consider a single historical price and demand observation $(\tilde{p}_{ij},\tilde{y}_{ij})$ .

Given a price of interest $p_{ij}$ , we assume our task as the following demand function:

y_{ij}=f_{ij}(\mathbf{x}_{ij})+\epsilon_{ij},

(1)

where $\mathbf{x}_{ij}$ is the feature tuple $(s_{i},v_{j},\tilde{p}_{ij},\tilde{y}_{ij},p_{ij})$ and $y_{ij}$ is the corresponding demand $y_{ij}$ . Here, $f_{ij}$ is a flexible function (e.g., linear regression, MLP, etc.) and each task is associated with a unique model parameter $\beta_{ij}\in{\mathbb{R}}^{m}$ . We assume that the noise $\epsilon_{ij}$ follows a centered sub-Gaussian distribution with parameter $\sigma_{i}^{2}$ . Furthermore, without loss of generality, we assume that $\mathcal{P}_{X}$ is an isotropic-centered sub-Gaussian distribution, i.e., $\mathbb{E}(\mathbf{x}_{ij}\mathbf{x}_{ij}^{\top})=\mathbb{I}_{d}$ . Exploiting some structural similarities in $\mathcal{P}\left({\mathcal{T}}\right)$ , the goal is to train a model for a new task ${\mathcal{T}}^{\rm new}$ , coming from $\mathcal{P}\left({\mathcal{T}}\right)$ , from a small amount of training dataset ${\mathcal{D}}={\big{(}\mathbf{x}^{\text{new}}_{ij},y^{\text{new}}_{ij}\big{)}}$ .

Remark 1.

Incorporating features allows us to capture an additional form of shared structure. However, despite accounting for observed product features, the demand functions of two products can exhibit distinct behaviors. For instance, even for Diet Coke, price sensitivities may vary significantly on different vending machines due to factors such as customer demographics or preferences that are challenging to capture as explicit features. To account for these product-location-specific nuances, we introduce the flexibility for the demand function’s coefficients (e.g., price elasticity) to differ.

In the First-Order MAML (FOMAML) approach, the model parameters for each task in the meta-training dataset are computed after a single gradient update. Specifically, for each task $\mathcal{T}_{ij}$ , the task-specific model parameters, denoted $\beta_{ij}^{\prime}$ , are updated as follows:

\beta_{ij}^{\prime}\leftarrow\beta^{*}-\lambda\nabla_{\beta^{*}}\mathcal{L}_{% \mathcal{T}_{ij}}(\beta_{ij}),

(2)

where $\lambda$ is the learning rate, $\beta^{*}$ is the global model parameter shared across tasks, and $\mathcal{L}_{\mathcal{T}_{ij}}(\beta_{ij})$ is the task-specific loss, such as the mean squared error:

\mathcal{L}_{\mathcal{T}_{ij}}(\beta_{ij})=\frac{1}{2}\left(y_{ij}-f_{ij}(% \mathbf{x}_{ij})\right)^{2}.

(3)

After updating the task-specific parameters, a meta-update is performed on the shared global parameter $\beta^{*}$ using the performance of the updated $\beta_{ij}^{\prime}$ on their corresponding tasks. This meta-update is given by the following:

\beta^{*}\leftarrow\beta^{*}-\eta\sum_{i=1}^{n}\sum_{j=1}^{t_{i}}\nabla_{\beta% ^{*}}\mathcal{L}_{\mathcal{T}_{ij}}(\beta_{ij}^{\prime}),

(4)

where $\eta$ is the meta-learning rate. The objective of this meta-learning process is to optimize the shared global parameter $\beta^{*}$ such that, after a few updates on each task, the task-specific parameters $\beta_{ij}$ yield improved performance on their corresponding tasks. Once the meta-learning process is complete, the model parameters of a newly arriving task can be estimated using the learned meta-parameters $\beta^{*}$ . These task-specific parameters $\beta_{ij}^{\prime}$ can then be fine-tuned on the new task using the available data, yielding improved performance and adaptability to new tasks.

By incorporating the FOMAML algorithm into our meta-learning framework, we aim to construct an efficient model for sales prediction that can adapt to new tasks with limited historical sales data.

4 Methodology

We illustrate the pipeline of our algorithm in Figure 1. Imagine there are 3 locations offering 6 drink types with transaction data, capturing their historical sales. First, a graph neural network (GNN), $G$ , is formed using both static features like machine locations and dynamic features from past sales time series. To predict the demand for Coke at the gym with a discount, relevant nodes and edges from $G$ are extracted. This subset, denoted as $G_{\mathcal{T}}$ , undergoes training using MAML’s inner loop, yielding initial task-specific parameters. These parameters are further refined through the FiLM transformer, considering proxy data that might suggest a promotion for Coke. The shared meta-parameters are updated in MAML’s outer loop based on the specific task losses. Once this cycle is completed across all tasks, the model is evaluated on fresh data to project the demand.

Our proposed methodology for e-commerce demand prediction encompasses three pivotal components: proxy data selection, GNN-enhenced representation learning, and the F-FOMAML algorithm design. To cater to the multi-faceted nature of e-commerce products and their varied demand across different locations or customer segments, we weave task-adaptive estimators into the meta-learning framework. Further, we employ GNN and the FiLM layer to utilize and encode proxy data into hidden representations, thus enabling the modulation of learner parameters for enhanced adaptation to the specific characteristics of products and customer segments.

4.1 Proxy Data Selection

The proxy data, vital for tasks with limited historical sales data, is judiciously selected. The ideal proxy data simulates the potential sales behavior of the focal product, informed by sales trends of similar products or those in related categories.

For e-commerce scenarios, task similarity might arise from: 1) Historical Transactions: Edges represent products often purchased together. 2) User Behavioral Patterns: Edges might indicate similar purchase behaviors or browsing patterns of users. 3) Product Similarities: Linking products of the same category or with similar attributes. 4) Domain Knowledge: Connections deriving from expert insights into customer behaviors, seasonal trends, or market dynamics.

Traditional methods use clustering techniques and measure distances with metrics like Euclidean and cosine similarity to quantify task similarity. Our approach, however, leverages a GNN-based method for selecting relevant tasks as proxy data. For a given task, we denote its proxy data as $Z_{ij}$ , representing the most relevant data identified through our GNN-based approach. This ensures the proxy data accurately reflects the target task, enhancing demand forecasting accuracy.

Graph Construction for Proxy Data

A tailored graph for our GNN encapsulates relationships among tasks. In determining proxy data for e-commerce settings, we choose tasks from support set $\mathcal{T}$ resembling our target task, $\mathcal{T}_{\text{new}}$ , guided by $\text{correlation}(\mathcal{T},\mathcal{T}_{new})>\delta,$ where $\delta$ is a threshold indicating task similarity, and the function $\text{correlation}(\mathcal{T},\mathcal{T}_{new})$ captures the similarity between $\mathcal{T}_{new}$ and $\mathcal{T}$ through different methods such as the ones described above.

4.2 GNN-enhanced Representation Learning

Here, we describe how to obtain the graph-enhanced features for each product. In a nutshell, we set up a time series forecasting task and utilize a GNN-based demand forecasting model to predict future sales given each product’s historical information as well as cross-product relationships defined by a predefined graph. We then extract the hidden encoded context from the trained model to produce the product embeddings as features.

Input Product Features

E-commerce platforms host a plethora of products, each with unique characteristics and consumer interactions. In this case, We construct the graph using product-specific attributes such as brands (i.e., we connect all products with the same brand). The input features for node $N_{i}$ (representing product $i$ ) are: 1) Static features $S_{i}$ , like the product category, brand, and manufacturing details. 2) Dynamic features $D_{i}$ , encompassing time-evolving aspects like recent sales and price changes.

Product Embedding Generation via Forecasting

A crucial aspect is to generate meaningful product embeddings that can capture the multifaceted nature of e-commerce products. To facilitate this, we set up a demand forecasting task as:

\widehat{Y}_{t+1}=f\left(Y_{t-C:t},D_{t-C:t},S\right),

(5)

where $f$ represents the forecasting model. At time $t$ , target $Y_{t+1}\in\mathbb{R}^{N\times 1}$ are the future one-day sales, $D_{t-C:t}\in\mathbb{R}^{N\times d}$ are $d$ dynamic features with the history length of $C$ days, and $S\in\mathbb{R}^{N\times m}$ are $m$ static features for all $N$ products. We adopt the GNN-based forecasting model introduced in (Yang et al., 2023) and use the brand information to craft the predefined graph. The GNN is utilized both for forecasting and for generating the embedding of tasks. After training convergence, we extract the embedding for each product, which serves as compact representations of product dynamics.

Edge Relationship Determination

Let $E(\mathcal{T}_{i},\mathcal{T}_{j})$ denote the edge between tasks $\mathcal{T}_{i}$ and $\mathcal{T}_{j}$ . The edge relationships between the two entities are inferred using:

E(\mathcal{T}_{i},\mathcal{T}_{j})=\begin{cases}1&\text{if }\text{dist}(emb(% \mathcal{T}_{i}),emb(\mathcal{T}_{j}))<\theta\text{ or }h_{\mathcal{T}_{i}}=h_% {\mathcal{T}_{j}}\\ 0&\text{otherwise,}\end{cases}

(6)

where $emb(\mathcal{T}_{i})$ stands for the embedding of task $\mathcal{T}_{i}$ , $\text{dist}(\cdot,\cdot)$ denotes a function measuring the distance between two embeddings, $h$ denotes the task (i.e., product) hierarchy or taxonomy, and $\theta$ is a pre-determined threshold to determine closeness. We will create an edge between tasks $\mathcal{T}_{i}$ and $\mathcal{T}_{j}$ if either their corresponding embeddings are close to each other or they belong to the same category.

4.3 F-FOMAML Algorithm Description

We present the GNN-Integrated Feature-based First-Order MAML (F-FOMAML) for Demand Forecasting Algorithm 1, which incorporates transactional data with static and dynamic features, and proxy data to forecast demand in peak periods. This variant of the MAML algorithm delineates the learning process into several stages: meta-learner, base learners, FiLM Layer, and fine-tuning.

Algorithm 1 GNN-Integrated Feature-based First-Order MAML (F-FOMAML) for Demand Forecasting

\mathcal{D}

={Static features

S=\{(s_{i},v_{j})\}_{i,j}

, Dynamic features

D=\{\tilde{y}_{ij}\}_{i,j}

; Proxy data

Z

; pre-constructed graph

G=(N,E)

; learning rates

\eta,\lambda

2:Learn the GNN with nodes

N

and edges

E

using

S

and

D

3:Initialize global meta-parameters

\beta^{*}

4:for each task

\mathcal{T}_{ij}

\mathcal{D}

5: Extract relevant nodes

N_{\mathcal{T}_{ij}}\subset N

and edges

E_{\mathcal{T}_{ij}}\subset E

from

G

for product

i

in environment

j

6: Represent task

\mathcal{T}_{ij}

using a subgraph

G_{\mathcal{T}_{ij}}=(N_{\mathcal{T}_{ij}},E_{\mathcal{T}_{ij}})

7: Initialize task-specific parameters

\beta_{ij}

from

\beta^{*}

8: Compute task-specific loss

\mathcal{L}_{\mathcal{T}_{ij}}(\beta_{ij})

\mathcal{L}_{\mathcal{T}_{ij}}(\beta_{ij})=\frac{1}{2}\left(y_{ij}-f_{ij}(% \mathbf{x}_{ij};\beta_{ij})\right)^{2}.

9: Update task-specific parameters

\beta_{ij}^{\prime}

using Equation 2:

\beta_{ij}^{\prime}\leftarrow\beta^{*}-\lambda\nabla_{\beta^{*}}\mathcal{L}_{% \mathcal{T}_{ij}}(\beta_{ij}).

10: Apply FiLM transformer using proxy data

Z_{ij}

to modulate the input features

\mathbf{x}_{ij}

for task

\mathcal{T}_{ij}

, enhancing the task-specific parameters

\beta_{ij}^{\prime}

adaptation:

\text{FiLM}(\mathbf{x}_{ij};Z_{ij})=\beta(Z_{ij})\odot\mathbf{x}_{ij}+\gamma(Z% _{ij})

11:end for

12:Perform meta-update on

\beta^{*}

using:

\beta^{*}\leftarrow\beta^{*}-\eta\sum_{i=1}^{n}\sum_{j=1}^{t_{i}}\nabla_{\beta% ^{*}}\mathcal{L}_{\mathcal{T}_{ij}}(\beta_{ij}^{\prime}).

13:Evaluate the model on testing data to get forecasted demand

\widehat{y}

14:return Forecasted demand

\widehat{y}

for products.

FiLM Layer.

The feature-wise linear modulation (FiLM) layer (Perez et al., 2018) is a critical component in tailoring the learner parameters based on the proxy data features. This layer applies an affine transformation, feature-wise, to its input, modulating the hidden vector outputs of the meta-model using the proxy data $Z_{ij}$ as task encodings. The construction and purpose of the proxy data $Z_{ij}$ is elaborated in section 4.1. The FiLM layer facilitates a more refined adaptation to the distinctive traits of the product and vending machine location by exploiting the relationship between the product-specific and machine-specific price-sensitivity estimators encapsulated in the proxy data.

The FiLM layer’s mechanism can be formally described as

\text{FiLM}(\mathbf{x}_{ij};Z_{ij})=\beta(Z_{ij})\odot\mathbf{x}_{ij}+\gamma(Z% _{ij}),

(7)

where $\mathbf{x}_{ij}$ represents the input feature representation, $\beta(Z_{ij})$ (abv. $\beta_{ij}$ ) and $\gamma(Z_{ij})$ signify the scaling and shifting factors learned from the proxy data $Z_{ij}$ , and $\odot$ symbolizes element-wise multiplication. The functions $\beta(Z_{ij})$ and $\gamma(Z_{ij})$ are learned during the training phase to cater to the specific task at hand. By applying this transformation to the task-specific model parameters $\beta_{ij}^{\prime}$ , the model captures complex feature interactions and becomes better equipped to adapt to the specific characteristics of each unique product-environment pair.

Meta-Learner.

The core of the meta-learning approach is the meta-learner, an overarching model that helps in initializing and updating the meta-parameters, $\beta^{*}$ . These parameters serve as a shared knowledge base that aids in swift adaptation across a myriad of tasks. The meta-learner initializes the global meta-parameters $\beta^{*}$ and, after task-specific adaptations are performed, updates $\beta^{*}$ using the aggregated first-order gradients from each task. This process ensures that the meta-parameters incorporate insights from various tasks, enabling rapid adaptation to new tasks and reducing the cold-start problem in the e-commerce domain.

Base Learners.

The base learners are models tailored to specific tasks, such as predicting the demand for a new product launch or forecasting sales during a flash sale. Each base learner operates by extracting relevant nodes and edges to form a subgraph for each task, initializing task-specific parameters $\beta_{ij}$ from the global meta-parameters $\beta^{*}$ , computing the task-specific loss, and updating the task-specific parameters $\beta_{ij}^{\prime}$ using first-order gradient descent. Additionally, the FiLM transformer applies proxy data to modulate input features, enhancing the adaptation of the task-specific parameters.

Fine-Tuning.

In our approach, fine-tuning involves a final meta-update on $\beta^{*}$ after task-specific updates, evaluating the model on testing data to forecast demand $\widehat{y}$ for products and returning the forecasted demand $\widehat{y}$ as the final prediction output. This process leverages the FiLM transformer and proxy data to ensure that the models are not just generic but tailored to capture the heterogeneity in tasks.

The strength of this method mainly lies in its ability to utilize shared structures across tasks while also adapting swiftly to unique task characteristics using the FiLM transformer and proxy data.

5 Theoretical analysis

In this section, we provide a theoretical model to illustrate the benefit of our proposed method and shed light on why proxy data improves the few-shot prediction.

Data generative model.

Suppose we have a set of tasks $\mathcal{T}$ . For each task $t\in\mathcal{T}$ , we observe training samples $\{(x_{k}^{(t)},y_{k}^{(t)})\}_{k=1}^{n_{t}}$ , where $n_{t}$ is the sample size for the task $t$ . In addition, for each task, we observe a task-specific feature $v_{t}\in\mathbb{R}^{r}$ . For the simplicity of presentation, we assume $v_{t}\in[0,1]^{r}$ . We denote the set of training domains by $\mathcal{D}^{tr}$ and assume there are ${T}=|\mathcal{T}|$ training tasks. Following equation (1), we assume that for each task $t$ , the outcome prediction function $g$ takes the form of $y=g_{t}(x)+\epsilon:=h_{t}(f(x))+\epsilon$ , where $h_{t}$ is the base-learner that depends on individual task $t$ , $f$ is the meta-learner, and $\epsilon$ is a noise term which is assumed to be sub-Gaussian with mean 0 and variance $\sigma^{2}$ .

Following Section 4, for each task $t\in\mathcal{T}$ , we construct the proxy data $Z_{t}$ by including all similar tasks $t^{\prime}$ such that $\|v_{t^{\prime}}-v_{t}\|\leq h$ for some threshold parameter $h>0$ . Then similarly, for the test task $\widetilde{t}$ , the outcome prediction function $\widehat{g}_{\widetilde{t}}$ is computed as $\widehat{g}_{\widetilde{t}}(x)=\widehat{h}_{\widetilde{t}}(f(x)),$ where $\widehat{h}_{\widetilde{t}}(\cdot)=\frac{\sum_{t\in\mathcal{T}}w(\widetilde{t}% ,t)\widehat{h}_{i}(\cdot)}{\sum_{t\in\mathcal{T}}w(\widetilde{t},t)}$ , with the weight $w(\widetilde{t},t)=\bm{1}\{\|v_{\widetilde{t}}-v_{t}\|\leq h\}$ . In the case where the denominator is $0$ , we define $\widehat{h}_{\widetilde{t}}=0$ .

Theoretical results.

To facilitate the theoretical analysis, we first assume that the distance between task-specific features indeed captures the similarity of tasks: there exists a universal constant $C$ , such that $\|h_{t_{1}}-h_{t_{2}}\|_{\infty}\leq C\cdot\|v_{t_{1}}-v_{t_{2}}\|.$

In addition, we assume that for each training domain $t$ , $\widehat{h}_{t}$ is well learned such that $\operatorname*{\mathbb{E}}\big{[}\big{(}\widehat{h}_{t}(f(x))-h_{t}(f(x))\big{% )}^{2}\big{]}=O(\frac{C(\mathcal{H})}{n_{t}})$ , where $C(\mathcal{H})$ is the Rademacher complexity of the function class $\mathcal{H}$ . We further assume $v_{t}$ has a positive density over $[0,1]^{r}$ . Then, we have:

Theorem 1.

Consider the data generative model, the algorithm $\widehat{g}_{\widetilde{t}}$ , and the assumptions above. Suppose we have $n_{d}\gtrsim n$ for all $t\in\mathcal{D}^{tr}$ . Define the excess risk for the test domain $\widetilde{t}$ by $R(\widehat{g}_{\widetilde{t}})=\operatorname*{\mathbb{E}}_{(x,y)\sim\widetilde% {t}}[l(\widehat{g}_{\widetilde{t}}(x;\mathcal{D}^{tr},A),y)]-\operatorname*{% \mathbb{E}}_{(x,y)\sim\widetilde{t}}[l(g_{\widetilde{t}}(x;\mathcal{D}^{tr},A)% ,y)]$ . If the loss function $l$ is Lipschitz with respect to the first argument, then the excess risk satisfies

R(\widehat{g}_{\widetilde{t}})\lesssim h+\sqrt{\frac{C(\mathcal{H})/n}{\max\{1% ,nh^{r}\}}}.\vspace{-1em}

(8)

In particular, if $h$ is properly chosen such that $h\asymp(\frac{C(\mathcal{H})/n}{T})^{\frac{1}{r+2}}$ , then $R(\widehat{g}_{\widetilde{t}})\lesssim\left(\frac{C(\mathcal{H})/n}{T}\right)^% {\frac{1}{r+2}}.$

Theorem 1 suggests that the superiority of our algorithm comes from a better bias-variance trade-off. More concretely, the threshold $h$ tunes the trade-off for the excess risk. On the one hand, when we do not use relational data at all (corresponding to the case where $h=0$ ), the first term in equation 8, bias, is negligible, while the second term, variance, becomes dominant as the data is limited. As the excess risk of a single task is of order $(C(\mathcal{H})/n)^{1/2}$ , our result implies that when $T$ is sufficiently large such that $T>\big{(}\frac{n}{C(\mathcal{H})}\big{)}^{r/2}$ , our proposed method will overcome the potential bias by incorporating similar (but still different) tasks and better than learning with only one single-source task. On the other hand, if we simply use the standard ERM to pool all the data together (corresponding to $h=\infty$ ), although the variance becomes small, the bias would dominate in this case. The second part of the theorem suggests that one can efficiently incorporate the proxy data with a carefully chosen threshold. The proof of Theorem 1 is deferred to Appendix B.

The theoretical perspective we discussed is particularly pertinent to the context of demand prediction. Given the limited data available from high-stakes sales events, relying solely on this data for prediction (akin to $h=0$ ) can result in outcomes with substantial variance. Conversely, utilizing the entire historical dataset (corresponding to $h=\infty$ ) can introduce significant bias, given the marked differences between regular and high-stakes sales events. Our method harnesses GNNs to understand the relationships within historical data. This approach strikes an optimal balance in the bias-variance trade-off, leading to improved prediction accuracy.

6 Experiment

In this section, we conduct extensive experiments to evaluate the efficacy of our proposed F-FOMAML for peak-period demand prediction, focusing on two key research questions:

•

How does F-FOMAML’s prediction performance compare to various baselines?
•

To what extent do the components we introduce, such as the proxy data selection method (GNN versus MQCNN), impact the model’s predictive capabilities?

By addressing these questions, we provide a comprehensive evaluation of F-FOMAML, highlighting its performance relative to existing approaches and analyzing the contributions of individual components to the model’s overall predictive power.

6.1 Experimental Setups

In this section, we detail the experimental setups, focusing on two real-world datasets and the evaluation criteria for our method’s performance.

For brevity, the main text covers the data description, experimental setup, and results for the JD.com dataset, while the details for the vending machine dataset are provided in the Appendix C.

Datasets.

We validate our methodology using transactional records from JD.com ²²2Dataset available at: https://connect.informs.org/msom/events/datadriven2020, which include both static and dynamic features related to products (SKUs) and order details for March 2018.

The goal is to predict the demand at the promotional price given the demand at the regular price. We use the category information (3 categories in total) for product features, and region (63 regions) information for location features. We use the last 15 days as testing, and the second to last 15 days as training. The detailed data description and dataset construction are deferred to the Appendix C.1.

Table 1: Experiment results on real-world JD.com E-Commerce sales data with proxy data features generated from different methods. This includes regression techniques, ensemble strategies, neural-network-based methods, and transfer methods for a comprehensive benchmark.

Method	MSE	MAE	MAPE
Linear Regression	1.2298	0.4757	0.2106
+ MQCNN	1.3789	0.5165	0.2486
+ GNN	0.7633	0.4138	0.1821
Random Forest	2.0397	0.5051	0.2706
+ MQCNN	1.7943	0.5639	0.3249
+ GNN	1.8419	0.5971	0.3473
XGBoost	1.7056	0.5516	0.3091
+ MQCNN	1.9021	0.5745	0.3352
+ GNN	1.7943	0.5639	0.3249
MLP	1.2708	0.4661	0.1945
+ MQCNN	1.2520	0.4668	0.1988
+ GNN	1.2708	0.4661	0.1945
GRU
+ MQCNN	1.7661	0.4780	0.2694
+ GNN	3.1427	0.7104	0.2234
LSTM
+ MQCNN	3.1850	0.6686	0.1752
+ GNN	1.2073	0.2936	0.1694
MAML	1.0752	0.4769	0.2345
+ MQCNN	3.2979	0.5898	0.3565
+ GNN	1.0383	0.4646	0.2201
Reptile	4.9341	0.8118	0.5410
+ MQCNN	1.5990	0.4457	0.1831
+ GNN	1.2041	0.4827	0.2287
MetaSGD	1.1941	0.4837	0.2379
+ GNN	1.2667	0.4651	0.1908
+ MQCNN	1.2111	0.4562	0.1983
TSA	1.7022	0.7310	0.4936
+ GNN	1.5116	0.5253	0.2585
+ MQCNN	1.1785	0.5315	0.3084
MUMOMAML	0.9449	0.3251	0.2189
+ MCQNN	1.0945	0.3861	0.1920
+ GNN	0.7577	0.3752	0.1553
F-FOMAML (Ours)	0.6552	0.3876	0.2117
+ MQCNN	0.6371	0.4134	0.2000
+ GNN	0.6089	0.3713	0.2077

Baselines.

Our evaluation encompasses a diverse range of baseline techniques for comparative analysis. This includes traditional regression techniques like Linear Regression, along with ensemble strategies such as Random Forest and the well-regarded XGBoost algorithm (Chen and Guestrin, 2016). In the realm of neural-network-based methods, we consider the Multi-Layer Perceptron network, Gated Recurrent Unit (GRU) (Chung et al., 2014), Dipole (Ma et al., 2017), and LSTNet (Lai et al., 2018). Additionally, advanced transfer methods like Model-Agnostic Meta-Learning (MAML) (Finn et al., 2017b), Reptile (Nichol et al., 2018), Meta-SGD (Li et al., 2017b), TSA (Zhou et al., 2020) and MUMOMAML (Vuorio et al., 2019) are included. Consistency in the feature set is maintained across all baseline models, aligning them with our proposed method, and ensuring a fair comparison.

Model Evaluation and Training.

With the meta-learning framework in place, we train the base learners on the proxy data and evaluate their performances using evaluation metrics such as mean squared error (MSE), mean absolute error (MAE), and mean absolute percentage error (MAPE). The meta-learner, which could be a neural network (Goodfellow et al., 2016), support vector machine (Cortes and Vapnik, 1995), or decision tree (Breiman et al., 1984), selects the best base learners and their corresponding hyper-parameters based on the evaluation results. Next, the selected base learners are fine-tuned on the available historical sales data from the regular sales period, if any, to adapt the model to the peak period. This fine-tuning step allows our model to better capture the unique relationships between features and sales in the target vending machine, leading to more accurate predictions and improved generalization to new tasks.

6.2 Analysis of Results

Table 1 presents a detailed comparison of various machine learning methods applied to real-world e-commerce sales data from JD.com, evaluating their performance through three metrics. The methods encompass traditional regression, ensemble strategies, neural network-based approaches, transfer learning methods, and some advanced meta-learning algorithms, with the inclusion of proxy data features generated by either MQCNN or Graph Neural Networks (GNN). Linear regression, serving as a baseline, shows moderate performance, which slightly deteriorates when combined with MQCNN but improves with GNN, indicating GNN’s effectiveness in feature enhancement. Random Forest and XGBoost, both ensemble methods, exhibit higher errors compared to linear regression, with their performance variably impacted by MQCNN and GNN additions, suggesting a complex interaction between ensemble methods and proxy data features. Among neural network-based methods, the addition of MQCNN generally does not significantly alter performance, whereas GNN integration shows mixed results. Notably, advanced methods like MAML, Reptile, and MetaSGD show varied outcomes, with some combinations leading to increased errors. Particularly, Reptile demonstrates a substantial error reduction when combined with MQCNN, highlighting the potential of integrating advanced algorithms with proxy data feature generation techniques. The performance of TSA and MUMOMAML, with their respective enhancements, underscores the importance of selecting appropriate proxy data feature generation methods to improve prediction accuracy. MUMOMAML combined with MQCNN achieves the best MAPE score across all methods, emphasizing the strength of multimodal meta-learning techniques when optimized with suitable proxy data features.

The impact of incorporating different proxy data selection methods, particularly the comparison between GNN and other methods like clustering, is profound. The integration of GNN with various machine learning models, including our proposed F-FOMAML, consistently improved performance across metrics (MSE, MAE, MAPE), highlighting the effectiveness of GNN in enhancing the model’s ability to predict demand accurately. This improvement is notably apparent in the substantial performance leap observed when F-FOMAML is combined with GNN, which yields the best results.

Ablation study.

To better understand the effect of proxy data, we perform an ablation study by varying the k parameter in the k-shot proxy data selection and evaluating the performance metrics as the value of k changes. As illustrated in Figure 3, we observed that initially increasing k leads to a rise in the error metric, suggesting a decline in model performance due to less relevant data. This trend reaches a plateau, after which further increases in k result in decreased error, indicating improved performance from a larger proxy data set. These findings highlight a critical threshold where the quantity of proxy data begins to enhance model performance, emphasizing the potential benefits of utilizing larger proxy data sets. Detailed analyses and additional studies on algorithm convergence are provided in the Appendix C.3.

Our analysis conclusively demonstrates that F-FOMAML, especially when enhanced with Graph Neural Network (GNN) proxy data, outshines traditional regression models, ensemble strategies, neural networks, and other advanced meta-learning algorithms in predicting e-commerce sales on JD.com. This method achieves the lowest MSE and MAE, evidencing its superior ability to capture complex data patterns. Moreover, the integration of GNN as a proxy data selection method significantly boosts F-FOMAML’s performance across various metrics, including MSE, MAE, and MAPE, underscoring the pivotal role of advanced proxy data techniques in improving demand prediction models. The comparison with other proxy data methods, such as MQCNN, further illustrates GNN’s unique capability in effectively capturing complex data relationships, enhancing F-FOMAML’s predictive accuracy. These findings highlight the critical impact of combining GNN with state-of-the-art prediction models, offering insights into the development of more precise demand prediction algorithms.

7 Conclusion

This paper presents a novel approach to the challenging task of predicting demand during promotional events characterized by special buying behaviors. Traditional sales data often falls short due to the limited availability of historical data for such events. To address this, we framed demand prediction within the graph-augmented meta-learning paradigm. Utilizing the GNN-enhanced F-FOMAML algorithm, which integrates the generalizability of meta-learning with the adaptability of FiLM layers, we developed a robust forecasting model particularly effective in data-sparse scenarios.

Our method is grounded in solid theoretical foundations, demonstrating the algorithm’s ability to optimize predictive risk by skillfully managing bias-variance trade-offs. Empirical evaluations highlight our model’s superiority over conventional forecasting techniques and underscore its applicability beyond retail, with potential uses in fields such as online banking security and digital marketing. Empirically, F-FOMAML achieves significant improvements, reducing prediction MAE by 26.24% on the vending machine dataset and 1.04% on the JD.com dataset, with a notable 10.18% enhancement over GNN-based benchmarks. Further discussion on the strengths, limitations, and future research directions is provided in the Appendix D.

References

(1)
Baardman et al. (2017) Lennart Baardman, Igor Levin, Georgia Perakis, and Divya Singhvi. 2017. Leveraging comparables for new product sales forecasting. Available at SSRN 3086237 (2017).
Bishop et al. (2014) Jared Bishop, John Peters, and York Yannikos. 2014. Rapid response learning of humanitarian interventions. In 2014 IEEE Symposium on Computational Intelligence in Multicriteria Decision-Making (MCDM). IEEE, 9–15.
Breiman et al. (1984) Leo Breiman, Jerome H Friedman, Richard A Olshen, and Charles J Stone. 1984. Classification and regression trees. CRC press (1984).
Cao and Zhang (2021) Xinyu Cao and Juanjuan Zhang. 2021. Preference learning and demand forecast. Marketing Science 40, 1 (2021), 62–79.
Chang et al. (2019) Kevin Chang, **gxuan Wu, Xinyang Yu, Ruichen Yu, Xiaoxin Chen, Weiwei Dai, and David Anastasiu. 2019. Chimera: Large-Scale Classification using Machine Learning, Rules, and Crowdsourcing. In Proceedings of the VLDB Endowment, Vol. 12. VLDB Endowment, 193–206.
Chen and Guestrin (2016) Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2016), 785–794.
Choe et al. (2023) Sang Keun Choe, Sanket Vaibhav Mehta, Hwijeen Ahn, Willie Neiswanger, Pengtao Xie, Emma Strubell, and Eric Xing. 2023. Making Scalable Meta Learning Practical. arXiv:2310.05674 [cs.LG]
Chung et al. (2014) Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014).
Cortes and Vapnik (1995) Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine learning 20, 3 (1995), 273–297.
Eisenach et al. (2020) Carson Eisenach, Yagna Patel, and Dhruv Madeka. 2020. MQTransformer: Multi-Horizon Forecasts with Context Dependent and Feedback-Aware Attention. CoRR abs/2009.14799 (2020).
Ferreira et al. (2016) Kris Johnson Ferreira, Bin Hong Alex Lee, and David Simchi-Levi. 2016. Analytics for an online retailer: Demand forecasting and price optimization. Manufacturing & service operations management 18, 1 (2016), 69–88.
Finn et al. (2017a) Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017a. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In Proceedings of the 34th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 70), Doina Precup and Yee Whye Teh (Eds.). PMLR, 1126–1135.
Finn et al. (2017b) Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017b. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In ICML. 1126–1135.
Fiot and Dinuzzo (2015) Jean-Baptiste Fiot and Francesco Dinuzzo. 2015. Electricity Demand Forecasting by Multi-Task Learning. arXiv:1512.08178 [cs.LG]
Gong et al. (2012) Boqing Gong, Yuan Shi, Fei Sha, and Kristen Grauman. 2012. Geodesic flow kernel for unsupervised domain adaptation. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. 2066–2073.
Goodfellow et al. (2016) Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep learning. In MIT Press.
Gupta et al. (2016) Shivam Gupta, Kiran Ramesh, and Anil Kumar. 2016. Transfer Learning for Yield Prediction. In 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW). IEEE, 1109–1116.
Jean et al. (2016) Neal Jean, Marshall Burke, Michael Xie, W Matthew Davis, David B Lobell, and Stefano Ermon. 2016. Combining satellite imagery and machine learning to predict poverty. In Science, Vol. 353. American Association for the Advancement of Science, 790–794.
Kipf et al. (2018) Thomas Kipf, Ethan Fetaya, Kuan-Chieh Wang, Max Welling, and Richard Zemel. 2018. Neural relational inference for interacting systems. In International Conference on Machine Learning. 2688–2697.
Kipf and Welling (2017) Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In ICLR (Poster). OpenReview.net.
Kong et al. (2020a) Weihao Kong, Raghav Somani, Sham Kakade, and Sewoong Oh. 2020a. Robust Meta-learning for Mixed Linear Regression with Small Batches. In Advances in Neural Information Processing Systems, Vol. 33. Curran Associates, Inc., 4683–4696.
Kong et al. (2020b) Weihao Kong, Raghav Somani, Zhao Song, Sham Kakade, and Sewoong Oh. 2020b. Meta-Learning for Mixed Linear Regression. In Proceedings of the 37th International Conference on Machine Learning (ICML’20). JMLR.org, Article 500, 11 pages.
Lai et al. (2018) Guokun Lai, Wei-Cheng Chang, Yiming Yang, and Hanxiao Liu. 2018. Modeling long-and short-term temporal patterns with deep neural networks. In SIGIR. 95–104.
Le Guen and Thome (2020) Vincent Le Guen and Nicolas Thome. 2020. Probabilistic Time Series Forecasting with Structured Shape and Temporal Diversity. In Proceedings of the 34th International Conference on Neural Information Processing Systems (Vancouver, BC, Canada) (NIPS’20). Curran Associates Inc., Red Hook, NY, USA, Article 372, 14 pages.
Li et al. (2020) **g Li, Yan Zhang, Zhenzhen Yang, and Shengjun Wang. 2020. Relation-aware Meta-learning for E-commerce Market Segment Demand Prediction with Limited Records. In Proceedings of the 29th ACM International Conference on Information and Knowledge Management. ACM, 2325–2334.
Li et al. (2017a) Yaguang Li, Rose Yu, Cyrus Shahabi, and Yan Liu. 2017a. Diffusion convolutional recurrent neural network: Data-driven traffic forecasting. arXiv preprint arXiv:1707.01926 (2017).
Li et al. (2017b) Zhenguo Li, Fengwei Zhou, Fei Chen, and Hang Li. 2017b. Meta-sgd: Learning to learn quickly for few shot learning. arXiv preprint arXiv:1707.09835 (2017).
Lim et al. (2019) Bryan Lim, Sercan Ömer Arik, Nicolas Loeff, and Tomas Pfister. 2019. Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting. CoRR abs/1912.09363 (2019).
Long et al. (2015) Mingsheng Long, Yue Cao, Jianmin Wang, and Michael I Jordan. 2015. Learning transferable features with deep adaptation networks. In International Conference on Machine Learning. PMLR, 97–105.
Long et al. (2013) Mingsheng Long, Jianmin Wang, Guiguang Ding, Jiaguang Sun, and SY Philip. 2013. Transfer feature learning with joint distribution adaptation. In Proceedings of the IEEE international conference on computer vision. 2200–2207.
Ma et al. (2017) Fenglong Ma, Radha Chitta, **g Zhou, Quanzeng You, Tong Sun, and **g Gao. 2017. Dipole: Diagnosis prediction in healthcare via attention-based bidirectional recurrent neural networks. In KDD. ACM, 1903–1911.
Ma and Simchi-Levi (2022) Will Ma and David Simchi-Levi. 2022. Constructing demand curves from a single observation of bundle sales. In International Conference on Web and Internet Economics. Springer, 150–166.
Meng et al. (2022) Yu Meng, Jiaxin Huang, Yu Zhang, and Jiawei Han. 2022. Adapting Pretrained Representations for Text Mining. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (Washington DC, USA) (KDD ’22). Association for Computing Machinery, New York, NY, USA, 4806–4807. https://doi.org/10.1145/3534678.3542607
Nichol et al. (2018) Alex Nichol, Joshua Achiam, and John Schulman. 2018. On First-Order Meta-Learning Algorithms. ArXiv abs/1803.02999 (2018). https://api.semanticscholar.org/CorpusID:4587331
Oreshkin et al. (2020) Boris N Oreshkin, Dmitri Carpov, Nicolas Chapados, and Yoshua Bengio. 2020. N-BEATS: Neural basis expansion analysis for interpretable time series forecasting. In Proceedings of the 8th International Conference on Learning Representations (ICLR).
Pan and Yang (2009) Sinno Jialin Pan and Qiang Yang. 2009. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering 22, 10 (2009), 1345–1359.
Perez et al. (2018) Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron C. Courville. 2018. FiLM: Visual Reasoning with a General Conditioning Layer. In AAAI. AAAI Press, 3942–3951.
Salinas et al. (2020) David Salinas, Valentin Flunkert, Jan Gasthaus, and Tim Januschowski. 2020. DeepAR: Probabilistic forecasting with autoregressive recurrent networks. International Journal of Forecasting 36, 3 (2020), 1181–1191.
Shang et al. (2021) Chao Shang, Jie Chen, and **bo Bi. 2021. Discrete Graph Structure Learning for Forecasting Multiple Time Series. In ICLR. OpenReview.net.
Shen et al. (2020) Max Shen, Christopher S Tang, Di Wu, Rong Yuan, and Wei Zhou. 2020. JD. com: Transaction-level data for the 2020 MSOM data driven research challenge. Manufacturing & Service Operations Management (2020).
Snell et al. (2017) Jake Snell, Kevin Swersky, and Richard S. Zemel. 2017. Prototypical Networks for Few-shot Learning. In NIPS. 4077–4087.
Sung et al. (2017) Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip H. S. Torr, and Timothy M. Hospedales. 2017. Learning to Compare: Relation Network for Few-Shot Learning. https://doi.org/10.48550/ARXIV.1711.06025
Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to Sequence Learning with Neural Networks. In NIPS. 3104–3112.
Tzeng et al. (2017) Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. 2017. Adversarial discriminative domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2962–2971.
Vinyals et al. (2016) Oriol Vinyals, Charles Blundell, Tim Lillicrap, Koray Kavukcuoglu, and Daan Wierstra. 2016. Matching Networks for One Shot Learning. In NIPS. 3630–3638.
Vuorio et al. (2019) Risto Vuorio, Shao-Hua Sun, Hexiang Hu, and Joseph J. Lim. 2019. Multimodal Model-Agnostic Meta-Learning via Task-Aware Modulation. In Neural Information Processing Systems.
Wang et al. (2019) Yaqing Wang, Qin Yao, and Ivor W Tsang. 2019. Meta-learning for online retail sales prediction. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. ACM, 2073–2076.
Wen et al. (2017) Ruofeng Wen, Kari Torkkola, Balakrishnan Narayanaswamy, and Dhruv Madeka. 2017. A multi-horizon quantile recurrent forecaster. arXiv preprint arXiv:1711.11053 (2017).
Wu et al. (2020) Neo Wu, Bradley Green, Xue Ben, and Shawn O’Banion. 2020. Deep transformer models for time series forecasting: The influenza prevalence case. arXiv preprint arXiv:2001.08317 (2020).
Yang et al. (2023) Sitan Yang, Malcolm Wolff, Shankar Ramasubramanian, Vincent Quenneville-Belair, Ronak Mehta, and Michael Mahoney. 2023. GEANN: Scalable graph augmentations for multi-horizon time series forecasting. In KDD 2023 Workshop on Deep Learning on Graphs.
Yao et al. (2019) Huaxiu Yao, Ying Wei, Junzhou Huang, and Zhenhui Li. 2019. Learning to learn by remembering. In Advances in Neural Information Processing Systems. 1574–1584.
Zhou et al. (2020) Pan Zhou, Yingtian Zou, Xiaotong Yuan, Jiashi Feng, Caiming Xiong, and SC Hoi. 2020. Task Similarity Aware Meta Learning: Theory-inspired Improvement on MAML. In 4th Workshop on Meta-Learning at NeurIPS.
Zhu et al. (2020) Qi Zhu, Yidan Xu, Haonan Wang, Chao Zhang, Jiawei Han, and Carl Yang. 2020. Transfer Learning of Graph Neural Networks with Ego-graph Information Maximization. In Neural Information Processing Systems. https://api.semanticscholar.org/CorpusID:221640925

Appendix A Appendix

This appendix offers additional content to complement the core findings of our research. Below is a summary of the sections included:

•

Proof of Theorem 1 (Section B): A rigorous, step-by-step proof of Theorem 1 is presented, solidifying the theoretical underpinnings of our proposed methodology.
•

Experimental Details and Supplementary Results (Section C): We meticulously outline the datasets employed, the design choices, and the implementation specifics of our experiments, and present additional results that reinforce our conclusions.
•

Discussion and Future Directions (Section D): We critically discuss the strengths and limitations of our approach, illuminating potential avenues for future research endeavors.

Appendix B Proof of Theorem 1

Proof.

Let us first define an intermediate function:

h_{\widetilde{t}}^{(im)}=\frac{\sum_{i=1}^{n}w(\widetilde{t},t_{i})h_{i}}{\sum% _{i=1}^{n}w(\widetilde{t},t_{i})}.

We then define the event $E_{n}=\{\sum_{i=1}^{n}w(\widetilde{t},t_{i})>0\}$ . Conditioned on the event $E_{n}$ , we have

	$\displaystyle\operatorname*{\mathbb{E}}\big{[}\big{(}h^{(im)}_{\widetilde{t}}(% f(x))$	$\displaystyle-\widehat{h}_{\widetilde{t}}(f(x))\big{)}^{2}\big{]}$
	$\displaystyle\leq$	$\displaystyle\frac{\sum_{i=1}^{n}w(\widetilde{t},t_{i})\cdot\operatorname*{% \mathbb{E}}[(\widehat{h}_{i}(f(x))-h_{i}(f(x)))^{2}]}{(\sum_{i=1}^{n}w(% \widetilde{t},t_{i}))^{2}}$
	$\displaystyle\leq$	$\displaystyle\frac{\max_{i}\operatorname*{\mathbb{E}}[(\widehat{h}_{i}(f(x))-h% _{i}(f(x)))^{2}]}{\sum_{i=1}^{n}w(\widetilde{t},t_{i})}$
	$\displaystyle=$	$\displaystyle O\left(\frac{C(\mathcal{H})}{N\sum_{i=1}^{n}w(\widetilde{t},t_{i% })}\right),$

where the last inequality uses the assumption that $\operatorname*{\mathbb{E}}[(\widehat{h}_{t}(f(x))-h_{t}(f(x)))^{2}]=O\left(% \frac{C(\mathcal{H})}{N_{d}}\right)$ and $N_{d}\gtrsim N$ .

Moreover, since $\|h_{t_{1}}-h_{t_{2}}\|_{\infty}\leq C\cdot\|Z_{t_{1}}-Z_{t_{2}}\|\leq Ch$ when $\|Z_{t_{1}}-Z_{t_{2}}\|\leq h$ , we have that on the event $E_{n}$ ,

	$\displaystyle\|h_{\widetilde{t}}^{(im)}-h_{t}\|$	$\displaystyle=\Big{\|}\frac{\sum_{i=1}^{n}w(\widetilde{t},t_{i})(h_{i}-h_{t})}{% \sum_{i=1}^{n}w(\widetilde{t},t_{i})}\Big{\|}$
		$\displaystyle=\Big{\|}\frac{\sum_{i=1}^{n}\bm{1}\{A(\widetilde{t},t_{i})<h\}(h_% {i}-h_{t})}{\sum_{i=1}^{n}\bm{1}\{A(\widetilde{t},t_{i})<h\}}\Big{\|}$
		$\displaystyle\leq Ch.$

On the other hand, conditioned on the event $E_{n}^{c}$ , when the denominator equals to 0, by definition we have $h_{t}=0$ , and therefore

\displaystyle|h_{\widetilde{t}}^{(im)}(f(x))-h_{t}(f(x))|^{2}=h_{\widetilde{t}% }^{2}(f(x)).

Consequently, we can write

\displaystyle|h_{\widetilde{t}}^{(im)}(f(x))-h_{t}(f(x))|^{2}\leq C^{2}h^{2}+h% _{\widetilde{t}}^{2}(f(x))\bm{1}_{E_{n}^{c}}.

Therefore, we have

$\displaystyle\operatorname*{\mathbb{E}}[\left(\widehat{h}_{\widetilde{t}}-h_{% \widetilde{t}}\right)^{2}]\lesssim$	$\displaystyle\operatorname*{\mathbb{E}}\left[\frac{C(\mathcal{H})}{N\sum_{i=1}% ^{n}w(\widetilde{t},t_{i})}\cdot\bm{1}_{E_{n}}\right]$	(9)
	$\displaystyle+h^{2}$	(10)
	$\displaystyle+\operatorname*{\mathbb{E}}[h_{\widetilde{t}}^{2}(f(x))\cdot\bm{1% }_{E_{n}^{c}}].$	(11)

To bound the first term on the right-hand side (RHS), we let

Y=\sum_{i=1}^{n}w(\widetilde{t},t_{i})=\sum_{i=1}^{n}\bm{1}\{|Z_{\widetilde{t}% }-Z_{t_{i}}\|>h\}.

Since $Z_{d}$ are uniformly distributed on $[0,1]^{r}$ , we have that $Y\sim Binomial(n,q)$ with $q=\mathbb{P}(\|Z-Z_{\widetilde{t}}\|>h)$ . Using the property of binomial distribution, we have

\displaystyle\operatorname*{\mathbb{E}}\left[\frac{\bm{1}\{Y>0\}}{Y}\right]% \lesssim\frac{1}{nq}\lesssim\frac{1}{nh^{r}}.

Therefore, the first term on RHS is upper bounded as:

\displaystyle\operatorname*{\mathbb{E}}\left[\frac{C(\mathcal{H})}{N\sum_{i=1}% ^{n}w(\widetilde{t},t_{i})}\cdot\bm{1}_{E_{n}}\right]\lesssim\frac{C(\mathcal{% H})}{N\cdot nh^{r}}.

The third term in relation (9) can be bounded as

	$\displaystyle\operatorname*{\mathbb{E}}[h_{\widetilde{t}}^{2}(f(x))\cdot\bm{1}% _{E_{n}^{c}}]$	$\displaystyle\leq\sup h_{\widetilde{t}}^{2}(f(x))\operatorname*{\mathbb{E}}[(1% -q)^{n}]$
		$\displaystyle\lesssim\sup h_{\widetilde{t}}^{2}(f(x))\frac{1}{qn}$
		$\displaystyle\lesssim\frac{1}{nh^{r}}.$

Combining all the pieces, we get

\operatorname*{\mathbb{E}}[(\widehat{h}_{\widetilde{t}}-h_{\widetilde{t}})^{2}% ]\lesssim h^{2}+\frac{C(\mathcal{H})/N}{nh^{r}}.

Therefore, when $l$ is Lipschitz with respect to the first argument, we have that

	$\displaystyle\operatorname*{\mathbb{E}}_{(x,y)\sim\widetilde{t}}\left[l(% \widehat{g}_{\widetilde{t}}(x;\mathcal{D}^{tr},A),y)\right]$	$\displaystyle-\operatorname*{\mathbb{E}}_{(x,y)\sim\widetilde{t}}[l(g_{% \widetilde{t}}(x),y)]$
	$\displaystyle\leq$	$\displaystyle\operatorname*{\mathbb{E}}[\|\widehat{h}_{\widetilde{t}}-h_{% \widetilde{t}}\|]$
	$\displaystyle\leq$	$\displaystyle\sqrt{\operatorname*{\mathbb{E}}[(\widehat{h}_{\widetilde{t}}-h_{% \widetilde{t}})^{2}]}$
	$\displaystyle\lesssim$	$\displaystyle h+\sqrt{\frac{C(\mathcal{H})/N}{nh^{r}}}.$

If we further take $h\asymp(\frac{C(\mathcal{H})/N}{n})^{\frac{1}{r+2}}$ , we obtain

	$\displaystyle\operatorname*{\mathbb{E}}_{(x,y)\sim\widetilde{t}}[l(\widehat{g}% _{\widetilde{t}}(x;\mathcal{D}^{tr},A),y)]$	$\displaystyle-\operatorname*{\mathbb{E}}_{(x,y)\sim\widetilde{t}}[l(g_{% \widetilde{t}}(x;\mathcal{D}^{tr},A),y)]$
		$\displaystyle\lesssim\left(\frac{C(\mathcal{H})/N}{n}\right)^{\frac{1}{r+2}},$

which completes the proof. ∎

Appendix C Detailed Experiment Design and Additional Results

C.1 Dataset Description

Vending Machine Company

In our experiments, we utilize various datasets that contain detailed information about the sales orders and product attributes.

Sales Orders (all those files ended with sale_order.csv): Each dataset covers a specific time range and includes both participating and non-participating products in the experiments. Each record corresponds to a sales order, characterized by the business area, shelf code, order code, product code, user code, purchase date, quantity purchased, sale price, actual payment amount, and product category.

Experiment Details (experiment_detail_product.csv): This dataset includes the products participating in pricing experiments A and B. Each product is characterized by attributes such as the business area, shelf code, scene, scene subdivision, product category, product sub-category, experimental group designation, sale price, adjusted price, cross-price indicator, and an indicator for prices below 95% of the overall average price.

Product Details (product_detail.csv): This dataset contains information about the attributes of each product, including the product code, product category, product type, product sub-category, and an indicator specifying whether it is a common product.

Shelf Details (shelf_detail.csv): This dataset provides attributes of each shelf, including the business area, shelf code, an indicator for low-selling devices in the previous month, an indicator for the ability to place high-priced products, old user rate, and shelf grade.

The various data points in these datasets allow for an extensive and comprehensive analysis of the sales patterns and facilitate the learning and evaluation of the proposed model.

Table 2: Detailed feature definition in the Vending Machine dataset.

(a) Sales Order Datasets
Feature	Type	Description
business_area	String	Area of the business.
shelf_code	String	Unique identifier for the shelf.
order_code	String	Unique identifier for the order.
product_code	String	Unique identifier for the product.
user_code	String	Unique identifier for the user.
pay_date	Date	Purchase date.
quantity_act	Integer	Quantity purchased.
sale_price	Float	Sale price of the product.
real_total_price	Float	Actual payment amount.
product_type	String	Product category.
(b) Experiment Detail Dataset
Feature	Type	Description
business_area	String	Area of the business.
shelf_code	String	Unique identifier for the shelf.
product_code	String	Unique identifier for the product.
mtype	String	Scene.
scene	String	Scene subdivision.
second_type_name	String	Product category.
sub_type_name	String	Product sub-category.
if_exper	Integer	Indicator for the experimental group (1) or control group (0).
sale_price	Float	Sale price of the product.
ab_price	Float	Adjusted price.
cross_price	Float	Cross-price indicator.
lower_price95	Integer	Indicator for prices below 95% of the overall average price.
(c) Product Detail Dataset
Feature	Type	Description
product_code	String	Unique identifier for the product.
type_name	String	Product category.
second_type_name	String	Product type.
sub_type_name	String	Product sub-category.
is_common_product	Integer	Indicator for common products (1 for yes, 0 for no).
(d) Shelf Detail Dataset
Feature	Type	Description
business_area	String	Area of the business.
shelf_code	String	Unique identifier for the shelf.
is_low_sale	Integer	Indicator for low-selling devices in the previous month (1 for yes, 0 for no).
can_fill_high_price	Integer	Indicator for the ability to place high-priced products.
old_user_rate	Float	Old user rate.
grade	String	Shelf grade.

JD.com

We work with the transactional records from JD.com, which offer a blend of both static and dynamic features related to the product (SKU) and order details for March 2018 (Shen et al., 2020). The SKU table contains information about the SKUs that were clicked at least once during March 2018. Each SKU entry has a unique SKU ID and is associated with a seller. For this study, 9,167 SKUs were selected. Each SKU possesses two pivotal attributes, which could, depending on the category, represent product features like SPF for face moisturizers or the number of personalized shaving modes for electric shavers.

The Order table encompasses details about unique customer orders within our designated product category from March 2018. This table elucidates specifics like order quantity, order date and time, SKU type, and the promised delivery time of the order. Additionally, it captures the product pricing and promotional activities, delineating the difference between the original and final unit price, thereby indicating the promotional discounts offered.

For our analysis, we split our tasks into the following partitions: 1) Training set $\mathcal{D}_{train}$ , comprising data from regular sales days, this is used for initial model training. 2) Query set $\mathcal{D}_{query}$ , which consists of slightly modified versions of tasks from $\mathcal{D}_{train}$ and facilitating the inner-loop adaptation. 3) Testing set $\mathcal{D}_{test}$ , that incorporates data from peak sales periods and is earmarked for assessing model performance.

Table 3: Detailed feature description of the JD.com dataset.

(a) Description of the SKU table
Field	Data type	Description	Sample value
sku_ID	String	Unique identifier of a product	b4822497a5
type	Int	1P or 3P SKU	1
brand_ID	String	Brand unique identification code	c840ce7809
attribute1	Int	First key attribute of the category	3
attribute2	Int	Second key attribute of the category	60
activate_date	String	The date at which the SKU is first Introduced	2018-03-01
deactivate_date	String	The date at which the SKU is terminated	2018-03-01
(b) Description of the users table
Field	Data type	Description	Sample value
user_ID	String	User unique identification code	000000f736
user_level	Int	User level	10
first_order_month	String	First month in which the customer placed an order on JD.com	2017-07
plus	Int	If user is with a PLUS membership	0
gender	String	User gender (estimated)	F
age	String	User age range (estimated)	26–35
marital_status	String	User marital status (estimated)	M
education	Int	User education level (estimated)	3
purchase_power	Int	User purchase power (estimated)	2
city_level	Int	City level of user address	1
(c) Description of the orders table
Field	Data type	Description	Sample value
order_ID	String	Order unique identification code	3b76bfcd3b
user_ID	String	User unique identification code	3cde601074
sku_ID	String	SKU unique identification code	443fd601f0
order_date	String	Order date (format: yyyy-mm-dd)	2018-03-01
order_time	String	Specific time at which the order gets placed	2018-03-01 11:10:40.0
quantity	Int	Number of units ordered	1
type	Int	1P or 3P orders	1
promise	Int	Expected delivery time (in days)	2
original_unit_price	Float	Original list price	99.9
final_unit_price	Float	Final purchase price	53.9
direct_discount_per_unit	Float	Discount due to SKU direct discount	5.0
quantity_discount_per_unit	Float	Discount due to purchase quantity	41.0
bundle_discount_per_unit	Float	Discount due to “bundle promotion”	0.0
coupon_discount_per_unit	Float	Discount due to customer coupon	0.0
gift_item	Int	If the SKU is with gift promotion	0
dc_ori	Int	Distribution center ID where the order is shipped from	29
dc_des	Int	Destination address where the order is shipped to	29
		(represented by the closest distribution center ID)

C.2 Implementation Details

We employ PyTorch for model implementation and Adam for optimization. The learning rate scheduler uses a warmup phase, accounting for 10% of the total training steps, followed by a linear decay to zero. We set the dropout rate at 0.5 to prevent overfitting during training.

Each training and test episode corresponds to a single task. For each task, we sample 5 data points (k-shot) for training, validation, and testing. We iterate over 1000 training episodes, 200 validation episodes, and 1000 test episodes.

The hidden dimensions of the product-specific and machine-specific estimators are set to 32. After training for 100 epochs, the model with the lowest validation MSE is saved and used for testing.

All experiment outputs are saved in a timestamped directory under the project’s output path for reproducibility and future reference.

After training, we load the model with the best performance on the validation set to evaluate on the testing set, and report the test scores.

In terms of GNN-based learning tasks, we aggregate order information for each product to generate the daily quantities sold during the one month of JD.com data. We also include the original and final price sequences as the time series features. Meanwhile, we also obtain each product’s static features such as brand and attributes. Our forecasting task is set to predict the sales for the future one day of each product using the past 16 days’ information (i.e., $C=16$ in Equation (5)).

The graph constructed uses products’ brand information such that products under the same brand are connected. In this task, the graph contains 9159 nodes that represent all products having at least 1 unit of sale in one month. We use two graph convolutional (GCN) layers to explore all 2-hop neighbors for each product, together with its own static and dynamic features in the prediction task.

We train the graph-based model until convergence. The Adam optimizer with default settings is used for minimizing the Mean Absolute Error (MAE) loss with a batch size of 8 to 100 epochs. The final embedding for each product is a numerical vector of length 90 which contains $50,8,32$ encoded features representing static, dynamic, and graph-enhanced features respectively.

Our F-FOMAML model is configured with the hyper-parameters in Table 4. We use two-layer feed-forward neural networks with RELU activation to encode the product and vending machine separately. The hidden size is set to $64$ . All models are trained for 100K episodes. Each episode is a regression task. We use the mean-square-error loss. We use Adam as the optimizer with a learning rate of 1e-3. A linear warmup scheduler is used with the first 10% episodes as the warm-up episodes. The dropout is set to 0.5. We implement our method using PyTorch 1.11 and Python 3.8. The model is trained on a CentOS Linux 7 machine with 128 AMD EPYC 7513 32-Core Processors, 512 GB memory, and eight NVIDIA RTX A6000 GPUs.

Table 4: Hyper parameters configuration in our F-FOMAML network.

Dropout rate	0.5
Hidden dimension size	32
Training $k$ -shot	5
Training episodes	1000
Validation k-shot	5
Validation episodes	200
Test $k$ -shot	5
Test episodes	1000
Epochs	100
Learning rate scheduler	WarmupLinear
warmup ratio	0.1
Optimizer	Adam optimizer
learning rate	$0.001$
Weight decay	0
Monitor metric	Mean Squared Error (MSE)

C.3 Additional Experiment Results

To validate our methodology, we sourced data from two major trading contexts, namely, the vending machine merchandising dataset and the JD.com dataset from a renowned e-commerce platform (i.e., JD.com).

Experiments on Vending machine data

Vending Machine Merchandising Dataset. This dataset is derived from a private vending machine company. The dataset contains sales data from Mar 10, 2022, to April 20, 2022, for 246 products and 1715 vending machines. Each product from a specific vending machine has a base price (last for the first 20 days) and an adjusted price (last for the last 20 days). The goal is to estimate the demand at the adjusted price given the demand at the base price. We use the category information (7 categories in total) for product features, and region (4 regions) and scene (8 scenes) information for vending machine feature. We use the last 10 days as the testing set, and the second to last 10 days as training. The detailed data description can be found in Table 2.

Table 5: Experiment results of F-FOMAML using vending machine sales data. Our F-FOMAML obtains the smallest error on the real-world dataset among the competing baselines.

Method	with GNN	MSE	MAE	MAPE
Linear Regression	No	0.6218	0.4782	0.2900
MLP	No	0.2811	0.2038	0.1499
MAML	No	0.2985	0.2143	0.1587
F-FOMAML (Ours)	No	0.2345	0.1532	0.1206

Table 5 compares our F-FOMAML with Linear Regression and MLP, alongside the meta-learning model MAML, in predicting vending machine sales. While Linear Regression and MLP provide strong baselines, their performance is comparable to MAML, which may suggest MAML’s potential overfitting issues in data-limited scenarios. Significantly, F-FOMAML surpasses these models, demonstrating enhanced prediction accuracy. Our analysis indicates that F-FOMAML improves MAE values by 26.24% over the existing models on the vending machine dataset, benefiting from domain-specific insights for relation construction. This result underscores F-FOMAML’s effectiveness in demand forecasting, highlighting its capacity to optimize prediction accuracy through strategic data utilization.

In the analysis of the vending machine dataset (Table 5), GNNs, MQCNN, and other sequential models were excluded due to the dataset’s absence of continuous time-series data, essential for GNNs to produce effective embedding. Therefore, a direct comparison involving GNNs for this dataset is not included.

Additional Ablation studies and analysis on JD.com dataset

In our ablation study for proxy data, we aim to unpack the impact of varying volumes of proxy data on our methodology by adjusting the k parameter within the k-shot proxy data selection framework and examining its influence on our method’s error metrics. The outcomes of this investigation are presented in Figure 3, where we meticulously track how changes in the k value affect the error metric. Our observations reveal a distinct pattern: as the value of k initially increases, there is a corresponding rise in the error metric, suggesting a diminution in model performance possibly due to the incorporation of a larger but less relevant data set. This upward trend in error reaches a plateau, indicating a point of saturation where further increases in k do not adversely affect the model to the same extent. Interestingly, beyond this saturation point, the error metric begins to decrease with further increases in k, suggesting that the model starts to derive benefits from the expanded pool of proxy data. This indicates the potential value of using larger sets of proxy data, highlighting a critical threshold beyond which the quantity of data begins to outweigh the dilution of relevance, thereby enhancing model performance.

Additional details regarding the training performance are presented in Figure 4, where our algorithm converges to a lower error than other baselines. The fluctuation of our method is due to the feature-based adaptation which results in the stochastic pattern of the convergence behavior.

Appendix D Discussion

Adaptability and Scalability of F-FOMAML Algorithm: The GNN-enhanced F-FOMAML algorithm is designed for adaptability beyond just demand forecasting, making it suitable for a range of prediction and classification tasks in data-limited scenarios. Its performance on large-scale datasets, like those from JD.com, indicates its scalability. This scalability, alongside the algorithm’s flexibility, suggests potential applicability across diverse industries, showcasing the model’s capacity to handle different types of data environments efficiently.Also, our model using the the first order MAML method.

As demonstrated in (Yang et al., 2023), the GNN-based forecaster to generate embeddings can operate in mini-batch fashion and hen can scale to graphs with millions of nodes, which allows the embeddings to be generated based on very large datasets (much larger than the current open source JD.com data). Moreover, our F-FOMAML algorithm is also scalable as it is mathematically similar to Reptile (Nichol et al., 2018) and also through distributed training which is also demonstrated in the most recent work (Choe et al., 2023).

Nevertheless, our solution is not devoid of limitations. The success of our methodology is critically tied to the quality and relevance of the proxy data employed. Any deficiencies or omissions in this proxy data could compromise the effectiveness of our approach.

As we gaze into the future, we intend to incorporate meta-learning algorithms specifically designed for regression problems (Kong et al., 2020a, b). By doing so, we aim to embed deeper domain-specific knowledge into our model, enhancing its predictive acumen and generalizability. Considering the inherent generality of our methodology, we envision its adaptation to tackle various data-limited scenarios, such as cold-start recommendation or trend forecasting, marking them as potential trajectories for future endeavors.

	$\displaystyle\|h_{\widetilde{t}}^{(im)}-h_{t}\|$	$\displaystyle=\Big{\|}\frac{\sum_{i=1}^{n}w(\widetilde{t},t_{i})(h_{i}-h_{t})}{% \sum_{i=1}^{n}w(\widetilde{t},t_{i})}\Big{\|}$
		$\displaystyle=\Big{\|}\frac{\sum_{i=1}^{n}\bm{1}\{A(\widetilde{t},t_{i})<h\}(h_% {i}-h_{t})}{\sum_{i=1}^{n}\bm{1}\{A(\widetilde{t},t_{i})<h\}}\Big{\|}$
		$\displaystyle\leq Ch.$