\zxrsetup

toltxlabel=true, tozreflabel=false

F-FOMAML: GNN-Enhanced Meta-Learning for Peak Period Demand Forecasting with Proxy Data

Zexing Xu111Corresponding author: [email protected] University of Illinois Urbana-Champaign Linjun Zhang Rutgers University Sitan Yang Amazon Rasoul Etesami University of Illinois Urbana-Champaign Hanghang Tong University of Illinois Urbana-Champaign Huan Zhang University of Illinois Urbana-Champaign Jiawei Han University of Illinois Urbana-Champaign
(May 24, 2024)

F-FOMAML: GNN-Enhanced Meta-Learning for Peak Period Demand Forecasting with Proxy Data

Zexing Xu111Corresponding author: [email protected] University of Illinois Urbana-Champaign Linjun Zhang Rutgers University Sitan Yang Amazon Rasoul Etesami University of Illinois Urbana-Champaign Hanghang Tong University of Illinois Urbana-Champaign Huan Zhang University of Illinois Urbana-Champaign Jiawei Han University of Illinois Urbana-Champaign
(May 24, 2024)
Abstract

Demand prediction is a crucial task for e-commerce and physical retail businesses, especially during high-stake sales events. However, the limited availability of historical data from these peak periods poses a significant challenge for traditional forecasting methods. In this paper, we propose a novel approach that leverages strategically chosen proxy data reflective of potential sales patterns from similar entities during non-peak periods, enriched by features learned from a graph neural networks (GNNs)-based forecasting model, to predict demand during peak events. We formulate the demand prediction as a meta-learning problem and develop the Feature-based First-Order Model-Agnostic Meta-Learning (F-FOMAML) algorithm that leverages proxy data from non-peak periods and GNN-generated relational metadata to learn feature-specific layer parameters, thereby adapting to demand forecasts for peak events. Theoretically, we show that by considering domain similarities through task-specific metadata, our model achieves improved generalization, where the excess risk decreases as the number of training tasks increases. Empirical evaluations on large-scale industrial datasets demonstrate the superiority of our approach. Compared to existing state-of-the-art models, our method demonstrates a notable improvement in demand prediction accuracy, reducing the Mean Absolute Error by 26.24% on an internal vending machine dataset and by 1.04% on the publicly accessible JD.com dataset.

1 Introduction

Forecasting product demand during high-stake sales events such as Black Friday or Prime Day is a daunting task for both e-commerce giants like Amazon and JD.com and physical retailers. This challenge stems largely from the scarcity of event-specific historical data. Commonly, businesses anchor their strategies on regular sales data, which may not fully capture the distinct consumer behaviors observed during promotional periods. Beyond standard demand prediction, promotional forecasting includes predicting "extreme" events (Le Guen and Thome, 2020). These events, marked by deeper discounts and atypical merchandising strategies, significantly deviate from the typical sales patterns influenced by factors like seasonality or product life cycles. This deviation necessitates a specialized approach to deal-level forecasting, one that thoroughly considers promotion-specific intricacies, from the depth of discounts to deal combinations.

For instance, an online retailer aiming to anticipate the demand spike for a newly launched electronic item during a holiday sale might struggle. They might be unsure how various promotions will influence demand during these events, particularly when previous similar event data is limited or non-existent. To mitigate this, our research effectively uses proxy data from non-peak sales to inform decisions during peak sales events. However, this supplemental data alone is insufficient, given the intricate interrelationships among various products and categories and even across different shop** platforms. We thus introduce a representation learning task for each product, leveraging a cutting-edge Graph Neural Network (GNN) based forecasting model (Yang et al., 2023). This model generates embeddings enriched with graph-enhanced features, encapsulating cross-product information derived from pertinent graph structures. Such structures offer insights into a myriad of dynamics, from relationships between products to patterns of inter-platform shop** behaviors.

Our proposed Feature-based First-Order Model-Agnostic Meta-Learning (F-FOMAML) approach refines the foundational MAML framework (Finn et al., 2017a; Nichol et al., 2018), incorporating task-specific insights (Meng et al., 2022) drawn from GNN-processed data. By training F-FOMAML with this enhanced metadata, the model showcases an unparalleled ability to adapt, consistently surpassing conventional forecasting techniques in accuracy metrics. While we primarily target enhancing e-commerce and brick-and-mortar retail demand prediction, the potential of our GNN-augmented F-FOMAML is vast. Its versatility makes it a candidate for various applications, from fortifying online banking fraud detection systems to optimizing digital advertising click-through rates.

Our main contributions can be summarized as follows:

  1. 1.

    Model: We propose a novel approach to model demand prediction, reframing it as a graph-augmented meta-learning challenge.

  2. 2.

    Algorithm: We introduce the GNN-infused F-FOMAML algorithm, which skillfully combines meta-learning and the feature-wise linear modulation (FiLM) layers. This results in a model capable of producing robust predictions, even when historical data is sparse.

  3. 3.

    Theory: We provide a theoretical framework that provides insights into how our proposed algorithm reduces predictive risk through the lens of bias-variance trade-off.

  4. 4.

    Numerical Experiments: Our numerical experiments address the inherent challenges of forecasting with multi-modal time series data (combining static and dynamic features) while facing data scarcity in both the target domain and source tasks. Empirical tests validate F-FOMAML’s proficiency, with the model consistently outshining existing forecasting methods in the prediction MAE values by 26.24% on the vending machine dataset and by 8.7% on the JD.com dataset using domain-knowledge constructed features. Furthermore, our model achieves an 1.04% improvement over the MAE metric against baselines with GNN integrated.

2 Related Work

In this section, we discuss prior studies related to sales prediction models, meta-learning’s role in time series forecasting, and the significance of proxy data in prediction endeavors. We categorize the related works into four main sub-domains: Prediction with Limited Data, Meta-Learning for Demand Prediction, Few-Shot Meta-Learning Methods, and Graph Neural Networks for Time Series Forecasting.

Prediction with Limited Data

Previous work in transfer learning has focused on learning from data-rich domains and transferring knowledge to data-sparse regions or underrepresented classes (Gupta et al., 2016; Jean et al., 2016; Zhu et al., 2020).

In e-commerce, we aim to learn from popular products to improve the performance of new or less popular products. Multi-task learning has also been used to enhance model performance on data-sparse tasks (Bishop et al., 2014; Chang et al., 2019; Fiot and Dinuzzo, 2015; Pan and Yang, 2009). Conventional transfer learning methods learn transferable latent factors between one source domain and one target domain (Long et al., 2013; Gong et al., 2012; Tzeng et al., 2017; Long et al., 2015). In our work, we focus on adopting meta-learning techniques to learn from various tasks and then adapt them to unseen tasks in demand prediction.

Meta-Learning for Demand Prediction

Meta-learning has been applied to various retail and demand prediction tasks, with an emphasis on learning from diverse data sources and adapting to new tasks with limited data. For instance, Li et al. (2020) employed meta-learning to predict demand in retail settings, demonstrating the effectiveness of meta-learning in capturing complex patterns across diverse scenarios. Similarly, Wang et al. (2019) applied meta-learning to online retail data, highlighting the potential for meta-learning in e-commerce applications.

In the time series-related problems, Oreshkin et al. (Oreshkin et al., 2020) briefly discusses the relation between neural time series prediction and meta-learning (Oreshkin et al., 2020). Yao et al. incorporate the gradient-based meta-learning with a region functionality based memory (Yao et al., 2019) for spatiotemporal prediction. However, this method relies on the spatial semantic correlations between tasks, which limits its applicability to our problem.

Our work contributes to the problem of learning customer demand for new products with few historical data points. Previous works have suggested comparing the features of new products to existing ones (Ferreira et al., 2016; Baardman et al., 2017), or efficient methods for eliciting additional information (Cao and Zhang, 2021; Ma and Simchi-Levi, 2022). Our paper assumes that sales have already been observed at limited prices and leverages more information from other related products and environments as proxy data.

Meta-Learning Methods for Few-Shot Learning

Meta-learning methods for few-shot learning can be broadly categorized into two main approaches: metric-learning-based and optimization-based. Metric-learning-based approaches focus on establishing similarity or dissimilarity between classes, as demonstrated by works such as Prototypical Networks (Snell et al., 2017), Matching Networks (Vinyals et al., 2016), and Relation Networks (Sung et al., 2017). These methods aim to learn representations that facilitate comparisons between few-shot examples and known classes. On the other hand, optimization-based approaches aim to learn a good initialization point that can quickly adapt to new tasks with minimal parameter updates. Prominent examples of this category include Model-Agnostic Meta-Learning (MAML) (Finn et al., 2017a), Reptile (Nichol et al., 2018), and Meta-SGD (Li et al., 2017b). These methods have been further extended by advanced techniques such as Task-Specific Adaptation (TSA) (Zhou et al., 2020) and Multi-Modal Model-Agnostic Meta-Learning (MUMOMAML) (Vuorio et al., 2019). These optimization-based approaches enhance the adaptability and robustness of the learned models across diverse tasks and domains.

Our method falls into the domain of optimization-based approaches.

Graph Neural Networks for Time Series Forecasting

Deep learning models have been extensively explored for time series forecasting especially those with the Seq2Seq structure (Sutskever et al., 2014), which involves learning an encoder to transform various inputs into fixed-length hidden states for producing forecasts. Recent developments include DeepAR (Salinas et al., 2020), TFT (Lim et al., 2019) and MQ-Forecasters (Wen et al., 2017; Eisenach et al., 2020). However, these methods do not account for cross-observation information, which becomes important in many practical applications. As a result, Graph Neural Networks (GNNs) have rapidly emerged as a promising framework to address this issue by combining temporal processing with graph convolution to augment the learning of individual time series (Kipf and Welling, 2017; Li et al., 2017a; Wu et al., 2020; Shang et al., 2021). A popular family of methods propose graph structure learning for the joint inference of a latent structure through GNN while forecasting (Kipf et al., 2018; Wu et al., 2020). However, they suffer limitations in scaling to large datasets. A scalable approach recently introduced by Yang et al. (Yang et al., 2023) uses predefined graphs as data augmentations rather than enabling graph structure learning, which demonstrates not only to scale to graphs over millions of nodes but also shows substantially improving model performance, especially for cold-start problems when data is scarce.

Our work builds upon these foundations by specifically applying meta-learning and few-shot learning techniques to the demand forecasting problem, with the goal of improving the adaptability and performance of models in this context. To the best of our knowledge, we are the first to study peak period demand prediction with limited records by borrowing relation-aware knowledge from other time periods. We focus on this domain, exploring the application of meta-learning for few-shot prediction and incorporating auxiliary information, such as proxy data from other related tasks, to improve model performance.

3 Problem Formulation

Refer to caption
Figure 1: Pipeline of the GNN-enhanced F-FOMAML for demand forecasting.

During peak periods, promoted products often have limited historical sales (e.g., less popular items) or are new items without historical transaction data. Consequently, the demand forecasting tasks for product-location pairs during these periods are new and unseen compared to regular products and periods (e.g., paper towels). To address this challenge, we frame our research problem within a generic setting, focusing on a few-shot meta-learning paradigm, specifically targeting demand forecasting. Throughout our discussion, we use JD.com’s transactional data as the primary example to illustrate our approach.

3.1 Task Definition

Demand forecasting aims to predict the future demand for a product in a specific environment based on observed features. Each forecasting task is associated with a product and its environment.

Formally, let 𝒫(𝒯)𝒫𝒯\mathcal{P}({\mathcal{T}})caligraphic_P ( caligraphic_T ) denote a distribution over tasks 𝒯ijsubscript𝒯𝑖𝑗{\mathcal{T}}_{ij}caligraphic_T start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, each corresponding to product i𝑖iitalic_i in environment j𝑗jitalic_j. For a set [n]={1,,n}delimited-[]𝑛1𝑛[n]=\{1,\ldots,n\}[ italic_n ] = { 1 , … , italic_n } of n𝑛nitalic_n products with product i𝑖iitalic_i present in tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT environments, we have a total of i=1ntisuperscriptsubscript𝑖1𝑛subscript𝑡𝑖\sum_{i=1}^{n}t_{i}∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT tasks. Each task dataset is symbolized as a pair (𝐱ij,yij)subscript𝐱𝑖𝑗subscript𝑦𝑖𝑗(\mathbf{x}_{ij},y_{ij})( bold_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ), where 𝐱ijsubscript𝐱𝑖𝑗\mathbf{x}_{ij}bold_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is the feature vector and yijsubscript𝑦𝑖𝑗y_{ij}italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT signifies the associated demand.

To provide a concrete example, consider a scenario where we have n=10𝑛10n=10italic_n = 10 products, each available in ti=5subscript𝑡𝑖5t_{i}=5italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 5 locations. Therefore, we have a total of 50 tasks in our meta-training set. The dataset corresponding to each task is represented as a demand-feature pair (𝐱ij,yij)subscript𝐱𝑖𝑗subscript𝑦𝑖𝑗(\mathbf{x}_{ij},y_{ij})( bold_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ), where 𝐱ijmsubscript𝐱𝑖𝑗superscript𝑚\mathbf{x}_{ij}\in\mathbb{R}^{m}bold_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT is a feature vector and yijsubscript𝑦𝑖𝑗y_{ij}\in\mathbb{R}italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∈ blackboard_R is the associated demand.

Our goal is to train a model, denoted by f:m+:𝑓superscript𝑚superscriptf:\mathbb{R}^{m}\rightarrow\mathbb{R}^{+}italic_f : blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT, capable of map** m𝑚mitalic_m-dimensional observations 𝐱𝐱\mathbf{x}bold_x to outputs y𝑦yitalic_y across a large or possibly infinite number of tasks. We employ the First-Order Model Agnostic Meta-Learning (FOMAML) algorithm for this purpose. For a given product characterized by a feature vector si,i[n]subscript𝑠𝑖for-all𝑖delimited-[]𝑛s_{i},\forall i\in[n]italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , ∀ italic_i ∈ [ italic_n ] and an environment (e.g., location) characterized by a feature vector vj,j[ti]subscript𝑣𝑗for-all𝑗delimited-[]subscript𝑡𝑖v_{j},\forall j\in[t_{i}]italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , ∀ italic_j ∈ [ italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ], we consider a single historical price and demand observation (p~ij,y~ij)subscript~𝑝𝑖𝑗subscript~𝑦𝑖𝑗(\tilde{p}_{ij},\tilde{y}_{ij})( over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT , over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ).

Given a price of interest pijsubscript𝑝𝑖𝑗p_{ij}italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, we assume our task as the following demand function:

yij=fij(𝐱ij)+ϵij,subscript𝑦𝑖𝑗subscript𝑓𝑖𝑗subscript𝐱𝑖𝑗subscriptitalic-ϵ𝑖𝑗y_{ij}=f_{ij}(\mathbf{x}_{ij})+\epsilon_{ij},italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) + italic_ϵ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT , (1)

where 𝐱ijsubscript𝐱𝑖𝑗\mathbf{x}_{ij}bold_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is the feature tuple (si,vj,p~ij,y~ij,pij)subscript𝑠𝑖subscript𝑣𝑗subscript~𝑝𝑖𝑗subscript~𝑦𝑖𝑗subscript𝑝𝑖𝑗(s_{i},v_{j},\tilde{p}_{ij},\tilde{y}_{ij},p_{ij})( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , over~ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT , over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) and yijsubscript𝑦𝑖𝑗y_{ij}italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is the corresponding demand yijsubscript𝑦𝑖𝑗y_{ij}italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT. Here, fijsubscript𝑓𝑖𝑗f_{ij}italic_f start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is a flexible function (e.g., linear regression, MLP, etc.) and each task is associated with a unique model parameter βijmsubscript𝛽𝑖𝑗superscript𝑚\beta_{ij}\in{\mathbb{R}}^{m}italic_β start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT. We assume that the noise ϵijsubscriptitalic-ϵ𝑖𝑗\epsilon_{ij}italic_ϵ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT follows a centered sub-Gaussian distribution with parameter σi2superscriptsubscript𝜎𝑖2\sigma_{i}^{2}italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Furthermore, without loss of generality, we assume that 𝒫Xsubscript𝒫𝑋\mathcal{P}_{X}caligraphic_P start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT is an isotropic-centered sub-Gaussian distribution, i.e., 𝔼(𝐱ij𝐱ij)=𝕀d𝔼subscript𝐱𝑖𝑗superscriptsubscript𝐱𝑖𝑗topsubscript𝕀𝑑\mathbb{E}(\mathbf{x}_{ij}\mathbf{x}_{ij}^{\top})=\mathbb{I}_{d}blackboard_E ( bold_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) = blackboard_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT. Exploiting some structural similarities in 𝒫(𝒯)𝒫𝒯\mathcal{P}\left({\mathcal{T}}\right)caligraphic_P ( caligraphic_T ), the goal is to train a model for a new task 𝒯newsuperscript𝒯new{\mathcal{T}}^{\rm new}caligraphic_T start_POSTSUPERSCRIPT roman_new end_POSTSUPERSCRIPT, coming from 𝒫(𝒯)𝒫𝒯\mathcal{P}\left({\mathcal{T}}\right)caligraphic_P ( caligraphic_T ), from a small amount of training dataset 𝒟=(𝐱ijnew,yijnew)𝒟subscriptsuperscript𝐱new𝑖𝑗subscriptsuperscript𝑦new𝑖𝑗{\mathcal{D}}={\big{(}\mathbf{x}^{\text{new}}_{ij},y^{\text{new}}_{ij}\big{)}}caligraphic_D = ( bold_x start_POSTSUPERSCRIPT new end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT , italic_y start_POSTSUPERSCRIPT new end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ).

Remark 1.

Incorporating features allows us to capture an additional form of shared structure. However, despite accounting for observed product features, the demand functions of two products can exhibit distinct behaviors. For instance, even for Diet Coke, price sensitivities may vary significantly on different vending machines due to factors such as customer demographics or preferences that are challenging to capture as explicit features. To account for these product-location-specific nuances, we introduce the flexibility for the demand function’s coefficients (e.g., price elasticity) to differ.

In the First-Order MAML (FOMAML) approach, the model parameters for each task in the meta-training dataset are computed after a single gradient update. Specifically, for each task 𝒯ijsubscript𝒯𝑖𝑗\mathcal{T}_{ij}caligraphic_T start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, the task-specific model parameters, denoted βijsuperscriptsubscript𝛽𝑖𝑗\beta_{ij}^{\prime}italic_β start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, are updated as follows:

βijβλβ𝒯ij(βij),superscriptsubscript𝛽𝑖𝑗superscript𝛽𝜆subscriptsuperscript𝛽subscriptsubscript𝒯𝑖𝑗subscript𝛽𝑖𝑗\beta_{ij}^{\prime}\leftarrow\beta^{*}-\lambda\nabla_{\beta^{*}}\mathcal{L}_{% \mathcal{T}_{ij}}(\beta_{ij}),italic_β start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_λ ∇ start_POSTSUBSCRIPT italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT caligraphic_T start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_β start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) , (2)

where λ𝜆\lambdaitalic_λ is the learning rate, βsuperscript𝛽\beta^{*}italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the global model parameter shared across tasks, and 𝒯ij(βij)subscriptsubscript𝒯𝑖𝑗subscript𝛽𝑖𝑗\mathcal{L}_{\mathcal{T}_{ij}}(\beta_{ij})caligraphic_L start_POSTSUBSCRIPT caligraphic_T start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_β start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) is the task-specific loss, such as the mean squared error:

𝒯ij(βij)=12(yijfij(𝐱ij))2.subscriptsubscript𝒯𝑖𝑗subscript𝛽𝑖𝑗12superscriptsubscript𝑦𝑖𝑗subscript𝑓𝑖𝑗subscript𝐱𝑖𝑗2\mathcal{L}_{\mathcal{T}_{ij}}(\beta_{ij})=\frac{1}{2}\left(y_{ij}-f_{ij}(% \mathbf{x}_{ij})\right)^{2}.caligraphic_L start_POSTSUBSCRIPT caligraphic_T start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_β start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT - italic_f start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (3)

After updating the task-specific parameters, a meta-update is performed on the shared global parameter βsuperscript𝛽\beta^{*}italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT using the performance of the updated βijsuperscriptsubscript𝛽𝑖𝑗\beta_{ij}^{\prime}italic_β start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT on their corresponding tasks. This meta-update is given by the following:

ββηi=1nj=1tiβ𝒯ij(βij),superscript𝛽superscript𝛽𝜂superscriptsubscript𝑖1𝑛superscriptsubscript𝑗1subscript𝑡𝑖subscriptsuperscript𝛽subscriptsubscript𝒯𝑖𝑗superscriptsubscript𝛽𝑖𝑗\beta^{*}\leftarrow\beta^{*}-\eta\sum_{i=1}^{n}\sum_{j=1}^{t_{i}}\nabla_{\beta% ^{*}}\mathcal{L}_{\mathcal{T}_{ij}}(\beta_{ij}^{\prime}),italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ← italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_η ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT caligraphic_T start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_β start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , (4)

where η𝜂\etaitalic_η is the meta-learning rate. The objective of this meta-learning process is to optimize the shared global parameter βsuperscript𝛽\beta^{*}italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT such that, after a few updates on each task, the task-specific parameters βijsubscript𝛽𝑖𝑗\beta_{ij}italic_β start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT yield improved performance on their corresponding tasks. Once the meta-learning process is complete, the model parameters of a newly arriving task can be estimated using the learned meta-parameters βsuperscript𝛽\beta^{*}italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. These task-specific parameters βijsuperscriptsubscript𝛽𝑖𝑗\beta_{ij}^{\prime}italic_β start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT can then be fine-tuned on the new task using the available data, yielding improved performance and adaptability to new tasks.

By incorporating the FOMAML algorithm into our meta-learning framework, we aim to construct an efficient model for sales prediction that can adapt to new tasks with limited historical sales data.

4 Methodology

We illustrate the pipeline of our algorithm in Figure 1. Imagine there are 3 locations offering 6 drink types with transaction data, capturing their historical sales. First, a graph neural network (GNN), G𝐺Gitalic_G, is formed using both static features like machine locations and dynamic features from past sales time series. To predict the demand for Coke at the gym with a discount, relevant nodes and edges from G𝐺Gitalic_G are extracted. This subset, denoted as G𝒯subscript𝐺𝒯G_{\mathcal{T}}italic_G start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT, undergoes training using MAML’s inner loop, yielding initial task-specific parameters. These parameters are further refined through the FiLM transformer, considering proxy data that might suggest a promotion for Coke. The shared meta-parameters are updated in MAML’s outer loop based on the specific task losses. Once this cycle is completed across all tasks, the model is evaluated on fresh data to project the demand.

Our proposed methodology for e-commerce demand prediction encompasses three pivotal components: proxy data selection, GNN-enhenced representation learning, and the F-FOMAML algorithm design. To cater to the multi-faceted nature of e-commerce products and their varied demand across different locations or customer segments, we weave task-adaptive estimators into the meta-learning framework. Further, we employ GNN and the FiLM layer to utilize and encode proxy data into hidden representations, thus enabling the modulation of learner parameters for enhanced adaptation to the specific characteristics of products and customer segments.

4.1 Proxy Data Selection

The proxy data, vital for tasks with limited historical sales data, is judiciously selected. The ideal proxy data simulates the potential sales behavior of the focal product, informed by sales trends of similar products or those in related categories.

For e-commerce scenarios, task similarity might arise from: 1) Historical Transactions: Edges represent products often purchased together. 2) User Behavioral Patterns: Edges might indicate similar purchase behaviors or browsing patterns of users. 3) Product Similarities: Linking products of the same category or with similar attributes. 4) Domain Knowledge: Connections deriving from expert insights into customer behaviors, seasonal trends, or market dynamics.

Traditional methods use clustering techniques and measure distances with metrics like Euclidean and cosine similarity to quantify task similarity. Our approach, however, leverages a GNN-based method for selecting relevant tasks as proxy data. For a given task, we denote its proxy data as Zijsubscript𝑍𝑖𝑗Z_{ij}italic_Z start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, representing the most relevant data identified through our GNN-based approach. This ensures the proxy data accurately reflects the target task, enhancing demand forecasting accuracy.

Graph Construction for Proxy Data

A tailored graph for our GNN encapsulates relationships among tasks. In determining proxy data for e-commerce settings, we choose tasks from support set 𝒯𝒯\mathcal{T}caligraphic_T resembling our target task, 𝒯newsubscript𝒯new\mathcal{T}_{\text{new}}caligraphic_T start_POSTSUBSCRIPT new end_POSTSUBSCRIPT, guided by correlation(𝒯,𝒯new)>δ,correlation𝒯subscript𝒯𝑛𝑒𝑤𝛿\text{correlation}(\mathcal{T},\mathcal{T}_{new})>\delta,correlation ( caligraphic_T , caligraphic_T start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT ) > italic_δ , where δ𝛿\deltaitalic_δ is a threshold indicating task similarity, and the function correlation(𝒯,𝒯new)correlation𝒯subscript𝒯𝑛𝑒𝑤\text{correlation}(\mathcal{T},\mathcal{T}_{new})correlation ( caligraphic_T , caligraphic_T start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT ) captures the similarity between 𝒯newsubscript𝒯𝑛𝑒𝑤\mathcal{T}_{new}caligraphic_T start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT and 𝒯𝒯\mathcal{T}caligraphic_T through different methods such as the ones described above.

4.2 GNN-enhanced Representation Learning

Here, we describe how to obtain the graph-enhanced features for each product. In a nutshell, we set up a time series forecasting task and utilize a GNN-based demand forecasting model to predict future sales given each product’s historical information as well as cross-product relationships defined by a predefined graph. We then extract the hidden encoded context from the trained model to produce the product embeddings as features.

Input Product Features

E-commerce platforms host a plethora of products, each with unique characteristics and consumer interactions. In this case, We construct the graph using product-specific attributes such as brands (i.e., we connect all products with the same brand). The input features for node Nisubscript𝑁𝑖N_{i}italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (representing product i𝑖iitalic_i) are: 1) Static features Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, like the product category, brand, and manufacturing details. 2) Dynamic features Disubscript𝐷𝑖D_{i}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, encompassing time-evolving aspects like recent sales and price changes.

Product Embedding Generation via Forecasting

A crucial aspect is to generate meaningful product embeddings that can capture the multifaceted nature of e-commerce products. To facilitate this, we set up a demand forecasting task as:

Y^t+1=f(YtC:t,DtC:t,S),subscript^𝑌𝑡1𝑓subscript𝑌:𝑡𝐶𝑡subscript𝐷:𝑡𝐶𝑡𝑆\widehat{Y}_{t+1}=f\left(Y_{t-C:t},D_{t-C:t},S\right),over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_f ( italic_Y start_POSTSUBSCRIPT italic_t - italic_C : italic_t end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT italic_t - italic_C : italic_t end_POSTSUBSCRIPT , italic_S ) , (5)

where f𝑓fitalic_f represents the forecasting model. At time t𝑡titalic_t, target Yt+1N×1subscript𝑌𝑡1superscript𝑁1Y_{t+1}\in\mathbb{R}^{N\times 1}italic_Y start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 1 end_POSTSUPERSCRIPT are the future one-day sales, DtC:tN×dsubscript𝐷:𝑡𝐶𝑡superscript𝑁𝑑D_{t-C:t}\in\mathbb{R}^{N\times d}italic_D start_POSTSUBSCRIPT italic_t - italic_C : italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_d end_POSTSUPERSCRIPT are d𝑑ditalic_d dynamic features with the history length of C𝐶Citalic_C days, and SN×m𝑆superscript𝑁𝑚S\in\mathbb{R}^{N\times m}italic_S ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_m end_POSTSUPERSCRIPT are m𝑚mitalic_m static features for all N𝑁Nitalic_N products. We adopt the GNN-based forecasting model introduced in (Yang et al., 2023) and use the brand information to craft the predefined graph. The GNN is utilized both for forecasting and for generating the embedding of tasks. After training convergence, we extract the embedding for each product, which serves as compact representations of product dynamics.

Edge Relationship Determination

Let E(𝒯i,𝒯j)𝐸subscript𝒯𝑖subscript𝒯𝑗E(\mathcal{T}_{i},\mathcal{T}_{j})italic_E ( caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) denote the edge between tasks 𝒯isubscript𝒯𝑖\mathcal{T}_{i}caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝒯jsubscript𝒯𝑗\mathcal{T}_{j}caligraphic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. The edge relationships between the two entities are inferred using:

E(𝒯i,𝒯j)={1if dist(emb(𝒯i),emb(𝒯j))<θ or h𝒯i=h𝒯j0otherwise,𝐸subscript𝒯𝑖subscript𝒯𝑗cases1if dist𝑒𝑚𝑏subscript𝒯𝑖𝑒𝑚𝑏subscript𝒯𝑗𝜃 or subscriptsubscript𝒯𝑖subscriptsubscript𝒯𝑗0otherwise,E(\mathcal{T}_{i},\mathcal{T}_{j})=\begin{cases}1&\text{if }\text{dist}(emb(% \mathcal{T}_{i}),emb(\mathcal{T}_{j}))<\theta\text{ or }h_{\mathcal{T}_{i}}=h_% {\mathcal{T}_{j}}\\ 0&\text{otherwise,}\end{cases}italic_E ( caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = { start_ROW start_CELL 1 end_CELL start_CELL if roman_dist ( italic_e italic_m italic_b ( caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_e italic_m italic_b ( caligraphic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) < italic_θ or italic_h start_POSTSUBSCRIPT caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_h start_POSTSUBSCRIPT caligraphic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise, end_CELL end_ROW (6)

where emb(𝒯i)𝑒𝑚𝑏subscript𝒯𝑖emb(\mathcal{T}_{i})italic_e italic_m italic_b ( caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) stands for the embedding of task 𝒯isubscript𝒯𝑖\mathcal{T}_{i}caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, dist(,)dist\text{dist}(\cdot,\cdot)dist ( ⋅ , ⋅ ) denotes a function measuring the distance between two embeddings, hhitalic_h denotes the task (i.e., product) hierarchy or taxonomy, and θ𝜃\thetaitalic_θ is a pre-determined threshold to determine closeness. We will create an edge between tasks 𝒯isubscript𝒯𝑖\mathcal{T}_{i}caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝒯jsubscript𝒯𝑗\mathcal{T}_{j}caligraphic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT if either their corresponding embeddings are close to each other or they belong to the same category.

4.3 F-FOMAML Algorithm Description

We present the GNN-Integrated Feature-based First-Order MAML (F-FOMAML) for Demand Forecasting Algorithm 1, which incorporates transactional data with static and dynamic features, and proxy data to forecast demand in peak periods. This variant of the MAML algorithm delineates the learning process into several stages: meta-learner, base learners, FiLM Layer, and fine-tuning.

Algorithm 1 GNN-Integrated Feature-based First-Order MAML (F-FOMAML) for Demand Forecasting
1:𝒟𝒟\mathcal{D}caligraphic_D={Static features S={(si,vj)}i,j𝑆subscriptsubscript𝑠𝑖subscript𝑣𝑗𝑖𝑗S=\{(s_{i},v_{j})\}_{i,j}italic_S = { ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT, Dynamic features D={y~ij}i,j𝐷subscriptsubscript~𝑦𝑖𝑗𝑖𝑗D=\{\tilde{y}_{ij}\}_{i,j}italic_D = { over~ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT; Proxy data Z𝑍Zitalic_Z; pre-constructed graph G=(N,E)𝐺𝑁𝐸G=(N,E)italic_G = ( italic_N , italic_E ); learning rates η,λ𝜂𝜆\eta,\lambdaitalic_η , italic_λ.
2:Learn the GNN with nodes N𝑁Nitalic_N and edges E𝐸Eitalic_E using S𝑆Sitalic_S and D𝐷Ditalic_D.
3:Initialize global meta-parameters βsuperscript𝛽\beta^{*}italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.
4:for each task 𝒯ijsubscript𝒯𝑖𝑗\mathcal{T}_{ij}caligraphic_T start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT in 𝒟𝒟\mathcal{D}caligraphic_D do
5:     Extract relevant nodes N𝒯ijNsubscript𝑁subscript𝒯𝑖𝑗𝑁N_{\mathcal{T}_{ij}}\subset Nitalic_N start_POSTSUBSCRIPT caligraphic_T start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⊂ italic_N and edges E𝒯ijEsubscript𝐸subscript𝒯𝑖𝑗𝐸E_{\mathcal{T}_{ij}}\subset Eitalic_E start_POSTSUBSCRIPT caligraphic_T start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⊂ italic_E from G𝐺Gitalic_G for product i𝑖iitalic_i in environment j𝑗jitalic_j
6:     Represent task 𝒯ijsubscript𝒯𝑖𝑗\mathcal{T}_{ij}caligraphic_T start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT using a subgraph G𝒯ij=(N𝒯ij,E𝒯ij)subscript𝐺subscript𝒯𝑖𝑗subscript𝑁subscript𝒯𝑖𝑗subscript𝐸subscript𝒯𝑖𝑗G_{\mathcal{T}_{ij}}=(N_{\mathcal{T}_{ij}},E_{\mathcal{T}_{ij}})italic_G start_POSTSUBSCRIPT caligraphic_T start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT = ( italic_N start_POSTSUBSCRIPT caligraphic_T start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT caligraphic_T start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ).
7:     Initialize task-specific parameters βijsubscript𝛽𝑖𝑗\beta_{ij}italic_β start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT from βsuperscript𝛽\beta^{*}italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.
8:     Compute task-specific loss 𝒯ij(βij)subscriptsubscript𝒯𝑖𝑗subscript𝛽𝑖𝑗\mathcal{L}_{\mathcal{T}_{ij}}(\beta_{ij})caligraphic_L start_POSTSUBSCRIPT caligraphic_T start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_β start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ): 𝒯ij(βij)=12(yijfij(𝐱ij;βij))2.subscriptsubscript𝒯𝑖𝑗subscript𝛽𝑖𝑗12superscriptsubscript𝑦𝑖𝑗subscript𝑓𝑖𝑗subscript𝐱𝑖𝑗subscript𝛽𝑖𝑗2\mathcal{L}_{\mathcal{T}_{ij}}(\beta_{ij})=\frac{1}{2}\left(y_{ij}-f_{ij}(% \mathbf{x}_{ij};\beta_{ij})\right)^{2}.caligraphic_L start_POSTSUBSCRIPT caligraphic_T start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_β start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT - italic_f start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ; italic_β start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .
9:     Update task-specific parameters βijsuperscriptsubscript𝛽𝑖𝑗\beta_{ij}^{\prime}italic_β start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT using Equation 2: βijβλβ𝒯ij(βij).superscriptsubscript𝛽𝑖𝑗superscript𝛽𝜆subscriptsuperscript𝛽subscriptsubscript𝒯𝑖𝑗subscript𝛽𝑖𝑗\beta_{ij}^{\prime}\leftarrow\beta^{*}-\lambda\nabla_{\beta^{*}}\mathcal{L}_{% \mathcal{T}_{ij}}(\beta_{ij}).italic_β start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_λ ∇ start_POSTSUBSCRIPT italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT caligraphic_T start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_β start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) .
10:     Apply FiLM transformer using proxy data Zijsubscript𝑍𝑖𝑗Z_{ij}italic_Z start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT to modulate the input features 𝐱ijsubscript𝐱𝑖𝑗\mathbf{x}_{ij}bold_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT for task 𝒯ijsubscript𝒯𝑖𝑗\mathcal{T}_{ij}caligraphic_T start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, enhancing the task-specific parameters βijsuperscriptsubscript𝛽𝑖𝑗\beta_{ij}^{\prime}italic_β start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT adaptation: FiLM(𝐱ij;Zij)=β(Zij)𝐱ij+γ(Zij)FiLMsubscript𝐱𝑖𝑗subscript𝑍𝑖𝑗direct-product𝛽subscript𝑍𝑖𝑗subscript𝐱𝑖𝑗𝛾subscript𝑍𝑖𝑗\text{FiLM}(\mathbf{x}_{ij};Z_{ij})=\beta(Z_{ij})\odot\mathbf{x}_{ij}+\gamma(Z% _{ij})FiLM ( bold_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ; italic_Z start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) = italic_β ( italic_Z start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) ⊙ bold_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT + italic_γ ( italic_Z start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT )
11:end for
12:Perform meta-update on βsuperscript𝛽\beta^{*}italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT using: ββηi=1nj=1tiβ𝒯ij(βij).superscript𝛽superscript𝛽𝜂superscriptsubscript𝑖1𝑛superscriptsubscript𝑗1subscript𝑡𝑖subscriptsuperscript𝛽subscriptsubscript𝒯𝑖𝑗superscriptsubscript𝛽𝑖𝑗\beta^{*}\leftarrow\beta^{*}-\eta\sum_{i=1}^{n}\sum_{j=1}^{t_{i}}\nabla_{\beta% ^{*}}\mathcal{L}_{\mathcal{T}_{ij}}(\beta_{ij}^{\prime}).italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ← italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT - italic_η ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT caligraphic_T start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_β start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) .
13:Evaluate the model on testing data to get forecasted demand y^^𝑦\widehat{y}over^ start_ARG italic_y end_ARG.
14:return Forecasted demand y^^𝑦\widehat{y}over^ start_ARG italic_y end_ARG for products.

FiLM Layer.

The feature-wise linear modulation (FiLM) layer (Perez et al., 2018) is a critical component in tailoring the learner parameters based on the proxy data features. This layer applies an affine transformation, feature-wise, to its input, modulating the hidden vector outputs of the meta-model using the proxy data Zijsubscript𝑍𝑖𝑗Z_{ij}italic_Z start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT as task encodings. The construction and purpose of the proxy data Zijsubscript𝑍𝑖𝑗Z_{ij}italic_Z start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is elaborated in section 4.1. The FiLM layer facilitates a more refined adaptation to the distinctive traits of the product and vending machine location by exploiting the relationship between the product-specific and machine-specific price-sensitivity estimators encapsulated in the proxy data.

The FiLM layer’s mechanism can be formally described as

FiLM(𝐱ij;Zij)=β(Zij)𝐱ij+γ(Zij),FiLMsubscript𝐱𝑖𝑗subscript𝑍𝑖𝑗direct-product𝛽subscript𝑍𝑖𝑗subscript𝐱𝑖𝑗𝛾subscript𝑍𝑖𝑗\text{FiLM}(\mathbf{x}_{ij};Z_{ij})=\beta(Z_{ij})\odot\mathbf{x}_{ij}+\gamma(Z% _{ij}),FiLM ( bold_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ; italic_Z start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) = italic_β ( italic_Z start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) ⊙ bold_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT + italic_γ ( italic_Z start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) , (7)

where 𝐱ijsubscript𝐱𝑖𝑗\mathbf{x}_{ij}bold_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT represents the input feature representation, β(Zij)𝛽subscript𝑍𝑖𝑗\beta(Z_{ij})italic_β ( italic_Z start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) (abv. βijsubscript𝛽𝑖𝑗\beta_{ij}italic_β start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT) and γ(Zij)𝛾subscript𝑍𝑖𝑗\gamma(Z_{ij})italic_γ ( italic_Z start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) signify the scaling and shifting factors learned from the proxy data Zijsubscript𝑍𝑖𝑗Z_{ij}italic_Z start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, and direct-product\odot symbolizes element-wise multiplication. The functions β(Zij)𝛽subscript𝑍𝑖𝑗\beta(Z_{ij})italic_β ( italic_Z start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) and γ(Zij)𝛾subscript𝑍𝑖𝑗\gamma(Z_{ij})italic_γ ( italic_Z start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) are learned during the training phase to cater to the specific task at hand. By applying this transformation to the task-specific model parameters βijsuperscriptsubscript𝛽𝑖𝑗\beta_{ij}^{\prime}italic_β start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, the model captures complex feature interactions and becomes better equipped to adapt to the specific characteristics of each unique product-environment pair.

Meta-Learner.

The core of the meta-learning approach is the meta-learner, an overarching model that helps in initializing and updating the meta-parameters, βsuperscript𝛽\beta^{*}italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. These parameters serve as a shared knowledge base that aids in swift adaptation across a myriad of tasks. The meta-learner initializes the global meta-parameters βsuperscript𝛽\beta^{*}italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and, after task-specific adaptations are performed, updates βsuperscript𝛽\beta^{*}italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT using the aggregated first-order gradients from each task. This process ensures that the meta-parameters incorporate insights from various tasks, enabling rapid adaptation to new tasks and reducing the cold-start problem in the e-commerce domain.

Base Learners.

The base learners are models tailored to specific tasks, such as predicting the demand for a new product launch or forecasting sales during a flash sale. Each base learner operates by extracting relevant nodes and edges to form a subgraph for each task, initializing task-specific parameters βijsubscript𝛽𝑖𝑗\beta_{ij}italic_β start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT from the global meta-parameters βsuperscript𝛽\beta^{*}italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, computing the task-specific loss, and updating the task-specific parameters βijsuperscriptsubscript𝛽𝑖𝑗\beta_{ij}^{\prime}italic_β start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT using first-order gradient descent. Additionally, the FiLM transformer applies proxy data to modulate input features, enhancing the adaptation of the task-specific parameters.

Fine-Tuning.

In our approach, fine-tuning involves a final meta-update on βsuperscript𝛽\beta^{*}italic_β start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT after task-specific updates, evaluating the model on testing data to forecast demand y^^𝑦\widehat{y}over^ start_ARG italic_y end_ARG for products and returning the forecasted demand y^^𝑦\widehat{y}over^ start_ARG italic_y end_ARG as the final prediction output. This process leverages the FiLM transformer and proxy data to ensure that the models are not just generic but tailored to capture the heterogeneity in tasks.

The strength of this method mainly lies in its ability to utilize shared structures across tasks while also adapting swiftly to unique task characteristics using the FiLM transformer and proxy data.

5 Theoretical analysis

In this section, we provide a theoretical model to illustrate the benefit of our proposed method and shed light on why proxy data improves the few-shot prediction.

Data generative model.

Suppose we have a set of tasks 𝒯𝒯\mathcal{T}caligraphic_T. For each task t𝒯𝑡𝒯t\in\mathcal{T}italic_t ∈ caligraphic_T, we observe training samples {(xk(t),yk(t))}k=1ntsuperscriptsubscriptsuperscriptsubscript𝑥𝑘𝑡superscriptsubscript𝑦𝑘𝑡𝑘1subscript𝑛𝑡\{(x_{k}^{(t)},y_{k}^{(t)})\}_{k=1}^{n_{t}}{ ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where ntsubscript𝑛𝑡n_{t}italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the sample size for the task t𝑡titalic_t. In addition, for each task, we observe a task-specific feature vtrsubscript𝑣𝑡superscript𝑟v_{t}\in\mathbb{R}^{r}italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT. For the simplicity of presentation, we assume vt[0,1]rsubscript𝑣𝑡superscript01𝑟v_{t}\in[0,1]^{r}italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ [ 0 , 1 ] start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT. We denote the set of training domains by 𝒟trsuperscript𝒟𝑡𝑟\mathcal{D}^{tr}caligraphic_D start_POSTSUPERSCRIPT italic_t italic_r end_POSTSUPERSCRIPT and assume there are T=|𝒯|𝑇𝒯{T}=|\mathcal{T}|italic_T = | caligraphic_T | training tasks. Following equation (1), we assume that for each task t𝑡titalic_t, the outcome prediction function g𝑔gitalic_g takes the form of y=gt(x)+ϵ:=ht(f(x))+ϵ𝑦subscript𝑔𝑡𝑥italic-ϵassignsubscript𝑡𝑓𝑥italic-ϵy=g_{t}(x)+\epsilon:=h_{t}(f(x))+\epsilonitalic_y = italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) + italic_ϵ := italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_f ( italic_x ) ) + italic_ϵ, where htsubscript𝑡h_{t}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the base-learner that depends on individual task t𝑡titalic_t, f𝑓fitalic_f is the meta-learner, and ϵitalic-ϵ\epsilonitalic_ϵ is a noise term which is assumed to be sub-Gaussian with mean 0 and variance σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

Following Section 4, for each task t𝒯𝑡𝒯t\in\mathcal{T}italic_t ∈ caligraphic_T, we construct the proxy data Ztsubscript𝑍𝑡Z_{t}italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT by including all similar tasks tsuperscript𝑡t^{\prime}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT such that vtvthnormsubscript𝑣superscript𝑡subscript𝑣𝑡\|v_{t^{\prime}}-v_{t}\|\leq h∥ italic_v start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT - italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ ≤ italic_h for some threshold parameter h>00h>0italic_h > 0. Then similarly, for the test task t~~𝑡\widetilde{t}over~ start_ARG italic_t end_ARG, the outcome prediction function g^t~subscript^𝑔~𝑡\widehat{g}_{\widetilde{t}}over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT is computed as g^t~(x)=h^t~(f(x)),subscript^𝑔~𝑡𝑥subscript^~𝑡𝑓𝑥\widehat{g}_{\widetilde{t}}(x)=\widehat{h}_{\widetilde{t}}(f(x)),over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT ( italic_x ) = over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT ( italic_f ( italic_x ) ) , where h^t~()=t𝒯w(t~,t)h^i()t𝒯w(t~,t)subscript^~𝑡subscript𝑡𝒯𝑤~𝑡𝑡subscript^𝑖subscript𝑡𝒯𝑤~𝑡𝑡\widehat{h}_{\widetilde{t}}(\cdot)=\frac{\sum_{t\in\mathcal{T}}w(\widetilde{t}% ,t)\widehat{h}_{i}(\cdot)}{\sum_{t\in\mathcal{T}}w(\widetilde{t},t)}over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT ( ⋅ ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_t ∈ caligraphic_T end_POSTSUBSCRIPT italic_w ( over~ start_ARG italic_t end_ARG , italic_t ) over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( ⋅ ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_t ∈ caligraphic_T end_POSTSUBSCRIPT italic_w ( over~ start_ARG italic_t end_ARG , italic_t ) end_ARG, with the weight w(t~,t)=𝟏{vt~vth}𝑤~𝑡𝑡1normsubscript𝑣~𝑡subscript𝑣𝑡w(\widetilde{t},t)=\bm{1}\{\|v_{\widetilde{t}}-v_{t}\|\leq h\}italic_w ( over~ start_ARG italic_t end_ARG , italic_t ) = bold_1 { ∥ italic_v start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT - italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ ≤ italic_h }. In the case where the denominator is 00, we define h^t~=0subscript^~𝑡0\widehat{h}_{\widetilde{t}}=0over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT = 0.

Theoretical results.

To facilitate the theoretical analysis, we first assume that the distance between task-specific features indeed captures the similarity of tasks: there exists a universal constant C𝐶Citalic_C, such that ht1ht2Cvt1vt2.subscriptnormsubscriptsubscript𝑡1subscriptsubscript𝑡2𝐶normsubscript𝑣subscript𝑡1subscript𝑣subscript𝑡2\|h_{t_{1}}-h_{t_{2}}\|_{\infty}\leq C\cdot\|v_{t_{1}}-v_{t_{2}}\|.∥ italic_h start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_h start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ italic_C ⋅ ∥ italic_v start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_v start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ .

In addition, we assume that for each training domain t𝑡titalic_t, h^tsubscript^𝑡\widehat{h}_{t}over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is well learned such that 𝔼[(h^t(f(x))ht(f(x)))2]=O(C()nt)𝔼superscriptsubscript^𝑡𝑓𝑥subscript𝑡𝑓𝑥2𝑂𝐶subscript𝑛𝑡\operatorname*{\mathbb{E}}\big{[}\big{(}\widehat{h}_{t}(f(x))-h_{t}(f(x))\big{% )}^{2}\big{]}=O(\frac{C(\mathcal{H})}{n_{t}})blackboard_E [ ( over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_f ( italic_x ) ) - italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_f ( italic_x ) ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = italic_O ( divide start_ARG italic_C ( caligraphic_H ) end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ), where C()𝐶C(\mathcal{H})italic_C ( caligraphic_H ) is the Rademacher complexity of the function class \mathcal{H}caligraphic_H. We further assume vtsubscript𝑣𝑡v_{t}italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT has a positive density over [0,1]rsuperscript01𝑟[0,1]^{r}[ 0 , 1 ] start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT. Then, we have:

Theorem 1.

Consider the data generative model, the algorithm g^t~subscript^𝑔~𝑡\widehat{g}_{\widetilde{t}}over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT, and the assumptions above. Suppose we have ndngreater-than-or-equivalent-tosubscript𝑛𝑑𝑛n_{d}\gtrsim nitalic_n start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ≳ italic_n for all t𝒟tr𝑡superscript𝒟𝑡𝑟t\in\mathcal{D}^{tr}italic_t ∈ caligraphic_D start_POSTSUPERSCRIPT italic_t italic_r end_POSTSUPERSCRIPT. Define the excess risk for the test domain t~~𝑡\widetilde{t}over~ start_ARG italic_t end_ARG by R(g^t~)=𝔼(x,y)t~[l(g^t~(x;𝒟tr,A),y)]𝔼(x,y)t~[l(gt~(x;𝒟tr,A),y)]𝑅subscript^𝑔~𝑡subscript𝔼similar-to𝑥𝑦~𝑡𝑙subscript^𝑔~𝑡𝑥superscript𝒟𝑡𝑟𝐴𝑦subscript𝔼similar-to𝑥𝑦~𝑡𝑙subscript𝑔~𝑡𝑥superscript𝒟𝑡𝑟𝐴𝑦R(\widehat{g}_{\widetilde{t}})=\operatorname*{\mathbb{E}}_{(x,y)\sim\widetilde% {t}}[l(\widehat{g}_{\widetilde{t}}(x;\mathcal{D}^{tr},A),y)]-\operatorname*{% \mathbb{E}}_{(x,y)\sim\widetilde{t}}[l(g_{\widetilde{t}}(x;\mathcal{D}^{tr},A)% ,y)]italic_R ( over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT ) = blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y ) ∼ over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT [ italic_l ( over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT ( italic_x ; caligraphic_D start_POSTSUPERSCRIPT italic_t italic_r end_POSTSUPERSCRIPT , italic_A ) , italic_y ) ] - blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y ) ∼ over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT [ italic_l ( italic_g start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT ( italic_x ; caligraphic_D start_POSTSUPERSCRIPT italic_t italic_r end_POSTSUPERSCRIPT , italic_A ) , italic_y ) ]. If the loss function l𝑙litalic_l is Lipschitz with respect to the first argument, then the excess risk satisfies

R(g^t~)h+C()/nmax{1,nhr}.less-than-or-similar-to𝑅subscript^𝑔~𝑡𝐶𝑛1𝑛superscript𝑟R(\widehat{g}_{\widetilde{t}})\lesssim h+\sqrt{\frac{C(\mathcal{H})/n}{\max\{1% ,nh^{r}\}}}.\vspace{-1em}italic_R ( over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT ) ≲ italic_h + square-root start_ARG divide start_ARG italic_C ( caligraphic_H ) / italic_n end_ARG start_ARG roman_max { 1 , italic_n italic_h start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT } end_ARG end_ARG . (8)

In particular, if hhitalic_h is properly chosen such that h(C()/nT)1r+2asymptotically-equalssuperscript𝐶𝑛𝑇1𝑟2h\asymp(\frac{C(\mathcal{H})/n}{T})^{\frac{1}{r+2}}italic_h ≍ ( divide start_ARG italic_C ( caligraphic_H ) / italic_n end_ARG start_ARG italic_T end_ARG ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_r + 2 end_ARG end_POSTSUPERSCRIPT, then R(g^t~)(C()/nT)1r+2.less-than-or-similar-to𝑅subscript^𝑔~𝑡superscript𝐶𝑛𝑇1𝑟2R(\widehat{g}_{\widetilde{t}})\lesssim\left(\frac{C(\mathcal{H})/n}{T}\right)^% {\frac{1}{r+2}}.italic_R ( over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT ) ≲ ( divide start_ARG italic_C ( caligraphic_H ) / italic_n end_ARG start_ARG italic_T end_ARG ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_r + 2 end_ARG end_POSTSUPERSCRIPT .

Theorem 1 suggests that the superiority of our algorithm comes from a better bias-variance trade-off. More concretely, the threshold hhitalic_h tunes the trade-off for the excess risk. On the one hand, when we do not use relational data at all (corresponding to the case where h=00h=0italic_h = 0), the first term in equation 8, bias, is negligible, while the second term, variance, becomes dominant as the data is limited. As the excess risk of a single task is of order (C()/n)1/2superscript𝐶𝑛12(C(\mathcal{H})/n)^{1/2}( italic_C ( caligraphic_H ) / italic_n ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT, our result implies that when T𝑇Titalic_T is sufficiently large such that T>(nC())r/2𝑇superscript𝑛𝐶𝑟2T>\big{(}\frac{n}{C(\mathcal{H})}\big{)}^{r/2}italic_T > ( divide start_ARG italic_n end_ARG start_ARG italic_C ( caligraphic_H ) end_ARG ) start_POSTSUPERSCRIPT italic_r / 2 end_POSTSUPERSCRIPT, our proposed method will overcome the potential bias by incorporating similar (but still different) tasks and better than learning with only one single-source task. On the other hand, if we simply use the standard ERM to pool all the data together (corresponding to h=h=\inftyitalic_h = ∞), although the variance becomes small, the bias would dominate in this case. The second part of the theorem suggests that one can efficiently incorporate the proxy data with a carefully chosen threshold. The proof of Theorem 1 is deferred to Appendix B.

The theoretical perspective we discussed is particularly pertinent to the context of demand prediction. Given the limited data available from high-stakes sales events, relying solely on this data for prediction (akin to h=00h=0italic_h = 0) can result in outcomes with substantial variance. Conversely, utilizing the entire historical dataset (corresponding to h=h=\inftyitalic_h = ∞) can introduce significant bias, given the marked differences between regular and high-stakes sales events. Our method harnesses GNNs to understand the relationships within historical data. This approach strikes an optimal balance in the bias-variance trade-off, leading to improved prediction accuracy.

6 Experiment

In this section, we conduct extensive experiments to evaluate the efficacy of our proposed F-FOMAML for peak-period demand prediction, focusing on two key research questions:

  • How does F-FOMAML’s prediction performance compare to various baselines?

  • To what extent do the components we introduce, such as the proxy data selection method (GNN versus MQCNN), impact the model’s predictive capabilities?

By addressing these questions, we provide a comprehensive evaluation of F-FOMAML, highlighting its performance relative to existing approaches and analyzing the contributions of individual components to the model’s overall predictive power.

6.1 Experimental Setups

In this section, we detail the experimental setups, focusing on two real-world datasets and the evaluation criteria for our method’s performance.

For brevity, the main text covers the data description, experimental setup, and results for the JD.com dataset, while the details for the vending machine dataset are provided in the Appendix C.

Datasets.

We validate our methodology using transactional records from JD.com 222Dataset available at: https://connect.informs.org/msom/events/datadriven2020, which include both static and dynamic features related to products (SKUs) and order details for March 2018.

The goal is to predict the demand at the promotional price given the demand at the regular price. We use the category information (3 categories in total) for product features, and region (63 regions) information for location features. We use the last 15 days as testing, and the second to last 15 days as training. The detailed data description and dataset construction are deferred to the Appendix C.1.

Table 1: Experiment results on real-world JD.com E-Commerce sales data with proxy data features generated from different methods. This includes regression techniques, ensemble strategies, neural-network-based methods, and transfer methods for a comprehensive benchmark.
Method MSE MAE MAPE
Linear Regression 1.2298 0.4757 0.2106
    + MQCNN 1.3789 0.5165 0.2486
    + GNN 0.7633 0.4138 0.1821
Random Forest 2.0397 0.5051 0.2706
    + MQCNN 1.7943 0.5639 0.3249
    + GNN 1.8419 0.5971 0.3473
XGBoost 1.7056 0.5516 0.3091
    + MQCNN 1.9021 0.5745 0.3352
    + GNN 1.7943 0.5639 0.3249
MLP 1.2708 0.4661 0.1945
    + MQCNN 1.2520 0.4668 0.1988
    + GNN 1.2708 0.4661 0.1945
GRU
    + MQCNN 1.7661 0.4780 0.2694
    + GNN 3.1427 0.7104 0.2234
LSTM
    + MQCNN 3.1850 0.6686 0.1752
    + GNN 1.2073 0.2936 0.1694
MAML 1.0752 0.4769 0.2345
    + MQCNN 3.2979 0.5898 0.3565
    + GNN 1.0383 0.4646 0.2201
Reptile 4.9341 0.8118 0.5410
    + MQCNN 1.5990 0.4457 0.1831
    + GNN 1.2041 0.4827 0.2287
MetaSGD 1.1941 0.4837 0.2379
    + GNN 1.2667 0.4651 0.1908
    + MQCNN 1.2111 0.4562 0.1983
TSA 1.7022 0.7310 0.4936
    + GNN 1.5116 0.5253 0.2585
    + MQCNN 1.1785 0.5315 0.3084
MUMOMAML 0.9449 0.3251 0.2189
    + MCQNN 1.0945 0.3861 0.1920
    + GNN 0.7577 0.3752 0.1553
F-FOMAML (Ours) 0.6552 0.3876 0.2117
    + MQCNN 0.6371 0.4134 0.2000
    + GNN 0.6089 0.3713 0.2077

Baselines.

Our evaluation encompasses a diverse range of baseline techniques for comparative analysis. This includes traditional regression techniques like Linear Regression, along with ensemble strategies such as Random Forest and the well-regarded XGBoost algorithm (Chen and Guestrin, 2016). In the realm of neural-network-based methods, we consider the Multi-Layer Perceptron network, Gated Recurrent Unit (GRU) (Chung et al., 2014), Dipole (Ma et al., 2017), and LSTNet (Lai et al., 2018). Additionally, advanced transfer methods like Model-Agnostic Meta-Learning (MAML) (Finn et al., 2017b), Reptile (Nichol et al., 2018), Meta-SGD (Li et al., 2017b), TSA (Zhou et al., 2020) and MUMOMAML (Vuorio et al., 2019) are included. Consistency in the feature set is maintained across all baseline models, aligning them with our proposed method, and ensuring a fair comparison.

Model Evaluation and Training.

With the meta-learning framework in place, we train the base learners on the proxy data and evaluate their performances using evaluation metrics such as mean squared error (MSE), mean absolute error (MAE), and mean absolute percentage error (MAPE). The meta-learner, which could be a neural network (Goodfellow et al., 2016), support vector machine (Cortes and Vapnik, 1995), or decision tree (Breiman et al., 1984), selects the best base learners and their corresponding hyper-parameters based on the evaluation results. Next, the selected base learners are fine-tuned on the available historical sales data from the regular sales period, if any, to adapt the model to the peak period. This fine-tuning step allows our model to better capture the unique relationships between features and sales in the target vending machine, leading to more accurate predictions and improved generalization to new tasks.

6.2 Analysis of Results

Table 1 presents a detailed comparison of various machine learning methods applied to real-world e-commerce sales data from JD.com, evaluating their performance through three metrics. The methods encompass traditional regression, ensemble strategies, neural network-based approaches, transfer learning methods, and some advanced meta-learning algorithms, with the inclusion of proxy data features generated by either MQCNN or Graph Neural Networks (GNN). Linear regression, serving as a baseline, shows moderate performance, which slightly deteriorates when combined with MQCNN but improves with GNN, indicating GNN’s effectiveness in feature enhancement. Random Forest and XGBoost, both ensemble methods, exhibit higher errors compared to linear regression, with their performance variably impacted by MQCNN and GNN additions, suggesting a complex interaction between ensemble methods and proxy data features. Among neural network-based methods, the addition of MQCNN generally does not significantly alter performance, whereas GNN integration shows mixed results. Notably, advanced methods like MAML, Reptile, and MetaSGD show varied outcomes, with some combinations leading to increased errors. Particularly, Reptile demonstrates a substantial error reduction when combined with MQCNN, highlighting the potential of integrating advanced algorithms with proxy data feature generation techniques. The performance of TSA and MUMOMAML, with their respective enhancements, underscores the importance of selecting appropriate proxy data feature generation methods to improve prediction accuracy. MUMOMAML combined with MQCNN achieves the best MAPE score across all methods, emphasizing the strength of multimodal meta-learning techniques when optimized with suitable proxy data features.

The impact of incorporating different proxy data selection methods, particularly the comparison between GNN and other methods like clustering, is profound. The integration of GNN with various machine learning models, including our proposed F-FOMAML, consistently improved performance across metrics (MSE, MAE, MAPE), highlighting the effectiveness of GNN in enhancing the model’s ability to predict demand accurately. This improvement is notably apparent in the substantial performance leap observed when F-FOMAML is combined with GNN, which yields the best results.

Ablation study.

To better understand the effect of proxy data, we perform an ablation study by varying the k parameter in the k-shot proxy data selection and evaluating the performance metrics as the value of k changes. As illustrated in Figure 3, we observed that initially increasing k leads to a rise in the error metric, suggesting a decline in model performance due to less relevant data. This trend reaches a plateau, after which further increases in k result in decreased error, indicating improved performance from a larger proxy data set. These findings highlight a critical threshold where the quantity of proxy data begins to enhance model performance, emphasizing the potential benefits of utilizing larger proxy data sets. Detailed analyses and additional studies on algorithm convergence are provided in the Appendix C.3.

Our analysis conclusively demonstrates that F-FOMAML, especially when enhanced with Graph Neural Network (GNN) proxy data, outshines traditional regression models, ensemble strategies, neural networks, and other advanced meta-learning algorithms in predicting e-commerce sales on JD.com. This method achieves the lowest MSE and MAE, evidencing its superior ability to capture complex data patterns. Moreover, the integration of GNN as a proxy data selection method significantly boosts F-FOMAML’s performance across various metrics, including MSE, MAE, and MAPE, underscoring the pivotal role of advanced proxy data techniques in improving demand prediction models. The comparison with other proxy data methods, such as MQCNN, further illustrates GNN’s unique capability in effectively capturing complex data relationships, enhancing F-FOMAML’s predictive accuracy. These findings highlight the critical impact of combining GNN with state-of-the-art prediction models, offering insights into the development of more precise demand prediction algorithms.

7 Conclusion

This paper presents a novel approach to the challenging task of predicting demand during promotional events characterized by special buying behaviors. Traditional sales data often falls short due to the limited availability of historical data for such events. To address this, we framed demand prediction within the graph-augmented meta-learning paradigm. Utilizing the GNN-enhanced F-FOMAML algorithm, which integrates the generalizability of meta-learning with the adaptability of FiLM layers, we developed a robust forecasting model particularly effective in data-sparse scenarios.

Our method is grounded in solid theoretical foundations, demonstrating the algorithm’s ability to optimize predictive risk by skillfully managing bias-variance trade-offs. Empirical evaluations highlight our model’s superiority over conventional forecasting techniques and underscore its applicability beyond retail, with potential uses in fields such as online banking security and digital marketing. Empirically, F-FOMAML achieves significant improvements, reducing prediction MAE by 26.24% on the vending machine dataset and 1.04% on the JD.com dataset, with a notable 10.18% enhancement over GNN-based benchmarks. Further discussion on the strengths, limitations, and future research directions is provided in the Appendix D.

References

  • (1)
  • Baardman et al. (2017) Lennart Baardman, Igor Levin, Georgia Perakis, and Divya Singhvi. 2017. Leveraging comparables for new product sales forecasting. Available at SSRN 3086237 (2017).
  • Bishop et al. (2014) Jared Bishop, John Peters, and York Yannikos. 2014. Rapid response learning of humanitarian interventions. In 2014 IEEE Symposium on Computational Intelligence in Multicriteria Decision-Making (MCDM). IEEE, 9–15.
  • Breiman et al. (1984) Leo Breiman, Jerome H Friedman, Richard A Olshen, and Charles J Stone. 1984. Classification and regression trees. CRC press (1984).
  • Cao and Zhang (2021) Xinyu Cao and Juanjuan Zhang. 2021. Preference learning and demand forecast. Marketing Science 40, 1 (2021), 62–79.
  • Chang et al. (2019) Kevin Chang, **gxuan Wu, Xinyang Yu, Ruichen Yu, Xiaoxin Chen, Weiwei Dai, and David Anastasiu. 2019. Chimera: Large-Scale Classification using Machine Learning, Rules, and Crowdsourcing. In Proceedings of the VLDB Endowment, Vol. 12. VLDB Endowment, 193–206.
  • Chen and Guestrin (2016) Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2016), 785–794.
  • Choe et al. (2023) Sang Keun Choe, Sanket Vaibhav Mehta, Hwijeen Ahn, Willie Neiswanger, Pengtao Xie, Emma Strubell, and Eric Xing. 2023. Making Scalable Meta Learning Practical. arXiv:2310.05674 [cs.LG]
  • Chung et al. (2014) Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014).
  • Cortes and Vapnik (1995) Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine learning 20, 3 (1995), 273–297.
  • Eisenach et al. (2020) Carson Eisenach, Yagna Patel, and Dhruv Madeka. 2020. MQTransformer: Multi-Horizon Forecasts with Context Dependent and Feedback-Aware Attention. CoRR abs/2009.14799 (2020).
  • Ferreira et al. (2016) Kris Johnson Ferreira, Bin Hong Alex Lee, and David Simchi-Levi. 2016. Analytics for an online retailer: Demand forecasting and price optimization. Manufacturing & service operations management 18, 1 (2016), 69–88.
  • Finn et al. (2017a) Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017a. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In Proceedings of the 34th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 70), Doina Precup and Yee Whye Teh (Eds.). PMLR, 1126–1135.
  • Finn et al. (2017b) Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017b. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In ICML. 1126–1135.
  • Fiot and Dinuzzo (2015) Jean-Baptiste Fiot and Francesco Dinuzzo. 2015. Electricity Demand Forecasting by Multi-Task Learning. arXiv:1512.08178 [cs.LG]
  • Gong et al. (2012) Boqing Gong, Yuan Shi, Fei Sha, and Kristen Grauman. 2012. Geodesic flow kernel for unsupervised domain adaptation. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. 2066–2073.
  • Goodfellow et al. (2016) Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep learning. In MIT Press.
  • Gupta et al. (2016) Shivam Gupta, Kiran Ramesh, and Anil Kumar. 2016. Transfer Learning for Yield Prediction. In 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW). IEEE, 1109–1116.
  • Jean et al. (2016) Neal Jean, Marshall Burke, Michael Xie, W Matthew Davis, David B Lobell, and Stefano Ermon. 2016. Combining satellite imagery and machine learning to predict poverty. In Science, Vol. 353. American Association for the Advancement of Science, 790–794.
  • Kipf et al. (2018) Thomas Kipf, Ethan Fetaya, Kuan-Chieh Wang, Max Welling, and Richard Zemel. 2018. Neural relational inference for interacting systems. In International Conference on Machine Learning. 2688–2697.
  • Kipf and Welling (2017) Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In ICLR (Poster). OpenReview.net.
  • Kong et al. (2020a) Weihao Kong, Raghav Somani, Sham Kakade, and Sewoong Oh. 2020a. Robust Meta-learning for Mixed Linear Regression with Small Batches. In Advances in Neural Information Processing Systems, Vol. 33. Curran Associates, Inc., 4683–4696.
  • Kong et al. (2020b) Weihao Kong, Raghav Somani, Zhao Song, Sham Kakade, and Sewoong Oh. 2020b. Meta-Learning for Mixed Linear Regression. In Proceedings of the 37th International Conference on Machine Learning (ICML’20). JMLR.org, Article 500, 11 pages.
  • Lai et al. (2018) Guokun Lai, Wei-Cheng Chang, Yiming Yang, and Hanxiao Liu. 2018. Modeling long-and short-term temporal patterns with deep neural networks. In SIGIR. 95–104.
  • Le Guen and Thome (2020) Vincent Le Guen and Nicolas Thome. 2020. Probabilistic Time Series Forecasting with Structured Shape and Temporal Diversity. In Proceedings of the 34th International Conference on Neural Information Processing Systems (Vancouver, BC, Canada) (NIPS’20). Curran Associates Inc., Red Hook, NY, USA, Article 372, 14 pages.
  • Li et al. (2020) **g Li, Yan Zhang, Zhenzhen Yang, and Shengjun Wang. 2020. Relation-aware Meta-learning for E-commerce Market Segment Demand Prediction with Limited Records. In Proceedings of the 29th ACM International Conference on Information and Knowledge Management. ACM, 2325–2334.
  • Li et al. (2017a) Yaguang Li, Rose Yu, Cyrus Shahabi, and Yan Liu. 2017a. Diffusion convolutional recurrent neural network: Data-driven traffic forecasting. arXiv preprint arXiv:1707.01926 (2017).
  • Li et al. (2017b) Zhenguo Li, Fengwei Zhou, Fei Chen, and Hang Li. 2017b. Meta-sgd: Learning to learn quickly for few shot learning. arXiv preprint arXiv:1707.09835 (2017).
  • Lim et al. (2019) Bryan Lim, Sercan Ömer Arik, Nicolas Loeff, and Tomas Pfister. 2019. Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting. CoRR abs/1912.09363 (2019).
  • Long et al. (2015) Mingsheng Long, Yue Cao, Jianmin Wang, and Michael I Jordan. 2015. Learning transferable features with deep adaptation networks. In International Conference on Machine Learning. PMLR, 97–105.
  • Long et al. (2013) Mingsheng Long, Jianmin Wang, Guiguang Ding, Jiaguang Sun, and SY Philip. 2013. Transfer feature learning with joint distribution adaptation. In Proceedings of the IEEE international conference on computer vision. 2200–2207.
  • Ma et al. (2017) Fenglong Ma, Radha Chitta, **g Zhou, Quanzeng You, Tong Sun, and **g Gao. 2017. Dipole: Diagnosis prediction in healthcare via attention-based bidirectional recurrent neural networks. In KDD. ACM, 1903–1911.
  • Ma and Simchi-Levi (2022) Will Ma and David Simchi-Levi. 2022. Constructing demand curves from a single observation of bundle sales. In International Conference on Web and Internet Economics. Springer, 150–166.
  • Meng et al. (2022) Yu Meng, Jiaxin Huang, Yu Zhang, and Jiawei Han. 2022. Adapting Pretrained Representations for Text Mining. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (Washington DC, USA) (KDD ’22). Association for Computing Machinery, New York, NY, USA, 4806–4807. https://doi.org/10.1145/3534678.3542607
  • Nichol et al. (2018) Alex Nichol, Joshua Achiam, and John Schulman. 2018. On First-Order Meta-Learning Algorithms. ArXiv abs/1803.02999 (2018). https://api.semanticscholar.org/CorpusID:4587331
  • Oreshkin et al. (2020) Boris N Oreshkin, Dmitri Carpov, Nicolas Chapados, and Yoshua Bengio. 2020. N-BEATS: Neural basis expansion analysis for interpretable time series forecasting. In Proceedings of the 8th International Conference on Learning Representations (ICLR).
  • Pan and Yang (2009) Sinno Jialin Pan and Qiang Yang. 2009. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering 22, 10 (2009), 1345–1359.
  • Perez et al. (2018) Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron C. Courville. 2018. FiLM: Visual Reasoning with a General Conditioning Layer. In AAAI. AAAI Press, 3942–3951.
  • Salinas et al. (2020) David Salinas, Valentin Flunkert, Jan Gasthaus, and Tim Januschowski. 2020. DeepAR: Probabilistic forecasting with autoregressive recurrent networks. International Journal of Forecasting 36, 3 (2020), 1181–1191.
  • Shang et al. (2021) Chao Shang, Jie Chen, and **bo Bi. 2021. Discrete Graph Structure Learning for Forecasting Multiple Time Series. In ICLR. OpenReview.net.
  • Shen et al. (2020) Max Shen, Christopher S Tang, Di Wu, Rong Yuan, and Wei Zhou. 2020. JD. com: Transaction-level data for the 2020 MSOM data driven research challenge. Manufacturing & Service Operations Management (2020).
  • Snell et al. (2017) Jake Snell, Kevin Swersky, and Richard S. Zemel. 2017. Prototypical Networks for Few-shot Learning. In NIPS. 4077–4087.
  • Sung et al. (2017) Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip H. S. Torr, and Timothy M. Hospedales. 2017. Learning to Compare: Relation Network for Few-Shot Learning. https://doi.org/10.48550/ARXIV.1711.06025
  • Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to Sequence Learning with Neural Networks. In NIPS. 3104–3112.
  • Tzeng et al. (2017) Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. 2017. Adversarial discriminative domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2962–2971.
  • Vinyals et al. (2016) Oriol Vinyals, Charles Blundell, Tim Lillicrap, Koray Kavukcuoglu, and Daan Wierstra. 2016. Matching Networks for One Shot Learning. In NIPS. 3630–3638.
  • Vuorio et al. (2019) Risto Vuorio, Shao-Hua Sun, Hexiang Hu, and Joseph J. Lim. 2019. Multimodal Model-Agnostic Meta-Learning via Task-Aware Modulation. In Neural Information Processing Systems.
  • Wang et al. (2019) Yaqing Wang, Qin Yao, and Ivor W Tsang. 2019. Meta-learning for online retail sales prediction. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. ACM, 2073–2076.
  • Wen et al. (2017) Ruofeng Wen, Kari Torkkola, Balakrishnan Narayanaswamy, and Dhruv Madeka. 2017. A multi-horizon quantile recurrent forecaster. arXiv preprint arXiv:1711.11053 (2017).
  • Wu et al. (2020) Neo Wu, Bradley Green, Xue Ben, and Shawn O’Banion. 2020. Deep transformer models for time series forecasting: The influenza prevalence case. arXiv preprint arXiv:2001.08317 (2020).
  • Yang et al. (2023) Sitan Yang, Malcolm Wolff, Shankar Ramasubramanian, Vincent Quenneville-Belair, Ronak Mehta, and Michael Mahoney. 2023. GEANN: Scalable graph augmentations for multi-horizon time series forecasting. In KDD 2023 Workshop on Deep Learning on Graphs.
  • Yao et al. (2019) Huaxiu Yao, Ying Wei, Junzhou Huang, and Zhenhui Li. 2019. Learning to learn by remembering. In Advances in Neural Information Processing Systems. 1574–1584.
  • Zhou et al. (2020) Pan Zhou, Yingtian Zou, Xiaotong Yuan, Jiashi Feng, Caiming Xiong, and SC Hoi. 2020. Task Similarity Aware Meta Learning: Theory-inspired Improvement on MAML. In 4th Workshop on Meta-Learning at NeurIPS.
  • Zhu et al. (2020) Qi Zhu, Yidan Xu, Haonan Wang, Chao Zhang, Jiawei Han, and Carl Yang. 2020. Transfer Learning of Graph Neural Networks with Ego-graph Information Maximization. In Neural Information Processing Systems. https://api.semanticscholar.org/CorpusID:221640925

Appendix A Appendix

This appendix offers additional content to complement the core findings of our research. Below is a summary of the sections included:

  • Proof of Theorem 1 (Section B): A rigorous, step-by-step proof of Theorem 1 is presented, solidifying the theoretical underpinnings of our proposed methodology.

  • Experimental Details and Supplementary Results (Section C): We meticulously outline the datasets employed, the design choices, and the implementation specifics of our experiments, and present additional results that reinforce our conclusions.

  • Discussion and Future Directions (Section D): We critically discuss the strengths and limitations of our approach, illuminating potential avenues for future research endeavors.

Appendix B Proof of Theorem 1

Proof.

Let us first define an intermediate function:

ht~(im)=i=1nw(t~,ti)hii=1nw(t~,ti).superscriptsubscript~𝑡𝑖𝑚superscriptsubscript𝑖1𝑛𝑤~𝑡subscript𝑡𝑖subscript𝑖superscriptsubscript𝑖1𝑛𝑤~𝑡subscript𝑡𝑖h_{\widetilde{t}}^{(im)}=\frac{\sum_{i=1}^{n}w(\widetilde{t},t_{i})h_{i}}{\sum% _{i=1}^{n}w(\widetilde{t},t_{i})}.italic_h start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i italic_m ) end_POSTSUPERSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_w ( over~ start_ARG italic_t end_ARG , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_w ( over~ start_ARG italic_t end_ARG , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG .

We then define the event En={i=1nw(t~,ti)>0}subscript𝐸𝑛superscriptsubscript𝑖1𝑛𝑤~𝑡subscript𝑡𝑖0E_{n}=\{\sum_{i=1}^{n}w(\widetilde{t},t_{i})>0\}italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = { ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_w ( over~ start_ARG italic_t end_ARG , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) > 0 }. Conditioned on the event Ensubscript𝐸𝑛E_{n}italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, we have

𝔼[(ht~(im)(f(x))\displaystyle\operatorname*{\mathbb{E}}\big{[}\big{(}h^{(im)}_{\widetilde{t}}(% f(x))blackboard_E [ ( italic_h start_POSTSUPERSCRIPT ( italic_i italic_m ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT ( italic_f ( italic_x ) ) h^t~(f(x)))2]\displaystyle-\widehat{h}_{\widetilde{t}}(f(x))\big{)}^{2}\big{]}- over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT ( italic_f ( italic_x ) ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
\displaystyle\leq i=1nw(t~,ti)𝔼[(h^i(f(x))hi(f(x)))2](i=1nw(t~,ti))2superscriptsubscript𝑖1𝑛𝑤~𝑡subscript𝑡𝑖𝔼superscriptsubscript^𝑖𝑓𝑥subscript𝑖𝑓𝑥2superscriptsuperscriptsubscript𝑖1𝑛𝑤~𝑡subscript𝑡𝑖2\displaystyle\frac{\sum_{i=1}^{n}w(\widetilde{t},t_{i})\cdot\operatorname*{% \mathbb{E}}[(\widehat{h}_{i}(f(x))-h_{i}(f(x)))^{2}]}{(\sum_{i=1}^{n}w(% \widetilde{t},t_{i}))^{2}}divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_w ( over~ start_ARG italic_t end_ARG , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⋅ blackboard_E [ ( over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_f ( italic_x ) ) - italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_f ( italic_x ) ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG start_ARG ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_w ( over~ start_ARG italic_t end_ARG , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG
\displaystyle\leq maxi𝔼[(h^i(f(x))hi(f(x)))2]i=1nw(t~,ti)subscript𝑖𝔼superscriptsubscript^𝑖𝑓𝑥subscript𝑖𝑓𝑥2superscriptsubscript𝑖1𝑛𝑤~𝑡subscript𝑡𝑖\displaystyle\frac{\max_{i}\operatorname*{\mathbb{E}}[(\widehat{h}_{i}(f(x))-h% _{i}(f(x)))^{2}]}{\sum_{i=1}^{n}w(\widetilde{t},t_{i})}divide start_ARG roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_E [ ( over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_f ( italic_x ) ) - italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_f ( italic_x ) ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_w ( over~ start_ARG italic_t end_ARG , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG
=\displaystyle== O(C()Ni=1nw(t~,ti)),𝑂𝐶𝑁superscriptsubscript𝑖1𝑛𝑤~𝑡subscript𝑡𝑖\displaystyle O\left(\frac{C(\mathcal{H})}{N\sum_{i=1}^{n}w(\widetilde{t},t_{i% })}\right),italic_O ( divide start_ARG italic_C ( caligraphic_H ) end_ARG start_ARG italic_N ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_w ( over~ start_ARG italic_t end_ARG , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG ) ,

where the last inequality uses the assumption that 𝔼[(h^t(f(x))ht(f(x)))2]=O(C()Nd)𝔼superscriptsubscript^𝑡𝑓𝑥subscript𝑡𝑓𝑥2𝑂𝐶subscript𝑁𝑑\operatorname*{\mathbb{E}}[(\widehat{h}_{t}(f(x))-h_{t}(f(x)))^{2}]=O\left(% \frac{C(\mathcal{H})}{N_{d}}\right)blackboard_E [ ( over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_f ( italic_x ) ) - italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_f ( italic_x ) ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = italic_O ( divide start_ARG italic_C ( caligraphic_H ) end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_ARG ) and NdNgreater-than-or-equivalent-tosubscript𝑁𝑑𝑁N_{d}\gtrsim Nitalic_N start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ≳ italic_N.

Moreover, since ht1ht2CZt1Zt2Chsubscriptnormsubscriptsubscript𝑡1subscriptsubscript𝑡2𝐶normsubscript𝑍subscript𝑡1subscript𝑍subscript𝑡2𝐶\|h_{t_{1}}-h_{t_{2}}\|_{\infty}\leq C\cdot\|Z_{t_{1}}-Z_{t_{2}}\|\leq Ch∥ italic_h start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_h start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ italic_C ⋅ ∥ italic_Z start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_Z start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ ≤ italic_C italic_h when Zt1Zt2hnormsubscript𝑍subscript𝑡1subscript𝑍subscript𝑡2\|Z_{t_{1}}-Z_{t_{2}}\|\leq h∥ italic_Z start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_Z start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ ≤ italic_h, we have that on the event Ensubscript𝐸𝑛E_{n}italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT,

|ht~(im)ht|superscriptsubscript~𝑡𝑖𝑚subscript𝑡\displaystyle|h_{\widetilde{t}}^{(im)}-h_{t}|| italic_h start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i italic_m ) end_POSTSUPERSCRIPT - italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | =|i=1nw(t~,ti)(hiht)i=1nw(t~,ti)|absentsuperscriptsubscript𝑖1𝑛𝑤~𝑡subscript𝑡𝑖subscript𝑖subscript𝑡superscriptsubscript𝑖1𝑛𝑤~𝑡subscript𝑡𝑖\displaystyle=\Big{|}\frac{\sum_{i=1}^{n}w(\widetilde{t},t_{i})(h_{i}-h_{t})}{% \sum_{i=1}^{n}w(\widetilde{t},t_{i})}\Big{|}= | divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_w ( over~ start_ARG italic_t end_ARG , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_w ( over~ start_ARG italic_t end_ARG , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG |
=|i=1n𝟏{A(t~,ti)<h}(hiht)i=1n𝟏{A(t~,ti)<h}|absentsuperscriptsubscript𝑖1𝑛1𝐴~𝑡subscript𝑡𝑖subscript𝑖subscript𝑡superscriptsubscript𝑖1𝑛1𝐴~𝑡subscript𝑡𝑖\displaystyle=\Big{|}\frac{\sum_{i=1}^{n}\bm{1}\{A(\widetilde{t},t_{i})<h\}(h_% {i}-h_{t})}{\sum_{i=1}^{n}\bm{1}\{A(\widetilde{t},t_{i})<h\}}\Big{|}= | divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT bold_1 { italic_A ( over~ start_ARG italic_t end_ARG , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) < italic_h } ( italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT bold_1 { italic_A ( over~ start_ARG italic_t end_ARG , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) < italic_h } end_ARG |
Ch.absent𝐶\displaystyle\leq Ch.≤ italic_C italic_h .

On the other hand, conditioned on the event Encsuperscriptsubscript𝐸𝑛𝑐E_{n}^{c}italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT, when the denominator equals to 0, by definition we have ht=0subscript𝑡0h_{t}=0italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 0, and therefore

|ht~(im)(f(x))ht(f(x))|2=ht~2(f(x)).superscriptsuperscriptsubscript~𝑡𝑖𝑚𝑓𝑥subscript𝑡𝑓𝑥2superscriptsubscript~𝑡2𝑓𝑥\displaystyle|h_{\widetilde{t}}^{(im)}(f(x))-h_{t}(f(x))|^{2}=h_{\widetilde{t}% }^{2}(f(x)).| italic_h start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i italic_m ) end_POSTSUPERSCRIPT ( italic_f ( italic_x ) ) - italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_f ( italic_x ) ) | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_h start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_f ( italic_x ) ) .

Consequently, we can write

|ht~(im)(f(x))ht(f(x))|2C2h2+ht~2(f(x))𝟏Enc.superscriptsuperscriptsubscript~𝑡𝑖𝑚𝑓𝑥subscript𝑡𝑓𝑥2superscript𝐶2superscript2superscriptsubscript~𝑡2𝑓𝑥subscript1superscriptsubscript𝐸𝑛𝑐\displaystyle|h_{\widetilde{t}}^{(im)}(f(x))-h_{t}(f(x))|^{2}\leq C^{2}h^{2}+h% _{\widetilde{t}}^{2}(f(x))\bm{1}_{E_{n}^{c}}.| italic_h start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i italic_m ) end_POSTSUPERSCRIPT ( italic_f ( italic_x ) ) - italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_f ( italic_x ) ) | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_h start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_f ( italic_x ) ) bold_1 start_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT end_POSTSUBSCRIPT .

Therefore, we have

𝔼[(h^t~ht~)2]less-than-or-similar-to𝔼superscriptsubscript^~𝑡subscript~𝑡2absent\displaystyle\operatorname*{\mathbb{E}}[\left(\widehat{h}_{\widetilde{t}}-h_{% \widetilde{t}}\right)^{2}]\lesssimblackboard_E [ ( over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT - italic_h start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≲ 𝔼[C()Ni=1nw(t~,ti)𝟏En]𝔼𝐶𝑁superscriptsubscript𝑖1𝑛𝑤~𝑡subscript𝑡𝑖subscript1subscript𝐸𝑛\displaystyle\operatorname*{\mathbb{E}}\left[\frac{C(\mathcal{H})}{N\sum_{i=1}% ^{n}w(\widetilde{t},t_{i})}\cdot\bm{1}_{E_{n}}\right]blackboard_E [ divide start_ARG italic_C ( caligraphic_H ) end_ARG start_ARG italic_N ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_w ( over~ start_ARG italic_t end_ARG , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG ⋅ bold_1 start_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] (9)
+h2superscript2\displaystyle+h^{2}+ italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (10)
+𝔼[ht~2(f(x))𝟏Enc].𝔼superscriptsubscript~𝑡2𝑓𝑥subscript1superscriptsubscript𝐸𝑛𝑐\displaystyle+\operatorname*{\mathbb{E}}[h_{\widetilde{t}}^{2}(f(x))\cdot\bm{1% }_{E_{n}^{c}}].+ blackboard_E [ italic_h start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_f ( italic_x ) ) ⋅ bold_1 start_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ] . (11)

To bound the first term on the right-hand side (RHS), we let

Y=i=1nw(t~,ti)=i=1n𝟏{|Zt~Zti>h}.𝑌superscriptsubscript𝑖1𝑛𝑤~𝑡subscript𝑡𝑖superscriptsubscript𝑖1𝑛1delimited-|‖subscript𝑍~𝑡subscript𝑍subscript𝑡𝑖Y=\sum_{i=1}^{n}w(\widetilde{t},t_{i})=\sum_{i=1}^{n}\bm{1}\{|Z_{\widetilde{t}% }-Z_{t_{i}}\|>h\}.italic_Y = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_w ( over~ start_ARG italic_t end_ARG , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT bold_1 { | italic_Z start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT - italic_Z start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ > italic_h } .

Since Zdsubscript𝑍𝑑Z_{d}italic_Z start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT are uniformly distributed on [0,1]rsuperscript01𝑟[0,1]^{r}[ 0 , 1 ] start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT, we have that YBinomial(n,q)similar-to𝑌𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙𝑛𝑞Y\sim Binomial(n,q)italic_Y ∼ italic_B italic_i italic_n italic_o italic_m italic_i italic_a italic_l ( italic_n , italic_q ) with q=(ZZt~>h)𝑞norm𝑍subscript𝑍~𝑡q=\mathbb{P}(\|Z-Z_{\widetilde{t}}\|>h)italic_q = blackboard_P ( ∥ italic_Z - italic_Z start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT ∥ > italic_h ). Using the property of binomial distribution, we have

𝔼[𝟏{Y>0}Y]1nq1nhr.less-than-or-similar-to𝔼1𝑌0𝑌1𝑛𝑞less-than-or-similar-to1𝑛superscript𝑟\displaystyle\operatorname*{\mathbb{E}}\left[\frac{\bm{1}\{Y>0\}}{Y}\right]% \lesssim\frac{1}{nq}\lesssim\frac{1}{nh^{r}}.blackboard_E [ divide start_ARG bold_1 { italic_Y > 0 } end_ARG start_ARG italic_Y end_ARG ] ≲ divide start_ARG 1 end_ARG start_ARG italic_n italic_q end_ARG ≲ divide start_ARG 1 end_ARG start_ARG italic_n italic_h start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT end_ARG .

Therefore, the first term on RHS is upper bounded as:

𝔼[C()Ni=1nw(t~,ti)𝟏En]C()Nnhr.less-than-or-similar-to𝔼𝐶𝑁superscriptsubscript𝑖1𝑛𝑤~𝑡subscript𝑡𝑖subscript1subscript𝐸𝑛𝐶𝑁𝑛superscript𝑟\displaystyle\operatorname*{\mathbb{E}}\left[\frac{C(\mathcal{H})}{N\sum_{i=1}% ^{n}w(\widetilde{t},t_{i})}\cdot\bm{1}_{E_{n}}\right]\lesssim\frac{C(\mathcal{% H})}{N\cdot nh^{r}}.blackboard_E [ divide start_ARG italic_C ( caligraphic_H ) end_ARG start_ARG italic_N ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_w ( over~ start_ARG italic_t end_ARG , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG ⋅ bold_1 start_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] ≲ divide start_ARG italic_C ( caligraphic_H ) end_ARG start_ARG italic_N ⋅ italic_n italic_h start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT end_ARG .

The third term in relation (9) can be bounded as

𝔼[ht~2(f(x))𝟏Enc]𝔼superscriptsubscript~𝑡2𝑓𝑥subscript1superscriptsubscript𝐸𝑛𝑐\displaystyle\operatorname*{\mathbb{E}}[h_{\widetilde{t}}^{2}(f(x))\cdot\bm{1}% _{E_{n}^{c}}]blackboard_E [ italic_h start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_f ( italic_x ) ) ⋅ bold_1 start_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ] supht~2(f(x))𝔼[(1q)n]absentsupremumsuperscriptsubscript~𝑡2𝑓𝑥𝔼superscript1𝑞𝑛\displaystyle\leq\sup h_{\widetilde{t}}^{2}(f(x))\operatorname*{\mathbb{E}}[(1% -q)^{n}]≤ roman_sup italic_h start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_f ( italic_x ) ) blackboard_E [ ( 1 - italic_q ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ]
supht~2(f(x))1qnless-than-or-similar-toabsentsupremumsuperscriptsubscript~𝑡2𝑓𝑥1𝑞𝑛\displaystyle\lesssim\sup h_{\widetilde{t}}^{2}(f(x))\frac{1}{qn}≲ roman_sup italic_h start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_f ( italic_x ) ) divide start_ARG 1 end_ARG start_ARG italic_q italic_n end_ARG
1nhr.less-than-or-similar-toabsent1𝑛superscript𝑟\displaystyle\lesssim\frac{1}{nh^{r}}.≲ divide start_ARG 1 end_ARG start_ARG italic_n italic_h start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT end_ARG .

Combining all the pieces, we get

𝔼[(h^t~ht~)2]h2+C()/Nnhr.less-than-or-similar-to𝔼superscriptsubscript^~𝑡subscript~𝑡2superscript2𝐶𝑁𝑛superscript𝑟\operatorname*{\mathbb{E}}[(\widehat{h}_{\widetilde{t}}-h_{\widetilde{t}})^{2}% ]\lesssim h^{2}+\frac{C(\mathcal{H})/N}{nh^{r}}.blackboard_E [ ( over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT - italic_h start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≲ italic_h start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_C ( caligraphic_H ) / italic_N end_ARG start_ARG italic_n italic_h start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT end_ARG .

Therefore, when l𝑙litalic_l is Lipschitz with respect to the first argument, we have that

𝔼(x,y)t~[l(g^t~(x;𝒟tr,A),y)]subscript𝔼similar-to𝑥𝑦~𝑡𝑙subscript^𝑔~𝑡𝑥superscript𝒟𝑡𝑟𝐴𝑦\displaystyle\operatorname*{\mathbb{E}}_{(x,y)\sim\widetilde{t}}\left[l(% \widehat{g}_{\widetilde{t}}(x;\mathcal{D}^{tr},A),y)\right]blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y ) ∼ over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT [ italic_l ( over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT ( italic_x ; caligraphic_D start_POSTSUPERSCRIPT italic_t italic_r end_POSTSUPERSCRIPT , italic_A ) , italic_y ) ] 𝔼(x,y)t~[l(gt~(x),y)]subscript𝔼similar-to𝑥𝑦~𝑡𝑙subscript𝑔~𝑡𝑥𝑦\displaystyle-\operatorname*{\mathbb{E}}_{(x,y)\sim\widetilde{t}}[l(g_{% \widetilde{t}}(x),y)]- blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y ) ∼ over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT [ italic_l ( italic_g start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT ( italic_x ) , italic_y ) ]
\displaystyle\leq 𝔼[|h^t~ht~|]𝔼subscript^~𝑡subscript~𝑡\displaystyle\operatorname*{\mathbb{E}}[|\widehat{h}_{\widetilde{t}}-h_{% \widetilde{t}}|]blackboard_E [ | over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT - italic_h start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT | ]
\displaystyle\leq 𝔼[(h^t~ht~)2]𝔼superscriptsubscript^~𝑡subscript~𝑡2\displaystyle\sqrt{\operatorname*{\mathbb{E}}[(\widehat{h}_{\widetilde{t}}-h_{% \widetilde{t}})^{2}]}square-root start_ARG blackboard_E [ ( over^ start_ARG italic_h end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT - italic_h start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG
less-than-or-similar-to\displaystyle\lesssim h+C()/Nnhr.𝐶𝑁𝑛superscript𝑟\displaystyle h+\sqrt{\frac{C(\mathcal{H})/N}{nh^{r}}}.italic_h + square-root start_ARG divide start_ARG italic_C ( caligraphic_H ) / italic_N end_ARG start_ARG italic_n italic_h start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT end_ARG end_ARG .

If we further take h(C()/Nn)1r+2asymptotically-equalssuperscript𝐶𝑁𝑛1𝑟2h\asymp(\frac{C(\mathcal{H})/N}{n})^{\frac{1}{r+2}}italic_h ≍ ( divide start_ARG italic_C ( caligraphic_H ) / italic_N end_ARG start_ARG italic_n end_ARG ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_r + 2 end_ARG end_POSTSUPERSCRIPT, we obtain

𝔼(x,y)t~[l(g^t~(x;𝒟tr,A),y)]subscript𝔼similar-to𝑥𝑦~𝑡𝑙subscript^𝑔~𝑡𝑥superscript𝒟𝑡𝑟𝐴𝑦\displaystyle\operatorname*{\mathbb{E}}_{(x,y)\sim\widetilde{t}}[l(\widehat{g}% _{\widetilde{t}}(x;\mathcal{D}^{tr},A),y)]blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y ) ∼ over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT [ italic_l ( over^ start_ARG italic_g end_ARG start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT ( italic_x ; caligraphic_D start_POSTSUPERSCRIPT italic_t italic_r end_POSTSUPERSCRIPT , italic_A ) , italic_y ) ] 𝔼(x,y)t~[l(gt~(x;𝒟tr,A),y)]subscript𝔼similar-to𝑥𝑦~𝑡𝑙subscript𝑔~𝑡𝑥superscript𝒟𝑡𝑟𝐴𝑦\displaystyle-\operatorname*{\mathbb{E}}_{(x,y)\sim\widetilde{t}}[l(g_{% \widetilde{t}}(x;\mathcal{D}^{tr},A),y)]- blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y ) ∼ over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT [ italic_l ( italic_g start_POSTSUBSCRIPT over~ start_ARG italic_t end_ARG end_POSTSUBSCRIPT ( italic_x ; caligraphic_D start_POSTSUPERSCRIPT italic_t italic_r end_POSTSUPERSCRIPT , italic_A ) , italic_y ) ]
(C()/Nn)1r+2,less-than-or-similar-toabsentsuperscript𝐶𝑁𝑛1𝑟2\displaystyle\lesssim\left(\frac{C(\mathcal{H})/N}{n}\right)^{\frac{1}{r+2}},≲ ( divide start_ARG italic_C ( caligraphic_H ) / italic_N end_ARG start_ARG italic_n end_ARG ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_r + 2 end_ARG end_POSTSUPERSCRIPT ,

which completes the proof. ∎

Appendix C Detailed Experiment Design and Additional Results

C.1 Dataset Description

Vending Machine Company

In our experiments, we utilize various datasets that contain detailed information about the sales orders and product attributes.

Sales Orders (all those files ended with sale_order.csv): Each dataset covers a specific time range and includes both participating and non-participating products in the experiments. Each record corresponds to a sales order, characterized by the business area, shelf code, order code, product code, user code, purchase date, quantity purchased, sale price, actual payment amount, and product category.

Experiment Details (experiment_detail_product.csv): This dataset includes the products participating in pricing experiments A and B. Each product is characterized by attributes such as the business area, shelf code, scene, scene subdivision, product category, product sub-category, experimental group designation, sale price, adjusted price, cross-price indicator, and an indicator for prices below 95% of the overall average price.

Product Details (product_detail.csv): This dataset contains information about the attributes of each product, including the product code, product category, product type, product sub-category, and an indicator specifying whether it is a common product.

Shelf Details (shelf_detail.csv): This dataset provides attributes of each shelf, including the business area, shelf code, an indicator for low-selling devices in the previous month, an indicator for the ability to place high-priced products, old user rate, and shelf grade.

The various data points in these datasets allow for an extensive and comprehensive analysis of the sales patterns and facilitate the learning and evaluation of the proposed model.

Refer to caption
Figure 2: Evaluation Performance (MSE values) throughout training epochs over MAML, MLP and our proposed methods, where k𝑘kitalic_k is set to be 5555.
Table 2: Detailed feature definition in the Vending Machine dataset.
(a) Sales Order Datasets
Feature Type Description
business_area String Area of the business.
shelf_code String Unique identifier for the shelf.
order_code String Unique identifier for the order.
product_code String Unique identifier for the product.
user_code String Unique identifier for the user.
pay_date Date Purchase date.
quantity_act Integer Quantity purchased.
sale_price Float Sale price of the product.
real_total_price Float Actual payment amount.
product_type String Product category.
(b) Experiment Detail Dataset
Feature Type Description
business_area String Area of the business.
shelf_code String Unique identifier for the shelf.
product_code String Unique identifier for the product.
mtype String Scene.
scene String Scene subdivision.
second_type_name String Product category.
sub_type_name String Product sub-category.
if_exper Integer Indicator for the experimental group (1) or control group (0).
sale_price Float Sale price of the product.
ab_price Float Adjusted price.
cross_price Float Cross-price indicator.
lower_price95 Integer Indicator for prices below 95% of the overall average price.
(c) Product Detail Dataset
Feature Type Description
product_code String Unique identifier for the product.
type_name String Product category.
second_type_name String Product type.
sub_type_name String Product sub-category.
is_common_product Integer Indicator for common products (1 for yes, 0 for no).
(d) Shelf Detail Dataset
Feature Type Description
business_area String Area of the business.
shelf_code String Unique identifier for the shelf.
is_low_sale Integer Indicator for low-selling devices in the previous month (1 for yes, 0 for no).
can_fill_high_price Integer Indicator for the ability to place high-priced products.
old_user_rate Float Old user rate.
grade String Shelf grade.

JD.com

We work with the transactional records from JD.com, which offer a blend of both static and dynamic features related to the product (SKU) and order details for March 2018 (Shen et al., 2020). The SKU table contains information about the SKUs that were clicked at least once during March 2018. Each SKU entry has a unique SKU ID and is associated with a seller. For this study, 9,167 SKUs were selected. Each SKU possesses two pivotal attributes, which could, depending on the category, represent product features like SPF for face moisturizers or the number of personalized shaving modes for electric shavers.

The Order table encompasses details about unique customer orders within our designated product category from March 2018. This table elucidates specifics like order quantity, order date and time, SKU type, and the promised delivery time of the order. Additionally, it captures the product pricing and promotional activities, delineating the difference between the original and final unit price, thereby indicating the promotional discounts offered.

For our analysis, we split our tasks into the following partitions: 1) Training set 𝒟trainsubscript𝒟𝑡𝑟𝑎𝑖𝑛\mathcal{D}_{train}caligraphic_D start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT, comprising data from regular sales days, this is used for initial model training. 2) Query set 𝒟querysubscript𝒟𝑞𝑢𝑒𝑟𝑦\mathcal{D}_{query}caligraphic_D start_POSTSUBSCRIPT italic_q italic_u italic_e italic_r italic_y end_POSTSUBSCRIPT, which consists of slightly modified versions of tasks from 𝒟trainsubscript𝒟𝑡𝑟𝑎𝑖𝑛\mathcal{D}_{train}caligraphic_D start_POSTSUBSCRIPT italic_t italic_r italic_a italic_i italic_n end_POSTSUBSCRIPT and facilitating the inner-loop adaptation. 3) Testing set 𝒟testsubscript𝒟𝑡𝑒𝑠𝑡\mathcal{D}_{test}caligraphic_D start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT, that incorporates data from peak sales periods and is earmarked for assessing model performance.

Table 3: Detailed feature description of the JD.com dataset.
(a) Description of the SKU table
Field Data type Description Sample value
sku_ID String Unique identifier of a product b4822497a5
type Int 1P or 3P SKU 1
brand_ID String Brand unique identification code c840ce7809
attribute1 Int First key attribute of the category 3
attribute2 Int Second key attribute of the category 60
activate_date String The date at which the SKU is first Introduced 2018-03-01
deactivate_date String The date at which the SKU is terminated 2018-03-01
(b) Description of the users table
Field Data type Description Sample value
user_ID String User unique identification code 000000f736
user_level Int User level 10
first_order_month String First month in which the customer placed an order on JD.com 2017-07
plus Int If user is with a PLUS membership 0
gender String User gender (estimated) F
age String User age range (estimated) 26–35
marital_status String User marital status (estimated) M
education Int User education level (estimated) 3
purchase_power Int User purchase power (estimated) 2
city_level Int City level of user address 1
(c) Description of the orders table
Field Data type Description Sample value
order_ID String Order unique identification code 3b76bfcd3b
user_ID String User unique identification code 3cde601074
sku_ID String SKU unique identification code 443fd601f0
order_date String Order date (format: yyyy-mm-dd) 2018-03-01
order_time String Specific time at which the order gets placed 2018-03-01 11:10:40.0
quantity Int Number of units ordered 1
type Int 1P or 3P orders 1
promise Int Expected delivery time (in days) 2
original_unit_price Float Original list price 99.9
final_unit_price Float Final purchase price 53.9
direct_discount_per_unit Float Discount due to SKU direct discount 5.0
quantity_discount_per_unit Float Discount due to purchase quantity 41.0
bundle_discount_per_unit Float Discount due to “bundle promotion” 0.0
coupon_discount_per_unit Float Discount due to customer coupon 0.0
gift_item Int If the SKU is with gift promotion 0
dc_ori Int Distribution center ID where the order is shipped from 29
dc_des Int Destination address where the order is shipped to 29
(represented by the closest distribution center ID)

C.2 Implementation Details

We employ PyTorch for model implementation and Adam for optimization. The learning rate scheduler uses a warmup phase, accounting for 10% of the total training steps, followed by a linear decay to zero. We set the dropout rate at 0.5 to prevent overfitting during training.

Each training and test episode corresponds to a single task. For each task, we sample 5 data points (k-shot) for training, validation, and testing. We iterate over 1000 training episodes, 200 validation episodes, and 1000 test episodes.

The hidden dimensions of the product-specific and machine-specific estimators are set to 32. After training for 100 epochs, the model with the lowest validation MSE is saved and used for testing.

All experiment outputs are saved in a timestamped directory under the project’s output path for reproducibility and future reference.

After training, we load the model with the best performance on the validation set to evaluate on the testing set, and report the test scores.

In terms of GNN-based learning tasks, we aggregate order information for each product to generate the daily quantities sold during the one month of JD.com data. We also include the original and final price sequences as the time series features. Meanwhile, we also obtain each product’s static features such as brand and attributes. Our forecasting task is set to predict the sales for the future one day of each product using the past 16 days’ information (i.e., C=16𝐶16C=16italic_C = 16 in Equation (5)).

The graph constructed uses products’ brand information such that products under the same brand are connected. In this task, the graph contains 9159 nodes that represent all products having at least 1 unit of sale in one month. We use two graph convolutional (GCN) layers to explore all 2-hop neighbors for each product, together with its own static and dynamic features in the prediction task.

We train the graph-based model until convergence. The Adam optimizer with default settings is used for minimizing the Mean Absolute Error (MAE) loss with a batch size of 8 to 100 epochs. The final embedding for each product is a numerical vector of length 90 which contains 50,8,325083250,8,3250 , 8 , 32 encoded features representing static, dynamic, and graph-enhanced features respectively.

Our F-FOMAML model is configured with the hyper-parameters in Table 4. We use two-layer feed-forward neural networks with RELU activation to encode the product and vending machine separately. The hidden size is set to 64646464. All models are trained for 100K episodes. Each episode is a regression task. We use the mean-square-error loss. We use Adam as the optimizer with a learning rate of 1e-3. A linear warmup scheduler is used with the first 10% episodes as the warm-up episodes. The dropout is set to 0.5. We implement our method using PyTorch 1.11 and Python 3.8. The model is trained on a CentOS Linux 7 machine with 128 AMD EPYC 7513 32-Core Processors, 512 GB memory, and eight NVIDIA RTX A6000 GPUs.

Table 4: Hyper parameters configuration in our F-FOMAML network.
Dropout rate 0.5
Hidden dimension size 32
Training k𝑘kitalic_k-shot 5
Training episodes 1000
Validation k-shot 5
Validation episodes 200
Test k𝑘kitalic_k-shot 5
Test episodes 1000
Epochs 100
Learning rate scheduler WarmupLinear
warmup ratio 0.1
Optimizer Adam optimizer
learning rate 0.0010.0010.0010.001
Weight decay 0
Monitor metric Mean Squared Error (MSE)

C.3 Additional Experiment Results

To validate our methodology, we sourced data from two major trading contexts, namely, the vending machine merchandising dataset and the JD.com dataset from a renowned e-commerce platform (i.e., JD.com).

Experiments on Vending machine data

Vending Machine Merchandising Dataset. This dataset is derived from a private vending machine company. The dataset contains sales data from Mar 10, 2022, to April 20, 2022, for 246 products and 1715 vending machines. Each product from a specific vending machine has a base price (last for the first 20 days) and an adjusted price (last for the last 20 days). The goal is to estimate the demand at the adjusted price given the demand at the base price. We use the category information (7 categories in total) for product features, and region (4 regions) and scene (8 scenes) information for vending machine feature. We use the last 10 days as the testing set, and the second to last 10 days as training. The detailed data description can be found in Table 2.

Table 5: Experiment results of F-FOMAML using vending machine sales data. Our F-FOMAML obtains the smallest error on the real-world dataset among the competing baselines.
Method with GNN MSE MAE MAPE
Linear Regression No 0.6218 0.4782 0.2900
MLP No 0.2811 0.2038 0.1499
MAML No 0.2985 0.2143 0.1587
F-FOMAML (Ours) No 0.2345 0.1532 0.1206

Table 5 compares our F-FOMAML with Linear Regression and MLP, alongside the meta-learning model MAML, in predicting vending machine sales. While Linear Regression and MLP provide strong baselines, their performance is comparable to MAML, which may suggest MAML’s potential overfitting issues in data-limited scenarios. Significantly, F-FOMAML surpasses these models, demonstrating enhanced prediction accuracy. Our analysis indicates that F-FOMAML improves MAE values by 26.24% over the existing models on the vending machine dataset, benefiting from domain-specific insights for relation construction. This result underscores F-FOMAML’s effectiveness in demand forecasting, highlighting its capacity to optimize prediction accuracy through strategic data utilization.

In the analysis of the vending machine dataset (Table 5), GNNs, MQCNN, and other sequential models were excluded due to the dataset’s absence of continuous time-series data, essential for GNNs to produce effective embedding. Therefore, a direct comparison involving GNNs for this dataset is not included.

Additional Ablation studies and analysis on JD.com dataset

Refer to caption
Figure 3: Evaluation of metrics versus different values of k for the k-shot proxy data selection.

In our ablation study for proxy data, we aim to unpack the impact of varying volumes of proxy data on our methodology by adjusting the k parameter within the k-shot proxy data selection framework and examining its influence on our method’s error metrics. The outcomes of this investigation are presented in Figure 3, where we meticulously track how changes in the k value affect the error metric. Our observations reveal a distinct pattern: as the value of k initially increases, there is a corresponding rise in the error metric, suggesting a diminution in model performance possibly due to the incorporation of a larger but less relevant data set. This upward trend in error reaches a plateau, indicating a point of saturation where further increases in k do not adversely affect the model to the same extent. Interestingly, beyond this saturation point, the error metric begins to decrease with further increases in k, suggesting that the model starts to derive benefits from the expanded pool of proxy data. This indicates the potential value of using larger sets of proxy data, highlighting a critical threshold beyond which the quantity of data begins to outweigh the dilution of relevance, thereby enhancing model performance.

Additional details regarding the training performance are presented in Figure 4, where our algorithm converges to a lower error than other baselines. The fluctuation of our method is due to the feature-based adaptation which results in the stochastic pattern of the convergence behavior.

Refer to caption
Figure 4: Training performance (MSE values) throughout training epochs over MAML, MLP and our proposed methods, where k𝑘kitalic_k is set to be 5555.

Appendix D Discussion

Adaptability and Scalability of F-FOMAML Algorithm: The GNN-enhanced F-FOMAML algorithm is designed for adaptability beyond just demand forecasting, making it suitable for a range of prediction and classification tasks in data-limited scenarios. Its performance on large-scale datasets, like those from JD.com, indicates its scalability. This scalability, alongside the algorithm’s flexibility, suggests potential applicability across diverse industries, showcasing the model’s capacity to handle different types of data environments efficiently.Also, our model using the the first order MAML method.

As demonstrated in (Yang et al., 2023), the GNN-based forecaster to generate embeddings can operate in mini-batch fashion and hen can scale to graphs with millions of nodes, which allows the embeddings to be generated based on very large datasets (much larger than the current open source JD.com data). Moreover, our F-FOMAML algorithm is also scalable as it is mathematically similar to Reptile (Nichol et al., 2018) and also through distributed training which is also demonstrated in the most recent work (Choe et al., 2023).

Nevertheless, our solution is not devoid of limitations. The success of our methodology is critically tied to the quality and relevance of the proxy data employed. Any deficiencies or omissions in this proxy data could compromise the effectiveness of our approach.

As we gaze into the future, we intend to incorporate meta-learning algorithms specifically designed for regression problems (Kong et al., 2020a, b). By doing so, we aim to embed deeper domain-specific knowledge into our model, enhancing its predictive acumen and generalizability. Considering the inherent generality of our methodology, we envision its adaptation to tackle various data-limited scenarios, such as cold-start recommendation or trend forecasting, marking them as potential trajectories for future endeavors.