AIGB: Generative Auto-bidding via Diffusion Modeling

Jiayan Guo [email protected] Peking University
Alibaba Group
Bei**gChina
Yusen Huo [email protected] Alibaba GroupBei**gChina Zhilin Zhang [email protected] Alibaba GroupBei**gChina Tianyu Wang yves.wty@@alibaba-inc.com Alibaba GroupBei**gChina Chuan Yu [email protected] Alibaba GroupBei**gChina Jian Xu [email protected] Alibaba GroupBei**gChina Yan Zhang [email protected] Peking UniversityBei**gChina  and  Bo Zheng [email protected] Alibaba GroupBei**gChina
(2024)
Abstract.

Auto-bidding plays a crucial role in facilitating online advertising by automatically providing bids for advertisers. Reinforcement learning (RL) has gained popularity for auto-bidding. However, most current RL auto-bidding methods are modeled through the Markovian Decision Process (MDP), which assumes the Markovian state transition. This assumption restricts the ability to perform in long horizon scenarios and makes the model unstable when dealing with highly random online advertising environments. To tackle this issue, this paper introduces AI-Generated Bidding (AIGB), a novel paradigm for auto-bidding through generative modeling. In this paradigm, we propose DiffBid, a conditional diffusion modeling approach for bid generation. DiffBid directly models the correlation between the return and the entire trajectory, effectively avoiding error propagation across time steps in long horizons. Additionally, DiffBid offers a versatile approach for generating trajectories that maximize given targets while adhering to specific constraints. Extensive experiments conducted on the real-world dataset and online A/B test on Alibaba advertising platform demonstrate the effectiveness of DiffBid, achieving 2.81% increase in GMV and 3.36% increase in ROI.

Online Advertising, Auto-bidding, Generative Learning, Diffusion Modeling
journalyear: 2024copyright: acmlicensedconference: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining; August 25–29, 2024; Barcelona, Spainbooktitle: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’24), August 25–29, 2024, Barcelona, Spaindoi: 10.1145/3637528.3671526isbn: 979-8-4007-0490-1/24/08submissionid: 79ccs: Information systems Computational advertising

1. Introduction

The ever-increasing digitalization of commerce has exponentially expanded the scope and importance of online advertising platforms (Ha, 2008; Evans, 2009). These ad platforms have become indispensable for businesses to effectively target their audience and drive sales. Traditionally, advertisers need to manually adjust bid prices to optimize overall ad performance. However, this coarse bidding process becomes impractical when dealing with trillions of impression opportunities, requiring extensive domain knowledge (Chiesi et al., 1979) and comprehensive information about the advertising environments.

To alleviate the burden of bid optimization for advertisers, these ad platforms provide auto-bidding services (Deng et al., 2021; Balseiro et al., 2021b, a; Ou et al., 2023). These services automate the determination of bids for each impression opportunity by employing well-designed bidding strategies. Such strategies consider a variety of factors about advertising environments and advertisers, such as the distribution of impression opportunities, budgets, and average cost constraints (Li and Tang, 2022). Considering the dynamic nature of advertising environments, it is essential to regularly optimize the bidding strategy, typically at intervals of a few minutes, in response to changing conditions. With advertising episodes typically extending beyond 24 hours, auto-bidding can be seen as a sequential decision-making process with a long planning horizon where the bidding strategy seeks to optimize performance throughout the entire episode.

Recently, reinforcement learning (RL) techniques have been employed to optimize auto-bidding strategies through the training of agents with bidding logs collected from online advertising environments (** et al., 2018; Cai et al., 2017; Wang et al., 2017; He et al., 2021; Zhang et al., 2023b; Mou et al., 2022). By leveraging historical realistic bidding information, these agents can learn patterns and trends to make informed bidding decisions. However, most existing RL auto-bidding methods are based on the Markovian decision process (MDP), where the next state only depends on the current state and action. In the online auto-bidding environment, this assumption may been challenged by our statistical analysis presented in Figure 1, which shows a significant increase in the correlation between the sequence lengths of history states and the next bidding state. This result indicates that solving auto-bidding considering only the last state will encounter several problems, including instability in the highly random online advertising environment. Additionally, the RL methods that rely on the Bellman equation often result in compound errors  (Fujimoto et al., 2022). This issue is especially pronounced in the auto-bidding problem characterized by sparse return and limited data coverage. A detailed statistical analysis is provided in A.5.

Refer to caption
Figure 1. Correlation Coefficients between History and the Next State.

In this paper, instead of employing RL-based methods, we present a novel paradigm, AI Generated Bidding (AIGB), that regards auto-bidding as a generative sequential decision-making problem. AIGB directly capture the correlation between the return and the entire bidding trajectory that consists of a sequence of states or actions, thereby transforming the problem into learning to generate an optimal bidding trajectory. This approach enables us to overcome the limitations of RL when dealing with the highly random online advertising environment, sparse returns, and limited data coverage.

In the new paradigm, we propose Diffusion auto-bidding model DiffBid. It gradually corrupts the bidding trajectory by injecting scheduled Gaussian noises into the forward process. Then, it reconstructs trajectory from corrupted ones given returns and temporal conditions via a parameterized neural network. We further propose a non-Markovian inverse dynamics (Nguyen-Tuong et al., 2008; Guo et al., 2022; Zhang et al., 2023a) to more accurately generate optimal bidding parameters. Taking one step further, DiffBid provides flexibility to closely align with the specific needs of advertisers by accommodating diverse constraints like cost-per-click (CPC) and incorporating human feedback. Notably, DiffBid serves as a unified model capable of mastering multiple tasks simultaneously, dynamically composing various bidding trajectory components to generate sequences that efficiently maximize diverse targets while adhering to a range of predefined constraints. To assess the effectiveness of DiffBid, we conducted extensive evaluations offline and online against baselines. Our results indicate that DiffBid surpasses RL methods for auto-bidding. In summary:

  • We uncover that the Markov assumptions upon which common decision-making methods rely are not applicable to the auto-bidding problem. Therefore, we propose a novel bidding paradigm with non-Markovian properties based on generative learning. This paradigm represents a significant innovation in modeling methodology compared with existing RL methods commonly used in auto-bidding.

  • Unlike common bidding methods, our approach captures the correlation between the return and the entire bidding trajectory. This design enables the method to address important challenges, such as sparse returns, and ensures stability in the highly random advertising environment. Finally, we prove that the proposed diffusion modeling is equivalent in terms of optimality to solving a non-Markovian decision problem.

  • We demonstrate that the method can integrate capabilities to handle a variety of tasks within a unified solution, transcending the limitations of traditional task-specific methods. It shows that DiffBid outperforms conventional RL methods in auto-bidding, and achieves significant performance gain on a leading E-commerce ad platform through both offline and online evaluation. In specific, it achieves 2.81% increase in GMV and 3.36% in ROI.

2. Preliminary

2.1. Problem Formulation

Refer to caption
Figure 2. Overall Framework for Generative Auto-bidding.

For simplicity, we consider auto-bidding with cost-related constraints. During a time period, suppose there are N𝑁Nitalic_N impression opportunities arriving sequentially and indexed by i𝑖iitalic_i. In this setting, advertisers submit bids to compete for each impression opportunity. An advertiser will win the impression if its bid bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is greater than others. Then it will incur a cost cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for winning and getting the value.

During the period, the mission of an advertiser is to maximize the total received value ioivisubscript𝑖subscript𝑜𝑖subscript𝑣𝑖\sum_{i}o_{i}v_{i}∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, where visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the value of impression i𝑖iitalic_i and oisubscript𝑜𝑖o_{i}italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is whether the advertiser wins impression i𝑖iitalic_i. Besides, we have the budget and several constraints to control the performance of ad deliveries. Budget constraints are simply ioiciBsubscript𝑖subscript𝑜𝑖subscript𝑐𝑖𝐵\sum_{i}o_{i}c_{i}\leq B∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ italic_B, where cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the cost of impression i𝑖iitalic_i and B𝐵Bitalic_B is the budget. The other constraints are complex and according to (He et al., 2021) we have the unified formulation:

(1) icijoiipijoiCj,subscript𝑖subscript𝑐𝑖𝑗subscript𝑜𝑖subscript𝑖subscript𝑝𝑖𝑗subscript𝑜𝑖subscript𝐶𝑗\frac{\sum_{i}c_{ij}o_{i}}{\sum_{i}p_{ij}o_{i}}\leq C_{j},divide start_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ≤ italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ,

where Cjsubscript𝐶𝑗C_{j}italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is the upper bound of j𝑗jitalic_j’th constraint. pijsubscript𝑝𝑖𝑗p_{ij}italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT can be any performance indicator, e.g. return, or constant. cijsubscript𝑐𝑖𝑗c_{ij}italic_c start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is the cost of constraint j𝑗jitalic_j. Given J𝐽Jitalic_J constraints, we have the Multi-constrained Bidding (MCB) as:

(2) maximizeoiioivis.t.ioiciBicijoiipijoiCj,joi{0,1},i\begin{split}\mathop{\text{maximize}}_{o_{i}}&\sum_{i}o_{i}v_{i}\\ \text{s.t.}&\sum_{i}o_{i}c_{i}\leq B\\ &\frac{\sum_{i}c_{ij}o_{i}}{\sum_{i}p_{ij}o_{i}}\leq C_{j},\ \ \ \forall j\\ &\ \ \ \ \ \ \ \ \ \ \ o_{i}\in\{0,1\},\ \forall i\\ \end{split}start_ROW start_CELL maximize start_POSTSUBSCRIPT italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL s.t. end_CELL start_CELL ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ italic_B end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL divide start_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ≤ italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , ∀ italic_j end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { 0 , 1 } , ∀ italic_i end_CELL end_ROW

A previous study (He et al., 2021) has already shown the optimal solution:

(3) bi=λ0vi+Cij=1Jλjpij,superscriptsubscript𝑏𝑖subscript𝜆0subscript𝑣𝑖subscript𝐶𝑖superscriptsubscript𝑗1𝐽subscript𝜆𝑗subscript𝑝𝑖𝑗b_{i}^{*}=\lambda_{0}v_{i}+C_{i}\sum_{j=1}^{J}\lambda_{j}p_{ij},italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ,

where bisuperscriptsubscript𝑏𝑖b_{i}^{*}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the predicted optimal bid for the impression i𝑖iitalic_i. λj,j{0,,J}subscript𝜆𝑗𝑗0𝐽\lambda_{j},\ j\in\{0,...,J\}italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_j ∈ { 0 , … , italic_J } are the optimal bidding parameters. Specifically, when considering only the budget constraint, it is the Max Return bidding. However, when considering both the budget constraint and the CPC constraint, it is called Target-CPC bidding. From an alternative perspective, the optimal strategy involves arranging all impressions in order of their cost-effectiveness (CE) and then selecting every impression opportunity that surpasses the optimal CE ratio ce𝑐superscript𝑒ce^{*}italic_c italic_e start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. This threshold enforces the constraint, and the optimal bidding parameters λ0=1/cesubscript𝜆01𝑐superscript𝑒\lambda_{0}=1/ce^{*}italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 1 / italic_c italic_e start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

2.2. Auto-Bidding as Decision-Making

Eq.(3) gives the formation of the optimal bid bisuperscriptsubscript𝑏𝑖b_{i}^{*}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT with bidding parameters λj,j{0,,J}subscript𝜆𝑗𝑗0𝐽\lambda_{j},\ j\in\{0,...,J\}italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_j ∈ { 0 , … , italic_J }. However, in practice, the highly random and complex nature of the advertising environment prevents direct calculation of the bidding parameters. They must be carefully calibrated to adapt to the environment and dynamically adjusted as it evolves, This subsequently makes auto-bidding a sequential decision-making problem. To model it with decision-making, we introduce states 𝒔t𝒮subscript𝒔𝑡𝒮\boldsymbol{s}_{t}\in\mathcal{S}bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_S to describe the real-time advertising status and actions 𝒂t𝒜subscript𝒂𝑡𝒜\boldsymbol{a}_{t}\in\mathcal{A}bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_A to adjust the corresponding bidding parameters. The auto-bidding agent will take action 𝒂tsubscript𝒂𝑡\boldsymbol{a}_{t}bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at the state 𝒔tsubscript𝒔𝑡\boldsymbol{s}_{t}bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT based on its policy π𝜋\piitalic_π, and then the state will transit to the next state 𝒔t+1subscript𝒔𝑡1\boldsymbol{s}_{t+1}bold_italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT and gain reward 𝒓tsubscript𝒓𝑡\boldsymbol{r}_{t}\in\mathcal{R}bold_italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_R according to the advertising environment dynamics 𝒯𝒯\mathcal{T}caligraphic_T. When 𝒯:𝒔t×𝒂t𝒔t+1×𝒓t:𝒯subscript𝒔𝑡subscript𝒂𝑡subscript𝒔𝑡1subscript𝒓𝑡\mathcal{T}:\boldsymbol{s}_{t}\times\boldsymbol{a}_{t}\rightarrow\boldsymbol{s% }_{t+1}\times\boldsymbol{r}_{t}caligraphic_T : bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT × bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT → bold_italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT × bold_italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT satisfies, it is called the Markovian decision process (MDP). Otherwise, it is a non-Markovian decision process. We next describe the key items of the automated bidding agent in the industrial online advertising system:

  • State 𝒔tsubscript𝒔𝑡\boldsymbol{s}_{t}bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT describes the real-time advertising status at time period t𝑡titalic_t, which includes 1) remaining time of the advertiser; 2) remaining budget; 3) budget spend speed; 4) real-time cost-efficiency (CPC), 5) and average cost-efficiency (CPC).

  • Action 𝒂tsubscript𝒂𝑡\boldsymbol{a}_{t}bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT indicates the adjustment to the bidding parameters at the time period t𝑡titalic_t, which has the dimension of the number of bidding parameters λj,j=1,..,J\lambda_{j},\ j=1,..,Jitalic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_j = 1 , . . , italic_J and modeled as (atλ0,,atλJ)superscriptsubscript𝑎𝑡subscript𝜆0superscriptsubscript𝑎𝑡subscript𝜆𝐽(a_{t}^{\lambda_{0}},...,a_{t}^{\lambda_{J}})( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ).

  • The reward 𝒓tsubscript𝒓𝑡\boldsymbol{r}_{t}bold_italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the value contributed to the objective obtained within the time period t𝑡titalic_t.

  • A trajectory τ𝜏\tauitalic_τ is the index of a sequence of states, actions, and rewards within an episode.

In the online advertising system, learning policy through direct interaction with the online environment is unfeasible due to safety concerns. Nonetheless, access to historical bidding logs, incorporating trajectories from a variety of bidding strategies, is attainable and provides a viable alternative. Prevalent auto-bidding methods predominantly leverage this offline data to craft effective policies. Our approach is aligned with this practice and will be elaborated in detail in the subsequent chapters.

3. AIGB PARADIGM for AUTO-BIDDING

To thoroughly investigate the auto-bidding problem, we conducted a series of statistical analyses of bidding trajectories, with detailed information available in appendix LABEL:statistical_analysis. These analyses provide us with the insight that devising an effective bidding strategy is essentially equivalent to optimizing a state trajectory. Armed with this insight, we propose a hierarchical paradigm for auto-bidding that prioritizes the state trajectory optimization and subsequently generates actions aligned with the optimized trajectory.

For state trajectory optimization, we can employ a generative model to capture the joint distribution of the entire bidding trajectory and its associated returns, subsequently generating the trajectory distribution conditioned on the desired return. This approach enables us to address key auto-bidding challenges by employing SOTA generative algorithms. This paper presents an implementation that utilizes Denoising Diffusion Probabilistic Models (DDPM). For action generation, several off-the-shelf methods can be utilized to predict the proper action given the target state trajectory. In this paper, we apply a widely used inverse dynamics model. The hierarchical paradigm divides auto-bidding into two supervised learning problems, offering several advantages that include enhanced interpretability and increased stability during the training process.

4. Diffusion Auto-bidding Model

In this section, we give a detailed introduction of the proposed diffusion Auto-bidding Model (DiffBid). We will first give the modeling of Auto-bidding through diffusion models in Section 4.1.1. Then we give a detailed description of the forward process in Section 4.1.2, the reverse process in Section 4.1.3, and the training process in Section 4.2. Finally, we will give the complexity analysis in Section 4.4.

4.1. Diffusion Modeling of Auto-bidding

4.1.1. Overview

We model such sequential decision-making problem through conditional generative modeling (Chen et al., 2021; Ajay et al., 2022) by maximum likelihood estimation (MLE):

(4) maxθ𝔼τD[logpθ(𝒙0(τ)|𝒚(τ))]subscriptmax𝜃subscript𝔼similar-to𝜏𝐷delimited-[]logsubscript𝑝𝜃conditionalsubscript𝒙0𝜏𝒚𝜏\mathop{\text{max}}_{\theta}\mathbb{E}_{\tau\sim D}\left[\text{log}p_{\theta}(% \boldsymbol{x}_{0}(\tau)|\boldsymbol{y}(\tau))\right]max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_τ ∼ italic_D end_POSTSUBSCRIPT [ log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_τ ) | bold_italic_y ( italic_τ ) ) ]

where τ𝜏\tauitalic_τ is the trajectory index, 𝒙0(τ)subscript𝒙0𝜏\boldsymbol{x}_{0}(\tau)bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_τ ) is the original trajectory of states and 𝒚(τ)𝒚𝜏\boldsymbol{y}(\tau)bold_italic_y ( italic_τ ) is the corresponding property. The goal is to estimate the conditional data distribution with pθsubscript𝑝𝜃p_{\theta}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT so that the future states of a trajectory 𝒙0(τ)subscript𝒙0𝜏\boldsymbol{x}_{0}(\tau)bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_τ ) from information 𝒚(τ)𝒚𝜏\boldsymbol{y}(\tau)bold_italic_y ( italic_τ ) can be generated. For example, in the context of online advertising, 𝒚(τ)𝒚𝜏\boldsymbol{y}(\tau)bold_italic_y ( italic_τ ) can be the constraints or the total value of the entire trajectory. Under such a setting, we can formalize the conditional diffusion modeling for auto-bidding:

(5) q(𝒙k+1(τ)|𝒙k(τ)),pθ(𝒙k1(τ)|𝒙k(τ),𝒚(τ)),𝑞conditionalsubscript𝒙𝑘1𝜏subscript𝒙𝑘𝜏subscript𝑝𝜃conditionalsubscript𝒙𝑘1𝜏subscript𝒙𝑘𝜏𝒚𝜏q(\boldsymbol{x}_{k+1}(\tau)|\boldsymbol{x}_{k}(\tau)),\ \ p_{\theta}(% \boldsymbol{x}_{k-1}(\tau)|\boldsymbol{x}_{k}(\tau),\boldsymbol{y}(\tau)),italic_q ( bold_italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ( italic_τ ) | bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_τ ) ) , italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( italic_τ ) | bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_τ ) , bold_italic_y ( italic_τ ) ) ,

where q𝑞qitalic_q represents the forward process in which noises are gradually added to the trajectory while pθsubscript𝑝𝜃p_{\theta}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is the reverse process where a model is used for denoising. The detailed introduction of diffusion modeling can be found in Appendix A.2. The overall framework is presented in Figure 2. We will make a detailed discussion about the two modeling processes in the following sections.

4.1.2. Forward Process via Diffusion over States

We model the forward process q(𝒙k+1(τ)|𝒙k(τ))𝑞conditionalsubscript𝒙𝑘1𝜏subscript𝒙𝑘𝜏q(\boldsymbol{x}_{k+1}(\tau)|\boldsymbol{x}_{k}(\tau))italic_q ( bold_italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ( italic_τ ) | bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_τ ) ) via diffusion over states, where:

(6) 𝒙k(τ):=(𝒔1,,𝒔t,,𝒔T)k,assignsubscript𝒙𝑘𝜏subscriptsubscript𝒔1subscript𝒔𝑡subscript𝒔𝑇𝑘\boldsymbol{x}_{k}(\tau):=\left(\boldsymbol{s}_{1},...,\boldsymbol{s}_{t},...,% \boldsymbol{s}_{T}\right)_{k},bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_τ ) := ( bold_italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , … , bold_italic_s start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ,

where 𝒔tsubscript𝒔𝑡\boldsymbol{s}_{t}bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is modeled as a one-dimensional vector. 𝒙k(𝝉)subscript𝒙𝑘𝝉\boldsymbol{x}_{k}(\boldsymbol{\tau})bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_italic_τ ) is a noise sequence of states and can be represented by a two-dimensional array where the first dimension is the time periods and the second dimension is the state values. Merely sampling states is not enough for an agent. Given 𝒙k(τ)subscript𝒙𝑘𝜏\boldsymbol{x}_{k}(\tau)bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_τ ), we model the diffusion process as a Markov chain, where 𝒙k(τ)subscript𝒙𝑘𝜏\boldsymbol{x}_{k}(\tau)bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_τ ) is only dependent on 𝒙k1(τ)subscript𝒙𝑘1𝜏\boldsymbol{x}_{k-1}(\tau)bold_italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( italic_τ ):

(7) q(𝒙k(τ)|𝒙k1(τ))=𝒩(𝒙k(τ);1βk𝒙k1(τ),βkI),𝑞conditionalsubscript𝒙𝑘𝜏subscript𝒙𝑘1𝜏𝒩subscript𝒙𝑘𝜏1subscript𝛽𝑘subscript𝒙𝑘1𝜏subscript𝛽𝑘𝐼q(\boldsymbol{x}_{k}(\tau)|\boldsymbol{x}_{k-1}(\tau))=\mathcal{N}\left(% \boldsymbol{x}_{k}(\tau);\sqrt{1-\beta_{k}}\boldsymbol{x}_{k-1}(\tau),\beta_{k% }I\right),italic_q ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_τ ) | bold_italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( italic_τ ) ) = caligraphic_N ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_τ ) ; square-root start_ARG 1 - italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG bold_italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( italic_τ ) , italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_I ) ,

when k𝑘k\rightarrow\inftyitalic_k → ∞, xk(τ)subscript𝑥𝑘𝜏x_{k}(\tau)italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_τ ) approaches a sequence of standard Gaussian distribution where we can make sampling through the re-parameterization trick and then gradually denoise the trajectory to produce the final state sequence. For the design of βk,k=1,,Kformulae-sequencesubscript𝛽𝑘𝑘1𝐾\beta_{k},\ \ k=1,...,Kitalic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_k = 1 , … , italic_K, we apply cosine schedule (Nichol and Dhariwal, 2021) to assign the corresponding values which smoothly increases diffusion noises using a cosine function to prevent sudden changes in the noise level. The details for noise schedule can be found in the appendix.

4.1.3. Reverse Process for Bid Generation

Following (Ho and Salimans, 2021; Ajay et al., 2022) we use a classifier-free guidance strategy with low-temperature sampling to guide the generation of bidding, to extract high-likelihood trajectories in the dataset. During the training phase, we jointly train the unconditional model ϵθ(𝒙k(τ),k)subscriptitalic-ϵ𝜃subscript𝒙𝑘𝜏𝑘\epsilon_{\theta}(\boldsymbol{x}_{k}(\tau),k)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_τ ) , italic_k ) and conditional model ϵθ(𝒙k(τ),𝒚(τ),k)subscriptitalic-ϵ𝜃subscript𝒙𝑘𝜏𝒚𝜏𝑘\epsilon_{\theta}(\boldsymbol{x}_{k}(\tau),\boldsymbol{y}(\tau),k)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_τ ) , bold_italic_y ( italic_τ ) , italic_k ) by randomly drop** out conditions. During generation, a linear combination of conditional and unconditional score estimates is used:

(8) ϵ^k:=ϵθ(𝒙k(τ),k)+ω(ϵθ(𝒙k(τ),𝒚(τ),k)ϵθ(𝒙k(τ),k)),assignsubscript^italic-ϵ𝑘subscriptitalic-ϵ𝜃subscript𝒙𝑘𝜏𝑘𝜔subscriptitalic-ϵ𝜃subscript𝒙𝑘𝜏𝒚𝜏𝑘subscriptitalic-ϵ𝜃subscript𝒙𝑘𝜏𝑘\hat{\epsilon}_{k}:=\epsilon_{\theta}(\boldsymbol{x}_{k}(\tau),k)+\omega\left(% \epsilon_{\theta}\left(\boldsymbol{x}_{k}(\tau),\boldsymbol{y}(\tau),k\right)-% \epsilon_{\theta}\left(\boldsymbol{x}_{k}(\tau),k\right)\right),over^ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT := italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_τ ) , italic_k ) + italic_ω ( italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_τ ) , bold_italic_y ( italic_τ ) , italic_k ) - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_τ ) , italic_k ) ) ,

where the scale ω𝜔\omegaitalic_ω is applied to extract the most suitable portion of the trajectory in the dataset that coappeared with 𝒚(τ)𝒚𝜏\boldsymbol{y}(\tau)bold_italic_y ( italic_τ ). After that, we can sample from DiffBid to produce bidding parameters through sampling from pθ(𝒙k1(τ)|𝒙k(τ),𝒚(τ))subscript𝑝𝜃conditionalsubscript𝒙𝑘1𝜏subscript𝒙𝑘𝜏𝒚𝜏p_{\theta}(\boldsymbol{x}_{k-1}(\tau)|\boldsymbol{x}_{k}(\tau),\boldsymbol{y}(% \tau))italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( italic_τ ) | bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_τ ) , bold_italic_y ( italic_τ ) ):

(9) 𝒙k1(τ)𝒩(𝒙k1(τ)|𝝁θ(xk(τ),𝒚(τ),k),𝚺θ(𝒙k(τ),k))similar-tosubscript𝒙𝑘1𝜏𝒩conditionalsubscript𝒙𝑘1𝜏subscript𝝁𝜃subscript𝑥𝑘𝜏𝒚𝜏𝑘subscript𝚺𝜃subscript𝒙𝑘𝜏𝑘\boldsymbol{x}_{k-1}(\tau)\sim\mathcal{N}\left(\boldsymbol{x}_{k-1}(\tau)|% \boldsymbol{\mu}_{\theta}\left(x_{k}(\tau),\boldsymbol{y}(\tau),k\right),% \boldsymbol{\Sigma}_{\theta}\left(\boldsymbol{x}_{k}(\tau),k\right)\right)bold_italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( italic_τ ) ∼ caligraphic_N ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( italic_τ ) | bold_italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_τ ) , bold_italic_y ( italic_τ ) , italic_k ) , bold_Σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_τ ) , italic_k ) )

where a widely used parameterization here is 𝝁θ(𝒙k(τ),𝒚(τ),k)=1αk(𝒙k(τ)βk1α¯kϵ^k)subscript𝝁𝜃subscript𝒙𝑘𝜏𝒚𝜏𝑘1subscript𝛼𝑘subscript𝒙𝑘𝜏subscript𝛽𝑘1subscript¯𝛼𝑘subscript^italic-ϵ𝑘\boldsymbol{\mu}_{\theta}(\boldsymbol{x}_{k}(\tau),\boldsymbol{y}(\tau),k)=% \frac{1}{\sqrt{\alpha_{k}}}(\boldsymbol{x}_{k}(\tau)-\frac{\beta_{k}}{\sqrt{1-% \overline{\alpha}_{k}}}\hat{\epsilon}_{k})bold_italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_τ ) , bold_italic_y ( italic_τ ) , italic_k ) = divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_τ ) - divide start_ARG italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG over^ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) and Σθ()=βksubscriptΣ𝜃subscript𝛽𝑘\Sigma_{\theta}(\cdot)=\beta_{k}roman_Σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ ) = italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, in which αk=1βksubscript𝛼𝑘1subscript𝛽𝑘\alpha_{k}=1-\beta_{k}italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 1 - italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and αk¯=i=1kαk¯subscript𝛼𝑘superscriptsubscriptproduct𝑖1𝑘subscript𝛼𝑘\overline{\alpha_{k}}=\prod_{i=1}^{k}\alpha_{k}over¯ start_ARG italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. When serving at time period t𝑡titalic_t, the agent first sample a initial trajectory xK(τ)𝒩(0,I)similar-tosubscriptsuperscript𝑥𝐾𝜏𝒩0𝐼x^{\prime}_{K}(\tau)\sim\mathcal{N}(0,I)italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_τ ) ∼ caligraphic_N ( 0 , italic_I ) and assign the history states 𝒔0:tsubscript𝒔:0𝑡\boldsymbol{s}_{0:t}bold_italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT into it. Then, we can sample predicted states with the reverse process recursively by

(10) 𝒙k1(τ)=𝝁θ(𝒙k(τ),𝒚(τ),k)+βk𝒛subscriptsuperscript𝒙𝑘1𝜏subscript𝝁𝜃subscriptsuperscript𝒙𝑘𝜏𝒚𝜏𝑘subscript𝛽𝑘𝒛\boldsymbol{x}^{\prime}_{k-1}(\tau)=\boldsymbol{\mu}_{\theta}(\boldsymbol{x}^{% \prime}_{k}(\tau),\boldsymbol{y}(\tau),k)+\sqrt{\beta_{k}}\boldsymbol{z}bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ( italic_τ ) = bold_italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_τ ) , bold_italic_y ( italic_τ ) , italic_k ) + square-root start_ARG italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG bold_italic_z

where 𝒛𝒩(0,I)similar-to𝒛𝒩0𝐼\boldsymbol{z}\sim\mathcal{N}(0,I)bold_italic_z ∼ caligraphic_N ( 0 , italic_I ). Given 𝒙0(τ)subscriptsuperscript𝒙0𝜏\boldsymbol{x}^{\prime}_{0}(\tau)bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_τ ), we can extract the next predicted state st+1subscriptsuperscript𝑠𝑡1s^{\prime}_{t+1}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT, and determine how much it should bid to achieve that state. In this setting, we apply widely used inverse dynamics (Agrawal et al., 2016; Pathak et al., 2018) with non-Markovian state sequence to determine current bidding parameters at time period t𝑡titalic_t:

(11) 𝒂^t=fϕ(𝒔tL:t,𝒔t+1),subscriptbold-^𝒂𝑡subscript𝑓italic-ϕsubscript𝒔:𝑡𝐿𝑡subscriptsuperscript𝒔𝑡1\boldsymbol{\hat{a}}_{t}=f_{\phi}(\boldsymbol{s}_{t-L:t},\boldsymbol{s}^{% \prime}_{t+1}),overbold_^ start_ARG bold_italic_a end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_italic_s start_POSTSUBSCRIPT italic_t - italic_L : italic_t end_POSTSUBSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ,

where 𝒂^tJsubscriptbold-^𝒂𝑡superscript𝐽\boldsymbol{\hat{a}}_{t}\in\mathbb{R}^{J}overbold_^ start_ARG bold_italic_a end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_J end_POSTSUPERSCRIPT contains predicted bidding parameters (i.e. λi,i=1,,nformulae-sequencesubscript𝜆𝑖𝑖1𝑛\lambda_{i},\ i=1,...,nitalic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i = 1 , … , italic_n) at time t𝑡titalic_t. L𝐿Litalic_L is the length of history states. The inverse dynamic function fϕsubscript𝑓italic-ϕf_{\phi}italic_f start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT can be trained with the same offline logs as the reverse process. This design disentangles the learning of states and actions, making it easier to learn the connection between states thus achieving better empirical performance. The overall procedure is summarized in Algorithm 1.

4.2. DiffBid Training

Following (Ho et al., 2020), we train DiffBid to approximate the given noise and the returns in a supervised manner. Given a bidding trajectory 𝒙0(τ)subscript𝒙0𝜏\boldsymbol{x}_{0}(\tau)bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_τ ), we have its corresponding returns e.g., values the advertiser received, the constraint the model should obey and the history states sl,l=1,,tformulae-sequencesubscript𝑠𝑙𝑙1𝑡s_{l},l=1,...,titalic_s start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_l = 1 , … , italic_t before time t+1𝑡1t+1italic_t + 1. Then we just train the reverse process model pθsubscript𝑝𝜃p_{\theta}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT which is parameterized through the noise model ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and the inverse dynamics fϕsubscript𝑓italic-ϕf_{\phi}italic_f start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT through:

(12) (θ,ϕ)=𝔼k,τ𝒟[ϵϵθ(𝒙k(τ),𝒚(τ),k)2]+𝔼(𝒔tL:t,𝒂t,st+1)𝒟[𝒂tfϕ(𝒔tL:t,𝒔t+1)2],𝜃italic-ϕsubscript𝔼𝑘𝜏𝒟delimited-[]superscriptnormitalic-ϵsubscriptitalic-ϵ𝜃subscript𝒙𝑘𝜏𝒚𝜏𝑘2subscript𝔼subscript𝒔:𝑡𝐿𝑡subscript𝒂𝑡subscriptsuperscript𝑠𝑡1𝒟delimited-[]superscriptnormsubscript𝒂𝑡subscript𝑓italic-ϕsubscript𝒔:𝑡𝐿𝑡subscriptsuperscript𝒔𝑡12\begin{split}\mathcal{L}(\theta,\phi)&=\mathbb{E}_{k,\tau\in\mathcal{D}}\left[% ||\epsilon-\epsilon_{\theta}(\boldsymbol{x}_{k}(\tau),\boldsymbol{y}(\tau),k)|% |^{2}\right]\\ &+\mathbb{E}_{(\boldsymbol{s}_{t-L:t},\boldsymbol{a}_{t},s^{\prime}_{t+1})\in% \mathcal{D}}\left[||\boldsymbol{a}_{t}-f_{\phi}(\boldsymbol{s}_{t-L:t},% \boldsymbol{s}^{\prime}_{t+1})||^{2}\right],\end{split}start_ROW start_CELL caligraphic_L ( italic_θ , italic_ϕ ) end_CELL start_CELL = blackboard_E start_POSTSUBSCRIPT italic_k , italic_τ ∈ caligraphic_D end_POSTSUBSCRIPT [ | | italic_ϵ - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_τ ) , bold_italic_y ( italic_τ ) , italic_k ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + blackboard_E start_POSTSUBSCRIPT ( bold_italic_s start_POSTSUBSCRIPT italic_t - italic_L : italic_t end_POSTSUBSCRIPT , bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ∈ caligraphic_D end_POSTSUBSCRIPT [ | | bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_f start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_italic_s start_POSTSUBSCRIPT italic_t - italic_L : italic_t end_POSTSUBSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , end_CELL end_ROW

In the training process, we randomly sample a bidding trajectory τ𝜏\tauitalic_τ and a time step k𝑘kitalic_k, then we construct a noise trajectory 𝒙k(τ)subscript𝒙𝑘𝜏\boldsymbol{x}_{k}(\tau)bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_τ ) and predict the noise through Eq (8). Following (Ho and Salimans, 2021), we randomly drop conditions 𝒚(τ)𝒚𝜏\boldsymbol{y}(\tau)bold_italic_y ( italic_τ ) with probability p𝑝pitalic_p to train DiffBid to enhance the robustness. The process is presented in Algorithm 2.

4.3. Design of Conditions.

In this section, we present approaches transforming industrial metrics into conditions of DiffBid.

4.3.1. Generation with Returns

For each trajectory τ𝜏\tauitalic_τ we have the total value the advertiser received as the the return R(τ)=t=1Trt𝑅𝜏superscriptsubscript𝑡1𝑇subscript𝑟𝑡R(\tau)=\sum_{t=1}^{T}r_{t}italic_R ( italic_τ ) = ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. We normalize the return by:

(13) R=R(τ)RminRmaxRmin,𝑅𝑅𝜏subscript𝑅minsubscript𝑅maxsubscript𝑅minR=\frac{R(\tau)-R_{\text{min}}}{R_{\text{max}}-R_{\text{min}}},italic_R = divide start_ARG italic_R ( italic_τ ) - italic_R start_POSTSUBSCRIPT min end_POSTSUBSCRIPT end_ARG start_ARG italic_R start_POSTSUBSCRIPT max end_POSTSUBSCRIPT - italic_R start_POSTSUBSCRIPT min end_POSTSUBSCRIPT end_ARG ,

where Rminsubscript𝑅minR_{\text{min}}italic_R start_POSTSUBSCRIPT min end_POSTSUBSCRIPT and Rmaxsubscript𝑅maxR_{\text{max}}italic_R start_POSTSUBSCRIPT max end_POSTSUBSCRIPT are the smallest and the largest return in the dataset. Through Eq. 13 we normalize the return into [0,1]01[0,1][ 0 , 1 ] and merge it into y(τ)𝑦𝜏y(\tau)italic_y ( italic_τ ). Subsequently, we train the model to generate trajectories conditioned on the normalized returns. It should be noted that trajectories with more values received have higher normalized returns. Thus R=1𝑅1R=1italic_R = 1 indicates the best trajectory with the highest values which will better fit the advertisers’ needs. When generation, we just set R=1𝑅1R=1italic_R = 1 and generate the trajectory under the max return condition to the advertiser.

4.3.2. Generation with Constraints or Human Feedback

In MCB, the cumulative performance related to the constraints within a given episode should be controlled so as not to exceed the advertisers’ expectations. In such a setting, we can design 𝒚(τ)𝒚𝜏\boldsymbol{y}(\tau)bold_italic_y ( italic_τ ) to control the generation process. For example, in the Target-CPC setting, we can maintain a binary variable E𝐸Eitalic_E to indicate whether the final CPC exceeds the given constraint C𝐶Citalic_C:

(14) E=IxC(x)𝐸subscriptI𝑥𝐶𝑥E=\text{I}_{x\leq C}(x)italic_E = I start_POSTSUBSCRIPT italic_x ≤ italic_C end_POSTSUBSCRIPT ( italic_x )

where x=icioiipioi𝑥subscript𝑖subscript𝑐𝑖subscript𝑜𝑖subscript𝑖subscript𝑝𝑖subscript𝑜𝑖x=\frac{\sum_{i}c_{i}o_{i}}{\sum_{i}p_{i}o_{i}}italic_x = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG is defined in Eq (2). We can then normalize x𝑥xitalic_x into [0,1]01[0,1][ 0 , 1 ] through min-max normalization for simplification. E𝐸Eitalic_E can be used to indicate whether trajectory τ𝜏\tauitalic_τ break the CPC constraint. We can also design 𝒚(τ)𝒚𝜏\boldsymbol{y}(\tau)bold_italic_y ( italic_τ ) to include E=1𝐸1E=1italic_E = 1 to make the model generate bids that do not break the CPC constraint. Sometimes it is also important to adjust the bidding parameters given real-time feedback provided by the advertiser to enable flexibility. Here we use two example indicators that reflect the experience of advertisers:

  1. (1)

    Smoothness: an advertiser may expect the cost curve as smooth as possible to avoid sudden change. By defining x=1Tt|costtcostt1|𝑥1𝑇subscript𝑡𝑐𝑜𝑠subscript𝑡𝑡𝑐𝑜𝑠subscript𝑡𝑡1x=\frac{1}{T}\sum_{t}\left|cost_{t}-cost_{t-1}\right|italic_x = divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_c italic_o italic_s italic_t start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_c italic_o italic_s italic_t start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT |, we can model it as a binary variable S𝑆Sitalic_S indicating whether the max cost change between adjacent time period exceeds a threshold as in Eq (14).

  2. (2)

    Early/Late Spend: an advertiser may expect the budget to be cost in the morning or in the evening when there are promotions. Here we model the ratio of cost in the early half day through x=t=0T/2costit=0Tcosti𝑥superscriptsubscript𝑡0𝑇2𝑐𝑜𝑠subscript𝑡𝑖superscriptsubscript𝑡0𝑇𝑐𝑜𝑠subscript𝑡𝑖x=\frac{\sum_{t=0}^{T/2}{cost_{i}}}{\sum_{t=0}^{T}{cost_{i}}}italic_x = divide start_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T / 2 end_POSTSUPERSCRIPT italic_c italic_o italic_s italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_c italic_o italic_s italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG, and use a binary variable to indicate whether the spend in the early half day exceeds a certain threshold C𝐶Citalic_C as in Eq (14).

We can also compose several constraints together to form 𝒚(τ)𝒚𝜏\boldsymbol{y}(\tau)bold_italic_y ( italic_τ ) to guide the model to generate bid parameters that adhere to different constraints. In this setting, 𝒚(τ)𝒚𝜏\boldsymbol{y}(\tau)bold_italic_y ( italic_τ ) will be a vector.

4.4. Complexity Analysis

The complexity analysis for training DiffBid consists of the training process and the inference process. For training, given the time complexity of the noise prediction model ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is 𝒪(T1)𝒪subscript𝑇1\mathcal{O}(T_{1})caligraphic_O ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ), the complexity for the inverse dynamic model fϕsubscript𝑓italic-ϕf_{\phi}italic_f start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT is 𝒪(T2)𝒪subscript𝑇2\mathcal{O}(T_{2})caligraphic_O ( italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), the complexity for a training epoch is 𝒪(||(T1+T2))𝒪subscript𝑇1subscript𝑇2\mathcal{O}(|\mathcal{B}|(T_{1}+T_{2}))caligraphic_O ( | caligraphic_B | ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ). It can be seen that the training complexity is linear with the input given T1subscript𝑇1T_{1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and T2subscript𝑇2T_{2}italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are relatively fixed. Thus the training of DiffBid is efficient. For generation, given the total diffusion step K𝐾Kitalic_K, the trajectory length L𝐿Litalic_L, then the time complexity for inference is 𝒪(KL(T1+T2))𝒪𝐾𝐿subscript𝑇1subscript𝑇2\mathcal{O}(KL(T_{1}+T_{2}))caligraphic_O ( italic_K italic_L ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ). We can observe that the time complexity for inference is linearly scaled with the diffusion step K𝐾Kitalic_K. In image generation, K𝐾Kitalic_K is usually very large to ensure good generation quality, which brings the problem of non-efficiency. However, for bidding generation, we find K𝐾Kitalic_K needs not to be very large. Relatively small K𝐾Kitalic_K has already generated promising results. Moreover, in auto-bidding, a higher tolerance for latency is acceptable, enabling the use of relatively larger K𝐾Kitalic_K.

5. Theoretical Analysis

In this section, we theoretically analyze the property of DiffBid. In specific, we show that DiffBid that utilize MLE as the objective has a corresponding non-Markovian decision problem (Majeed and Hutter, 2018; Qin et al., 2023; Gaon and Brafman, 2020; Mutti et al., 2022). The detailed proofs can be found in the Appendix A.7.

Lemma 5.1 (MLE as non-Markovian decision-making).

Assuming the Markovian transition pγ(st+1|st,at)subscript𝑝superscript𝛾conditionalsubscript𝑠𝑡1subscript𝑠𝑡subscript𝑎𝑡p_{\gamma^{*}}(s_{t+1}|s_{t},a_{t})italic_p start_POSTSUBSCRIPT italic_γ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is known, the ground-truth conditional state distribution p(st+1|s0:t)superscript𝑝conditionalsubscript𝑠𝑡1subscript𝑠:0𝑡p^{*}(s_{t+1}|s_{0:t})italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) for demonstration sequences is accessible, we can construct a non-Markovian sequential decision-making problem, based on a reward function rα(st+1,s0:t):=logpα(at|s0:t)pγ(st+1|st,at)𝑑atassignsubscript𝑟𝛼subscript𝑠𝑡1subscript𝑠:0𝑡logsubscript𝑝𝛼conditionalsubscript𝑎𝑡subscript𝑠:0𝑡subscript𝑝superscript𝛾conditionalsubscript𝑠𝑡1subscript𝑠𝑡subscript𝑎𝑡differential-dsubscript𝑎𝑡r_{\alpha}(s_{t+1},s_{0:t}):={\rm{log}}\int p_{\alpha}(a_{t}|s_{0:t})p_{\gamma% ^{*}}(s_{t+1}|s_{t},a_{t})da_{t}italic_r start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) := roman_log ∫ italic_p start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) italic_p start_POSTSUBSCRIPT italic_γ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_d italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for an arbitrary energy-based policy pα(at|s0:t)subscript𝑝𝛼conditionalsubscript𝑎𝑡subscript𝑠:0𝑡p_{\alpha}(a_{t}|s_{0:t})italic_p start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ). Its objective is

(15) t=0T𝔼p(s0:t)[Vpα(s0:t)]=𝔼p(s0:T)[t=0Tk=tTrα(sk+1;s0:k)]superscriptsubscript𝑡0𝑇subscript𝔼superscript𝑝subscript𝑠:0𝑡delimited-[]superscript𝑉subscript𝑝𝛼subscript𝑠:0𝑡subscript𝔼superscript𝑝subscript𝑠:0𝑇delimited-[]superscriptsubscript𝑡0𝑇superscriptsubscript𝑘𝑡𝑇subscript𝑟𝛼subscript𝑠𝑘1subscript𝑠:0𝑘\sum_{t=0}^{T}\mathbb{E}_{p^{*}(s_{0:t})}\left[V^{p_{\alpha}}(s_{0:t})\right]=% \mathbb{E}_{p^{*}(s_{0:T})}\left[\sum_{t=0}^{T}\sum_{k=t}^{T}r_{\alpha}(s_{k+1% };s_{0:k})\right]∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) ] = blackboard_E start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 : italic_T end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ; italic_s start_POSTSUBSCRIPT 0 : italic_k end_POSTSUBSCRIPT ) ]

Vpα(s0:t):=𝔼p(st+1:T|s0:t)[k=1Trα(st+1;s0:t)]assignsuperscript𝑉subscript𝑝𝛼subscript𝑠:0𝑡subscript𝔼superscript𝑝conditionalsubscript𝑠:𝑡1𝑇subscript𝑠:0𝑡delimited-[]superscriptsubscript𝑘1𝑇subscript𝑟𝛼subscript𝑠𝑡1subscript𝑠:0𝑡V^{p_{\alpha}}(s_{0:t}):=\mathbb{E}_{p^{*}(s_{t+1:T}|s_{0:t})}[\sum_{k=1}^{T}r% _{\alpha}(s_{t+1};s_{0:t})]italic_V start_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) := blackboard_E start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 : italic_T end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ; italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) ] is the value function of pαsubscript𝑝𝛼p_{\alpha}italic_p start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT. This objective yields the save optimal policy as the Maximum Likelihood Estimation 𝔼p(s0:T)[logpθ(s0:T)]subscript𝔼superscript𝑝subscript𝑠:0𝑇delimited-[]logsubscript𝑝𝜃subscript𝑠:0𝑇\mathbb{E}_{p^{*}(s_{0:T})}\left[{\rm{log}}p_{\theta}(s_{0:T})\right]blackboard_E start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 : italic_T end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 : italic_T end_POSTSUBSCRIPT ) ].

Remarks. This analysis shows that DiffBid utilizing MLE objective has its corresponding non-Markovian decision problem, and their optimal are equivalent. It means that DiffBid does not require the MDP assumption of problems and thus is more powerful in handling randomness and sparse return like in the advertising environment.

6. Experiments

Table 1. Performance Comparison with baselines in different settings, including different data scales, and budgets in Max Return bidding. improv indicates the relative improvement of DiffBid against the most comparative baseline. The best results are bolded and the best second results are underlined.
Training Dataset Budget USCB BCQ CQL IQL DT DiffBid improv
USCB-5K 1.5K 454.25 454.72 461.82 456.80 477.39 480.76 0.71%
2.0K 482.67 483.50 475.78 486.56 507.30 511.17 0.76%
2.5K 497.66 498.77 481.37 518.27 527.88 531.29 0.65%
3.0K 500.60 501.86 491.36 549.19 550.66 556.32 1.03%
USCBEx-5K 1.5K 454.25 453.74 358.43 464.69 378.64 475.62 2.35%
2.0K 482.67 487.63 356.80 529.36 439.03 544.38 2.84%
2.5K 497.66 510.75 356.41 613.67 505.43 624.29 1.73%
3.0K 500.60 512.18 355.42 670.65 574.79 678.73 1.17%
USCBEx-50K 1.5K 454.25 458.64 435.06 446.23 396.24 495.57 8.05%
2.0K 482.67 491.72 431.49 533.58 478.29 551.73 3.40%
2.5K 497.66 513.23 428.39 592.32 554.48 606.34 2.37%
3.0K 500.60 526.21 425.29 633.26 611.50 644.88 1.83%

6.1. Experimental Setup

6.1.1. Experimental Environment

The simulated experimental environment is conducted in a manually built offline real advertising system (RAS) as in (Mou et al., 2022). Specifically, the RAS is composed of two consecutive stages, where the auction mechanisms resemble those in the RAS. We consider the bidding process in a day, where the episode is divided into 96 time steps. Thus, the duration between any two adjacent time steps t𝑡titalic_t and t+1𝑡1t+1italic_t + 1 is 15 minutes. The number of impression opportunities between time step t𝑡titalic_t and t+1𝑡1t+1italic_t + 1 fluctuates from 100 to 500. Detailed parameters in the RAS are shown in Table 5. We keep the parameters the same for all experiments.

6.1.2. Data Collection

We use the widely applied auto-bidding RL method USCB in the online environment to generate the bidding logs for offline RL training. This results in a total 5,00050005,0005 , 000 trajectories for the based dataset and 50,0005000050,00050 , 000 for a larger one. To increase the diversity of the action space, we also randomly make explorations to generate a dataset with more noise. The above process results in three datasets: USCB-5k, USCBEx-5k, and USCBEx-50k, where USCBEx indicates USCB logs with random exploration data.

6.1.3. Baselines.

We use the state-of-the-art auto-bidding method USCB as well as other 4 recently proposed offline RL methods as our baselines. The details of the baselines are as follows:

  • USCB (He et al., 2021) an RL method designed for real-time bidding to dynamically adjust parameters to achieve the optimum. It has outperformed many RL baselines and is also the base policy that is used to collect the data for offline training.

  • BCQ (Fujimoto et al., 2019) a classic offline RL method without interaction with the environment.

  • CQL (Kumar et al., 2020) address the limitations of offline RL methods by learning a conservative Q-function such that the expected value of a policy under it lower-bounds its true value.

  • IQL (Kostrikov et al., 2021) an offline RL method that does not require evaluating actions outside of the dataset, yet it enables substantial improvement of the learned policy beyond the best behavior in the data through generalization

  • DT (Chen et al., 2021) a prevalent generative method based on the transformer architecture for sequential decision-making.

6.1.4. Implementation Details

For the implementation of baselines, we use the default hyper-parameters suggested from their papers and also tune through our best effort. For DiffBid, the diffusion steps is searched within {5,10,20,30,50}510203050\{5,10,20,30,50\}{ 5 , 10 , 20 , 30 , 50 }. γ𝛾\gammaitalic_γ is set to 0.008. L𝐿Litalic_L is searched in {1,2,3}123\{1,2,3\}{ 1 , 2 , 3 }. ω𝜔\omegaitalic_ω for noise schedule is set to 0.2 empirically. The batch size is set to 2%percent\%% of all training trajectories. Total training epochs is set to 500. For the implementation of pθsubscript𝑝𝜃p_{\theta}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, we adopt the most widely used model U-Net for diffusion modeling with hidden sizes of 128 and 256. We use Adam optimizer with a learning rate 1e41𝑒41e{-4}1 italic_e - 4 to optimize the model. The condition dropout ratio is set to 0.2 during training. We update the model with momentum updates over a period of 4 steps.

6.1.5. Evaluation

For evaluation, we randomly initialize a multi-agent advertising environment with USCB as the base auto-bidding agents and use other methods to compete with these agents. We test the performance under 4 different budgets, 1500, 2000, 2500, and 3000, to test the generalization under different budget scales. We use the cumulative reward as the evaluation metric, which reflects the total gain received by the target agent. For each method, we randomly initialize 50 times and report the average of top-5 scores.

6.2. Performance Evaluation

The performance against baselines is shown in Table 1. In this table, we show the cumulative reward from different budgets of all the models. We have the following discoveries. One of the key takeaways from the performance comparison presented in Table 1 is that offline RL methods consistently outperform the state-of-the-art auto-bidding method, USCB. This finding underscores the advantages of leveraging historical bidding data to train RL agents. Offline RL methods, such as BCQ, IQL, and DT, exhibit superior performance in terms of cumulative rewards across various budget scenarios. The superiority of offline RL methods can be attributed to their ability to learn from past bidding experiences without interaction with a simulation environment. This mitigates the challenges associated with inconsistencies between the online bidding environment and the offline bidding environment, leading to policies that are better aligned with real-world scenarios. Notably, DiffBid stands out as the top-performing approach among all the methods evaluated. In all budget scenarios and training datasets, DiffBid consistently achieves the highest cumulative rewards. This remarkable performance highlights the efficacy of the DiffBid approach in optimizing bidding strategies by directly modeling the correlation with the returns and entire trajectories. By decoupling the computational complexity from horizon length, DiffBid achieves superior decision-making capabilities, outperforming traditional RL methods in both foresight and strategy. Another important observation from the results is the impact of training dataset size on model performance. When comparing the ”USCB-5K” and ”USCBEx-50K” settings, it becomes evident that a larger training dataset consistently leads to improved cumulative rewards. This finding underscores the significance of data size in training RL models for automated bidding. A richer dataset allows the models to capture more diverse bidding scenarios and make more informed decisions, ultimately resulting in better performance. One intriguing aspect of DiffBid’s performance is its resilience to noise. In real-world advertising environments, there can be inherent uncertainty and variability in the bidding process due to factors like market dynamics and competitor behavior. DiffBid appears to handle such noise more effectively than the RL baselines. This means that even in situations where bidding outcomes are less predictable, DiffBid manages to maintain competitive performance.

6.3. Ablation Study

Table 2. Ablation Study
Model USCBEx-5K USCBEx-50K
DiffBid 2280.12 2395.60
DiffBid w/o cond 1812.64 1852.21
DiffBid w/o non-mkv 2254.78 2287.41

To study different parts of the proposed DiffBid, we run the model without a certain module to see if the removed corresponding module will result in a performance drop. The result of the ablation study is shown in Table 2. Due to the space limitation, we only provide the results on USCBEx-5K and USCBEx-50K. w/o cond refers to the DiffBid with the condition set to 0.0 (rather than 1.0). w/o non-mkv refers to the situation where we only use the current state and the predicted next state to generate the bidding coefficient. From the table, we find both of the two parts contribute to the final result, and removing either of them will result in a performance drop. It verifies the effectiveness of the proposed methods in boosting DiffBid’s performance for auto-bidding.

6.4. In-depth Analysis

6.4.1. Study of State Transition

Here we compare the state transition of the baseline method USCB and our proposed method DiffBid. The result for grouped and non-grouped state transition during a day is shown in Figure 3. In this figure, we plot the budget left ratio with time steps in one day. From the figure, we can observe that under USCB, most of the advertisers’ consumption does not exhaust their budget. This is attributed to the inconsistency between the offline virtual environment and the real online environment faced by USCB. On the contrary, the budget completion situation improves under DiffBid, where most of the advertisers spend more than 80% of their budgets. One possible reason is that DiffBid finds trajectories with a high budget completion ratio will also have a high cumulative reward, and thus tend to generate trajectories with a high budget completion ratio. Moreover, advertisers with small budgets undertend to spend money in the afternoon. This is because the impressions in the afternoon offer a higher cost-effectiveness, albeit with a limited quantity.

6.4.2. Performance under Constraints and Feedbacks.

Refer to caption
(a) USCB
Refer to caption
(b) DiffBid
Figure 3. State Transition in One Episode.
Refer to caption
(a) IQL
Refer to caption
(b) DiffBid
Figure 4. Performance under CPC constraint.
Refer to caption
(a) Smoothness
Refer to caption
(b) Early Spend
Figure 5. Performance of Human Feedback.

We additionally investigate DiffBid’s multi-objective optimization capability under specific constraints, comparing its performance with Offline RL. Specifically, we choose CPC ratio and overall return as metrics and examine the ability of DiffBid and IQL to control the overall CPC exceeding ratio while maximizing the overall return. During training, we set different thresholds of CPC as in Eq (14). Then when testing, we make DiffBid generating trajectories under the expected CPC. In Figure 4, we show the exceeding ratio and overall return under different CPC constraints and training settings. From the figure, we find that DiffBid has the ability to control diverse levels of exceeding ratio while maintaining an intact return, surpassing IQL by a significant margin. Consequently, DiffBid holds a distinct advantage in effectively addressing MCB problems. We also study the performance under different advertiser feedbacks. During training we split the trajectories through thresholds of Eq. (14) into high and low levels, and learn the conditional distribution under different levels. During generation, we adjust the condition and generate corresponding samples and summarize the metrics. The results for the statistic distribution of metrics for low level, high level and the original trajectories are shown in Figure 5. We find that the trajectory obtained from deploying DiffBid is well controlled by the condition.

6.4.3. Impact of Diffusion Steps

We also study the overall performance under different diffusion steps, which is an important factor in influencing the efficiency and performance. The overall impact of diffusion steps with respect to different budgets is illustrated in Figure 6. From the figure, we have the following discoveries. First of all, we observe that diffusion steps have a larger impact on advertisers with small budgets (1500 yuan). Secondly, larger budgets are not sensitive to the diffusion steps, where we can get the best result in most situations within 30 diffusion steps.

Refer to caption
(a) Impact of Diffusion Steps
Refer to caption
(b) Stability
Figure 6. In-depth Analysis.

6.4.4. Stability

In this study, we randomly initialized the parameters of three models - CQL, IQL, and DiffBid - and conducted thirty training trials for each to examine stability in performance. As depicted in Figure 3(b), the RL-based models, CQL and IQL, showed a tendency towards instability under varying random seeds. Notably, IQL demonstrated slightly better performance than CQL, which may be attributed to its design optimized for conservative regularization. Contrasting with these, the generative model DiffBid exhibited remarkable stability, with significantly fewer instances of failure compared to its RL counterparts.

6.5. Online A/B Test

To further substantiate the effectiveness of DiffBid, we have deployed it on Alibaba advertising platform for comparison against the baseline IQL (Kostrikov et al., 2021) method, which performs best among various auto-bidding methods.

Table 3. Online A/B Test Result.
Metrics #Plan Budget Cost Buycnt GMV ROI
Baseline 2068 886744 834426.104 23584.6836 1853823 2.221
DiffBid 2068 886744 829992.384 24078.6883 1905954 2.296
compare - - -0.53% +2.09% +2.81% +3.36%

The online A/B test is conducted from February 01, 2024, to February 08, 2024. The results are shown in Table 3. It shows that DiffBid can significantly improve the Buycnt by 2.09%, the GMV by 2.81%, the ROI by 3.36%, showing its effectiveness in optimizing the overall performance. For efficiency, DiffBid takes 0.2s per request with GPU acceleration while the baseline is 0.07s, which means latency can be well guaranteed.

7. Related Works

Offline-Reinforcement Learning. Offline reinforcement learning is a research direction that has gained significant attention in recent years. The primary goal of offline RL is to learn effective policies from a fixed dataset without additional online interaction with the environment. This approach is particularly beneficial when online interaction is costly, risky, or otherwise not feasible.Notable works include Conservative Q-learning (CQL) by Kumar et al. (Kumar et al., 2020), and Batch-Constrained deep Q-learning (BCQ) by Fujimoto et al. (Fujimoto et al., 2019). Both algorithms aim to tackle overestimation bias which tends to occur in offline RL settings. Kostrikov et al. (Kostrikov et al., 2021) propose an implicit q-learning approach to address the training instability for CQL. Chen et al. (Chen et al., 2021) propose to use transformers for offline RL to increase the model capability. Hansen-Estruch et al. (Hansen-Estruch et al., 2023) proposes a diffusion-based approach with implicit Q-learning for offline RL.

Diffusion Models. They recently have shown the capability of high-quality generation (Croitoru et al., 2023), unconditional generation (Austin et al., 2021) and conditional generation (Chao et al., 2022; Huang et al., 2022). It has shown promising performance in decision-making. Hansen-Estruch et al. (Hansen-Estruch et al., 2023) proposes a diffusion-based approach with implicit q-learning for offline RL. Wang et al. (Wang et al., 2022) propose a expressive policy though diffusion modeling. Chen et al. (Chen et al., 2022) propose to use diffusion models for behavior modeling. Hu et al. (Hu et al., 2023) introduce temporal conditions for trajectory generation. Despite these preliminary explorations, no work has been payed for diffusion based auto-bidding which requires the model to adapt to the random advertising environment. Li et al. (Li et al., 2023) utilize diffusion model in anti-money laundering.

Auto-bidding. Auto-bidding systems are widely used in programmatic advertising, where they are employed to automatically place bids on ad spaces. The main focus of such systems is to optimise a given key performance indicator (KPI), such as the number of clicks or conversions, while maintaining a certain budget (Wang et al., 2017). Cai et al. (Cai et al., 2017) proposed an RL-based approach to the problem of auto-bidding for display advertising. They designed a bidding environment and applied a deep RL algorithm to learn the optimal bidding strategy. He et al. (He et al., 2021) propose a unified solution with RL to enable multiple constraints for auto-bidding. ** et al. extend the RL to enable multi-agent competition (** et al., 2018). Zhang et al. (Mou et al., 2022) also adopted the RL framework for auto-bidding and showed that their approach can outperform traditional bidding strategies. Wen et al. (Wen et al., 2022) propose a multi-agent-based approach for auto-bidding, which enables the modeling of multiple auto-bidding agents at the same time to include more information and also has been deployed online.

8. Conclusion

In this paper, we design a new paradigm for auto-bidding through the lens of generative modeling. To achieve this goal, we propose a decision-denoising diffusion approach to generate conditional bidding trajectories and at the same time control the generated samples under certain constraints. This new generative modeling approach enables integrating different kinds of industrial metrics, which is the first unified model for bidding. Extensive experiments on real-world simulation environments demonstrate the effectiveness of the newly proposed approach. In the future, we will consider develo** new methods to accelerate the generation process and new methods to ensure the robustness of DiffBid.

References

  • (1)
  • Agrawal et al. (2016) Pulkit Agrawal, Ashvin V Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. 2016. Learning to poke by poking: Experiential learning of intuitive physics. Advances in neural information processing systems 29 (2016).
  • Ajay et al. (2022) Anurag Ajay, Yilun Du, Abhi Gupta, Joshua B Tenenbaum, Tommi S Jaakkola, and Pulkit Agrawal. 2022. Is Conditional Generative Modeling all you need for Decision Making?. In The Eleventh International Conference on Learning Representations.
  • Austin et al. (2021) Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. 2021. Structured denoising diffusion models in discrete state-spaces. Advances in Neural Information Processing Systems 34 (2021), 17981–17993.
  • Balseiro et al. (2021a) Santiago Balseiro, Yuan Deng, Jieming Mao, Vahab Mirrokni, and Song Zuo. 2021a. Robust auction design in the auto-bidding world. Advances in Neural Information Processing Systems 34 (2021), 17777–17788.
  • Balseiro et al. (2021b) Santiago R Balseiro, Yuan Deng, Jieming Mao, Vahab S Mirrokni, and Song Zuo. 2021b. The landscape of auto-bidding auctions: Value versus utility maximization. In Proceedings of the 22nd ACM Conference on Economics and Computation. 132–133.
  • Cai et al. (2017) Han Cai, Kan Ren, Weinan Zhang, Kleanthis Malialis, Jun Wang, Yong Yu, and Defeng Guo. 2017. Real-time bidding by reinforcement learning in display advertising. In Proceedings of the tenth ACM international conference on web search and data mining. 661–670.
  • Chao et al. (2022) Chen-Hao Chao, Wei-Fang Sun, Bo-Wun Cheng, Yi-Chen Lo, Chia-Che Chang, Yu-Lun Liu, Yu-Lin Chang, Chia-** Chen, and Chun-Yi Lee. 2022. Denoising likelihood score matching for conditional score-based data generation. arXiv preprint arXiv:2203.14206 (2022).
  • Chen et al. (2022) Huayu Chen, Cheng Lu, Chengyang Ying, Hang Su, and Jun Zhu. 2022. Offline reinforcement learning via high-fidelity generative behavior modeling. arXiv preprint arXiv:2209.14548 (2022).
  • Chen et al. (2021) Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. 2021. Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems 34 (2021), 15084–15097.
  • Chiesi et al. (1979) Harry L Chiesi, George J Spilich, and James F Voss. 1979. Acquisition of domain-related information in relation to high and low domain knowledge. Journal of verbal learning and verbal behavior 18, 3 (1979), 257–273.
  • Croitoru et al. (2023) Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, and Mubarak Shah. 2023. Diffusion models in vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).
  • Deng et al. (2021) Yuan Deng, Jieming Mao, Vahab Mirrokni, and Song Zuo. 2021. Towards efficient auctions in an auto-bidding world. In Proceedings of the Web Conference 2021. 3965–3973.
  • Evans (2009) David S Evans. 2009. The online advertising industry: Economics, evolution, and privacy. Journal of economic perspectives 23, 3 (2009), 37–60.
  • Fujimoto et al. (2019) Scott Fujimoto, David Meger, and Doina Precup. 2019. Off-policy deep reinforcement learning without exploration. In International conference on machine learning. PMLR, 2052–2062.
  • Fujimoto et al. (2022) Scott Fujimoto, David Meger, Doina Precup, Ofir Nachum, and Shixiang Shane Gu. 2022. Why Should I Trust You, Bellman? The Bellman Error is a Poor Replacement for Value Error. In Proceedings of the 39th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 162), Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (Eds.). PMLR, 6918–6943. https://proceedings.mlr.press/v162/fujimoto22a.html
  • Gaon and Brafman (2020) Maor Gaon and Ronen Brafman. 2020. Reinforcement learning with non-markovian rewards. In Proceedings of the AAAI conference on artificial intelligence, Vol. 34. 3980–3987.
  • Guo et al. (2022) Jiayan Guo, Yaming Yang, Xiangchen Song, Yuan Zhang, Yu**g Wang, **g Bai, and Yan Zhang. 2022. Learning Multi-granularity Consecutive User Intent Unit for Session-based Recommendation. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining (Virtual Event, AZ, USA) (WSDM ’22). Association for Computing Machinery, New York, NY, USA, 343–352. https://doi.org/10.1145/3488560.3498524
  • Ha (2008) Louisa Ha. 2008. Online advertising research in advertising journals: A review. Journal of Current Issues & Research in Advertising 30, 1 (2008), 31–48.
  • Hansen-Estruch et al. (2023) Philippe Hansen-Estruch, Ilya Kostrikov, Michael Janner, Jakub Grudzien Kuba, and Sergey Levine. 2023. Idql: Implicit q-learning as an actor-critic method with diffusion policies. arXiv preprint arXiv:2304.10573 (2023).
  • Hao et al. (2020) Xiaotian Hao, Zhaoqing Peng, Yi Ma, Guan Wang, Junqi **, Jianye Hao, Shan Chen, Rongquan Bai, Mingzhou Xie, Miao Xu, Zhenzhe Zheng, Chuan Yu, Han Li, Jian Xu, and Kun Gai. 2020. Dynamic Knapsack Optimization Towards Efficient Multi-Channel Sequential Advertising. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event (Proceedings of Machine Learning Research, Vol. 119). PMLR, 4060–4070. http://proceedings.mlr.press/v119/hao20b.html
  • He et al. (2021) Yue He, Xiujun Chen, Di Wu, Junwei Pan, Qing Tan, Chuan Yu, Jian Xu, and Xiaoqiang Zhu. 2021. A unified solution to constrained bidding in online display advertising. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 2993–3001.
  • Ho et al. (2020) Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. Advances in neural information processing systems 33 (2020), 6840–6851.
  • Ho and Salimans (2021) Jonathan Ho and Tim Salimans. 2021. Classifier-Free Diffusion Guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications.
  • Hu et al. (2023) Jifeng Hu, Yanchao Sun, Sili Huang, SiYuan Guo, Hechang Chen, Li Shen, Lichao Sun, Yi Chang, and Dacheng Tao. 2023. Instructed Diffuser with Temporal Condition Guidance for Offline Reinforcement Learning. arXiv preprint arXiv:2306.04875 (2023).
  • Huang et al. (2022) R Huang, MWY Lam, J Wang, D Su, D Yu, Y Ren, and Z Zhao. 2022. FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis. In IJCAI International Joint Conference on Artificial Intelligence. IJCAI: International Joint Conferences on Artificial Intelligence Organization, 4157–4163.
  • Jaynes (1957) Edwin T Jaynes. 1957. Information theory and statistical mechanics. Physical review 106, 4 (1957), 620.
  • ** et al. (2018) Junqi **, Chengru Song, Han Li, Kun Gai, Jun Wang, and Weinan Zhang. 2018. Real-time bidding with multi-agent reinforcement learning in display advertising. In Proceedings of the 27th ACM international conference on information and knowledge management. 2193–2201.
  • Kingma et al. (2015) Durk P Kingma, Tim Salimans, and Max Welling. 2015. Variational dropout and the local reparameterization trick. Advances in neural information processing systems 28 (2015).
  • Kostrikov et al. (2021) Ilya Kostrikov, Ashvin Nair, and Sergey Levine. 2021. Offline Reinforcement Learning with Implicit Q-Learning. In International Conference on Learning Representations.
  • Kumar et al. (2020) Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. 2020. Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems 33 (2020), 1179–1191.
  • Li and Tang (2022) Juncheng Li and **zhong Tang. 2022. Auto-bidding Equilibrium in ROI-Constrained Online Advertising Markets. arXiv preprint arXiv:2210.06107 (2022).
  • Li et al. (2023) Xujia Li, Yuan Li, Xueying Mo, Hebing Xiao, Yanyan Shen, and Lei Chen. 2023. Diga: Guided diffusion model for graph recovery in anti-money laundering. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 4404–4413.
  • Majeed and Hutter (2018) Sultan Javed Majeed and Marcus Hutter. 2018. On Q-learning Convergence for Non-Markov Decision Processes.. In IJCAI, Vol. 18. 2546–2552.
  • Misra (2019) Diganta Misra. 2019. Mish: A self regularized non-monotonic activation function. arXiv preprint arXiv:1908.08681 (2019).
  • Mou et al. (2022) Zhiyu Mou, Yusen Huo, Rongquan Bai, Mingzhou Xie, Chuan Yu, Jian Xu, and Bo Zheng. 2022. Sustainable Online Reinforcement Learning for Auto-bidding. Advances in Neural Information Processing Systems 35 (2022), 2651–2663.
  • Mutti et al. (2022) Mirco Mutti, Riccardo De Santi, and Marcello Restelli. 2022. The importance of non-markovianity in maximum state entropy exploration. In International Conference on Machine Learning. PMLR, 16223–16239.
  • Nguyen-Tuong et al. (2008) Duy Nguyen-Tuong, Jan Peters, Matthias Seeger, and Bernhard Schölkopf. 2008. Learning inverse dynamics: a comparison. In European symposium on artificial neural networks.
  • Nichol and Dhariwal (2021) Alexander Quinn Nichol and Prafulla Dhariwal. 2021. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning. PMLR, 8162–8171.
  • Ou et al. (2023) Weitong Ou, Bo Chen, Yingxuan Yang, Xinyi Dai, Weiwen Liu, Weinan Zhang, Ruiming Tang, and Yong Yu. 2023. Deep landscape forecasting in multi-slot real-time bidding. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 4685–4695.
  • Pathak et al. (2018) Deepak Pathak, Parsa Mahmoudieh, Guanghao Luo, Pulkit Agrawal, Dian Chen, Yide Shentu, Evan Shelhamer, Jitendra Malik, Alexei A Efros, and Trevor Darrell. 2018. Zero-shot visual imitation. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops. 2050–2053.
  • Qin et al. (2023) Aoyang Qin, Feng Gao, Qing Li, Song-Chun Zhu, and Sirui Xie. 2023. Learning non-Markovian Decision-Making from State-only Sequences. In Thirty-seventh Conference on Neural Information Processing Systems.
  • Ronneberger et al. (2015) Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18. Springer, 234–241.
  • Vincent (2011) Pascal Vincent. 2011. A connection between score matching and denoising autoencoders. Neural computation 23, 7 (2011), 1661–1674.
  • Wang et al. (2017) Jun Wang, Weinan Zhang, Shuai Yuan, et al. 2017. Display advertising with real-time bidding (RTB) and behavioural targeting. Foundations and Trends® in Information Retrieval 11, 4-5 (2017), 297–435.
  • Wang et al. (2022) Zhendong Wang, Jonathan J Hunt, and Mingyuan Zhou. 2022. Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning. In The Eleventh International Conference on Learning Representations.
  • Wen et al. (2022) Chao Wen, Miao Xu, Zhilin Zhang, Zhenzhe Zheng, Yuhui Wang, Xiangyu Liu, Yu Rong, Dong Xie, Xiaoyang Tan, Chuan Yu, et al. 2022. A cooperative-competitive multi-agent framework for auto-bidding in online advertising. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining. 1129–1139.
  • Wu and He (2018) Yuxin Wu and Kaiming He. 2018. Group normalization. In Proceedings of the European conference on computer vision (ECCV). 3–19.
  • Zhang et al. (2023b) Haoqi Zhang, Lvyin Niu, Zhenzhe Zheng, Zhilin Zhang, Shan Gu, Fan Wu, Chuan Yu, Jian Xu, Guihai Chen, and Bo Zheng. 2023b. A Personalized Automated Bidding Framework for Fairness-aware Online Advertising. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 5544–5553.
  • Zhang et al. (2023a) Peiyan Zhang, Jiayan Guo, Chaozhuo Li, Yueqi Xie, Jae Boum Kim, Yan Zhang, Xing Xie, Haohan Wang, and Sunghun Kim. 2023a. Efficiently Leveraging Multi-level User Intent for Session-based Recommendation via Atten-Mixer Network. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining (, Singapore, Singapore,) (WSDM ’23). Association for Computing Machinery, New York, NY, USA, 168–176. https://doi.org/10.1145/3539597.3570445
  • Ziebart (2010) Brian D. Ziebart. 2010. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. Ph. D. Dissertation. USA. Advisor(s) Bagnell, J. Andrew. AAI3438449.

Appendix A Appendix

A.1. Notations

Table 4. Definition of Notations.
Symbol Definition
τ𝜏\tauitalic_τ The trajectory index of a serving policy.
𝒙(τ)k𝒙subscript𝜏𝑘\boldsymbol{x}(\tau)_{k}bold_italic_x ( italic_τ ) start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT Sequence of states of trajectory τ𝜏\tauitalic_τ in diffusion step k𝑘kitalic_k.
𝒚(τ)𝒚𝜏\boldsymbol{y}(\tau)bold_italic_y ( italic_τ ) Properties or conditions for τ𝜏\tauitalic_τ.
R𝑅Ritalic_R Return of a trajectory.
E𝐸Eitalic_E Binary indicator variable.
B𝐵Bitalic_B The budget of the advertiser.
Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT The i𝑖iitalic_i’s constraint.
𝒐isubscript𝒐𝑖\boldsymbol{o}_{i}bold_italic_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT Whether the advertiser wins impression i𝑖iitalic_i.
𝒗isubscript𝒗𝑖\boldsymbol{v}_{i}bold_italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT The true value of the impression i𝑖iitalic_i.
𝒃isuperscriptsubscript𝒃𝑖\boldsymbol{b}_{i}^{*}bold_italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT The optimal bidding price for the impression i𝑖iitalic_i.
𝒔tsubscript𝒔𝑡\boldsymbol{s}_{t}bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT The state at time period t𝑡titalic_t.
𝒂^tsubscriptbold-^𝒂𝑡\boldsymbol{\hat{a}}_{t}overbold_^ start_ARG bold_italic_a end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT Predicted bidding parameters at time period t𝑡titalic_t.
λisubscript𝜆𝑖\lambda_{i}italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT The bidding parameters.
ϵθsubscriptbold-italic-ϵ𝜃\boldsymbol{\epsilon}_{\theta}bold_italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT The denoising model that predict the noise.
fϕsubscript𝑓italic-ϕf_{\phi}italic_f start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT The model that generate bids.
α¯ksubscript¯𝛼𝑘\overline{\alpha}_{k}over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT The cumulative product of 1βj,j=0,,kformulae-sequence1subscript𝛽𝑗𝑗0𝑘1-\beta_{j},j=0,...,k1 - italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_j = 0 , … , italic_k
βksubscript𝛽𝑘\beta_{k}italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT Schedualing factors.
αksubscript𝛼𝑘\alpha_{k}italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT 1-βksubscript𝛽𝑘\beta_{k}italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT.

A.2. Diffusion Modeling

As a kind of generative model, diffusion models (Ho et al., 2020; Vincent, 2011) use the diffusion process to gradually denoise latent samples to generate the new sample and have been widely used in generating pictures, videos, and audio. One of the widely used diffusion models, denoising diffusion probabilistic model (DDPM), consists of two processes:

Forward process. In the forward process, the noise is gradually added to the latent variable, which is parameterized by a Markov chain with the transition q(𝒙k|𝒙k1)=𝒩(xk;1βk𝒙k,βkI)𝑞conditionalsubscript𝒙𝑘subscript𝒙𝑘1𝒩subscript𝑥𝑘1subscript𝛽𝑘subscript𝒙𝑘subscript𝛽𝑘𝐼q(\boldsymbol{x}_{k}|\boldsymbol{x}_{k-1})=\mathcal{N}\left(x_{k};\sqrt{1-% \beta_{k}}\boldsymbol{x}_{k},\beta_{k}I\right)italic_q ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) = caligraphic_N ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; square-root start_ARG 1 - italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_I ), where k{1,K}𝑘1𝐾k\in\{1,...K\}italic_k ∈ { 1 , … italic_K } refers to the diffusion step, and βk(0,1)subscript𝛽𝑘01\beta_{k}\in(0,1)italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ ( 0 , 1 ) is a pre-defined scale that controls the noise scale at step k𝑘kitalic_k. By defining α¯k=i=1kαi=i=1k(1βi)subscript¯𝛼𝑘superscriptsubscriptproduct𝑖1𝑘subscript𝛼𝑖superscriptsubscriptproduct𝑖1𝑘1subscript𝛽𝑖\overline{\alpha}_{k}=\prod_{i=1}^{k}\alpha_{i}=\prod_{i=1}^{k}(1-\beta_{i})over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( 1 - italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), we can have the conditional distribution:

(16) q(xk|x0)=𝒩(xk;α¯kx0,(1α¯k)I)𝑞conditionalsubscript𝑥𝑘subscript𝑥0𝒩subscript𝑥𝑘subscript¯𝛼𝑘subscript𝑥01subscript¯𝛼𝑘𝐼q(x_{k}|x_{0})=\mathcal{N}\left(x_{k};\sqrt{\overline{\alpha}_{k}}x_{0},(1-% \overline{\alpha}_{k})I\right)italic_q ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = caligraphic_N ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ; square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) italic_I )

In this paper, we apply cosine noise schedule to control the noise by:

(17) α¯k=g(t)g(0)=cos(k/K+γ1+γπ2)cos(γ1+γπ2),subscript¯𝛼𝑘𝑔𝑡𝑔0cos𝑘𝐾𝛾1𝛾𝜋2cos𝛾1𝛾𝜋2\overline{\alpha}_{k}=\frac{g(t)}{g(0)}=\frac{\text{cos}\left(\frac{k/K+\gamma% }{1+\gamma}\cdot\frac{\pi}{2}\right)}{\text{cos}\left(\frac{\gamma}{1+\gamma}% \cdot\frac{\pi}{2}\right)},over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG italic_g ( italic_t ) end_ARG start_ARG italic_g ( 0 ) end_ARG = divide start_ARG cos ( divide start_ARG italic_k / italic_K + italic_γ end_ARG start_ARG 1 + italic_γ end_ARG ⋅ divide start_ARG italic_π end_ARG start_ARG 2 end_ARG ) end_ARG start_ARG cos ( divide start_ARG italic_γ end_ARG start_ARG 1 + italic_γ end_ARG ⋅ divide start_ARG italic_π end_ARG start_ARG 2 end_ARG ) end_ARG ,

where γ𝛾\gammaitalic_γ is a constant. When K𝐾K\rightarrow\inftyitalic_K → ∞, q(xK)𝑞subscript𝑥𝐾q(x_{K})italic_q ( italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) approaches to a standard Gaussian distribution (Ho et al., 2020). Given the original trajectory x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and ϵ𝒩(0,I)similar-toitalic-ϵ𝒩0𝐼\epsilon\sim\mathcal{N}(0,I)italic_ϵ ∼ caligraphic_N ( 0 , italic_I ), we have the noisy version at k𝑘kitalic_k by xk=α¯kx0+ϵ1α¯ksubscript𝑥𝑘subscript¯𝛼𝑘subscript𝑥0italic-ϵ1subscript¯𝛼𝑘x_{k}=\sqrt{\overline{\alpha}_{k}}x_{0}+\epsilon\sqrt{1-\overline{\alpha}_{k}}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_ϵ square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG

Reverse process. In the reverse process, diffusion models plan to remove the added noise on xksubscript𝑥𝑘x_{k}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and recursively recover xk1subscript𝑥𝑘1x_{k-1}italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT. To achieve this goal, a Gaussian distribution parameterized by pθ(𝒙k1|𝒙k)=𝒩(𝒙k1|𝝁θ(𝒙k,k),𝚺θ(𝒙k,k))subscript𝑝𝜃conditionalsubscript𝒙𝑘1subscript𝒙𝑘𝒩conditionalsubscript𝒙𝑘1subscript𝝁𝜃subscript𝒙𝑘𝑘subscript𝚺𝜃subscript𝒙𝑘𝑘p_{\theta}(\boldsymbol{x}_{k-1}|\boldsymbol{x}_{k})=\mathcal{N}\left(% \boldsymbol{x}_{k-1}|\boldsymbol{\mu}_{\theta}(\boldsymbol{x}_{k},k),% \boldsymbol{\Sigma}_{\theta}(\boldsymbol{x}_{k},k)\right)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = caligraphic_N ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT | bold_italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_k ) , bold_Σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_k ) ) is learned, where 𝝁θ(𝒙k,k)subscript𝝁𝜃subscript𝒙𝑘𝑘\boldsymbol{\mu}_{\theta}(\boldsymbol{x}_{k},k)bold_italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_k ) is the learned mean and 𝚺θ(𝒙k,k)subscript𝚺𝜃subscript𝒙𝑘𝑘\boldsymbol{\Sigma}_{\theta}(\boldsymbol{x}_{k},k)bold_Σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_k ) is the learned covariance of the Gaussian distribution parameterized by a neural network with parameter θ𝜃\thetaitalic_θ. For generating new samples, we can simply use re-parameterization trick (Kingma et al., 2015) to sample a noise 𝒙K𝝁K+ϵσKsimilar-tosubscript𝒙𝐾subscript𝝁𝐾italic-ϵsubscript𝜎𝐾\boldsymbol{x}_{K}\sim\boldsymbol{\mu}_{K}+\epsilon\sigma_{K}bold_italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ∼ bold_italic_μ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT + italic_ϵ italic_σ start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT and recursively denoise the sample by pθ(𝒙k1|𝒙k)subscript𝑝𝜃conditionalsubscript𝒙𝑘1subscript𝒙𝑘p_{\theta}(\boldsymbol{x}_{k-1}|\boldsymbol{x}_{k})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) for generation.

Optimization. DDPM optimizes the Evidence Lower BOund (ELBO) of generative models. In the context of the diffusion model, we can take the latent samples as hidden variables and rewrite ELBO in the following form:

(18) 𝔼q[logpθ(𝒙𝟎)]𝔼q[logpθ(𝒙0:K)q(𝒙1:K|𝒙0)]=𝔼q[DKL(q(𝒙K|𝒙0)||pθ(𝒙K))]𝔼q[logpθ(𝒙0|𝒙1)]+𝔼q[t>1DKL(q(𝒙k1|𝒙k,𝒙0)||pθ(𝒙k1|𝒙k))],\small\begin{split}&\mathbb{E}_{q}\left[-\log p_{\theta}(\boldsymbol{x_{0}})% \right]\\ \leq&\mathbb{E}_{q}\left[-\text{log}\frac{p_{\theta}(\boldsymbol{x}_{0:K})}{q(% \boldsymbol{x}_{1:K}|\boldsymbol{x}_{0})}\right]\\ =&\mathbb{E}_{q}\left[D_{KL}(q(\boldsymbol{x}_{K}|\boldsymbol{x}_{0})||p_{% \theta}(\boldsymbol{x}_{K}))\right]-\mathbb{E}_{q}\left[\text{log}p_{\theta}% \left(\boldsymbol{x}_{0}|\boldsymbol{x}_{1}\right)\right]\\ +&\mathbb{E}_{q}\left[\sum_{t>1}D_{KL}\left(q(\boldsymbol{x}_{k-1}|\boldsymbol% {x}_{k},\boldsymbol{x}_{0})||p_{\theta}(\boldsymbol{x}_{k-1}|\boldsymbol{x}_{k% })\right)\right],\\ \end{split}start_ROW start_CELL end_CELL start_CELL blackboard_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ - roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT bold_0 end_POSTSUBSCRIPT ) ] end_CELL end_ROW start_ROW start_CELL ≤ end_CELL start_CELL blackboard_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ - log divide start_ARG italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 : italic_K end_POSTSUBSCRIPT ) end_ARG start_ARG italic_q ( bold_italic_x start_POSTSUBSCRIPT 1 : italic_K end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG ] end_CELL end_ROW start_ROW start_CELL = end_CELL start_CELL blackboard_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_q ( bold_italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) | | italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ) ) ] - blackboard_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ] end_CELL end_ROW start_ROW start_CELL + end_CELL start_CELL blackboard_E start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t > 1 end_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_q ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) | | italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) ] , end_CELL end_ROW

where the first term has no learned variable given variance βksubscript𝛽𝑘\beta_{k}italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is fixed to constants, thus can be ignored during training. The second term is the reconstruction term where pθ()subscript𝑝𝜃p_{\theta}(\cdot)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ ) is trained to recover the original sample 𝒙0subscript𝒙0\boldsymbol{x}_{0}bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT from the noise sample 𝒙1subscript𝒙1\boldsymbol{x}_{1}bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. The last term is the denoising term where pθ()subscript𝑝𝜃p_{\theta}(\cdot)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ ) is trained to denoise 𝒙ksubscript𝒙𝑘\boldsymbol{x}_{k}bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT to get 𝒙k1subscript𝒙𝑘1\boldsymbol{x}_{k-1}bold_italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT, thus we can recurrently denoise the latent samples. In the original paper (Ho et al., 2020) the author shows that the last term can be simplified to the noise prediction objective 𝔼k,𝒙0,ϵ[ϵϵθ(α¯k𝒙0+1α¯kϵ,k)]subscript𝔼𝑘subscript𝒙0bold-italic-ϵdelimited-[]normbold-italic-ϵsubscriptbold-italic-ϵ𝜃subscript¯𝛼𝑘subscript𝒙01subscript¯𝛼𝑘bold-italic-ϵ𝑘\mathbb{E}_{k,\boldsymbol{x}_{0},\boldsymbol{\epsilon}}\left[||\boldsymbol{% \epsilon}-\boldsymbol{\epsilon}_{\theta}\left(\sqrt{\overline{\alpha}_{k}}% \boldsymbol{x}_{0}+\sqrt{1-\overline{\alpha}_{k}}\boldsymbol{\epsilon},k\right% )||\right]blackboard_E start_POSTSUBSCRIPT italic_k , bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_ϵ end_POSTSUBSCRIPT [ | | bold_italic_ϵ - bold_italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG bold_italic_ϵ , italic_k ) | | ], where α¯k=i=1kαk=i=1k(1βk)subscript¯𝛼𝑘superscriptsubscriptproduct𝑖1𝑘subscript𝛼𝑘superscriptsubscriptproduct𝑖1𝑘1subscript𝛽𝑘\overline{\alpha}_{k}=\prod_{i=1}^{k}\alpha_{k}=\prod_{i=1}^{k}(1-\beta_{k})over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( 1 - italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ).

A.3. Model Configuration

We parameterize the noise model ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT with a temporal U-Net (Ronneberger et al., 2015), consisting of 3 repeated residual blocks. Each block is consisted of two temporal convolutions, followed by group normalization (Wu and He, 2018), and a final Mish activation function (Misra, 2019). Timestamp and condition embeddings, both 128-dimensional vectors, are produced by separate 2-layered MLP (with 256 hidden units and Mish activation function) and are concatenated together before getting added to the activation functions of the first temporal convolution within each block. fϕsubscript𝑓italic-ϕf_{\phi}italic_f start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT is parameterized with a 3-layer MLP.

A.4. Pseudo-code

The process of training and inference of DiffBid is shown in Algorithm 2 and Algorithm 1 respectively.

Algorithm 1 Bid Generation with DiffBid.
1:noise model ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, inverse dynamics fϕsubscript𝑓italic-ϕf_{\phi}italic_f start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT, guidance scale ω𝜔\omegaitalic_ω, condition 𝒚𝒚\boldsymbol{y}bold_italic_y, max diffusion step K𝐾Kitalic_K, scales βt,t=1,,Kformulae-sequencesubscript𝛽𝑡𝑡1𝐾\beta_{t},\ t=1,...,Kitalic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t = 1 , … , italic_K.
2:Bidding parameters 𝒂tsubscript𝒂𝑡\boldsymbol{a}_{t}bold_italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
3:Get history of states 𝒔0:tsubscript𝒔:0𝑡\boldsymbol{s}_{0:t}bold_italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT;
4:Sample 𝒙K(τ)𝒩(0,βKI)similar-tosubscript𝒙𝐾𝜏𝒩0subscript𝛽𝐾𝐼\boldsymbol{x}_{K}(\tau)\sim\mathcal{N}(0,\beta_{K}I)bold_italic_x start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ( italic_τ ) ∼ caligraphic_N ( 0 , italic_β start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT italic_I );
5:for k=K,,0𝑘𝐾0k=K,...,0italic_k = italic_K , … , 0 do
6:     𝒙k(τ)[:t]𝒔0:t\boldsymbol{x}_{k}(\tau)[:t]\leftarrow\boldsymbol{s}_{0:t}bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_τ ) [ : italic_t ] ← bold_italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT
7:     Estimating noise ϵ^^italic-ϵ\hat{\epsilon}over^ start_ARG italic_ϵ end_ARG through Eq. (8)
8:     (μk1,Σk1)Denoise(𝒙k(τ),ϵ^k)subscript𝜇𝑘1subscriptΣ𝑘1Denoisesubscript𝒙𝑘𝜏subscript^italic-ϵ𝑘\left(\mu_{k-1},\Sigma_{k-1}\right)\leftarrow\text{Denoise}(\boldsymbol{x}_{k}% (\tau),\hat{\epsilon}_{k})( italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , roman_Σ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) ← Denoise ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_τ ) , over^ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT )
9:     𝒙k1𝒩(μk1,αΣk1)similar-tosubscript𝒙𝑘1𝒩subscript𝜇𝑘1𝛼subscriptΣ𝑘1\boldsymbol{x}_{k-1}\sim\mathcal{N}(\mu_{k-1},\alpha\Sigma_{k-1})bold_italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ∼ caligraphic_N ( italic_μ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , italic_α roman_Σ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT )
10:end for
11:Extract (𝒔tL:t,𝒔t+1)subscript𝒔:𝑡𝐿𝑡subscriptsuperscript𝒔𝑡1(\boldsymbol{s}_{t-L:t},\boldsymbol{s}^{\prime}_{t+1})( bold_italic_s start_POSTSUBSCRIPT italic_t - italic_L : italic_t end_POSTSUBSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) from 𝒙0(τ)subscript𝒙0𝜏\boldsymbol{x}_{0}(\tau)bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_τ )
12:Generate 𝒂^t=fϕ(𝒔tL:t,𝒔t+1)subscriptbold-^𝒂𝑡subscript𝑓italic-ϕsubscript𝒔:𝑡𝐿𝑡subscriptsuperscript𝒔𝑡1\boldsymbol{\hat{a}}_{t}=f_{\phi}(\boldsymbol{s}_{t-L:t},\boldsymbol{s}^{% \prime}_{t+1})overbold_^ start_ARG bold_italic_a end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_italic_s start_POSTSUBSCRIPT italic_t - italic_L : italic_t end_POSTSUBSCRIPT , bold_italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT );
13:return 𝒂^tsubscriptbold-^𝒂𝑡\boldsymbol{\hat{a}}_{t}overbold_^ start_ARG bold_italic_a end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.
Algorithm 2 Training of DiffBid.
1:randomly initialized θ𝜃\thetaitalic_θ, ϕitalic-ϕ\phiitalic_ϕ, bidding trajectory set 𝒟𝒟\mathcal{D}caligraphic_D
2:optimized θ𝜃\thetaitalic_θ, ϕitalic-ϕ\phiitalic_ϕ
3:while not converge do
4:     Sample a batch of trajectories 𝒟𝒟\mathcal{B}\in\mathcal{D}caligraphic_B ∈ caligraphic_D;
5:     for all τ𝜏\tau\in\mathcal{B}italic_τ ∈ caligraphic_B do
6:         Sample kUniform(1,K)similar-to𝑘Uniform1𝐾k\sim\text{Uniform}(1,K)italic_k ∼ Uniform ( 1 , italic_K ), ϵ𝒩(0,I)similar-toitalic-ϵ𝒩0𝐼\epsilon\sim\mathcal{N}(0,I)italic_ϵ ∼ caligraphic_N ( 0 , italic_I );
7:         Compute 𝒙k(τ)subscript𝒙𝑘𝜏\boldsymbol{x}_{k}(\tau)bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_τ ) via q(𝒙k(τ)|𝒙0(τ))𝑞conditionalsubscript𝒙𝑘𝜏subscript𝒙0𝜏q(\boldsymbol{x}_{k}(\tau)|\boldsymbol{x}_{0}(\tau))italic_q ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_τ ) | bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_τ ) ) in Eq (7);
8:         Compute (θ,ϕ)𝜃italic-ϕ\mathcal{L}(\theta,\phi)caligraphic_L ( italic_θ , italic_ϕ ) by Eq (12);
9:         Perform gradient descent to optimize θ𝜃\thetaitalic_θ and ϕitalic-ϕ\phiitalic_ϕ;
10:     end for
11:end while
12:return optimized θ𝜃\thetaitalic_θ, ϕitalic-ϕ\phiitalic_ϕ

A.5. Analytical Results for Action Control

Refer to caption
Figure 7. Ability of Action Control.

We analyze the ability of different models in controlling actions. To achieve this goal, we re-define the return function to be the summation of actions in odd time steps minus the summation of actions in even time steps. The results are shown in Figure 7. We find DiffBid can better control the action than IQL. The main reason is that controlling of actions is difficult for RL in long horizons. Instead, DiffBid directly models the correlation of trajectories and returns, thus can well handle the long trajectory situation.

A.6. Statistical Analyses for Bidding Trajectory

The study by (Hao et al., 2020) indicates that CE follows a power-law decline as the number of winning impressions increases. Our statistical analysis confirms that this finding holds true at every discrete time step, with decay rates varying temporally due to the heterogeneous nature of the impressions. Figure 8(a) shows three steps sampled from the online advertising system, From which another key insight is that the optimal bidding strategy’s cece*italic_c italic_e ∗ is equivalent to selecting a specific number of winning impressions per time step. We denote the number at time step t𝑡titalic_t as ntsubscript𝑛𝑡n_{t}italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

Another finding illustrated in Figure 8(b) is that the costs of impressions remain relatively stable throughout the total episode, fluctuating by less than 5%. This stability allows us to approximate the cost of each impression cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with the average cost c¯=1Ni=0Nci¯𝑐1𝑁superscriptsubscript𝑖0𝑁subscript𝑐𝑖\bar{c}=\frac{1}{N}\sum_{i=0}^{N}c_{i}over¯ start_ARG italic_c end_ARG = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, where N𝑁Nitalic_N represents the number of winning impressions of the total episode. Therefore, the total cost at each time step ct=ntc¯subscript𝑐𝑡subscript𝑛𝑡¯𝑐c_{t}=n_{t}\cdot\bar{c}italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ over¯ start_ARG italic_c end_ARG. In auto-bidding modeling, ctsubscript𝑐𝑡c_{t}italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT can be calculated from the state trajectory by using the difference in the remaining budget between two consecutive steps. Consequently, we can conclude that the optimal strategy correlates to a specific state trajectory.

A.7. Theoretical Analysis

We first give the definition of several decision process and then show the theoretical analysis.

Definition A.0 (Markovian Decision Process (MDP)).

MDP is a stochastic map** from a state-action pair to state-reward pairs. Formally, 𝒯:𝒮×𝒜𝒮×:𝒯𝒮𝒜𝒮\mathcal{T}:\mathcal{S}\times\mathcal{A}\rightarrow\mathcal{S}\times\mathcal{R}caligraphic_T : caligraphic_S × caligraphic_A → caligraphic_S × caligraphic_R, where 𝒯𝒯\mathcal{T}caligraphic_T denotes a stochastic map**.

Definition A.0 (History-based Decision Process (HDP)).

HDP is a stochastic map** from a history-action pair to observation-reward pairs. Formally, 𝒫:×𝒜𝒪×:𝒫superscript𝒜𝒪\mathcal{P}:\mathcal{H}^{*}\times\mathcal{A}\rightarrow\mathcal{O}\times% \mathcal{R}caligraphic_P : caligraphic_H start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT × caligraphic_A → caligraphic_O × caligraphic_R, where 𝒫𝒫\mathcal{P}caligraphic_P denotes a stochastic map**.

We show that a sequential decision-making problem can be constructed to maximize the same objective. The main results are given by (Qin et al., 2023) and we put the proofs here for completeness. To start, let the ground-truth distribution of demonstrations be p(𝒙0(τ))superscript𝑝subscript𝒙0𝜏p^{*}(\boldsymbol{x}_{0}(\tau))italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_τ ) ) and the learned marginal distributions of state sequences be pθ(𝒙0(τ))subscript𝑝𝜃subscript𝒙0𝜏p_{\theta}(\boldsymbol{x}_{0}(\tau))italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_τ ) ). Then Eq. (4) is an empirical estimation of

(19) 𝔼p(𝒔0)[logp(𝒔0)+𝔼p(𝒔1:T|𝒔0)[logpθ(𝒔1:T|𝒔0)]]subscript𝔼superscript𝑝subscript𝒔0delimited-[]logsuperscript𝑝subscript𝒔0subscript𝔼superscript𝑝conditionalsubscript𝒔:1𝑇subscript𝒔0delimited-[]logsubscript𝑝𝜃conditionalsubscript𝒔:1𝑇subscript𝒔0\begin{split}\mathbb{E}_{p^{*}(\boldsymbol{s}_{0})}\left[\text{log}p^{*}(% \boldsymbol{s}_{0})+\mathbb{E}_{p^{*}(\boldsymbol{s}_{1:T}|\boldsymbol{s}_{0})% }\left[\text{log}p_{\theta}(\boldsymbol{s}_{1:T}|\boldsymbol{s}_{0})\right]% \right]\end{split}start_ROW start_CELL blackboard_E start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ log italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + blackboard_E start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_s start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | bold_italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_s start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | bold_italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ] ] end_CELL end_ROW

Suppose the MLE yields the maximum, we will have pθ=psuperscriptsubscript𝑝𝜃superscript𝑝p_{\theta}^{*}=p^{*}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. Then we define V(s0):=𝔼p(s1:T|s0)[logp(s1:T|s0)]assignsuperscript𝑉subscript𝑠0subscript𝔼superscript𝑝conditionalsubscript𝑠:1𝑇subscript𝑠0delimited-[]logsuperscript𝑝conditionalsubscript𝑠:1𝑇subscript𝑠0V^{*}(s_{0}):=\mathbb{E}_{p^{*}(s_{1:T}|s_{0})}[\text{log}p^{*}(s_{1:T}|s_{0})]italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) := blackboard_E start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ log italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ], and generalize it to have a V𝑉Vitalic_V function:

(20) V(s0:t)=𝔼p(st+1:T|s0:t)[logp(st+1:T|s0:t)]superscript𝑉subscript𝑠:0𝑡subscript𝔼superscript𝑝conditionalsubscript𝑠:𝑡1𝑇subscript𝑠:0𝑡delimited-[]logsuperscript𝑝conditionalsubscript𝑠:𝑡1𝑇subscript𝑠:0𝑡V^{*}(s_{0:t})=\mathbb{E}_{p^{*}(s_{t+1:T}|s_{0:t})}[\text{log}p^{*}(s_{t+1:T}% |s_{0:t})]italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) = blackboard_E start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 : italic_T end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ log italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 : italic_T end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) ]

which comes with a Bellman optimal equation:

(21) V(s0:t):=𝔼p(st+1|s0:t)[r(st+1,s0:t)+V(s0:t+1)]assignsuperscript𝑉subscript𝑠:0𝑡subscript𝔼superscript𝑝conditionalsubscript𝑠𝑡1subscript𝑠:0𝑡delimited-[]𝑟subscript𝑠𝑡1subscript𝑠:0𝑡superscript𝑉subscript𝑠:0𝑡1V^{*}(s_{0:t}):=\mathbb{E}_{p^{*}(s_{t+1}|s_{0:t})}[r(s_{t+1},s_{0:t})+V^{*}(s% _{0:t+1})]italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) := blackboard_E start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ italic_r ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) + italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 : italic_t + 1 end_POSTSUBSCRIPT ) ]

with r(st+1,s0:t):=logp(st+1|s0:t)=logpa(st|s0:t)p(st+1|st,at)dtassign𝑟subscript𝑠𝑡1subscript𝑠:0𝑡superscript𝑝conditionalsubscript𝑠𝑡1subscript𝑠:0𝑡subscriptsuperscript𝑝𝑎conditionalsubscript𝑠𝑡subscript𝑠:0𝑡superscript𝑝conditionalsubscript𝑠𝑡1subscript𝑠𝑡subscript𝑎𝑡𝑑𝑡r(s_{t+1},s_{0:t}):=\log p^{*}(s_{t+1}|s_{0:t})=\log p^{*}_{a}(s_{t}|s_{0:t})p% ^{*}(s_{t+1}|s_{t},a_{t})dtitalic_r ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) := roman_log italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) = roman_log italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_d italic_t, V(s0:T):=0assignsuperscript𝑉subscript𝑠:0𝑇0V^{*}(s_{0:T}):=0italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 : italic_T end_POSTSUBSCRIPT ) := 0. It is worth noting that the r𝑟ritalic_r defined above involves the optimal policy, which may not be known a priori. We can resolve this by replacing it with rαsubscript𝑟𝛼r_{\alpha}italic_r start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT for an arbitrary policy pα(at|s0:t)subscript𝑝𝛼conditionalsubscript𝑎𝑡subscript𝑠:0𝑡p_{\alpha}(a_{t}|s_{0:t})italic_p start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ). All Bellman identities and updates should still hold. The entailed Bellman update, value iteration, for arbitrary V𝑉Vitalic_V and α𝛼\alphaitalic_α is

(22) V(s0:t)=𝔼p(st+1|s0:t)[rα(s0:t,st+1)+V(s0:t+1)].𝑉subscript𝑠:0𝑡subscript𝔼superscript𝑝conditionalsubscript𝑠𝑡1subscript𝑠:0𝑡delimited-[]subscript𝑟𝛼subscript𝑠:0𝑡subscript𝑠𝑡1𝑉subscript𝑠:0𝑡1V(s_{0:t})=\mathbb{E}_{p^{*}(s_{t+1}|s_{0:t})}[r_{\alpha}(s_{0:t},s_{t+1})+V(s% _{0:t+1})].italic_V ( italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) = blackboard_E start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ italic_r start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) + italic_V ( italic_s start_POSTSUBSCRIPT 0 : italic_t + 1 end_POSTSUBSCRIPT ) ] .

We then define r(st+1,at,s0:t):=r(st+1,s0:t)+logpa(at|s0:t)assign𝑟subscript𝑠𝑡1subscript𝑎𝑡subscript𝑠:0𝑡𝑟subscript𝑠𝑡1subscript𝑠:0𝑡subscriptsuperscript𝑝𝑎conditionalsubscript𝑎𝑡subscript𝑠:0𝑡r(s_{t+1},a_{t},s_{0:t}):=r(s_{t+1},s_{0:t})+\log p^{*}_{a}(a_{t}|s_{0:t})italic_r ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) := italic_r ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) + roman_log italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) to construct a Q𝑄Qitalic_Q function:

(23) Q(at;s0:t):=𝔼p(st+1|s0:t)[r(st+1,at,s0:t)+V(s0:t+1)],assignsuperscript𝑄subscript𝑎𝑡subscript𝑠:0𝑡subscript𝔼superscript𝑝conditionalsubscript𝑠𝑡1subscript𝑠:0𝑡delimited-[]𝑟subscript𝑠𝑡1subscript𝑎𝑡subscript𝑠:0𝑡superscript𝑉subscript𝑠:0𝑡1Q^{*}(a_{t};s_{0:t}):=\mathbb{E}_{p^{*}(s_{t+1}|s_{0:t})}[r(s_{t+1},a_{t},s_{0% :t})+V^{*}(s_{0:t+1})],italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) := blackboard_E start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ italic_r ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) + italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 : italic_t + 1 end_POSTSUBSCRIPT ) ] ,

which entails a Bellman update, Q backup, for arbitrary α𝛼\alphaitalic_α, Q𝑄Qitalic_Q and V𝑉Vitalic_V

(24) Q(at;s0:t)=𝔼p(st+1|s0:t)[rα(s0:t,at,st+1)+V(s0:t+1)].𝑄subscript𝑎𝑡subscript𝑠:0𝑡subscript𝔼superscript𝑝conditionalsubscript𝑠𝑡1subscript𝑠:0𝑡delimited-[]subscript𝑟𝛼subscript𝑠:0𝑡subscript𝑎𝑡subscript𝑠𝑡1𝑉subscript𝑠:0𝑡1Q(a_{t};s_{0:t})=\mathbb{E}_{p^{*}(s_{t+1}|s_{0:t})}[r_{\alpha}(s_{0:t},a_{t},% s_{t+1})+V(s_{0:t+1})].italic_Q ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) = blackboard_E start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ italic_r start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) + italic_V ( italic_s start_POSTSUBSCRIPT 0 : italic_t + 1 end_POSTSUBSCRIPT ) ] .

Also note that the V𝑉Vitalic_V and Q𝑄Qitalic_Q in identities Eq. (23) and Eq. (25) respectively are not necessarily associated with the policy pα(at|s0:t)subscript𝑝𝛼conditionalsubscript𝑎𝑡subscript𝑠:0𝑡p_{\alpha}(a_{t}|s_{0:t})italic_p start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ). Slightly overloading the notations, we use Qα,Vαsubscript𝑄𝛼subscript𝑉𝛼Q_{\alpha},V_{\alpha}italic_Q start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT to denote the expected returns from policy pα(at|s0:t)subscript𝑝𝛼conditionalsubscript𝑎𝑡subscript𝑠:0𝑡p_{\alpha}(a_{t}|s_{0:t})italic_p start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ). By now, we finish the construction of atomic algebraic components and move on to check if the relations between them align with the algebraic structure of a sequential decision-making problem. We first prove the construction above is valid at optimality.

Lemma A.3.

When fα(at;s0:t)=Q(at;s0:t)V(s0:t),pα(at|s0:t)subscript𝑓𝛼subscript𝑎𝑡subscript𝑠:0𝑡superscript𝑄subscript𝑎𝑡subscript𝑠:0𝑡superscript𝑉subscript𝑠:0𝑡subscript𝑝𝛼conditionalsubscript𝑎𝑡subscript𝑠:0𝑡f_{\alpha}(a_{t};s_{0:t})=Q^{*}(a_{t};s_{0:t})-V^{*}(s_{0:t}),p_{\alpha}(a_{t}% |s_{0:t})italic_f start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) = italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) - italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) , italic_p start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) is the optimal policy.

Proof.

Note that the construction gives us

(25) Q(at;s0:t)=𝔼p(st+1|s0:t)[r(st+1,s0:t)+logpα(at|s0:t)+V(s0:t+1)]=logpα(at|s0:t)+𝔼p(st+1|s0:t)[r(st+1,s0:t)+V(s0:t+1)]=logpα(at|s0:t)+V(s0:t)superscript𝑄subscript𝑎𝑡subscript𝑠:0𝑡subscript𝔼superscript𝑝conditionalsubscript𝑠𝑡1subscript𝑠:0𝑡delimited-[]𝑟subscript𝑠𝑡1subscript𝑠:0𝑡subscriptsuperscript𝑝𝛼conditionalsubscript𝑎𝑡subscript𝑠:0𝑡superscript𝑉subscript𝑠:0𝑡1subscriptsuperscript𝑝𝛼conditionalsubscript𝑎𝑡subscript𝑠:0𝑡subscript𝔼superscript𝑝conditionalsubscript𝑠𝑡1subscript𝑠:0𝑡delimited-[]𝑟subscript𝑠𝑡1subscript𝑠:0𝑡superscript𝑉subscript𝑠:0𝑡1subscriptsuperscript𝑝𝛼conditionalsubscript𝑎𝑡subscript𝑠:0𝑡superscript𝑉subscript𝑠:0𝑡\begin{split}&Q^{*}(a_{t};s_{0:t})\\ =&\mathbb{E}_{p^{*}(s_{t+1}|s_{0:t})}\left[r(s_{t+1},s_{0:t})+\log p^{*}_{% \alpha}(a_{t}|s_{0:t})+V^{*}(s_{0:t+1})\right]\\ =&\log p^{*}_{\alpha}(a_{t}|s_{0:t})+\mathbb{E}_{p^{*}(s_{t+1}|s_{0:t})}\left[% r(s_{t+1},s_{0:t})+V^{*}(s_{0:t+1})\right]\\ =&\log p^{*}_{\alpha}(a_{t}|s_{0:t})+V^{*}(s_{0:t})\end{split}start_ROW start_CELL end_CELL start_CELL italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL = end_CELL start_CELL blackboard_E start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ italic_r ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) + roman_log italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) + italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 : italic_t + 1 end_POSTSUBSCRIPT ) ] end_CELL end_ROW start_ROW start_CELL = end_CELL start_CELL roman_log italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) + blackboard_E start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ italic_r ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) + italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 : italic_t + 1 end_POSTSUBSCRIPT ) ] end_CELL end_ROW start_ROW start_CELL = end_CELL start_CELL roman_log italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) + italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) end_CELL end_ROW

Obviously, Q(at;s0:t)superscript𝑄subscript𝑎𝑡subscript𝑠:0𝑡Q^{*}(a_{t};s_{0:t})italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) lies in the hypothesis space of fα(at;s0:t)subscript𝑓𝛼subscript𝑎𝑡subscript𝑠:0𝑡f_{\alpha}(a_{t};s_{0:t})italic_f start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ). It indicates that we need to either parameterize fα(at;s0:t)subscript𝑓𝛼subscript𝑎𝑡subscript𝑠:0𝑡f_{\alpha}(a_{t};s_{0:t})italic_f start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) or Q(at;s0:t)𝑄subscript𝑎𝑡subscript𝑠:0𝑡Q(a_{t};s_{0:t})italic_Q ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ). While Qαsuperscript𝑄𝛼Q^{\alpha}italic_Q start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT and Vαsuperscript𝑉𝛼V^{\alpha}italic_V start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT are constructed from the optimality, the derived Qαsuperscript𝑄𝛼Q^{\alpha}italic_Q start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT and Vαsuperscript𝑉𝛼V^{\alpha}italic_V start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT measure the performance of an interactive agent when it executes with the policy pα(at|s0:t)subscript𝑝𝛼conditionalsubscript𝑎𝑡subscript𝑠:0𝑡p_{\alpha}(a_{t}|s_{0:t})italic_p start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ). They should be consistent.

Lemma A.4.

Vα(s0:t)superscript𝑉𝛼subscript𝑠:0𝑡V^{\alpha}(s_{0:t})italic_V start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) and 𝔼pα(at|s0:t)[Qα(at;s0:t)]subscript𝔼subscript𝑝𝛼conditionalsubscript𝑎𝑡subscript𝑠:0𝑡delimited-[]superscript𝑄𝛼subscript𝑎𝑡subscript𝑠:0𝑡\mathbb{E}_{p_{\alpha}(a_{t}|s_{0:t})}\left[Q^{\alpha}(a_{t};s_{0:t})\right]blackboard_E start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ italic_Q start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) ] yield the same optimal policy pα(at|s0:t)subscriptsuperscript𝑝𝛼conditionalsubscript𝑎𝑡subscript𝑠:0𝑡p^{*}_{\alpha}(a_{t}|s_{0:t})italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT )

Proof.
𝔼pα(at|s0:t)[Qα(at;s0:t)]:=𝔼pα(at|s0:t)[𝔼p(st+1|s0:t)[r(st+1,at,s0:t)+Vα(s0:t+1)]]=𝔼pα(at|s0:t)𝔼p(st+1|s0:t)[logpα(at|s0:t)+r(st+1,s0:t)+Vα(s0:t+1)]=𝔼p(st+1|s0:t)[r(st+1,s0:t)Hα(at|s0:t)+Vα(s0:t+1)]k=t+1T1𝔼p(st+1:k|s0:t)[Hα(ak|s0:k)]assignsubscript𝔼subscript𝑝𝛼conditionalsubscript𝑎𝑡subscript𝑠:0𝑡delimited-[]superscript𝑄𝛼subscript𝑎𝑡subscript𝑠:0𝑡subscript𝔼subscript𝑝𝛼conditionalsubscript𝑎𝑡subscript𝑠:0𝑡delimited-[]subscript𝔼superscript𝑝conditionalsubscript𝑠𝑡1subscript𝑠:0𝑡delimited-[]𝑟subscript𝑠𝑡1subscript𝑎𝑡subscript𝑠:0𝑡superscript𝑉𝛼subscript𝑠:0𝑡1subscript𝔼subscript𝑝𝛼conditionalsubscript𝑎𝑡subscript𝑠:0𝑡subscript𝔼superscript𝑝conditionalsubscript𝑠𝑡1subscript𝑠:0𝑡delimited-[]subscriptsuperscript𝑝𝛼conditionalsubscript𝑎𝑡subscript𝑠:0𝑡𝑟subscript𝑠𝑡1subscript𝑠:0𝑡superscript𝑉𝛼subscript𝑠:0𝑡1subscript𝔼superscript𝑝conditionalsubscript𝑠𝑡1subscript𝑠:0𝑡delimited-[]𝑟subscript𝑠𝑡1subscript𝑠:0𝑡subscript𝐻𝛼conditionalsubscript𝑎𝑡subscript𝑠:0𝑡superscript𝑉𝛼subscript𝑠:0𝑡1superscriptsubscript𝑘𝑡1𝑇1subscript𝔼superscript𝑝conditionalsubscript𝑠:𝑡1𝑘subscript𝑠:0𝑡delimited-[]subscript𝐻𝛼conditionalsubscript𝑎𝑘subscript𝑠:0𝑘\small\begin{split}&\mathbb{E}_{p_{\alpha}(a_{t}|s_{0:t})}\left[Q^{\alpha}(a_{% t};s_{0:t})\right]\\ :=&\mathbb{E}_{p_{\alpha}(a_{t}|s_{0:t})}\left[\mathbb{E}_{p^{*}(s_{t+1}|s_{0:% t})}\left[r(s_{t+1},a_{t},s_{0:t})+V^{\alpha}(s_{0:t+1})\right]\right]\\ =&\mathbb{E}_{p_{\alpha}(a_{t}|s_{0:t})}\mathbb{E}_{p^{*}(s_{t+1}|s_{0:t})}% \left[\log p^{*}_{\alpha}(a_{t}|s_{0:t})+r(s_{t+1},s_{0:t})+V^{\alpha}(s_{0:t+% 1})\right]\\ =&\mathbb{E}_{p^{*}(s_{t+1}|s_{0:t})}\left[r(s_{t+1},s_{0:t})-H_{\alpha}(a_{t}% |s_{0:t})+V^{\alpha}(s_{0:t+1})\right]\\ &-\sum_{k=t+1}^{T-1}\mathbb{E}_{p^{*}(s_{t+1:k}|s_{0:t})}\left[H_{\alpha}(a_{k% }|s_{0:k})\right]\end{split}start_ROW start_CELL end_CELL start_CELL blackboard_E start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ italic_Q start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) ] end_CELL end_ROW start_ROW start_CELL := end_CELL start_CELL blackboard_E start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ italic_r ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) + italic_V start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 : italic_t + 1 end_POSTSUBSCRIPT ) ] ] end_CELL end_ROW start_ROW start_CELL = end_CELL start_CELL blackboard_E start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ roman_log italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) + italic_r ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) + italic_V start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 : italic_t + 1 end_POSTSUBSCRIPT ) ] end_CELL end_ROW start_ROW start_CELL = end_CELL start_CELL blackboard_E start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ italic_r ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) - italic_H start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) + italic_V start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 : italic_t + 1 end_POSTSUBSCRIPT ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL - ∑ start_POSTSUBSCRIPT italic_k = italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 : italic_k end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ italic_H start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 0 : italic_k end_POSTSUBSCRIPT ) ] end_CELL end_ROW

where ()\mathcal{H}(\cdot)caligraphic_H ( ⋅ ) is the entropy term. The last line is derived by recursively applying the Bellman equation in the line above until s0:Tsubscript𝑠:0𝑇s_{0:T}italic_s start_POSTSUBSCRIPT 0 : italic_T end_POSTSUBSCRIPT. As an energy-based policy, pα(at|s0:t)subscript𝑝𝛼conditionalsubscript𝑎𝑡subscript𝑠:0𝑡p_{\alpha}(a_{t}|s_{0:t})italic_p start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT )’s entropy is inherently maximized (Jaynes, 1957). Therefore, within the hypothesis space, pα(at|s0:t)superscriptsubscript𝑝𝛼conditionalsubscript𝑎𝑡subscript𝑠:0𝑡p_{\alpha}^{*}(a_{t}|s_{0:t})italic_p start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) that optimizes Vα(s0:t)superscript𝑉𝛼subscript𝑠:0𝑡V^{\alpha}(s_{0:t})italic_V start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) also leads to the optimal expected return 𝔼pα(at|s0:t)[Qα(at;s0:t)]subscript𝔼subscript𝑝𝛼conditionalsubscript𝑎𝑡subscript𝑠:0𝑡delimited-[]superscript𝑄𝛼subscript𝑎𝑡subscript𝑠:0𝑡\mathbb{E}_{p_{\alpha}(a_{t}|s_{0:t})}\left[Q^{\alpha}(a_{t};s_{0:t})\right]blackboard_E start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ italic_Q start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) ]. ∎

Given the convergence proof by Ziebart (Ziebart, 2010), we have:

Lemma A.5.

If p(st+1|s0:t)superscript𝑝conditionalsubscript𝑠𝑡1subscript𝑠:0𝑡p^{*}(s_{t+1}|s_{0:t})italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) is accessible and pγ(st+1|st,at)subscriptsuperscript𝑝𝛾conditionalsubscript𝑠𝑡1subscript𝑠𝑡subscript𝑎𝑡p^{*}_{\gamma}(s_{t+1}|s_{t},a_{t})italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is known, soft policy iteration and soft Q𝑄Qitalic_Q learning both converge to pα(at|s0:t)=pα(at|s0:t)exp(Q(at;s0:t))subscriptsuperscript𝑝𝛼conditionalsubscript𝑎𝑡subscript𝑠:0𝑡subscriptsuperscript𝑝𝛼conditionalsubscript𝑎𝑡subscript𝑠:0𝑡proportional-tosuperscript𝑄subscript𝑎𝑡subscript𝑠:0𝑡p^{*}_{\alpha}(a_{t}|s_{0:t})=p^{*}_{\alpha}(a_{t}|s_{0:t})\propto\exp(Q^{*}(a% _{t};s_{0:t}))italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) = italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) ∝ roman_exp ( italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) ) under conditions.

Lemma 3 means given p(st+1|s0:t)superscript𝑝conditionalsubscript𝑠𝑡1subscript𝑠:0𝑡p^{*}(s_{t+1}|s_{0:t})italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) and pγ(st+1|st,at)subscriptsuperscript𝑝𝛾conditionalsubscript𝑠𝑡1subscript𝑠𝑡subscript𝑎𝑡p^{*}_{\gamma}(s_{t+1}|s_{t},a_{t})italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), we can recover pαsubscriptsuperscript𝑝𝛼p^{*}_{\alpha}italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT through reinforcement learning methods, instead of the proposed MLE. So pα(at|s0:t)subscript𝑝𝛼conditionalsubscript𝑎𝑡subscript𝑠:0𝑡p_{\alpha}(a_{t}|s_{0:t})italic_p start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) is a viable policy space for the constructed sequential decision-making problem. Together, Lemma A.1, Lemma A.2 and Lemma A.3 provide proof for a valid sequential decision-making problem that maximizes the same objective of MLE, by Lemma A.4.

Lemma A.6 (MLE as non-Markovian decision-making process).

Assuming the Markovian transition pγ(st+1|st,at)subscript𝑝superscript𝛾conditionalsubscript𝑠𝑡1subscript𝑠𝑡subscript𝑎𝑡p_{\gamma^{*}}(s_{t+1}|s_{t},a_{t})italic_p start_POSTSUBSCRIPT italic_γ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is known, the ground-truth conditional state distribution p(st+1|s0:t)superscript𝑝conditionalsubscript𝑠𝑡1subscript𝑠:0𝑡p^{*}(s_{t+1}|s_{0:t})italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) for demonstration sequences is accessible, we can construct a non-Markovian sequential decision-making problem, based on a reward function rα(st+1,s0:t):=logpα(at|s0:t)pγ(st+1|st,at)𝑑atassignsubscript𝑟𝛼subscript𝑠𝑡1subscript𝑠:0𝑡logsubscript𝑝𝛼conditionalsubscript𝑎𝑡subscript𝑠:0𝑡subscript𝑝superscript𝛾conditionalsubscript𝑠𝑡1subscript𝑠𝑡subscript𝑎𝑡differential-dsubscript𝑎𝑡r_{\alpha}(s_{t+1},s_{0:t}):={\rm{log}}\int p_{\alpha}(a_{t}|s_{0:t})p_{\gamma% ^{*}}(s_{t+1}|s_{t},a_{t})da_{t}italic_r start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) := roman_log ∫ italic_p start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) italic_p start_POSTSUBSCRIPT italic_γ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_d italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for an arbitrary energy-based policy pα(at|s0:t)subscript𝑝𝛼conditionalsubscript𝑎𝑡subscript𝑠:0𝑡p_{\alpha}(a_{t}|s_{0:t})italic_p start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ). Its objective is

(26) t=0T𝔼p(s0:t)[Vpα(s0:t)]=𝔼p(s0:T)[t=0Tk=tTrα(sk+1;s0:k)]superscriptsubscript𝑡0𝑇subscript𝔼superscript𝑝subscript𝑠:0𝑡delimited-[]superscript𝑉subscript𝑝𝛼subscript𝑠:0𝑡subscript𝔼superscript𝑝subscript𝑠:0𝑇delimited-[]superscriptsubscript𝑡0𝑇superscriptsubscript𝑘𝑡𝑇subscript𝑟𝛼subscript𝑠𝑘1subscript𝑠:0𝑘\sum_{t=0}^{T}\mathbb{E}_{p^{*}(s_{0:t})}\left[V^{p_{\alpha}}(s_{0:t})\right]=% \mathbb{E}_{p^{*}(s_{0:T})}\left[\sum_{t=0}^{T}\sum_{k=t}^{T}r_{\alpha}(s_{k+1% };s_{0:k})\right]∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) ] = blackboard_E start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 : italic_T end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ; italic_s start_POSTSUBSCRIPT 0 : italic_k end_POSTSUBSCRIPT ) ]

Vpα(s0:t):=𝔼p(st+1:T|s0:t)[k=1Trα(sk+1;s0:k)]assignsuperscript𝑉subscript𝑝𝛼subscript𝑠:0𝑡subscript𝔼superscript𝑝conditionalsubscript𝑠:𝑡1𝑇subscript𝑠:0𝑡delimited-[]superscriptsubscript𝑘1𝑇subscript𝑟𝛼subscript𝑠𝑘1subscript𝑠:0𝑘V^{p_{\alpha}}(s_{0:t}):=\mathbb{E}_{p^{*}(s_{t+1:T}|s_{0:t})}[\sum_{k=1}^{T}r% _{\alpha}(s_{k+1};s_{0:k})]italic_V start_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) := blackboard_E start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t + 1 : italic_T end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ; italic_s start_POSTSUBSCRIPT 0 : italic_k end_POSTSUBSCRIPT ) ] is the value function of pαsubscript𝑝𝛼p_{\alpha}italic_p start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT. This objective yields the save optimal policy as the Maximum Likelihood Estimation 𝔼p(s0:T)[logpθ(s0:T)]subscript𝔼superscript𝑝subscript𝑠:0𝑇delimited-[]logsubscript𝑝𝜃subscript𝑠:0𝑇\mathbb{E}_{p^{*}(s_{0:T})}\left[{\rm{log}}p_{\theta}(s_{0:T})\right]blackboard_E start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT 0 : italic_T end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 : italic_T end_POSTSUBSCRIPT ) ].

Table 5. Parameters of the Real Advertising System.
Parameters Values
Number of advertisers 30
Time steps in an episode, T𝑇Titalic_T 96
Minimum number of impression opportunities Nminsubscript𝑁minN_{\text{min}}italic_N start_POSTSUBSCRIPT min end_POSTSUBSCRIPT 50
Maximum number of impression opportunities Nmaxsubscript𝑁maxN_{\text{max}}italic_N start_POSTSUBSCRIPT max end_POSTSUBSCRIPT 300
Minimum budget 1000 Yuan
Maximum budget 4000 Yuan
Value of impression opportunities in stage 1, vj,t1superscriptsubscript𝑣𝑗𝑡1v_{j,t}^{1}italic_v start_POSTSUBSCRIPT italic_j , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT 0similar-to\sim 1
Value of impression opportunities in stage 2, vj,t2superscriptsubscript𝑣𝑗𝑡2v_{j,t}^{2}italic_v start_POSTSUBSCRIPT italic_j , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 0similar-to\sim 1
Minimum bid price, Aminsubscript𝐴minA_{\text{min}}italic_A start_POSTSUBSCRIPT min end_POSTSUBSCRIPT 0 Yuan
Maximum bid price, Amaxsubscript𝐴maxA_{\text{max}}italic_A start_POSTSUBSCRIPT max end_POSTSUBSCRIPT 1000 Yuan
Maximum value of impression opportunity, vMsubscript𝑣𝑀v_{M}italic_v start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT 1
Maximum market price, pMsubscript𝑝𝑀p_{M}italic_p start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT 1000 Yuan
Refer to caption
(a) Cost Effectiveness Curve Samples from Three Steps
Refer to caption
(b) Impression Cost Fluctuation at Different Time Steps
Figure 8. Statistical Results From Online Advertising System.