AIGB: Generative Auto-bidding via Diffusion Modeling

Jiayan Guo [email protected] Peking University
Alibaba GroupBei**gChina , Yusen Huo [email protected] Alibaba GroupBei**gChina , Zhilin Zhang [email protected] Alibaba GroupBei**gChina , Tianyu Wang yves.wty@@alibaba-inc.com Alibaba GroupBei**gChina , Chuan Yu [email protected] Alibaba GroupBei**gChina , Jian Xu [email protected] Alibaba GroupBei**gChina , Yan Zhang [email protected] Peking UniversityBei**gChina and Bo Zheng [email protected] Alibaba GroupBei**gChina

(2024)

Abstract.

Auto-bidding plays a crucial role in facilitating online advertising by automatically providing bids for advertisers. Reinforcement learning (RL) has gained popularity for auto-bidding. However, most current RL auto-bidding methods are modeled through the Markovian Decision Process (MDP), which assumes the Markovian state transition. This assumption restricts the ability to perform in long horizon scenarios and makes the model unstable when dealing with highly random online advertising environments. To tackle this issue, this paper introduces AI-Generated Bidding (AIGB), a novel paradigm for auto-bidding through generative modeling. In this paradigm, we propose DiffBid, a conditional diffusion modeling approach for bid generation. DiffBid directly models the correlation between the return and the entire trajectory, effectively avoiding error propagation across time steps in long horizons. Additionally, DiffBid offers a versatile approach for generating trajectories that maximize given targets while adhering to specific constraints. Extensive experiments conducted on the real-world dataset and online A/B test on Alibaba advertising platform demonstrate the effectiveness of DiffBid, achieving 2.81% increase in GMV and 3.36% increase in ROI.

Online Advertising, Auto-bidding, Generative Learning, Diffusion Modeling

^†^†journalyear: 2024^†^†copyright: acmlicensed^†^†conference: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining; August 25–29, 2024; Barcelona, Spain^†^†booktitle: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’24), August 25–29, 2024, Barcelona, Spain^†^†doi: 10.1145/3637528.3671526^†^†isbn: 979-8-4007-0490-1/24/08^†^†submissionid: 79^†^†ccs: Information systems Computational advertising

1. Introduction

The ever-increasing digitalization of commerce has exponentially expanded the scope and importance of online advertising platforms (Ha, 2008; Evans, 2009). These ad platforms have become indispensable for businesses to effectively target their audience and drive sales. Traditionally, advertisers need to manually adjust bid prices to optimize overall ad performance. However, this coarse bidding process becomes impractical when dealing with trillions of impression opportunities, requiring extensive domain knowledge (Chiesi et al., 1979) and comprehensive information about the advertising environments.

To alleviate the burden of bid optimization for advertisers, these ad platforms provide auto-bidding services (Deng et al., 2021; Balseiro et al., 2021b, a; Ou et al., 2023). These services automate the determination of bids for each impression opportunity by employing well-designed bidding strategies. Such strategies consider a variety of factors about advertising environments and advertisers, such as the distribution of impression opportunities, budgets, and average cost constraints (Li and Tang, 2022). Considering the dynamic nature of advertising environments, it is essential to regularly optimize the bidding strategy, typically at intervals of a few minutes, in response to changing conditions. With advertising episodes typically extending beyond 24 hours, auto-bidding can be seen as a sequential decision-making process with a long planning horizon where the bidding strategy seeks to optimize performance throughout the entire episode.

Recently, reinforcement learning (RL) techniques have been employed to optimize auto-bidding strategies through the training of agents with bidding logs collected from online advertising environments (** et al., 2018; Cai et al., 2017; Wang et al., 2017; He et al., 2021; Zhang et al., 2023b; Mou et al., 2022). By leveraging historical realistic bidding information, these agents can learn patterns and trends to make informed bidding decisions. However, most existing RL auto-bidding methods are based on the Markovian decision process (MDP), where the next state only depends on the current state and action. In the online auto-bidding environment, this assumption may been challenged by our statistical analysis presented in Figure 1, which shows a significant increase in the correlation between the sequence lengths of history states and the next bidding state. This result indicates that solving auto-bidding considering only the last state will encounter several problems, including instability in the highly random online advertising environment. Additionally, the RL methods that rely on the Bellman equation often result in compound errors (Fujimoto et al., 2022). This issue is especially pronounced in the auto-bidding problem characterized by sparse return and limited data coverage. A detailed statistical analysis is provided in A.5.

Refer to caption — Figure 1. Correlation Coefficients between History and the Next State.

In this paper, instead of employing RL-based methods, we present a novel paradigm, AI Generated Bidding (AIGB), that regards auto-bidding as a generative sequential decision-making problem. AIGB directly capture the correlation between the return and the entire bidding trajectory that consists of a sequence of states or actions, thereby transforming the problem into learning to generate an optimal bidding trajectory. This approach enables us to overcome the limitations of RL when dealing with the highly random online advertising environment, sparse returns, and limited data coverage.

In the new paradigm, we propose Diffusion auto-bidding model DiffBid. It gradually corrupts the bidding trajectory by injecting scheduled Gaussian noises into the forward process. Then, it reconstructs trajectory from corrupted ones given returns and temporal conditions via a parameterized neural network. We further propose a non-Markovian inverse dynamics (Nguyen-Tuong et al., 2008; Guo et al., 2022; Zhang et al., 2023a) to more accurately generate optimal bidding parameters. Taking one step further, DiffBid provides flexibility to closely align with the specific needs of advertisers by accommodating diverse constraints like cost-per-click (CPC) and incorporating human feedback. Notably, DiffBid serves as a unified model capable of mastering multiple tasks simultaneously, dynamically composing various bidding trajectory components to generate sequences that efficiently maximize diverse targets while adhering to a range of predefined constraints. To assess the effectiveness of DiffBid, we conducted extensive evaluations offline and online against baselines. Our results indicate that DiffBid surpasses RL methods for auto-bidding. In summary:

•

We uncover that the Markov assumptions upon which common decision-making methods rely are not applicable to the auto-bidding problem. Therefore, we propose a novel bidding paradigm with non-Markovian properties based on generative learning. This paradigm represents a significant innovation in modeling methodology compared with existing RL methods commonly used in auto-bidding.
•

Unlike common bidding methods, our approach captures the correlation between the return and the entire bidding trajectory. This design enables the method to address important challenges, such as sparse returns, and ensures stability in the highly random advertising environment. Finally, we prove that the proposed diffusion modeling is equivalent in terms of optimality to solving a non-Markovian decision problem.
•

We demonstrate that the method can integrate capabilities to handle a variety of tasks within a unified solution, transcending the limitations of traditional task-specific methods. It shows that DiffBid outperforms conventional RL methods in auto-bidding, and achieves significant performance gain on a leading E-commerce ad platform through both offline and online evaluation. In specific, it achieves 2.81% increase in GMV and 3.36% in ROI.

2. Preliminary

2.1. Problem Formulation

For simplicity, we consider auto-bidding with cost-related constraints. During a time period, suppose there are $N$ impression opportunities arriving sequentially and indexed by $i$ . In this setting, advertisers submit bids to compete for each impression opportunity. An advertiser will win the impression if its bid $b_{i}$ is greater than others. Then it will incur a cost $c_{i}$ for winning and getting the value.

During the period, the mission of an advertiser is to maximize the total received value $\sum_{i}o_{i}v_{i}$ , where $v_{i}$ is the value of impression $i$ and $o_{i}$ is whether the advertiser wins impression $i$ . Besides, we have the budget and several constraints to control the performance of ad deliveries. Budget constraints are simply $\sum_{i}o_{i}c_{i}\leq B$ , where $c_{i}$ is the cost of impression $i$ and $B$ is the budget. The other constraints are complex and according to (He et al., 2021) we have the unified formulation:

(1)

\frac{\sum_{i}c_{ij}o_{i}}{\sum_{i}p_{ij}o_{i}}\leq C_{j},

where $C_{j}$ is the upper bound of $j$ ’th constraint. $p_{ij}$ can be any performance indicator, e.g. return, or constant. $c_{ij}$ is the cost of constraint $j$ . Given $J$ constraints, we have the Multi-constrained Bidding (MCB) as:

(2)

\begin{split}\mathop{\text{maximize}}_{o_{i}}&\sum_{i}o_{i}v_{i}\\ \text{s.t.}&\sum_{i}o_{i}c_{i}\leq B\\ &\frac{\sum_{i}c_{ij}o_{i}}{\sum_{i}p_{ij}o_{i}}\leq C_{j},\ \ \ \forall j\\ &\ \ \ \ \ \ \ \ \ \ \ o_{i}\in\{0,1\},\ \forall i\\ \end{split}

A previous study (He et al., 2021) has already shown the optimal solution:

(3)

b_{i}^{*}=\lambda_{0}v_{i}+C_{i}\sum_{j=1}^{J}\lambda_{j}p_{ij},

where $b_{i}^{*}$ is the predicted optimal bid for the impression $i$ . $\lambda_{j},\ j\in\{0,...,J\}$ are the optimal bidding parameters. Specifically, when considering only the budget constraint, it is the Max Return bidding. However, when considering both the budget constraint and the CPC constraint, it is called Target-CPC bidding. From an alternative perspective, the optimal strategy involves arranging all impressions in order of their cost-effectiveness (CE) and then selecting every impression opportunity that surpasses the optimal CE ratio $ce^{*}$ . This threshold enforces the constraint, and the optimal bidding parameters $\lambda_{0}=1/ce^{*}$ .

2.2. Auto-Bidding as Decision-Making

Eq.(3) gives the formation of the optimal bid $b_{i}^{*}$ with bidding parameters $\lambda_{j},\ j\in\{0,...,J\}$ . However, in practice, the highly random and complex nature of the advertising environment prevents direct calculation of the bidding parameters. They must be carefully calibrated to adapt to the environment and dynamically adjusted as it evolves, This subsequently makes auto-bidding a sequential decision-making problem. To model it with decision-making, we introduce states $\boldsymbol{s}_{t}\in\mathcal{S}$ to describe the real-time advertising status and actions $\boldsymbol{a}_{t}\in\mathcal{A}$ to adjust the corresponding bidding parameters. The auto-bidding agent will take action $\boldsymbol{a}_{t}$ at the state $\boldsymbol{s}_{t}$ based on its policy $\pi$ , and then the state will transit to the next state $\boldsymbol{s}_{t+1}$ and gain reward $\boldsymbol{r}_{t}\in\mathcal{R}$ according to the advertising environment dynamics $\mathcal{T}$ . When $\mathcal{T}:\boldsymbol{s}_{t}\times\boldsymbol{a}_{t}\rightarrow\boldsymbol{s% }_{t+1}\times\boldsymbol{r}_{t}$ satisfies, it is called the Markovian decision process (MDP). Otherwise, it is a non-Markovian decision process. We next describe the key items of the automated bidding agent in the industrial online advertising system:

•

State $\boldsymbol{s}_{t}$ describes the real-time advertising status at time period $t$ , which includes 1) remaining time of the advertiser; 2) remaining budget; 3) budget spend speed; 4) real-time cost-efficiency (CPC), 5) and average cost-efficiency (CPC).
•

Action $\boldsymbol{a}_{t}$ indicates the adjustment to the bidding parameters at the time period $t$ , which has the dimension of the number of bidding parameters $\lambda_{j},\ j=1,..,J$ and modeled as $(a_{t}^{\lambda_{0}},...,a_{t}^{\lambda_{J}})$ .
•

The reward $\boldsymbol{r}_{t}$ is the value contributed to the objective obtained within the time period $t$ .
•

A trajectory $\tau$ is the index of a sequence of states, actions, and rewards within an episode.

In the online advertising system, learning policy through direct interaction with the online environment is unfeasible due to safety concerns. Nonetheless, access to historical bidding logs, incorporating trajectories from a variety of bidding strategies, is attainable and provides a viable alternative. Prevalent auto-bidding methods predominantly leverage this offline data to craft effective policies. Our approach is aligned with this practice and will be elaborated in detail in the subsequent chapters.

3. AIGB PARADIGM for AUTO-BIDDING

To thoroughly investigate the auto-bidding problem, we conducted a series of statistical analyses of bidding trajectories, with detailed information available in appendix LABEL:statistical_analysis. These analyses provide us with the insight that devising an effective bidding strategy is essentially equivalent to optimizing a state trajectory. Armed with this insight, we propose a hierarchical paradigm for auto-bidding that prioritizes the state trajectory optimization and subsequently generates actions aligned with the optimized trajectory.

For state trajectory optimization, we can employ a generative model to capture the joint distribution of the entire bidding trajectory and its associated returns, subsequently generating the trajectory distribution conditioned on the desired return. This approach enables us to address key auto-bidding challenges by employing SOTA generative algorithms. This paper presents an implementation that utilizes Denoising Diffusion Probabilistic Models (DDPM). For action generation, several off-the-shelf methods can be utilized to predict the proper action given the target state trajectory. In this paper, we apply a widely used inverse dynamics model. The hierarchical paradigm divides auto-bidding into two supervised learning problems, offering several advantages that include enhanced interpretability and increased stability during the training process.

4. Diffusion Auto-bidding Model

In this section, we give a detailed introduction of the proposed diffusion Auto-bidding Model (DiffBid). We will first give the modeling of Auto-bidding through diffusion models in Section 4.1.1. Then we give a detailed description of the forward process in Section 4.1.2, the reverse process in Section 4.1.3, and the training process in Section 4.2. Finally, we will give the complexity analysis in Section 4.4.

4.1. Diffusion Modeling of Auto-bidding

4.1.1. Overview

We model such sequential decision-making problem through conditional generative modeling (Chen et al., 2021; Ajay et al., 2022) by maximum likelihood estimation (MLE):

(4)

\mathop{\text{max}}_{\theta}\mathbb{E}_{\tau\sim D}\left[\text{log}p_{\theta}(% \boldsymbol{x}_{0}(\tau)|\boldsymbol{y}(\tau))\right]

where $\tau$ is the trajectory index, $\boldsymbol{x}_{0}(\tau)$ is the original trajectory of states and $\boldsymbol{y}(\tau)$ is the corresponding property. The goal is to estimate the conditional data distribution with $p_{\theta}$ so that the future states of a trajectory $\boldsymbol{x}_{0}(\tau)$ from information $\boldsymbol{y}(\tau)$ can be generated. For example, in the context of online advertising, $\boldsymbol{y}(\tau)$ can be the constraints or the total value of the entire trajectory. Under such a setting, we can formalize the conditional diffusion modeling for auto-bidding:

(5)

q(\boldsymbol{x}_{k+1}(\tau)|\boldsymbol{x}_{k}(\tau)),\ \ p_{\theta}(% \boldsymbol{x}_{k-1}(\tau)|\boldsymbol{x}_{k}(\tau),\boldsymbol{y}(\tau)),

where $q$ represents the forward process in which noises are gradually added to the trajectory while $p_{\theta}$ is the reverse process where a model is used for denoising. The detailed introduction of diffusion modeling can be found in Appendix A.2. The overall framework is presented in Figure 2. We will make a detailed discussion about the two modeling processes in the following sections.

4.1.2. Forward Process via Diffusion over States

We model the forward process $q(\boldsymbol{x}_{k+1}(\tau)|\boldsymbol{x}_{k}(\tau))$ via diffusion over states, where:

(6)

\boldsymbol{x}_{k}(\tau):=\left(\boldsymbol{s}_{1},...,\boldsymbol{s}_{t},...,% \boldsymbol{s}_{T}\right)_{k},

where $\boldsymbol{s}_{t}$ is modeled as a one-dimensional vector. $\boldsymbol{x}_{k}(\boldsymbol{\tau})$ is a noise sequence of states and can be represented by a two-dimensional array where the first dimension is the time periods and the second dimension is the state values. Merely sampling states is not enough for an agent. Given $\boldsymbol{x}_{k}(\tau)$ , we model the diffusion process as a Markov chain, where $\boldsymbol{x}_{k}(\tau)$ is only dependent on $\boldsymbol{x}_{k-1}(\tau)$ :

(7)

q(\boldsymbol{x}_{k}(\tau)|\boldsymbol{x}_{k-1}(\tau))=\mathcal{N}\left(% \boldsymbol{x}_{k}(\tau);\sqrt{1-\beta_{k}}\boldsymbol{x}_{k-1}(\tau),\beta_{k% }I\right),

when $k\rightarrow\infty$ , $x_{k}(\tau)$ approaches a sequence of standard Gaussian distribution where we can make sampling through the re-parameterization trick and then gradually denoise the trajectory to produce the final state sequence. For the design of $\beta_{k},\ \ k=1,...,K$ , we apply cosine schedule (Nichol and Dhariwal, 2021) to assign the corresponding values which smoothly increases diffusion noises using a cosine function to prevent sudden changes in the noise level. The details for noise schedule can be found in the appendix.

4.1.3. Reverse Process for Bid Generation

Following (Ho and Salimans, 2021; Ajay et al., 2022) we use a classifier-free guidance strategy with low-temperature sampling to guide the generation of bidding, to extract high-likelihood trajectories in the dataset. During the training phase, we jointly train the unconditional model $\epsilon_{\theta}(\boldsymbol{x}_{k}(\tau),k)$ and conditional model $\epsilon_{\theta}(\boldsymbol{x}_{k}(\tau),\boldsymbol{y}(\tau),k)$ by randomly drop** out conditions. During generation, a linear combination of conditional and unconditional score estimates is used:

(8)

\hat{\epsilon}_{k}:=\epsilon_{\theta}(\boldsymbol{x}_{k}(\tau),k)+\omega\left(% \epsilon_{\theta}\left(\boldsymbol{x}_{k}(\tau),\boldsymbol{y}(\tau),k\right)-% \epsilon_{\theta}\left(\boldsymbol{x}_{k}(\tau),k\right)\right),

where the scale $\omega$ is applied to extract the most suitable portion of the trajectory in the dataset that coappeared with $\boldsymbol{y}(\tau)$ . After that, we can sample from DiffBid to produce bidding parameters through sampling from $p_{\theta}(\boldsymbol{x}_{k-1}(\tau)|\boldsymbol{x}_{k}(\tau),\boldsymbol{y}(% \tau))$ :

(9)

\boldsymbol{x}_{k-1}(\tau)\sim\mathcal{N}\left(\boldsymbol{x}_{k-1}(\tau)|% \boldsymbol{\mu}_{\theta}\left(x_{k}(\tau),\boldsymbol{y}(\tau),k\right),% \boldsymbol{\Sigma}_{\theta}\left(\boldsymbol{x}_{k}(\tau),k\right)\right)

where a widely used parameterization here is $\boldsymbol{\mu}_{\theta}(\boldsymbol{x}_{k}(\tau),\boldsymbol{y}(\tau),k)=% \frac{1}{\sqrt{\alpha_{k}}}(\boldsymbol{x}_{k}(\tau)-\frac{\beta_{k}}{\sqrt{1-% \overline{\alpha}_{k}}}\hat{\epsilon}_{k})$ and $\Sigma_{\theta}(\cdot)=\beta_{k}$ , in which $\alpha_{k}=1-\beta_{k}$ and $\overline{\alpha_{k}}=\prod_{i=1}^{k}\alpha_{k}$ . When serving at time period $t$ , the agent first sample a initial trajectory $x^{\prime}_{K}(\tau)\sim\mathcal{N}(0,I)$ and assign the history states $\boldsymbol{s}_{0:t}$ into it. Then, we can sample predicted states with the reverse process recursively by

(10)

\boldsymbol{x}^{\prime}_{k-1}(\tau)=\boldsymbol{\mu}_{\theta}(\boldsymbol{x}^{% \prime}_{k}(\tau),\boldsymbol{y}(\tau),k)+\sqrt{\beta_{k}}\boldsymbol{z}

where $\boldsymbol{z}\sim\mathcal{N}(0,I)$ . Given $\boldsymbol{x}^{\prime}_{0}(\tau)$ , we can extract the next predicted state $s^{\prime}_{t+1}$ , and determine how much it should bid to achieve that state. In this setting, we apply widely used inverse dynamics (Agrawal et al., 2016; Pathak et al., 2018) with non-Markovian state sequence to determine current bidding parameters at time period $t$ :

(11)

\boldsymbol{\hat{a}}_{t}=f_{\phi}(\boldsymbol{s}_{t-L:t},\boldsymbol{s}^{% \prime}_{t+1}),

where $\boldsymbol{\hat{a}}_{t}\in\mathbb{R}^{J}$ contains predicted bidding parameters (i.e. $\lambda_{i},\ i=1,...,n$ ) at time $t$ . $L$ is the length of history states. The inverse dynamic function $f_{\phi}$ can be trained with the same offline logs as the reverse process. This design disentangles the learning of states and actions, making it easier to learn the connection between states thus achieving better empirical performance. The overall procedure is summarized in Algorithm 1.

4.2. DiffBid Training

Following (Ho et al., 2020), we train DiffBid to approximate the given noise and the returns in a supervised manner. Given a bidding trajectory $\boldsymbol{x}_{0}(\tau)$ , we have its corresponding returns e.g., values the advertiser received, the constraint the model should obey and the history states $s_{l},l=1,...,t$ before time $t+1$ . Then we just train the reverse process model $p_{\theta}$ which is parameterized through the noise model $\epsilon_{\theta}$ and the inverse dynamics $f_{\phi}$ through:

(12)

\begin{split}\mathcal{L}(\theta,\phi)&=\mathbb{E}_{k,\tau\in\mathcal{D}}\left[% ||\epsilon-\epsilon_{\theta}(\boldsymbol{x}_{k}(\tau),\boldsymbol{y}(\tau),k)|% |^{2}\right]\\ &+\mathbb{E}_{(\boldsymbol{s}_{t-L:t},\boldsymbol{a}_{t},s^{\prime}_{t+1})\in% \mathcal{D}}\left[||\boldsymbol{a}_{t}-f_{\phi}(\boldsymbol{s}_{t-L:t},% \boldsymbol{s}^{\prime}_{t+1})||^{2}\right],\end{split}

In the training process, we randomly sample a bidding trajectory $\tau$ and a time step $k$ , then we construct a noise trajectory $\boldsymbol{x}_{k}(\tau)$ and predict the noise through Eq (8). Following (Ho and Salimans, 2021), we randomly drop conditions $\boldsymbol{y}(\tau)$ with probability $p$ to train DiffBid to enhance the robustness. The process is presented in Algorithm 2.

4.3. Design of Conditions.

In this section, we present approaches transforming industrial metrics into conditions of DiffBid.

4.3.1. Generation with Returns

For each trajectory $\tau$ we have the total value the advertiser received as the the return $R(\tau)=\sum_{t=1}^{T}r_{t}$ . We normalize the return by:

(13)

R=\frac{R(\tau)-R_{\text{min}}}{R_{\text{max}}-R_{\text{min}}},

where $R_{\text{min}}$ and $R_{\text{max}}$ are the smallest and the largest return in the dataset. Through Eq. 13 we normalize the return into $[0,1]$ and merge it into $y(\tau)$ . Subsequently, we train the model to generate trajectories conditioned on the normalized returns. It should be noted that trajectories with more values received have higher normalized returns. Thus $R=1$ indicates the best trajectory with the highest values which will better fit the advertisers’ needs. When generation, we just set $R=1$ and generate the trajectory under the max return condition to the advertiser.

4.3.2. Generation with Constraints or Human Feedback

In MCB, the cumulative performance related to the constraints within a given episode should be controlled so as not to exceed the advertisers’ expectations. In such a setting, we can design $\boldsymbol{y}(\tau)$ to control the generation process. For example, in the Target-CPC setting, we can maintain a binary variable $E$ to indicate whether the final CPC exceeds the given constraint $C$ :

(14)

E=\text{I}_{x\leq C}(x)

where $x=\frac{\sum_{i}c_{i}o_{i}}{\sum_{i}p_{i}o_{i}}$ is defined in Eq (2). We can then normalize $x$ into $[0,1]$ through min-max normalization for simplification. $E$ can be used to indicate whether trajectory $\tau$ break the CPC constraint. We can also design $\boldsymbol{y}(\tau)$ to include $E=1$ to make the model generate bids that do not break the CPC constraint. Sometimes it is also important to adjust the bidding parameters given real-time feedback provided by the advertiser to enable flexibility. Here we use two example indicators that reflect the experience of advertisers:

(1)

Smoothness: an advertiser may expect the cost curve as smooth as possible to avoid sudden change. By defining $x=\frac{1}{T}\sum_{t}\left|cost_{t}-cost_{t-1}\right|$ , we can model it as a binary variable $S$ indicating whether the max cost change between adjacent time period exceeds a threshold as in Eq (14).
(2)

Early/Late Spend: an advertiser may expect the budget to be cost in the morning or in the evening when there are promotions. Here we model the ratio of cost in the early half day through $x=\frac{\sum_{t=0}^{T/2}{cost_{i}}}{\sum_{t=0}^{T}{cost_{i}}}$ , and use a binary variable to indicate whether the spend in the early half day exceeds a certain threshold $C$ as in Eq (14).

We can also compose several constraints together to form $\boldsymbol{y}(\tau)$ to guide the model to generate bid parameters that adhere to different constraints. In this setting, $\boldsymbol{y}(\tau)$ will be a vector.

4.4. Complexity Analysis

The complexity analysis for training DiffBid consists of the training process and the inference process. For training, given the time complexity of the noise prediction model $\epsilon_{\theta}$ is $\mathcal{O}(T_{1})$ , the complexity for the inverse dynamic model $f_{\phi}$ is $\mathcal{O}(T_{2})$ , the complexity for a training epoch is $\mathcal{O}(|\mathcal{B}|(T_{1}+T_{2}))$ . It can be seen that the training complexity is linear with the input given $T_{1}$ and $T_{2}$ are relatively fixed. Thus the training of DiffBid is efficient. For generation, given the total diffusion step $K$ , the trajectory length $L$ , then the time complexity for inference is $\mathcal{O}(KL(T_{1}+T_{2}))$ . We can observe that the time complexity for inference is linearly scaled with the diffusion step $K$ . In image generation, $K$ is usually very large to ensure good generation quality, which brings the problem of non-efficiency. However, for bidding generation, we find $K$ needs not to be very large. Relatively small $K$ has already generated promising results. Moreover, in auto-bidding, a higher tolerance for latency is acceptable, enabling the use of relatively larger $K$ .

5. Theoretical Analysis

In this section, we theoretically analyze the property of DiffBid. In specific, we show that DiffBid that utilize MLE as the objective has a corresponding non-Markovian decision problem (Majeed and Hutter, 2018; Qin et al., 2023; Gaon and Brafman, 2020; Mutti et al., 2022). The detailed proofs can be found in the Appendix A.7.

Lemma 5.1 (MLE as non-Markovian decision-making).

Assuming the Markovian transition $p_{\gamma^{*}}(s_{t+1}|s_{t},a_{t})$ is known, the ground-truth conditional state distribution $p^{*}(s_{t+1}|s_{0:t})$ for demonstration sequences is accessible, we can construct a non-Markovian sequential decision-making problem, based on a reward function $r_{\alpha}(s_{t+1},s_{0:t}):={\rm{log}}\int p_{\alpha}(a_{t}|s_{0:t})p_{\gamma% ^{*}}(s_{t+1}|s_{t},a_{t})da_{t}$ for an arbitrary energy-based policy $p_{\alpha}(a_{t}|s_{0:t})$ . Its objective is

(15)

\sum_{t=0}^{T}\mathbb{E}_{p^{*}(s_{0:t})}\left[V^{p_{\alpha}}(s_{0:t})\right]=% \mathbb{E}_{p^{*}(s_{0:T})}\left[\sum_{t=0}^{T}\sum_{k=t}^{T}r_{\alpha}(s_{k+1% };s_{0:k})\right]

$V^{p_{\alpha}}(s_{0:t}):=\mathbb{E}_{p^{*}(s_{t+1:T}|s_{0:t})}[\sum_{k=1}^{T}r% _{\alpha}(s_{t+1};s_{0:t})]$ is the value function of $p_{\alpha}$ . This objective yields the save optimal policy as the Maximum Likelihood Estimation $\mathbb{E}_{p^{*}(s_{0:T})}\left[{\rm{log}}p_{\theta}(s_{0:T})\right]$ .

Remarks. This analysis shows that DiffBid utilizing MLE objective has its corresponding non-Markovian decision problem, and their optimal are equivalent. It means that DiffBid does not require the MDP assumption of problems and thus is more powerful in handling randomness and sparse return like in the advertising environment.

6. Experiments

Table 1. Performance Comparison with baselines in different settings, including different data scales, and budgets in Max Return bidding. improv indicates the relative improvement of DiffBid against the most comparative baseline. The best results are bolded and the best second results are underlined.

Training Dataset	Budget	USCB	BCQ	CQL	IQL	DT	DiffBid	improv
USCB-5K	1.5K	454.25	454.72	461.82	456.80	477.39	480.76	0.71%
	2.0K	482.67	483.50	475.78	486.56	507.30	511.17	0.76%
	2.5K	497.66	498.77	481.37	518.27	527.88	531.29	0.65%
	3.0K	500.60	501.86	491.36	549.19	550.66	556.32	1.03%
USCBEx-5K	1.5K	454.25	453.74	358.43	464.69	378.64	475.62	2.35%
	2.0K	482.67	487.63	356.80	529.36	439.03	544.38	2.84%
	2.5K	497.66	510.75	356.41	613.67	505.43	624.29	1.73%
	3.0K	500.60	512.18	355.42	670.65	574.79	678.73	1.17%
USCBEx-50K	1.5K	454.25	458.64	435.06	446.23	396.24	495.57	8.05%
	2.0K	482.67	491.72	431.49	533.58	478.29	551.73	3.40%
	2.5K	497.66	513.23	428.39	592.32	554.48	606.34	2.37%
	3.0K	500.60	526.21	425.29	633.26	611.50	644.88	1.83%

6.1. Experimental Setup

6.1.1. Experimental Environment

The simulated experimental environment is conducted in a manually built offline real advertising system (RAS) as in (Mou et al., 2022). Specifically, the RAS is composed of two consecutive stages, where the auction mechanisms resemble those in the RAS. We consider the bidding process in a day, where the episode is divided into 96 time steps. Thus, the duration between any two adjacent time steps $t$ and $t+1$ is 15 minutes. The number of impression opportunities between time step $t$ and $t+1$ fluctuates from 100 to 500. Detailed parameters in the RAS are shown in Table 5. We keep the parameters the same for all experiments.

6.1.2. Data Collection

We use the widely applied auto-bidding RL method USCB in the online environment to generate the bidding logs for offline RL training. This results in a total $5,000$ trajectories for the based dataset and $50,000$ for a larger one. To increase the diversity of the action space, we also randomly make explorations to generate a dataset with more noise. The above process results in three datasets: USCB-5k, USCBEx-5k, and USCBEx-50k, where USCBEx indicates USCB logs with random exploration data.

6.1.3. Baselines.

We use the state-of-the-art auto-bidding method USCB as well as other 4 recently proposed offline RL methods as our baselines. The details of the baselines are as follows:

•

USCB (He et al., 2021) an RL method designed for real-time bidding to dynamically adjust parameters to achieve the optimum. It has outperformed many RL baselines and is also the base policy that is used to collect the data for offline training.
•

BCQ (Fujimoto et al., 2019) a classic offline RL method without interaction with the environment.
•

CQL (Kumar et al., 2020) address the limitations of offline RL methods by learning a conservative Q-function such that the expected value of a policy under it lower-bounds its true value.
•

IQL (Kostrikov et al., 2021) an offline RL method that does not require evaluating actions outside of the dataset, yet it enables substantial improvement of the learned policy beyond the best behavior in the data through generalization
•

DT (Chen et al., 2021) a prevalent generative method based on the transformer architecture for sequential decision-making.

6.1.4. Implementation Details

For the implementation of baselines, we use the default hyper-parameters suggested from their papers and also tune through our best effort. For DiffBid, the diffusion steps is searched within $\{5,10,20,30,50\}$ . $\gamma$ is set to 0.008. $L$ is searched in $\{1,2,3\}$ . $\omega$ for noise schedule is set to 0.2 empirically. The batch size is set to 2 $\%$ of all training trajectories. Total training epochs is set to 500. For the implementation of $p_{\theta}$ , we adopt the most widely used model U-Net for diffusion modeling with hidden sizes of 128 and 256. We use Adam optimizer with a learning rate $1e{-4}$ to optimize the model. The condition dropout ratio is set to 0.2 during training. We update the model with momentum updates over a period of 4 steps.

6.1.5. Evaluation

For evaluation, we randomly initialize a multi-agent advertising environment with USCB as the base auto-bidding agents and use other methods to compete with these agents. We test the performance under 4 different budgets, 1500, 2000, 2500, and 3000, to test the generalization under different budget scales. We use the cumulative reward as the evaluation metric, which reflects the total gain received by the target agent. For each method, we randomly initialize 50 times and report the average of top-5 scores.

6.2. Performance Evaluation

The performance against baselines is shown in Table 1. In this table, we show the cumulative reward from different budgets of all the models. We have the following discoveries. One of the key takeaways from the performance comparison presented in Table 1 is that offline RL methods consistently outperform the state-of-the-art auto-bidding method, USCB. This finding underscores the advantages of leveraging historical bidding data to train RL agents. Offline RL methods, such as BCQ, IQL, and DT, exhibit superior performance in terms of cumulative rewards across various budget scenarios. The superiority of offline RL methods can be attributed to their ability to learn from past bidding experiences without interaction with a simulation environment. This mitigates the challenges associated with inconsistencies between the online bidding environment and the offline bidding environment, leading to policies that are better aligned with real-world scenarios. Notably, DiffBid stands out as the top-performing approach among all the methods evaluated. In all budget scenarios and training datasets, DiffBid consistently achieves the highest cumulative rewards. This remarkable performance highlights the efficacy of the DiffBid approach in optimizing bidding strategies by directly modeling the correlation with the returns and entire trajectories. By decoupling the computational complexity from horizon length, DiffBid achieves superior decision-making capabilities, outperforming traditional RL methods in both foresight and strategy. Another important observation from the results is the impact of training dataset size on model performance. When comparing the ”USCB-5K” and ”USCBEx-50K” settings, it becomes evident that a larger training dataset consistently leads to improved cumulative rewards. This finding underscores the significance of data size in training RL models for automated bidding. A richer dataset allows the models to capture more diverse bidding scenarios and make more informed decisions, ultimately resulting in better performance. One intriguing aspect of DiffBid’s performance is its resilience to noise. In real-world advertising environments, there can be inherent uncertainty and variability in the bidding process due to factors like market dynamics and competitor behavior. DiffBid appears to handle such noise more effectively than the RL baselines. This means that even in situations where bidding outcomes are less predictable, DiffBid manages to maintain competitive performance.

6.3. Ablation Study

Table 2. Ablation Study

Model	USCBEx-5K	USCBEx-50K
DiffBid	2280.12	2395.60
DiffBid w/o cond	1812.64	1852.21
DiffBid w/o non-mkv	2254.78	2287.41

To study different parts of the proposed DiffBid, we run the model without a certain module to see if the removed corresponding module will result in a performance drop. The result of the ablation study is shown in Table 2. Due to the space limitation, we only provide the results on USCBEx-5K and USCBEx-50K. w/o cond refers to the DiffBid with the condition set to 0.0 (rather than 1.0). w/o non-mkv refers to the situation where we only use the current state and the predicted next state to generate the bidding coefficient. From the table, we find both of the two parts contribute to the final result, and removing either of them will result in a performance drop. It verifies the effectiveness of the proposed methods in boosting DiffBid’s performance for auto-bidding.

6.4. In-depth Analysis

6.4.1. Study of State Transition

Here we compare the state transition of the baseline method USCB and our proposed method DiffBid. The result for grouped and non-grouped state transition during a day is shown in Figure 3. In this figure, we plot the budget left ratio with time steps in one day. From the figure, we can observe that under USCB, most of the advertisers’ consumption does not exhaust their budget. This is attributed to the inconsistency between the offline virtual environment and the real online environment faced by USCB. On the contrary, the budget completion situation improves under DiffBid, where most of the advertisers spend more than 80% of their budgets. One possible reason is that DiffBid finds trajectories with a high budget completion ratio will also have a high cumulative reward, and thus tend to generate trajectories with a high budget completion ratio. Moreover, advertisers with small budgets undertend to spend money in the afternoon. This is because the impressions in the afternoon offer a higher cost-effectiveness, albeit with a limited quantity.

6.4.2. Performance under Constraints and Feedbacks.

We additionally investigate DiffBid’s multi-objective optimization capability under specific constraints, comparing its performance with Offline RL. Specifically, we choose CPC ratio and overall return as metrics and examine the ability of DiffBid and IQL to control the overall CPC exceeding ratio while maximizing the overall return. During training, we set different thresholds of CPC as in Eq (14). Then when testing, we make DiffBid generating trajectories under the expected CPC. In Figure 4, we show the exceeding ratio and overall return under different CPC constraints and training settings. From the figure, we find that DiffBid has the ability to control diverse levels of exceeding ratio while maintaining an intact return, surpassing IQL by a significant margin. Consequently, DiffBid holds a distinct advantage in effectively addressing MCB problems. We also study the performance under different advertiser feedbacks. During training we split the trajectories through thresholds of Eq. (14) into high and low levels, and learn the conditional distribution under different levels. During generation, we adjust the condition and generate corresponding samples and summarize the metrics. The results for the statistic distribution of metrics for low level, high level and the original trajectories are shown in Figure 5. We find that the trajectory obtained from deploying DiffBid is well controlled by the condition.

6.4.3. Impact of Diffusion Steps

We also study the overall performance under different diffusion steps, which is an important factor in influencing the efficiency and performance. The overall impact of diffusion steps with respect to different budgets is illustrated in Figure 6. From the figure, we have the following discoveries. First of all, we observe that diffusion steps have a larger impact on advertisers with small budgets (1500 yuan). Secondly, larger budgets are not sensitive to the diffusion steps, where we can get the best result in most situations within 30 diffusion steps.

6.4.4. Stability

In this study, we randomly initialized the parameters of three models - CQL, IQL, and DiffBid - and conducted thirty training trials for each to examine stability in performance. As depicted in Figure 3(b), the RL-based models, CQL and IQL, showed a tendency towards instability under varying random seeds. Notably, IQL demonstrated slightly better performance than CQL, which may be attributed to its design optimized for conservative regularization. Contrasting with these, the generative model DiffBid exhibited remarkable stability, with significantly fewer instances of failure compared to its RL counterparts.

6.5. Online A/B Test

To further substantiate the effectiveness of DiffBid, we have deployed it on Alibaba advertising platform for comparison against the baseline IQL (Kostrikov et al., 2021) method, which performs best among various auto-bidding methods.

Table 3. Online A/B Test Result.

Metrics	#Plan	Budget	Cost	Buycnt	GMV	ROI
Baseline	2068	886744	834426.104	23584.6836	1853823	2.221
DiffBid	2068	886744	829992.384	24078.6883	1905954	2.296
compare	-	-	-0.53%	+2.09%	+2.81%	+3.36%

The online A/B test is conducted from February 01, 2024, to February 08, 2024. The results are shown in Table 3. It shows that DiffBid can significantly improve the Buycnt by 2.09%, the GMV by 2.81%, the ROI by 3.36%, showing its effectiveness in optimizing the overall performance. For efficiency, DiffBid takes 0.2s per request with GPU acceleration while the baseline is 0.07s, which means latency can be well guaranteed.

7. Related Works

Offline-Reinforcement Learning. Offline reinforcement learning is a research direction that has gained significant attention in recent years. The primary goal of offline RL is to learn effective policies from a fixed dataset without additional online interaction with the environment. This approach is particularly beneficial when online interaction is costly, risky, or otherwise not feasible.Notable works include Conservative Q-learning (CQL) by Kumar et al. (Kumar et al., 2020), and Batch-Constrained deep Q-learning (BCQ) by Fujimoto et al. (Fujimoto et al., 2019). Both algorithms aim to tackle overestimation bias which tends to occur in offline RL settings. Kostrikov et al. (Kostrikov et al., 2021) propose an implicit q-learning approach to address the training instability for CQL. Chen et al. (Chen et al., 2021) propose to use transformers for offline RL to increase the model capability. Hansen-Estruch et al. (Hansen-Estruch et al., 2023) proposes a diffusion-based approach with implicit Q-learning for offline RL.

Diffusion Models. They recently have shown the capability of high-quality generation (Croitoru et al., 2023), unconditional generation (Austin et al., 2021) and conditional generation (Chao et al., 2022; Huang et al., 2022). It has shown promising performance in decision-making. Hansen-Estruch et al. (Hansen-Estruch et al., 2023) proposes a diffusion-based approach with implicit q-learning for offline RL. Wang et al. (Wang et al., 2022) propose a expressive policy though diffusion modeling. Chen et al. (Chen et al., 2022) propose to use diffusion models for behavior modeling. Hu et al. (Hu et al., 2023) introduce temporal conditions for trajectory generation. Despite these preliminary explorations, no work has been payed for diffusion based auto-bidding which requires the model to adapt to the random advertising environment. Li et al. (Li et al., 2023) utilize diffusion model in anti-money laundering.

Auto-bidding. Auto-bidding systems are widely used in programmatic advertising, where they are employed to automatically place bids on ad spaces. The main focus of such systems is to optimise a given key performance indicator (KPI), such as the number of clicks or conversions, while maintaining a certain budget (Wang et al., 2017). Cai et al. (Cai et al., 2017) proposed an RL-based approach to the problem of auto-bidding for display advertising. They designed a bidding environment and applied a deep RL algorithm to learn the optimal bidding strategy. He et al. (He et al., 2021) propose a unified solution with RL to enable multiple constraints for auto-bidding. ** et al. extend the RL to enable multi-agent competition (** et al., 2018). Zhang et al. (Mou et al., 2022) also adopted the RL framework for auto-bidding and showed that their approach can outperform traditional bidding strategies. Wen et al. (Wen et al., 2022) propose a multi-agent-based approach for auto-bidding, which enables the modeling of multiple auto-bidding agents at the same time to include more information and also has been deployed online.

8. Conclusion

In this paper, we design a new paradigm for auto-bidding through the lens of generative modeling. To achieve this goal, we propose a decision-denoising diffusion approach to generate conditional bidding trajectories and at the same time control the generated samples under certain constraints. This new generative modeling approach enables integrating different kinds of industrial metrics, which is the first unified model for bidding. Extensive experiments on real-world simulation environments demonstrate the effectiveness of the newly proposed approach. In the future, we will consider develo** new methods to accelerate the generation process and new methods to ensure the robustness of DiffBid.

References

(1)
Agrawal et al. (2016) Pulkit Agrawal, Ashvin V Nair, Pieter Abbeel, Jitendra Malik, and Sergey Levine. 2016. Learning to poke by poking: Experiential learning of intuitive physics. Advances in neural information processing systems 29 (2016).
Ajay et al. (2022) Anurag Ajay, Yilun Du, Abhi Gupta, Joshua B Tenenbaum, Tommi S Jaakkola, and Pulkit Agrawal. 2022. Is Conditional Generative Modeling all you need for Decision Making?. In The Eleventh International Conference on Learning Representations.
Austin et al. (2021) Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. 2021. Structured denoising diffusion models in discrete state-spaces. Advances in Neural Information Processing Systems 34 (2021), 17981–17993.
Balseiro et al. (2021a) Santiago Balseiro, Yuan Deng, Jieming Mao, Vahab Mirrokni, and Song Zuo. 2021a. Robust auction design in the auto-bidding world. Advances in Neural Information Processing Systems 34 (2021), 17777–17788.
Balseiro et al. (2021b) Santiago R Balseiro, Yuan Deng, Jieming Mao, Vahab S Mirrokni, and Song Zuo. 2021b. The landscape of auto-bidding auctions: Value versus utility maximization. In Proceedings of the 22nd ACM Conference on Economics and Computation. 132–133.
Cai et al. (2017) Han Cai, Kan Ren, Weinan Zhang, Kleanthis Malialis, Jun Wang, Yong Yu, and Defeng Guo. 2017. Real-time bidding by reinforcement learning in display advertising. In Proceedings of the tenth ACM international conference on web search and data mining. 661–670.
Chao et al. (2022) Chen-Hao Chao, Wei-Fang Sun, Bo-Wun Cheng, Yi-Chen Lo, Chia-Che Chang, Yu-Lun Liu, Yu-Lin Chang, Chia-** Chen, and Chun-Yi Lee. 2022. Denoising likelihood score matching for conditional score-based data generation. arXiv preprint arXiv:2203.14206 (2022).
Chen et al. (2022) Huayu Chen, Cheng Lu, Chengyang Ying, Hang Su, and Jun Zhu. 2022. Offline reinforcement learning via high-fidelity generative behavior modeling. arXiv preprint arXiv:2209.14548 (2022).
Chen et al. (2021) Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. 2021. Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems 34 (2021), 15084–15097.
Chiesi et al. (1979) Harry L Chiesi, George J Spilich, and James F Voss. 1979. Acquisition of domain-related information in relation to high and low domain knowledge. Journal of verbal learning and verbal behavior 18, 3 (1979), 257–273.
Croitoru et al. (2023) Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, and Mubarak Shah. 2023. Diffusion models in vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).
Deng et al. (2021) Yuan Deng, Jieming Mao, Vahab Mirrokni, and Song Zuo. 2021. Towards efficient auctions in an auto-bidding world. In Proceedings of the Web Conference 2021. 3965–3973.
Evans (2009) David S Evans. 2009. The online advertising industry: Economics, evolution, and privacy. Journal of economic perspectives 23, 3 (2009), 37–60.
Fujimoto et al. (2019) Scott Fujimoto, David Meger, and Doina Precup. 2019. Off-policy deep reinforcement learning without exploration. In International conference on machine learning. PMLR, 2052–2062.
Fujimoto et al. (2022) Scott Fujimoto, David Meger, Doina Precup, Ofir Nachum, and Shixiang Shane Gu. 2022. Why Should I Trust You, Bellman? The Bellman Error is a Poor Replacement for Value Error. In Proceedings of the 39th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 162), Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (Eds.). PMLR, 6918–6943. https://proceedings.mlr.press/v162/fujimoto22a.html
Gaon and Brafman (2020) Maor Gaon and Ronen Brafman. 2020. Reinforcement learning with non-markovian rewards. In Proceedings of the AAAI conference on artificial intelligence, Vol. 34. 3980–3987.
Guo et al. (2022) Jiayan Guo, Yaming Yang, Xiangchen Song, Yuan Zhang, Yu**g Wang, **g Bai, and Yan Zhang. 2022. Learning Multi-granularity Consecutive User Intent Unit for Session-based Recommendation. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining (Virtual Event, AZ, USA) (WSDM ’22). Association for Computing Machinery, New York, NY, USA, 343–352. https://doi.org/10.1145/3488560.3498524
Ha (2008) Louisa Ha. 2008. Online advertising research in advertising journals: A review. Journal of Current Issues & Research in Advertising 30, 1 (2008), 31–48.
Hansen-Estruch et al. (2023) Philippe Hansen-Estruch, Ilya Kostrikov, Michael Janner, Jakub Grudzien Kuba, and Sergey Levine. 2023. Idql: Implicit q-learning as an actor-critic method with diffusion policies. arXiv preprint arXiv:2304.10573 (2023).
Hao et al. (2020) Xiaotian Hao, Zhaoqing Peng, Yi Ma, Guan Wang, Junqi **, Jianye Hao, Shan Chen, Rongquan Bai, Mingzhou Xie, Miao Xu, Zhenzhe Zheng, Chuan Yu, Han Li, Jian Xu, and Kun Gai. 2020. Dynamic Knapsack Optimization Towards Efficient Multi-Channel Sequential Advertising. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event (Proceedings of Machine Learning Research, Vol. 119). PMLR, 4060–4070. http://proceedings.mlr.press/v119/hao20b.html
He et al. (2021) Yue He, Xiujun Chen, Di Wu, Junwei Pan, Qing Tan, Chuan Yu, Jian Xu, and Xiaoqiang Zhu. 2021. A unified solution to constrained bidding in online display advertising. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 2993–3001.
Ho et al. (2020) Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. Advances in neural information processing systems 33 (2020), 6840–6851.
Ho and Salimans (2021) Jonathan Ho and Tim Salimans. 2021. Classifier-Free Diffusion Guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications.
Hu et al. (2023) Jifeng Hu, Yanchao Sun, Sili Huang, SiYuan Guo, Hechang Chen, Li Shen, Lichao Sun, Yi Chang, and Dacheng Tao. 2023. Instructed Diffuser with Temporal Condition Guidance for Offline Reinforcement Learning. arXiv preprint arXiv:2306.04875 (2023).
Huang et al. (2022) R Huang, MWY Lam, J Wang, D Su, D Yu, Y Ren, and Z Zhao. 2022. FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis. In IJCAI International Joint Conference on Artificial Intelligence. IJCAI: International Joint Conferences on Artificial Intelligence Organization, 4157–4163.
Jaynes (1957) Edwin T Jaynes. 1957. Information theory and statistical mechanics. Physical review 106, 4 (1957), 620.
** et al. (2018) Junqi **, Chengru Song, Han Li, Kun Gai, Jun Wang, and Weinan Zhang. 2018. Real-time bidding with multi-agent reinforcement learning in display advertising. In Proceedings of the 27th ACM international conference on information and knowledge management. 2193–2201.
Kingma et al. (2015) Durk P Kingma, Tim Salimans, and Max Welling. 2015. Variational dropout and the local reparameterization trick. Advances in neural information processing systems 28 (2015).
Kostrikov et al. (2021) Ilya Kostrikov, Ashvin Nair, and Sergey Levine. 2021. Offline Reinforcement Learning with Implicit Q-Learning. In International Conference on Learning Representations.
Kumar et al. (2020) Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. 2020. Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems 33 (2020), 1179–1191.
Li and Tang (2022) Juncheng Li and **zhong Tang. 2022. Auto-bidding Equilibrium in ROI-Constrained Online Advertising Markets. arXiv preprint arXiv:2210.06107 (2022).
Li et al. (2023) Xujia Li, Yuan Li, Xueying Mo, Hebing Xiao, Yanyan Shen, and Lei Chen. 2023. Diga: Guided diffusion model for graph recovery in anti-money laundering. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 4404–4413.
Majeed and Hutter (2018) Sultan Javed Majeed and Marcus Hutter. 2018. On Q-learning Convergence for Non-Markov Decision Processes.. In IJCAI, Vol. 18. 2546–2552.
Misra (2019) Diganta Misra. 2019. Mish: A self regularized non-monotonic activation function. arXiv preprint arXiv:1908.08681 (2019).
Mou et al. (2022) Zhiyu Mou, Yusen Huo, Rongquan Bai, Mingzhou Xie, Chuan Yu, Jian Xu, and Bo Zheng. 2022. Sustainable Online Reinforcement Learning for Auto-bidding. Advances in Neural Information Processing Systems 35 (2022), 2651–2663.
Mutti et al. (2022) Mirco Mutti, Riccardo De Santi, and Marcello Restelli. 2022. The importance of non-markovianity in maximum state entropy exploration. In International Conference on Machine Learning. PMLR, 16223–16239.
Nguyen-Tuong et al. (2008) Duy Nguyen-Tuong, Jan Peters, Matthias Seeger, and Bernhard Schölkopf. 2008. Learning inverse dynamics: a comparison. In European symposium on artificial neural networks.
Nichol and Dhariwal (2021) Alexander Quinn Nichol and Prafulla Dhariwal. 2021. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning. PMLR, 8162–8171.
Ou et al. (2023) Weitong Ou, Bo Chen, Yingxuan Yang, Xinyi Dai, Weiwen Liu, Weinan Zhang, Ruiming Tang, and Yong Yu. 2023. Deep landscape forecasting in multi-slot real-time bidding. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 4685–4695.
Pathak et al. (2018) Deepak Pathak, Parsa Mahmoudieh, Guanghao Luo, Pulkit Agrawal, Dian Chen, Yide Shentu, Evan Shelhamer, Jitendra Malik, Alexei A Efros, and Trevor Darrell. 2018. Zero-shot visual imitation. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops. 2050–2053.
Qin et al. (2023) Aoyang Qin, Feng Gao, Qing Li, Song-Chun Zhu, and Sirui Xie. 2023. Learning non-Markovian Decision-Making from State-only Sequences. In Thirty-seventh Conference on Neural Information Processing Systems.
Ronneberger et al. (2015) Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18. Springer, 234–241.
Vincent (2011) Pascal Vincent. 2011. A connection between score matching and denoising autoencoders. Neural computation 23, 7 (2011), 1661–1674.
Wang et al. (2017) Jun Wang, Weinan Zhang, Shuai Yuan, et al. 2017. Display advertising with real-time bidding (RTB) and behavioural targeting. Foundations and Trends® in Information Retrieval 11, 4-5 (2017), 297–435.
Wang et al. (2022) Zhendong Wang, Jonathan J Hunt, and Mingyuan Zhou. 2022. Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning. In The Eleventh International Conference on Learning Representations.
Wen et al. (2022) Chao Wen, Miao Xu, Zhilin Zhang, Zhenzhe Zheng, Yuhui Wang, Xiangyu Liu, Yu Rong, Dong Xie, Xiaoyang Tan, Chuan Yu, et al. 2022. A cooperative-competitive multi-agent framework for auto-bidding in online advertising. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining. 1129–1139.
Wu and He (2018) Yuxin Wu and Kaiming He. 2018. Group normalization. In Proceedings of the European conference on computer vision (ECCV). 3–19.
Zhang et al. (2023b) Haoqi Zhang, Lvyin Niu, Zhenzhe Zheng, Zhilin Zhang, Shan Gu, Fan Wu, Chuan Yu, Jian Xu, Guihai Chen, and Bo Zheng. 2023b. A Personalized Automated Bidding Framework for Fairness-aware Online Advertising. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 5544–5553.
Zhang et al. (2023a) Peiyan Zhang, Jiayan Guo, Chaozhuo Li, Yueqi Xie, Jae Boum Kim, Yan Zhang, Xing Xie, Haohan Wang, and Sunghun Kim. 2023a. Efficiently Leveraging Multi-level User Intent for Session-based Recommendation via Atten-Mixer Network. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining (, Singapore, Singapore,) (WSDM ’23). Association for Computing Machinery, New York, NY, USA, 168–176. https://doi.org/10.1145/3539597.3570445
Ziebart (2010) Brian D. Ziebart. 2010. Modeling purposeful adaptive behavior with the principle of maximum causal entropy. Ph. D. Dissertation. USA. Advisor(s) Bagnell, J. Andrew. AAI3438449.

Appendix A Appendix

A.1. Notations

Table 4. Definition of Notations.

Symbol	Definition
$\tau$	The trajectory index of a serving policy.
$\boldsymbol{x}(\tau)_{k}$	Sequence of states of trajectory $\tau$ in diffusion step $k$ .
$\boldsymbol{y}(\tau)$	Properties or conditions for $\tau$ .
$R$	Return of a trajectory.
$E$	Binary indicator variable.
$B$	The budget of the advertiser.
$C_{i}$	The $i$ ’s constraint.
$\boldsymbol{o}_{i}$	Whether the advertiser wins impression $i$ .
$\boldsymbol{v}_{i}$	The true value of the impression $i$ .
$\boldsymbol{b}_{i}^{*}$	The optimal bidding price for the impression $i$ .
$\boldsymbol{s}_{t}$	The state at time period $t$ .
$\boldsymbol{\hat{a}}_{t}$	Predicted bidding parameters at time period $t$ .
$\lambda_{i}$	The bidding parameters.
$\boldsymbol{\epsilon}_{\theta}$	The denoising model that predict the noise.
$f_{\phi}$	The model that generate bids.
$\overline{\alpha}_{k}$	The cumulative product of $1-\beta_{j},j=0,...,k$
$\beta_{k}$	Schedualing factors.
$\alpha_{k}$	1- $\beta_{k}$ .

A.2. Diffusion Modeling

As a kind of generative model, diffusion models (Ho et al., 2020; Vincent, 2011) use the diffusion process to gradually denoise latent samples to generate the new sample and have been widely used in generating pictures, videos, and audio. One of the widely used diffusion models, denoising diffusion probabilistic model (DDPM), consists of two processes:

Forward process. In the forward process, the noise is gradually added to the latent variable, which is parameterized by a Markov chain with the transition $q(\boldsymbol{x}_{k}|\boldsymbol{x}_{k-1})=\mathcal{N}\left(x_{k};\sqrt{1-% \beta_{k}}\boldsymbol{x}_{k},\beta_{k}I\right)$ , where $k\in\{1,...K\}$ refers to the diffusion step, and $\beta_{k}\in(0,1)$ is a pre-defined scale that controls the noise scale at step $k$ . By defining $\overline{\alpha}_{k}=\prod_{i=1}^{k}\alpha_{i}=\prod_{i=1}^{k}(1-\beta_{i})$ , we can have the conditional distribution:

(16)

q(x_{k}|x_{0})=\mathcal{N}\left(x_{k};\sqrt{\overline{\alpha}_{k}}x_{0},(1-% \overline{\alpha}_{k})I\right)

In this paper, we apply cosine noise schedule to control the noise by:

(17)

\overline{\alpha}_{k}=\frac{g(t)}{g(0)}=\frac{\text{cos}\left(\frac{k/K+\gamma% }{1+\gamma}\cdot\frac{\pi}{2}\right)}{\text{cos}\left(\frac{\gamma}{1+\gamma}% \cdot\frac{\pi}{2}\right)},

where $\gamma$ is a constant. When $K\rightarrow\infty$ , $q(x_{K})$ approaches to a standard Gaussian distribution (Ho et al., 2020). Given the original trajectory $x_{0}$ and $\epsilon\sim\mathcal{N}(0,I)$ , we have the noisy version at $k$ by $x_{k}=\sqrt{\overline{\alpha}_{k}}x_{0}+\epsilon\sqrt{1-\overline{\alpha}_{k}}$

Reverse process. In the reverse process, diffusion models plan to remove the added noise on $x_{k}$ and recursively recover $x_{k-1}$ . To achieve this goal, a Gaussian distribution parameterized by $p_{\theta}(\boldsymbol{x}_{k-1}|\boldsymbol{x}_{k})=\mathcal{N}\left(% \boldsymbol{x}_{k-1}|\boldsymbol{\mu}_{\theta}(\boldsymbol{x}_{k},k),% \boldsymbol{\Sigma}_{\theta}(\boldsymbol{x}_{k},k)\right)$ is learned, where $\boldsymbol{\mu}_{\theta}(\boldsymbol{x}_{k},k)$ is the learned mean and $\boldsymbol{\Sigma}_{\theta}(\boldsymbol{x}_{k},k)$ is the learned covariance of the Gaussian distribution parameterized by a neural network with parameter $\theta$ . For generating new samples, we can simply use re-parameterization trick (Kingma et al., 2015) to sample a noise $\boldsymbol{x}_{K}\sim\boldsymbol{\mu}_{K}+\epsilon\sigma_{K}$ and recursively denoise the sample by $p_{\theta}(\boldsymbol{x}_{k-1}|\boldsymbol{x}_{k})$ for generation.

Optimization. DDPM optimizes the Evidence Lower BOund (ELBO) of generative models. In the context of the diffusion model, we can take the latent samples as hidden variables and rewrite ELBO in the following form:

(18)

\small\begin{split}&\mathbb{E}_{q}\left[-\log p_{\theta}(\boldsymbol{x_{0}})% \right]\\ \leq&\mathbb{E}_{q}\left[-\text{log}\frac{p_{\theta}(\boldsymbol{x}_{0:K})}{q(% \boldsymbol{x}_{1:K}|\boldsymbol{x}_{0})}\right]\\ =&\mathbb{E}_{q}\left[D_{KL}(q(\boldsymbol{x}_{K}|\boldsymbol{x}_{0})||p_{% \theta}(\boldsymbol{x}_{K}))\right]-\mathbb{E}_{q}\left[\text{log}p_{\theta}% \left(\boldsymbol{x}_{0}|\boldsymbol{x}_{1}\right)\right]\\ +&\mathbb{E}_{q}\left[\sum_{t>1}D_{KL}\left(q(\boldsymbol{x}_{k-1}|\boldsymbol% {x}_{k},\boldsymbol{x}_{0})||p_{\theta}(\boldsymbol{x}_{k-1}|\boldsymbol{x}_{k% })\right)\right],\\ \end{split}

where the first term has no learned variable given variance $\beta_{k}$ is fixed to constants, thus can be ignored during training. The second term is the reconstruction term where $p_{\theta}(\cdot)$ is trained to recover the original sample $\boldsymbol{x}_{0}$ from the noise sample $\boldsymbol{x}_{1}$ . The last term is the denoising term where $p_{\theta}(\cdot)$ is trained to denoise $\boldsymbol{x}_{k}$ to get $\boldsymbol{x}_{k-1}$ , thus we can recurrently denoise the latent samples. In the original paper (Ho et al., 2020) the author shows that the last term can be simplified to the noise prediction objective $\mathbb{E}_{k,\boldsymbol{x}_{0},\boldsymbol{\epsilon}}\left[||\boldsymbol{% \epsilon}-\boldsymbol{\epsilon}_{\theta}\left(\sqrt{\overline{\alpha}_{k}}% \boldsymbol{x}_{0}+\sqrt{1-\overline{\alpha}_{k}}\boldsymbol{\epsilon},k\right% )||\right]$ , where $\overline{\alpha}_{k}=\prod_{i=1}^{k}\alpha_{k}=\prod_{i=1}^{k}(1-\beta_{k})$ .

A.3. Model Configuration

We parameterize the noise model $\epsilon_{\theta}$ with a temporal U-Net (Ronneberger et al., 2015), consisting of 3 repeated residual blocks. Each block is consisted of two temporal convolutions, followed by group normalization (Wu and He, 2018), and a final Mish activation function (Misra, 2019). Timestamp and condition embeddings, both 128-dimensional vectors, are produced by separate 2-layered MLP (with 256 hidden units and Mish activation function) and are concatenated together before getting added to the activation functions of the first temporal convolution within each block. $f_{\phi}$ is parameterized with a 3-layer MLP.

A.4. Pseudo-code

The process of training and inference of DiffBid is shown in Algorithm 2 and Algorithm 1 respectively.

Algorithm 1 Bid Generation with DiffBid.

1:noise model

\epsilon_{\theta}

, inverse dynamics

f_{\phi}

, guidance scale

\omega

, condition

\boldsymbol{y}

, max diffusion step

K

, scales

\beta_{t},\ t=1,...,K

2:Bidding parameters

\boldsymbol{a}_{t}

3:Get history of states

\boldsymbol{s}_{0:t}

;

4:Sample

\boldsymbol{x}_{K}(\tau)\sim\mathcal{N}(0,\beta_{K}I)

;

5:for

k=K,...,0

\boldsymbol{x}_{k}(\tau)[:t]\leftarrow\boldsymbol{s}_{0:t}

7: Estimating noise

\hat{\epsilon}

through Eq. (8)

\left(\mu_{k-1},\Sigma_{k-1}\right)\leftarrow\text{Denoise}(\boldsymbol{x}_{k}% (\tau),\hat{\epsilon}_{k})

\boldsymbol{x}_{k-1}\sim\mathcal{N}(\mu_{k-1},\alpha\Sigma_{k-1})

10:end for

11:Extract

(\boldsymbol{s}_{t-L:t},\boldsymbol{s}^{\prime}_{t+1})

from

\boldsymbol{x}_{0}(\tau)

12:Generate

\boldsymbol{\hat{a}}_{t}=f_{\phi}(\boldsymbol{s}_{t-L:t},\boldsymbol{s}^{% \prime}_{t+1})

;

13:return

\boldsymbol{\hat{a}}_{t}

Algorithm 2 Training of DiffBid.

1:randomly initialized

\theta

\phi

, bidding trajectory set

\mathcal{D}

2:optimized

\theta

\phi

3:while not converge do

4: Sample a batch of trajectories

\mathcal{B}\in\mathcal{D}

;

5: for all

\tau\in\mathcal{B}

6: Sample

k\sim\text{Uniform}(1,K)

\epsilon\sim\mathcal{N}(0,I)

;

7: Compute

\boldsymbol{x}_{k}(\tau)

via

q(\boldsymbol{x}_{k}(\tau)|\boldsymbol{x}_{0}(\tau))

in Eq (7);

8: Compute

\mathcal{L}(\theta,\phi)

by Eq (12);

9: Perform gradient descent to optimize

\theta

and

\phi

;

10: end for

11:end while

12:return optimized

\theta

\phi

A.5. Analytical Results for Action Control

We analyze the ability of different models in controlling actions. To achieve this goal, we re-define the return function to be the summation of actions in odd time steps minus the summation of actions in even time steps. The results are shown in Figure 7. We find DiffBid can better control the action than IQL. The main reason is that controlling of actions is difficult for RL in long horizons. Instead, DiffBid directly models the correlation of trajectories and returns, thus can well handle the long trajectory situation.

A.6. Statistical Analyses for Bidding Trajectory

The study by (Hao et al., 2020) indicates that CE follows a power-law decline as the number of winning impressions increases. Our statistical analysis confirms that this finding holds true at every discrete time step, with decay rates varying temporally due to the heterogeneous nature of the impressions. Figure 8(a) shows three steps sampled from the online advertising system, From which another key insight is that the optimal bidding strategy’s $ce*$ is equivalent to selecting a specific number of winning impressions per time step. We denote the number at time step $t$ as $n_{t}$ .

Another finding illustrated in Figure 8(b) is that the costs of impressions remain relatively stable throughout the total episode, fluctuating by less than 5%. This stability allows us to approximate the cost of each impression $c_{i}$ with the average cost $\bar{c}=\frac{1}{N}\sum_{i=0}^{N}c_{i}$ , where $N$ represents the number of winning impressions of the total episode. Therefore, the total cost at each time step $c_{t}=n_{t}\cdot\bar{c}$ . In auto-bidding modeling, $c_{t}$ can be calculated from the state trajectory by using the difference in the remaining budget between two consecutive steps. Consequently, we can conclude that the optimal strategy correlates to a specific state trajectory.

A.7. Theoretical Analysis

We first give the definition of several decision process and then show the theoretical analysis.

Definition A.0 (Markovian Decision Process (MDP)).

MDP is a stochastic map** from a state-action pair to state-reward pairs. Formally, $\mathcal{T}:\mathcal{S}\times\mathcal{A}\rightarrow\mathcal{S}\times\mathcal{R}$ , where $\mathcal{T}$ denotes a stochastic map**.

Definition A.0 (History-based Decision Process (HDP)).

HDP is a stochastic map** from a history-action pair to observation-reward pairs. Formally, $\mathcal{P}:\mathcal{H}^{*}\times\mathcal{A}\rightarrow\mathcal{O}\times% \mathcal{R}$ , where $\mathcal{P}$ denotes a stochastic map**.

We show that a sequential decision-making problem can be constructed to maximize the same objective. The main results are given by (Qin et al., 2023) and we put the proofs here for completeness. To start, let the ground-truth distribution of demonstrations be $p^{*}(\boldsymbol{x}_{0}(\tau))$ and the learned marginal distributions of state sequences be $p_{\theta}(\boldsymbol{x}_{0}(\tau))$ . Then Eq. (4) is an empirical estimation of

(19)

\begin{split}\mathbb{E}_{p^{*}(\boldsymbol{s}_{0})}\left[\text{log}p^{*}(% \boldsymbol{s}_{0})+\mathbb{E}_{p^{*}(\boldsymbol{s}_{1:T}|\boldsymbol{s}_{0})% }\left[\text{log}p_{\theta}(\boldsymbol{s}_{1:T}|\boldsymbol{s}_{0})\right]% \right]\end{split}

Suppose the MLE yields the maximum, we will have $p_{\theta}^{*}=p^{*}$ . Then we define $V^{*}(s_{0}):=\mathbb{E}_{p^{*}(s_{1:T}|s_{0})}[\text{log}p^{*}(s_{1:T}|s_{0})]$ , and generalize it to have a $V$ function:

(20)

V^{*}(s_{0:t})=\mathbb{E}_{p^{*}(s_{t+1:T}|s_{0:t})}[\text{log}p^{*}(s_{t+1:T}% |s_{0:t})]

which comes with a Bellman optimal equation:

(21)

V^{*}(s_{0:t}):=\mathbb{E}_{p^{*}(s_{t+1}|s_{0:t})}[r(s_{t+1},s_{0:t})+V^{*}(s% _{0:t+1})]

with $r(s_{t+1},s_{0:t}):=\log p^{*}(s_{t+1}|s_{0:t})=\log p^{*}_{a}(s_{t}|s_{0:t})p% ^{*}(s_{t+1}|s_{t},a_{t})dt$ , $V^{*}(s_{0:T}):=0$ . It is worth noting that the $r$ defined above involves the optimal policy, which may not be known a priori. We can resolve this by replacing it with $r_{\alpha}$ for an arbitrary policy $p_{\alpha}(a_{t}|s_{0:t})$ . All Bellman identities and updates should still hold. The entailed Bellman update, value iteration, for arbitrary $V$ and $\alpha$ is

(22)

V(s_{0:t})=\mathbb{E}_{p^{*}(s_{t+1}|s_{0:t})}[r_{\alpha}(s_{0:t},s_{t+1})+V(s% _{0:t+1})].

We then define $r(s_{t+1},a_{t},s_{0:t}):=r(s_{t+1},s_{0:t})+\log p^{*}_{a}(a_{t}|s_{0:t})$ to construct a $Q$ function:

(23)

Q^{*}(a_{t};s_{0:t}):=\mathbb{E}_{p^{*}(s_{t+1}|s_{0:t})}[r(s_{t+1},a_{t},s_{0% :t})+V^{*}(s_{0:t+1})],

which entails a Bellman update, Q backup, for arbitrary $\alpha$ , $Q$ and $V$

(24)

Q(a_{t};s_{0:t})=\mathbb{E}_{p^{*}(s_{t+1}|s_{0:t})}[r_{\alpha}(s_{0:t},a_{t},% s_{t+1})+V(s_{0:t+1})].

Also note that the $V$ and $Q$ in identities Eq. (23) and Eq. (25) respectively are not necessarily associated with the policy $p_{\alpha}(a_{t}|s_{0:t})$ . Slightly overloading the notations, we use $Q_{\alpha},V_{\alpha}$ to denote the expected returns from policy $p_{\alpha}(a_{t}|s_{0:t})$ . By now, we finish the construction of atomic algebraic components and move on to check if the relations between them align with the algebraic structure of a sequential decision-making problem. We first prove the construction above is valid at optimality.

Lemma A.3.

When $f_{\alpha}(a_{t};s_{0:t})=Q^{*}(a_{t};s_{0:t})-V^{*}(s_{0:t}),p_{\alpha}(a_{t}% |s_{0:t})$ is the optimal policy.

Proof.

Note that the construction gives us

(25)

\begin{split}&Q^{*}(a_{t};s_{0:t})\\ =&\mathbb{E}_{p^{*}(s_{t+1}|s_{0:t})}\left[r(s_{t+1},s_{0:t})+\log p^{*}_{% \alpha}(a_{t}|s_{0:t})+V^{*}(s_{0:t+1})\right]\\ =&\log p^{*}_{\alpha}(a_{t}|s_{0:t})+\mathbb{E}_{p^{*}(s_{t+1}|s_{0:t})}\left[% r(s_{t+1},s_{0:t})+V^{*}(s_{0:t+1})\right]\\ =&\log p^{*}_{\alpha}(a_{t}|s_{0:t})+V^{*}(s_{0:t})\end{split}

∎

Obviously, $Q^{*}(a_{t};s_{0:t})$ lies in the hypothesis space of $f_{\alpha}(a_{t};s_{0:t})$ . It indicates that we need to either parameterize $f_{\alpha}(a_{t};s_{0:t})$ or $Q(a_{t};s_{0:t})$ . While $Q^{\alpha}$ and $V^{\alpha}$ are constructed from the optimality, the derived $Q^{\alpha}$ and $V^{\alpha}$ measure the performance of an interactive agent when it executes with the policy $p_{\alpha}(a_{t}|s_{0:t})$ . They should be consistent.

Lemma A.4.

$V^{\alpha}(s_{0:t})$ and $\mathbb{E}_{p_{\alpha}(a_{t}|s_{0:t})}\left[Q^{\alpha}(a_{t};s_{0:t})\right]$ yield the same optimal policy $p^{*}_{\alpha}(a_{t}|s_{0:t})$

Proof.

\small\begin{split}&\mathbb{E}_{p_{\alpha}(a_{t}|s_{0:t})}\left[Q^{\alpha}(a_{% t};s_{0:t})\right]\\ :=&\mathbb{E}_{p_{\alpha}(a_{t}|s_{0:t})}\left[\mathbb{E}_{p^{*}(s_{t+1}|s_{0:% t})}\left[r(s_{t+1},a_{t},s_{0:t})+V^{\alpha}(s_{0:t+1})\right]\right]\\ =&\mathbb{E}_{p_{\alpha}(a_{t}|s_{0:t})}\mathbb{E}_{p^{*}(s_{t+1}|s_{0:t})}% \left[\log p^{*}_{\alpha}(a_{t}|s_{0:t})+r(s_{t+1},s_{0:t})+V^{\alpha}(s_{0:t+% 1})\right]\\ =&\mathbb{E}_{p^{*}(s_{t+1}|s_{0:t})}\left[r(s_{t+1},s_{0:t})-H_{\alpha}(a_{t}% |s_{0:t})+V^{\alpha}(s_{0:t+1})\right]\\ &-\sum_{k=t+1}^{T-1}\mathbb{E}_{p^{*}(s_{t+1:k}|s_{0:t})}\left[H_{\alpha}(a_{k% }|s_{0:k})\right]\end{split}

where $\mathcal{H}(\cdot)$ is the entropy term. The last line is derived by recursively applying the Bellman equation in the line above until $s_{0:T}$ . As an energy-based policy, $p_{\alpha}(a_{t}|s_{0:t})$ ’s entropy is inherently maximized (Jaynes, 1957). Therefore, within the hypothesis space, $p_{\alpha}^{*}(a_{t}|s_{0:t})$ that optimizes $V^{\alpha}(s_{0:t})$ also leads to the optimal expected return $\mathbb{E}_{p_{\alpha}(a_{t}|s_{0:t})}\left[Q^{\alpha}(a_{t};s_{0:t})\right]$ . ∎

Given the convergence proof by Ziebart (Ziebart, 2010), we have:

Lemma A.5.

If $p^{*}(s_{t+1}|s_{0:t})$ is accessible and $p^{*}_{\gamma}(s_{t+1}|s_{t},a_{t})$ is known, soft policy iteration and soft $Q$ learning both converge to $p^{*}_{\alpha}(a_{t}|s_{0:t})=p^{*}_{\alpha}(a_{t}|s_{0:t})\propto\exp(Q^{*}(a% _{t};s_{0:t}))$ under conditions.

Lemma 3 means given $p^{*}(s_{t+1}|s_{0:t})$ and $p^{*}_{\gamma}(s_{t+1}|s_{t},a_{t})$ , we can recover $p^{*}_{\alpha}$ through reinforcement learning methods, instead of the proposed MLE. So $p_{\alpha}(a_{t}|s_{0:t})$ is a viable policy space for the constructed sequential decision-making problem. Together, Lemma A.1, Lemma A.2 and Lemma A.3 provide proof for a valid sequential decision-making problem that maximizes the same objective of MLE, by Lemma A.4.

Lemma A.6 (MLE as non-Markovian decision-making process).

(26)

\sum_{t=0}^{T}\mathbb{E}_{p^{*}(s_{0:t})}\left[V^{p_{\alpha}}(s_{0:t})\right]=% \mathbb{E}_{p^{*}(s_{0:T})}\left[\sum_{t=0}^{T}\sum_{k=t}^{T}r_{\alpha}(s_{k+1% };s_{0:k})\right]

$V^{p_{\alpha}}(s_{0:t}):=\mathbb{E}_{p^{*}(s_{t+1:T}|s_{0:t})}[\sum_{k=1}^{T}r% _{\alpha}(s_{k+1};s_{0:k})]$ is the value function of $p_{\alpha}$ . This objective yields the save optimal policy as the Maximum Likelihood Estimation $\mathbb{E}_{p^{*}(s_{0:T})}\left[{\rm{log}}p_{\theta}(s_{0:T})\right]$ .

Table 5. Parameters of the Real Advertising System.

Parameters	Values
Number of advertisers	30
Time steps in an episode, $T$	96
Minimum number of impression opportunities $N_{\text{min}}$	50
Maximum number of impression opportunities $N_{\text{max}}$	300
Minimum budget	1000 Yuan
Maximum budget	4000 Yuan
Value of impression opportunities in stage 1, $v_{j,t}^{1}$	0 $\sim$ 1
Value of impression opportunities in stage 2, $v_{j,t}^{2}$	0 $\sim$ 1
Minimum bid price, $A_{\text{min}}$	0 Yuan
Maximum bid price, $A_{\text{max}}$	1000 Yuan
Maximum value of impression opportunity, $v_{M}$	1
Maximum market price, $p_{M}$	1000 Yuan