Stochastic Approach for Price Optimization Problems with Decision-dependent Uncertainty

Yuya Hikima¹¹1Corresponding author. E-mail: [email protected] Graduate School of Information Science and Technology, University of Tokyo, Tokyo, Japan Akiko Takeda Graduate School of Information Science and Technology, University of Tokyo, Tokyo, Japan Center for Advanced Intelligence Project, RIKEN, Tokyo, Japan

Abstract

Price determination is a central research topic of revenue management in marketing. The important aspect in pricing is controlling the stochastic behavior of demand, and the previous studies have tackled price optimization problems with uncertainties. However, many of those studies assumed that uncertainties are independent of decision variables (i.e., prices) and did not consider situations where demand uncertainty depends on price. Although some price optimization studies have dealt with decision-dependent uncertainty, they make application-specific assumptions in order to obtain an optimal solution or an approximation solution. To handle a wider range of applications with decision-dependent uncertainty, we propose a general non-convex stochastic optimization formulation. This approach aims to maximize the expectation of a revenue function with respect to a random variable representing demand under a decision-dependent distribution. We derived an unbiased stochastic gradient estimator by using a well-tuned variance reduction parameter and used it for a projected stochastic gradient descent method to find a stationary point of our problem. We conducted synthetic experiments and simulation experiments with real data on a retail service application. The results show that the proposed method outputs solutions with higher total revenues than baselines.

1 Introduction

Price determination is a central research topic of revenue management in marketing, and many pricing studies have targeted applications in agricultural (Wang and Wang, 2019), online retail (Ferreira et al., 2016), electrical power (Dong et al., 2017), and hospitality industries (Koushik et al., 2012).

An important aspect in pricing is controlling the stochastic behavior of demand. This is because stochastic over/under demand causes a loss in many cases; for example, in road pricing, overuse of a certain road causes congestion or traffic accidents; in an electricity market, if demand is much lower than the available electricity supply, capital investment costs cannot be recovered.

To obtain greater profits under demand uncertainty, many of the previous studies have tackled price optimization problems with decision-independent random variables. For example, He et al. (2009) and Dong et al. (2017) define the demand for a product/service as $d(x)+\xi$ , where $x$ is price and $\xi$ is a decision-independent random variable. Correa et al. (2017) and Chawla et al. (2010) assume multi-agent systems where each buyer $i$ has a random variable $v_{i}$ as their value for a product and purchases it when the price is below the value. However, in practical applications, it is natural for the distribution of stochastic demand to vary with price: when the price of a product is close to (far from) those of competing products, it is difficult (easy) to predict the demand and its uncertainty is large (small). Furthermore, the settings of these studies with decision-independent random variables need to use discontinuous functions to represent buyers’ discrete actions (e.g., buy or leave), which makes the optimization problem difficult to solve (see Section 3.4.1).

Although some pricing studies have dealt with decision-dependent uncertainty, they assume specific demand distributions and problem settings in order to obtain an optimal solution or an approximation solution. For example, Bertsimas and de Boer (2005) determine prices for multiple products produced with limited resources. They consider demand of item $i$ at time $t$ , i.e., $\alpha^{t}_{i}(x^{t}_{i})+\beta^{t}_{i}(x^{t}_{i})\xi^{t}_{i}$ , where $\alpha^{t}_{i}(\cdot)$ and $\beta^{t}_{i}(\cdot)$ are given functions, $x^{t}_{i}$ is price, and $\xi^{t}_{i}$ is a random variable following a given distribution. Schulte and Sachs (2020) optimize prices over multiple periods to sell a single product with a fixed unit cost $c\geq 0$ , where the item’s demand follows a Poisson distribution with a given intensity function $\lambda(x)$ . While these studies can find optimal or approximation solutions for their problems, they appear to be difficult to apply to a wide range of probability distributions or problem settings (e.g., a nonlinear cost setting for selling products) due to their specific assumptions.

To resolve these issues, for general price optimization, we propose a non-convex stochastic optimization formulation that maximizes the expectation of a revenue function with respect to a random variable representing demand under a decision-dependent distribution. Our formulation assumes that (i) the objective function is differentiable and Lipschitz continuous, (ii) the given probability density function of the random variables is differentiable and its gradient, normalized by the value of the probability density function, is bounded, and (iii) the feasible region is compact and convex. These assumptions may seem strong, but they often hold in the price optimization literature. Indeed, we show three application examples satisfying our assumptions (see Section 3.3).

The formulated problem for practical applications is generally non-convex and the dimension of the decision variables may be large. We derive an unbiased stochastic gradient estimator of the objective function by using information on the probability density function and incorporate the estimator in a projected stochastic gradient descent method to find a stationary point of our problem. When deriving a gradient estimator, it is important to design it so that its variance is small for fast convergence of the algorithm. Our unbiased stochastic gradient includes a variance reduction parameter, which is inspired by baseline technique (Williams, 1992; Sutton and Barto, 2018) in the reinforcement-learning literature. After confirming that the variance of the proposed stochastic gradient is bounded, we present a method for calculating the variance reduction parameter. Then, we develop a projected stochastic gradient descent method, which converges to a stationary point by incorporating the proposed stochastic gradient and method for calculating the variance reduction parameter into a recent gradient descent algorithm (Ghadimi and Lan, 2016). Moreover, we show a way of speeding up the computation of the minibatch gradient under additional assumptions that hold in applications where multiple agents make purchase decisions.

While some of the previous methods might seem applicable to our formulation, they are not suitable for the following reasons: the retraining method (Perdomo et al., 2020; Mendler-Dünner et al., 2020) requires strong convexity of the objective function; the Bayesian optimization (Brochu et al., 2010; Frazier, 2018) and gradient-free methods (Spall, 2005; Flaxman et al., 2005) require a huge number of evaluations of objective values, which makes it difficult to find good solutions for large-scale problems in a reasonable time.

We conducted synthetic experiments and simulation experiments with real-data on a retail service application. The results show that the proposed method outputs solutions with higher total revenues than do baselines such as the (modified) retraining method and Bayesian optimization.

Notation

Bold lowercase symbols (e.g., $\bm{x},\bm{y}$ ) denote vectors, and $\|\bm{x}\|$ denotes the Euclidean norm of a vector $\bm{x}$ . The inner product of the vectors $\bm{x},\bm{y}$ is denoted by $\bm{x}^{\top}\bm{y}$ . Let $\mathbb{R}_{+}$ be the set of positive real numbers. The gradient for a real-valued function $f$ w.r.t. $\bm{x}$ is denoted by $\nabla_{\bm{x}}f$ and the Jacobian matrix for a vector valued function $\bm{p}$ w.r.t. $\bm{x}$ is denoted by $\frac{d\bm{p}}{d\bm{x}}$ . A binomial coefficient of a pair of integers $m$ and $n$ is written as ${}_{m}C_{n}$ . Let $[N]$ be the set of $\{1,2,\dots,N\}$ .

2 Related Works

2.1 Price Optimization Problems with Stochastic Demand

The previous studies on pricing with stochastic demand considered three types of random variable: (a) decision-independent random variables included in buyers’ purchase behavior; (b) decision-independent random variables directly included in demand; (c) decision-dependent random variables included in demand. Regarding (a), Chawla et al. (2010) and Correa et al. (2017) address pricing problems with stochastic behaviors of multiple agents; each agent $i$ has a (decision-independent) random variable $v_{i}$ as its value for a product and purchases a product when the price is below that value. Regarding (b), He et al. (2009); Heydari and Norouzinasab (2015), and Dong et al. (2017) deal with demand with a decision-independent uncertainty, such as $d(x)+\xi$ , where $x$ is price and $\xi$ is a random variable independent of price. Regarding (c), Bertsimas and de Boer (2005); Wang and Wang (2019); Schulte and Sachs (2020), and Hikima et al. (2021, 2022, 2023) tackle pricing problems with decision-dependent stochastic demand, such as $d(x)+\xi(x)$ , where $\xi(x)$ is a (decision-dependent) random variable. Our study is categorized into (c).

In this paper, we propose a new general pricing problem with decision-dependent random variables. Our problem has advantages over the previous ones for tackling (a), (b), and (c). Regarding (a), while the previous studies need to define agents’ actions (e.g., buy or leave) by discontinuous functions, in our formulation, we can define those without a discontinuous function, leading to gradient-based methods. Regarding (b), we generalize the noise $\xi$ of demand to make it depend on the decision variable, which allows us to deal with situations where the demand uncertainty varies with price. Regarding (c), previous studies have limited applications since they make application-specific assumptions to obtain an optimal solution or an approximation solution: Wang and Wang (2019) and Schulte and Sachs (2020) consider specific situations to optimize prices over multiple periods to sell items and describe efficient methods to find an optimal solution; Hikima et al. (2021, 2022, 2023) tackle resource allocation problems while controlling agents’ acceptance probabilities for prices and present approximation algorithms with constant approximation ratios; Bertsimas and de Boer (2005) consider a simple demand function where the price of each item does not affect demand for other items and present heuristics to obtain an approximation solution. In contrast, we deal with a more general framework that has various applications (see Section 3.3). Consequently, our formulation is a non-convex optimization problem and we develop a stochastic method that is theoretically guaranteed to converge to a stationary point.

2.2 Optimization Methods for Stochastic Problems with Decision-dependent Uncertainty

Our price optimization problem, (P) in Section 3.1, is categorized as a stochastic problem with decision-dependent uncertainty (Hellemo et al., 2018; Varaiya and Wets, 1989). This is because the demand of items and services follows a probability distribution depending on price (decision variables). Here, we explain three different techniques for solving the problem.²²2 Another formulation dealing with decision-dependent random variables is decision-dependent distributionally robust optimization (Luo and Mehrotra, 2020; Basciftci et al., 2021). Although such methods are effective at finding an optimal solution in the worst case when the probability distribution is ambiguous, they are not appropriate for the purpose of this study.

Retraining methods (Perdomo et al., 2020; Mendler-Dünner et al., 2020).

Retraining methods fix the distribution at each iteration and update the current iterate. Specifically, (Perdomo et al., 2020) proposed repeated gradient descent: $\bm{x}_{k+1}:=\mathrm{proj}_{\mathcal{C}}(\bm{x}_{k}-\eta_{k}\mathbb{E}_{\bm{% \xi}\sim D(\bm{x}_{k})}[\nabla_{\bm{x}}f(\bm{x}_{k},\bm{\xi})]),$ where $\mathcal{C}$ is the feasible region and $\mathrm{proj}_{\mathcal{C}}$ is the Euclidean projection operator onto $\mathcal{C}$ . It converges to a performatively stable point $\bm{x}_{\mathrm{PS}}=\mathrm{arg}\min_{\bm{x}}\mathbb{E}_{\bm{\xi}\sim D(\bm{x% }_{\mathrm{PS}})}[f(\bm{x},\bm{\xi})]$ . However, these methods assume the strong convexity of $f(\bm{x},\bm{\xi})$ w.r.t. $\bm{x}$ and are not applicable to our problem. We provide an intuitive example where RGD fails to work in price optimization, where the objective function is generally not strongly convex.

Example 1.

Suppose that a seller determines the price $x\in[0,M]$ of a product. The buyer purchases the product ( $\xi=1$ ) with probability $p(x)$ or does not purchase it ( $\xi=0$ ) with probability $1-p(x)$ , where $p:[0,M]\to[0,1]$ is a decreasing function. The seller wants to solve $\min_{x\in[0,M]}\mathbb{E}_{\xi\sim D(x)}[-x\xi]$ to maximize the expected revenue, where $D(x)$ is the distribution for $\xi$ . Then, the optimal solution is $x^{*}\in\arg\min_{x\in[0,M]}-xp(x)$ . However, RGD continues to raise the price until the purchase probability reaches zero or the price reaches $M$ since $\mathbb{E}_{\xi\sim D(x^{k})}[-\nabla_{x}(x\xi)]=\mathbb{E}_{\xi\sim D(x^{k})}% [-\xi]=-p(x^{k})$ and $p(x^{k})\geq 0$ for all $x^{k}\in[0,M]$ . This price is generally not equal to $x^{*}$ .

Meta-model methods (Brochu et al., 2010; Frazier, 2018; Miller et al., 2021).

This type of method creates a meta-model of the objective function or the distribution map $D(\cdot)$ from multiple sample points. Bayesian optimization (Brochu et al., 2010; Frazier, 2018) is the process of learning the objective function through Gaussian process regression while finding the global optimal solution. The two-stage approach (Miller et al., 2021) estimates a coarse model of the distribution map $D(\cdot)$ and then optimizes a proxy to the objective function by treating the estimated distribution as if it were the true distribution map. While these methods are powerful for certain problems, they are not suitable for ours: Bayesian optimization cannot find good solutions when the dimension of the decision variables is too large to be adequately explored; the two-stage approach assumes that the distribution map is included in location-scale families (Miller et al., 2021, Eq. (2)), which cannot be assumed in our problem.

Gradient-free methods (Spall, 2005; Flaxman et al., 2005).

Gradient-free methods estimate the gradient by querying objective values at randomly perturbed points around the current iterate. While this type of method is generic, it often requires many evaluations of objective values to estimate the gradient accurately.

We developed a new projected stochastic gradient descent method by deriving an unbiased stochastic gradient. Our method has advantages over the existing ones: unlike retraining methods, it can find stationary points for general pricing problems with no strongly convex objective functions; unlike meta-model methods, it can find stationary points in high-dimensional optimization problems and does not place a strong assumption on the distribution map; while gradient-free methods naively approximate the gradient, our method approximates it by using gradient information on the objective function and the probability density function, which enables us to estimate gradients more accurately in a shorter computation time.

3 Optimization Problem

3.1 Problem Definition

We will consider the following hypothetical situation. There is a decision maker determining a price vector $\bm{x}\in\mathcal{C}\subseteq\mathbb{R}^{n}$ for items $i=1,2,\dots,n$ , where the index $i$ denotes the type of items and/or the time period. Then, the demand vector $\bm{\xi}\in\Xi\subseteq\mathbb{R}^{n}$ of $n$ items is sampled from a probability distribution $D(\bm{x})$ . The decision maker obtains a profit of $s(\bm{x},\bm{\xi})-c(\bm{\xi})$ , where $s:\mathcal{C}\times\Xi\to\mathbb{R}$ and $c:\Xi\to\mathbb{R}$ are the sales and cost functions, respectively.

The revenue maximization problem is as follows:

\displaystyle\mathrm{(P)}\quad\min_{\bm{x}\in\mathcal{C}}

\displaystyle\quad\mathbb{E}_{\bm{\xi}\sim D(\bm{x})}[f(\bm{x},\bm{\xi})],

where $f(\bm{x},\bm{\xi}):=-s(\bm{x},\bm{\xi})+c(\bm{\xi})$ is real-valued and possibly non-convex. $D(\bm{x})$ is a decision-dependent distribution for the measurable set $\Xi$ . Here, we let $\mathrm{Pr}(\bm{\xi}\mid\bm{x})$ be the probability density function of $D(\bm{x})$ and assume that the decision maker can obtain the value of $\mathrm{Pr}(\bm{\xi}\mid\bm{x})$ and $\nabla_{\bm{x}}\mathrm{Pr}(\bm{\xi}\mid\bm{x})$ for given $\bm{x}$ and $\bm{\xi}$ . This assumption naturally holds in many applications of price optimization.³³3For example, in (Bertsimas and de Boer, 2005), the demand for item $i$ at price $x_{i}$ is defined by $d_{i}(x_{i}):=\alpha_{i}(x_{i})+\beta_{i}(x_{i})\xi_{i}$ , where $\xi_{i}$ is a random variable and its probability density function is given. In (Schulte and Sachs, 2020), the buyer’s arrival rate at price $x$ is assumed to follow a Poisson process with the intensity function $\lambda(x)$ , which identifies the probability density function of demand. Hikima et al. (2022) define $p_{vt}(x)$ as the probability that buyer $v$ arrives at time interval $t$ for price $x$ , and then give a probability density function for demand.

3.2 Assumptions

Our development of an unbiased stochastic gradient for (P) whose variance is bounded by a constant requires a number of assumptions. In particular, we will make the following assumptions.

Assumption 1.

For all $\bm{x}\in\mathbb{R}^{n}$ and $\bm{\xi}\in\Xi$ , the following hold,

(i)

$f(\bm{x},\bm{\xi})$ is differentiable and Lipschitz continuous with modulus $L_{f}$ w.r.t. $\bm{x}$ and continuous w.r.t. $\bm{\xi}$ ,
(ii)

$\mathrm{Pr}(\bm{\xi}\mid\bm{x})$ is differentiable w.r.t. $\bm{x}$ and $\mathrm{Pr}(\bm{\xi}\mid\bm{x})>0$ , and
(iii)

$\left\|\frac{\nabla_{\bm{x}}\mathrm{Pr}(\bm{\xi}\mid\bm{x})}{\mathrm{Pr}(\bm{% \xi}\mid\bm{x})}\right\|\leq M$ for a constant $M$ .

Assumption 2.

The set $\mathcal{C}$ is compact and convex. The set $\Xi$ is compact.

Moreover, we need the following assumption when $\bm{\xi}$ is a continuous random vector:

Assumption 3.

The set $\Xi$ is a Borel set on $\mathbb{R}^{n}$ . Moreover, $\mathrm{Pr}(\bm{\xi}\mid\bm{x})$ is continuous w.r.t. $\bm{\xi}$ for all $\bm{x}\in\mathbb{R}^{n}$ .

Assumptions 1–3 do not depend on a specific application; there are various applications that satisfy them (see Section 3.3). Condition (i) of Assumption 1 usually holds in pricing applications; the sales function $s(\bm{x},\bm{\xi})$ is usually expressed as $\bm{x}^{\top}\bm{\xi}$ (the product of price and demand), so it can be differentiable w.r.t. $\bm{x}$ and Lipschitz continuous when $\bm{\xi}$ is bounded; the cost function $c(\bm{\xi})$ is usually continuous w.r.t. $\bm{\xi}$ since the production cost is usually continuous with respect to demand. Condition (ii) of Assumption 1 is satisfied by many distributions with a (statistical) parameter $\bm{\lambda}(\bm{x})$ , where $\bm{\lambda}$ is a differentiable vector-valued function.⁴⁴4For example, the probability density functions of normal and multinomial distributions satisfy condition (ii) of Assumption 1. Since the probability density functions of these distributions are differentiable with respect to their parameters $\bm{\lambda}$ (e.g., mean, variance), they are also differentiable w.r.t. $\bm{x}$ from the differentiability of $\bm{\lambda}(\bm{x})$ . Condition (iii) of Assumption 1 means that when the probability of a given demand is small, the effect of price on that probability is also small. In our application examples presented in Section 3.3, the multinomial and truncated normal distributions parameterized by price satisfy these conditions. Assumption 2 is natural for practical pricing applications since price and demand ranges are usually bounded. Assumption 3 is satisfied if $\bm{\xi}$ follows one of the major continuous probability distributions such as the normal and logistic distributions. In the next section, we show that our application examples satisfy Assumptions 1–3.

Remark.

Assumption 2 does not hold in the case of unconstrained price optimization, but we can assume $\bm{x}\in[-G,G]^{n}$ for a sufficiently large $G\in\mathbb{R}_{+}$ in practice.

3.3 Application Examples

3.3.1 Multiproduct Pricing

We consider a variant of (Gallego and Wang, 2014; Zhang et al., 2018) in which a decision maker exists that determines the prices of multiple products and there are $m$ buyers and $n$ products. Let $\bm{x}:=(x_{1},x_{2},\dots,x_{n})\in[x_{\min},x_{\max}]^{n}$ be the price vector for the products. We assume buyers choose one product stochastically; Each buyer chooses product $i\in I:=\{1,\dots,n\}$ with probability $p_{i}(\bm{x})=\frac{e^{\gamma_{i}(\alpha_{i}-x_{i})}}{a_{0}+\operatorname*{% \textstyle\sum}_{j=1}^{n}e^{\gamma_{j}(\alpha_{j}-x_{j})}}$ or does not choose any product with probability $p_{0}(\bm{x})=\frac{a_{0}}{a_{0}+\operatorname*{\textstyle\sum}_{j=1}^{n}e^{% \gamma_{j}(\alpha_{j}-x_{j})}}$ .⁵⁵5Besides the multinomial logit model, various other models can be considered, such as the nested logit model (Gallego and Wang, 2014) and the generalized nested logit model (Zhang et al., 2018). Here, $\alpha_{i}$ and $\gamma_{i}$ are positive constants that can be estimated from historical transaction data (Croissant, 2012). Let $\bm{\xi}\in\{0,1,\dots,m\}^{n+1}$ be a random vector, where $\xi_{0}$ represents the number of buyers not purchasing any product and $\xi_{i}$ for $i=1,\dots,n$ represents the number of sales of each product. Let $s(\bm{x},\bm{\xi})$ and $c(\bm{\xi})$ be real-valued functions representing the sales and costs of products, respectively. The following functions are possible for $s$ and $c$ :

\displaystyle s(\bm{x},\bm{\xi}):=\sum_{i=1}^{n}x_{i}\xi_{i},\quad c(\bm{\xi})% :=\sum_{i=1}^{n}c_{i}(\xi_{i}),

\displaystyle\mathrm{where}\

\displaystyle c_{i}(\xi_{i}):=\begin{cases}\eta^{1}_{i}\xi_{i},&\xi_{i}\leq l_% {i},\\ \eta^{2}_{i}(\xi_{i}-l_{i})+\eta^{1}_{i}l_{i},&l_{i}<\xi_{i}\leq u_{i},\\ \eta^{3}_{i}(\xi_{i}-u_{i})+\eta^{2}_{i}(u_{i}-l_{i})+\eta^{1}_{i}l_{i},&\xi_{% i}>u_{i}.\end{cases}

Here, $\eta^{1}_{i}$ , $\eta^{2}_{i}$ , $\eta^{3}_{i}$ , $l_{i}$ , and $u_{i}$ are constants for each $i$ . The function $c_{i}$ represents the case where the cost rate varies with the number of sold products (which is also called economies of scale or diseconomies of scale).

The revenue-maximizing problem is as follows:

\displaystyle\min_{\bm{x}\in[x_{\min},x_{\max}]^{n}}\mathbb{E}_{\bm{\xi}\sim D% (\bm{x})}\left[-s(\bm{x},\bm{\xi})+c(\bm{\xi})\right],

where the probability mass function of $D(\bm{x})$ is $\Pr(\bm{\xi}\mid\bm{x}):=\prod_{i=0}^{n}{}_{m}C_{\xi_{i}}p_{i}(\bm{x})^{\xi_{i}}$ . It can be written in the form of (P).⁶⁶6If the functions $s$ and $c$ are linear w.r.t. $\xi$ , then $\mathbb{E}_{\bm{\xi}\sim D(\bm{x})}\left[-s(\bm{x},\bm{\xi})+c(\bm{\xi})\right% ]=-s(\bm{x},\mathbb{E}_{\bm{\xi}\sim D(\bm{x})}[\bm{\xi}])+c(\mathbb{E}_{\bm{% \xi}\sim D(\bm{x})}[\bm{\xi}])$ , and the problem is a deterministic optimization as is tackled in (Gallego and Wang, 2014; Zhang et al., 2018). Therefore, the problem can be regarded as a generalization of the problems in the previous studies in terms of sales and cost functions.

The following proposition shows that this application satisfies our assumptions.

Proposition 1.

Let $\gamma^{\max}:=\max_{i\in I}|\gamma_{i}|$ . The problem of multiproduct pricing satisfies Assumptions 1 and 2, where $f(\bm{x},\bm{\xi}):=-s(\bm{x},\bm{\xi})+c(\bm{\xi})$ , $\mathcal{C}:=[x_{\min},x_{\max}]^{n}$ , $L_{f}:=m$ , and $M:=nm\gamma^{\max}$ .

The proofs of this proposition and the others can be found in Appendix A.

Remark.

Assumption 1 is not satisfied if $p_{i}(\bm{x})$ is non-differentiable. However, we can satisfy it by using a smoothing technique (Chen and Mangasarian, 1996; Chen, 2012) to smooth $p_{i}(\bm{x})$ .

3.3.2 Congestion Pricing for HOT Lanes

We consider a stochastic variant of (Lou et al., 2011) in the following traffic situation:⁷⁷7Our method can be extended to more general situations, such as ones with many lanes. there are two lanes, a high-occupancy/toll (HOT) lane and a regular lane; drivers can only switch from the regular lane to the HOT lane. There is a decision maker determining a price $x_{i}\in[x_{\min},x_{\max}]$ of the HOT lane for each time interval $i\in I:=\{1,2,\dots,n\}$ . The purpose of the decision maker is (i) to maximize the total flow rate at the bottlenecks of the HOT and regular lanes and (ii) to prevent the density of vehicles at the switching point from exceeding a certain level (to avoid traffic accidents). Let $d_{i}$ be the number of homogeneous drivers in the regular lane in a time interval $i\in I$ . Here, each driver in $i$ changes lane with a probability $p_{i}(x_{i}):=\frac{1}{1+e^{\alpha_{i}h_{i}+\beta_{i}x_{i}+\gamma_{i}}}$ , where $h_{i}$ is a constant indicating the average time savings if a driver chooses the HOT lane at time $i\in I$ . The parameters $\alpha_{i}$ , $\beta_{i}$ , and $\gamma_{i}$ are constants, which can be estimated in real-time (Lou et al., 2011, Section 2.2).

The optimization problem is as follows (the details can be found in (Lou et al., 2011, Section 3)):

\displaystyle\max_{\bm{x}\in[x_{\min},x_{\max}]^{I}}\mathbb{E}_{\bm{\xi}\sim D% (\bm{x})}\left[\operatorname*{\textstyle\sum}_{i\in I}\left(q_{H}\left(\xi_{i}% \right)+q_{R}\left(d_{i}-\xi_{i}\right)\right)+\theta\min\left(\tilde{k}-\frac% {1}{|I|}\operatorname*{\textstyle\sum}_{i\in I}k\left(\xi_{i}\right),0\right)% \right],

where $\xi_{i}\in\{0,1,\dots,d_{i}\}$ is a random variable indicating the number of drivers switching their lanes in $i$ . Regarding the first term, the values of $q_{H}\left(\xi_{i}\right)$ and $q_{R}\left(d_{i}-\xi_{i}\right)$ are continuous functions representing flow rates at the bottlenecks on the HOT lane and the regular lane, respectively. This term aims to maximize the flow rate of each lane. Regarding the second term, $\tilde{k}\in\mathbb{R}_{+}$ is the critical density (the density likely to cause traffic accidents) of vehicles at the switching point, and $k(\xi_{i})$ is a continuous function representing the density at the switching point for the demand $\xi_{i}$ in $i$ . Therefore, $\theta\min\left(\tilde{k}-\frac{1}{|I|}\operatorname*{\textstyle\sum}_{i\in I}% k\left(\xi_{i}\right),0\right)$ is a penalty term for densities above the critical density, where $\theta\geq 0$ is the penalty parameter. This optimization problem can be written in the form of (P), where $s(\bm{x},\bm{\xi}):=0$ , $c(\bm{\xi}):=\operatorname*{\textstyle\sum}_{i\in I}\left(q_{H}\left(\xi_{i}% \right)+q_{R}\left(d_{i}-\xi_{i}\right)\right)+\theta\min\left(\tilde{k}-\frac% {1}{|I|}\operatorname*{\textstyle\sum}_{i\in I}k\left(\xi_{i}\right),0\right)$ , and $\Pr(\bm{\xi}\mid\bm{x}):=\prod_{i\in I}{}_{d_{i}}C_{\xi_{i}}p_{i}(x_{i})^{\xi_% {i}}(1-p_{i}(x_{i}))^{d_{i}-\xi_{i}}$ .

The following proposition shows that this application satisfies our assumptions.

Proposition 2.

The problem of congestion pricing for HOT lanes satisfies Assumptions 1 and 2, where $f(\bm{x},\bm{\xi}):=\operatorname*{\textstyle\sum}_{i\in I}\left(q_{H}\left(% \xi_{i}\right)+q_{R}\left(d_{i}-\xi_{i}\right)\right)+\theta\min(\tilde{k}-% \frac{1}{|I|}\operatorname*{\textstyle\sum}_{i\in I}k\left(\xi_{i}\right),0)$ , $L_{f}:=0$ , $\mathcal{C}:=[x_{\min},x_{\max}]^{I}$ , and $M:=|I|\max_{i\in I}\left(|\beta_{i}|d_{i}\right)$ .

3.3.3 Pricing with Demand Prediction from Limited Data Points

Here, we will consider optimizing prices of $n$ types of item. Regarding the prices $\bm{x}\in[x_{\min},x_{\max}]^{n}$ , the demand $\xi_{i}\in\{\xi\mid 0\leq\xi\leq\xi_{i}^{\max}\}$ of item $i$ is predicted using data points $\hat{D}:=\{(\hat{\bm{x}}^{d},\hat{\bm{\xi}}^{d})\}_{d=1}^{N}$ through the truncated Gaussian process (Swiler et al., 2020, Section 8.1):

	$\displaystyle\xi_{i}\sim\frac{\bm{1}_{\{0\leq\xi_{i}\leq\xi_{i}^{\max}\}}(\xi_% {i})}{C^{i}(\bm{x})}N[\bm{v}^{i}(\bm{x})^{\top}\bm{a}^{i},(\sigma^{i})^{2}-\bm% {v}^{i}(\bm{x})^{\top}A^{i}\bm{v}^{i}(\bm{x})],$
	$\displaystyle\textrm{where}\ C^{i}(\bm{x})=\int_{0}^{\xi_{i}^{\max}}\frac{1}{% \sqrt{2\pi((\sigma^{i})^{2}-\bm{v}^{i}(\bm{x})^{\top}A^{i}\bm{v}^{i}(\bm{x}))}% }\exp\left(-\frac{(\phi-\bm{v}^{i}(\bm{x})^{\top}\bm{a}^{i})^{2}}{2((\sigma^{i% })^{2}-\bm{v}^{i}(\bm{x})^{\top}A^{i}\bm{v}^{i}(\bm{x}))}\right)d\phi.$

Here, $\bm{v}^{i}:\mathbb{R}^{n}\to\mathbb{R}^{N}$ , $\bm{a}^{i}\in\mathbb{R}^{N}$ , $\sigma^{i}\in\mathbb{R}$ , and $A^{i}\in\mathbb{R}^{N\times N}$ are respectively the function, vector, scalar, and matrix learned from the data points $\hat{D}$ . The $j$ -th element $v_{j}^{i}(\bm{x})$ of $\bm{v}^{i}(\bm{x})$ is defined by $v^{i}_{j}(\bm{x}):=\theta_{1}^{i}\exp\left(-\frac{\|\bm{x}-\hat{\bm{x}}^{j}\|^% {2}}{\theta^{i}_{2}}\right)$ , where $\theta_{1}^{i}\in\mathbb{R}_{+}$ and $\theta_{2}^{i}\in\mathbb{R}_{+}$ are learned constants. The normalization function $C^{i}(\bm{x})$ is the probability that a sample lies in $\{0\leq\xi_{i}\leq\xi_{i}^{\max}\}$ . Here, $(\sigma^{i})^{2}-\bm{v}^{i}(\bm{x})^{\top}A^{i}\bm{v}^{i}(\bm{x})\geq\Delta$ for some $\Delta\in\mathbb{R}_{+}$ .⁸⁸8Given that the observations are subject to noise, it is natural to predict that the variance $(\sigma^{i})^{2}-\bm{v}^{i}(\bm{x})^{\top}A^{i}\bm{v}^{i}(\bm{x})$ is more than or equal to a certain constant ( $\Delta$ ).

The revenue-maximizing problem is as follows:

\displaystyle\min_{\bm{x}\in[x_{\min},x_{\max}]^{n}}\ \mathbb{E}_{\bm{\xi}\sim D% (\bm{x})}\left[-s(\bm{x},\bm{\xi})+c\left(\bm{\xi}\right)\right],

where $s(\bm{x},\bm{\xi}):=\bm{\xi}^{\top}\bm{x}$ , $c(\bm{\xi}):=\sum_{i=1}^{n}c_{i}(\xi_{i})$ , and $c_{i}:\mathbb{R}\to\mathbb{R}$ is a continuous function for $i=1,2,\dots,n$ . $c_{i}(\xi_{i})$ represents the cost for item $i$ . This problem can be written in the form of (P), where $\Pr(\bm{\xi}\mid\bm{x}):=\prod_{i=1}^{n}\frac{1}{C^{i}(\bm{x})\sqrt{2\pi((% \sigma^{i})^{2}-\bm{v}^{i}(\bm{x})^{\top}A^{i}\bm{v}^{i}(\bm{x}))}}\exp\left(-% \frac{(\xi_{i}-\bm{v}^{i}(\bm{x})^{\top}\bm{a}^{i})^{2}}{2((\sigma^{i})^{2}-% \bm{v}^{i}(\bm{x})^{\top}A^{i}\bm{v}^{i}(\bm{x}))}\right)$ for $\bm{\xi}\in\Xi:=\{\bm{\xi}\mid\forall i\in[n],0\leq\xi_{i}\leq\xi_{i}^{\max}\}$ .

The following proposition shows that this application satisfies our assumptions.

Proposition 3.

Let $\theta_{1}^{\max}:=\max_{i}\theta_{1}^{i}$ , $\theta_{2}^{\min}:=\min_{i}\theta_{2}^{i}$ , $a^{\max}:=\max_{i,k}|a^{i}_{k}|$ , $A^{\max}:=\max_{i,k,l}|A^{i}_{k,l}|$ , and $\xi^{\max}:=\max_{i}\xi_{i}^{\max}$ . The problem of pricing with demand prediction from limited data points satisfies Assumptions 1–3, where $f(\bm{x},\bm{\xi}):=-s(\bm{x},\bm{\xi})+c(\bm{\xi})$ , $\mathcal{C}:=[x_{\min},x_{\max}]^{n}$ , $L_{f}:=n\xi^{\max}$ , and $M:=\frac{4n^{2}N\theta_{1}^{\max}(x_{\max}-x_{\min})}{\Delta\theta_{2}^{\min}}% \left(NA^{\max}\theta_{1}^{\max}+(\xi^{\max}+N\theta_{1}^{\max}a^{\max})\left(% a^{\max}+NA^{\max}\theta_{1}^{\max}\frac{\xi^{\max}+N\theta_{1}^{\max}a^{\max}% }{\Delta}\right)\right)$ .

3.4 Advantages of Our Formulation

3.4.1 Benefits of Using Decision-dependent Random Variables

The multiproduct pricing problem in Section 3.3.1 can also be expressed in terms of decision-independent random variables as follows. Each buyer $j=1,2,\dots,m$ has a value $\gamma_{i}(\alpha_{i}-x_{i})+\mu_{ij}$ for each product $i$ , where $\alpha_{i}$ and $\gamma_{i}$ are constants, $x_{i}$ is the price, and $\mu_{ij}$ is a random variable following a Gumbel distribution with mode 0 and variance $\frac{\pi^{2}}{6}$ . Each buyer purchases a product $i$ with the highest $\gamma_{i}(\alpha_{i}-x_{i})+\mu_{ij}$ . Accordingly, the demand $\xi_{i}$ for product $i$ can be defined by $\xi_{i}(\bm{x},\bm{\mu}):=\sum_{j=1}^{m}\xi_{ij}(\bm{x},\bm{\mu}_{j})$ , where $\bm{\mu}_{j}=(\mu_{ij})_{i=1}^{n}$ and $\xi_{ij}(\bm{x},\bm{\mu}_{j}):=1$ if $i=\operatorname*{argmax}_{r}\{\gamma_{r}(\alpha_{r}-x_{r})+\mu_{rj}\}$ with $\xi_{ij}(\bm{x},\bm{\mu}_{j}):=0$ otherwise. The optimization problem can be written as follows by letting $\bm{\xi}(\bm{x},\bm{\mu}):=(\xi_{i}(\bm{x},\bm{\mu}))_{i=1}^{n}$ :

\displaystyle\min_{\bm{x}\in[x_{\min},x_{\max}]^{n}}\mathbb{E}_{\bm{\mu}}\left% [-s(\bm{x},\bm{\xi}(\bm{x},\bm{\mu}))+c(\bm{\xi}(\bm{x},\bm{\mu}))\right].

Although the multiproduct pricing problem can be formulated in the above manner with decision-independent random variables, the discontinuous function $\xi_{ij}(\bm{x},\bm{\mu}_{j})$ makes it difficult to optimize.⁹⁹9Optimization problems involving such discontinuous functions have been addressed by (Correa et al., 2017). They propose approximation methods to deal with this difficulty. In contrast, our problem does not involve a discontinuous function, which allows us to use gradient-based methods.

Moreover, He et al. (2009) and Dong et al. (2017) tackle similar problems to ours by defining the demand for a product/service as $d(x)+\xi$ , where $x$ is price and $\xi$ is a decision-independent random variable. However, assuming $\xi$ is decision-independent makes it impossible to handle situations where demand uncertainty varies with price. In contrast, our problem setting can deal with such a situation by using decision-dependent random variables.

3.4.2 Differences from Existing Pricing Problems with Decision-dependent Uncertainty

The existing formulations with decision-dependent uncertainty make assumptions specific to their applications. For example, Schulte and Sachs (2020) assume that demand follows a Poisson distribution with an intensity $\lambda(x)$ , and they cannot use a multinomial or truncated Gaussian distribution as the demand distribution. Moreover, since a fixed cost is charged on their products, they can not handle a nonlinear cost. Hikima et al. (2021) assume that the probability density function for demand is ${\rm Pr}(\bm{\xi}\mid\bm{x})=\displaystyle{\prod\nolimits_{u\in U}}\left\{p_{u% }(x_{u})^{\xi_{u}}(1-p_{u}(x_{u}))^{(1-\xi_{u})}\right\}$ , where $p_{u}(x)$ is the probability that service user $u\in U$ accepts the price $x_{u}$ ; they cannot use a truncated Gaussian distribution. In addition, they assume a specific objective function, which is defined by a bipartite matching problem with uncertainty.

In contrast to the existing formulations, ours has more varied applications because it has more general assumptions. The trade off for this generality is that our problem is non-convex and the dependence of the probability distribution on the decision variables defeats conventional stochastic optimization theory. Below, we focus on finding a stationary point and develop a projected stochastic gradient descent method by deriving unbiased stochastic gradient estimators.

4 Proposed Method

4.1 Preliminaries

Definition 1 (Projection oracle).

Given a point $\bm{x}$ , we define the following as a projection oracle:

\displaystyle\mathrm{proj}_{\mathcal{C}}(\bm{x}):=\mathrm{arg}\min_{\bm{y}\in% \mathbb{R}^{n}}\{\|\bm{x}-\bm{y}\|_{2}\mid\bm{y}\in\mathcal{C}\}.

Definition 2 (Unbiased stochastic gradient).

Given a point $\bm{x}$ , we call $g(\bm{x},\bm{\xi})$ an “unbiased stochastic gradient” if

\mathbb{E}_{\bm{\xi}\sim D(\bm{x})}[g(\bm{x},\bm{\xi})]=\nabla_{\bm{x}}\mathbb% {E}_{\bm{\xi}\sim D(\bm{x})}[f(\bm{x},\bm{\xi})].

Definition 3 (Gradient map**).

Given a point $\bm{x}$ and $\eta\in\mathbb{R}_{+}$ , the gradient map** of (P) is defined by

\mathcal{G}(\bm{x},\eta):=\frac{1}{\eta}(\bm{x}-\mathrm{proj}_{\mathcal{C}}(% \bm{x}-\eta\nabla_{\bm{x}}\mathbb{E}_{\bm{\xi}\sim D(\bm{x})}[f(\bm{x},\bm{\xi% })])).

Definition 4 ( $\varepsilon$ -stationary point).

We call $\hat{\bm{x}}$ an $\varepsilon$ -stationary point for (P) if $\mathbb{E}_{\hat{\bm{x}}}[\|\mathcal{G}(\hat{\bm{x}},\eta)\|^{2}]\leq% \varepsilon^{2}$ for some $\eta\in\mathbb{R}_{+}$ , where $\hat{\bm{x}}$ denotes the point returned by a stochastic algorithm.

The following preliminary lemmas are needed for ensuring our method’s convergence when the random variables are continuous.

Lemma 4.

Suppose that Assumptions 1–3 hold. Let $h(\bm{x},\bm{\xi}):=f(\bm{x},\bm{\xi})\mathrm{Pr}(\bm{\xi}\mid\bm{x})$ . Then,

\displaystyle\nabla_{\bm{x}}\int_{\bm{\xi}\in\Xi}h(\bm{x},\bm{\xi})d\bm{\xi}=% \int_{\bm{\xi}\in\Xi}\nabla_{\bm{x}}h(\bm{x},\bm{\xi})d{\bm{\xi}}

for all $\bm{x}\in\mathcal{C}$ .

Lemma 5.

Suppose that conditions (ii) and (iii) of Assumption 1 and Assumptions 2 and 3 hold. Then,

\displaystyle\nabla_{\bm{x}}\int_{\bm{\xi}\in\Xi}q(\bm{\xi})\mathrm{Pr}(\bm{% \xi}\mid\bm{x})d\bm{\xi}=\int_{\bm{\xi}\in\Xi}\nabla_{\bm{x}}q(\bm{\xi})% \mathrm{Pr}(\bm{\xi}\mid\bm{x})d{\bm{\xi}}

for all $\bm{x}\in\mathcal{C}$ and any real-valued continuous function $q:\Xi\to\mathbb{R}$ .

Throughout the paper, we let

\displaystyle f_{\max}:=\max_{\bm{x}\in\mathcal{C},\bm{\xi}\in\Xi}|f(\bm{x},% \bm{\xi})|,

(1)

which exists since $\mathcal{C}$ and $\Xi$ are compact from Assumption 2 and $f(\bm{x},\bm{\xi})$ is real-valued and continuous from Assumption 1.

4.2 Unbiased Stochastic Gradient for (P)

First, we propose an unbiased stochastic gradient for (P).

Lemma 6.

Suppose that conditions (i) and (ii) of Assumption 1 hold. Moreover, suppose that condition (iii) of Assumption 1, Assumption 2, and Assumption 3 hold if $\bm{\xi}$ is a continuous random vector. Let $\delta\in\mathbb{R}$ and

\bm{g}(\bm{x},\bm{\xi},\delta):=\nabla_{\bm{x}}f(\bm{x},\bm{\xi})+\left(f(\bm{% x},\bm{\xi})-\delta\right)\frac{\nabla_{\bm{x}}\mathrm{Pr}(\bm{\xi}\mid\bm{x})% }{\mathrm{Pr}(\bm{\xi}\mid\bm{x})}.

Then, $\bm{g}(\bm{x},\bm{\xi},\delta)$ is an unbiased stochastic gradient for (P) for any $\delta\in\mathbb{R}$ .

Inspired by a technique called baseline in reinforcement learning (Williams, 1992; Sutton and Barto, 2018), we decided to include a variance reduction parameter $\delta$ in the unbiased stochastic gradient. If $\delta$ is close to $\mathbb{E}_{\bm{\xi}\sim D(\bm{x})}[f(\bm{x},\bm{\xi})]$ , the second term of $g(\bm{x},\bm{\xi},\delta)$ is small, and the variance of $g(\bm{x},\bm{\xi},\delta)$ is reduced. We show how to determine $\delta$ in Section 4.3.

The gradient in Lemma 6 has the following useful feature.

Lemma 7.

Suppose that Assumptions 1 and 2 hold. Moreover, suppose that Assumption 3 holds if $\bm{\xi}$ is a continuous random vector. Let $\delta\in[-f_{\max},f_{\max}]$ . Then, for all $\bm{x}\in\mathcal{C}$ ,

\displaystyle\mathbb{E}_{\bm{\xi}^{\prime}\sim D(\bm{x})}[\|\bm{g}(\bm{x},\bm{% \xi}^{\prime},\delta)-\nabla_{\bm{x}}\mathbb{E}_{\bm{\xi}\sim D(\bm{x})}[f(\bm% {x},\bm{\xi})]\|^{2}]\leq(L_{f}+2f_{\max}M)^{2},

where $\bm{g}(\bm{x},\bm{\xi}^{\prime},\delta)$ is defined as in Lemma 6.

Lemma 7 shows that the variance of the stochastic gradient of Lemma 6 can be bounded by a constant. This is a necessary condition for stochastic gradient methods to have a convergence rate independent of the number of possible values of $\bm{\xi}$ (Li and Li, 2018).

Moreover, the following lemma is necessary for ensuring the convergence of the proposed method.

Lemma 8.

Suppose that Assumptions 1 and 2 hold. Moreover, suppose that Assumption 3 holds if $\bm{\xi}$ is a continuous random vector. Then, for all $\bm{x}\in\mathcal{C}$ ,

\displaystyle\|\nabla_{\bm{x}}\mathbb{E}_{\bm{\xi}\sim D(\bm{x})}[f(\bm{x},\bm% {\xi})]\|\leq L_{f}+f_{\max}M.

4.3 Calculation of Variance Reduction Parameter $\delta$

To reduce the variance of the gradient in Lemma 6, the parameter $\delta$ should be close to $\mathbb{E}_{\bm{\xi}\sim D(\bm{x}_{k})}[f(\bm{x}_{k},\bm{\xi})]$ for the iterate $\bm{x}_{k}$ . During the iterations of the algorithm, $\delta_{k}$ is updated to bring it closer to the target value. We consider the following sequential stochastic problem for $\delta_{k}$ : a decision maker selects $\delta_{k}\in[-f_{\max}-\kappa,f_{\max}+\kappa]$ at iteration $k$ and incurs an unobserved cost $\psi_{k}(\delta_{k}):=\frac{1}{2}(\delta_{k}-\mathbb{E}_{\bm{\xi}\sim D(\bm{x}% _{k})}[f(\bm{x}_{k},\bm{\xi})])^{2}$ , where $\bm{x}_{k}$ and $\kappa\in\mathbb{R}_{+}$ are given;¹⁰¹⁰10 $\kappa$ is an arbitrarily small positive value. It extends the range of $\delta_{k}$ , which is needed in Proposition 9. for the decision maker, an unbiased estimate of $\mathbb{E}_{\bm{\xi}\sim D(\bm{x}_{k})}[f(\bm{x}_{k},\bm{\xi})]$ , denoted by $v_{k}$ , is obtained by sampling. Here, we assume $v_{k}\in[-f_{\max},f_{\max}]$ , which is usually holds from the definition (1) of $f_{\max}$ .

As a way to solve the above problem, we propose Algorithm 1, which is based on the online gradient descent (OGD) algorithm (Besbes et al., 2015).

Algorithm 1 OGD algorithm

1:iteration limit

R

, initial parameter

\delta_{1}\in[-f_{\max},f_{\max}]

, stepsize parameter

\{\zeta_{k}\}_{k=2}^{R}

, noisy target value

\{v_{k}\}_{k\in[R]}

\delta_{k}

for

k\in[R]

3:for

k=1,\dots,R-1

\delta_{k+1}\leftarrow(1-\zeta_{k+1})\delta_{k}+\zeta_{k+1}v_{k}

This method updates $\delta_{k}$ by using the stochastic gradient since $(1-\zeta_{k+1})\delta_{k}+\zeta_{k+1}v_{k}=\delta_{k}-\zeta_{k+1}(\delta_{k}-v% _{k})\approx\delta_{k}-\zeta_{k+1}\nabla\psi_{k}(\delta_{k})$ . Note that $\nabla\psi_{k}(\delta_{k})=\delta_{k}-\mathbb{E}_{\bm{\xi}\sim D(\bm{x}_{k})}[% f(\bm{x}_{k},\bm{\xi})]$ from the definition of $\psi_{k}$ . Accordingly, the following proposition holds from (Besbes et al., 2015, Lemma C-5), which guarantees that Algorithm 1 outputs a solution close to the optimum in terms of regret.

Proposition 9.

Let $\zeta_{k}:=\frac{1}{k}$ for $k\in[R]$ , and let $\delta_{k}$ be the output of Algorithm 1 for $k\in[R]$ . Then, there exists a constant $\bar{C}$ such that

\displaystyle\mathbb{E}\left[\sum_{k=1}^{R}\psi_{k}(\delta_{k})\right]-\min_{% \delta}\sum_{k=1}^{R}\psi_{k}(\delta)\leq\bar{C}\log R.

From this proposition and the definition of $\psi_{k}$ , we find that output $\delta_{k}$ of Algorithm 1 is a reasonable approximation of $\mathbb{E}_{\bm{\xi}\sim D(\bm{x}_{k})}[f(\bm{x}_{k},\bm{\xi})]$ .

4.4 Proposed Algorithm

We propose Algorithm 2 for solving problem (P). It incorporates our stochastic gradient and Algorithm 1 into a projected stochastic gradient method (Ghadimi and Lan, 2016, Algorithm 4). Lines 5–9 update the iterate on the basis of (Ghadimi and Lan, 2016, Algorithm 4) by using our proposed stochastic gradient. Line 10 updates the variance reduction parameter on the basis of Algorithm 1 by letting $v_{k}$ be $\frac{1}{m_{k}}\sum_{\ell=1}^{m_{k}}f(\bm{x}_{k}^{md},\bm{\xi}^{\ell})$ . Note that line 10 does not impose any additional computation cost since $\frac{1}{m_{k}}{\sum_{\ell=1}^{m_{k}}}f(\bm{x}_{k}^{md},\bm{\xi}^{\ell})$ is already computed on line 7.

Then, from Lemmas 6–8 and (Ghadimi and Lan, 2016, Corollary 6), the following convergence theorem holds.

Algorithm 2 Projected stochastic gradient algorithm

1:initial iterate

\bm{x}_{0}\in\mathcal{C}

, initial variance reduction parameter

\delta_{1}\in[-f_{\max},f_{\max}]

, iteration limit

\{N\geq 1\}

, batch size

\{m_{k}\}_{k\in[N]}

, probability distribution

D_{R}(N)

for set

\{1,2,\dots,N\}

, and stepsize parameters

\{\alpha_{k}\in(0,1]\}_{k\in[N]}

\{\beta_{k}\in\mathbb{R}_{+}\}_{k\in[N]}

\{\lambda_{k}\in\mathbb{R}_{+}\}_{k\in[N]}

, and

\{\zeta_{k}\}_{k=2}^{N}

\bm{x}^{md}_{R}

3:Set

\bm{x}^{ag}_{0}=\bm{x}_{0}

and sample

R\sim D_{R}(N)

4:for

k=1,2,\dots,R

\bm{x}_{k}^{md}\leftarrow(1-\alpha_{k})\bm{x}_{k-1}^{ag}+\alpha_{k}\bm{x}_{k-1}

6: Sample

\bm{\xi}^{\ell}\sim D(\bm{x}_{k}^{md})

for

\ell=1,\dots,m_{k}

\bm{g}_{k}\leftarrow\frac{1}{m_{k}}{\displaystyle\sum_{\ell=1}^{m_{k}}}\left(% \nabla_{\bm{x}}f(\bm{x}_{k}^{md},\bm{\xi}^{\ell})+(f(\bm{x}_{k}^{md},\bm{\xi}^% {\ell})-\delta_{k})\frac{\nabla_{\bm{x}}\mathrm{Pr}(\bm{\xi}^{\ell}\mid\bm{x}_% {k}^{md})}{\mathrm{Pr}(\bm{\xi}^{\ell}\mid\bm{x}_{k}^{md})}\right)

\bm{x}_{k}\leftarrow\mathrm{proj}_{\mathcal{C}}(\bm{x}_{k-1}-\lambda_{k}\bm{g}% _{k})

\bm{x}_{k}^{ag}\leftarrow\mathrm{proj}_{\mathcal{C}}(\bm{x}_{k}^{md}-\beta_{k}% \bm{g}_{k})

10: if

k\leq R-1

\delta_{k+1}\leftarrow(1-\zeta_{k+1})\delta_{k}+\frac{\zeta_{k+1}}{m_{k}}{% \displaystyle\sum_{\ell=1}^{m_{k}}}f(\bm{x}_{k}^{md},\bm{\xi}^{\ell})

Theorem 10.

Suppose that Assumptions 1 and 2 hold. Moreover, suppose that Assumption 3 holds if $\bm{\xi}$ is a continuous random vector. Let the inputs of Algorithm 2 be $\alpha_{k}:=\frac{2}{k+1},\ \beta_{k}:=\frac{1}{2L_{Ef}},\ \lambda_{k}:=\frac{% k\beta_{k}}{2},m_{k}:=\left\lceil\frac{(L_{f}+2f_{\max}M)^{2}k}{L_{Ef}\tilde{D% }^{2}}\right\rceil$ , and $\Pr(R=k):=\frac{\Gamma_{k}^{-1}\beta_{k}(1-L_{Ef}\beta_{k})}{\sum_{{\tau}=1}^{% N}\Gamma_{\tau}^{-1}\beta_{\tau}(1-L_{Ef}\beta_{\tau})}$ for $k=1,2,\dots,N,$ where $\tilde{D}$ is some parameter, $L_{Ef}:=L_{f}+f_{\max}M$ , $\Gamma_{1}:=1$ , and $\Gamma_{k}:=(1-\alpha_{k})\Gamma_{k-1}$ for $k=2,\dots,N$ . Let $\zeta_{k}:=\frac{1}{k}$ for $k=2,\dots,N$ . Then,

\displaystyle\mathbb{E}[\|\mathcal{G}(\bm{x}_{R}^{md},{\beta_{R}})\|^{2}]\leq 9% 6L_{Ef}\left[\frac{4L_{Ef}\|\bm{x}_{0}-\bm{x}^{*}\|^{2}}{N(N+1)(N+2)}+\frac{L_% {Ef}(\|\bm{x}^{*}\|^{2}+H^{2})+2\tilde{D}^{2}}{N}\right],

where $H:=\max_{\bm{x}\in\mathcal{C}}\|\bm{x}\|$ . Consequently, to obtain an $\varepsilon$ -stationary point of Definition 4, we need at most $O\left(\left[\frac{L_{Ef}^{2}\|\bm{x}_{0}-\bm{x}^{*}\|^{2}}{\varepsilon^{2}}% \right]^{\frac{1}{3}}+\frac{L_{Ef}^{2}(\|\bm{x}^{*}\|^{2}+H^{2})+L_{Ef}\tilde{% D}^{2}}{\varepsilon^{2}}\right)$ iterations.

The parameter $\tilde{D}$ in Theorem 10 determines the balance between the minibatch size and the iteration complexity: a small $\tilde{D}$ results in smaller iteration complexity but a larger minibatch size; a large $\tilde{D}$ leads to a smaller minibatch size but larger iteration complexity.

Bottleneck of Algorithm 2.

The bottleneck of Algorithm 2 is line 7 because it requires $\frac{\nabla_{\bm{x}}\mathrm{Pr}(\bm{\xi}^{\ell}\mid\bm{x}_{k}^{md})}{\mathrm{% Pr}(\bm{\xi}^{\ell}\mid\bm{x}_{k}^{md})}$ to be computed $m_{k}$ times. This calculation takes a lot of time since $m_{k}$ has to be at least proportional to the number $k$ of current iterations to obtain the convergence rate of Theorem 10.

4.5 Specialized Projected Stochastic Gradient Method for Price Optimization in Multi-agent Applications

To reduce the computation cost at the bottleneck of Algorithm 2, we propose a specialized projected stochastic gradient method that adds the following assumptions to (P).

Assumption 4.

$\mathbb{E}_{\bm{\xi}\sim D(\bm{x})}[s(\bm{x},\bm{\xi})]=s(\bm{x},\mathbb{E}_{% \bm{\xi}\sim D(\bm{x})}[\bm{\xi}])$ and $c(\bm{\xi})$ is continuous.

Assumption 5.

The probability density function $\Pr(\bm{\xi}\mid\bm{x})$ is defined as $\phi(\bm{p}(\bm{x}),\bm{\xi})$ , where $\phi$ is real-valued and differentiable w.r.t. $\bm{p}$ , $\frac{\nabla_{\bm{p}}\phi(\bm{p},\bm{\xi})}{\phi(\bm{p},\bm{\xi})}$ is easily computed, and $\bm{p}$ is vector-valued and differentiable.

The above assumptions are often satisfied in price optimization for multi-agent applications. In particular, Assumption 4 tends to hold because the sales function $s$ in price optimization is usually linear with respect to $\bm{\xi}$ and the cost function $c$ is usually continuous with respect to $\bm{\xi}$ . Assumption 5 holds for many parameterized distributions (e.g., binomial, multinomial, and Poisson distributions) since the probability density function and its gradient can be simply written by its parameters. Many multi-agent applications satisfy Assumption 5 because the distribution of the demand follows a binomial or multinomial distribution with parameters $\bm{p}$ , which represents the probabilities of each agent’s actions.

The following lemmas show that the applications with multiple agents described in Section 3.3 satisfy Assumptions 4 and 5.

Proposition 11.

The problem of multiproduct pricing satisfies Assumption 4. Moreover, let $\phi(\bm{p}(\bm{x}),\bm{\xi}):=\prod_{i=0}^{n}{}_{m}C_{\xi_{i}}p_{i}(\bm{x})^{% \xi_{i}}.$ Then, $\Pr(\bm{\xi}\mid\bm{x})=\phi(\bm{p}(\bm{x}),\bm{\xi})$ and $\left(\frac{\nabla_{\bm{p}}\phi(\bm{p}(\bm{x}),\bm{\xi})}{\phi(\bm{p}(\bm{x}),% \bm{\xi})}\right)_{k}=\frac{\xi_{k}}{p_{k}(\bm{x})}.$

Proposition 12.

The problem of congestion pricing for Hot lanes satisfies Assumption 4. Moreover, let $\phi(\bm{p}(\bm{x}),\bm{\xi}):=\prod_{i\in I}{}_{d_{i}}C_{\xi_{i}}p_{i}(x_{i})% ^{\xi_{i}}(1-p_{i}(x_{i}))^{d_{i}-\xi_{i}}.$ Then, $\Pr(\bm{\xi}\mid\bm{x})=\phi(\bm{p}(\bm{x}),\bm{\xi})$ and $\left(\frac{\nabla_{\bm{p}}\phi(\bm{p}(\bm{x}),\bm{\xi})}{\phi(\bm{p}(\bm{x}),% \bm{\xi})}\right)_{k}=\frac{\xi_{k}}{p_{k}(x_{k})}-\frac{d_{k}-\xi_{k}}{1-p_{k% }(x_{k})}.$

Below, we present the lemmas for our specialized method under Assumptions 1–5. Let $c_{\max}:=\max_{\bm{\xi}\in\Xi}|c(\bm{\xi})|$ , which exists since $\Xi$ is compact from Assumption 2 and $c(\bm{\xi})$ is continuous from Assumption 4.

Lemma 13.

Suppose that condition (ii) of Assumption 1 and Assumptions 4 and 5 hold. Moreover, suppose that condition (iii) of Assumption 1, Assumption 2, and Assumption 3 hold if $\bm{\xi}$ is a continuous random vector. Let $\bm{g}_{2}(\bm{x},\bm{\xi}^{\prime},\delta):=-\nabla_{\bm{x}}s(\bm{x},\mathbb{% E}_{\bm{\xi}\sim D(\bm{x})}[\bm{\xi}])+(c(\bm{\xi}^{\prime})-\delta)\frac{d\bm% {p}(\bm{x})}{d\bm{x}}\frac{\nabla_{\bm{p}}\phi(\bm{p}(\bm{x}),\bm{\xi}^{\prime% })}{\phi(\bm{p}(\bm{x}),\bm{\xi}^{\prime})}.$ Then, $\bm{g}_{2}(\bm{x},\bm{\xi}^{\prime},\delta)$ is an unbiased stochastic gradient for (P).

Lemma 14.

Suppose that conditions (ii) and (iii) of Assumption 1, and Assumptions 2, 4, and 5 hold. Moreover, suppose that Assumption 3 holds if $\bm{\xi}$ is a continuous random vector. Let $\delta\in[-c_{\max},c_{\max}]$ . Then, for all $\bm{x}\in\mathcal{C}$ , $\mathbb{E}_{\bm{\xi}^{\prime}\sim D(\bm{x})}[\|\bm{g}_{2}(\bm{x},\bm{\xi}^{% \prime},\delta)-\nabla_{\bm{x}}\mathbb{E}_{\bm{\xi}\sim D(\bm{x})}[f(\bm{x},% \bm{\xi})]\|^{2}]\leq 4\left(c_{\max}M\right)^{2},$ where $\bm{g}_{2}(\bm{x},\bm{\xi}^{\prime},\delta)$ is defined as in Lemma 13.

Now, let us examine Algorithm 3. Its computational cost is lower than that of Algorithm 2: Algorithm 2 requires $m_{k}$ calculations of $\frac{\nabla_{\bm{x}}\mathrm{Pr}(\bm{\xi}^{\ell}\mid\bm{x}_{k}^{md})}{\mathrm{% Pr}(\bm{\xi}^{\ell}\mid\bm{x}_{k}^{md})}$ , whereas Algorithm 3 requires $m_{k}$ calculations of $\frac{\nabla_{\bm{p}}\phi(\bm{p}(\bm{x}_{k}^{md}),\bm{\xi}^{\ell})}{\phi(\bm{p% }(\bm{x}_{k}^{md}),\bm{\xi}^{\ell})}$ , which can be easily computed from Assumption 5.

Algorithm 3 Specialized projected stochastic gradient algorithm

In Algorithm 2, let

\delta_{1}\in[-c_{\max},c_{\max}]

in the input, replace line 7 by

\displaystyle\bm{g}_{k}\leftarrow-\nabla_{\bm{x}}s(\bm{x}_{k}^{md},\mathbb{E}_% {\bm{\xi}\sim D(\bm{x}_{k}^{md})}[\bm{\xi}])+\frac{d\bm{p}(\bm{x}_{k}^{md})}{d% \bm{x}}\frac{1}{m_{k}}\sum_{\ell=1}^{m_{k}}\left(c(\bm{\xi}^{\ell})-\delta_{k}% \right)\frac{\nabla_{\bm{p}}\phi(\bm{p}(\bm{x}_{k}^{md}),\bm{\xi}^{\ell})}{% \phi(\bm{p}(\bm{x}_{k}^{md}),\bm{\xi}^{\ell})},

and replace line 10 by

k\leq R-1

\delta_{k+1}\leftarrow(1-\zeta_{k+1})\delta_{k}+\frac{\zeta_{k+1}}{m_{k}}{% \displaystyle\sum_{\ell=1}^{m_{k}}}c(\bm{\xi}^{\ell})

Similarly to Algorithm 2, we can determine the convergence rate of Algorithm 3.

Theorem 15.

Suppose that Assumptions 1, 2, 4, and 5 hold. Moreover, suppose that Assumption 3 holds if $\bm{\xi}$ is a continuous random vector. Let the inputs of Algorithm 3 other than $m_{k}$ be as in Theorem 10 and let $m_{k}:=\left\lceil\frac{4(c_{\max}M)^{2}k}{L_{Ef}\tilde{D}^{2}}\right\rceil$ , where $L_{Ef}:=L_{f}+f_{\max}M$ . Then, Algorithm 3 achieves the same convergence rate as in Theorem 10.

5 Experiments

We conducted experiments on an application of multiproduct pricing to show that Algorithm 3 outputs solutions with higher total revenues compared with the existing methods. We performed synthetic experiments and simulation experiments with real retail data from a supermarket service provider in Japan.¹¹¹¹11We used publicly available data, “New Product Sales Ranking”, provided by KSP-SP Co., Ltd, http://www.ksp-sp.com. The details of our experiments are in Appenndix B.

We implemented the following methods.
Proposed Method: We implemented Algorithm 3 with $\alpha_{k}:=\frac{10}{k+1},\beta_{k}:=\frac{0.1}{2m},\lambda_{k}:=\frac{k\beta% _{k}}{2}$ , $m_{k}=0.1km$ , and $\zeta_{k}=\frac{1}{k}$ , where $k$ is the current iteration number and $m$ is the number of buyers.
Proposed Method (fixed $\delta$ ): This is the proposed method with a fixed $\delta_{k}$ from the information at the initial iterate $\bm{x}_{0}$ . Specifically, $\delta_{k}$ was set to $\frac{1}{10^{3}}\sum_{\ell=1}^{10^{3}}c(\bm{\xi}^{\ell}(\bm{x}_{0}))$ for all $k$ , where $\bm{\xi}^{\ell}(\bm{x}_{0})\sim D(\bm{x}_{0})$ .
Proposed Method ( $\delta=0$ ): This is the proposed method with $\delta_{k}$ set to zero.
L2-Regularized Repeated Gradient Descent (L2-RGD( $\alpha$ )) (Perdomo et al., 2020, Appendix E):
This method applies a repeated gradient descent (Perdomo et al., 2020, Section 3.3) to the objective function with a regularization term $\frac{\alpha}{2}\|\bm{x}-\bm{x}_{0}\|$ , where $\bm{x}_{0}$ is the initial point.¹²¹²12This is introduced in (Perdomo et al., 2020, Appendix E) as a remedy for the retraining method for non-strongly convex objective functions. Note that the retraining method was originally intended for strongly convex objective functions. We implemented this method for several $\alpha$ .
Bayesian Optimization (BO) (GPyOpt-authors, 2016): This method sequentially searches for points where the objective value is likely to be small and outputs the solution with the lowest objective value among the evaluated points. We used GPyOpt, a Python open-source library for Bayesian optimization (GPyOpt-authors, 2016).
Simultaneous Perturbation Stochastic Approximation (SPSA) (Spall, 2005, 1998): This method updates the current iterate by using approximated gradient, which calculated by the difference between objective values of two perturbed iterates.
Projected Sub-gradient Descent for Average Demand (PSD-AD) (Boyd et al., 2003, Section 3): This is a projected subgradient descent method for deterministic pricing problems with average demand.

We performed our experiments under the following settings.
Initial points. For all methods other than BO, we set the initial points as $\bm{x}_{0}:=0.5\bm{e}$ , where $\bm{e}\in\mathbb{R}^{n}$ is a vector with all elements equal to $1$ . BO first evaluates five random points; then it runs the Bayesian optimization.
Metric. We computed $\frac{1}{10^{3}}\sum_{\ell=1}^{10^{3}}\left(-s(\bm{x}_{k},\bm{\xi}^{\ell}(\bm{% x}_{k}))+c(\bm{\xi}^{\ell}(\bm{x}_{k}))\right)$ for each iterate $\bm{x}_{k}$ , where $\bm{\xi}^{\ell}(\bm{x}_{k})\sim D(\bm{x}_{k})$ , and defined the smallest value among the iteration points as the Negative Expected Revenue (NER).
Termination criteria. We terminated all methods at a maximum computation time of $500$ seconds.

5.1 Synthetic Experiments

Synthetic Parameter Setup.

We performed experiments by varying each parameter from the following default settings. We set $n:=20$ and $m:=200$ , which are the numbers of products and buyers, respectively. For each product, we let the minimum price $x_{\min}$ be $0.01$ and the maximum price be $x_{\max}:=10$ . For the parameters of the function $p_{i}$ , we generated $\alpha_{i}$ for each $i$ from a uniform distribution of $[0.01,1]$ , and we let $\gamma_{i}:=\frac{2\pi}{\sqrt{6}\alpha_{i}}$ and let $a_{0}:=0.25n$ . For the parameters of the function $c_{i}$ for each $i$ , we set $\eta^{1}_{i}:=2.0w_{i}$ , $\eta^{2}_{i}:=w_{i}$ , and $\eta^{3}_{i}:=3.0w_{i}$ , where $w_{i}$ was generated from a uniform distribution of $[0.25\alpha_{i},0.5\alpha_{i}]$ . We let $l_{i}:=\frac{0.5m}{n}$ and $u_{i}:=\frac{1.5m}{n}$ . We then varied $m$ and $n$ under these default settings.

Table 1: Results of synthetic experiments for 20 randomly generated problem instances. The NER (SD) column represents the average (standard deviation) of NER. The best value of the average NER for each experiment is in bold.

(

n

m

)

Proposed

(fixed

\delta

)

Proposed

(

\delta=0

)

L2-RGD

(

\alpha=0.1

)

L2-RGD

(

\alpha=1

)

L2-RGD

(

\alpha=10

)

SPSA

PSD-AD

NER

(20,200)

-56.3

5.4

-54.9

5.5

-54.7

5.2

-28.0

10.6

-28.2

10.7

-28.2

10.7

-22.3

6.2

-33.7

17.6

-45.9

8.9

(10,200)

-55.4

8.4

-54.5

8.8

-54.4

8.3

-8.2

23.0

-9.8

23.2

-11.9

23.4

-34.4

10.6

-46.8

13.4

-31.8

13.9

(40,200)

-56.6

3.2

-54.7

3.5

-54.1

3.4

-24.5

9.5

-24.4

9.5

-24.4

9.6

-14.6

3.9

-1.7

13.0

-47.4

3.7

(20,100)

-26.9

3.7

-26.2

3.8

-26.1

3.7

-8.8

6.8

-8.8

6.8

-8.8

6.8

-11.7

4.0

-4.6

6.3

-20.4

5.1

(20,400)

-106.9

14.2

-104.4

14.6

-103.4

14.0

-38.5

28.5

-38.5

28.2

-39.2

26.8

-37.8

7.9

-36.4

17.5

-79.4

21.0

Experimental Results

Table 1 shows the results of the simulation experiments with different parameter values. The proposed method outperformed the baselines in terms of NER for all parameters, for the following reasons: (i) Proposed (fixed $\delta$ ) and Proposed ( $\delta=0$ ) converged to low-quality local solutions because the variance of the gradient was larger than that of the proposed method; (ii) L2-RGD continued to increase prices without considering the effect of prices on the probability distribution, as shown in Example 1 in Section 2.2, which led to unreasonably high prices; (iii) BO did not adequately explore $\bm{x}$ because it took a lot of time to evaluate the objective value at each search point; (iv) SPSA did not accurately estimate the gradient because the noise in the gradient was too large; (v) PSD-AD ignored demand uncertainty, which increases the objective value since over/under demand occurs stochastically and causes unprofitable costs.

5.2 Simulation Experiments with Real Data

Data Set and Parameter Setup

We used retail data from a supermarket service provider in Japan. This data records the average sales prices of top-selling new products in food supermarkets. We targeted sales data for $n=50$ different confectionery products for randomly selected weeks from 2022. We set the recorded average selling price as the general value $\alpha_{i}$ for each product $i$ . The other parameters were set the same as in the synthetic experiment. Since the parameter $w_{i}$ for each $i$ was generated randomly, experiments were performed on 20 problem instances for each week’s data.

Table 2: Results of simulation experiments with real data for 20 randomly generated problem instances. The NER (SD) column represents the average (standard deviation) of NER. The best value of the average NER for each experiment is in bold.

date

Proposed

(fixed

\delta

)

Proposed

(

\delta=0

)

L2-RGD

(

\alpha=0.1

)

L2-RGD

(

\alpha=1

)

L2-RGD

(

\alpha=10

)

SPSA

PSD-AD

NER

02/21–02/27

-28.1

1.0

-21.5

1.4

-25.3

1.0

13.8

18.2

8.8

19.0

2.2

16.3

-8.5

2.1

10.3

8.7

-9.0

2.5

03/21–03/27

-20.6

0.7

-20.1

1.0

-18.5

1.0

-7.5

3.4

-7.5

3.4

-7.6

3.4

-4.4

0.7

-10.4

3.3

-17.7

1.8

05/23–05/29

-22.6

0.9

-17.8

1.8

-20.2

1.0

12.4

7.1

12.3

7.1

12.1

7.9

-6.1

1.6

-1.2

6.6

-10.2

3.2

06/20–06/26

-32.3

2.1

-21.6

3.6

-28.8

2.4

79.2

39.5

55.0

58.9

53.3

57.9

-14.1

4.1

30.6

15.4

-8.1

5.6

08/08–08/14

-33.6

0.9

-31.7

1.0

-31.2

1.0

-24.5

3.8

-24.5

3.7

-24.6

3.7

-7.2

1.8

-11.1

3.4

-29.4

1.6

09/19–09/25

-31.3

1.5

-23.9

3.4

-28.5

1.8

0.0

24.4

-6.4

22.1

-10.3

18.5

-9.8

2.3

11.6

7.4

-13.5

5.3

12/05–12/11

-73.0

3.2

-66.0

3.9

-71.1

3.4

172.3

30.7

152.6

44.0

146.2

37.2

-37.9

7.0

72.5

22.3

-28.9

10.8

Experimental Results

Table 2 shows the results of the experiments using real data from different weeks. The proposed method was superior to the baseline in terms of NER for all weeks of data.

6 Conclusion

We formulated a new price optimization problem with decision-dependent uncertainty to address the drawbacks of existing formulations that (i) cannot deal with decision-dependent demand uncertainty, (ii) require discontinuous functions to define buyers’ discrete actions, or (iii) have limited applications due to specific assumptions. Moreover, we developed a projected stochastic gradient descent method by deriving an unbiased stochastic gradient with a variance reduction parameter. Our method is guaranteed to converge to an $\varepsilon$ -stationary point. Synthetic experiments and simulation experiments with real data confirmed the effectiveness of our formulation and method.

Our formulation and results suggest directions for further research. The first is to construct methods to find a globally optimal solution rather than a stationary point (e.g., incorporating multi-start techniques (György and Kocsis, 2011) into our methods or building fast Bayesian optimization under more specific assumptions). The second is analyzing the performance of our method when some of our assumptions are relaxed. This would include analyzing the performance when the probability density function is not differentiable and smoothed with the existing techniques.

References

Basciftci et al. [2021] B. Basciftci, S. Ahmed, and S. Shen. Distributionally robust facility location problem under decision-dependent stochastic demand. European Journal of Operational Research, 292(2):548–561, 2021.
Bertsimas and de Boer [2005] D. Bertsimas and S. de Boer. Special issue papers: Dynamic pricing and inventory control for multiple products. Journal of Revenue and Pricing Management, 3:303–319, 2005.
Besbes et al. [2015] O. Besbes, Y. Gur, and A. Zeevi. Non-stationary stochastic optimization. Operations research, 63(5):1227–1244, 2015.
Boyd et al. [2003] S. Boyd, L. Xiao, and A. Mutapcic. Subgradient methods. lecture notes of EE392o, Stanford University, Autumn Quarter, 2004:2004–2005, 2003.
Brochu et al. [2010] E. Brochu, V. M. Cora, and N. De Freitas. A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv preprint arXiv:1012.2599, 2010.
Chawla et al. [2010] S. Chawla, J. D. Hartline, D. L. Malec, and B. Sivan. Multi-parameter mechanism design and sequential posted pricing. In STOC, pages 311–320, 2010.
Chen and Mangasarian [1996] C. Chen and O. L. Mangasarian. A class of smoothing functions for nonlinear and mixed complementarity problems. Computational Optimization and Applications, 5(2):97–138, 1996.
Chen [2012] X. Chen. Smoothing methods for nonsmooth, nonconvex minimization. Mathematical Programming, 134(1):71–99, 2012.
Correa et al. [2017] J. Correa, P. Foncea, R. Hoeksma, T. Oosterwijk, and T. Vredeveld. Posted price mechanisms for a random stream of customers. In EC, page 169â186, 2017.
Croissant [2012] Y. Croissant. Estimation of multinomial logit models in r: The mlogit packages. https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=03dbc1728d3860d239132b5af95367d4a5b273c3, 2012.
Dong et al. [2017] C. Dong, C. T. Ng, and T. Cheng. Electricity time-of-use tariff with stochastic demand. Production and Operations Management, 26(1):64–79, 2017.
Ferreira et al. [2016] K. J. Ferreira, B. H. A. Lee, and D. Simchi-Levi. Analytics for an online retailer: Demand forecasting and price optimization. Manufacturing & Service Operations Management, 18(1):69–88, 2016.
Flaxman et al. [2005] A. Flaxman, A. T. Kalai, and H. B. McMahan. Online convex optimization in the bandit setting: gradient descent without a gradient. In SODA, pages 385–â94, 2005.
Frazier [2018] P. I. Frazier. A tutorial on bayesian optimization. arXiv preprint arXiv:1807.02811, 2018.
Gallego and Wang [2014] G. Gallego and R. Wang. Multiproduct price optimization and competition under the nested logit model with product-differentiated price sensitivities. Operations Research, 62(2):450–461, 2014.
Ghadimi and Lan [2016] S. Ghadimi and G. Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathematical Programming, 156(1):59–99, 2016.
GPyOpt-authors [2016] T. GPyOpt-authors. Gpyopt: A bayesian optimization framework in python. http://github.com/SheffieldML/GPyOpt, 2016.
György and Kocsis [2011] A. György and L. Kocsis. Efficient multi-start strategies for local search algorithms. Journal of Artificial Intelligence Research, 41:407–444, 2011.
He et al. [2009] Y. He, X. Zhao, L. Zhao, and J. He. Coordinating a supply chain with effort and price dependent stochastic demand. Applied Mathematical Modelling, 33(6):2777–2790, 2009.
Hellemo et al. [2018] L. Hellemo, P. I. Barton, and A. Tomasgard. Decision-dependent probabilities in stochastic programs with recourse. Computational Management Science, 15(3):369–395, 2018.
Heydari and Norouzinasab [2015] J. Heydari and Y. Norouzinasab. A two-level discount model for coordinating a decentralized supply chain considering stochastic price-sensitive demand. Journal of Industrial Engineering International, 11:531–542, 2015.
Hikima et al. [2021] Y. Hikima, Y. Akagi, H. Kim, M. Kohjima, T. Kurashima, and H. Toda. Integrated optimization of bipartite matching and its stochastic behavior: New formulation and approximation algorithm via min-cost flow optimization. In AAAI, pages 3796–3805, 2021.
Hikima et al. [2022] Y. Hikima, Y. Akagi, N. Marumo, and H. Kim. Online matching with controllable rewards and arrival probabilities. In IJCAI, pages 1825–1833, 2022.
Hikima et al. [2023] Y. Hikima, Y. Akagi, H. Kim, and T. Asami. An improved approximation algorithm for wage determination and online task allocation in crowd-sourcing. In AAAI, pages 3977–3986, 2023.
Koushik et al. [2012] D. Koushik, J. A. Higbie, and C. Eister. Retail price optimization at intercontinental hotels group. Interfaces, 42(1):45–57, 2012.
Li and Li [2018] Z. Li and J. Li. A simple proximal stochastic gradient method for nonsmooth nonconvex optimization. In NeurIPS, pages 5569–â5579, 2018.
Lou et al. [2011] Y. Lou, Y. Yin, and J. A. Laval. Optimal dynamic pricing strategies for high-occupancy/toll lanes. Transportation Research Part C: Emerging Technologies, 19(1):64–74, 2011.
Luo and Mehrotra [2020] F. Luo and S. Mehrotra. Distributionally robust optimization with decision dependent ambiguity sets. Optimization Letters, 14:2565–2594, 2020.
Mendler-Dünner et al. [2020] C. Mendler-Dünner, J. Perdomo, T. Zrnic, and M. Hardt. Stochastic optimization for performative prediction. In NeurIPS, pages 4929–4939, 2020.
Miller et al. [2021] J. P. Miller, J. C. Perdomo, and T. Zrnic. Outside the echo chamber: Optimizing the performative risk. In ICML, pages 7710–7720, 2021.
Perdomo et al. [2020] J. Perdomo, T. Zrnic, C. Mendler-Dünner, and M. Hardt. Performative prediction. In ICML, pages 7599–7609, 2020.
Royden and Fitzpatrick [1988] H. L. Royden and P. Fitzpatrick. Real analysis, volume 32. Macmillan New York, 1988.
Schulte and Sachs [2020] B. Schulte and A.-L. Sachs. The price-setting newsvendor with poisson demand. European Journal of Operational Research, 283(1):125–137, 2020.
Spall [1998] J. C. Spall. Implementation of the simultaneous perturbation algorithm for stochastic optimization. IEEE Transactions on aerospace and electronic systems, 34(3):817–823, 1998.
Spall [2005] J. C. Spall. Introduction to stochastic search and optimization: estimation, simulation, and control. John Wiley & Sons, 2005.
Sutton and Barto [2018] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.
Swiler et al. [2020] L. P. Swiler, M. Gulian, A. L. Frankel, C. Safta, and J. D. Jakeman. A survey of constrained gaussian process regression: Approaches and implementation challenges. Journal of Machine Learning for Modeling and Computing, 1(2), 2020.
Varaiya and Wets [1989] P. Varaiya and R. J. Wets. Stochastic dynamic optimization, approaches and computation. In Mathematical Programming, Recent Developments and Applications, 1989.
Wang and Wang [2019] X.-z. Wang and G.-q. Wang. Integrating dynamic pricing and inventory control for fresh-agri product under consumer choice. Australian Economic Papers, 58(1):96–111, 2019.
Williams [1992] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8:229–256, 1992.
Zhang et al. [2018] H. Zhang, P. Rusmevichientong, and H. Topaloglu. Multiproduct pricing under the generalized extreme value models with homogeneous price sensitivity parameters. Operations Research, 66(6):1559–1570, 2018.

Appendix A Proofs

A.1 Proof of Proposition 1

Proof.

Assumption 2 holds since $\mathcal{C}=[x_{\min},x_{\max}]^{n}$ and $\bm{\xi}\in\{0,1,\dots,m\}^{n+1}$ . Therefore, we give proof for each of (i)–(iii) in Assumption 1.
(i) From definitions of $s$ and $c$ , the function $f(\bm{x},\bm{\xi})$ is differentiable w.r.t. $\bm{x}$ and continuous w.r.t. $\bm{\xi}$ for all $\bm{x}\in\mathbb{R}^{n}$ and $\bm{\xi}\in\Xi$ . Moreover, $\|\nabla_{\bm{x}}f(\bm{x},\bm{\xi})\|=\|\nabla_{\bm{x}}\left(-s(\bm{x},\bm{\xi% })+c(\bm{\xi})\right)\|=\|-\nabla_{\bm{x}}s(\bm{x},\bm{\xi})\|\leq\sum_{i=1}^{% n}|\xi_{i}|\leq m$ , where the second inequality is due to the fact that the total demand for all products never exceeds the number $m$ of buyers. Therefore, $f(\bm{x},\bm{\xi})$ is Lipschitz continuous with modulus $L_{f}=m$ .

(ii) Since $\Pr(\bm{\xi}\mid\bm{x})=\prod_{i=0}^{n}{}_{m}C_{\xi_{i}}p_{i}(\bm{x})^{\xi_{i}}$ , the function $\mathrm{Pr}(\bm{\xi}\mid\bm{x})$ is differentiable w.r.t. $\bm{x}$ and $\mathrm{Pr}(\bm{\xi}\mid\bm{x})\neq 0$ for all $\bm{x}\in\mathbb{R}^{n}$ and $\bm{\xi}\in\Xi$ from the definition of $p_{i}(\bm{x})$ for each $i\in\{0,1,\dots,n\}$ .

(iii) We have $0<p_{i}(\bm{x})<1$ for all $\bm{x}\in\mathbb{R}^{n}$ and $i\in\{0,1,\dots,n\}$ from the definition of $p_{i}(\bm{x})$ for each $i\in\{0,1,\dots,n\}$ . Then, since $\left(\nabla_{\bm{x}}\mathrm{Pr}(\bm{\xi}\mid\bm{x})\right)_{k}=\prod_{i=0}^{n% }{}_{m}C_{\xi_{i}}p_{i}(\bm{x})^{\xi_{i}}\sum_{j=0}^{n}\frac{\xi_{j}}{p_{j}(% \bm{x})}\frac{\partial p_{j}(\bm{x})}{\partial x_{k}}$ , we have

\displaystyle\left(\frac{\nabla_{\bm{x}}\mathrm{Pr}(\bm{\xi}\mid\bm{x})}{% \mathrm{Pr}(\bm{\xi}\mid\bm{x})}\right)_{k}=\sum_{j=0}^{n}\frac{\xi_{j}}{p_{j}% (\bm{x})}\frac{\partial p_{j}(\bm{x})}{\partial x_{k}}.

(2)

Let $a_{i}(x):=e^{\gamma_{i}(\alpha_{i}-x)}$ for all $i\in I\ (=\{1,2,\dots,n\})$ . Then, $\frac{\partial a_{i}(x)}{\partial x}=-\gamma_{i}a_{i}(x)$ . For $k\in I$ ,

	$\displaystyle\frac{\partial p_{k}(\bm{x})}{\partial x_{k}}$	$\displaystyle=\frac{\frac{\partial a_{k}(x_{k})}{\partial x_{k}}(a_{0}+\sum_{i% \in I}e^{\gamma_{i}(\alpha_{i}-x_{i})})-a_{k}(x_{k})\frac{\partial a_{k}(x_{k}% )}{\partial x_{k}}}{(a_{0}+\sum_{i\in I}e^{\gamma_{i}(\alpha_{i}-x_{i})})^{2}}% =\frac{-\gamma_{k}a_{k}(x_{k})(a_{0}+\sum_{i\in I}e^{\gamma_{i}(\alpha_{i}-x_{% i})}-a_{k}(x_{k}))}{(a_{0}+\sum_{i\in I}e^{\gamma_{i}(\alpha_{i}-x_{i})})^{2}}$
		$\displaystyle=-\gamma_{k}\frac{a_{k}(x_{k})}{a_{0}+\sum_{i\in I}e^{\gamma_{i}(% \alpha_{i}-x_{i})}}\frac{a_{0}+\sum_{i\in I}e^{\gamma_{i}(\alpha_{i}-x_{i})}-a% _{k}(x_{k})}{a_{0}+\sum_{i\in I}e^{\gamma_{i}(\alpha_{i}-x_{i})}}=-\gamma_{k}p% _{k}(\bm{x})(1-p_{k}(\bm{x})).$		(3)

For $k\in I$ and $j\in I\setminus\{k\}$ ,

	$\displaystyle\frac{\partial p_{j}(\bm{x})}{\partial x_{k}}$	$\displaystyle=\frac{-a_{j}(x_{j})\frac{\partial a_{k}(x_{k})}{\partial x_{k}}}% {(a_{0}+\sum_{i\in I}e^{\gamma_{i}(\alpha_{i}-x_{i})})^{2}}=\frac{-a_{j}(x_{j}% )(-\gamma_{k}a_{k}(x_{k}))}{(a_{0}+\sum_{i\in I}e^{\gamma_{i}(\alpha_{i}-x_{i}% )})^{2}}$
		$\displaystyle=\gamma_{k}\frac{a_{j}(x_{j})}{a_{0}+\sum_{i\in I}e^{\gamma_{i}(% \alpha_{i}-x_{i})}}\frac{a_{k}(x_{k})}{a_{0}+\sum_{i\in I}e^{\gamma_{i}(\alpha% _{i}-x_{i})}}=\gamma_{k}p_{j}(\bm{x})p_{k}(\bm{x}).$		(4)

For $k\in I$ ,

	$\displaystyle\frac{\partial p_{0}(\bm{x})}{\partial x_{k}}$	$\displaystyle=\frac{-a_{0}\frac{\partial a_{k}(x_{k})}{\partial x_{k}}}{(a_{0}% +\sum_{i\in I}e^{\gamma_{i}(\alpha_{i}-x_{i})})^{2}}=\frac{-a_{0}(-\gamma_{k}a% _{k}(x_{k}))}{(a_{0}+\sum_{i\in I}e^{\gamma_{i}(\alpha_{i}-x_{i})})^{2}}$
		$\displaystyle=\gamma_{k}\frac{a_{0}}{a_{0}+\sum_{i\in I}e^{\gamma_{i}(\alpha_{% i}-x_{i})}}\frac{a_{k}(x_{k})}{a_{0}+\sum_{i\in I}e^{\gamma_{i}(\alpha_{i}-x_{% i})}}=\gamma_{k}p_{0}(\bm{x})p_{k}(\bm{x}).$		(5)

From (2)–(5),

	$\displaystyle\left(\frac{\nabla_{\bm{x}}\mathrm{Pr}(\bm{\xi}\mid\bm{x})}{% \mathrm{Pr}(\bm{\xi}\mid\bm{x})}\right)_{k}$	$\displaystyle=\sum_{j\in I\setminus\{k\}}\left(\frac{\xi_{j}}{p_{j}(\bm{x})}% \gamma_{k}p_{k}(\bm{x})p_{j}(\bm{x})\right)-\frac{\xi_{k}}{p_{k}(\bm{x})}% \gamma_{k}p_{k}(\bm{x})(1-p_{k}(\bm{x}))+\frac{\xi_{0}}{p_{0}(\bm{x})}\gamma_{% k}p_{0}(\bm{x})p_{k}(\bm{x})$
		$\displaystyle=\sum_{j\in I\cup\{0\}\setminus\{k\}}\xi_{j}\gamma_{k}p_{k}(\bm{x% })-\xi_{k}\gamma_{k}(1-p_{k}(\bm{x}))\leq\sum_{j\in I\cup\{0\}}\|\xi_{j}\|\|% \gamma_{k}\|\leq m\gamma^{\max},$

where the first inequality follows from $0<p_{k}(\bm{x})<1$ for all $\bm{x}\in\mathbb{R}^{n}$ . The second inequality follows from the definition whereby $\sum_{j\in I\cup\{0\}}|\xi_{j}|$ is equal to the number $m$ of buyers. Then, for all $x\in\mathbb{R}^{n}$ and $\bm{\xi}\in\Xi$ , $\left\|\frac{\nabla_{\bm{x}}\mathrm{Pr}(\bm{\xi}\mid\bm{x})}{\mathrm{Pr}(\bm{% \xi}\mid\bm{x})}\right\|\leq\sum_{k=1}^{n}\left|\left(\frac{\nabla_{\bm{x}}% \mathrm{Pr}(\bm{\xi}\mid\bm{x})}{\mathrm{Pr}(\bm{\xi}\mid\bm{x})}\right)_{k}% \right|\leq nm\gamma^{\max}$ . ∎

A.2 Proof of Proposition 2

Proof.

Assumption 2 holds since $\mathcal{C}=[x_{\min},x_{\max}]^{I}$ and $\xi_{i}\in\{0,1,\dots,d_{i}\}$ for all $i\in I$ . Therefore, we give proof for each of (i)–(iii) in Assumption 1.
(i) From the definition, the value of $f(\bm{x},\bm{\xi})$ is independent of $\bm{x}$ . Therefore, $f(\bm{x},\bm{\xi})$ is differentiable and Lipschitz continuous with modulus $L_{f}=0$ w.r.t. $\bm{x}$ . Moreover, $f(\bm{x},\bm{\xi})$ is continuous w.r.t. $\bm{\xi}$ from the definition since $q_{H}$ , $q_{R}$ , and $k$ are continuous functions.

(ii) Since $\Pr(\bm{\xi}\mid\bm{x})=\prod_{i\in I}{}_{d_{i}}C_{\xi_{i}}p_{i}(x_{i})^{\xi_{% i}}(1-p_{i}(x_{i}))^{d_{i}-\xi_{i}}$ , $\mathrm{Pr}(\bm{\xi}\mid\bm{x})$ is differentiable w.r.t. $\bm{x}$ from the definition of $p_{i}$ . Moreover, since $0<p_{i}(x)<1$ for all $x\in\mathbb{R}$ and $i\in I$ , $\mathrm{Pr}(\bm{\xi}\mid\bm{x})\neq 0$ for all $(\bm{x},\bm{\xi})\in\mathbb{R}^{n}\times\Xi$ .

(iii) For all $x\in\mathbb{R}$ and $i\in I$ , the definition of $p_{i}$ gives $0<p_{i}(x)<1$ . Then, since

\displaystyle\left(\nabla_{\bm{x}}\mathrm{Pr}(\bm{\xi}\mid\bm{x})\right)_{k}

\displaystyle=\prod_{i\in I}{}_{d_{i}}C_{\xi_{i}}p_{i}(x_{i})^{\xi_{i}}(1-p_{i% }(x_{i}))^{d_{i}-\xi_{i}}\left(\frac{\xi_{k}p^{\prime}_{k}(x_{k})}{p_{k}(x_{k}% )}-\frac{(d_{k}-\xi_{k})p^{\prime}_{k}(x_{k})}{1-p_{k}(x_{k})}\right),

we have

\displaystyle\left(\frac{\nabla_{\bm{x}}\mathrm{Pr}(\bm{\xi}\mid\bm{x})}{% \mathrm{Pr}(\bm{\xi}\mid\bm{x})}\right)_{k}=\frac{\xi_{k}p^{\prime}_{k}(x_{k})% }{p_{k}(x_{k})}-\frac{(d_{k}-\xi_{k})p^{\prime}_{k}(x_{k})}{1-p_{k}(x_{k})}.

(6)

Here,

\displaystyle p^{\prime}_{k}(x)

\displaystyle=\frac{-\beta_{k}e^{\alpha_{k}h_{k}+\beta_{k}x+\gamma_{k}}}{(1+e^% {\alpha_{k}h_{k}+\beta_{k}x+\gamma_{k}})^{2}}=-\beta_{k}\frac{1}{(1+e^{\alpha_% {k}h_{k}+\beta_{k}x+\gamma_{k}})}\frac{e^{\alpha_{k}h_{k}+\beta_{k}x+\gamma_{k% }}}{(1+e^{\alpha_{k}h_{k}+\beta_{k}x+\gamma_{k}})}=-\beta_{k}p_{k}(x)(1-p_{k}(% x)).

(7)

From (6) and (7),

	$\displaystyle\left(\frac{\nabla_{\bm{x}}\mathrm{Pr}(\bm{\xi}\mid\bm{x})}{% \mathrm{Pr}(\bm{\xi}\mid\bm{x})}\right)_{k}$	$\displaystyle=-\beta_{k}(1-p_{k}(x_{k}))\xi_{k}+\beta_{k}p_{k}(x_{k})(d_{k}-% \xi_{k})=\beta_{k}(d_{k}p_{k}(x_{k})-\xi_{k})$
		$\displaystyle\leq\|\beta_{k}\|\|d_{k}p_{k}(x_{k})-\xi_{k}\|\leq\|\beta_{k}\|d_{k},$

where the second inequality follows from $\xi_{k}\in\{0,1,\dots,d_{k}\}$ and $0\leq p_{k}(x_{k})\leq 1$ for all $k\in I$ . Then, for all $x\in\mathbb{R}^{I}$ and $\bm{\xi}\in\Xi$ , $\left\|\frac{\nabla_{\bm{x}}\mathrm{Pr}(\bm{\xi}\mid\bm{x})}{\mathrm{Pr}(\bm{% \xi}\mid\bm{x})}\right\|\leq\sum_{k\in I}\left|\left(\frac{\nabla_{\bm{x}}% \mathrm{Pr}(\bm{\xi}\mid\bm{x})}{\mathrm{Pr}(\bm{\xi}\mid\bm{x})}\right)_{k}% \right|\leq|I|\max_{i\in I}\left(|\beta_{i}|d_{i}\right)$ . ∎

A.3 Proof of Proposition 3

Proof.

Assumption 2 holds since $\mathcal{C}=[x_{\min},x_{\max}]^{n}$ and $\Xi=\{\bm{\xi}\mid\forall i\in[n],0\leq\xi_{i}\leq\xi_{i}^{\max}\}$ . Moreover, Assumption 3 holds since $\Xi$ is a Borel set on $\mathbb{R}^{n}$ and $\mathrm{Pr}(\bm{\xi}\mid\bm{x})$ is continuous w.r.t. $\bm{\xi}$ for all $\bm{x}\in\mathbb{R}^{n}$ from the definition of $\mathrm{Pr}(\bm{\xi}\mid\bm{x})$ . We give proof for each condition of Assumption 1.
(i) From definitions of $s$ and $c$ , $f(\bm{x},\bm{\xi})$ is differentiable w.r.t. $\bm{x}$ and continuous w.r.t. $\bm{\xi}$ for all $\bm{x}\in\mathbb{R}^{n}$ and $\bm{\xi}\in\Xi$ . Moreover, $\|\nabla_{\bm{x}}f(\bm{x},\bm{\xi})\|=\|\bm{\xi}\|\leq n\xi^{\max}$ . Therefore, $f(\bm{x},\bm{\xi})$ is Lipschitz continuous with modulus $L_{f}=n\xi^{\max}$ .

(ii) $\mathrm{Pr}(\bm{\xi}\mid\bm{x})$ is differentiable w.r.t. $\bm{x}$ and $\mathrm{Pr}(\bm{\xi}\mid\bm{x})\neq 0$ for all $\bm{x}\in\mathbb{R}^{n}$ and $\bm{\xi}\in\Xi$ from definitions of $\Pr(\bm{\xi}\mid\bm{x})$ , $C^{i}(\bm{x})$ , and $\bm{v}^{i}(\bm{x})$ .

(iii) Let $g_{i}(\bm{x},\xi):=-\frac{(\xi-\bm{v}^{i}(\bm{x})^{\top}\bm{a}^{i})^{2}}{2((% \sigma^{i})^{2}-\bm{v}^{i}(\bm{x})^{\top}A^{i}\bm{v}^{i}(\bm{x}))}$ and $h_{i}(\bm{x}):=(\sigma^{i})^{2}-\bm{v}^{i}(\bm{x})^{\top}A^{i}\bm{v}^{i}(\bm{x})$ . Then, $\Pr(\bm{\xi}\mid\bm{x}):=\prod_{i=1}^{n}\frac{1}{C^{i}(\bm{x})\sqrt{2\pi h_{i}% (\bm{x})}}\exp\left(g_{i}(\bm{x},\xi_{i})\right)$ . Therefore,

\displaystyle\frac{\partial\Pr(\bm{\xi}\mid\bm{x})}{\partial x_{k}}=\prod_{i=1% }^{n}\frac{1}{C^{i}(\bm{x})\sqrt{2\pi h_{i}(\bm{x})}}\exp\left(g_{i}(\bm{x},% \xi_{i})\right)\left(-\sum_{i=1}^{n}\frac{1}{C^{i}(\bm{x})}\frac{\partial C^{i% }(\bm{x})}{\partial x_{k}}-\sum_{i=1}^{n}\frac{1}{2}\frac{1}{h^{i}(\bm{x})}% \frac{\partial h^{i}(\bm{x})}{\partial x_{k}}+\sum_{i=1}^{n}\frac{\partial g_{% i}(\bm{x},\xi_{i})}{\partial x_{k}}\right).

(8)

Here,

\displaystyle\left|\frac{\partial v_{j}^{i}(\bm{x})}{\partial x_{k}}\right|=% \left|-\theta_{1}^{i}\frac{2x_{k}-2\hat{x}_{k}^{j}}{\theta_{2}^{i}}\exp\left(-% \frac{\|\bm{x}-\hat{\bm{x}}^{j}\|^{2}}{\theta_{2}^{i}}\right)\right|\leq\frac{% 2|\theta_{1}^{i}||x_{k}-\hat{x}_{k}^{j}|}{|\theta_{2}^{i}|}\leq\frac{2\theta_{% 1}^{\max}(x_{\max}-x_{\min})}{\theta_{2}^{\min}}.

(9)

For all $\xi_{i}\in\mathbb{R}$ ,

$\displaystyle\left\|\frac{\partial g_{i}(\bm{x},\xi_{i})}{\partial x_{k}}\right\|$	$\displaystyle=\left\|-\frac{2(\xi_{i}-\bm{v}^{i}(\bm{x})^{\top}\bm{a}^{i})(-% \sum_{j=1}^{N}\frac{\partial v_{j}^{i}(\bm{x})}{\partial x_{k}}a_{j}^{i})}{2((% \sigma^{i})^{2}-\bm{v}^{i}(\bm{x})^{\top}A^{i}\bm{v}^{i}(\bm{x}))}\right.$
	$\displaystyle\quad\left.+\frac{(\xi_{i}-\bm{v}^{i}(\bm{x})^{\top}\bm{a}^{i})^{% 2}(-2\sum_{s=1}^{N}\sum_{t=1}^{N}A_{st}^{i}(v_{t}^{i}(\bm{x})\frac{\partial v_% {s}^{i}(\bm{x})}{\partial x_{k}}+v_{s}^{i}(\bm{x})\frac{\partial v_{t}^{i}(\bm% {x})}{\partial x_{k}}))}{(2((\sigma^{i})^{2}-\bm{v}^{i}(\bm{x})^{\top}A^{i}\bm% {v}^{i}(\bm{x})))^{2}}\right\|$
	$\displaystyle\leq\left\|\frac{\xi_{i}-\bm{v}^{i}(\bm{x})^{\top}\bm{a}^{i}}{(% \sigma^{i})^{2}-\bm{v}^{i}(\bm{x})^{\top}A^{i}\bm{v}^{i}(\bm{x})}\right\|\left\|% \sum_{j=1}^{N}\frac{\partial v_{j}^{i}(\bm{x})}{\partial x_{k}}a_{j}^{i}\right\|$
	$\displaystyle\quad+\left\|\frac{(\xi_{i}-\bm{v}^{i}(\bm{x})^{\top}\bm{a}^{i})^{% 2}}{(2((\sigma^{i})^{2}-\bm{v}^{i}(\bm{x})^{\top}A^{i}\bm{v}^{i}(\bm{x})))^{2}% }\right\|\left\|2\sum_{s=1}^{N}\sum_{t=1}^{N}A_{st}^{i}\left(v_{t}^{i}(\bm{x})% \frac{\partial v_{s}^{i}(\bm{x})}{\partial x_{k}}+v_{s}^{i}(\bm{x})\frac{% \partial v_{t}^{i}(\bm{x})}{\partial x_{k}}\right)\right\|$
	$\displaystyle\leq\frac{\xi^{\max}+N\theta_{1}^{\max}a^{\max}}{\Delta}\cdot 2Na% ^{\max}\frac{\theta_{1}^{\max}(x_{\max}-x_{\min})}{\theta_{2}^{\min}}$
	$\displaystyle\quad+\frac{(\xi^{\max}+N\theta_{1}^{\max}a^{\max})^{2}}{4\Delta^% {2}}\cdot 8N^{2}A^{\max}\frac{(\theta_{1}^{\max})^{2}(x_{\max}-x_{\min})}{% \theta_{2}^{\min}}$
	$\displaystyle\leq\frac{2N\theta_{1}^{\max}(\xi^{\max}+N\theta_{1}^{\max}a^{% \max})(x_{\max}-x_{\min})}{\Delta\theta_{2}^{\min}}\left(a^{\max}+NA^{\max}% \theta_{1}^{\max}\frac{\xi^{\max}+N\theta_{1}^{\max}a^{\max}}{\Delta}\right),$	(10)

where the second inequality comes from (9) and the fact that $(\sigma^{i})^{2}-\bm{v}^{i}(\bm{x})^{\top}A^{i}\bm{v}^{i}(\bm{x})\geq\Delta$ and $|v_{t}^{i}(\bm{x})|\leq|\theta_{1}^{i}|\leq\theta_{1}^{\max}$ for $i\in[n]$ and $t\in[N]$ .

Moreover, for all $\bm{x}\in\mathbb{R}^{n}$ ,

	$\displaystyle\left\|\frac{1}{2}\frac{1}{h_{i}(\bm{x})}\frac{\partial h_{i}(\bm{% x})}{\partial x_{k}}\right\|$	$\displaystyle=\frac{1}{2}\left\|\frac{1}{h_{i}(\bm{x})}\right\|\left\|\sum_{s=1}^% {N}\sum_{t=1}^{N}A_{st}^{i}\left(v_{t}^{i}(\bm{x})\frac{\partial v_{s}^{i}(\bm% {x})}{\partial x_{k}}+v_{s}^{i}(\bm{x})\frac{\partial v_{t}^{i}(\bm{x})}{% \partial x_{k}}\right)\right\|$
		$\displaystyle\leq\frac{2}{\Delta}N^{2}A^{\max}\frac{(\theta_{1}^{\max})^{2}(x_% {\max}-x_{\min})}{\theta_{2}^{\min}},$		(11)

where the inequality follows from (9) and the fact that $h_{i}(\bm{x})\geq\Delta$ and $|v_{t}^{i}(\bm{x})|\leq\theta_{1}^{\max}$ for for $i\in[n]$ and $t\in[N]$ .

Let $r_{1}:=\frac{2N\theta_{1}^{\max}(\xi^{\max}+N\theta_{1}^{\max}a^{\max})(x_{% \max}-x_{\min})}{\Delta\theta_{2}^{\min}}\left(a^{\max}+NA^{\max}\theta_{1}^{% \max}\frac{\xi^{\max}+N\theta_{1}^{\max}a^{\max}}{\Delta}\right)$ , and
$r_{2}:=\frac{2}{\Delta}N^{2}A^{\max}\frac{(\theta_{1}^{\max})^{2}(x_{\max}-x_{% \min})}{\theta_{2}^{\min}}$ . Then,

	$\displaystyle\left\|\frac{1}{C^{i}(\bm{x})}\frac{\partial C^{i}(\bm{x})}{% \partial x_{k}}\right\|$	$\displaystyle=\left\|\frac{1}{C^{i}(\bm{x})}\frac{\partial\int_{0}^{\xi_{i}^{% \max}}\frac{1}{\sqrt{2\pi h_{i}(\bm{x})}}\exp\left(g_{i}(\bm{x},\phi)\right)d% \phi}{\partial x_{k}}\right\|$
		$\displaystyle=\left\|\frac{1}{C^{i}(\bm{x})}\int_{0}^{\xi_{i}^{\max}}\frac{1}{% \sqrt{2\pi h_{i}(\bm{x})}}\exp\left(g_{i}(\bm{x},\phi)\right)\left(-\frac{1}{2% }\frac{1}{h_{i}(\bm{x})}\frac{\partial h_{i}(\bm{x})}{\partial x_{k}}+\frac{% \partial g_{i}(\bm{x},\phi)}{\partial x_{k}}\right)d\phi\right\|$
		$\displaystyle\leq\left\|\frac{1}{C^{i}(\bm{x})}\int_{0}^{\xi_{i}^{\max}}\frac{1% }{\sqrt{2\pi h_{i}(\bm{x})}}\exp\left(g_{i}(\bm{x},\phi)\right)\left(\left\|% \frac{1}{2}\frac{1}{h_{i}(\bm{x})}\frac{\partial h_{i}(\bm{x})}{\partial x_{k}% }\right\|+\left\|\frac{\partial g_{i}(\bm{x},\phi)}{\partial x_{k}}\right\|\right% )d\phi\right\|$
		$\displaystyle\leq\left\|\frac{1}{C^{i}(\bm{x})}\int_{0}^{\xi_{i}^{\max}}\frac{1% }{\sqrt{2\pi h_{i}(\bm{x})}}\exp\left(g_{i}(\bm{x},\phi)\right)\left(r_{1}+r_{% 2}\right)d\phi\right\|=r_{1}+r_{2},$

where the second inequality follows from (10) and (11). Here, from (8),

\displaystyle\left|\frac{\partial\Pr(\bm{\xi}\mid\bm{x})}{\partial x_{k}}\cdot% \frac{1}{\Pr(\bm{\xi}\mid\bm{x})}\right|

\displaystyle\leq\sum_{i=1}^{n}\left|\frac{1}{C^{i}(\bm{x})}\frac{\partial C^{% i}(\bm{x})}{\partial x_{k}}\right|+\sum_{i=1}^{n}\left|\frac{1}{2}\frac{1}{h^{% i}(\bm{x})}\frac{\partial h^{i}(\bm{x})}{\partial x_{k}}\right|+\sum_{i=1}^{n}% \left|\frac{\partial g_{i}(\bm{x},\xi_{i})}{\partial x_{k}}\right|\leq 2n(r_{1% }+r_{2}).

Therefore, $\left\|\frac{\nabla\Pr(\bm{\xi}\mid\bm{x})}{\Pr(\bm{\xi}\mid\bm{x})}\right\|% \leq\sum_{k=1}^{n}\left|\frac{\partial\Pr(\bm{\xi}\mid\bm{x})}{\partial x_{k}}% \cdot\frac{1}{\Pr(\bm{\xi}\mid\bm{x})}\right|\leq 2n^{2}(r_{1}+r_{2})$ . Condition (iii) of Assumption 1 holds from the definition of $r_{1}$ , $r_{2}$ , and $M$ . ∎

A.4 Proof of Lemma 4

Proof.

For a given $\bm{x}\in C$ , let $\{\Delta_{k}\}$ be a sequence of scalars such that $\lim_{k\to\infty}\Delta_{k}=0$ and $\bm{x}+\Delta_{k}\bm{e}^{i}\in\mathcal{C}$ , where $\bm{e}^{i}$ is a vector such that the $i$ -th element is $1$ and other elements are $0$ . Let $g_{k,i}(\bm{x},\bm{\xi}):=\frac{h(\bm{x}+\Delta_{k}\bm{e}^{i},\bm{\xi})-h(\bm{% x},\bm{\xi})}{\Delta_{k}}$ . There exists $\bm{x}^{\prime}\in\mathcal{C}$ such that $g_{k,i}(\bm{x},\bm{\xi})=\frac{\partial h(\bm{x}^{\prime},\bm{\xi})}{\partial x% _{i}}$ from the mean-value theorem. Moreover, $f_{\max}(=\max_{\bm{x}\in\mathcal{C},\bm{\xi}\in\Xi}|f(\bm{x},\bm{\xi})|)$ exists since $\Xi$ and $\mathcal{C}$ are compact from Assumption 2 and $f$ is a real-valued continuous function from Assumption 1. Then, for all $\bm{x}\in\mathcal{\mathcal{missing}}C$ and $i=1,\dots,n$ ,

	$\displaystyle\|g_{k,i}(\bm{x},\bm{\xi})\|$	$\displaystyle=\left\|\frac{\partial h(\bm{x}^{\prime},\bm{\xi})}{\partial x_{i}% }\right\|=\left\|\frac{\partial f(\bm{x}^{\prime},\bm{\xi})}{\partial x_{i}}% \mathrm{Pr}(\bm{\xi}\mid\bm{x}^{\prime})+f(\bm{x}^{\prime},\bm{\xi})\frac{% \partial\mathrm{Pr}(\bm{\xi}\mid\bm{x}^{\prime})}{\partial x_{i}}\right\|$
		$\displaystyle\leq\left\|\frac{\partial f(\bm{x}^{\prime},\bm{\xi})}{\partial x_% {i}}\right\|+f_{\max}\left\|\frac{\partial\mathrm{Pr}(\bm{\xi}\mid\bm{x}^{\prime% })}{\partial x_{i}}\frac{1}{\mathrm{Pr}(\bm{\xi}\mid\bm{x}^{\prime})}\right\|% \leq L_{f}+f_{\max}M,$

where the first inequality comes from $0<\mathrm{Pr}(\bm{\xi}\mid\bm{x}^{\prime})\leq 1$ , and the second inequality follows from conditions (i) and (iii) of Assumption 1. Here, $g_{k,i}$ is measurable on $\Xi$ since $\Xi$ is a Borel set and $g_{k,i}$ is continuous w.r.t. $\bm{\xi}$ from Assumption 3 and the definitions of $g_{k,i}$ and $h$ . The constant function $r(\bm{\xi}):=L_{f}+f_{\max}M<\infty$ is integrable over $\Xi$ . Moreover, $g_{k,i}(\bm{x},\bm{\xi})\to\frac{\partial h(\bm{x},\bm{\xi})}{\partial x_{i}}$ pointwise when $k\to\infty$ since $h(\bm{x},\bm{\xi})$ is differentiable w.r.t. $\bm{x}$ from conditions (i) and (ii) of Assumption 1. Then, the Lebesgue dominated convergence theorem [Royden and Fitzpatrick, 1988, Chapter 4.4, page 88] holds for $g_{k,i}$ for all $\bm{x}\in\mathcal{C}$ and $i=1,\dots,n$ , that is,

\displaystyle\lim_{k\to\infty}\int_{\bm{\xi}\in\Xi}g_{k,i}(\bm{x},\bm{\xi})d% \bm{\xi}=\int_{\bm{\xi}\in\Xi}\lim_{k\to\infty}g_{k,i}(\bm{x},\bm{\xi})d\bm{% \xi},\ \textrm{for all}\ \bm{x}\in\mathcal{C}\ \textrm{and}\ i=1,\dots,n.

Then, for all $\bm{x}\in\mathcal{C}$ and $i=1,\dots,n$ ,

	$\displaystyle\left(\nabla_{\bm{x}}\int_{\bm{\xi}\in\Xi}h(\bm{x},\bm{\xi})d\bm{% \xi}\right)_{i}$	$\displaystyle=\lim_{k\to\infty}\frac{\int_{\bm{\xi}\in\Xi}h(\bm{x}+\Delta_{k}% \bm{e}^{i},\bm{\xi})d\bm{\xi}-\int_{\bm{\xi}\in\Xi}h(\bm{x},\bm{\xi})d\bm{\xi}% }{\Delta_{k}}$
		$\displaystyle=\lim_{k\to\infty}\int_{\bm{\xi}\in\Xi}\frac{h(\bm{x}+\Delta_{k}% \bm{e}^{i},\bm{\xi})-h(\bm{x},\bm{\xi})}{\Delta_{k}}d\bm{\xi}$
		$\displaystyle=\lim_{k\to\infty}\int_{\bm{\xi}\in\Xi}g_{k,i}(\bm{x},\bm{\xi})d% \bm{\xi}=\int_{\bm{\xi}\in\Xi}\lim_{k\to\infty}g_{k,i}(\bm{x},\bm{\xi})d\bm{\xi}$
		$\displaystyle=\int_{\bm{\xi}\in\Xi}\left(\nabla_{\bm{x}}h(\bm{x},\bm{\xi})% \right)_{i}d\bm{\xi}.$

Therefore, for all $\bm{x}\in\mathcal{C}$ ,

\displaystyle\nabla_{\bm{x}}\int_{\bm{\xi}\in\Xi}h(\bm{x},\bm{\xi})d\bm{\xi}=% \int_{\bm{\xi}\in\Xi}\nabla_{\bm{x}}h(\bm{x},\bm{\xi})d\bm{\xi}.

∎

A.5 Proof of Lemma 5

Proof.

For given $\bm{x}\in C$ , let $\{\Delta_{k}\}$ be a sequence of scalars such that $\lim_{k\to\infty}\Delta_{k}=0$ and $\bm{x}+\Delta_{k}\bm{e}^{i}\in\mathcal{C}$ , where $\bm{e}^{i}$ is a vector such that the $i$ -th element is $1$ and other elements are $0$ . Let $g_{k,i}(\bm{x},\bm{\xi}):=\frac{q(\bm{\xi})\mathrm{Pr}(\bm{\xi}\mid\bm{x}+% \Delta_{k}\bm{e}^{i})-q(\bm{\xi})\mathrm{Pr}(\bm{\xi}\mid\bm{x})}{\Delta_{k}}$ . There exists $\bm{x}^{\prime}\in\mathcal{C}$ such that $g_{k,i}(\bm{x},\bm{\xi})=\frac{\partial q(\bm{\xi})\mathrm{Pr}(\bm{\xi}\mid\bm% {x}^{\prime})}{\partial x_{i}}$ from the mean-value theorem. Moreover, let $q^{\max}:=\max_{\bm{\xi}\in\Xi}|q(\bm{\xi})|$ , which exists since $\Xi$ is compact from Assumption 2 and $q$ is a real-valued continuous function. Then, for all $\bm{x}\in\mathcal{\mathcal{missing}}C$ and $i=1,\dots,n$ ,

	$\displaystyle\|g_{k,i}(\bm{x},\bm{\xi})\|$	$\displaystyle=\left\|\frac{\partial q(\bm{\xi})\mathrm{Pr}(\bm{\xi}\mid\bm{x}^{% \prime})}{\partial x_{i}}\right\|=\|q(\bm{\xi})\|\left\|\frac{\partial\mathrm{Pr}(% \bm{\xi}\mid\bm{x}^{\prime})}{\partial x_{i}}\right\|$
		$\displaystyle\leq\|q(\bm{\xi})\|\left\|\frac{\partial\mathrm{Pr}(\bm{\xi}\mid\bm{% x}^{\prime})}{\partial x_{i}}\frac{1}{\mathrm{Pr}(\bm{\xi}\mid\bm{x}^{\prime})% }\right\|\leq q^{\max}M,$

where the first inequality follows from $0<\mathrm{Pr}(\bm{\xi}\mid\bm{x}^{\prime})\leq 1$ . The second inequality comes from condition (iii) of Assumption 1. Here, $g_{k,i}$ is measurable on $\Xi$ since $\Xi$ is a Borel set and $g_{k,i}$ is continuous w.r.t. $\bm{\xi}$ from Assumption 3 and the definition of $g_{k,i}$ . The constant function $r(\bm{\xi}):=q^{\max}M<\infty$ is integrable over $\Xi$ . Moreover, $g_{k,i}(\bm{x},\bm{\xi})\to\frac{\partial q(\bm{\xi})\mathrm{Pr}(\bm{\xi}\mid% \bm{x})}{\partial x_{i}}$ pointwise when $k\to\infty$ since $\mathrm{Pr}(\bm{\xi}\mid\bm{x})$ is differentiable w.r.t. $\bm{x}$ from condition (ii) of Assumption 1. Then, the Lebesgue dominated convergence theorem [Royden and Fitzpatrick, 1988, Chapter 4.4, page 88] holds for $g_{k,i}$ for all $\bm{x}\in\mathcal{C}$ and $i=1,\dots,n$ , that is,

\displaystyle\lim_{k\to\infty}\int_{\bm{\xi}\in\Xi}g_{k,i}(\bm{x},\bm{\xi})d{% \bm{\xi}}=\int_{\bm{\xi}\in\Xi}\lim_{k\to\infty}g_{k,i}(\bm{x},\bm{\xi})d{\bm{% \xi}},\ \textrm{for all}\ \bm{x}\in\mathcal{C}\ \textrm{and}\ i=1,\dots,n.

Then, for all $\bm{x}\in\mathcal{C}$ and $i=1,\dots,n$ ,

	$\displaystyle\left(\nabla_{\bm{x}}\int_{\bm{\xi}\in\Xi}q(\bm{\xi})\mathrm{Pr}(% \bm{\xi}\mid\bm{x})d\bm{\xi}\right)_{i}$
	$\displaystyle=\lim_{k\to\infty}\frac{\int_{\bm{\xi}\in\Xi}q(\bm{\xi})\mathrm{% Pr}(\bm{\xi}\mid\bm{x}+\Delta_{k}\bm{e}^{i})d\bm{\xi}-\int_{\bm{\xi}\in\Xi}q(% \bm{\xi})\mathrm{Pr}(\bm{\xi}\mid\bm{x})d\bm{\xi}}{\Delta_{k}}$
	$\displaystyle=\lim_{k\to\infty}\int_{\bm{\xi}\in\Xi}\frac{q(\bm{\xi})\mathrm{% Pr}(\bm{\xi}\mid\bm{x}+\Delta_{k}\bm{e}^{i})-q(\bm{\xi})\mathrm{Pr}(\bm{\xi}% \mid\bm{x})}{\Delta_{k}}d\bm{\xi}$
	$\displaystyle=\lim_{k\to\infty}\int_{\bm{\xi}\in\Xi}g_{k,i}(\bm{x},\bm{\xi})d% \bm{\xi}=\int_{\bm{\xi}\in\Xi}\lim_{k\to\infty}g_{k,i}(\bm{x},\bm{\xi})d\bm{\xi}$
	$\displaystyle=\int_{\bm{\xi}\in\Xi}\left(\nabla_{\bm{x}}(q(\bm{\xi})\mathrm{Pr% }(\bm{\xi}\mid\bm{x}))\right)_{i}d\bm{\xi}.$		(12)

Therefore, for all $\bm{x}\in\mathcal{C}$ ,

\displaystyle\nabla_{\bm{x}}\int_{\bm{\xi}\in\Xi}q(\bm{\xi})\mathrm{Pr}(\bm{% \xi}\mid\bm{x})d\bm{\xi}=\int_{\bm{\xi}\in\Xi}\nabla_{\bm{x}}(q(\bm{\xi})% \mathrm{Pr}(\bm{\xi}\mid\bm{x}))d{\bm{\xi}}.

∎

A.6 Proof of Lemma 6

Proof.

We have

	$\displaystyle\int_{\bm{\xi}\in\Xi}\left(\delta\frac{\nabla_{\bm{x}}\mathrm{Pr}% (\bm{\xi}\mid\bm{x})}{\mathrm{Pr}(\bm{\xi}\mid\bm{x})}\right)\mathrm{Pr}(\bm{% \xi}\mid\bm{x})d\bm{\xi}=\delta\int_{\bm{\xi}\in\Xi}\nabla_{\bm{x}}\mathrm{Pr}% (\bm{\xi}\mid\bm{x})d\bm{\xi}$
	$\displaystyle=\delta\nabla_{\bm{x}}\int_{\bm{\xi}\in\Xi}\mathrm{Pr}(\bm{\xi}% \mid\bm{x})d\bm{\xi}=\delta\nabla_{\bm{x}}(1)=0,$		(13)

where the second equality comes from Lemma 5 with $q(\bm{\xi})=1$ . Then,

	$\displaystyle\nabla_{\bm{x}}\mathbb{E}_{\bm{\xi}\sim D(\bm{x})}[f(\bm{x},\bm{% \xi})]=\nabla_{\bm{x}}\int_{\bm{\xi}\in\Xi}f(\bm{x},\bm{\xi})d\mathrm{Pr}(\bm{% \xi}\mid\bm{x})=\nabla_{\bm{x}}\int_{\bm{\xi}\in\Xi}f(\bm{x},\bm{\xi})\mathrm{% Pr}(\bm{\xi}\mid\bm{x})d\bm{\xi}$
	$\displaystyle=\int_{\bm{\xi}\in\Xi}\nabla_{\bm{x}}\left(f(\bm{x},\bm{\xi})% \mathrm{Pr}(\bm{\xi}\mid\bm{x})\right)d\bm{\xi}=\int_{\bm{\xi}\in\Xi}\nabla_{% \bm{x}}f(\bm{x},\bm{\xi})\mathrm{Pr}(\bm{\xi}\mid\bm{x})+f(\bm{x},\bm{\xi})% \nabla_{\bm{x}}\mathrm{Pr}(\bm{\xi}\mid\bm{x})d\bm{\xi}$
	$\displaystyle=\int_{\bm{\xi}\in\Xi}\left(\nabla_{\bm{x}}f(\bm{x},\bm{\xi})+f(% \bm{x},\bm{\xi})\frac{\nabla_{\bm{x}}\mathrm{Pr}(\bm{\xi}\mid\bm{x})}{\mathrm{% Pr}(\bm{\xi}\mid\bm{x})}\right)\mathrm{Pr}(\bm{\xi}\mid\bm{x})d\bm{\xi}$
	$\displaystyle=\int_{\bm{\xi}\in\Xi}\left(\nabla_{\bm{x}}f(\bm{x},\bm{\xi})+f(% \bm{x},\bm{\xi})\frac{\nabla_{\bm{x}}\mathrm{Pr}(\bm{\xi}\mid\bm{x})}{\mathrm{% Pr}(\bm{\xi}\mid\bm{x})}\right)d\mathrm{Pr}(\bm{\xi}\mid\bm{x})$
	$\displaystyle=\int_{\bm{\xi}\in\Xi}\left(\nabla_{\bm{x}}f(\bm{x},\bm{\xi})+(f(% \bm{x},\bm{\xi})-\delta)\frac{\nabla_{\bm{x}}\mathrm{Pr}(\bm{\xi}\mid\bm{x})}{% \mathrm{Pr}(\bm{\xi}\mid\bm{x})}\right)d\mathrm{Pr}(\bm{\xi}\mid\bm{x})$
	$\displaystyle=\mathbb{E}_{\bm{\xi}\sim D(\bm{x})}\left[\nabla_{\bm{x}}f(\bm{x}% ,\bm{\xi})+(f(\bm{x},\bm{\xi})-\delta)\frac{\nabla_{\bm{x}}\mathrm{Pr}(\bm{\xi% }\mid\bm{x})}{\mathrm{Pr}(\bm{\xi}\mid\bm{x})}\right].$

Here, the third equality obviously holds when $\bm{\xi}$ is a discrete random vector. If $\bm{\xi}$ is a continuous random vector, the third equality follows from Lemma 4 since Assumptions 1–3 hold. The fourth equality is due to the fact that $f(\bm{x},\bm{\xi})$ and $\mathrm{Pr}(\bm{\xi}\mid\bm{x})$ are differentiable w.r.t. $\bm{x}$ from conditions (i) and (ii) of Assumption 1. The fifth equality is due to the fact that $\mathrm{Pr}(\bm{\xi}\mid\bm{x})\neq 0$ from condition (ii) of Assumption 1. The seventh equality comes from (13). Then, Lemma 6 holds from Definition 2. ∎

A.7 Proof of Lemma 7

Proof.

We have

	$\displaystyle\mathbb{E}_{\bm{\xi}^{\prime}\sim D(\bm{x})}[\\|g_{2}(\bm{x},\bm{% \xi}^{\prime})-\nabla_{\bm{x}}\mathbb{E}_{\bm{\xi}\sim D(\bm{x})}[f(\bm{x},\bm% {\xi})]\\|^{2}]=\mathbb{E}_{\bm{\xi}^{\prime}\sim D(\bm{x})}[\\|g_{2}(\bm{x},\bm% {\xi}^{\prime})\\|^{2}]-\\|\nabla_{\bm{x}}\mathbb{E}_{\bm{\xi}\sim D(\bm{x})}[f(% \bm{x},\bm{\xi})]\\|^{2}$
	$\displaystyle\leq\mathbb{E}_{\bm{\xi}^{\prime}\sim D(\bm{x})}[\\|g_{2}(\bm{x},% \bm{\xi}^{\prime})\\|^{2}]\leq\mathbb{E}_{\bm{\xi}^{\prime}\sim D(\bm{x})}\left% [\left(\\|\nabla_{\bm{x}}f(\bm{x},\bm{\xi}^{\prime})\\|+\|f(\bm{x},\bm{\xi}^{% \prime})-\delta\|\left\\|\frac{\nabla_{\bm{x}}\mathrm{Pr}(\bm{\xi}^{\prime}\mid% \bm{x})}{\mathrm{Pr}(\bm{\xi}^{\prime}\mid\bm{x})}\right\\|\right)^{2}\right]$
	$\displaystyle\leq\mathbb{E}_{\bm{\xi}^{\prime}\sim D(\bm{x})}[(L_{f}+2f_{\max}% M)^{2}]=(L_{f}+2f_{\max}M)^{2},$

where the first equality comes from Lemma 6. The third inequality follows from conditions (i) and (iii) of Assumption 1 and Assumption 2. ∎

A.8 Proof of Lemma 8

Proof.

For all $\bm{x}\in\mathcal{C}$ ,

	$\displaystyle\\|\nabla_{\bm{x}}\mathbb{E}_{\bm{\xi}\sim D(\bm{x})}[f(\bm{x},\bm% {\xi})]\\|=\left\\|\mathbb{E}_{\bm{\xi}\sim D(\bm{x})}\left[\nabla_{\bm{x}}f(\bm% {x},\bm{\xi})+f(\bm{x},\bm{\xi})\frac{\nabla_{\bm{x}}\mathrm{Pr}(\bm{\xi}\mid% \bm{x})}{\mathrm{Pr}(\bm{\xi}\mid\bm{x})}\right]\right\\|$
	$\displaystyle\leq\left\\|\mathbb{E}_{\bm{\xi}\sim D(\bm{x})}\left[\\|\nabla_{\bm% {x}}f(\bm{x},\bm{\xi})\\|+\|f(\bm{x},\bm{\xi})\|\left\\|\frac{\nabla_{\bm{x}}% \mathrm{Pr}(\bm{\xi}\mid\bm{x})}{\mathrm{Pr}(\bm{\xi}\mid\bm{x})}\right\\|% \right]\right\\|\leq\left\\|\mathbb{E}_{\bm{\xi}\sim D(\bm{x})}\left[L_{f}+f_{% \max}M\right]\right\\|$
	$\displaystyle=L_{f}+f_{\max}M,$

where the first equality follows from Lemma 6 with $\delta=0$ , and the second inequality comes from conditions (i) and (iii) of Assumption 1 and Assumption 2. ∎

A.9 Proof of Proposition 9

Proof.

We show that our problem satisfies the assumptions of [Besbes et al., 2015, Lemma C-5]. First, we show that $\{\psi_{k}\}_{k=1}^{R}$ in our problem is included in $\mathcal{F}_{s}$ defined by [Besbes et al., 2015, Section 5]. $\mathcal{F}_{s}$ is a class of sequences $\{g_{k}\}_{k=1}^{R}$ of convex cost functions from $\mathcal{Y}\subset\mathbb{R}^{d}$ into $\mathbb{R}$ , where $\mathcal{Y}$ is convex, compact, and non-empty. Moreover, $\mathcal{F}_{s}$ and $\mathcal{Y}$ satify the following conditions for all $k\in[R]$ :

1.

There is a finite number $G>0$ such that $|g_{k}(\bm{y})|\leq G$ and $\|\nabla_{\bm{y}}g_{k}(\bm{y})\|\leq G$ for all $\bm{y}\in\mathcal{Y}$ .
2.

There is some $\nu>0$ such that $\{\bm{y}\in\mathbb{R}^{d}:\|\bm{y}-\bm{y}^{*}_{k}\|\leq\nu\}\subset\mathcal{Y}$ , where $\bm{y}^{*}_{k}\in\arg\min_{\bm{y}\in\mathcal{Y}}\psi_{k}(\bm{y})$ .
3.

There are finite numbers $H>0$ and $G>0$ such that $H\bm{I}_{d}\preceq\nabla_{\bm{y}}^{2}g_{k}(\bm{y})\preceq G\bm{I}_{d}$ , where $\bm{I}_{d}$ is the $d$ -dimensional identity matrix.

We consider the case of $d=1$ and $\mathcal{Y}=[-f_{\max}-\kappa,f_{\max}+\kappa]$ . Accordingly, $\{\psi_{k}\}_{k=1}^{R}$ in our problem is included in $\mathcal{F}_{s}$ since the following holds for any $\delta\in\mathcal{Y}$ and $k\in[R]$ :

	$\displaystyle\|\psi_{k}(\delta)\|=\frac{1}{2}(\delta-\mathbb{E}_{\bm{\xi}\sim D(% \bm{x}_{k})}[f(\bm{x}_{k},\bm{\xi})])^{2}\leq\frac{1}{2}(2f_{\max}+\kappa)^{2},$
	$\displaystyle\|\nabla_{\delta}\psi_{k}(\delta)\|=\|\delta-\mathbb{E}_{\bm{\xi}% \sim D(\bm{x}_{k})}[f(\bm{x}_{k},\bm{\xi})]\|\leq 2f_{\max}+\kappa,$
	$\displaystyle\arg\min_{\delta^{\prime}}\psi_{k}(\delta^{\prime})\in[-f_{\max},% f_{\max}],\ \textrm{and}$
	$\displaystyle\nabla_{\delta}^{2}\psi_{k}(\delta)=1.$

Here, (C-9) in the proof of Lemma C-5 of Besbes et al. [2015] holds by letting $\phi^{1}(\delta_{k},\psi_{k}):=\delta_{k}-v_{k}$ since $\mathbb{E}_{v_{k}}[\delta_{k}-v_{k}]=\delta_{k}-\mathbb{E}_{\bm{\xi}\sim D(\bm% {x}_{k})}[f(\bm{x}_{k},\bm{\xi})]=\nabla\psi_{k}(\delta_{k})$ and $\mathbb{E}_{v_{k}}[|\delta_{k}-v_{k}|^{2}]\leq|2f_{\max}+\kappa|^{2}$ . Proposition 9 follows from the same argument as in the proof of [Besbes et al., 2015, Lemma C-5]. ∎

A.10 Proof of Theorem 10

Proof.

When $\delta_{k}\in[-f_{\max},f_{\max}]$ , the output $\delta_{k+1}$ of Algorithm 2 is included in $[-f_{\max},f_{\max}]$ from $\frac{1}{m_{k}}{\sum_{\ell=1}^{m_{k}}}f(\bm{x}_{k}^{md},\bm{\xi}^{\ell})\in[-f% _{\max},f_{\max}]$ , $\zeta_{k+1}\in(0,1)$ , and line 10 of Algorithm 2. Therefore, $\delta_{k}\in[-f_{\max},f_{\max}]$ for all $k\in[R]$ from $\delta_{1}\in[-f_{\max},f_{\max}]$ . From Lemmas 6-8, Assumption 2, and [Ghadimi and Lan, 2016, Corollary 6], we have

\displaystyle\mathbb{E}[\|\mathcal{G}(\bm{x}_{R}^{md},{\beta_{R}})\|^{2}]\leq 9% 6L_{Ef}\left[\frac{4L_{Ef}\|\bm{x}_{0}-\bm{x}^{*}\|^{2}}{N(N+1)(N+2)}+\frac{L_% {Ef}(\|\bm{x}^{*}\|^{2}+H^{2})+2\tilde{D}^{2}}{N}\right].

Here, in [Ghadimi and Lan, 2016, Corollary 6], we let $L_{\psi}:=L_{Ef}$ , $L_{f}:=L_{Ef}$ , and $\sigma^{2}:=(L_{f}+2f_{\max}M)^{2}$ . Then, to obtain an $\varepsilon$ -stationary point, we need the iteration number $\hat{N}$ such that

\displaystyle 96L_{Ef}\left[\frac{4L_{Ef}\|\bm{x}_{0}-\bm{x}^{*}\|^{2}}{\hat{N% }(\hat{N}+1)(\hat{N}+2)}+\frac{L_{Ef}(\|\bm{x}^{*}\|^{2}+H^{2})+2\tilde{D}^{2}% }{\hat{N}}\right]\leq\varepsilon^{2}.

(14)

Eq. (14) can be reformulated as

\displaystyle\hat{N}(\hat{N}+1)(\hat{N}+2)\geq\frac{384L_{Ef}^{2}}{\varepsilon% ^{2}}\|\bm{x}_{0}-\bm{x}^{*}\|^{2}+\frac{96L_{Ef}}{\varepsilon^{2}}(L_{Ef}(\|% \bm{x}^{*}\|^{2}+H^{2})+2\tilde{D}^{2})(\hat{N}+1)(\hat{N}+2).

Therefore, the sufficient condition for (14) is as follows:

\displaystyle\hat{N}^{3}\geq\frac{768L_{Ef}^{2}}{\varepsilon^{2}}\|\bm{x}_{0}-% \bm{x}^{*}\|^{2},\ \hat{N}\geq\frac{192L_{Ef}}{\varepsilon^{2}}(L_{Ef}(\|\bm{x% }^{*}\|^{2}+H^{2})+2\tilde{D}^{2}).

∎

A.11 Proof of Proposition 11

Proof.

Assumption 4 holds since $c(\bm{\xi})$ is continuous from the definition and
$\mathbb{E}_{\bm{\xi}\sim D(\bm{x})}[s(\bm{x},\bm{\xi})]=\mathbb{E}_{\bm{\xi}% \sim D(\bm{x})}\left[\sum_{i=1}^{n}x_{i}\xi_{i}\right]=\sum_{i=1}^{n}x_{i}% \mathbb{E}_{\bm{\xi}\sim D(\bm{x})}[\xi_{i}]=s(\bm{x},\mathbb{E}_{\bm{\xi}\sim D% (\bm{x})}[\bm{\xi}])$ . Moreover, since $p_{k}(\bm{x})\neq 0$ for all $\bm{x}\in\mathbb{R}^{n}$ and $k\in\{0,\dots,n\}$ from the definition of $p_{i}(\bm{x})$ for each $i\in\{0,1,\dots,n\}$ , $\left(\frac{\nabla_{\bm{p}}\phi(\bm{p}(\bm{x}),\bm{\xi})}{\phi(\bm{p}(\bm{x}),% \bm{\xi})}\right)_{k}=\frac{\partial\prod_{i=0}^{n}{}_{m}C_{\xi_{i}}p_{i}(\bm{% x})^{\xi_{i}}}{\partial p_{k}}\frac{1}{\prod_{i=0}^{n}{}_{m}C_{\xi_{i}}p_{i}(% \bm{x})^{\xi_{i}}}=\prod_{i=0}^{n}{}_{m}C_{\xi_{i}}p_{i}(\bm{x})^{\xi_{i}}% \frac{\xi_{k}}{p_{k}(\bm{x})}\frac{1}{\prod_{i=0}^{n}{}_{m}C_{\xi_{i}}p_{i}(% \bm{x})^{\xi_{i}}}=\frac{\xi_{k}}{p_{k}(\bm{x})}$ . ∎

A.12 Proof of Proposition 12

Proof.

Assumption 4 holds since $c(\bm{\xi})$ is continuous and $s(\bm{x},\bm{\xi})=0$ . Moreover, since $0<p_{i}(x)<1$ for all $x\in\mathbb{R}$ from the definition of $p_{i}$ , we have

	$\displaystyle\left(\frac{\nabla_{\bm{p}}\phi(\bm{p}(\bm{x}),\bm{\xi})}{\phi(% \bm{p}(\bm{x}),\bm{\xi})}\right)_{k}=\frac{\partial\prod_{i\in I}{}_{d_{i}}C_{% \xi_{i}}p_{i}(x_{i})^{\xi_{i}}(1-p_{i}(x_{i}))^{d_{i}-\xi_{i}}}{\partial p_{k}% }\frac{1}{\prod_{i\in I}{}_{d_{i}}C_{\xi_{i}}p_{i}(x_{i})^{\xi_{i}}(1-p_{i}(x_% {i}))^{d_{i}-\xi_{i}}}$
	$\displaystyle=\prod_{i\in I}{}_{d_{i}}C_{\xi_{i}}p_{i}(x_{i})^{\xi_{i}}(1-p_{i% }(x_{i}))^{d_{i}-\xi_{i}}\left(\frac{\xi_{k}}{p_{k}(x_{k})}-\frac{d_{k}-\xi_{k% }}{1-p_{k}(x_{k})}\right)\frac{1}{\prod_{i\in I}{}_{d_{i}}C_{\xi_{i}}p_{i}(x_{% i})^{\xi_{i}}(1-p_{i}(x_{i}))^{d_{i}-\xi_{i}}}$
	$\displaystyle=\frac{\xi_{k}}{p_{k}(x_{k})}-\frac{d_{k}-\xi_{k}}{1-p_{k}(x_{k})}.$

∎

A.13 Proof of Lemma 13

Proof.

It follows from the definition of $\phi$ that

	$\displaystyle\mathbb{E}_{\bm{\xi}\sim D(\bm{x})}\left[\delta\frac{\nabla_{\bm{% x}}\phi(\bm{p}(\bm{x}),\bm{\xi})}{\phi(\bm{p}(\bm{x}),\bm{\xi})}\right]=\int_{% \bm{\xi}\in\Xi}\left(\delta\frac{\nabla_{\bm{x}}\phi(\bm{p}(\bm{x}),\bm{\xi})}% {\phi(\bm{p}(\bm{x}),\bm{\xi})}\right)d\mathrm{Pr}(\bm{\xi}\mid\bm{x})=\delta% \int_{\bm{\xi}\in\Xi}\nabla_{\bm{x}}\mathrm{Pr}(\bm{\xi}\mid\bm{x})d\bm{\xi}$
	$\displaystyle=\delta\nabla_{\bm{x}}\int_{\bm{\xi}\in\Xi}\mathrm{Pr}(\bm{\xi}% \mid\bm{x})d\bm{\xi}=\delta\nabla_{\bm{x}}(1)=0,$		(15)

where the third equality comes from Lemma 5 with $q(\bm{\xi})=1$ . Since

\displaystyle\mathbb{E}_{\bm{\xi}\sim D(\bm{x})}[f(\bm{x},\bm{\xi})]=-s(\bm{x}% ,\mathbb{E}_{\bm{\xi}\sim D(\bm{x})}[\bm{\xi}])+\int_{\bm{\xi}\in\Xi}c(\bm{\xi% })\phi(\bm{p}(\bm{x}),\bm{\xi})d\bm{\xi}

from Assumptions 4 and 5, we have

	$\displaystyle\nabla_{\bm{x}}\mathbb{E}_{\bm{\xi}\sim D(\bm{x})}[f(\bm{x},\bm{% \xi})]$	$\displaystyle=\nabla_{\bm{x}}\left(-s(\bm{x},\mathbb{E}_{\bm{\xi}\sim D(\bm{x}% )}[\bm{\xi}])+\int_{\bm{\xi}\in\Xi}c(\bm{\xi})\phi(\bm{p}(\bm{x}),\bm{\xi})d% \bm{\xi}\right)$
		$\displaystyle=-\nabla_{\bm{x}}s(\bm{x},\mathbb{E}_{\bm{\xi}\sim D(\bm{x})}[\bm% {\xi}])+\nabla_{\bm{x}}\int_{\bm{\xi}\in\Xi}c(\bm{\xi})\phi(\bm{p}(\bm{x}),\bm% {\xi})d\bm{\xi}$
		$\displaystyle=-\nabla_{\bm{x}}s(\bm{x},\mathbb{E}_{\bm{\xi}\sim D(\bm{x})}[\bm% {\xi}])+\int_{\bm{\xi}\in\Xi}\nabla_{\bm{x}}\left(c(\bm{\xi})\phi(\bm{p}(\bm{x% }),\bm{\xi})\right)d\bm{\xi}$
		$\displaystyle=-\nabla_{\bm{x}}s(\bm{x},\mathbb{E}_{\bm{\xi}\sim D(\bm{x})}[\bm% {\xi}])+\mathbb{E}_{\bm{\xi}\sim D(\bm{x})}\left[c(\bm{\xi})\frac{\nabla_{\bm{% x}}\phi(\bm{p}(\bm{x}),\bm{\xi})}{\phi(\bm{p}(\bm{x}),\bm{\xi})}\right]$
		$\displaystyle=-\nabla_{\bm{x}}s(\bm{x},\mathbb{E}_{\bm{\xi}\sim D(\bm{x})}[\bm% {\xi}])+\mathbb{E}_{\bm{\xi}\sim D(\bm{x})}\left[(c(\bm{\xi})-\delta)\frac{% \nabla_{\bm{x}}\phi(\bm{p}(\bm{x}),\bm{\xi})}{\phi(\bm{p}(\bm{x}),\bm{\xi})}\right]$
		$\displaystyle=-\nabla_{\bm{x}}s(\bm{x},\mathbb{E}_{\bm{\xi}\sim D(\bm{x})}[\bm% {\xi}])+\mathbb{E}_{\bm{\xi}\sim D(\bm{x})}\left[(c(\bm{\xi})-\delta)\frac{d% \bm{p}(\bm{x})}{d\bm{x}}\frac{\nabla_{\bm{p}}\phi(\bm{p}(\bm{x}),\bm{\xi})}{% \phi(\bm{p}(\bm{x}),\bm{\xi})}\right].$

Here, the third equality holds when $\bm{\xi}$ is a discrete random vector. When $\bm{\xi}$ is a continuous random vector, the third equality comes from Assumption 3 and Lemma 5 by letting $q(\bm{\xi}):=c(\bm{\xi})$ . The fourth equality is due to the fact that $\phi(\bm{p}(\bm{x}),\bm{\xi})=\mathrm{Pr}(\bm{\xi}\mid\bm{x})\neq 0$ from condition (ii) of Assumption 1. The fifth equality comes from (15). ∎

A.14 Proof of Lemma 14

Proof.

Under Assumptions 4 and 5, we have

\displaystyle\nabla_{\bm{x}}\mathbb{E}_{\bm{\xi}\sim D(\bm{x})}[f(\bm{x},\bm{% \xi})]=-\nabla_{\bm{x}}s(\bm{x},\mathbb{E}_{\bm{\xi}\sim D(\bm{x})}[\bm{\xi}])% +\nabla_{\bm{x}}\int_{\bm{\xi}\in\Xi}c(\bm{\xi})\phi(\bm{p}(\bm{x}),\bm{\xi})d% \bm{\xi}.

(16)

From Lemma 13,

\displaystyle\nabla_{\bm{x}}\mathbb{E}_{\bm{\xi}\sim D(\bm{x})}[f(\bm{x},\bm{% \xi})]=-\nabla_{\bm{x}}s(\bm{x},\mathbb{E}_{\bm{\xi}\sim D(\bm{x})}[\bm{\xi}])% +\mathbb{E}_{\bm{\xi}\sim D(\bm{x})}\left[(c(\bm{\xi})-\delta)\frac{d\bm{p}(% \bm{x})}{d\bm{x}}\frac{\nabla_{\bm{p}}\phi(\bm{p}(\bm{x}),\bm{\xi})}{\phi(\bm{% p}(\bm{x}),\bm{\xi})}\right].

(17)

Then, from (16) and (17),

\displaystyle\mathbb{E}_{\bm{\xi}\sim D(\bm{x})}\left[(c(\bm{\xi})-\delta)% \frac{d\bm{p}(\bm{x})}{d\bm{x}}\frac{\nabla_{\bm{p}}\phi(\bm{p}(\bm{x}),\bm{% \xi})}{\phi(\bm{p}(\bm{x}),\bm{\xi})}\right]=\nabla_{\bm{x}}\int_{\bm{\xi}\in% \Xi}c(\bm{\xi})\phi(\bm{p}(\bm{x}),\bm{\xi})d\bm{\xi}.

(18)

Then,

	$\displaystyle\mathbb{E}_{\bm{\xi}^{\prime}\sim D(\bm{x})}[\\|g_{2}(\bm{x},\bm{% \xi}^{\prime})-\nabla_{\bm{x}}\mathbb{E}_{\bm{\xi}\sim D(\bm{x})}[f(\bm{x},\bm% {\xi})]\\|^{2}]$
	$\displaystyle=\mathbb{E}_{\bm{\xi}^{\prime}\sim D(\bm{x})}\left[\left\\|-\nabla% _{\bm{x}}s(\bm{x},\mathbb{E}_{\bm{\xi}\sim D(\bm{x})}[\bm{\xi}])+(c(\bm{\xi}^{% \prime})-\delta)\frac{d\bm{p}(\bm{x})}{d\bm{x}}\frac{\nabla_{\bm{p}}\phi(\bm{p% }(\bm{x}),\bm{\xi}^{\prime})}{\phi(\bm{p}(\bm{x}),\bm{\xi}^{\prime})}\right.\right.$
	$\displaystyle\qquad\qquad\qquad\left.\left.+\nabla_{\bm{x}}s(\bm{x},\mathbb{E}% _{\bm{\xi}\sim D(\bm{x})}[\bm{\xi}])-\nabla_{\bm{x}}\int_{\bm{\xi}\in\Xi}c(\bm% {\xi})\phi(\bm{p}(\bm{x}),\bm{\xi})d\bm{\xi}\}\right\\|^{2}\right]$
	$\displaystyle=\mathbb{E}_{\bm{\xi}^{\prime}\sim D(\bm{x})}\left[\left\\|(c(\bm{% \xi}^{\prime})-\delta)\frac{d\bm{p}(\bm{x})}{d\bm{x}}\frac{\nabla_{\bm{p}}\phi% (\bm{p}(\bm{x}),\bm{\xi}^{\prime})}{\phi(\bm{p}(\bm{x}),\bm{\xi}^{\prime})}-% \nabla_{\bm{x}}\int_{\bm{\xi}\in\Xi}c(\bm{\xi})\phi(\bm{p}(\bm{x}),\bm{\xi})d% \bm{\xi}\}\right\\|^{2}\right]$
	$\displaystyle=\mathbb{E}_{\bm{\xi}^{\prime}\sim D(\bm{x})}\left[\left\\|(c(\bm{% \xi}^{\prime})-\delta)\frac{d\bm{p}(\bm{x})}{d\bm{x}}\frac{\nabla_{\bm{p}}\phi% (\bm{p}(\bm{x}),\bm{\xi}^{\prime})}{\phi(\bm{p}(\bm{x}),\bm{\xi}^{\prime})}% \right\\|^{2}\right]-\left\\|\nabla_{\bm{x}}\int_{\bm{\xi}\in\Xi}c(\bm{\xi})\phi% (\bm{p}(\bm{x}),\bm{\xi})d\bm{\xi}\}\right\\|^{2}$
	$\displaystyle\leq\mathbb{E}_{\bm{\xi}^{\prime}\sim D(\bm{x})}\left[\left\\|(c(% \bm{\xi}^{\prime})-\delta)\frac{d\bm{p}(\bm{x})}{d\bm{x}}\frac{\nabla_{\bm{p}}% \phi(\bm{p}(\bm{x}),\bm{\xi}^{\prime})}{\phi(\bm{p}(\bm{x}),\bm{\xi}^{\prime})% }\right\\|^{2}\right]=\mathbb{E}_{\bm{\xi}^{\prime}\sim D(\bm{x})}\left[\left\\|% (c(\bm{\xi}^{\prime})-\delta)\frac{\nabla_{\bm{x}}\mathrm{Pr}(\bm{\xi}^{\prime% }\mid\bm{x})}{\mathrm{Pr}(\bm{\xi}^{\prime}\mid\bm{x})}\right\\|^{2}\right]$
	$\displaystyle\leq\mathbb{E}_{\bm{\xi}^{\prime}\sim D(\bm{x})}\left[(2c_{\max})% ^{2}M^{2}\right]=4(c_{\max}M)^{2},$

where the first equality comes from (16), and the third equality follows from (18). The second inequality follows from condition (iii) of Assumption 1. ∎

A.15 Proof of Theorem 15

Proof.

When $\delta_{k}\in[-c_{\max},c_{\max}]$ , the output $\delta_{k+1}$ of Algorithm 3 is included in $[-c_{\max},c_{\max}]$ from $\frac{1}{m_{k}}\sum_{\ell=1}^{m_{k}}c(\bm{\xi}^{\ell})\in[-c_{\max},c_{\max}]$ , $\zeta_{k+1}\in(0,1)$ , and the update rule for $\delta_{k}$ in Algorithm 3. Therefore, $\delta_{k}\in[-c_{\max},c_{\max}]$ for all $k\in[R]$ from $\delta_{1}\in[-c_{\max},c_{\max}]$ . From Lemmas 8, 13, 14, and [Ghadimi and Lan, 2016, Corollary 6], we have

\displaystyle\mathbb{E}[\|\mathcal{G}(\bm{x}_{R}^{md},{\beta_{R}})\|^{2}]\leq 9% 6L_{Ef}\left[\frac{4L_{Ef}\|\bm{x}_{0}-\bm{x}^{*}\|^{2}}{N(N+1)(N+2)}+\frac{L_% {Ef}(\|\bm{x}^{*}\|^{2}+H^{2})+2\tilde{D}^{2}}{N}\right].

Here, in [Ghadimi and Lan, 2016, Corollary 6], we let $L_{\psi}:=L_{Ef}$ , $L_{f}:=L_{Ef}$ , and $\sigma^{2}:=4(c_{\max}M)^{2}$ . Then, as in the proof of Theorem 10, we need the iteration number $\hat{N}$ to obtain an $\varepsilon$ -stationary point such that:

\displaystyle\hat{N}^{3}\geq\frac{768L_{Ef}^{2}}{\varepsilon^{2}}\|\bm{x}_{0}-% \bm{x}^{*}\|^{2},\ \hat{N}\geq\frac{192L_{Ef}}{\varepsilon^{2}}(L_{Ef}(\|\bm{x% }^{*}\|^{2}+H^{2})+2\tilde{D}^{2}).

∎

Appendix B Details of our experiments

B.1 Common Settings

All experiments were conducted on a computer with an AMD EPYC 7413 24-Core Processor, 503.6 GiB of memory RAM, and Ubuntu 20.04.6 LTS. The program code was implemented in Python 3.8.3.

B.2 Settings of Baselines

L2-Regularized Repeated Gradient Descent (L2-RGD( $\alpha$ )): This method is described in Section 2.2. We used the fixed step size $\eta_{k}:=0.01$ at each iteration $k$ .
Bayesian Optimization (BO): We used GPyOpt, a Python open-source library for Bayesian optimization [GPyOpt-authors, 2016]. We used the default setting of the library for parameters other than the termination criteria.
Simultaneous Perturbation Stochastic Approximation (SPSA): At each iteration $k$ , this method updates the current iterate by using the stochastic perturbation gradient:

\displaystyle\frac{f(\bm{x}^{k}+c_{k}\Delta^{k},\bm{\xi}^{k,1})-f(\bm{x}^{k}-c% _{k}\Delta^{k},\bm{\xi}^{k,2})}{c_{k}\Delta^{k}},

where $c_{k}:=\frac{1}{(k+1)^{0.101}}$ , each element of $\Delta^{k}$ is sampled from a Rademacher distribution (i.e. Bernoulli $\pm 1$ with probability $0.5$ ), and $\bm{\xi}^{k,1}$ and $\bm{\xi}^{k,2}$ are random vectors sampled from the distribution $D(\bm{x}^{k})$ . We set $a_{k}:=\frac{0.16}{(100+k+1)^{0.602}}$ as the stepsize at each iteration. The settings of $c_{k}$ , $\Delta^{k}$ , and $a_{k}$ are based on [Spall, 1998, Section III].
Projected Sub-gradient Descent for Average Demand (PSD-AD): This method is a projected subgradient descent method for

\displaystyle\min_{\bm{x}\in[x_{\min},x_{\max}]^{n}}(-s(\bm{x},\bar{\bm{\xi}}(% \bm{x}))+c(\bar{\bm{\xi}}(\bm{x}))),

where $\bar{\bm{\xi}}(\bm{x}):=\mathbb{E}_{\bm{\xi}\sim D(\bm{x})}[\bm{\xi}]$ , which represents the average demand for $\bm{x}$ . We set the step size at each iteration so that the objective value decreases by repeatedly multiplying by $\delta=0.9$ .

$\displaystyle\left\|\frac{\partial g_{i}(\bm{x},\xi_{i})}{\partial x_{k}}\right\|$	$\displaystyle=\left\|-\frac{2(\xi_{i}-\bm{v}^{i}(\bm{x})^{\top}\bm{a}^{i})(-% \sum_{j=1}^{N}\frac{\partial v_{j}^{i}(\bm{x})}{\partial x_{k}}a_{j}^{i})}{2((% \sigma^{i})^{2}-\bm{v}^{i}(\bm{x})^{\top}A^{i}\bm{v}^{i}(\bm{x}))}\right.$
	$\displaystyle\quad\left.+\frac{(\xi_{i}-\bm{v}^{i}(\bm{x})^{\top}\bm{a}^{i})^{% 2}(-2\sum_{s=1}^{N}\sum_{t=1}^{N}A_{st}^{i}(v_{t}^{i}(\bm{x})\frac{\partial v_% {s}^{i}(\bm{x})}{\partial x_{k}}+v_{s}^{i}(\bm{x})\frac{\partial v_{t}^{i}(\bm% {x})}{\partial x_{k}}))}{(2((\sigma^{i})^{2}-\bm{v}^{i}(\bm{x})^{\top}A^{i}\bm% {v}^{i}(\bm{x})))^{2}}\right\|$
	$\displaystyle\leq\left\|\frac{\xi_{i}-\bm{v}^{i}(\bm{x})^{\top}\bm{a}^{i}}{(% \sigma^{i})^{2}-\bm{v}^{i}(\bm{x})^{\top}A^{i}\bm{v}^{i}(\bm{x})}\right\|\left\|% \sum_{j=1}^{N}\frac{\partial v_{j}^{i}(\bm{x})}{\partial x_{k}}a_{j}^{i}\right\|$
	$\displaystyle\quad+\left\|\frac{(\xi_{i}-\bm{v}^{i}(\bm{x})^{\top}\bm{a}^{i})^{% 2}}{(2((\sigma^{i})^{2}-\bm{v}^{i}(\bm{x})^{\top}A^{i}\bm{v}^{i}(\bm{x})))^{2}% }\right\|\left\|2\sum_{s=1}^{N}\sum_{t=1}^{N}A_{st}^{i}\left(v_{t}^{i}(\bm{x})% \frac{\partial v_{s}^{i}(\bm{x})}{\partial x_{k}}+v_{s}^{i}(\bm{x})\frac{% \partial v_{t}^{i}(\bm{x})}{\partial x_{k}}\right)\right\|$
	$\displaystyle\leq\frac{\xi^{\max}+N\theta_{1}^{\max}a^{\max}}{\Delta}\cdot 2Na% ^{\max}\frac{\theta_{1}^{\max}(x_{\max}-x_{\min})}{\theta_{2}^{\min}}$
	$\displaystyle\quad+\frac{(\xi^{\max}+N\theta_{1}^{\max}a^{\max})^{2}}{4\Delta^% {2}}\cdot 8N^{2}A^{\max}\frac{(\theta_{1}^{\max})^{2}(x_{\max}-x_{\min})}{% \theta_{2}^{\min}}$
	$\displaystyle\leq\frac{2N\theta_{1}^{\max}(\xi^{\max}+N\theta_{1}^{\max}a^{% \max})(x_{\max}-x_{\min})}{\Delta\theta_{2}^{\min}}\left(a^{\max}+NA^{\max}% \theta_{1}^{\max}\frac{\xi^{\max}+N\theta_{1}^{\max}a^{\max}}{\Delta}\right),$	(10)

	$\displaystyle\left\|\frac{1}{C^{i}(\bm{x})}\frac{\partial C^{i}(\bm{x})}{% \partial x_{k}}\right\|$	$\displaystyle=\left\|\frac{1}{C^{i}(\bm{x})}\frac{\partial\int_{0}^{\xi_{i}^{% \max}}\frac{1}{\sqrt{2\pi h_{i}(\bm{x})}}\exp\left(g_{i}(\bm{x},\phi)\right)d% \phi}{\partial x_{k}}\right\|$
		$\displaystyle=\left\|\frac{1}{C^{i}(\bm{x})}\int_{0}^{\xi_{i}^{\max}}\frac{1}{% \sqrt{2\pi h_{i}(\bm{x})}}\exp\left(g_{i}(\bm{x},\phi)\right)\left(-\frac{1}{2% }\frac{1}{h_{i}(\bm{x})}\frac{\partial h_{i}(\bm{x})}{\partial x_{k}}+\frac{% \partial g_{i}(\bm{x},\phi)}{\partial x_{k}}\right)d\phi\right\|$
		$\displaystyle\leq\left\|\frac{1}{C^{i}(\bm{x})}\int_{0}^{\xi_{i}^{\max}}\frac{1% }{\sqrt{2\pi h_{i}(\bm{x})}}\exp\left(g_{i}(\bm{x},\phi)\right)\left(\left\|% \frac{1}{2}\frac{1}{h_{i}(\bm{x})}\frac{\partial h_{i}(\bm{x})}{\partial x_{k}% }\right\|+\left\|\frac{\partial g_{i}(\bm{x},\phi)}{\partial x_{k}}\right\|\right% )d\phi\right\|$
		$\displaystyle\leq\left\|\frac{1}{C^{i}(\bm{x})}\int_{0}^{\xi_{i}^{\max}}\frac{1% }{\sqrt{2\pi h_{i}(\bm{x})}}\exp\left(g_{i}(\bm{x},\phi)\right)\left(r_{1}+r_{% 2}\right)d\phi\right\|=r_{1}+r_{2},$

	$\displaystyle\mathbb{E}_{\bm{\xi}^{\prime}\sim D(\bm{x})}[\\|g_{2}(\bm{x},\bm{% \xi}^{\prime})-\nabla_{\bm{x}}\mathbb{E}_{\bm{\xi}\sim D(\bm{x})}[f(\bm{x},\bm% {\xi})]\\|^{2}]$
	$\displaystyle=\mathbb{E}_{\bm{\xi}^{\prime}\sim D(\bm{x})}\left[\left\\|-\nabla% _{\bm{x}}s(\bm{x},\mathbb{E}_{\bm{\xi}\sim D(\bm{x})}[\bm{\xi}])+(c(\bm{\xi}^{% \prime})-\delta)\frac{d\bm{p}(\bm{x})}{d\bm{x}}\frac{\nabla_{\bm{p}}\phi(\bm{p% }(\bm{x}),\bm{\xi}^{\prime})}{\phi(\bm{p}(\bm{x}),\bm{\xi}^{\prime})}\right.\right.$
	$\displaystyle\qquad\qquad\qquad\left.\left.+\nabla_{\bm{x}}s(\bm{x},\mathbb{E}% _{\bm{\xi}\sim D(\bm{x})}[\bm{\xi}])-\nabla_{\bm{x}}\int_{\bm{\xi}\in\Xi}c(\bm% {\xi})\phi(\bm{p}(\bm{x}),\bm{\xi})d\bm{\xi}\}\right\\|^{2}\right]$
	$\displaystyle=\mathbb{E}_{\bm{\xi}^{\prime}\sim D(\bm{x})}\left[\left\\|(c(\bm{% \xi}^{\prime})-\delta)\frac{d\bm{p}(\bm{x})}{d\bm{x}}\frac{\nabla_{\bm{p}}\phi% (\bm{p}(\bm{x}),\bm{\xi}^{\prime})}{\phi(\bm{p}(\bm{x}),\bm{\xi}^{\prime})}-% \nabla_{\bm{x}}\int_{\bm{\xi}\in\Xi}c(\bm{\xi})\phi(\bm{p}(\bm{x}),\bm{\xi})d% \bm{\xi}\}\right\\|^{2}\right]$
	$\displaystyle=\mathbb{E}_{\bm{\xi}^{\prime}\sim D(\bm{x})}\left[\left\\|(c(\bm{% \xi}^{\prime})-\delta)\frac{d\bm{p}(\bm{x})}{d\bm{x}}\frac{\nabla_{\bm{p}}\phi% (\bm{p}(\bm{x}),\bm{\xi}^{\prime})}{\phi(\bm{p}(\bm{x}),\bm{\xi}^{\prime})}% \right\\|^{2}\right]-\left\\|\nabla_{\bm{x}}\int_{\bm{\xi}\in\Xi}c(\bm{\xi})\phi% (\bm{p}(\bm{x}),\bm{\xi})d\bm{\xi}\}\right\\|^{2}$
	$\displaystyle\leq\mathbb{E}_{\bm{\xi}^{\prime}\sim D(\bm{x})}\left[\left\\|(c(% \bm{\xi}^{\prime})-\delta)\frac{d\bm{p}(\bm{x})}{d\bm{x}}\frac{\nabla_{\bm{p}}% \phi(\bm{p}(\bm{x}),\bm{\xi}^{\prime})}{\phi(\bm{p}(\bm{x}),\bm{\xi}^{\prime})% }\right\\|^{2}\right]=\mathbb{E}_{\bm{\xi}^{\prime}\sim D(\bm{x})}\left[\left\\|% (c(\bm{\xi}^{\prime})-\delta)\frac{\nabla_{\bm{x}}\mathrm{Pr}(\bm{\xi}^{\prime% }\mid\bm{x})}{\mathrm{Pr}(\bm{\xi}^{\prime}\mid\bm{x})}\right\\|^{2}\right]$
	$\displaystyle\leq\mathbb{E}_{\bm{\xi}^{\prime}\sim D(\bm{x})}\left[(2c_{\max})% ^{2}M^{2}\right]=4(c_{\max}M)^{2},$

Stochastic Approach for Price Optimization Problems with Decision-dependent Uncertainty

Abstract

1 Introduction

Notation

2 Related Works

2.1 Price Optimization Problems with Stochastic Demand

2.2 Optimization Methods for Stochastic Problems with Decision-dependent Uncertainty

Retraining methods (Perdomo et al., 2020; Mendler-Dünner et al., 2020).

Example 1.

Meta-model methods (Brochu et al., 2010; Frazier, 2018; Miller et al., 2021).

Gradient-free methods (Spall, 2005; Flaxman et al., 2005).

3 Optimization Problem

3.1 Problem Definition

3.2 Assumptions

Assumption 1.

Assumption 2.

Assumption 3.

Remark.

3.3 Application Examples

3.3.1 Multiproduct Pricing

Proposition 1.

Remark.

3.3.2 Congestion Pricing for HOT Lanes

Proposition 2.

3.3.3 Pricing with Demand Prediction from Limited Data Points

Proposition 3.

3.4 Advantages of Our Formulation

3.4.1 Benefits of Using Decision-dependent Random Variables

3.4.2 Differences from Existing Pricing Problems with Decision-dependent Uncertainty

4 Proposed Method

4.1 Preliminaries

Definition 1 (Projection oracle).

Definition 2 (Unbiased stochastic gradient).

Definition 3 (Gradient map**).

Definition 4 (ε𝜀\varepsilonitalic_ε-stationary point).

Lemma 4.

Lemma 5.

4.2 Unbiased Stochastic Gradient for (P)

Lemma 6.

Lemma 7.

Lemma 8.

4.3 Calculation of Variance Reduction Parameter δ𝛿\deltaitalic_δ

Proposition 9.

4.4 Proposed Algorithm

Theorem 10.

Bottleneck of Algorithm 2.

4.5 Specialized Projected Stochastic Gradient Method for Price Optimization in Multi-agent Applications

Assumption 4.

Assumption 5.

Proposition 11.

Proposition 12.

Lemma 13.

Lemma 14.

Theorem 15.

5 Experiments

5.1 Synthetic Experiments

Synthetic Parameter Setup.

Experimental Results

5.2 Simulation Experiments with Real Data

Data Set and Parameter Setup

Experimental Results

6 Conclusion

References

Appendix A Proofs

A.1 Proof of Proposition 1

Proof.

A.2 Proof of Proposition 2

Proof.

A.3 Proof of Proposition 3

Proof.

A.4 Proof of Lemma 4

Proof.

A.5 Proof of Lemma 5

Proof.

A.6 Proof of Lemma 6

Proof.

A.7 Proof of Lemma 7

Proof.

A.8 Proof of Lemma 8

Proof.

Definition 4 ( $\varepsilon$ -stationary point).

4.3 Calculation of Variance Reduction Parameter $\delta$