Online Stackelberg Optimization via Nonlinear Control

William Brown Columbia University & Morgan Stanley MLR. Email: [email protected]. Christos Papadimitriou Columbia University. Email: [email protected] Tim Rougharden Columbia University & a16z crypto. Email: [email protected]

(June 27, 2024)

Abstract

In repeated interaction problems with adaptive agents, our objective often requires anticipating and optimizing over the space of possible agent responses. We show that many problems of this form can be cast as instances of online (nonlinear) control which satisfy local controllability, with convex losses over a bounded state space which encodes agent behavior, and we introduce a unified algorithmic framework for tractable regret minimization in such cases. When the instance dynamics are known but otherwise arbitrary, we obtain oracle-efficient $O(\sqrt{T})$ regret by reduction to online convex optimization, which can be made computationally efficient if dynamics are locally action-linear. In the presence of adversarial disturbances to the state, we give tight bounds in terms of either the cumulative or per-round disturbance magnitude (for strongly or weakly locally controllable dynamics, respectively). Additionally, we give sublinear regret results for the cases of unknown locally action-linear dynamics as well as for the bandit feedback setting. Finally, we demonstrate applications of our framework to well-studied problems including performative prediction, recommendations for adaptive agents, adaptive pricing of real-valued goods, and repeated gameplay against no-regret learners, directly yielding extensions beyond prior results in each case.

1 Introduction

Machine learning problems involving strategic or adaptive agents are commonly framed as Stackelberg games, wherein the leader aims to commit to an optimal strategy in anticipation of the follower’s best response. This approach has been effectively applied to challenges ranging from performative feature manipulation (Hardt et al., 2015; Dong et al., 2018; Perdomo et al., 2020; Jagadeesan et al., 2022b) and optimal pricing (Roth et al., 2015; Daskalakis and Syrgkanis, 2015; Nedelec et al., 2020) to resource allocation in security games (Blum et al., 2014; Balcan et al., 2015; Alcantara-Jiménez and Clempner, 2020) and learning in tabular games (Letchford et al., 2009; Peng et al., 2019; Lauffer et al., 2022; Collina et al., 2023), often with a regret minimization objective. Additionally, several of these settings have been independently extended to account for agents that may update their strategies gradually over time rather than optimally responding in each round (Zrnic et al., 2021a; Brown et al., 2022; Braverman et al., 2017; Deng et al., 2019; Brown et al., 2023). Despite their conceptual similarities, these problems have largely been approached as distinct areas of study, each with their own growing body of techniques. Our aim in this work is to offer a unifying perspective and algorithmic approach for problems of this form, through the lens of online control.

For the broad family of online “Stackelberg-style” optimization problems, the language of control is quite natural to adopt: we are navigating a dynamical system where states corresponding to agent strategies evolve as a function of our own actions, and where objectives which consider best-response stability can be expressed in terms of the stationary behavior of this system. Our results consider a general class of online control instances for representing such problems, which we introduce in Section 2, and in Section 3 we give a sequence of no-regret algorithms for these instances satisfying a range of robustness properties. In Section 4, we show that several online optimization problems involving adaptive agents, including variants of online performative prediction (as in Kumar et al. (2022)), online recommendations (as in Agarwal and Brown (2023)), adaptive pricing (as in Roth et al. (2015)), and learning in time-varying games (as in Anagnostides et al. (2023)) can be embedded in our framework and solved by our algorithms.

While there has been a great deal of recent progress in online linear control, yielding algorithms which can optimize over stabilizing linear policies even with general convex costs, adversarial disturbances, and unknown dynamics (Agarwal et al., 2019a; Simchowitz et al., 2020; Cassel et al., 2022; Minasyan et al., 2022), the required assumptions and regret benchmarks for these algorithms do not always type-check with the settings we are interested in. For the examples we consider, we will often wish to allow for nonlinear dynamics (e.g. encoding an agent’s utility function) and explicitly bounded spaces (e.g. via projection into the simplex), and we will seek to compete with regret benchmarks which correspond to stable responses by the agent. Unfortunately, as we show in Proposition 2, the latter goal is incompatible with linear policies even under linear dynamics and in the absence of any disturbances: the performance of every linear policy can be $\Omega(T)$ worse than the best policy in the class of affine “state-targeting” policies.

In contrast, the orthogonal set of assumptions we identify enables tractable regret minimization even for nonlinear control problems and comports with the requirements of Stackelberg optimization across a wide range of settings, including the ability to compete with state-targeting policies. For convex and compact state and action spaces $\operatorname{\mathcal{X}}$ and $\operatorname{\mathcal{Y}}$ , our first key assumption is that the dynamics $D(x,y):\operatorname{\mathcal{X}}\times\operatorname{\mathcal{Y}}\rightarrow% \operatorname{\mathcal{Y}}$ satisfy a notion of local controllability. While local controllability is well-studied for continuous-time and asymptotic control (Aoki, 1974; Kuhn and Wohltmann, 1989; Barbero-Liñán and Jakubczyk, 2013; Boscain et al., 2021), we are unaware of any prior applications to finite-time online optimization, and we adapt existing definitions to be appropriate for this setting. We say that $D(x,y)$ is strongly locally controllable if every state in a fixed-radius ball around $y$ is reachable in a single round by an appropriate choice of $x$ , and that $D(x,y)$ is weakly locally controllable if the reachable radius around $y$ is allowed to vanish near the boundary of $\operatorname{\mathcal{Y}}$ . We also assume that our loss $f_{t}$ in each round is determined (or well-approximated by) an adversarially-chosen convex function depending only on the state $y_{t}$ .

When these conditions hold, we show in Theorem 1 that this is sufficient to obtain $O(\sqrt{T})$ regret with respect to the loss of the best fixed state, provided that dynamics are known and we have offline access to an oracle for non-convex optimization; the oracle call can be removed if dynamics are locally action-linear, i.e. given by (or locally well-approximated by) a function linear in $x$ at each fixed $y$ . If adversarial disturbances to the dynamics are present, our approach can be extended for both weakly (Theorem 2) and strongly (Theorem 3) locally controllable dynamics with additional regret scaling linearly in total disturbance magnitude, provided that each round’s disturbance cannot be too large in the case of weak local controllability; we give lower bounds showing that each dependence on disturbance magnitude is tight. The aforementioned results all extend to the case where the dynamics (absent disturbances) are given by a known but time-dependent function $D_{t}(x,y)$ . If dynamics are unknown but time-invariant, and locally action-linear with appropriate regularity parameters, we obtain sublinear regret provided that a “near-stabilizing” action is known at $t=1$ . We additionally extend our approach to the bandit feedback setting, where we obtain $O(T^{3/4})$ regret. In Section 4 we show that each of the following, with appropriate assumptions, can be cast as a locally controllable instance with state-only convex surrogate losses:

•

Performative prediction: Minimize prediction loss $\operatorname*{\mathbb{E}}_{z\sim p_{t}}f_{t}(x_{t},z)$ for a classifier $x_{t}$ , where the distribution $p_{t}$ in each round is updated according to the prior classifier and distribution.
•

Adaptive recommendations: Maximize the reward $f_{t}(i_{t})$ when showing menus $K_{t}\subseteq[n]$ of size $k\ll n$ to an agent, whose choice $i_{t}\sim p(K_{t},v_{t})$ in each round depends on preferences which are influenced by choices in prior rounds (encoded in the “memory vector” $v_{t}$ ).
•

Adaptive pricing: Maximize profit $\langle p_{t},x_{t}\rangle$ - $c_{t}(x_{t})$ for selling bundles of goods $x_{t}$ to an agent at prices $p_{t}$ and with costs $c_{t}$ , where the agent’s purchased bundle $x_{t}$ is a function of their utility function, consumption rate, and existing reserves.
•

Repeated gameplay: Maximize the reward $x_{t}^{\top}A_{t}y_{t}$ obtained from playing a sequence of time-varying games $(A_{t},B_{t})$ against a no-regret learning agent.

In each case, application of our algorithms from Section 3 yields results which extend beyond the applicability regimes of prior work, such as by enabling relaxation of previous assumptions or a novel extension to adversarial or dynamic problem variants.

1.1 Related Work

Online control.

Much of the recent progress in online control (Agarwal et al., 2019a, b; Cassel et al., 2022; Minasyan et al., 2022) considers linear systems with general convex losses, benchmarking against a class of (“strongly stable”) fast-mixing linear policies introduced for linear-quadratic control (Cohen et al., 2018) by leveraging the framework of “OCO with memory” (Anava et al., 2014). Results have also been shown for nonlinear policy classes via neural networks (Chen et al., 2022), and for nonlinear dynamics with oracles in episodic settings (Kakade et al., 2020), via approximation with random Fourier features (Lale et al., 2021; Luo et al., 2022), via adaptive regret for time-varying linear systems (Gradu et al., 2022; Minasyan et al., 2022), and via dynamic regret over actions in terms of disturbance “attenuation” (Muthirayan and Khargonekar, 2022). For a further overview of online control and its historical context, see Hazan and Singh (2022). In contrast to the bulk of prior work in which states and actions are bounded implicitly via policy stability notions, we consider state and action spaces which are bounded explicitly, as enabled by nonlinearity in dynamics (e.g. via projection, or range decay of dynamics near the boundary). These works also view disturbances as intrinsic to the system, and account for their influence directly in regret benchmarks (the “optimal policy” will face the same sequence of disturbances in hindsight, regardless of state). Within the context of Stackelberg optimization where a fixed protocol largely determines an agent’s strategy updates, we view the role of disturbances as more akin to adversarial corruptions as considered in reinforcement learning (Lykouris et al., 2021; Zhang et al., 2021); while we incur linear dependence, our regret benchmarks are agnostic to alternate counterfactual disturbance sequences.

Strategizing against learners.

Initially formulated within the context of repeated auctions (Braverman et al., 2017), a recent line of work has considered the problem of optimizing long-run rewards in a repeated game against a no-regret learner across a range of tabular and Bayesian settings (Deng et al., 2019; Mansour et al., 2022; Brown et al., 2023; Zhang et al., 2023). While bounds on attainable reward have been known in terms of the Price of Anarchy (Blum et al., 2008; Hartline et al., 2015b), this sequence of results has highlighted important connections with Stackelberg equilibria: the Stackelberg value of the game is attainable on average against any no-regret learner, and it is the maximum attainable value against many common no-regret algorithms (such as no-swap learners, as shown by Deng et al. (2019)). This theme has emerged in other simultaneous learning settings as well; notably, Zrnic et al. (2021b) show that long-run outcomes in strategic classification are shaped by relative learning rates between parties, which can designate either as the Stackelberg leader.

Nested convex optimization.

The technique of identifying convex structure nested inside a more general problem has been applied broadly across a range of online optimization settings (Neu and Olkhovskaya, 2021; Shen et al., 2023; Flokas et al., 2019). For repeated interaction problems involving an agent with unknown utility, such as optimal pricing, Roth et al. (2015) identify utility conditions under which the non-convex objective over prices becomes convex in the space of agent actions, and where explorability properties resembling local controllability hold, which enables convex optimization by locally learning agent preferences; this “revealed preferences” approach has also been applied to strategic classification (Dong et al., 2018). In recent work concerning recommendations for agents with history-dependent preferences (Agarwal and Brown, 2022, 2023), properties related to local controllability are leveraged to enable tractable optimization as well. We consider each of these settings as applications in Section 4.

2 Model and Preliminaries

Let $\operatorname{\mathcal{X}}$ and $\operatorname{\mathcal{Y}}$ be convex and compact subsets of Euclidean space, respectively denoting the action and state spaces, where we assume $\dim(\operatorname{\mathcal{X}})\geq\dim(\operatorname{\mathcal{Y}})$ . Further, we assume that $\operatorname{\mathcal{Y}}$ contains a ball of radius $r$ around the origin $\mathbf{0}$ , and is contained in a ball of radius $R$ around the origin.

An instance of our control problem consists of choosing a sequence of actions $\{x_{t}\in\operatorname{\mathcal{X}}\}$ over $T$ rounds, which will yield a sequence of states $\{y_{t}\in\operatorname{\mathcal{Y}}\}$ , and we will incur losses determined by adversarially chosen functions $\{f_{t}\}$ . Let the initial state be $y_{0}=\mathbf{0}$ . In the basic version of our problem, upon choosing each $x_{t}$ for rounds $t\in[T]$ , we observe the state update to

\displaystyle y_{t}=

\displaystyle\;D(x_{t},y_{t-1}),

where $D:\operatorname{\mathcal{X}}\times\operatorname{\mathcal{Y}}\rightarrow% \operatorname{\mathcal{Y}}$ is an arbitrary continuous function which we refer to as the dynamics of our problem. We sometimes allow disturbances to the dynamics, where $y_{t}=D(x_{t},y_{t-1})+w_{t+1}$ for $\{w_{t}\}$ chosen adversarially. In some cases we allow time-varying dynamics $D:\operatorname{\mathcal{X}}\times\operatorname{\mathcal{Y}}\times[T]% \rightarrow\operatorname{\mathcal{Y}}$ , where the dynamics in each round are denoted by $D_{t}(x_{t},y_{t-1})$ .

Here and in Section 3, we assume that our loss in round is given by $f_{t}(y_{t})$ , where each $f_{t}$ is a $L$ -Lipschitz convex function revealed after playing $x_{t}$ ; we relax these assumptions for some of our applications in Section 4, e.g. to allow dependence on $x_{t}$ as well. We generally measure will performance with respect to the best fixed state, and the regret for an algorithm $\operatorname{\mathcal{A}}$ yielding $\{y_{t}\}$ is

\displaystyle\operatorname{\textup{{Reg}}}_{T}(\operatorname{\mathcal{A}})=% \sum_{t=1}^{T}f_{t}(y_{t})-\min_{y\in\operatorname{\mathcal{Y}}}\sum_{t=1}^{T}% {f}_{t}(y).

In Proposition 2, we relate this benchmark to the class of “state-targeting” policies, which can sometimes be expressed by affine functions, and we compare their performance to linear policies. Throughout, we use $\left\lVert\cdot\right\rVert$ to donate the Euclidean norm, and we let $\operatorname{\mathcal{B}}_{\epsilon}(y)=\{\hat{y}:\left\lVert y-\hat{y}\right% \rVert\leq\epsilon\}$ denote the norm ball of radius $\epsilon$ around $y$ . We let $\Pi_{\operatorname{\mathcal{Y}}}(\cdot)$ denote Euclidean projection into the set $\operatorname{\mathcal{Y}}$ ; $\mathbf{u}_{n}$ denotes the uniform distribution over $n$ items, and $\Delta(n)$ denotes the probability simplex.

2.1 Locally Controllable Dynamics

A number of properties under the name “local controllability” have been considered for various continuous-time and asymptotic control settings (Aoki, 1974; Kuhn and Wohltmann, 1989; Barbero-Liñán and Jakubczyk, 2013; Boscain et al., 2021), generally relating to the notion that all states in a neighborhood around a given state are reachable. We give two formulations of local controllability for our setting, which we take as properties of the dynamics $D$ holding over all inputs.

Definition 1 (Weak Local Controllability).

For $\rho\in(0,1]$ , an instance $(\operatorname{\mathcal{X}},\operatorname{\mathcal{Y}},D)$ satisfies (weak) $\rho$ -local controllability if for any $y\in\operatorname{\mathcal{Y}}$ and $y^{*}\in\operatorname{\mathcal{B}}_{\rho\cdot\pi(y)}(y)$ , there is some $x$ such that $D(x,y)=y^{*}$ , where $\pi(y)=\min_{\hat{y}\in\operatorname*{bd}(\operatorname{\mathcal{Y}})}\left% \lVert\hat{y}-y\right\rVert$ is the distance from $y$ to the boundary of $\operatorname{\mathcal{Y}}$ .

Definition 2 (Strong Local Controllability).

For $\rho>0$ , an instance $(\operatorname{\mathcal{X}},\operatorname{\mathcal{Y}},D)$ satisfies strong $\rho$ -local controllability if for any $y\in\operatorname{\mathcal{Y}}$ and $y^{*}\in\operatorname{\mathcal{B}}_{\rho}(y)\cap\operatorname{\mathcal{Y}}$ , there is some $x$ such that $D(x,y)=y^{*}$ .

We often refer to weak local controllability simply as local controllability. This property ensures that there is always some action $x_{t}$ which results in the next state $y_{t}$ staying fixed at $y_{t-1}$ , as well as some action which moves the state to any point in a surrounding ball; in the weak case, the size of the reachable ball is allowed to decay as $y_{t}$ approaches the boundary of $\operatorname{\mathcal{Y}}$ . The parameter $\rho$ controls the speed at which we can navigate the state space: when $\rho=1$ in the weak case (or $\rho\geq R$ in the strong case), we can always immediately reach some point on the boundary of $\operatorname{\mathcal{Y}}$ , yet for $\rho$ close to zero we may only be able to move in a small neighborhood. Our results use local controllability to minimize regret over $\operatorname{\mathcal{Y}}$ by reduction to online convex optimization. As we prove in Appendix A, up to a quantifier alternation which vanishes as $\rho$ approaches $0$ , a property of this form is essentially necessary: competing with the best state $y$ is impossible if we cannot remain in its neighborhood.

Proposition 1.

Suppose there is some $y\in\operatorname{\mathcal{Y}}$ and values $\alpha,\beta>0$ such that for all $\hat{y}\in\operatorname{\mathcal{B}}_{\alpha}(y)$ and $x\in\operatorname{\mathcal{X}}$ , $D(x,\hat{y})\notin\operatorname{\mathcal{B}}_{\beta}(\hat{y})$ . Then, there are losses such that $\operatorname{\textup{{Reg}}}_{T}(\operatorname{\mathcal{A}})=\Omega(T)$ for any algorithm $\operatorname{\mathcal{A}}$ .

2.2 States vs. Policies

While regret benchmarks in online control are typically expressed in terms of a reference class of policies, we note that there is a class of “state-targeting” policies which track the reward of fixed states (asymptotically, and up to the influence of disturbances), and which can be implemented if $D$ is known; we maintain the formulation in terms of fixed states for clarity with respect to our motivations for Stackelberg optimization. Existing no-regret algorithms for online control typically compete with linear policies, and choose actions each round by implementing policies which are linear in multiple past states (as in e.g. Agarwal et al. (2019a)). Here, we show that all such policies can be arbitrarily suboptimal when compared to state-targeting policies, even for dynamics which are linear up to projection and with fixed convex losses over states, as they may yield actions and states which remain fixed at $\mathbf{0}$ in every round even if the optimal state is always immediately accessible under the dynamics. We prove Proposition 2 in Appendix A.

Proposition 2.

For an instance $(\operatorname{\mathcal{X}},\operatorname{\mathcal{Y}},D)$ , let the class of state-targeting policies for $\hat{\operatorname{\mathcal{Y}}}\subseteq\operatorname{\mathcal{Y}}$ be given by $\mathcal{P}_{\hat{\operatorname{\mathcal{Y}}}}=\{P_{\hat{y}}:\hat{y}\in\hat{% \operatorname{\mathcal{Y}}}\}$ where $P_{\hat{y}}(y)=\operatorname*{argmin}_{\{x\in\operatorname{\mathcal{X}}:D(x,y)% \in\hat{\operatorname{\mathcal{Y}}}\}}\left\lVert D(x,y)-\hat{y}\right\rVert^{2}$ . Define the regret of a policy class $\mathcal{P}$ as

\displaystyle\operatorname{\textup{{Reg}}}_{T}(\mathcal{P})=\min_{P\in\mathcal% {P}}\left(\sum_{t=1}^{T}f_{t}(y_{t})\right)-\min_{y\in\operatorname{\mathcal{Y% }}}\left(\sum_{t=1}^{T}{f}_{t}(y)\right),

where $y_{t}$ is updated by playing $P$ at each round. For any $\rho$ -locally controllable instance, there is a set $\hat{\operatorname{\mathcal{Y}}}\subseteq\operatorname{\mathcal{Y}}$ for which $\operatorname{\textup{{Reg}}}_{T}(\mathcal{P}_{\hat{\operatorname{\mathcal{Y}}% }})={O}(\sqrt{T\rho^{-1}})$ . Further, for any class $\mathcal{P}_{\mathcal{K}}$ where each $K\in\mathcal{P}_{\mathcal{K}}$ is a matrix yielding actions $x_{t}=-Ky_{t-1}$ , there is an instance where $\operatorname{\textup{{Reg}}}_{T}(\mathcal{P}_{\mathcal{K}})\geq\Omega(T)$ for $\rho=1$ .

If dynamics are linear up to projection with $D(x_{t},{y_{t-1}})=\Pi_{\operatorname{\mathcal{Y}}}(By+Ax)$ for full-rank $A$ , and $\dim(\operatorname{\mathcal{X}})=\dim(\operatorname{\mathcal{Y}})$ , note that $P_{\hat{y}}(y)=A^{-1}(\hat{y}-By)$ implements any $P_{\hat{y}}$ for sufficiently large $\operatorname{\mathcal{X}}$ .

3 No-Regret Algorithms for Locally Controllable Dynamics

Here we give a sequence of no-regret algorithms satisfying a range of robustness properties. Our primary algorithm $\operatorname{\textup{{NestedOCO}}}$ , presented in Section 3.1, operates over known time-varying dynamics without disturbances and requires an offline non-convex optimization oracle, and we identify conditions in Section 3.2 which remove the oracle requirement. In Section 3.3 we give two algorithms, $\operatorname{\textup{{NestedOCO-BD}}}$ and $\operatorname{\textup{{NestedOCO-UD}}}$ , which allow adversarial disturbances to weakly and strongly locally controllable dynamics, respectively. In Section 3.4 we extend $\operatorname{\textup{{NestedOCO}}}$ to accommodate unknown dynamics under appropriate regularity conditions (provided an initial “approximately stabilizing” action is known at $t=1$ ), and in Section 3.5 we give an algorithm which obtains $O(T^{3/4})$ regret under bandit feedback.

3.1 Nonlinear Control via Online Convex Optimization

When dynamics satisfy local controllability and $y_{t-1}$ is not too close to $\operatorname*{bd}(\operatorname{\mathcal{Y}})$ , all points $y_{t}$ in a ball around $y_{t-1}$ are feasible with an appropriate $x_{t}$ ; this enables execution of an online convex optimization (OCO) algorithm over $\operatorname{\mathcal{Y}}$ by playing the action $x_{t}$ which yields a state update to the target $y_{t}$ chosen at each iteration, computed via offline non-convex optimization. Here we assume that $D$ is known and can be queried for any inputs, and that disturbances to the state are not present. We allow the dynamics to change over time, potentially as a function of previous actions $x_{s}$ and losses $f_{s}$ for $s<t$ , provided that $D_{t}$ can be determined in each round. We use Follow the Regularized Leader ( $\operatorname{\textup{{FTRL}}}$ ) as our OCO subroutine (Shalev-Shwartz and Singer, 2006; Abernethy et al., 2008), yet we note that it may be substituted for any OCO algorithm whose per-round step size is guaranteed to be sufficiently small (such as OGD with a constant learning rate); statements of the $\operatorname{\textup{{FTRL}}}$ algorithm and its key properties are provided in Appendix B. We instantiate $\operatorname{\textup{{FTRL}}}$ over a contracted space $\tilde{\operatorname{\mathcal{Y}}}\subseteq\operatorname{\mathcal{Y}}$ , calibrated to ensure that the minimum loss over $\tilde{\operatorname{\mathcal{Y}}}$ is close to that for $\operatorname{\mathcal{Y}}$ , yet where each step of $\operatorname{\textup{{FTRL}}}$ lies within the feasible region ensured by (weak) local controllability.

Algorithm 1 Nested Online Convex Optimization (

\operatorname{\textup{{NestedOCO}}}

Let

\psi:\operatorname{\mathcal{Y}}\rightarrow\operatorname{\mathbb{R}}

\gamma

-strongly convex with

\text{argmin}_{y}\psi(y)=\mathbf{0}

and

\max_{y,y^{\prime}}\left\lvert\psi(y)-\psi(y^{\prime})\right\rvert\leq G

Let

\eta=(G\gamma)^{1/2}((1+\frac{R}{r\rho})TL^{2})^{-1/2}

Let

\widetilde{\operatorname{\mathcal{Y}}}=\{y:\frac{1}{1-\delta}y\in\operatorname% {\mathcal{Y}}\}

for

\delta=\eta\frac{L}{r\rho\gamma}

Initialize

\operatorname{\textup{{FTRL}}}

to run for

T

rounds over

\widetilde{\operatorname{\mathcal{Y}}}

with regularizer

\psi

and parameter

\eta

for

t=1

T

Let

y^{*}

be the point chosen by

\operatorname{\textup{{FTRL}}}

Use

\texttt{Oracle}(y_{t-1},y^{*})

to compute

x_{t}=\operatorname*{argmin}_{x}\left\lVert D_{t}(x,y_{t-1})-y^{*}\right\rVert% ^{2}

Play action

x_{t}

Observe

y_{t}

and loss

f_{t}(y_{t})

, update

\operatorname{\textup{{FTRL}}}

end for

Theorem 1.

For a $\rho$ -locally controllable instance $(\operatorname{\mathcal{X}},\operatorname{\mathcal{Y}},D)$ without disturbances and with $D_{t}$ known at each $t$ , the regret of $\operatorname{\textup{{NestedOCO}}}$ for convex $L$ -Lipschitz losses $f_{t}:\operatorname{\mathcal{Y}}\rightarrow\operatorname{\mathbb{R}}$ is at most

\displaystyle\operatorname{\textup{{Reg}}}_{T}(\operatorname{\textup{{% NestedOCO}}})\leq

\displaystyle\;2L\sqrt{(1+{R}(r\rho)^{-1})TG\gamma^{-1}}

with respect to any state $y^{*}\in\operatorname{\mathcal{Y}}$ , with $T$ queries made to a non-convex optimization oracle.

The proof for Theorem 1 is given in Appendix C.

3.2 Efficient Updates for Action-Linear Dynamics

While $\operatorname{\textup{{NestedOCO}}}$ requires no assumptions on the dynamics beyond local controllability, there are large classes of dynamics for which the oracle call can be removed. We say that dynamics are action-linear if $y_{x}=D(x,y)$ is linear in $x$ , for $y_{x}\in\operatorname*{int}(\operatorname{\mathcal{Y}})$ (and arbitrary for $y_{x}\in\operatorname*{bd}(\operatorname{\mathcal{Y}}))$ .

Proposition 3.

For a $\rho$ -locally controllable and action-linear instance $(\operatorname{\mathcal{X}},\operatorname{\mathcal{Y}},D)$ , the per-round optimization problem for $\texttt{{Oracle}}(y_{t-1},y^{*})$ in $\operatorname{\textup{{NestedOCO}}}$ is convex.

Proof

For $y=y_{t-1}\in\widetilde{\operatorname{\mathcal{Y}}}\subseteq\operatorname*{int}% (\operatorname{\mathcal{Y}})$ , we have $D(x,y)=A_{y}\cdot x+b_{y}$ for some matrix $A_{y}$ and vector $b_{y}$ , and so we can solve $x_{t}=\operatorname*{argmin}_{x\in\operatorname{\mathcal{X}}}\left\lVert A_{y}% \cdot x+b_{y}-y^{*}\right\rVert^{2}$ efficiently. ∎

The class of action-linear dynamics is quite general, owing to the flexibility permitted by nonlinear parameterizations of $(A_{y},b_{y})$ in terms of $y$ ; in Appendix D, we show that local controllability holds for multiple explicit families of instances when appropriate eigenvalue conditions are satisfied. We can further relax this condition to accommodate dynamics where action-linearity holds only locally in the neighborhood of stabilizing actions (i.e. actions $x^{*}$ where $D(x^{*},y)=y$ ).

Definition 3 (Locally Action-Linear Dynamics).

An instance $(D,\operatorname{\mathcal{X}},\operatorname{\mathcal{Y}})$ is locally action-linear if, for any $y\in\operatorname*{int}(\operatorname{\mathcal{Y}})$ , $x^{*}$ such that $D(x^{*},y)=y$ , and $x$ such that $D(x,y)\in\operatorname*{int}(\operatorname{\mathcal{Y}})$ , the dynamics are given by $D(x,y)=A_{y}x+b_{y}+q_{y}(x)$ , where $A_{y}$ is a matrix and $b_{y}$ is a vector, both with norms bounded by some absolute constant, where and $q_{y}:\operatorname{\mathcal{X}}\rightarrow\operatorname{\mathbb{R}}^{\dim(% \operatorname{\mathcal{Y}})}$ is any function where $\left\lVert q_{y}(x)\right\rVert\leq C\left\lVert A_{y}(x-x^{*})\right\rVert^{% 1+c}$ for some constants $C,c>0$ .

By this condition, for any $x$ in a sufficiently small neighborhood around $x^{*}$ , the deviation of dynamics (and thus the resulting $y_{t+1}$ ) from action-linearity vanishes. Note that our algorithm always chooses a target $y_{t}$ will always be near $y_{t-1}$ ; as such, these deviations from non-action-linearity can be modeled as disturbances with magnitude strictly less than our per-round step size $\left\lVert y_{t+1}-y_{t}\right\rVert$ (along with universal constant factors). The existence of an efficient implementation follows as a straightforward corollary of Theorem 2 in Section 3.3, which extends $\operatorname{\textup{{NestedOCO}}}$ to accommodate bounded adversarial disturbances, as we can then select actions by disregarding the influence of $q_{y}$ and only considering the local approximation $D(x,y)=A_{y}x+b_{y}$ at each state $y$ (assuming that each decomposition between $q_{y}$ and the action-linear component is known).

3.3 Adversarial Disturbances

Our algorithm $\operatorname{\textup{{NestedOCO}}}$ can be extended to accommodate adversarial disturbances, where the state is updated as $y_{t}=D(x_{t},y_{t-1})+w_{t}$ , with $\{w_{t}\}$ chosen adversarially. In the weak local controllability case, we show a sharp threshold effect in terms of whether or not $\left\lVert w_{t}\right\rVert$ is allowed to exceed the undisturbed distance from the boundary by a factor of $\frac{\rho}{1+\rho}$ : if disturbances are bounded below this threshold, regret minimization remains feasible with a tight $\Theta(E)$ dependence on the total disturbance magnitude, yet if disturbances may exceed this, no sublinear regret rate is attainable even for a constant total disturbance magnitude. When $\rho$ is small, an adversary can push us to the boundary faster than we can “undo” past disturbances, causing our feasible range to decay.

Theorem 2 (Bounded Disturbances for Weak Local Controllability).

For any $\rho\in(0,1]$ , suppose that a sequence of adversarial disturbances $w_{t}$ for a $\rho$ -locally controllable instance $(\operatorname{\mathcal{X}},\operatorname{\mathcal{Y}},D)$ satisfies $\sum_{t=1}^{T}\left\lVert w_{t}\right\rVert\leq E$ and $\left\lVert w_{t}\right\rVert\leq\frac{\rho-\alpha\rho}{1+\rho}\cdot\pi\left(D% (x_{t},y_{t-1})\right)$ , for some $\alpha\in\operatorname{\mathbb{R}}$ . If $\alpha>0$ , there is an algorithm $\operatorname{\textup{{NestedOCO-BD}}}$ with regret for convex Lipschitz losses $f_{t}$ bounded by

\displaystyle\operatorname{\textup{{Reg}}}_{T}(\operatorname{\textup{{% NestedOCO-BD}}})\leq

\displaystyle\;O\left(\sqrt{T\cdot(\alpha\rho)^{-1}}+E\right),

and there is an instance where any algorithm $\operatorname{\mathcal{A}}$ obtains $\operatorname{\textup{{Reg}}}_{T}(\operatorname{\mathcal{A}})=\Omega(E)$ . If $\alpha<0$ , there is an instance such that any algorithm $\operatorname{\mathcal{A}}$ obtains $\operatorname{\textup{{Reg}}}_{T}(\operatorname{\mathcal{A}})\geq\Omega\left(T\right)$ even when $E=O(1)$ .

The maximum disturbance bound can be removed when dynamics are strongly locally controllable, as the ensured feasible range of the dynamics does not vanish at the boundary of the state space. For such instances, we can minimize regret (with tight $O(E\cdot\rho^{-1})$ dependence) even if disturbances are only implicitly bounded by the state space diameter (which is at least $\rho$ , without loss of generality).

Theorem 3 (Unbounded Disturbances for Strong Local Controllability).

For any $\rho>0$ and strongly $\rho$ -locally controllable instance $(\operatorname{\mathcal{X}},\operatorname{\mathcal{Y}},D)$ with disturbances $w_{t}$ satisfying $\sum_{t=1}^{T}\left\lVert w_{t}\right\rVert\leq E$ , there is an algorithm $\operatorname{\textup{{NestedOCO-UD}}}$ with regret for convex Lipschitz losses $f_{t}$ bounded by

\displaystyle\operatorname{\textup{{Reg}}}_{T}(\operatorname{\textup{{% NestedOCO-UD}}})\leq

\displaystyle\;O\left(\sqrt{T}+E\cdot\rho^{-1}\right),

and there is an instance where any algorithm $\operatorname{\mathcal{A}}$ obtains $\operatorname{\textup{{Reg}}}_{T}(\operatorname{\mathcal{A}})\geq\Omega\left(E% \cdot\rho^{-1}\right)$ .

In each case, our lower bounds in terms of $E$ hold for the same constants obtained by our algorithms, and our algorithms obtain the stated regret guarantees even when $E$ is not known in advance. We present the algorithms and analysis for each theorem in Appendix E; both operate by tracking deviations from an idealized trajectory without disturbances, and calibrating parameters to preserve sufficient reachability margin for applying corrections towards this trajectory in each round. The lower bounds both proceed by considering an instance with a fixed target state $y^{*}$ and losses which track the distance from $y^{*}$ , along with an adversary whose goal is to maximize this distance by selecting disturbances which push the current state away from $y^{*}$ .

3.4 Unknown Dynamics

Up until this point, we have assumed that the dynamics $D$ can be queried arbitrarily in each round. While this has required minimal assumptions on $D$ beyond local controllability, accommodation of unknown dynamics is often desired in online control (Cassel et al., 2022; Minasyan et al., 2022) and for several of our applications (Roth et al., 2015; Agarwal and Brown, 2023). Here we give conditions under which regret minimization can be implemented without advance knowledge of $D$ by an algorithm $\operatorname{\textup{{ProbingOCO}}}$ , which maintains continuously-updating local linear approximations of $D$ near $y_{t}$ across rounds. Crucially, we assume that $D$ is time-invariant and locally action-linear with sufficiently small Lipschitz parameters, and that for the initial state $y_{0}$ some near-stabilizing action $x_{1}$ is known, i.e. $\left\lVert D(x_{1},y_{0})-y_{0}\right\rVert\leq\epsilon$ , for some $\epsilon=o(\sqrt{T})$ .

Theorem 4.

For any $\rho$ -locally controllable and time-invariant instance $(D,\operatorname{\mathcal{X}},\operatorname{\mathcal{Y}})$ which satisfies local action-linearity and appropriate Lipschitz conditions, there is an algorithm $\operatorname{\textup{{ProbingOCO}}}$ with $\operatorname{\textup{{Reg}}}_{T}(\operatorname{\textup{{ProbingOCO}}})\leq O(% \sqrt{T})$ for convex Lipschitz losses $f_{t}$ and unknown dynamics $D$ , provided that at $t=1$ we are given some $x_{1}$ such that $\left\lVert D(x_{1},y_{0})-y_{0}\right\rVert=o(\sqrt{T})$ .

We state $\operatorname{\textup{{ProbingOCO}}}$ and prove Theorem 4 in Appendix F, along with additional details on the regularity and near-stability assumptions. The crux of our analysis, beyond that from our previous results, hinges on being able to maintain and update local linear approximations of $D$ throughout our optimization which are sufficiently accurate to allow us to discard the effects of both learned representation errors and action non-linearity from $q_{y}(x)$ as bounded disturbances. We implement each update from our nested regret minimization algorithm as a series of $O(\dim(\operatorname{\mathcal{X}}))$ steps involving small near-orthogonal perturbations to our targets $y_{t}$ , which we then use to update our local estimate for $D$ .

3.5 Bandit Feedback

We can extend our approach from $\operatorname{\textup{{NestedOCO}}}$ to accommodate bandit feedback for convex losses by replacing $\operatorname{\textup{{FTRL}}}$ with the $\operatorname{\textup{{FKM}}}$ algorithm (Flaxman et al., 2004) and appropriately recalibrating parameters. $\operatorname{\textup{{FKM}}}$ obtains ${O}(T^{3/4})$ regret, which is the best currently-known bound for bandit convex optimization without additional assumptions (e.g. strong convexity), and we obtain an analogous bound here for nested optimization. We note that this extension to bandit feedback can again be applied for any algorithm with a small per-round step-size bound, though this property does not hold for algorithms which sample from larger sets to reduce variance of gradient estimators (e.g. those from Abernethy et al. (2008); Hazan and Levy (2014)).

Theorem 5.

For any $\rho$ -locally controllable instance $(D,\operatorname{\mathcal{X}},\operatorname{\mathcal{Y}})$ , there is an oracle-efficient algorithm $\operatorname{\textup{{NestedBCO}}}$ with expected regret bounded by

\displaystyle\operatorname{\textup{{Reg}}}_{T}(\operatorname{\textup{{% NestedBCO}}})=

\displaystyle\;O\left(nRLT^{3/4}(r\rho)^{-1}\right)

for $L$ -Lipschitz convex losses $f_{t}$ under bandit feedback.

We present the $\operatorname{\textup{{NestedBCO}}}$ algorithm and prove Theorem 5 in Appendix G.

4 Applications for Online Stackelberg Optimization

We give several applications of our framework to online Stackelberg problems involving strategic or adaptive agents, each cast as an instance of online control with nonlinear dynamics where local controllability holds, and where our objectives are well-approximated by convex surrogate losses only over the state. Each application extends prior work by either allowing for more relaxed assumptions, unifying distinct problem instances, or giving a novel formulation to account for dynamic and adversarial behavior; analysis and comparison to related work is contained in Appendices H-K.

4.1 Online Performative Prediction

Performative Prediction was introduced by Perdomo et al. (2020) to capture settings in which the data distribution may shift as a function of the classifier itself. We consider the online formulation of Performative Prediction introduced in Kumar et al. (2022) as an instance of online convex optimization with unbounded memory, which we extend to accommodate a stateful variant of the problem (as in Brown et al. (2022)) in which the update to the distribution is a function of both the classifier and the current distribution itself. Let $\operatorname{\mathcal{X}}\subseteq\operatorname{\mathbb{R}}^{n}$ denote our space of classifiers, and let $p_{0}$ be the initial distribution over $\operatorname{\mathbb{R}}^{n}$ . When a classifier $x_{t}$ is deployed, the distribution is updated to

\displaystyle p_{t}=

\displaystyle\;(1-\theta)p_{t-1}+\theta\operatorname{\mathcal{D}}(x_{t},y_{t-1})

where $\operatorname{\mathcal{D}}(x_{t},y)=A(x_{t},y_{t-1})+\xi$ , for a random variable $\xi\in\operatorname{\mathbb{R}}^{n}$ with mean $\mu$ and covariance $\Sigma$ , and with $y_{t}=A(x_{t},y_{t-1})$ , where $A$ satisfies $\rho$ -local controllability for some $\rho>0$ and appropriate smoothness notions. We also assume there is some linear $s:\operatorname{\mathcal{X}}\rightarrow\operatorname{\mathcal{Y}}$ such that $A(x,y)=s(x)$ if $y=s(x)$ . We then receive loss $\tilde{f}_{t}(x_{t},p_{t})=\operatorname*{\mathbb{E}}_{z\sim p_{t}}[f_{t}(x_{t% },z)]$ , where each $f_{t}$ is convex and Lipschitz.

This generalizes the model of Kumar et al. (2022), in which $A(x,y)=A\in\operatorname{\mathbb{R}}^{n\times n}$ is taken to be a fixed matrix; there, $\rho$ -local controllability is satisfied for some $\rho>0$ provided that $A$ is nonsingular. Their aim is to compete with the best fixed classifier by running regret minimization over $\operatorname{\mathcal{X}}$ . Here we run $\operatorname{\textup{{NestedOCO}}}$ over $\operatorname{\mathcal{Y}}$ , taken over the range of $s$ , which allows us to compete against the best fixed classifier as well by the properties of $s$ ; while the classifiers $x_{t}$ we play will generally not result in stabilizing points of $A$ , their excess loss compared to each $s^{-1}(y_{t})$ is bounded.

Theorem 6 (Regret Minimization for Performative Prediction).

For any $\theta>0$ , the dynamics for Online Performative Prediction are $\rho$ -locally controllable, and $\operatorname{\textup{{NestedOCO}}}$ obtains regret $O(\sqrt{T(\rho^{-1}+\theta^{-1})})$ with respect to the best fixed classifier.

4.2 Adaptive Recommendations

Online interactions with economic agents of various types are ubiquitous, and the resulting control problems tend to be manifestly nonlinear; here we treat two diverse examples from this space. The Adaptive Recommendations problem, as introduced by Agarwal and Brown (2022), is about providing menu recommendations repeatedly to an agent, whose choice distribution is a function of their past selections, while the controller’s reward in each round depends on adversarial losses over the choice. In each round $t\in[T]$ , we show the agent a (possibly randomized) menu $K_{t}$ containing $k$ (out of $n$ ) items, and the agent’s instantaneous choice distribution conditioned on seeing $K_{t}$ is

\displaystyle p_{t}(i;K_{t},v_{t-1})=

\displaystyle\;\begin{cases}\frac{s_{i}(v_{t-1})}{\sum_{j\in K_{t}}s_{j}(v_{t-% 1})}&i\in K_{t}\\ 0&i\notin K_{t}\end{cases}

where each $s_{i}:\Delta(n)\rightarrow[\lambda,1]$ is the agent’s preference scoring function for item $i$ , for some $\lambda>0$ , taking as input the agent’s memory vector $v\in\Delta(n)$ . The memory vector updates each round as

\displaystyle v_{t}=(1-\theta_{t})v_{t-1}+\theta_{t}p_{t},

where $\theta_{t}\in[\theta,1]$ for $\theta>0$ is a possibly time-dependent update speed, and we receive loss $f_{t}(p_{t})$ , where each $f_{t}$ is convex and $L$ -Lipschitz. Note that the set of feasible choice distributions when considering all menu distributions $x_{t}\in\Delta({n\choose k})$ depends on the memory vector $v_{t}$ . The regret benchmark considered by Agarwal and Brown (2022) is the intersection of all such sets, denoted the “everywhere instantaneously-realizable distribution” set $\operatorname{\textup{{EIRD}}}=\cap_{v\in\Delta}\operatorname{\textup{{IRD}}}(v)$ , where $\operatorname{\textup{{IRD}}}(v)$ is the “instantaneously realizable distribution” set for $v$ , given as the convex hull of the choice distributions $p(K_{t})$ resulting from each menu $K_{t}\in[{n\choose k}]$ when $v$ is the memory vector. It is shown that the set is non-empty when $\lambda$ is not too small, and algorithms which minimize regret with respect to any distribution in $\operatorname{\textup{{EIRD}}}$ are given in Agarwal and Brown (2022) and Agarwal and Brown (2023) under varying assumptions regarding the scoring functions and update speed.

While the prior work considers a bandit version of the problem with unknown dynamics, here we consider a full-feedback deterministic variant of the problem for simplicity, which further allows us to circumvent barriers posed by uncertainty Agarwal and Brown (2022, 2023) and relax structural assumptions (e.g. on $\theta_{t}$ or $s_{i}$ ). We can cast this as an instance of our framework by taking $\operatorname{\mathcal{X}}=\Delta({n\choose k})$ and $\operatorname{\mathcal{Y}}=\operatorname{\textup{{EIRD}}}$ , where $D$ expresses updates to the memory vector. We assume $v_{0}=\mathbf{u}_{n}$ , and we reparameterize to run our algorithm over $\Delta(n)$ . We optimize surrogate losses $f^{*}_{t}(v_{t})$ , and bound excess regret from $f_{t}(p_{t})$ .

Theorem 7 (Regret Minimization over $\operatorname{\textup{{EIRD}}}$ ).

For $\lambda>\frac{k-1}{n-1}$ , the dynamics for Adaptive Recommendations over $\operatorname{\textup{{EIRD}}}$ are $\theta$ -locally controllable, and $\operatorname{\textup{{NestedOCO}}}$ obtains regret $O(\sqrt{T\theta^{-1}})$ .

In Agarwal and Brown (2023), a property for scoring functions is considered which enables regret minimization over a potentially much larger set of distributions than $\operatorname{\textup{{EIRD}}}$ . A scoring function $s_{i}:\Delta(n)\rightarrow[\frac{\lambda}{\sigma},1]$ is said to be $(\sigma,\lambda)$ -scale-bounded for $\sigma>1$ if, for all $v\in\Delta(n)$ , we have that

\displaystyle\sigma^{-1}((1-\lambda)v_{i}+\lambda)\leq s_{i}(v)\leq\sigma((1-% \lambda)v_{i}+\lambda).

The set considered is the $\phi$ -smoothed simplex $\Delta^{\phi}(n)=\{(1-\phi)v+\phi\mathbf{u}_{n}:v\in\Delta(n)\}$ , for $\phi=\Theta(k\lambda\sigma^{2})$ , where it is shown that $\operatorname{\textup{{IRD}}}(v)$ contains a ball around $v$ for $v\in\Delta^{\phi}(n)$ . We take $\operatorname{\mathcal{Y}}=\Delta^{\phi}(n)$ , which satisfies local controllability, and optimize over $f_{t}^{*}(v_{t})$ with $\operatorname{\textup{{NestedOCO}}}$ .

Theorem 8 (Regret Minimization over $\Delta^{\phi}(n)$ ).

For $(\sigma,\lambda)$ -scale-bounded scoring functions $s_{i}$ , for any $\lambda>0$ and $\sigma>1$ , the dynamics for Adaptive Recommendations over $\Delta^{\phi}(n)$ are $\Omega(\theta\lambda\phi)$ -locally controllable, and $\operatorname{\textup{{NestedOCO}}}$ obtains regret $O(\sqrt{T(\theta\lambda\phi)^{-1}})$ .

4.3 Adaptive Pricing

Here we consider an Adaptive Pricing problem for real-valued goods, formulated as a dynamic extension of the setting of Roth et al. (2015) where purchase history and consumption affect demand. In each round we set per-unit price vectors $p_{t}\in\operatorname{\mathbb{R}}_{+}^{n}$ , and an agent buys some bundle of goods $x_{t}\in\operatorname{\mathbb{R}}_{+}^{n}$ , which results in us obtaining a reward $\langle p_{t},x_{t}\rangle-c_{t}(x_{t})$ , where our production cost function $c_{t}$ at each round is convex and $L_{c}$ -Lipschitz, and may be chosen adversarially.

Departing from Roth et al. (2015), we consider an agent who maintains goods reserves $y_{t-1}\in\operatorname{\mathbb{R}}_{\geq 0}^{n}$ and consumes an adversarially chosen fraction $\theta_{t}\in[\theta,1]$ of every good’s reserve at each round (for some $\theta>0$ ). The agent then chooses a bundle $x_{t}$ to maximize their utility $g(p_{t},x_{t},y_{t})=v(y_{t})-\langle p_{t},x_{t}\rangle$ , where $y_{t}=(1-\theta_{t})y_{t-1}+x_{t}$ is their updated reserve bundle. We make several regularity assumptions on the agent’s valuation function $v:\operatorname{\mathbb{R}}_{+}^{n}\rightarrow\operatorname{\mathbb{R}}_{+}$ , all of which are satisfied by several classically studied utility families (which we discuss in Appendix 4.3). Notably, we assume that $v$ is strictly concave and increasing, and homogeneous; the range is bounded under rationality.

Our aim will be to set prices which allow us to compete with the best stable reserve policy, e.g. against any pricing policy where the agent maintains the same reserve bundle $y_{t}=y^{*}$ at each round for some $y^{*}$ regardless of $\theta_{t}$ . We take an appropriate convex set of such bundles as our state space, for which we show that local controllability holds. Observe that to induce a purchase of $x_{t}=\theta_{t}y_{t-1}$ , it suffices to set prices $p_{t}=\nabla v(y_{t-1})$ , as we then have that $\nabla_{x_{t}}(v((1-\theta_{t})y_{t-1}+x_{t})-\langle p_{t},x_{t}\rangle)=% \mathbf{0}$ . By homogeneity of $v$ , we also have that $\langle\nabla v(y_{t}),\theta_{t}y_{t}\rangle=\theta_{t}k\cdot v(y_{t})$ for some $k$ , and we show that optimization via the concave surrogate rewards

\displaystyle f^{*}_{t}(y_{t})=

\displaystyle\;\theta_{t}k\cdot v(y_{t})-c_{t}(\theta_{t}y_{t})

will closely track our true rewards $f_{t}(p_{t},x_{t})=\langle p_{t},x_{t}\rangle-c_{t}(x_{t})$ . While neither our true nor surrogate rewards will be Lipschitz, we extend $\operatorname{\textup{{NestedOCO}}}$ to obtain sublinear regret over Hölder continuous losses by appropriately calibrating our step size (which may be of independent interest).

Theorem 9 (Regret Minimization over Stable Reserve Policies).

For any $\theta>0$ , the dynamics for Adaptive Pricing can are $\theta$ -locally controllable, and $\operatorname{\textup{{NestedOCO}}}$ obtains regret $o(T\theta^{-1})$ with respect to the best stable reserve policy.

4.4 Steering Learners in Online Games

A recent line of work (Deng et al., 2019; Mansour et al., 2022; Brown et al., 2023) explores maximizing rewards in a repeated game against a no-regret learner, and Anagnostides et al. (2023) study of no-regret dynamics in time-varying games. We consider these questions in unison, and aim to optimize reward against a no-regret learner for game matrices chosen adversarially and online.

Consider adversarial sequences of two-player $m\times n$ bimatrix games $(A_{t},B_{t})$ , where $m>n$ ; we assume that the convex hull of the rows of each $B_{t}$ contains the unit ball. As Player A, we choose strategies $x_{t}\in\Delta(m)$ each round to maximize our reward against Player B, who chooses their strategies $y_{t}\in\Delta(n)$ according to a no-regret algorithm (in particular, online projected gradient descent). The game $(A_{t},B_{t})$ is only revealed after both players have chosen strategies for round $t$ . Our aim here is to illustrate the feasibility of steering the opponent’s trajectory, and so we consider games where Player A’s reward is predominantly a function only of Player B’s actions. We assume that $\left\lVert xA_{t}-x{A^{*}_{t}}\right\rVert\leq\delta_{t}$ for any $x\in\Delta(m)$ , where each $A^{*}_{t}$ is a matrix with identical rows, and that per-round changes to $B_{t}$ are bounded, with $\left\lVert xB_{t}-xB_{t-1}\right\rVert\leq\epsilon_{t}$ for any $x\in\Delta(m)$ . We measure the regret of an algorithm $\operatorname{\mathcal{A}}$ with respect to any profile $(x,y)\in\Delta(m)\times\Delta(n)$ , where

\displaystyle\operatorname{\textup{{Reg}}}_{T}(\operatorname{\mathcal{A}})=

\displaystyle\;\max_{(x,y)\in\Delta(m)\times\Delta(n)}\sum_{t=1}^{T}xA_{t}y-x_% {t}A_{t}y_{t}.

When Player B plays $\operatorname{\textup{OGD}}$ with step size $\theta=\Theta(T^{-1/2})$ , their strategy updates each round as

\displaystyle y_{t+1}=

\displaystyle\;\Pi_{\Delta(n)}\left(y_{t}+\theta(x_{t}B_{t})\right),

with $y_{1}=\mathbf{u}_{n}$ , and yields regret $O(\sqrt{T})$ for Player B with respect to any $y\in\Delta(n)$ for the loss sequence $\{x_{t}B_{t}:t\in[T]\}$ . To cast this in our framework, we consider $\Delta(n)=\operatorname{\mathcal{Y}}$ as our state space, where we select actions $x_{t-1}$ to induce desired updates to $y_{t}$ and optimize over the surrogate losses $\{\mathbf{u}_{m}A^{*}_{t}y_{t}:t\in[T]\}$ . While we do not see $B_{t}$ prior to choosing each $x_{t}$ , we view our update errors from instead selecting an action in terms of the dynamics resulting from $B_{t-1}$ as adversarial disturbances and run $\operatorname{\textup{{NestedOCO-UD}}}$ , as the dynamics are strongly locally controllable.

Theorem 10 (Regret Minimization in Online Games).

For $\theta=\Theta(T^{-1/2})$ , repeated play against $\operatorname{\textup{OGD}}$ in online $m\times n$ games can be cast as a $\theta$ -strongly locally controllable instance of online control with nonlinear dynamics, for which $\operatorname{\textup{{NestedOCO-UD}}}$ obtains regret $O(\sqrt{T}+\sum_{t}(\delta_{t}+\epsilon_{t}))$ .

References

Abernethy et al. (2008) Jacob D. Abernethy, Elad Hazan, and Alexander Rakhlin. Competing in the dark: An efficient algorithm for bandit linear optimization. In Annual Conference Computational Learning Theory, 2008. URL https://api.semanticscholar.org/CorpusID:8547150.
Agarwal and Brown (2022) Arpit Agarwal and William Brown. Diversified recommendations for agents with adaptive preferences. In Advances in Neural Information Processing Systems, volume 35, 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/a75db7d2ee1e4bee8fb819979b0a6cad-Paper-Conference.pdf.
Agarwal and Brown (2023) Arpit Agarwal and William Brown. Online recommendations for agents with discounted adaptive preferences, 2023.
Agarwal et al. (2019a) Naman Agarwal, Brian Bullins, Elad Hazan, Sham M. Kakade, and Karan Singh. Online control with adversarial disturbances, 2019a.
Agarwal et al. (2019b) Naman Agarwal, Elad Hazan, and Karan Singh. Logarithmic regret for online control. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019b. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/78719f11fa2df9917de3110133506521-Paper.pdf.
Agrawal et al. (2023) Shipra Agrawal, Yiding Feng, and Wei Tang. Dynamic pricing and learning with bayesian persuasion, 2023.
Ahmadi et al. (2023) Saba Ahmadi, Avrim Blum, and Kunhe Yang. Fundamental bounds on online strategic classification, 2023.
Alcantara-Jiménez and Clempner (2020) Guillermo Alcantara-Jiménez and Julio B. Clempner. Repeated stackelberg security games: Learning with incomplete state information. Reliability Engineering & System Safety, 195:106695, 2020. ISSN 0951-8320. doi: https://doi.org/10.1016/j.ress.2019.106695. URL https://www.sciencedirect.com/science/article/pii/S0951832019304478.
Anagnostides et al. (2022) Ioannis Anagnostides, Constantinos Daskalakis, Gabriele Farina, Maxwell Fishelson, Noah Golowich, and Tuomas Sandholm. Near-optimal no-regret learning for correlated equilibria in multi-player general-sum games. In Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing, pages 736–749, 2022.
Anagnostides et al. (2023) Ioannis Anagnostides, Ioannis Panageas, Gabriele Farina, and Tuomas Sandholm. On the convergence of no-regret learning dynamics in time-varying games, 2023.
Anava et al. (2014) Oren Anava, Elad Hazan, and Shie Mannor. Online convex optimization against adversaries with memory and application to statistical arbitrage, 2014.
Aoki (1974) Masanao Aoki. Local Controllability of a Decentralized Economic System1. The Review of Economic Studies, 41(1):51–63, 01 1974. ISSN 0034-6527. doi: 10.2307/2296398. URL https://doi.org/10.2307/2296398.
Balcan et al. (2015) Maria-Florina Balcan, Avrim Blum, Nika Haghtalab, and Ariel D. Procaccia. Commitment without regrets: Online learning in stackelberg security games. In Proceedings of the Sixteenth ACM Conference on Economics and Computation, EC ’15, page 61–78, New York, NY, USA, 2015. Association for Computing Machinery. ISBN 9781450334105. doi: 10.1145/2764468.2764478. URL https://doi.org/10.1145/2764468.2764478.
Barbero-Liñán and Jakubczyk (2013) M. Barbero-Liñán and B. Jakubczyk. Second order conditions for optimality and local controllability of discrete-time systems, 2013.
Blum et al. (2008) Avrim Blum, MohammadTaghi Hajiaghayi, Katrina Ligett, and Aaron Roth. Regret minimization and the price of total anarchy. In Proceedings of the fortieth annual ACM symposium on Theory of computing, pages 373–382, 2008.
Blum et al. (2014) Avrim Blum, Nika Haghtalab, and Ariel D Procaccia. Learning optimal commitment to overcome insecurity. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014. URL https://proceedings.neurips.cc/paper_files/paper/2014/file/cc1aa436277138f61cda703991069eaf-Paper.pdf.
Boscain et al. (2021) Ugo Boscain, Daniele Cannarsa, Valentina Franceschi, and Mario Sigalotti. Local controllability does imply global controllability, 2021.
Braverman et al. (2017) Mark Braverman, Jieming Mao, Jon Schneider, and S. Matthew Weinberg. Selling to a no-regret buyer. CoRR, abs/1711.09176, 2017. URL http://arxiv.longhoe.net/abs/1711.09176.
Brown et al. (2022) Gavin Brown, Shlomi Hod, and Iden Kalemaj. Performative prediction in a stateful world, 2022.
Brown et al. (2023) William Brown, Jon Schneider, and Kiran Vodrahalli. Is learning in games good for the learners?, 2023.
Cassel et al. (2022) Asaf Cassel, Alon Cohen, and Tomer Koren. Efficient online linear control with stochastic convex costs and unknown dynamics, 2022.
Chen et al. (2022) Xinyi Chen, Edgar Minasyan, Jason D. Lee, and Elad Hazan. Provable regret bounds for deep online learning and control, 2022.
Cohen et al. (2018) Alon Cohen, Avinatan Hassidim, Tomer Koren, Nevena Lazic, Yishay Mansour, and Kunal Talwar. Online linear quadratic control. CoRR, abs/1806.07104, 2018. URL http://arxiv.longhoe.net/abs/1806.07104.
Collina et al. (2023) Natalie Collina, Eshwar Ram Arunachaleswaran, and Michael Kearns. Efficient stackelberg strategies for finitely repeated games. In Proceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems, AAMAS ’23, page 643–651, Richland, SC, 2023. International Foundation for Autonomous Agents and Multiagent Systems. ISBN 9781450394321.
Daskalakis and Syrgkanis (2015) Constantinos Daskalakis and Vasilis Syrgkanis. Learning in auctions: Regret is hard, envy is easy. CoRR, abs/1511.01411, 2015. URL http://arxiv.longhoe.net/abs/1511.01411.
Dean and Morgenstern (2022) Sarah Dean and Jamie Morgenstern. Preference dynamics under personalized recommendations, 2022.
Deng et al. (2019) Yuan Deng, Jon Schneider, and Balusubramanian Sivan. Strategizing against no-regret learners, 2019.
Dong et al. (2018) **shuo Dong, Aaron Roth, Zachary Schutzman, Bo Waggoner, and Zhiwei Steven Wu. Strategic classification from revealed preferences. In Proceedings of the 2018 ACM Conference on Economics and Computation, EC ’18, page 55–70, New York, NY, USA, 2018. Association for Computing Machinery. ISBN 9781450358293. doi: 10.1145/3219166.3219193. URL https://doi.org/10.1145/3219166.3219193.
Feng et al. (2019) Zhe Feng, Okke Schrijvers, and Eric Sodomka. Online learning for measuring incentive compatibility in ad auctions. CoRR, abs/1901.06808, 2019. URL http://arxiv.longhoe.net/abs/1901.06808.
Flaxman et al. (2004) Abraham Flaxman, Adam Tauman Kalai, and H. Brendan McMahan. Online convex optimization in the bandit setting: gradient descent without a gradient. CoRR, cs.LG/0408007, 2004. URL http://arxiv.longhoe.net/abs/cs.LG/0408007.
Flaxman et al. (2016) Seth Flaxman, Sharad Goel, and Justin M. Rao. Filter Bubbles, Echo Chambers, and Online News Consumption. Public Opinion Quarterly, 80(S1):298–320, 03 2016. ISSN 0033-362X. doi: 10.1093/poq/nfw006. URL https://doi.org/10.1093/poq/nfw006.
Flokas et al. (2019) Lampros Flokas, Emmanouil-Vasileios Vlatakis-Gkaragkounis, and Georgios Piliouras. Poincaré recurrence, cycles and spurious equilibria in gradient-descent-ascent for non-convex non-concave zero-sum games, 2019.
Gaitonde et al. (2021) Jason Gaitonde, Jon M. Kleinberg, and Éva Tardos. Polarization in geometric opinion dynamics. In Péter Biró, Shuchi Chawla, and Federico Echenique, editors, EC ’21: The 22nd ACM Conference on Economics and Computation, Budapest, Hungary, July 18-23, 2021, pages 499–519. ACM, 2021.
Golrezaei et al. (2020) Negin Golrezaei, Adel Javanmard, and Vahab S. Mirrokni. Dynamic incentive-aware learning: Robust pricing in contextual auctions. CoRR, abs/2002.11137, 2020. URL https://arxiv.longhoe.net/abs/2002.11137.
Gradu et al. (2022) Paula Gradu, Elad Hazan, and Edgar Minasyan. Adaptive regret for control of time-varying dynamics, 2022.
Hardt et al. (2015) Moritz Hardt, Nimrod Megiddo, Christos H. Papadimitriou, and Mary Wootters. Strategic classification. CoRR, abs/1506.06980, 2015. URL http://arxiv.longhoe.net/abs/1506.06980.
Hartline et al. (2015a) Jason Hartline, Vasilis Syrgkanis, and Eva Tardos. No-regret learning in bayesian games. Advances in Neural Information Processing Systems, 28, 2015a.
Hartline et al. (2015b) Jason D. Hartline, Vasilis Syrgkanis, and Éva Tardos. No-regret learning in repeated bayesian games. CoRR, abs/1507.00418, 2015b. URL http://arxiv.longhoe.net/abs/1507.00418.
Hazan (2021) Elad Hazan. Introduction to online convex optimization, 2021.
Hazan and Levy (2014) Elad Hazan and Kfir Levy. Bandit convex optimization: Towards tight bounds. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014. URL https://proceedings.neurips.cc/paper_files/paper/2014/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf.
Hazan and Singh (2022) Elad Hazan and Karan Singh. Introduction to online nonstochastic control, 2022.
Hazla et al. (2019) Jan Hazla, Yan **, Elchanan Mossel, and Govind Ramnarayan. A geometric model of opinion polarization. CoRR, abs/1910.05274, 2019.
Jagadeesan et al. (2022a) Meena Jagadeesan, Nikhil Garg, and Jacob Steinhardt. Supply-side equilibria in recommender systems, 2022a.
Jagadeesan et al. (2022b) Meena Jagadeesan, Tijana Zrnic, and Celestine Mendler-Dünner. Regret minimization with performative feedback. CoRR, abs/2202.00628, 2022b. URL https://arxiv.longhoe.net/abs/2202.00628.
Jia et al. (2014) Liyan Jia, Lang Tong, and Qing Zhao. An online learning approach to dynamic pricing for demand response, 2014.
Kakade et al. (2020) Sham Kakade, Akshay Krishnamurthy, Kendall Lowrey, Motoya Ohnishi, and Wen Sun. Information theoretic regret bounds for online nonlinear control, 2020.
Kanoria and Nazerzadeh (2020) Yash Kanoria and Hamid Nazerzadeh. Dynamic reserve prices for repeated auctions: Learning from bids. CoRR, abs/2002.07331, 2020. URL https://arxiv.longhoe.net/abs/2002.07331.
Kuhn and Wohltmann (1989) H. Kuhn and H.-W. Wohltmann. Controllability of economic systems under alternative expectations hypotheses—the discrete case. Computers & Mathematics with Applications, 18(6):617–628, 1989. ISSN 0898-1221. doi: https://doi.org/10.1016/0898-1221(89)90112-0. URL https://www.sciencedirect.com/science/article/pii/0898122189901120.
Kumar et al. (2022) Raunak Kumar, Sarah Dean, and Robert D. Kleinberg. Online convex optimization with unbounded memory, 2022.
Lale et al. (2021) Sahin Lale, Kamyar Azizzadenesheli, Babak Hassibi, and Anima Anandkumar. Model learning predictive control in nonlinear dynamical systems. In 2021 60th IEEE Conference on Decision and Control (CDC), pages 757–762, 2021. doi: 10.1109/CDC45484.2021.9683670.
Lauffer et al. (2022) Niklas Lauffer, Mahsa Ghasemi, Abolfazl Hashemi, Yagiz Savas, and Ufuk Topcu. No-regret learning in dynamic stackelberg games, 2022.
Letchford et al. (2009) Joshua Letchford, Vincent Conitzer, and Kamesh Munagala. Learning and approximating the optimal strategy to commit to. In Algorithmic Game Theory, 2009. URL https://api.semanticscholar.org/CorpusID:1795572.
Luo et al. (2022) Wenhao Luo, Wen Sun, and Ashish Kapoor. Sample-efficient safe learning for online nonlinear control with control barrier functions, 2022.
Lykouris et al. (2021) Thodoris Lykouris, Max Simchowitz, Alex Slivkins, and Wen Sun. Corruption-robust exploration in episodic reinforcement learning. In Mikhail Belkin and Samory Kpotufe, editors, Proceedings of Thirty Fourth Conference on Learning Theory, volume 134 of Proceedings of Machine Learning Research, pages 3242–3245. PMLR, 15–19 Aug 2021. URL https://proceedings.mlr.press/v134/lykouris21a.html.
Mansour et al. (2022) Yishay Mansour, Mehryar Mohri, Jon Schneider, and Balasubramanian Sivan. Strategizing against learners in bayesian games, 2022.
Mehta et al. (2007) Aranyak Mehta, Amin Saberi, Umesh Vazirani, and Vijay Vazirani. Adwords and generalized online matching. J. ACM, 54(5):22–es, oct 2007. ISSN 0004-5411. doi: 10.1145/1284320.1284321. URL https://doi.org/10.1145/1284320.1284321.
Mendler-Dünner et al. (2020) Celestine Mendler-Dünner, Juan Perdomo, Tijana Zrnic, and Moritz Hardt. Stochastic optimization for performative prediction. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 4929–4939. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/33e75ff09dd601bbe69f351039152189-Paper.pdf.
Miller et al. (2021) John Miller, Juan C. Perdomo, and Tijana Zrnic. Outside the echo chamber: Optimizing the performative risk. CoRR, abs/2102.08570, 2021. URL https://arxiv.longhoe.net/abs/2102.08570.
Minasyan et al. (2022) Edgar Minasyan, Paula Gradu, Max Simchowitz, and Elad Hazan. Online control of unknown time-varying dynamical systems, 2022.
Morgenstern and Roughgarden (2016) Jamie Morgenstern and Tim Roughgarden. Learning simple auctions. CoRR, abs/1604.03171, 2016. URL http://arxiv.longhoe.net/abs/1604.03171.
Mussi et al. (2022) Marco Mussi, Gianmarco Genalti, Alessandro Nuara, Francesco Trovò, Marcello Restelli, and Nicola Gatti. Dynamic pricing with volume discounts in online settings, 2022.
Muthirayan and Khargonekar (2022) Deepan Muthirayan and Pramod P. Khargonekar. Online learning robust control of nonlinear dynamical systems, 2022.
Nedelec et al. (2020) Thomas Nedelec, Clement Calauzenes, Vianney Perchet, and Noureddine El Karoui. Robust stackelberg buyers in repeated auctions. In Silvia Chiappa and Roberto Calandra, editors, Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 1342–1351. PMLR, 26–28 Aug 2020. URL https://proceedings.mlr.press/v108/nedelec20a.html.
Neu and Olkhovskaya (2021) Gergely Neu and Julia Olkhovskaya. Online learning in mdps with linear function approximation and bandit feedback. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 10407–10417. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/5631e6ee59a4175cd06c305840562ff3-Paper.pdf.
Peng et al. (2019) Binghui Peng, Weiran Shen, **zhong Tang, and Song Zuo. Learning optimal strategies to commit to. In AAAI Conference on Artificial Intelligence, 2019. URL https://api.semanticscholar.org/CorpusID:92982174.
Perdomo et al. (2020) Juan C. Perdomo, Tijana Zrnic, Celestine Mendler-Dünner, and Moritz Hardt. Performative prediction. CoRR, abs/2002.06673, 2020. URL https://arxiv.longhoe.net/abs/2002.06673.
Piliouras and Yu (2022) Georgios Piliouras and Fang-Yi Yu. Multi-agent performative prediction: From global stability and optimality to chaos, 2022.
Roth et al. (2015) Aaron Roth, Jonathan R. Ullman, and Zhiwei Steven Wu. Watch and learn: Optimizing from revealed preferences feedback. CoRR, abs/1504.01033, 2015. URL http://arxiv.longhoe.net/abs/1504.01033.
Roughgarden (2015) Tim Roughgarden. Intrinsic robustness of the price of anarchy. J. ACM, 62(5), nov 2015. ISSN 0004-5411. doi: 10.1145/2806883. URL https://doi.org/10.1145/2806883.
Shalev-Shwartz and Singer (2006) Shai Shalev-Shwartz and Yoram Singer. Online learning meets optimization in the dual. In Proceedings of the 19th Annual Conference on Learning Theory, COLT’06, page 423–437, Berlin, Heidelberg, 2006. Springer-Verlag. ISBN 3540352945. doi: 10.1007/11776420˙32. URL https://doi.org/10.1007/11776420_32.
Shen et al. (2023) Lingqing Shen, Nam Ho-Nguyen, and Fatma Kılınç-Karzan. An online convex optimization-based framework for convex bilevel optimization. Mathematical Programming, 198(2):1519–1582, 04 2023. ISSN 1436-4646. doi: 10.1007/s10107-022-01894-5. URL https://doi.org/10.1007/s10107-022-01894-5.
Simchowitz et al. (2020) Max Simchowitz, Karan Singh, and Elad Hazan. Improper learning for non-stochastic control. CoRR, abs/2001.09254, 2020. URL https://arxiv.longhoe.net/abs/2001.09254.
Yue et al. (2012) Yisong Yue, Josef Broder, Robert Kleinberg, and Thorsten Joachims. The k-armed dueling bandits problem. Journal of Computer and System Sciences, 78(5):1538–1556, 2012. ISSN 0022-0000. doi: https://doi.org/10.1016/j.jcss.2011.12.028. URL https://www.sciencedirect.com/science/article/pii/S0022000012000281. JCSS Special Issue: Cloud Computing 2011.
Zhang et al. (2023) Brian Hu Zhang, Gabriele Farina, Ioannis Anagnostides, Federico Cacciamani, Stephen Marcus McAleer, Andreas Alexander Haupt, Andrea Celli, Nicola Gatti, Vincent Conitzer, and Tuomas Sandholm. Steering no-regret learners to optimal equilibria, 2023.
Zhang et al. (2021) Xuezhou Zhang, Yiding Chen, Jerry Zhu, and Wen Sun. Corruption-robust offline reinforcement learning. CoRR, abs/2106.06630, 2021. URL https://arxiv.longhoe.net/abs/2106.06630.
Zinkevich (2003) Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. 2, 04 2003.
Zrnic et al. (2021a) Tijana Zrnic, Eric Mazumdar, S. Shankar Sastry, and Michael I. Jordan. Who leads and who follows in strategic classification? CoRR, abs/2106.12529, 2021a. URL https://arxiv.longhoe.net/abs/2106.12529.
Zrnic et al. (2021b) Tijana Zrnic, Eric Mazumdar, Shankar Sastry, and Michael Jordan. Who leads and who follows in strategic classification? In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 15257–15269. Curran Associates, Inc., 2021b. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/812214fb8e7066bfa6e32c626c2c688b-Paper.pdf.

Appendix A Omitted Proofs for Section 2

Proof

of Proposition 1. Without loss of generality, assume $\alpha\leq\beta/2$ and that $T$ is even. Let $f_{t}=\left\lVert y_{t}-y\right\rVert$ for each $t$ . Consider any round $t$ where $y_{t-1}\in B_{\alpha}(y)$ ; then, for all actions $x_{t}$ , we have that $y_{t}\notin\operatorname{\mathcal{B}}_{\alpha}(y)$ , as $\operatorname{\mathcal{B}}_{\alpha}(y)\subseteq\operatorname{\mathcal{B}}_{% \beta}(y_{t-1})$ ; as such, we incur loss $f_{t}(y_{t})\geq\alpha$ in round $t$ . Now suppose $y_{t-1}\notin B_{\alpha}(y)$ ; then, we must have incurred loss at least $f_{t-1}(y_{t-1})\geq\alpha$ in round $t-1$ . As losses are non-negative, our total loss is at least $\alpha T/2$ , as loss $\alpha$ is incurred at least every other round; given that the best fixed state $y^{*}=y$ incurs total loss $0$ , we have that $\operatorname{\textup{{Reg}}}_{\operatorname{\mathcal{A}}}(T)=\Omega(T)$ for any algorithm $\operatorname{\mathcal{A}}$ . ∎

Proof

of Proposition 2. We begin by observing that for instances $(\operatorname{\mathcal{X}},\operatorname{\mathcal{Y}},D)$ , the class of state-targeting policies contains a policy which obtains the reward of the best fixed state up to $O(\sqrt{T{\rho^{-1}}})$ , for sufficiently large $T$ . Consider the set $\hat{\operatorname{\mathcal{Y}}}=\{y^{*}\in\operatorname{\mathcal{Y}}:\pi(y^{*% })\geq(T\rho)^{-1/2}\}$ . Note that the reward of any $y\in\operatorname{\mathcal{Y}}$ is matched by some $y^{*}\in\hat{\operatorname{\mathcal{Y}}}$ up to $O(\sqrt{T\rho^{-1}})$ for any fixed inner radius $r$ , outer radius $R$ , and Lipschitz constant $L$ . For any such $y^{*}$ , note that under the policy $P_{y^{*}}$ when starting at $y_{0}=0$ , the distance between $y_{t}$ and $y^{*}$ in each round $t$ is updated to at most:

\displaystyle\left\lVert y_{t}-y^{*}\right\rVert\leq

\displaystyle\;\max\left(0,\rho\cdot\pi(y_{t-1})\right).

It is straightforward to see that $\hat{\operatorname{\mathcal{Y}}}$ is convex, and so our state $y_{t}$ will never leave $\hat{\operatorname{\mathcal{Y}}}$ on its path to $y^{*}$ ; as such, we reach $y^{*}$ within $O(\sqrt{T\rho^{-1}})$ rounds, after which point our reward exactly tracks that of $y^{*}$ . For some $y^{*}\in\hat{\operatorname{\mathcal{Y}}}$ , this yields a regret for $P_{y^{*}}$ of at most $O(\sqrt{T\rho^{-1}})$ to the best fixed state in $\operatorname{\mathcal{Y}}$ .

Next, consider an instance where $\operatorname{\mathcal{X}}$ and $\operatorname{\mathcal{Y}}$ are both the unit ball in $\operatorname{\mathbb{R}}^{n}$ . With $y_{0}=0$ , let the dynamics be given by

\displaystyle y_{t}=

\displaystyle\;\Pi_{\operatorname{\mathcal{Y}}}\left(y_{t-1}+x_{t}\right).

Observe that this satisfies $\rho$ -local controllability for any $\rho\leq 1$ , as a ball of radius $\pi(y_{t-1})$ is always feasible around $y_{t-1}$ . Let each loss $f_{t}=\left\lVert y-p\right\rVert^{2}$ , for some $p\neq 0$ . Immediately we can see that any matrix policy $K\in\mathcal{P}_{\mathcal{K}}$ has regret $\Omega(T)$ , as the action $x_{t}=0$ will be played in each round. ∎

Appendix B Follow the Regularized Leader

Here we state the $\operatorname{\textup{{FTRL}}}$ algorithm and several of its key properties; see e.g. Hazan (2021) for proofs of Propositions 4 and 5.

Algorithm 2 Follow the Regularized Leader (

\operatorname{\textup{{FTRL}}}

)

Choose a time horizon

T

, step size

\eta

, and

\gamma

-strongly convex regularizer

\psi:\operatorname{\mathcal{Y}}\rightarrow\operatorname{\mathbb{R}}

Let

y_{1}=\text{argmin}_{y\in\operatorname{\mathcal{Y}}}~{}\psi(y)

for

t=1

T

Play

y_{t}

and observe loss

f_{t}(y_{t})

Set

\nabla_{t}=\nabla f_{t}(y_{t})

Set

y_{t+1}=\text{argmin}_{y\in\operatorname{\mathcal{Y}}}\left(\eta\cdot\sum_{s=1% }^{t}\nabla_{s}^{\top}y+\psi(y)\right)

end for

Proposition 4.

For a $\gamma$ -strongly convex regularizer $\psi:\operatorname{\mathcal{Y}}\rightarrow\operatorname{\mathbb{R}}$ where $\left\lvert\psi(y)-\psi(y^{\prime})\right\rvert\leq G$ for all $y,y^{\prime}\in\operatorname{\mathcal{Y}}$ , and for convex $L$ -Lipschitz losses $f_{1},\ldots,f_{T}$ , the regret of $\operatorname{\textup{{FTRL}}}$ is bounded by

\displaystyle\operatorname{\textup{{Reg}}}_{T}(\textup{$\operatorname{\textup{% {FTRL}}}$})\leq

\displaystyle\;\eta\frac{TL^{2}}{\gamma}+\frac{G}{\eta}.

Proposition 5.

Any pair of points $y_{t}$ and $y_{t+1}$ chosen by $\operatorname{\textup{{FTRL}}}$ satisfies $\left\lVert y_{t+1}-y_{t}\right\rVert\leq\eta\frac{L}{\gamma}$ .

Appendix C Analysis for $\operatorname{\textup{{NestedOCO}}}$

Proof

of Theorem 1. First we show that any point chosen by $\operatorname{\textup{{FTRL}}}$ will be feasible under local controllability, by induction. It is straightforward to see that $\tilde{\operatorname{\mathcal{Y}}}$ is convex and $\tilde{\operatorname{\mathcal{Y}}}\subseteq\operatorname{\mathcal{Y}}$ ; further, any $y\in\tilde{\operatorname{\mathcal{Y}}}$ is bounded away from $\operatorname*{bd}(\operatorname{\mathcal{Y}})$ . By the definition of $\tilde{\operatorname{\mathcal{Y}}}$ , we have that $y=(1-\delta)y^{\prime}$ for some $y^{\prime}\in\operatorname{\mathcal{Y}}$ . Recall that $\operatorname{\mathcal{B}}_{r}(\mathbf{0})\subseteq\operatorname{\mathcal{Y}}$ , and note that $\operatorname{\mathcal{B}}_{\delta r}(y)=\{y+\delta\hat{y}:\hat{y}\in% \operatorname{\mathcal{B}}_{r}(\mathbf{0})\}$ . Let $y^{\prime\prime}$ be any point in $\operatorname{\mathcal{B}}_{r}(\mathbf{0})$ . By convexity of $\operatorname{\mathcal{Y}}$ , we then have that any point $(1-\delta)y^{\prime}+\delta y^{\prime\prime}$ lies in $\operatorname{\mathcal{Y}}$ , and so for any $y\in\tilde{\operatorname{\mathcal{Y}}}$ we have that $\operatorname{\mathcal{B}}_{r\delta}(y)\subseteq\operatorname{\mathcal{Y}}$ . Each $y_{t-1}$ lies in $\tilde{\operatorname{\mathcal{Y}}}$ , and so we have that $\pi(y_{t-1})\geq r\delta$ ; as such, any point $y_{t}$ in $\operatorname{\mathcal{B}}_{r\delta\rho}(y_{t-1})\subseteq\operatorname{% \mathcal{B}}_{\rho\cdot\pi(y_{t-1})}(y_{t-1})$ is feasible. Given that $\eta\frac{L}{\gamma}\leq r\delta\rho$ , by Proposition 5 we have that $y_{t}\in\operatorname{\mathcal{B}}_{r\delta\rho}(y_{t-1})$ in each round for the chosen point. Each action will be selected by solving for

\displaystyle\operatorname*{argmin}_{x_{t}\in\operatorname{\mathcal{X}}}\left% \lVert D(x_{t},y_{t-1})-y^{*}\right\rVert^{2}

via a call to $\texttt{Oracle}(y_{t-1},y^{*})$ . Each call is guaranteed to have a solution which achieves an objective of 0 where $D(x_{t},y_{t-1})=y^{*}$ for some $y^{*}\in\operatorname{\mathcal{B}}_{\rho\cdot\pi(y_{t-1})}(y_{t-1})$ by local controllability, yielding an exact state update to $y_{t}=y^{*}$ as we assume Oracle can solve arbitrary non-convex minimization problems. To bound the regret, first note that for any $y^{*}\in{\operatorname{\mathcal{Y}}}$ , we have

\displaystyle\sum_{t=1}^{T}f_{t}(y_{t})\leq

\displaystyle\;\eta\frac{TL^{2}}{\gamma}+\frac{G}{\eta}+\sum_{t=1}^{T}f_{t}((1% -\delta)y^{*})

by Proposition 4, as $(1-\delta)y^{*}\in\tilde{\operatorname{\mathcal{Y}}}$ for any $y^{*}\in\operatorname{\mathcal{Y}}$ . Then, observe that for any $y^{*}\in\operatorname{\mathcal{Y}}$ , we have that

	$\displaystyle\sum_{t=1}^{T}f_{t}((1-\delta)y^{*})\leq$	$\displaystyle\;\sum_{t=1}^{T}\left(f_{t}(y^{})+L\left\lVert\delta y^{}\right% \rVert\right)$
	$\displaystyle\leq$	$\displaystyle\;\sum_{t=1}^{T}\left(f_{t}(y^{*})+\delta LR\right).$

Combining the previous claims, we have that

	$\displaystyle\sum_{t=1}^{T}f_{t}(y_{t})-f_{t}(y^{*})\leq$	$\displaystyle\;\delta TLR+\eta\frac{TL^{2}}{\gamma}+\frac{G}{\eta}$
	$\displaystyle=$	$\displaystyle\;\eta\left(1+\frac{R}{r\rho}\right)\frac{TL^{2}}{\gamma}+\frac{G% }{\eta}$
	$\displaystyle=$	$\displaystyle\;2\sqrt{\frac{(1+\frac{R}{r\rho})TGL^{2}}{\gamma}}$

upon setting $\delta=\eta\frac{L}{r\rho\gamma}$ and $\eta=\sqrt{\frac{G\gamma}{(1+\frac{R}{r\rho})TL^{2}}}$ , which yields the theorem. ∎

Appendix D Examples and Analysis for Action-Linear Dynamics

As a simple yet general example of dynamics which are both action-linear and locally controllable, consider update rules in which a step is taken by applying a nonsingular matrix transformation to the action, where the matrix can be parameterized by the state, with projection back into $\operatorname{\mathcal{Y}}$ if necessary.

Example 1.

Let both $\operatorname{\mathcal{X}}$ and $\operatorname{\mathcal{Y}}$ be given by the unit ball $\operatorname{\mathcal{B}}_{1}(\mathbf{0})$ in $\operatorname{\mathbb{R}}^{n}$ . For any fixed $y$ , let the updates from $D(x,y)$ be given by

\displaystyle D(x,y)=

\displaystyle\;\Pi_{\operatorname{\mathcal{Y}}}\left(y+A_{y}\cdot x\right),

where each $A_{y}$ is a square matrix with minimum absolute eigenvalue $\left\lvert\lambda_{n}(A_{y})\right\rvert\geq\pi(y)\cdot\rho$ for some $\rho>0$ . Then, the instance $(\operatorname{\mathcal{X}},\operatorname{\mathcal{Y}},D)$ is action-linear and satisfies $\rho$ -local controllability.

Proof

for Example 1. It is straightforward to see that $D(x,y)$ is action-linear. To show $\rho$ -local controllability, let $y^{*}$ be any point in $\operatorname{\mathcal{B}}_{\rho\cdot\pi(y)}(y)$ . It suffices to show that there is some $x^{*}\in\operatorname{\mathcal{X}}$ such that $A_{y}\cdot x^{*}=y^{*}-y$ . As $A_{y}$ is non-singular, we can solve for $x^{*}=A_{y}^{-1}(y^{*}-y)$ , where $\left\lVert y^{*}-y\right\rVert\leq\rho\cdot\pi(y)$ and $\left\lvert\lambda_{1}(A_{y}^{-1})\right\rvert\leq\frac{1}{\rho\cdot\pi(y)}$ , and so we have that $x^{*}\in\operatorname{\mathcal{B}}_{1}(\mathbf{0})=\operatorname{\mathcal{X}}$ . ∎

We can also extend this to include state-parameterized generalizations of any linear system governed by nonsingular matrices over a bounded-radius state space (for a sufficiently large action space).

Example 2.

Let $\operatorname{\mathcal{Y}}$ be given by the radius- $R$ ball $\operatorname{\mathcal{B}}_{R}(\mathbf{0})$ in $\operatorname{\mathbb{R}}^{n}$ , and let $\operatorname{\mathcal{X}}=\operatorname{\mathcal{B}}_{cR}(\mathbf{0})$ . For any fixed $y$ , let the updates from $D(x,y)$ be given by

\displaystyle D(x,y)=

\displaystyle\;\Pi_{\operatorname{\mathcal{Y}}}\left(K_{y}\cdot y+A_{y}\cdot x% \right),

where both $K_{y}$ and $A_{y}$ are square matrices. For any $y$ , let $M_{y}=K_{y}-I$ , and suppose we take $c$ large enough such that $c\cdot\left\lvert\lambda_{n}(A_{y})\right\rvert\geq\left\lvert\lambda_{1}(M_{y% })\right\rvert+\pi(y)\cdot\rho$ for some $\rho>0$ . Then, the instance $(\operatorname{\mathcal{X}},\operatorname{\mathcal{Y}},D)$ is action-linear and satisfies $\rho$ -local controllability.

Proof

for Example 2. Here, again it is evident that $D(x,y)$ is action-linear, and so it suffices to show that there is some $x^{*}\in\operatorname{\mathcal{X}}$ such that

	$\displaystyle K_{y}\cdot y+A_{y}\cdot x^{*}=$	$\displaystyle\;y+M_{y}\cdot y+A_{y}\cdot x^{*}$
	$\displaystyle=$	$\displaystyle\;y^{*}$

for any $y^{*}$ in $\operatorname{\mathcal{B}}_{\rho\cdot\pi(y)}(y)$ . As in the proof for Example 1, we have that $\left\lVert M_{y}\cdot y\right\rVert\leq R\cdot\left\lvert\lambda_{1}(M_{y})\right\rvert$ , and for large enough $c$ there is some $x^{*}$ such that $A_{y}\cdot x^{*}=\hat{y}$ for any $\hat{y}$ where $\left\lVert\hat{y}\right\rVert\leq R\cdot\left\lvert\lambda_{1}(M_{y})\right% \rvert+\pi(y)\cdot\rho$ . Thus, any point $y^{*}\in\operatorname{\mathcal{B}}_{R\cdot\left\lvert\lambda_{1}(M_{y})\right% \rvert+\pi(y)\cdot\rho}(y+M_{y}\cdot y)$ is feasible by some $x^{*}$ , which contains the ball $\operatorname{\mathcal{B}}_{\pi(y)\cdot\rho}(y)$ . ∎

Appendix E Algorithms for Adversarial Disturbances

E.1 $\operatorname{\textup{{NestedOCO-BD}}}$ and Proofs for Theorem 2

We show that it is possible simulate $\operatorname{\textup{{NestedOCO}}}$ over the undisturbed states $\hat{y}_{t}$ under the assumption that the dynamics are in $\alpha\rho$ -locally controllable for some $\alpha\in(0,1)$ while retaining sufficient range in the feasible region around $y_{t}$ to correct for the disturbance $w_{t-1}$ from the previous round. Here, the oracle call for computing $x_{t}$ in each round is updated to consider the true state $y_{t-1}$ .

Algorithm 3

\operatorname{\textup{{NestedOCO}}}

with Adversarial Disturbances (

\operatorname{\textup{{NestedOCO-BD}}}

Initialize

\operatorname{\textup{{NestedOCO}}}

for

T

rounds over

(\operatorname{\mathcal{X}},\operatorname{\mathcal{Y}},D)

for

\alpha\rho

-locally controllable dynamics

for

t=1

T

Let

\hat{y}_{t}

be the target state chosen by

\operatorname{\textup{{NestedOCO}}}

Use

\texttt{Oracle}(y_{t-1},\hat{y}_{t})

to compute

x_{t}=\operatorname*{argmin}_{x\in\operatorname{\mathcal{X}}}\left\lVert D(x,y% _{t-1})-\hat{y}_{t}\right\rVert^{2}

Play action

x_{t}

Observe disturbed state

y_{t}=\hat{y}_{t}+w_{t}

and loss

f_{t}(y_{t})

Update

\operatorname{\textup{{NestedOCO}}}

with state

\hat{y}_{t}

and loss

f_{t}(\hat{y}_{t})

end for

Theorem 2 follows directly from Theorems 11, 12, and 13. Intuitively, when the per-round disturbance magnitude is at most $\frac{\rho-\alpha\rho}{1+\rho}\cdot\pi\left(D(x_{t},y_{t-1})\right)$ , one can calibrate $\operatorname{\textup{{NestedOCO}}}$ for the case of $\alpha\rho$ -locally controllable dynamics and maintain sufficient “slack” to correct for the previous round’s disturbance in every round. When disturbances exceed $\frac{\rho}{1+\rho}\cdot\pi\left(D(x_{t},y_{t-1})\right)$ , an adversary can continually push the state towards the boundary of $\operatorname{\mathcal{Y}}$ , which may require vanishing disturbance magnitude as rounds progress due to the limited range promised by local controllability near the boundary.

Theorem 11.

For a $\rho$ -locally controllable instance $(\operatorname{\mathcal{X}},\operatorname{\mathcal{Y}},D)$ with convex losses $f_{t}:\operatorname{\mathcal{Y}}\rightarrow\operatorname{\mathbb{R}}$ and adversarial disturbances $w_{t}$ where $\left\lVert w_{t}\right\rVert\leq\frac{\rho-\alpha\rho}{1+\rho}\cdot\pi\left(D% (x_{t},y_{t-1})\right)$ and $\sum_{t=1}^{T}\left\lVert w_{t}\right\rVert\leq E$ , the regret of $\operatorname{\textup{{NestedOCO-BD}}}$ with respect to the reward of any state is bounded by

\displaystyle\operatorname{\textup{{Reg}}}_{T}(\operatorname{\textup{{% NestedOCO-BD}}})\leq

\displaystyle\;O\left(\sqrt{T\cdot(\alpha\rho)^{-1}}+E\right),

with $T$ queries made to an oracle for non-convex optimization.

Proof

We show by induction that each call to $\texttt{Oracle}(y_{t-1},\hat{y}_{t})$ yields a feasible action $x_{t}$ satisfying $\hat{y}_{t}=D(x_{t},y_{t-1})$ . This is immediate for $t=1$ , and suppose this holds up to some round $t-1$ , where we have that $y_{t-1}=\hat{y}_{t-1}+w_{t-1}$ . Given that $\operatorname{\textup{{NestedOCO}}}$ selects actions under $\alpha\rho$ -local controllability, we can bound

\displaystyle\left\lVert\hat{y}_{t}-\hat{y}_{t-1}\right\rVert\leq

\displaystyle\;\alpha\rho\cdot\pi(\hat{y}_{t-1}).

Further, the magnitude of the disturbance $w_{t-1}$ is bounded by

\displaystyle\left\lVert w_{t-1}\right\rVert\leq

\displaystyle\;\frac{\rho-\alpha\rho}{1+\rho}\cdot\pi(\hat{y}_{t-1}),

yielding that

	$\displaystyle\left\lVert\hat{y}_{t}-y_{t-1}\right\rVert\leq$	$\displaystyle\;\left\lVert\hat{y}_{t}-\hat{y}_{t-1}-w_{t-1}\right\rVert$
	$\displaystyle\leq$	$\displaystyle\;\left(\alpha\rho+\frac{\rho-\alpha\rho}{1+\rho}\right)\cdot\pi(% \hat{y}_{t-1}).$		( $y_{t-1}=w_{t-1}+\hat{y}_{t-1}$ )

As such, we have that

	$\displaystyle\rho\cdot\pi({y}_{t-1})\geq$	$\displaystyle\;\rho\left(1-\frac{\rho-\alpha\rho}{1+\rho}\right)\cdot\pi(\hat{% y}_{t-1})$
	$\displaystyle=$	$\displaystyle\;\rho\left(\alpha+\frac{1-\alpha}{1+\rho}\right)\cdot\pi(\hat{y}% _{t-1}),$

and so by $\rho$ -local controllability some feasible action $x_{t}$ exists, as $\hat{y}_{t}$ lies in $\operatorname{\mathcal{B}}_{\rho\cdot\pi(y_{t-1})}$ . The regret bound for $\operatorname{\textup{{NestedOCO}}}$ holds over the states $\hat{y}_{t}$ , and so we can bound the total regret of $\operatorname{\textup{{NestedOCO-BD}}}$ with respect to any $y^{*}\in\operatorname{\mathcal{Y}}$ as:

$\displaystyle\sum_{t=1}^{T}f_{t}(y_{t})-f_{t}(y^{*})\leq$	$\displaystyle\;\sum_{t=1}^{T}f_{t}(\hat{y}_{t})-f_{t}(y^{*})+L\left\lVert y_{t% }-\hat{y}_{t}\right\rVert$
$\displaystyle\leq$	$\displaystyle\;\operatorname{\textup{{Reg}}}_{T}(\textup{OEN-FTRL})+L\sum_{t=1% }^{T}\left\lVert w_{t}\right\rVert$	(Thm. 1)
$\displaystyle\leq$	$\displaystyle\;2\sqrt{\frac{(1+\frac{R}{r\alpha\rho})TGL^{2}}{\gamma}}+LE.$

∎

We show that the dependence on $E$ is tight up to the constant. Note that we we can obtain regret $O(\sqrt{T\cdot(\alpha\rho)^{-1}})+LE$ in the following instance via $\operatorname{\textup{{NestedOCO-BD}}}$ .

Theorem 12 (Regret Lower Bound for Bounded Disturbances).

Suppose for any $\alpha>0$ and $\rho\in(0,1]$ an adversary can choose $w_{t}$ with $\left\lVert w_{t}\right\rVert\leq\frac{\rho-\alpha\rho}{1+\rho}\cdot\pi\left(D% (x_{t},y_{t-1})\right)$ , where $\sum_{t=1}^{T}\left\lVert w_{t}\right\rVert=E$ for any $E$ . There is a $\rho$ -locally controllable instance $(\operatorname{\mathcal{X}},\operatorname{\mathcal{Y}},D)$ with $L$ -Lipschitz convex losses $f_{t}$ such that any algorithm $\operatorname{\mathcal{A}}$ obtains regret $\operatorname{\textup{{Reg}}}_{T}(\operatorname{\mathcal{A}})\geq\max(LE,\frac% {\rho-\alpha\rho}{1+\rho}TL)$ .

Proof

Consider any norm $\left\lVert\cdot\right\rVert$ over $\operatorname{\mathbb{R}}^{n}$ . Let $\operatorname{\mathcal{Y}}$ be the unit ball $B_{1}(\mathbf{0})$ , and let each $f_{t}(y_{t})=L\left\lVert y_{t}\right\rVert$ . Consider any action space $\operatorname{\mathcal{X}}$ and dynamics $D$ where $\rho$ -local controllability exactly characterizes the range of $D$ , i.e. for any $y$ and $y^{\prime}$ , there is some $x$ such that $D(x,y)=y^{\prime}$ if and only if $y^{\prime}\in\operatorname{\mathcal{B}}_{\rho\cdot\pi(y)}(x,y)$ .

First, note that $\pi(y)=1-\left\lVert y\right\rVert$ for any $y\in\operatorname{\mathcal{Y}}$ . In each round $t$ , suppose an algorithm plays an action $x_{t}$ at state $y_{t-1}$ which yields an target undisturbed update $\hat{y}=D(x_{t},y_{t-1})$ . The adversary can then choose any $w_{t}$ satisfying $\left\lVert w_{t}\right\rVert\leq\frac{\rho-\alpha\rho}{1+\rho}\cdot(1-\left% \lVert\hat{y}_{t}\right\rVert)$ ; suppose each $w_{t}$ is given by

\displaystyle w_{t}=

\displaystyle\;\hat{y}_{t}\cdot\frac{\frac{\rho-\alpha\rho}{1+\rho}\cdot(1-% \left\lVert\hat{y}_{t}\right\rVert)}{\left\lVert\hat{y}_{t}\right\rVert}

if $\hat{y}_{t}$ is non-zero, and an arbitrary vector $w_{t}$ with $\left\lVert w_{t}\right\rVert=\frac{\rho-\alpha\rho}{1+\rho}$ if $\hat{y}_{t}=\mathbf{0}$ . This satisfies the disturbance norm bound, and further yields $y_{t}=\hat{y}_{t}+w_{t}$ , where for non-zero $\hat{y}$ we have

\displaystyle y_{t}=

\displaystyle\;\hat{y_{t}}\cdot\left(1+\frac{\frac{\rho-\alpha\rho}{1+\rho}% \cdot(1-\left\lVert\hat{y}_{t}\right\rVert)}{\left\lVert\hat{y}_{t}\right% \rVert}\right)

and thus for any $\hat{y}$ ,

	$\displaystyle\left\lVert y_{t}\right\rVert\geq$	$\displaystyle\;\left\lVert\hat{y}_{t}\right\rVert+\frac{\rho-\alpha\rho}{1+% \rho}\cdot(1-\left\lVert\hat{y}_{t}\right\rVert)$
	$\displaystyle\geq$	$\displaystyle\;\frac{\rho-\alpha\rho}{1+\rho},$

yielding a loss $f_{t}(y_{t})\geq L\cdot\frac{\rho-\alpha\rho}{1+\rho}$ at a disturbance cost of $\left\lVert w_{t}\right\rVert=\frac{\rho-\alpha\rho}{1+\rho}(1-\left\lVert\hat% {y}_{t}\right\rVert)$ . Assuming the adversary continues this strategy in each round until any disturbance budget $E=\sum_{t=1}^{T}\left\lVert w_{t}\right\rVert$ is exhausted, this yields a regret for any algorithm of at least

\displaystyle\operatorname{\textup{{Reg}}}_{T}(\operatorname{\mathcal{A}})\geq

\displaystyle\;\min\left(LE,\frac{\rho-\alpha\rho}{1+\rho}TL\right),

as $y^{*}=\mathbf{0}$ obtains total loss 0. ∎

The disturbance upper bound is indeed necessary for $\rho$ -locally controllable dynamics. We show a sharp threshold effect at $\frac{\rho}{1+\rho}\cdot\pi(D(x_{t},y_{t-1}))$ , wherein an adversary who is allowed to exceed this limit by any amount can force an algorithm to incur linear regret even with only a constant budget. Note that for any $\rho\in(0,1]$ and $\alpha<0$ , there is some $\beta\in[0,1)$ such that $\frac{\rho-\alpha\rho}{1+\rho}\geq\frac{\rho}{1+\beta\rho}$ .

Theorem 13.

Suppose an adversary can choose any state disturbances $w_{t}$ with $\left\lVert w_{t}\right\rVert\leq\frac{\rho}{1+\beta\rho}\cdot\pi\left(D(x_{t}% ,y_{t-1})\right)$ , for any $\rho\in(0,1]$ and any $\beta\in[0,1)$ . Then, there is a $\rho$ -locally controllable instance $(\operatorname{\mathcal{X}},\operatorname{\mathcal{Y}},D)$ with convex losses $f_{t}$ such that any algorithm $\operatorname{\mathcal{A}}$ obtains regret $\operatorname{\textup{{Reg}}}_{T}(\operatorname{\mathcal{A}})=\Theta(T)$ even if $\sum_{t=1}^{T}\left\lVert w_{t}\right\rVert=O(1)$ .

Proof

Consider any instance $(\operatorname{\mathcal{X}},\operatorname{\mathcal{Y}},D)$ where $\rho$ -local controllability exactly characterizes the range of $D$ , i.e. for any $y$ and $y^{\prime}$ , there is some $x$ such that $D(x,y)=y^{\prime}$ if and only if $y^{\prime}\in\operatorname{\mathcal{B}}_{\rho\cdot\pi(y)}(x,y)$ .

Let $d_{t}=\pi(y_{t})$ for each round. Beginning at any round $t$ , suppose the adversary observes an action $x_{t}$ which yields an update $\hat{y}_{t}=D(x_{t},y_{t-1})$ . Let $z_{t}=\operatorname*{argmin}_{y\in\operatorname*{bd}(\operatorname{\mathcal{Y}% })}\left\lVert y-\hat{y}_{t}\right\rVert$ , and suppose the adversary chooses the disturbance:

\displaystyle w_{t}=

\displaystyle\;\operatorname*{argmin}_{w:\left\lVert w\right\rVert\leq\frac{% \rho}{1+\beta\rho}\cdot\pi\left(\hat{y}_{t}\right)}\left\lVert\hat{y}_{t}+w_{t% }-z_{t}\right\rVert.

This forces $y_{t}$ closer to the boundary at each round, regardless of the choice of $x_{t}$ :

$\displaystyle d_{t}=$	$\displaystyle\;\left(1-\frac{\rho}{1+\beta\rho}\right)\cdot\pi(\hat{y}_{t})$
$\displaystyle\leq$	$\displaystyle\;\left(1+\rho-\frac{\rho}{1+\beta\rho}-\frac{\rho^{2}}{1+\beta% \rho}\right)d_{t-1}$	( $\pi(\hat{y}_{t})\leq(1+\rho)d_{t-1}$ )
$\displaystyle\leq$	$\displaystyle\;\frac{1+\beta\rho+\beta\rho^{2}-\rho^{2}}{1+\beta\rho}d_{t-1}$
$\displaystyle\leq$	$\displaystyle\;\left(1-\frac{(1-\beta)\rho^{2}}{1+\beta\rho}\right)d_{t-1},$

where $\pi(\hat{y}_{t})\leq(1+\rho)d_{t-1}$ holds by our assumption on $D(x,y)$ . Assuming the adversary applies a disturbance $w_{t}$ selected as above in each round $t\leq T$ , we have that

\displaystyle d_{t}\leq

\displaystyle\;\left(1-\frac{(1-\beta)\rho^{2}}{1+\beta\rho}\right)^{t}\cdot d% _{0},

where the magnitude of each disturbance is bounded by

	$\displaystyle\left\lVert w_{t}\right\rVert\leq$	$\displaystyle\;\frac{\rho+\rho^{2}}{1+\beta\rho}d_{t-1}$
	$\displaystyle\leq$	$\displaystyle\;\frac{\rho+\rho^{2}}{1+\beta\rho}\left(1-\frac{(1-\beta)\rho^{2% }}{1+\beta\rho}\right)^{t-1}\cdot d_{0},$

where we take the initial state distance to the boundary $d_{0}=\pi(y_{0})$ to be a constant bounded away from zero. This yields that the sum of disturbance magnitudes $E=\sum_{t=1}^{T}\left\lVert w_{t}\right\rVert$ is at most:

	$\displaystyle\sum_{t=1}^{T}\left\lVert w_{t}\right\rVert\leq$	$\displaystyle\;d_{0}\frac{\rho+\rho^{2}}{1+\beta\rho}\cdot\sum_{t=1}^{T}\left(% 1-\frac{(1-\beta)\rho^{2}}{1+\beta\rho}\right)^{t-1}$
	$\displaystyle\leq$	$\displaystyle\;d_{0}\cdot\frac{\rho+\rho^{2}}{(1-\beta)\rho^{2}}$
	$\displaystyle=$	$\displaystyle\;O(1).$

Now suppose that the loss at each round is given by $f_{t}(y_{t})=\left\lVert y_{t}-y_{0}\right\rVert$ . Then, our regret with respect to $y_{0}$ is at least:

	$\displaystyle\sum_{t=1}^{T}f_{t}(y_{t})-f_{t}(y_{0})\leq$	$\displaystyle\;\sum_{t=1}^{T}d_{0}-d_{t}$
	$\displaystyle\leq$	$\displaystyle\;d_{0}\left(T-\sum_{t=1}^{T}\frac{(1-\beta)\rho^{2}}{1+\beta\rho% }\right)$
	$\displaystyle\leq$	$\displaystyle\;d_{0}\left(T-\frac{1-\frac{(1-\beta)\rho^{2}}{1+\beta\rho}}{% \frac{(1-\beta)\rho^{2}}{1+\beta\rho}}\right)$
	$\displaystyle\leq$	$\displaystyle\;d_{0}\left(T-\frac{1+\beta\rho}{(1-\beta)\rho^{2}}\right)$
	$\displaystyle=$	$\displaystyle\;\Theta(T).$

∎

Together, the previous three theorems yield Theorem 2.

E.2 $\operatorname{\textup{{NestedOCO-UD}}}$ and Proofs for Theorem 3

We can remove the bound on the maximum disturbance for strongly locally controllable instances, as the feasible update sets do not vanish at the boundary of $\operatorname{\mathcal{Y}}$ . Recall that an instance $(\operatorname{\mathcal{X}},\operatorname{\mathcal{Y}},D)$ satisfies strong $\rho$ -local controllability for $\rho>0$ if, for any $y\in\operatorname{\mathcal{Y}}$ and $y^{*}\in\operatorname{\mathcal{B}}_{\rho}(y)\cap\operatorname{\mathcal{Y}}$ , there is some $x$ such that $D(x,y)=y^{*}$ . We assume without loss of generality that $\rho\leq 2R$ , where $R$ is the radius of $\operatorname{\mathcal{Y}}$ .

Intuitively, our algorithm tracks the target state which would be chosen by $\operatorname{\textup{{FTRL}}}$ in the absence of all disturbances (by recording the loss counterfactual loss rather than the one truly experienced), and always seeks to minimize distance to that state.

Algorithm 4

\operatorname{\textup{{NestedOCO}}}

with Unbounded Disturbances (

\operatorname{\textup{{NestedOCO-UD}}}

Initialize

\operatorname{\textup{{FTRL}}}

for

T

rounds over

\operatorname{\mathcal{Y}}

with step size

\eta=\sqrt{\frac{G\gamma}{TL^{2}}}

for

t=1

T

Let

\hat{y}_{t}

be the target state chosen by

\operatorname{\textup{{FTRL}}}

Use

\texttt{Oracle}(y_{t-1},\hat{y}_{t})

to compute

x_{t}=\operatorname*{argmin}_{x\in\operatorname{\mathcal{X}}}\left\lVert D(x,y% _{t-1})-\hat{y}_{t}\right\rVert^{2}

Play action

x_{t}

Observe disturbed state

y_{t}=D(x_{t},y_{t-1})+w_{t}

and loss

f_{t}(y_{t})

Update

\operatorname{\textup{{FTRL}}}

with state

\hat{y}_{t}

and loss

f_{t}(\hat{y}_{t})

end for

Theorem 14.

For a strongly $\rho$ -locally controllable instance $(\operatorname{\mathcal{X}},\operatorname{\mathcal{Y}},D)$ with convex losses $f_{t}:\operatorname{\mathcal{Y}}\rightarrow\operatorname{\mathbb{R}}$ and adversarial disturbances $w_{t}$ where $\sum_{t=1}^{T}\left\lVert w_{t}\right\rVert\leq E$ , the regret of $\operatorname{\textup{{NestedOCO-UD}}}$ is bounded by

\displaystyle\operatorname{\textup{{Reg}}}_{T}(\operatorname{\textup{{% NestedOCO-UD}}})\leq

\displaystyle\;O\left(\sqrt{T}+E\cdot\rho^{-1}\right)

with respect to the reward of any state, with $T$ queries made to an oracle for non-convex optimization.

Proof

We begin by bounding the total state error $\sum_{t=1}^{t}\left\lVert y_{t}-\hat{y}_{t}\right\rVert$ across rounds. First, note that for any fixed $\rho>0$ , and any desired $\alpha\in(0,1)$ , we have that $\eta\frac{L}{\gamma}\leq\rho\alpha$ for sufficiently large $T$ , as $\eta\frac{L}{\gamma}=\sqrt{\frac{G}{T\gamma}}$ ; we assume this holds for any given choice of $\alpha$ , and so we have that $\left\lVert\hat{y}_{t+1}-\hat{y}_{t}\right\rVert\leq\rho\alpha$ by Proposition 5. For a total disturbance budget $E$ , we separately consider disturbances $w_{t}$ depending on whether or not the accumulated disturbance error up to $w_{t}$ is driven to 0 in the next round. Define $W_{+}$ and $W_{-}$ as:

\displaystyle W_{+}=

\displaystyle\;\{w_{t}:D(x_{t+1},y_{t})\neq\hat{y}_{t+1}\}

and

\displaystyle W_{-}=

\displaystyle\;\{w_{t}:D(x_{t+1},y_{t})=\hat{y}_{t+1}\}

with $E_{+}=\sum_{w_{t}\in W_{+}}\left\lVert w_{t}\right\rVert$ and $E_{-}=\sum_{w_{t}\in W_{-}}\left\lVert w_{t}\right\rVert$ . First, observe that at each round $t$ corresponding to $w_{t}\in W_{-}$ , given that $\left\lVert\hat{y}_{t+1}-y_{t}\right\rVert\leq\rho$ we have that $\left\lVert w_{t}\right\rVert=\left\lVert y_{t}-\hat{y}_{t}\right\rVert\leq(1+% \alpha)\rho$ , as $\left\lVert\hat{y}_{t+1}-\hat{y}_{t}\right\rVert\leq\alpha\rho$ . As such, we have that

	$\displaystyle\sum_{t:w_{t}\in W_{-}}f_{t}(y_{t})-f_{t}(\hat{y}_{t})\leq$	$\displaystyle\;\sum_{t:w_{t}\in W_{-}}L\left\lVert y_{t}-\hat{y}_{t}\right\rVert$
	$\displaystyle\leq$	$\displaystyle\;(1+\alpha)LE_{-}.$

Next, consider any $w_{t}\in W_{+}$ . As our instance is strongly $\rho$ -locally controllable, we must have that $\left\lVert\hat{y}_{t+1}-y_{t}\right\rVert>\rho$ , as otherwise there would some feasible action $x_{t+1}$ which would be selected that would yield $w_{t}\in W_{-}$ . Since $\left\lVert\hat{y}_{t+1}-\hat{y}_{t}\right\rVert\leq\alpha\rho$ , it then must be the case that $\left\lVert w_{t}\right\rVert=\left\lVert y_{t}-\hat{y}_{t}\right\rVert>(1-% \alpha)\rho$ , and so we can bound the number of disturbances in $W_{+}$ as:

\displaystyle\left\lvert W_{+}\right\rvert\leq

\displaystyle\;\frac{E_{+}}{(1-\alpha)\rho}.

Assuming a maximal distance $\left\lVert\hat{y}_{t}-y_{t}\right\rVert=2R$ for each round $t$ corresponding to some $w_{t}\in W_{+}$ , this yields

	$\displaystyle\sum_{t:w_{t}\in W_{+}}f_{t}(y_{t})-f_{t}(\hat{y}_{t})\leq$	$\displaystyle\;\sum_{t:w_{t}\in W_{+}}L\left\lVert y_{t}-\hat{y}_{t}\right\rVert$
	$\displaystyle\leq$	$\displaystyle\;\frac{2LRE_{+}}{(1-\alpha)\rho}$

We can assume $\alpha$ is small enough to yield $\frac{2R}{\rho}\geq(1+\alpha)\cdot(1-\alpha)$ , and so we have

\displaystyle\sum_{t=1}^{T}f_{t}(y_{t})-f_{t}(\hat{y}_{t})\leq

\displaystyle\;\frac{2LRE}{(1-\alpha)\rho}.

The regret bound for $\operatorname{\textup{{FTRL}}}$ holds over the states $\hat{y}_{t}$ , and so we can bound the total regret of $\operatorname{\textup{{NestedOCO-BD}}}$ with respect to any $y^{*}\in\operatorname{\mathcal{Y}}$ as:

$\displaystyle\sum_{t=1}^{T}f_{t}(y_{t})-f_{t}(y^{*})\leq$	$\displaystyle\;\sum_{t=1}^{T}f_{t}(\hat{y}_{t})-f_{t}(y^{*})+\sum_{t=1}^{T}f_{% t}(y_{t})-f_{t}(\hat{y}_{t})$
$\displaystyle\leq$	$\displaystyle\;\eta\frac{TL^{2}}{\gamma}+\frac{G}{\eta}+\frac{2LRE}{(1-\alpha)\rho}$	(Prop. 4)
$\displaystyle\leq$	$\displaystyle\;2\sqrt{\frac{TGL^{2}}{\gamma}}+\frac{2LRE}{(1-\alpha)\rho}.$

∎

Theorem 15 (Regret Lower Bound for Unbounded Disturbances).

Suppose an adversary can choose any state disturbances $w_{t}$ with $\sum_{t=1}^{T}\left\lVert w_{t}\right\rVert=E$ . For any $\rho\in(0,1]$ , there is a strongly $\rho$ -locally controllable instance $(\operatorname{\mathcal{X}},\operatorname{\mathcal{Y}},D)$ with convex losses $f_{t}$ such that any algorithm $\operatorname{\mathcal{A}}$ obtains regret $\operatorname{\textup{{Reg}}}_{T}(\operatorname{\mathcal{A}})=\min(\frac{2LRE}% {\rho},2TLR)$ .

Proof

Let $\operatorname{\mathcal{Y}}=[-R,R]$ for any $R>0$ and let $f_{t}(y_{t})=-Ly_{t}+LR$ for each $y$ . Suppose strong $\rho$ -local controllability exactly characterizes the range of $D$ , i.e. for any $y,y^{\prime}\in\operatorname{\mathcal{Y}}$ there is some $x$ such that $D(x,y)=y^{\prime}$ if and only if $\left\lvert y-y^{\prime}\right\rvert\leq\rho$ . Consider an adversary who chooses disturbances $w_{t}$ in each round such that $y_{t}=-R$ until their disturbance budget $E$ is exhausted. This requires a disturbance of magnitude at most $R+\rho$ for $w_{1}$ , as we assume $y_{0}=0$ , and at most $\rho$ in subsequent rounds, and thus the adversary can force any algorithm to remain at $y_{t}=-R$ for $({E-R}){\rho^{-1}}$ rounds.

As such, any algorithm must incur loss of at least ${2LR(E-R)}\rho^{-1}$ across these rounds, and further must incur average loss $LR$ over the subsequent ${2R}\rho^{-1}$ rounds (if $T$ is not yet reached), for an additional loss of ${2LR^{2}}{\rho^{-1}}$ , as they can only decrease per-round loss by $L\rho$ given the restriction on the range of $D$ . As the optimal state $y^{*}=R$ obtains loss 0, the total regret is at least:

\displaystyle\sum_{t=1}^{T}f_{t}(y_{t})-f_{t}(y^{*})\geq

\displaystyle\;\min\left(\frac{2LRE}{\rho},2TLR\right).

∎

Together, the previous two theorems yield Theorem 3. Note that for both algorithms it remains computationally efficient to optimize over action-linear dynamics, as the constraint that $D(x,y_{t-1})\in\operatorname{\mathcal{Y}}$ can be encoded as a convex contraint over $\operatorname{\mathcal{X}}$ .

Appendix F Unknown Dynamics: Analysis for $\operatorname{\textup{{ProbingOCO}}}$

Algorithm 5 Probing Online Convex Optimization (

\operatorname{\textup{{ProbingOCO}}}

Let

n=\dim(\operatorname{\mathcal{X}})

, let

y_{0}=\mathbf{0}

, and let

x_{1}\in\operatorname{\mathcal{X}}

such that

\left\lVert D(x_{1},y_{0})-y_{0}\right\rVert\leq\epsilon=o(\sqrt{T})

Initialize

\operatorname{\textup{{NestedOCO-BD}}}

to run over

\operatorname{\mathcal{Y}}

for

T/(2n+1)

rounds

Run Estimate for

2n+1

rounds:

Play

x_{1}

for

i=1

n

Play

x_{1}+\epsilon\cdot e_{i}

Play

x_{1}-\epsilon\cdot e_{i}

end for

Solve for estimates

(\hat{A}_{y},\hat{b}_{y})

which are consistent with with the previous

2n+1

observed state updates, up to error

O(\epsilon)

for

t=2n+1

T

Let

t^{*}=t

Using

(\hat{A}_{y},\hat{b}_{y})

, target

y=y_{t^{*}}

Let

y^{*}

be the next point chosen by

\operatorname{\textup{{NestedOCO-BD}}}

for

i=1

n

Using

(\hat{A}_{y},\hat{b}_{y})

, target

y=y_{t^{*}}+\frac{2i-1}{2n}(y^{*}-y_{t^{*}})+\epsilon\cdot e_{i}

Using

(\hat{A}_{y},\hat{b}_{y})

, target

y=y_{t^{*}}+\frac{2i}{2n}(y^{*}-y_{t^{*}})-\epsilon\cdot e_{i}

end for

Update estimates

(\hat{A}_{y},\hat{b}_{y})

, solving for values which are consistent with the previous

2n+1

observed state updates, up to error

O(\epsilon)

end for

Proof
of Theorem 4 Assume the following hold for $D(x,y)$ at each $y$ :
- –
  
  $D(x,y)=A_{y}\cdot x+b_{y}+y+q_{y}(x)$ , for a function $q_{y}:\operatorname{\mathcal{X}}\rightarrow\operatorname{\mathbb{R}}^{n}$ ;
- –
  
  $A_{y}$ has a largest absolute eigenvalue bounded by an absolute constant, smallest absolute eigenvalue bounded away from 0, and is $L_{\alpha}$ -Lipschitz in the matrix $\ell_{2}$ norm;
- –
  
  $b_{y}$ has a norm bounded by an absolute constant, and is $L_{\beta}$ -Lipschitz;
- –
  
  $\left\lVert q_{y}(x)\right\rVert\leq\epsilon$ for any $x$ such that $\left\lVert A_{y}\cdot x+b_{y}-y\right\rVert=O(\sqrt{T})$ .
In the neighborhood of any $y^{*}$ , observe that playing $x=A^{-1}_{y}(y^{*}-y-b_{y})$ yields an update to $y^{*}+w_{\epsilon}$ , where the error term $w_{\epsilon}$ has magnitude bounded linearly in terms of the neighborhood size as well as polynomial in the relevant constants. We assume sufficiently small values of $\epsilon$ , $L_{\alpha}$ , and $L_{\beta}$ (whose relative bounds may trade off with each other, and in general will be inverse-polynomial in problem parameters other than $T$ ) to bound the error of this process in accordance with the requirements of Theorem 2, as well as to ensure that estimation error for $(\hat{A}_{y},\hat{b}_{y})$ is uniformly bounded for all $t\leq T$ . Given $\epsilon=o(\sqrt{T})$ , this yields estimation error terms $w_{t}\leq C\sqrt{T}$ in each round, for small enough $C$ to obtain the obtain the desired regret bound. ∎

Appendix G Bandit Feedback: Analysis for $\operatorname{\textup{{NestedBCO}}}$

We first state the $\operatorname{\textup{{FKM}}}$ algorithm and its bounds for regret and per-round step size.

Algorithm 6

\operatorname{\textup{{FKM}}}

(Flaxman et al., 2004)

Input: decision set

\operatorname{\mathcal{K}}

containing

\mathbf{0}

, set

v_{1}=\mathbf{0}

, parameters

\eta,\tilde{\delta}

Let

v_{1}\in\operatorname*{int}(\operatorname{\mathcal{K}})

such that

\nabla\mathcal{R}(v_{1})=0

for

t=1

T

Draw

u_{t}\in\mathbb{S}

uniformly, set

y_{t}=v_{t}+\tilde{\delta}u_{t}

Play

y_{t}

, observe loss

f_{t}(y_{t})

, set

g_{t}=\frac{n}{\tilde{\delta}}f_{t}(y_{t})u_{t}

Update

v_{t+1}={\Pi}_{\operatorname{\mathcal{K}}_{\tilde{\delta}}}\left[v_{t}-\eta g_% {t}\right]

, where

\operatorname{\mathcal{K}}_{\tilde{\delta}}=\{(1-\tilde{\delta})v:v\in% \operatorname{\mathcal{K}}\}

end for

Proposition 6 (Flaxman et al. (2004)).

For $L$ -Lipschitz convex losses and a domain $\operatorname{\mathcal{K}}$ with diameter $2R$ which contains a ball of radius $r$ around the origin, $\operatorname{\textup{{FKM}}}$ obtains expected regret

\displaystyle\operatorname{\textup{{Reg}}}_{T}(\operatorname{\textup{{FKM}}})\leq

\displaystyle\;\eta\frac{n^{2}}{\tilde{\delta}^{2}}T+\frac{4R^{2}}{\eta r^{2}}% +\frac{8\tilde{\delta}RLT}{r},

with each point $y_{t}$ contained in $\operatorname{\mathcal{K}}$ . Further, each pair of consecutive points $y_{t}$ , $y_{t+1}$ chosen by $\operatorname{\textup{{FKM}}}$ satisfies $\left\lVert y_{t+1}-y_{t}\right\rVert\leq 2\tilde{\delta}+\frac{\eta nL}{% \tilde{\delta}}$ .

The $\operatorname{\textup{{NestedBCO}}}$ algorithm is essentially equivalent to $\operatorname{\textup{{NestedOCO}}}$ , replacing $\operatorname{\textup{{FTRL}}}$ with $\operatorname{\textup{{FKM}}}$ and recalibrating parameters.

Algorithm 7 Nested Bandit Convex Optimization (

\operatorname{\textup{{NestedBCO}}}

Let

\tilde{\delta}=\frac{1}{T^{1/4}}=r\delta\rho/4

, let

\eta=\frac{R}{2nrLT^{3/4}}

Let

\widetilde{\operatorname{\mathcal{Y}}}=\{y:\frac{1}{1-\delta}y\in\operatorname% {\mathcal{Y}}\}

Initialize

\operatorname{\textup{{FKM}}}

to run for

T

rounds over

\widetilde{\operatorname{\mathcal{Y}}}

with parameters

\eta,\tilde{\delta}

for

t=1

T

Let

y^{*}

be the point chosen by

\operatorname{\textup{{FKM}}}

Use

\texttt{Oracle}(y_{t-1},y^{*})

to compute

x_{t}=\operatorname*{argmin}_{x}\left\lVert D_{t}(x,y_{t-1})-y^{*}\right\rVert% ^{2}

Play action

x_{t}

Observe

y_{t}

and loss

f_{t}(y_{t})

, update

\operatorname{\textup{{SCRIBLE}}}

end for

Proof

of Theorem 5. Following the proof of Theorem 1, to apply the bound of $\operatorname{\textup{{FKM}}}$ to our setting (along with excess regret at most $\delta LR$ per round from contracting $\operatorname{\mathcal{Y}}$ to $\widetilde{\operatorname{\mathcal{Y}}}$ ), the key step is to show that each point selected by $\operatorname{\textup{{FKM}}}$ is feasible under weakly locally controllable dynamics over $\widetilde{\operatorname{\mathcal{Y}}}$ , i.e. $\left\lVert y_{t+1}-y_{t}\right\rVert\leq r\delta\rho$ . Let $\tilde{\delta}=\frac{1}{T^{1/4}}=r\delta\rho/4$ , and let $\eta=\frac{R}{2nrLT^{3/4}}$ . Assume for simplicity that $r\leq 1$ and $T^{1/4}\geq\frac{R}{r}$ . When instantiating $\operatorname{\textup{{FKM}}}$ over $\widetilde{\operatorname{\mathcal{Y}}}$ with parameters $\eta$ and $\tilde{\delta}$ , by Proposition 6 we then have

	$\displaystyle\left\lVert y_{t+1}-y_{t}\right\rVert\leq$	$\displaystyle\;2\tilde{\delta}+\frac{\eta nL}{\tilde{\delta}}$
	$\displaystyle\leq$	$\displaystyle\;r\delta\rho/2+\left(\frac{R}{2nrLT^{3/4}}\right)\frac{nL}{% \tilde{\delta}}$
	$\displaystyle\leq$	$\displaystyle\;r\delta\rho/2+\tilde{\delta}/2$
	$\displaystyle\leq$	$\displaystyle\;r\delta\rho,$

and so each selected point is feasible. This allows us to bound our regret by

$\displaystyle\operatorname{\textup{{Reg}}}_{T}(\operatorname{\textup{{% NestedBCO}}})=$	$\displaystyle\;\operatorname{\textup{{Reg}}}_{T}(\operatorname{\textup{{FKM}}}% )+\delta LRT$
$\displaystyle=$	$\displaystyle\;\eta\frac{n^{2}}{\tilde{\delta}^{2}}T+\frac{4R^{2}}{\eta r^{2}}% +\frac{8\tilde{\delta}LRT}{r}+{\delta LRT}$
$\displaystyle=$	$\displaystyle\;\eta\frac{16n^{2}}{r^{2}{\delta}^{2}\rho^{2}}T+\frac{4R^{2}}{% \eta r^{2}}+2\delta\rho LRT+{\delta LRT}$	( $\tilde{\delta}=r\delta\rho/4$ )
$\displaystyle\leq$	$\displaystyle\;16\eta n^{2}T^{3/2}+\frac{4R^{2}}{\eta r^{2}}+\frac{12LRT^{3/4}% }{r\rho}$	( $\delta=\frac{4}{r\rho T^{1/4}},r\leq 1$ )
$\displaystyle\leq$	$\displaystyle\;\frac{16nLRT^{3/4}}{r}+\frac{12LRT^{3/4}}{r\rho}$	( $\eta=\frac{R}{2nrLT^{3/4}}$ )
$\displaystyle=$	$\displaystyle\;O\left(nRLT^{3/4}(r\rho)^{-1}\right).$

∎

Appendix H Background and Proofs for Section 4.1: Performative Prediction

H.1 Background

Introduced by Perdomo et al. (2020), the Performative Prediction problem captures settings in which the data distribution for which a classifier is deployed may shift as a function of the classifier itself, notably including strategic classification Hardt et al. (2015) as well as problems related to reinforcement learning and causal inference. While a number of extensions of strategic classification to online settings have been considered Dong et al. (2018); Zrnic et al. (2021b); Ahmadi et al. (2023), the bulk of the literature on performative prediction considers settings with a fixed loss function and distribution “update map” Perdomo et al. (2020); Miller et al. (2021); Jagadeesan et al. (2022b); Mendler-Dünner et al. (2020); Piliouras and Yu (2022); Brown et al. (2022), where the update map may sometimes depend on the current distribution (as in the Stateful Performative Prediction setting of Brown et al. (2022)). For the location-scale family of update maps introduced by Miller et al. (2021) (and additionally explored by Jagadeesan et al. (2022b) from a regret minimization perspective), which yields a convex “performative risk” objective function, a formulation of Online Performative Prediction is given by Kumar et al. (2022) as an application of online convex optimization with unbounded memory, in which the classification loss function may change over time and the distribution updates may occur gradually.

Here, we generalize the problem formulation of Kumar et al. (2022) to also accommodate notions of statefulness similar to that in Brown et al. (2022). In particular, the instances we consider will resemble location-scale maps when restricting attention only the performatively stable classifiers for each distribution, yet the update effect of a non-stable classifier may be distribution-dependent and nonlinear, provided that the update map satisfies local controllability (viewing classifiers as actions and distributions as states) and mild regularity properties (e.g. invertibility and Lipschitz conditions).

H.2 Model

In the setting of Online Performative Prediction we consider, as formulated by Kumar et al. (2022), in each round $t\in[T]$ we deploy some classifier $x_{t}$ , and observe samples from some distribution $p_{t}$ , which may change dynamically as a function of the history of interactions. Here, we take $\operatorname{\mathcal{X}}\subseteq\operatorname{\mathbb{R}}^{n}$ as our space of classifiers, e.g. representing weight vectors for regression, which we assume is bounded and convex. The initial data distribution is given by some distribution $p_{0}$ over $\operatorname{\mathbb{R}}^{n}$ . In each round, upon deploying a classifier $x_{t}$ , the distribution is updated according to

\displaystyle p_{t}=

\displaystyle\;(1-\theta)p_{t-1}+\theta\operatorname{\mathcal{D}}(x_{t},y_{t-1% }),

for $\theta\in(0,1]$ , where $\operatorname{\mathcal{D}}(x_{t},y_{t-1})$ is the distribution update map taking as input our classifier $x_{t}$ and some representation of the state $y\in\operatorname{\mathcal{Y}}$ , where we assume $\operatorname{\mathcal{Y}}\subseteq\operatorname{\mathbb{R}}^{n}$ is convex, contains $\operatorname{\mathcal{B}}_{r}(\mathbf{0})$ , is bounded with radius $R$ , and that $y_{0}=0$ . We make the following assumptions on $\operatorname{\mathcal{D}}$ .

Assumption 1.

We assume the distribution update map $\operatorname{\mathcal{D}}(x,y)$ operates as follows:

•

$\operatorname{\mathcal{D}}(x,y)=A(x,y)+\xi$ , with $A:\operatorname{\mathcal{X}}\times\operatorname{\mathcal{Y}}\rightarrow% \operatorname{\mathcal{Y}}$ ,
•

$\xi$ is a random variable in $\operatorname{\mathbb{R}}^{n}$ with mean $\mu$ and covariance $\Sigma$ ,

•

$A(x,y)$ satisfies $\rho$ -local controllability and has an inverse action map** $X(y,y^{*})$ where

\displaystyle A(X(y,y^{*}),y)=y^{*},

defined over feasible pairs, which is $L_{y}$ -Lipschitz in $y$ (when feasibility of $y^{*}$ holds), and

•

There is a linear invertible function $s:\operatorname{\mathcal{X}}\rightarrow\operatorname{\mathcal{Y}}$ such that $A(x,y)=s(x)$ if $y=s(x)$ , where $s^{-1}:\operatorname{\mathcal{Y}}\rightarrow\operatorname{\mathcal{X}}$ is $S$ -Lipschitz.

Further, $A(x,y)$ is known and $\xi$ can be sampled freely.

The inverse action map** assumption simply enforces that classifiers need not change drastically to have the same update effect under small changes to the state. The final assumption imposes a linear structure over performatively stable classifiers (i.e. classifiers for which the resulting distribution will remain fixed under $\operatorname{\mathcal{D}}$ , as formulated by Perdomo et al. (2020)), but we note that the distribution may update in an arbitrarily nonlinear fashion (subject to the other conditions) when $x_{t}$ is not a performatively stable classifier for the distribution induced by the previous state $y_{t-1}$ . The ability to accommodate a state component is reminiscent of prior work involving notions of statefulness in performative prediction such as Brown et al. (2022). Our setting generalizes that of Kumar et al. (2022), in which the map $A$ is taken to be a fixed matrix. For any nonsingular matrix $A$ there is immediately a linear map $s(x)=A^{-1}x$ , and local controllability can be defined in terms of the largest and smallest absolute eigenvalues of $A$ (as a special case of our Example 1 with a fixed matrix). We view the nonsingularity assumption (and invertibility in the more general case) as fairly mild, as it amounts to assuming that the distribution map can depend on all parameters of classifier without any necessary (linear) dependency structure imposed, and that no two classifiers are equivalent only to the population but not the optimizer (as otherwise one could simply reduce dimensionality of $\operatorname{\mathcal{X}}$ ). However, even in the case where $A$ is singular, we note that this issue is resolvable augmenting the state representation $y_{t}$ to incorporate the choice of free classifier parameters which affect loss but not distribution updates (e.g. by adding a vector $w_{t}$ to $y_{t}$ which is orthogonal to the range of $A$ and linear in $x_{t}$ ). We assume invertibility here for simplicity, and we take $\operatorname{\mathcal{Y}}$ to be simply be given by the range of $s$ over $\operatorname{\mathcal{X}}$ . At each round $t$ , some scoring function $f_{t}(x,z)$ is chosen adversarially, and our loss is then given by

\displaystyle\tilde{f}_{t}(x_{t},p_{t})=

\displaystyle\;\operatorname*{\mathbb{E}}_{z\sim p_{t}}[f_{t}(x_{t},z)].

We assume each $f_{t}$ is convex and $L_{z}$ -Lipschitz in both $x$ and $z$ , and that $p_{0}=y_{0}+\xi$ . We measure our regret with respect to the best performatively stable classifier, i.e. the loss of any classifier as if were held constant indefinitely as the distribution updates. We define our regret as follows:

\displaystyle\operatorname{\textup{{Reg}}}_{T}(\operatorname{\mathcal{A}})=

\displaystyle\;\max_{x^{*}}\sum_{t=1}^{T}\tilde{f}_{t}(x_{t},p_{t})-\tilde{f}_% {t}(x^{*},\operatorname{\mathcal{D}}(x^{*},s(x^{*})))

Here, the role of $s(x^{*})$ captures the convergence of the distribution to a stable point, resulting from taking the limit of the distribution update rule as $t$ grows large.

As in many of the applications we consider, here our loss is determined both by our action (the classifier) and the state (in terms of the distribution). Our approach for casting Online Performative Prediction as an instance of online nonlinear control in our framework will be to define appropriate surrogate convex losses which depend only on the state, over which we run $\operatorname{\textup{{NestedOCO}}}$ . Here, these will correspond to losses only over the updated distribution component $\operatorname{\mathcal{D}}(x_{t},y_{t-1})$ , which we show closely track our true incurred loss.

H.3 Analysis

For each round $t$ , define the surrogate loss $f^{*}_{t}(y)$ as:

\displaystyle f^{*}_{t}(y)=\operatorname*{\mathbb{E}}_{z\sim y_{t}+\xi}\left[f% _{t}(s^{-1}(y),z)\right].

Lemma 1.

Each $f_{t}^{*}(y)$ is convex and $(1+S)L_{z}$ -Lipschitz in $y$ .

Proof

Consider any individual sample $v\sim\xi$ . We can then view $g(y)=(s^{-1}(y),y+v)$ as a vector-valued function which is $(1+S^{*})$ -Lipschitz. The function $f_{t}(g(y))$ is a $L_{z}$ -Lipschitz and convex function of this linear function of $y$ , and thus $f_{t}(s^{-1}(y),y+v)$ is convex and $(1+S^{*})L_{z}$ -Lipschitz in $y$ . The function $f^{*}_{t}(y)$ is an average of such functions, taken over the expectation of $\xi$ , and thus is convex and $(1+S^{*})L_{z}$ -Lipschitz in $y$ as well. ∎

Observe that $f^{*}_{t}(y)=\tilde{f}_{t}(s^{-1}(y),\operatorname{\mathcal{D}}(s^{-1}(y),y)$ . We will run $\operatorname{\textup{{NestedOCO}}}$ for these losses over the $\rho$ -locally controllable instance $(\operatorname{\mathcal{X}},\operatorname{\mathcal{Y}},A)$ , where we can track the current state $y_{t}=A(x_{t},y_{t-1})$ at each step as a function of our past actions given knowledge of $A$ , and can compute gradients of $f^{*}_{t}(y_{t})$ to arbitrary desired precision by sampling from $\xi$ . This will yield the regret bound from Theorem 1 with respect to the surrogate losses, and the key challenge will be to analyze our error between the true and surrogate losses.

Lemma 2.

For any round $t$ we have that

\displaystyle\tilde{f}_{t}(x_{t},p_{t})-f_{t}^{*}(y_{t})\leq

\displaystyle\;(1-\theta)^{h}M+\frac{\eta L_{z}(1+S)}{\gamma}\cdot\left(L_{y}+% \frac{1-\theta}{\theta}\right)

Proof

For any $h<t$ , the loss of $x_{t}$ over the distribution $y_{t-h}+\xi=\operatorname{\mathcal{D}}(x_{t-h},y_{t-h-1})$ can be expressed as

\displaystyle\hat{f}_{t}(x_{t},y_{t-h})=

\displaystyle\;\operatorname*{\mathbb{E}}_{z\sim\xi+y_{t-h}}\left[f_{t}(x_{t},% z)\right],

which is convex and $L_{z}$ -Lipschitz in both parameters when taking the expectation over $\xi$ . For round $t$ in isolation, using the inverse action map** bound and the bound on $\left\lVert y_{t}-y_{t-1}\right\rVert$ from Proposition 5 we have that

	$\displaystyle\hat{f}_{t}(x_{t},y_{t})-f^{*}_{t}(y_{t})=$	$\displaystyle\;\hat{f}_{t}(x_{t},y_{t})-\hat{f}_{t}(s^{-1}(y_{t}),y_{t})$
	$\displaystyle=$	$\displaystyle\;\hat{f}_{t}(X(y_{t-1},y_{t}),y_{t})-\hat{f}_{t}(X(y_{t},y_{t}),% y_{t})$
	$\displaystyle\leq$	$\displaystyle\;\frac{\eta L_{y}L_{z}}{\gamma},$

and further for previous states that

\displaystyle\hat{f}_{t}(x_{t},y_{t-h})-f^{*}_{t}(y_{t})=

\displaystyle\;(L_{y}+h)\frac{\eta L_{z}(1+S)}{\gamma}.

We can decompose the distribution $p_{t}$ into updates from past rounds as

\displaystyle p_{t}=

\displaystyle\;(1-\theta)^{t}p_{0}+\sum_{h=0}^{t-1}\theta(1-\theta)^{h}% \operatorname{\mathcal{D}}(x_{t-h},y_{t-h-1})

which then yields a loss discrepancy of at most

	$\displaystyle\tilde{f}_{t}(x_{t},p_{t})-f_{t}^{*}(y_{t})\leq$	$\displaystyle\;(1-\theta)^{t}f_{t}(x_{t},p_{0})+\frac{\eta L_{z}(1+S)}{\gamma}% \left(\sum_{h=0}^{t-1}\theta(1-\theta)^{h}(L_{y}+h)\right)$
	$\displaystyle\leq$	$\displaystyle\;\frac{\eta L_{z}(1+S)}{\gamma}\cdot\left(L_{y}+\frac{1-\theta}{% \theta}+(1-\theta)^{t}\right)$

between the true and surrogate loss for round $t$ . ∎

We can now bound the cumulative regret of $\operatorname{\textup{{NestedOCO}}}$ for the problem.

Theorem 16.

For any $\theta>0$ , when Assumption 1 holds for the distribution update rule, Online Performative Prediction can be cast as a $\rho$ -locally controllable instance of online control with nonlinear dynamics, for which $\operatorname{\textup{{NestedOCO}}}$ obtains regret

\displaystyle\operatorname{\textup{{Reg}}}_{T}(\operatorname{\textup{{% NestedOCO}}})\leq

\displaystyle\;2\sqrt{\frac{(1+L_{y}+\frac{R}{r\rho}+\frac{2-\theta}{\theta})% TGL_{z}^{2}(1+S)^{2}}{\gamma}}

with respect to the best performatively stable classifier classifier.

Proof

Combining the previous results with Theorem 1, we have that for any $x^{*}\in\operatorname{\mathcal{X}}$ our regret is at most

	$\displaystyle\sum_{t=1}^{T}\tilde{f}_{t}(x_{t},p_{t})-\tilde{f}_{t}(% \operatorname{\mathcal{D}}(x^{},s(x^{})))\leq$	$\displaystyle\;\sum_{t=1}^{T}\hat{f}_{t}(y_{t})-\tilde{f}_{t}(x^{},% \operatorname{\mathcal{D}}(x^{},s(x^{})))+\sum_{t=1}^{T}\tilde{f}_{t}(x_{t},% p_{t})-{f}^{}_{t}(y_{t})$
	$\displaystyle\leq$	$\displaystyle\;\eta\left(1+L_{y}+\frac{2-\theta}{\theta}+\frac{R}{r\rho}\right% )\frac{TL_{z}(1+S)}{\gamma}+\frac{G}{\eta}$
	$\displaystyle=$	$\displaystyle\;2\sqrt{\frac{(1+L_{y}+\frac{R}{r\rho}+\frac{2-\theta}{\theta})% TGL_{z}^{2}(1+S)^{2}}{\gamma}}$

upon setting $\eta=\sqrt{\frac{G\gamma}{(1+L_{y}+\frac{R}{r\rho}+\frac{2-\theta}{\theta})TL_% {z}^{2}(1+S)^{2}}}$ . ∎

Theorem 6 follows directly from Theorem 16. For Online Performative Prediction, in the full generality of the setting considered, the per-round optimization problem may not be convex, in which case we make use of the non-convex optimization oracle access for $\operatorname{\textup{{NestedOCO}}}$ . However, in each of the following applications we show that the action selection step can indeed be implemented efficiently without imposing additional restrictions on the dynamics.

Appendix I Background and Proofs for Section 4.2: Adaptive Recommendations

I.1 Background

Motivated by problems involving preference dynamics and feedback loops in recommendation systems (see e.g.Flaxman et al. (2016)), a number of recent works Hazla et al. (2019); Gaitonde et al. (2021); Dean and Morgenstern (2022); Jagadeesan et al. (2022a); Agarwal and Brown (2022, 2023) have explored models of repeated recommendation where given to an agent whose preferences or opinions evolve over time. Several of these models Hazla et al. (2019); Dean and Morgenstern (2022); Jagadeesan et al. (2022a) consider population-level effects for settings where a single recommendation is given each round and consumers (or producers) update their behavior according to linear dynamics. Nonlinear preference dynamics with menus of recommendations for a single agent are considered in Agarwal and Brown (2022, 2023), where the aims to minimize regret for adversarial losses over the agent’s choices. The Adaptive Recommendations formulation of Agarwal and Brown (2022) somewhat resembles the “Dueling Bandits” setting of Yue et al. (2012), where $k>1$ actions are chosen in each round, yet where preferences can now evolve dynamically as a function of the history rather than remaining fixed. Whereas Agarwal and Brown (2022, 2023) study a bandit formulation of the problem with unknown preference dynamics, here we consider a full-feedback model with known dynamics, allowing for relaxed structural assumptions (on the agent’s “memory horizon” and “preference scoring functions”) at the cost of stronger informational assumptions, while maintaining the overall dynamics of the problem.

I.2 Model

Here, we are tasked with repeatedly recommending menus of content to an agent. Out of a universe of $n$ elements (e.g. video channels, clothing items), we show a subset of size $k$ (denoted $K_{t}$ ) to the agent in each round, for $T$ total rounds. The agent chooses one item $i\in K_{t}$ from the menu, according to a distribution in terms of their preferences, which are a function of their selection history. Conditioned on being shown a menu $K_{t}$ , the agent’s choice distribution has positive mass only on the $k$ items $i\in K_{t}$ . The agent’s representation of their selection history is given by their memory vector $v_{t}\in\Delta(n)$ , and choices are determined by their preference scoring functions $s_{i}:\Delta(n)\rightarrow[\lambda,1]$ for each $i$ , which map the agent’s memory vector to relative preference scores for each item. The menu we show to the agent may be chosen from some distribution $x_{t}\in\Delta({n\choose k})$ , and for each $K_{t}\in[{n\choose k}]$ the agent’s menu-conditional distribution $p_{t}(\cdot;K_{t},v_{t-1})\in\Delta(n)$ is proportional to the scores $s_{i}(v_{t})$ for items in $K_{t}$ , given as

\displaystyle p_{t}(i;K_{t},v_{t-1})=

\displaystyle\;\frac{s_{i}(v_{t-1})}{\sum_{j\in K_{t}}s_{j}(v_{t-1})}

for each $i\in K_{t}$ , with $p_{t}(j;K_{t},v_{t-1})=0$ for $j\notin K_{t}$ . The joint item choice distribution, considering both random selection of a menu $K_{t}$ according to $x_{t}$ , and the agent’s choice from $K_{t}$ , is given by

\displaystyle p_{t}(\cdot;x_{t},v_{t-1})=

\displaystyle\;\sum_{K_{t}\in{n\choose k}}x_{t}(K_{t})\cdot p_{t}(\cdot;K_{t},% v_{t-1})

which we may denote simply by the vector $p_{t}\in\Delta(n)$ , or as a function $p_{t}(x_{t})$ . In contrast to prior work, here we consider a deterministic variant of the problem as an illustration of the flexibility of our framework for online nonlinear control. In particular, we assume that the agent’s memory vector $v_{t}$ updates according to its expectation over $p_{t}$ as

\displaystyle v_{t}=(1-\theta_{t})v_{t-1}+\theta_{t}p_{t},

where $\theta_{t}\in[\theta,1]$ is the per-round update speed, and we assume that the agent’s scoring functions $s_{i}$ are known. We receive convex and $L$ -Lipschitz losses $f_{t}(p_{t})$ in each round in terms of the agent’s choices, over which we aim to minimize regret with respect to some distribution set $\operatorname{\mathcal{Y}}\subseteq\Delta(n)$ .

The prior work (Agarwal and Brown, 2022, 2023) has considered two particular subsets of $\Delta(n)$ as regret benchmarks. We show that both can be cast as locally controllable instances of online control, and further, we make use of local controllability to give a general characterization of convex sets $\operatorname{\mathcal{Y}}\subseteq\Delta(n)$ over which sublinear regret is attainable. We recall some key definitions and results from (Agarwal and Brown, 2022, 2023).

Definition 4 (Instantaneously Realizable Distributions).

The set of instantaneously realizable distributions at a memory vector $v\in\Delta(n)$ is given by

\displaystyle\operatorname{\textup{{IRD}}}(v)=

\displaystyle\;\operatorname*{convhull}\left\{p(\cdot;K,v):K\in\left[{n\choose k% }\right]\right\}.

Each such set $\operatorname{\textup{{IRD}}}(v_{t-1})$ corresponds to the feasible distributions $p_{t}$ , given the agent’s scoring functions and memory $v_{t-1}$ . It is shown by Agarwal and Brown (2023) that each $\operatorname{\textup{{IRD}}}$ sets can be directly characterized in terms of the ratios between target frequencies and scores.

Proposition 7 (Menu Times for $\operatorname{\textup{{IRD}}}$ Agarwal and Brown (2023)).

Given a memory vector $v\in\Delta(n)$ and target distribution $p\in\Delta(n)$ , let the menu time $\mu_{i}$ for item $i$ be given by

\displaystyle\mu_{i}=

\displaystyle\;\frac{k\cdot\frac{p(i)}{s_{i}(v)}}{\sum_{j=1}^{n}\frac{p(j)}{s_% {j}(v)}},

where $\sum_{i=1}^{n}\mu_{i}=k$ . Then, $p\in\operatorname{\textup{{IRD}}}(v)$ if and only if $\mu_{i}\leq 1$ for each $i\in[n]$ .

We recall the prior benchmark sets considered, and the corresponding assumptions which yield feasibility of regret minimization. We state informal analogues of the prior results as translated to our setting, which we then show formally below.

Definition 5 (Everywhere Instantaneously Realizable Distributions).

The set of everywhere instantaneously realizable distributions is given by

\displaystyle\operatorname{\textup{{EIRD}}}=

\displaystyle\;\bigcap_{v\in\Delta(n)}\operatorname{\textup{{IRD}}}(v).

Proposition 8 (Corollary of Agarwal and Brown (2022)).

If $\lambda\geq\frac{k}{n}+\frac{k}{n(n-1)}$ , then $\operatorname{\textup{{EIRD}}}$ is non-empty, and there is a $o(T)$ regret algorithm with respect to any distribution $p\in\operatorname{\textup{{EIRD}}}$ .

Distributions $p_{t}\in\operatorname{\textup{{EIRD}}}$ are always feasible regardless of $v_{t-1}$ by an appropriate choice of $x_{t}$ , but $\operatorname{\textup{{EIRD}}}$ may be quite small in relation to $\Delta(n)$ . Under stronger assumptions for each $s_{i}$ , a potentially much larger set becomes feasible as a regret benchmark.

Definition 6 ( $\phi$ -Smoothed Simplex).

The $\phi$ -smoothed simplex $\Delta^{\phi}(n)$ for $\phi\in[0,1]$ is given by

\displaystyle\Delta^{\phi}(n)=

\displaystyle\;\{(1-\phi)v+\phi\mathbf{u}_{n}:v\in\Delta(n)\}

Definition 7 (Scale-Bounded Functions).

A scoring function $s_{i}:\Delta(n)\rightarrow[\frac{\lambda}{\sigma},1]$ is said to be $(\sigma,\lambda)$ -scale-bounded for $\sigma>1$ and $\lambda>0$ if, for all $v\in\Delta(n)$ , we have that

\displaystyle\sigma^{-1}((1-\lambda)v_{i}+\lambda)\leq s_{i}(v)\leq\sigma((1-% \lambda)v_{i}+\lambda).

For such functions, each score $s_{i}(v)$ cannot be too far from item $i$ ’s weight in memory, and it is shown that $\operatorname{\textup{{IRD}}}(v)$ contains a ball around $v$ for each $v\in\Delta^{\phi}(n)$ , for an appropriate choice of $\phi$ .

Proposition 9 (Corollary of Agarwal and Brown (2023)).

If each $s_{i}$ is $(\sigma,\lambda)$ -scale-bounded, then there is a $o(T)$ regret algorithm with respect to any distribution $p\in\Delta^{\phi}(n)$ , for $\phi=\Theta(k\lambda\sigma^{2})$ .

We extend these results to general convex benchmark sets $\operatorname{\mathcal{Y}}\subseteq\Delta(n)$ , where we can characterize the feasibility of regret minimization via local controllability using the menu times $\mu_{i}$ . When $\rho$ -local controllability holds over a set $\operatorname{\mathcal{Y}}$ , we can minimize regret via $\operatorname{\textup{{NestedOCO}}}$ using surrogate losses $f_{t}^{*}(v_{t})$ , which closely track our true losses $f_{t}(p_{t})$ .

I.3 Analysis

We make use of the menu time quantities $\mu_{i}$ for a memory vector $v$ and target distribution $p$ to translate our notion of local controllability to the Adaptive Recommendations setting. Let $\operatorname{\mathcal{Y}}$ be any convex subset of $\Delta(n)$ , let $\operatorname{\mathcal{X}}=\Delta({n\choose k})$ , where the dynamics $D_{t}(x_{t},v_{t-1})$ are given by

\displaystyle D_{t}(x_{t},v_{t-1})=

\displaystyle\;(1-\theta_{t})v_{t-1}+\theta_{t}p_{t}(x_{t}).

Note that $D_{t}(x_{t},v_{t-1})$ is action-linear in $x_{t}$ , and thus we can solve for $x_{t}$ efficiently (in terms of $\dim(\operatorname{\mathcal{X}})=O(n^{k})$ ); further, there is a construction given in Agarwal and Brown (2023) for removing exponential dependence on $k$ when computing menu distributions. We consider $\operatorname{\mathcal{Y}}$ as an $(n-1)$ -dimensional subset of $\operatorname{\mathbb{R}}^{n}$ , where we define the the ball $\operatorname{\mathcal{B}}_{\rho}(v)$ of radius $\rho$ around a point $v\in\operatorname{\mathcal{Y}}$ as:

\displaystyle\operatorname{\mathcal{B}}_{\rho}(v)=

\displaystyle\;\{p\in\Delta(n):\left\lVert p-v\right\rVert\leq\rho\}.

Theorem 17.

An instance of Adaptive Recommendations $(\operatorname{\mathcal{X}},\operatorname{\mathcal{Y}},D)$ satisfies $\rho\theta$ -local controllability if, for any $v\in\operatorname{\mathcal{Y}}$ and $p\in\operatorname{\mathcal{B}}_{\rho\cdot\pi(v)}$ , we have that

\displaystyle\frac{(k-1)p(i)}{s_{i}(v)}\leq

\displaystyle\;\sum_{j\neq i}^{n}\frac{p(j)}{s_{j}(v)}

for every $i\in[n]$ .

This follows immediately from Proposition 8 and the definition of local controllability, which can analogously extend to strong local controllability. We can use this formulation to unify the feasibility analysis for each of the previously considered sets.

Lemma 3.

For $\lambda\geq\frac{k-1}{n-1}+\epsilon$ and $\epsilon\geq 0$ , the $\operatorname{\textup{{EIRD}}}$ set contains a ball of radius $\rho=\Theta(\frac{\epsilon}{nk+\epsilon})$ around $\mathbf{u}_{n}$ , and any instance $(\operatorname{\mathcal{X}},\operatorname{\textup{{EIRD}}},D)$ satisfies $\theta$ -local controllability.

Proof

For any $v\in\Delta(n)$ , $i\in[n]$ , and $p\in\operatorname{\mathcal{B}}_{\rho}(\mathbf{u}_{n})$ we have $p(i)\leq\frac{1}{n}+\frac{\rho\sqrt{2}}{2}$ and $s_{i}(v)\geq\frac{k-1}{n-1}+\epsilon$ , yielding that

\displaystyle\frac{(k-1)p(j)}{s_{j}(v)}\leq

\displaystyle\;\frac{1+\frac{\rho n\sqrt{2}}{2}}{\frac{n}{n-1}+\frac{\epsilon n% }{k-1}},

and over all items $j\neq i$ (with $s_{j}(v)\leq 1$ ) we have

\displaystyle\sum_{j\neq i}^{n}\frac{p(j)}{s_{j}(v)}\geq

\displaystyle\;1-\frac{1}{n}-\frac{\rho\sqrt{2}}{2}.

Observe that the bounds for each term are equalized at $\frac{n-1}{n}$ when $\rho=\epsilon=0$ , and so $\mathbf{u}_{n}\in\operatorname{\textup{{EIRD}}}$ whenever $\lambda\geq\frac{k-1}{n-1}$ . We can specify $\epsilon(\rho)$ in terms of $\rho$ to maintain equality, and thus inclusion of $p\in\operatorname{\textup{{EIRD}}}$ . Taking $\epsilon(\rho)$ in terms of $\rho$ as

	$\displaystyle\epsilon(\rho)=$	$\displaystyle\;\frac{\rho n(k-1)}{\frac{2(n-1)}{\sqrt{2}n}-\rho}$
	$\displaystyle=$	$\displaystyle\;\frac{\frac{\rho n(k-1)\sqrt{2}}{2}}{\left(1-\frac{1}{n}-\frac{% \rho\sqrt{2}}{2}\right)}$
	$\displaystyle=$	$\displaystyle\;(k-1)\left(\frac{\frac{1}{n}+\frac{\rho\sqrt{2}}{2}}{1-\frac{1}% {n}-\frac{\rho\sqrt{2}}{2}}-\frac{1}{n-1}\right)$

gives us that

\displaystyle\frac{1}{n-1}+\frac{\epsilon(\rho)}{k-1}\geq

\displaystyle\;\frac{\frac{1}{n}+\frac{\rho\sqrt{2}}{2}}{1-\frac{1}{n}-\frac{% \rho\sqrt{2}}{2}}

for $\rho\geq 0$ , and so we maintain that $p\in\operatorname{\textup{{EIRD}}}$ . Inverting, we have

\displaystyle\rho(\epsilon)=

\displaystyle\;\frac{\epsilon\frac{2(n-1)}{\sqrt{2}n}}{n(k-1)+\epsilon}

as the radius of a ball around $\mathbf{u}_{n}$ contained in $\operatorname{\textup{{EIRD}}}$ . To see that $\operatorname{\textup{{EIRD}}}$ is $\theta$ -locally controllable, consider any $v_{t-1}$ and $v^{*}$ in $\operatorname{\textup{{EIRD}}}$ where $v^{*}\in\operatorname{\mathcal{B}}_{\pi(v_{t-1})}(v_{t-1})$ , and let $v_{t}=(1-\theta_{t})v_{t-1}+\theta_{t}v^{*}$ . By playing an action distribution $x_{t}$ which induces $p_{t}(x_{t})=v^{*}$ , the memory vector is then updated to $v_{t}$ . This is feasible for any $v_{t}\in\operatorname{\mathcal{B}}_{\theta\cdot\pi(v_{t-1})}(v_{t-1})$ , as each corresponds to some $v^{*}\in\operatorname{\mathcal{B}}_{\pi(v_{t-1})}(v_{t-1})$ . ∎

We remark that for the $\operatorname{\textup{{EIRD}}}$ set, if losses are given over $p_{t}$ rather than $v_{t}$ , one can define dynamics which directly consider the state to simply be the induced distribution $p_{t}$ in each round, which satisfies strong local controllability with any $p_{t}\in\operatorname{\textup{{EIRD}}}$ feasible at each round; in general, we consider dynamics to view the memory vector as the state, as the feasible updates $p_{t}$ are a function of $v_{t}$ . Such is the case for the $\phi$ -smoothed simplex, for which we can state an analogous local controllability result.

Lemma 4.

If each $s_{i}$ is $(\sigma,\lambda)$ -scale-bounded, then any instance $(\operatorname{\mathcal{X}},\Delta^{\phi}(n),D)$ over the $\phi$ -smoothed simplex for $\phi=\Theta(k\lambda\sigma^{2})$ satisfies $\Omega(\theta\lambda\phi)$ -local controllability.

Proof

The following lemma from Agarwal and Brown (2023) shows that a ball of distributions around any memory vector $v\in\Delta^{\phi}(n)$ is feasible under $\operatorname{\textup{{IRD}}}(v)$ .

Lemma 5 ( $\operatorname{\textup{{IRD}}}$ for Scale-Bounded Preferences Agarwal and Brown (2023)).

Let each $s_{i}$ be $(\sigma,\lambda)$ -scale-bounded with $\sigma\leq\sqrt{4(n-1)/k}$ , and let $v\in\Delta^{\phi}(n)$ be a vector in the $\phi$ -smoothed simplex, for $\phi\geq\Theta{k\lambda\sigma^{2}}$ . Then, $p\in\operatorname{\textup{{IRD}}}(v)$ for any vector $p\in\operatorname{\mathcal{B}}_{\lambda\phi}(v)\cap\Delta^{\phi}(n)$ .

Let $d=\min(\lambda\phi,\pi(v_{t-1}))\leq\lambda\phi\pi(v_{t-1})$ for any $v_{t-1}$ in $\Delta^{\phi}(n)$ . Any $v^{*}\in\operatorname{\mathcal{B}}_{d}(v_{t-1})$ then is contained in $\operatorname{\textup{{IRD}}}(v_{t-1})$ , and so playing $x_{t}$ such that $p_{t}(x_{t})=v^{*}$ yields an update to $v_{t}=(1-\theta_{t})v_{t-1}+\theta v^{*}$ , which is feasible for any $v_{t}\in\operatorname{\mathcal{B}}_{d\theta}(v_{t-1})$ , and so $\Omega(\theta\lambda\phi)$ -local controllability holds. ∎

For any such set $\operatorname{\mathcal{Y}}$ which yields locally controllable dynamics for the instance $(\operatorname{\mathcal{X}},\operatorname{\mathcal{Y}},D)$ , we can minimize regret over $\operatorname{\mathcal{Y}}$ via $\operatorname{\textup{{NestedOCO}}}$ , where we optimize with respect to the surrogate losses $f_{t}^{*}(v_{t})$ . Note that for our regret benchmark of the best per-round instantaneously distribution in $\operatorname{\mathcal{Y}}$ , any fixed vector $v^{*}$ which is instantaneously targeted across all rounds yields an item distribution $p_{t}=v^{*}$ in each round, and so $f^{*}_{t}(v^{*})=f_{t}(p^{*})$ . We assume that $y_{0}$ is bounded inside $\operatorname{\mathcal{Y}}$ (which typically will hold for $y_{0}=\mathbf{u}_{n}$ ).

Theorem 18.

For any $\rho$ -locally controllable instance $(\operatorname{\mathcal{X}},\operatorname{\mathcal{Y}},D)$ of Adaptive Recommendations with update speed $\theta>0$ , running $\operatorname{\textup{{NestedOCO}}}$ over the surrogate losses $f_{t}^{*}(v_{t})$ yields regret

\displaystyle\operatorname{\textup{{Reg}}}_{T}(\operatorname{\textup{{% NestedOCO}}})\leq

\displaystyle\;2\sqrt{\frac{(2+\frac{R}{r\rho}+\frac{1}{\theta})TGL^{2}}{% \gamma}}

with respect to the true losses $f_{t}(p_{t})$ over $\operatorname{\mathcal{Y}}$ .

Proof

Beyond applying the regret bound for $\operatorname{\textup{{NestedOCO}}}$ from Theorem 1, the key step here is to bound surrogate loss errors as:

	$\displaystyle\sum_{t=1}^{T}f_{t}(p_{t})-f_{t}(v^{*})\leq$	$\displaystyle\;\sum_{t=1}^{T}f_{t}^{}(v_{t})-f_{t}(v^{})+\sum_{t=1}^{T}f_{t}% (v_{t})-f_{t}(p_{t})$
	$\displaystyle\leq$	$\displaystyle\;\eta\left(1+\frac{R}{r\rho}\right)\frac{TL^{2}}{\gamma}+\frac{G% }{\eta}+\sum_{t=1}^{T}f_{t}(v_{t})-f_{t}\left(\frac{v_{t}-(1-\theta_{t})v_{t-1% }}{\theta_{t}}\right)$
	$\displaystyle\leq$	$\displaystyle\;\eta\left(1+\frac{R}{r\rho}\right)\frac{TL^{2}}{\gamma}+\frac{G% }{\eta}+\sum_{t=1}^{T}f_{t}(v_{t})-f_{t}\left(v_{t-1}+\frac{v_{t}-v_{t-1}}{% \theta_{t}}\right)$
	$\displaystyle\leq$	$\displaystyle\;\eta\left(1+\frac{R}{r\rho}\right)\frac{TL^{2}}{\gamma}+\frac{G% }{\eta}+L\left(1+\frac{1}{\theta}\right)\sum_{t=1}^{T}\left\lVert v_{t}-v_{t-1% }\right\rVert$
	$\displaystyle\leq$	$\displaystyle\;\eta\left(2+\frac{R}{r\rho}+\frac{1}{\theta}\right)\frac{TL^{2}% }{\gamma}+\frac{G}{\eta}$
	$\displaystyle=$	$\displaystyle\;2\sqrt{\frac{(2+\frac{R}{r\rho}+\frac{1}{\theta})TGL^{2}}{% \gamma}}$

upon setting $\eta=\sqrt{\frac{G\gamma}{(2+\frac{R}{r\rho}+\frac{1}{\theta})TL^{2}}}$ , which yields the theorem. ∎

Theorems 7 and 8 follow from Theorem 18, as well as from Lemmas 3 and 4, respectively.

Appendix J Background and Proofs for Section 4.3: Adaptive Pricing

J.1 Background

While there is a large literature on designing online mechanisms for pricing discrete goods via auctions (Mehta et al., 2007; Kanoria and Nazerzadeh, 2020; Golrezaei et al., 2020; Morgenstern and Roughgarden, 2016; Feng et al., 2019; Braverman et al., 2017), there is comparatively little work related to online pricing problems for real-valued goods. Most work for such problems to date requires strong assumptions on valuation functions, often either assuming linearity (Jia et al., 2014) or additivity (Agrawal et al., 2023), or requiring approximability via discretization (Mussi et al., 2022). Here, we introduce a novel formulation for an Adaptive Pricing problem which builds on the myopic-demand fixed-cost setting of Roth et al. (2015), which we extend to accommodate adversarial consumption rates for the agent (which affect demand, as a function of the agent’s reserves) as well as adversarial production costs. As in Roth et al. (2015), our setting can accommodate general convex (increasing) production cost functions and concave (increasing) valuations for the agent, provided that valuations additionally are homogeneous; to our knowledge, this encompasses a much wider class of valuations and costs than considered by any prior work on no-regret dynamic pricing for real-valued goods.

J.2 Model

In each round $t$ , an agent (the consumer) begins with goods reserves $y_{t-1}\in\operatorname{\mathbb{R}}_{\geq 0}^{n}$ (with $y_{0}=\mathbf{0}$ ), then consumes an adversarially chosen fraction $\theta_{t}\in[\theta,1]$ of each good simultaneously (e.g. corresponding to their rate of manufacturing downstream items, using the goods as components), updating their reserves to $(1-\theta_{t})y_{t-1}$ . We (the producer) show the consumer some vector $p_{t}\in\operatorname{\mathbb{R}}_{+}^{n}$ of per-unit prices for each good, and the consumer purchases some bundle of goods $x_{t}$ . The consumer’s valuation function for reserves of goods is given by $v:\operatorname{\mathbb{R}}_{+}^{n}\rightarrow\operatorname{\mathbb{R}}_{+}$ , and their selection of $x_{t}=x^{*}(p_{t},\theta_{t},y_{t-1})$ is given by

\displaystyle x^{*}(p_{t},\theta_{t},y_{t-1})=

\displaystyle\;\operatorname*{argmax}_{x\in\operatorname{\mathbb{R}}_{+}^{n}}v% (x+(1-\theta_{t})y_{t-1})-\langle p_{t},x\rangle.

We later discuss behavior of $x^{*}$ when the $\operatorname*{argmax}$ is undefined; it will suffice for us to only consider price vectors for which it is defined. This updates the consumer’s reserves to $y_{t}=x_{t}+(1-\theta_{t})y_{t-1}$ . Upon seeing the consumer’s purchased bundle $x_{t}$ , we receive their payment $\langle p_{t},x_{t}\rangle$ minus our production cost $c_{t}(x_{t}):\operatorname{\mathbb{R}}_{+}^{n}\rightarrow\operatorname{\mathbb% {R}}_{+}$ , where $c_{t}$ is adversarially chosen. Our utility is then given by

\displaystyle f_{t}(p_{t},x_{t})=

\displaystyle\;\langle p_{t},x_{t}\rangle-c_{t}(x_{t}).

We make the following assumptions on production costs $c_{t}$ and the consumer’s valuation $v$ .

Assumption 2 (Production Costs).

We assume that for each $c_{t}$ , the following hold over $\operatorname{\mathbb{R}}_{+}^{n}$ :

•

$c_{t}$ is non-negative, convex, and $L_{c}$ -Lipschitz,
•

$\lim_{\epsilon\rightarrow 0}c_{t}(\epsilon\cdot\mathbf{1})\leq C_{0}$ for some $C_{0}\geq 0$ , and
•

$c_{t}(x)\geq\phi\left\lVert x\right\rVert+C_{0}$ for some $\phi>0$ .

Further, each $c_{t}$ is revealed prior to setting prices $p_{t+1}$ .

Assumption 3 (Consumer Valuations).

We assume that the following hold over some set $\operatorname{\mathcal{Y}}\subseteq\operatorname{\mathbb{R}}^{n}_{+}$ :

•

$v$ is non-negative, continuous, and differentiable,
•

$v$ is strictly concave and increasing,

•

$v$ is $(\lambda,\beta)$ -Hölder continuous for some $\lambda\geq 1$ and $\beta\in(0,1]$ , i.e.

\left\lvert v(y)-v(y^{\prime})\right\rvert\leq\lambda\left\lVert y-y^{\prime}% \right\rVert^{\beta},

and

•

$v$ is homogeneous of degree $k$ for some $k\in(0,1)$ , i.e. $v(by)=b^{k}v(y)$ for any $b>0$ .

Further, $v$ is known to the producer.

Given the concavity assumption, we note that it is without loss of generality to assume that $k\in(0,1)$ for the homogeneity parameter. There are several well-studied valuation families which satisfy these properties for an appropriate set $\operatorname{\mathcal{Y}}$ ; see Roth et al. (2015) for proofs of each example.

Example 3 (Constant Elasticity of Substitution (CES)).

Valuations of the form

\displaystyle v(y)=

\displaystyle\;\left(\sum_{i=1}^{n}\alpha_{i}y_{i}^{\kappa}\right)^{\beta},

with each $\alpha_{i},\kappa,\beta>0$ and $\kappa,\beta\kappa<1$ , are Hölder continuous, differentiable, strictly concave, non-decreasing, and homogeneous over a convex set in $\operatorname{\mathbb{R}}^{n}_{+}$ .

Example 4 (Cobb-Douglas).

Valuations of the form

\displaystyle v(y)=

\displaystyle\;\prod_{i=1}^{n}y_{i}^{\alpha_{i}},

with $\alpha_{i}>0$ and $\sum_{i=1}^{n}\alpha_{i}<1$ are Hölder continuous, differentiable, strictly concave, non-decreasing, and homogeneous over a convex set in $\operatorname{\mathbb{R}}^{n}_{+}$ .

We initially assume that Assumption 3 holds over all of $\operatorname{\mathbb{R}}_{+}^{n}$ , but will restrict our attention to the set $\operatorname{\mathcal{Y}}\subseteq\operatorname{\mathbb{R}}^{n}_{+}$ of bundles where $v(y)\geq\phi\left\lVert y\right\rVert$ for each $y\in\operatorname{\mathcal{Y}}$ , and we note that our results can be extended to arbitrary downward-closed convex sets (where $by\in\operatorname{\mathcal{Y}}$ for any $y\in\operatorname{\mathcal{Y}}$ and $b\in(0,1]$ ). In Section J.3 we that show Assumptions 2 and 3 yield several important properties which enable optimization via our framework. We show a unique map** between price vectors and bundle purchases (for any fixed reserves and consumption rate), that restricting attention to $\operatorname{\mathcal{Y}}$ is justified under rationality constraints, and that $\operatorname{\mathcal{Y}}$ is convex.

Further, there is some price vector which yields a reserve update to any $y_{t}\in\operatorname{\mathcal{Y}}$ in a neighborhood around $y_{t-1}$ , yielding local controllability. Crucially, we show that there are concave surrogate rewards $f^{*}_{t}(y_{t})$ which will closely track our true rewards $f_{t}(p_{t},x_{t})$ , leveraging the following property of homogeneous functions.

Proposition 10 (Euler’s Theorem for Homogeneous Functions).

A continuous and differentiable function $v:\operatorname{\mathcal{Y}}\rightarrow\operatorname{\mathbb{R}}_{+}$ is homogeneous of degree $k$ if and only if

\displaystyle\langle\nabla v(y),y\rangle=

\displaystyle\;k\cdot v(y).

We run $\operatorname{\textup{{NestedOCO}}}$ directly over these concave surrogate rewards (by inverting the sign of each), where each $p_{t}$ can be computed efficiently in terms of $y_{t-1}$ and $\theta_{t}$ , and we show that the surrogate reward distance from our true rewards is bounded. While our rewards will not be Lipschitz over $\operatorname{\mathcal{Y}}$ in general, we show that appropriately calibrating our step size yields sublinear regret with dependence on the Hölder continuity parameters. We measure our regret with respect to the set of stable reserve policies, i.e. pricing policies where $y_{t}$ remains constant.

Definition 8 (Regret for Stable Reserve Policies).

Let $\operatorname{\mathcal{P}}_{\operatorname{\mathcal{Y}}}=\{P_{y}:y\in% \operatorname{\mathcal{Y}}\}$ be the set of stable reserve policies, where for any $y_{t-1}$ and $\theta_{t}$ satisfying $(1-\theta_{t})y_{t-1}\leq y^{*}$ , playing prices computed by a policy $p_{t}=P_{y}^{*}(y_{t-1},\theta)$ yields

\displaystyle(1-\theta_{t})y_{t-1}+x^{*}(p_{t},\theta_{t},y_{t-1})=y^{*}.

It is straightforward to see that any $P_{y}^{*}\in\operatorname{\mathcal{P}}_{\operatorname{\mathcal{Y}}}$ maintains the invariant that $y_{t}=y^{*}$ , provided that some such $p_{t}$ is always feasible.

J.3 Analysis

We show a series of results establishing the key conditions allowing us to formulate this problem as a locally controllable instance of online nonlinear control. We first show that any positive bundle is the unique optimal purchase for some positive price vector.

Lemma 6.

For any reserves $y_{t-1}\in\operatorname{\mathbb{R}}_{\geq 0}^{n}$ , consumption rate $\theta_{t}\in[\theta,1]$ , and vector $y_{t}\in\operatorname{\mathbb{R}}_{+}^{n}$ where $y_{t}>(1-\theta_{t})y_{t-1}$ elementwise, the bundle $x_{t}=y_{t}-(1-\theta_{t})y_{t-1}$ is the unique solution to

\displaystyle x_{t}=

\displaystyle\;x^{*}(p_{t},\theta_{t},y_{t-1})

for prices $p_{t}=\nabla v(y_{t})$ .

Proof

Recall that the consumer’s bundle choice is given by

\displaystyle x^{*}(p_{t},\theta_{t},y_{t-1})=

\displaystyle\;\operatorname*{argmax}_{x\in\operatorname{\mathbb{R}}_{+}^{n}}v% (x+(1-\theta_{t})y_{t-1})-\langle p_{t},x\rangle.

Note that $v((1-\theta_{t})y_{t-1}+x)-\langle p_{t},x\rangle$ is strictly concave in $x$ for any $x\in\operatorname{\mathbb{R}}^{n}_{+}$ , as the gradients

\displaystyle\nabla_{x}v((1-\theta_{t})y_{t+1}+x)=

\displaystyle\;\nabla_{y_{t}}v(y_{t})

are preserved at each point $y_{t}=(1-\theta_{t})y_{t+1}+x$ , and subtracting the linear function $\langle x,p_{t}\rangle$ does not affect strict concavity. We also have that $p_{t}\in\operatorname{\mathbb{R}}_{+}^{n}$ for prices $p_{t}=\nabla v(y_{t})$ , as $v$ is strictly concave and non-decreasing. This yields that $v((1-\theta_{t})y_{t-1}+x)-\langle p_{t},x\rangle$ has a unique global maximum at $x_{t}=y_{t}-(1-\theta_{t})y_{t-1}$ , as $\nabla_{x}(v((1-\theta_{t})y_{t+1}+x)-\langle p_{t},x\rangle)=\mathbf{0}$ . ∎

As such, the $\operatorname*{argmax}$ for $x^{*}(p_{t},\theta_{t},y_{t-1})$ is unique whenever $p_{t}=\nabla v(y)$ for some $y\in\operatorname{\mathbb{R}}_{+}^{n}$ . We let $p^{*}(x_{t};y_{t-1},\theta_{t})=\nabla v((1-\theta_{t})y_{t-1}+x_{t})$ denote this price vector which induces a purchase of $x_{t}$ . For any other price vector $p$ , the maximizing bundle $x_{t}$ either approaches a point on the boundary of $\operatorname{\mathbb{R}}_{+}^{n}$ , or grows unboundedly. We restrict our attention to bundles contained in $\operatorname{\mathbb{R}}_{+}^{n}$ , and show that the issue of unboundedness is resolved by rationality considerations for the producer. We characterize the per-round rewards of stable reserve policies as concave functions of $y\in\operatorname{\mathbb{R}}_{+}^{n}$ , and show that the optimal such policy corresponds to some state $y^{*}\in\operatorname{\mathcal{Y}}$ , where $\operatorname{\mathcal{Y}}$ is convex and bounded.

Lemma 7.

The round- $t$ reward of a stable reserve policy $P_{y}$ corresponding to any $y\in\operatorname{\mathbb{R}}_{+}^{n}$ is given by a strictly concave function

\displaystyle f_{t}(P_{y})=

\displaystyle\;\theta_{t}k\cdot v(y)-c_{t}(\theta_{t}y).

Proof

We first note that we can maintain $y_{t}=y$ in every round by Lemma 6, as $y_{0}=\mathbf{0}$ and $(1-\theta_{t})y<y$ . As such, a bundle $x_{t}=\theta_{t}y$ is purchased in each round at prices $\nabla v(y)$ , and our reward is given by

	$\displaystyle f_{t}(P_{y})=$	$\displaystyle\;f_{t}(p^{*}(\theta_{t}y;y,\theta_{t}),\theta_{t}y)$
	$\displaystyle=$	$\displaystyle\;\langle\nabla v(y),\theta_{t}y\rangle-c_{t}(\theta_{t}y)$
	$\displaystyle=$	$\displaystyle\;\theta_{t}k\cdot v(y)-c_{t}(\theta_{t}y),$

where the final step follows from Proposition 10 for homogeneous functions. The function $\theta_{t}k\cdot v(y)$ is strictly concave, which is preserved upon subtracting the convex function $c_{t}(\theta_{t}y)$ . ∎

Lemma 8.

The set $\operatorname{\mathcal{Y}}=\{y\in\operatorname{\mathbb{R}}_{+}^{n}:v(y)\geq% \phi\left\lVert y\right\rVert\}$ is convex.

Proof

Consider any two points $y,y^{\prime}\in\operatorname{\mathcal{Y}}$ , and let $y^{\prime\prime}=ay+(1-a)y^{\prime}$ for any $a\in[0,1]$ . Recall that $y^{*}\in\operatorname{\mathbb{R}}_{+}^{n}$ belongs to $\operatorname{\mathcal{Y}}$ if and only if $v(y^{*})\geq\phi\left\lVert y^{*}\right\rVert$ . By concavity of $v$ , we have that

	$\displaystyle v(y^{\prime\prime})=$	$\displaystyle\;v(ay+(1-a)y^{\prime})$
	$\displaystyle\geq$	$\displaystyle\;av(y)+(1-a)v(y^{\prime})$
	$\displaystyle\geq$	$\displaystyle\;\phi\left\lVert ay\right\rVert+\phi\left\lVert(1-a)y^{\prime}\right\rVert$
	$\displaystyle\geq$	$\displaystyle\;\phi\left\lVert ay+(1-a)y^{\prime}\right\rVert$
	$\displaystyle=$	$\displaystyle\;\phi\left\lVert y^{\prime\prime}\right\rVert$

and so $y^{\prime\prime}\in\operatorname{\mathcal{Y}}$ , yielding convexity of $\operatorname{\mathcal{Y}}$ . ∎

Lemma 9.

For any $z\in\operatorname{\mathbb{R}}_{+}^{n}$ where $z\notin\operatorname{\mathcal{Y}}$ , there is some $y\in\operatorname{\mathcal{Y}}$ such that $f_{t}(P_{y})\geq f_{t}(P_{z})$ for any $\theta_{t}$ and $c_{t}$ .

Proof

Consider some $z\notin\operatorname{\mathcal{Y}}$ such that $v(z)=\psi\left\lVert z\right\rVert$ , for $\psi<\phi$ , and let $y=\left(\frac{\psi}{\phi}\right)^{1/k}z$ . By homogeneity of $v$ , we have that $v(y)=\frac{\phi}{\psi}v(z)=\phi\left\lVert z\right\rVert$ , and so $y\in\operatorname{\mathcal{Y}}$ as $\left\lVert z\right\rVert>\left\lVert y\right\rVert$ . For any round with costs $c_{t}$ and consumption rate $\theta_{t}$ we then have that:

$\displaystyle f_{t}(P_{y})-f_{t}(P_{z})=$	$\displaystyle\;\theta_{t}k\left(v(y)-v(z)\right)-c_{t}(\theta_{t}y)+c_{t}(% \theta_{t}z)$
$\displaystyle=$	$\displaystyle\;\theta_{t}k\left(\frac{\psi}{\phi}-1\right)\psi\left\lVert z% \right\rVert-c_{t}(\theta_{t}y)+c_{t}(\theta_{t}z)$	(homogeneity of $v$ )
$\displaystyle\geq$	$\displaystyle\;\theta_{t}k\left(\frac{\psi}{\phi}-1\right)\psi\left\lVert z% \right\rVert+\theta_{t}\phi\left\lVert z-y\right\rVert$	( lower bound and convexity of $c_{t}$ )
$\displaystyle\geq$	$\displaystyle\;\theta_{t}k\left(\frac{\psi}{\phi}-1\right)\psi\left\lVert z% \right\rVert+\theta_{t}\left(1-\left(\frac{\psi}{\phi}\right)^{1/k}\right)\phi% \left\lVert z\right\rVert$
$\displaystyle\geq$	$\displaystyle\;\theta_{t}\left(1-\frac{\psi}{\phi}\right)\phi\left\lVert z% \right\rVert-\theta_{t}\left(1-\frac{\psi}{\phi}\right)\psi\left\lVert z\right\rVert$	( $k,\frac{\psi}{\phi}<1$ )
$\displaystyle>$	$\displaystyle\;0.$	( $\phi>\psi$ )

∎

Thus the optimal $P_{y}$ for any cost and consumption sequence corresponds to some $y\in\operatorname{\mathcal{Y}}$ . We can also bound the radius of $\operatorname{\mathcal{Y}}$ .

Lemma 10.

Let $V=\max_{y\in\operatorname{\mathbb{R}}_{+}^{n}:\left\lVert y\right\rVert=1}v(y)$ . Then, for every $y\in\operatorname{\mathcal{Y}}$ we have that

\displaystyle\left\lVert y\right\rVert\leq

\displaystyle\;\left(\frac{V}{\phi}\right)^{\frac{1}{1-k}}.

Proof

Let $y^{*}=\operatorname*{argmax}_{y:\left\lVert y\right\rVert=1}v(y)$ , where we have $v(y^{*})=V$ . Consider the vector $by^{*}$ for any $b>0$ . By homogeneity of $v$ , we have that

	$\displaystyle v(by^{*})=$	$\displaystyle\;b^{k}v(y^{*})$
	$\displaystyle=$	$\displaystyle\;b^{k}V.$

For any $b>\left(\frac{V}{\phi}\right)^{\frac{1}{1-k}}$ we have that

	$\displaystyle v(by^{*})=$	$\displaystyle\;\frac{b}{b^{1-k}}\cdot V$
	$\displaystyle\leq$	$\displaystyle\;b\phi,$

where $\left\lVert by^{*}\right\rVert>b$ and thus $by^{*}\notin\operatorname{\mathcal{Y}}$ . This holds for all vectors with norm $b$ , as any such vector $z$ will have at most $b^{k}V$ by homogeneity, which yields the result. ∎

The previous result also implies that $by\in\operatorname{\mathcal{Y}}$ for any $b<1$ and $y\in\operatorname{\mathcal{Y}}$ . We assume that $V>\phi$ , which is without loss of generality as we may otherwise take $\phi$ to be smaller artificially; we assume $\phi$ is small enough to ensure that $\operatorname{\mathcal{Y}}$ contains a ball $\operatorname{\mathcal{B}}_{1}(y_{1})$ of radius 1 around some $y_{1}\in\operatorname{\mathcal{Y}}$ , and we let $R=\left(\frac{V}{\phi}\right)^{\frac{1}{1-k}}$ . We consider the dynamics to be given by

\displaystyle D_{t}(p_{t},y_{t-1})=

\displaystyle\;(1-\theta_{t})y_{t-1}+x^{*}(p_{t},\theta_{t},y_{t-1}).

We let $\operatorname{\mathcal{Z}}=\operatorname{\mathbb{R}}_{+}^{n}$ denote our action space of price vectors; while dynamics here are not action-linear, we can still compute our desired action $p_{t}=\nabla v(y_{t})$ efficiently, as we assume we have knowledge of $v$ . While the dynamics depend on $\theta_{t}$ , our choice of action $p_{t}$ depends only on the target update $y_{t}$ to the consumer’s reserves, by Lemma 6. Further, upon observing $x_{t}$ , we can solve for $\theta_{t}$ as

\displaystyle\theta_{t}=

\displaystyle\;1-\frac{y_{t}-x_{t}}{y_{t-1}}

for purposes of representing our surrogate losses, which are given by

\displaystyle f^{*}_{t}(y_{t})=

\displaystyle\;\theta_{t}k\cdot v(y)-c_{t}(\theta_{t}y).

We now show that the dynamics satisfy local controllability.

Lemma 11 (Local Controllability).

The instance $(\operatorname{\mathcal{Z}},\operatorname{\mathcal{Y}},D_{t})$ satisfies $\theta$ -local controllability for each round $t$ .

Proof

We show that $\theta$ -local controllability holds over all of $\operatorname{\mathbb{R}}_{+}^{n}$ , which implies $\theta$ -local controllability over $\operatorname{\mathcal{Y}}$ as each distance $\pi(y_{t-1})$ while the feasible update region remains the same. By Lemma 6, any update where $y_{t}\geq(1-\theta_{t})y_{t-1}$ elementwise is feasible. Each $\pi(y_{t-1})$ over $\operatorname{\mathbb{R}}_{+}^{n}$ is simply the minimum element of $y_{t}$ , which we denote here by $m$ . Each element of $y_{t-1}$ is decreased by at least $\theta m$ , and so any $y_{t}$ in the $\ell_{\infty}$ ball of radius $\theta m=\theta\pi(y_{t-1})$ , and thus the $\ell_{2}$ ball of radius $\theta\pi(y_{t-1})$ , is feasible. ∎

We are now ready to analyse the regret of $\operatorname{\textup{{NestedOCO}}}$ for the problem. The remaining key issues to resolve will be the errors between our true and surrogate rewards $f_{t}$ and $f_{t}^{*}$ , as well as the lack of Lipschitz continuity for our rewards. We will make use of more general formulations of the guarantees of $\operatorname{\textup{{FTRL}}}$ , (see e.g. Hazan (2021)).

Proposition 11.

For a $\gamma$ -strongly convex regularizer $\psi:\operatorname{\mathcal{Y}}\rightarrow\operatorname{\mathbb{R}}$ where $\left\lvert\psi(y)-\psi(y^{\prime})\right\rvert\leq G$ for all $y,y^{\prime}\in\operatorname{\mathcal{Y}}$ , and for convex losses $f_{1},\ldots,f_{T}$ , the regret of $\operatorname{\textup{{FTRL}}}$ is bounded by

\displaystyle\operatorname{\textup{{Reg}}}_{T}(\textup{$\operatorname{\textup{% {FTRL}}}$})\leq

\displaystyle\;\sum_{t=1}^{T}(g_{t}(y_{t})-g_{t}(y_{t+1}))+\frac{G}{\eta},

where $g_{t}(y)=\langle\nabla_{t}f_{t}(y_{t}),y\rangle$ and $g_{t}(y_{t})-g_{t}(y_{t+1})\geq\frac{\gamma}{\eta}\left\lVert y_{t+1}-y_{t}% \right\rVert^{2}$ .

We show that this implies a regret bound for $(\lambda,\beta)$ -Hölder continuous convex losses, recovering the $\lambda$ -Lipschitz bounds when $\beta=1$ .

Theorem 19.

For $(\lambda,\beta)$ -Hölder continuous convex losses, $\operatorname{\textup{{FTRL}}}$ with obtains regret bounded by

\displaystyle\operatorname{\textup{{Reg}}}_{T}(\operatorname{\textup{{FTRL}}})\leq

\displaystyle\;T\lambda\left(\frac{\eta\lambda}{\gamma}\right)^{\beta/(2-\beta% )}+\frac{G}{\eta}

and chooses points which satisfy $\left\lVert y_{t+1}-y_{t}\right\rVert\leq\left(\frac{\eta\lambda}{\gamma}% \right)^{1/(2-\beta)}$ in each round.

Proof

For $(\lambda,\beta)$ -Hölder continuous convex losses $f_{t}$ , we have that

	$\displaystyle g_{t}(y_{t})-g_{t}(y_{t+1})=$	$\displaystyle\;\langle\nabla_{t}f_{t}(y_{t}),y_{t}-y_{t+1}\rangle$
	$\displaystyle=$	$\displaystyle\;\langle\nabla_{t}f_{t}(y_{t}),(2y_{t}-y_{t+1})-y_{t}\rangle$
	$\displaystyle\leq$	$\displaystyle\;f_{t}(2y_{t}-y_{t+1})-f_{t}(y_{t})$

by convexity of $f_{t}$ , where $\left\lVert(2y_{t}-y_{t+1})-y_{t}\right\rVert=\left\lVert y_{t}-y_{t+1}\right\rVert$ , and so

\displaystyle g_{t}(y_{t})-g_{t}(y_{t+1})\leq

\displaystyle\;\lambda\left\lVert y_{t}-y_{t+1}\right\rVert^{\beta}

by Hölder continuity. Combining with the lower bound on $g_{t}(y_{t})-g_{t}(y_{t+1})$ from Proposition 11 gives us that

\displaystyle\frac{\gamma}{\eta}\left\lVert y_{t+1}-y_{t}\right\rVert^{2}\leq

\displaystyle\;g_{t}(y_{t})-g_{t}(y_{t+1})\leq\lambda\left\lVert y_{t}-y_{t+1}% \right\rVert^{\beta}

and thus

\displaystyle g_{t}(y_{t})-g_{t}(y_{t+1})\leq

\displaystyle\;\lambda\left(\frac{\eta\lambda}{\gamma}\right)^{\beta/(2-\beta)},

yielding a regret bound of

\displaystyle\operatorname{\textup{{Reg}}}_{T}(\operatorname{\textup{{FTRL}}})\leq

\displaystyle\;T\lambda\left(\frac{\eta\lambda}{\gamma}\right)^{\beta/(2-\beta% )}+\frac{G}{\eta}

with per-round distance at most $\left\lVert y_{t+1}-y_{t}\right\rVert\leq\left(\frac{\eta\lambda}{\gamma}% \right)^{1/(2-\beta)}$ . ∎

We note that the concave surrogate rewards $f_{t}^{*}(y_{t})$ are a sum of a $(k\lambda,\beta)$ -Hölder continuous function and a $(L_{c},1)$ -Hölder continuous (i.e. Lipschitz) function; we assume that each function is $(L,\beta)$ -Hölder continuous with $L=k\lambda+L_{c}$ , which is sufficient for for large enough $T$ as we will have $\left\lVert y_{t}-y_{t-1}\right\rVert\leq 1$ and thus $\left\lVert y_{t}-y_{t-1}\right\rVert\leq\left\lVert y_{t}-y_{t-1}\right\rVert% ^{\beta}$ . We use a similar analysis to bound the error between true and surrogate rewards, yielding our regret bound for $\operatorname{\textup{{NestedOCO}}}$ .

Theorem 20.

The regret of $\operatorname{\textup{{NestedOCO}}}$ with respect to the stable reserve policies $\operatorname{\mathcal{P}}_{\operatorname{\mathcal{Y}}}$ is bounded by

\displaystyle\operatorname{\textup{{Reg}}}_{T}(\operatorname{\textup{{% NestedOCO}}})\leq

\displaystyle\;2L\left(\frac{G}{\gamma}\right)^{\beta/2}\left(T\left(3+\left(% \frac{R}{\theta}\right)^{\beta}\right)\right)^{(2-\beta)/{2}}.

Proof

We reparameterize to treat the bundle $y_{1}$ where $\operatorname{\mathcal{B}}_{1}(y_{1})\subseteq\operatorname{\mathcal{Y}}$ as the origin, and assume the choice of regularizer has $y_{1}$ as its minimum. By Theorem 1, for any step size and $\delta>0$ such that $\left\lVert y_{t}-y_{t-1}\right\rVert\leq\delta\theta$ , running $\operatorname{\textup{{NestedOCO}}}$ for the $\theta$ -locally controllable instance $(\operatorname{\mathcal{Z}},\operatorname{\mathcal{Y}},D)$ over the surrogate rewards $f_{t}^{*}$ , with inradius 1 and radius $R$ , obtains

	$\displaystyle\sum_{t=1}^{T}f_{t}^{}(y^{})-\sum_{t=1}^{T}f_{t}^{*}(y_{t})\leq$	$\displaystyle\;TL(\delta R)^{\beta}+TL\left(\frac{\eta L}{\gamma}\right)^{% \beta/(2-\beta)}+\frac{G}{\eta}$
	$\displaystyle\leq$	$\displaystyle\;TL\left(1+\left(\frac{R}{\theta}\right)^{\beta}\right)\left(% \frac{\eta L}{\gamma}\right)^{\beta/(2-\beta)}+\frac{G}{\eta}$
	$\displaystyle\leq$	$\displaystyle\;2L\left(\frac{G}{\gamma}\right)^{\beta/2}\left(T\left(1+\left(% \frac{R}{\theta}\right)^{\beta}\right)\right)^{(2-\beta)/{2}}$
	$\displaystyle\overset{\Delta}{=}$	$\displaystyle\;\operatorname{\textup{{Reg}}}_{T}(f^{*})$

for any $y^{*}\in\operatorname{\mathcal{Y}}$ , upon setting $\delta=\frac{1}{\theta}\left(\frac{\eta\lambda}{\gamma}\right)^{1/(2-\beta)}$ and $\eta=\left(\frac{G}{KT}\right)^{(2-\beta)/2}$ , where

K^{*}=L\left(1+\left(\frac{R}{\theta}\right)^{\beta}\right)\left(\frac{L}{% \gamma}\right)^{\beta/(2-\beta)}.

Note that the surrogate rewards exactly track the true rewards when a stable reserve policy $P_{y^{*}}$ is played, and so our regret with respect to the best stable reserve policy $P_{y^{*}}$ is at most

$\displaystyle\sum_{t=1}^{T}f_{t}(P_{y^{*}})-\sum_{t=1}^{T}f_{t}(y_{t})\leq$	$\displaystyle\;\operatorname{\textup{{Reg}}}_{T}(f^{})+\sum_{t=1}^{T}f^{}_{t% }(y_{t})-f_{t}(p_{t},x_{t})$
$\displaystyle\leq$	$\displaystyle\;\operatorname{\textup{{Reg}}}_{T}(f^{*})+\sum_{t=1}^{T}\langle% \nabla v(y_{t}),\theta y_{t}-x_{t}\rangle-c_{t}(\theta y_{t})+c_{t}(x_{t})$
$\displaystyle\leq$	$\displaystyle\;\operatorname{\textup{{Reg}}}_{T}(f^{*})+\sum_{t=1}^{T}(1-% \theta_{t})\left(\langle\nabla v(y_{t}),y_{t-1}-y_{t}\rangle+L\left\lVert y_{t% }-y_{t-1}\right\rVert\right)$	( $x_{t}=(1-\theta_{t})y_{t-1}$ )
$\displaystyle\leq$	$\displaystyle\;\operatorname{\textup{{Reg}}}_{T}(f^{*})+\sum_{t=1}^{T}\left(% \langle\nabla v(y_{t}),y_{t}-(2y_{t}-y_{t-1})\rangle+L\left\lVert y_{t}-y_{t-1% }\right\rVert\right)$
$\displaystyle\leq$	$\displaystyle\;\operatorname{\textup{{Reg}}}_{T}(f^{*})+\sum_{t=1}^{T}v(y_{t})% -v(2y_{t}-y_{t-1})+L\left\lVert y_{t}-y_{t-1}\right\rVert$	(concavity of $v$ )
$\displaystyle\leq$	$\displaystyle\;\operatorname{\textup{{Reg}}}_{T}(f^{*})+\sum_{t=1}^{T}2L\left% \lVert y_{t}-y_{t-1}\right\rVert^{\beta}$	(Hölder, $\left\lVert y_{t}-y_{t-1}\right\rVert\leq 1$ )
$\displaystyle\leq$	$\displaystyle\;\operatorname{\textup{{Reg}}}_{T}(f^{*})+2TL\left(\frac{\eta L}% {\gamma}\right)^{\beta/(2-\beta)}$
$\displaystyle\leq$	$\displaystyle\;2L\left(\frac{G}{\gamma}\right)^{\beta/2}\left(T\left(3+\left(% \frac{R}{\theta}\right)^{\beta}\right)\right)^{(2-\beta)/{2}}$

upon updating $K^{*}$ to $K$ as

K=L\left(3+\left(\frac{R}{\theta}\right)^{\beta}\right)\left(\frac{L}{\gamma}% \right)^{\beta/(2-\beta)},

which yields the theorem. ∎

Theorem 9 follows directly from Theorem 20.

Appendix K Background and Proofs for Section 4.4: Steering Learners

K.1 Background

While much of the literature related to no-regret learning in general-sum games considers either rates of convergence to (coarse) correlated equilibria Blum et al. (2008); Anagnostides et al. (2022) or welfare guarantees for such equilibria Roughgarden (2015); Hartline et al. (2015a), a recent line of work Braverman et al. (2017); Deng et al. (2019); Mansour et al. (2022) has considered the question of optimizing one’s reward when playing against a no-regret learner. A target benchmark which has emerged for this problem is the value of the Stackelberg equilibrium of a game (the optimal mixed strategy to “commit to”, assuming an opponent best responds), which was shown by attainable by Deng et al. (2019) against any no-regret algorithm and optimal in many cases (e.g. for no-swap learners), both up to $o(T)$ terms, and further which may yield higher reward for the optimizer than (coarse) correlated equilibria.

We show a class of instances for which the problem for optimizing reward against a learner playing according to gradient descent can be formulated as a locally controllable instance of online nonlinear control with adversarial perturbations and surrogate state-based losses. The simplest non-trivial instances we consider are those where the optimizer’s reward is a function only of the learner’s actions (i.e. all rows of their reward matrix are identical), and the optimization problem amounts to steering the learner to a desired strategy via one’s choice of actions. Additionally, we allow the game matrices to change over time, which has not been substantially considered in prior work to our knowledge. We require that the learner’s matrices do not change too quickly (which we model as adversarial disturbances to dynamics), and the optimizer’s matrices can change arbitrarily provided that they remain close to some row-identical matrix (which we model as imprecision in our surrogate loss function).

K.2 Model

Here we are tasked with playing a sequence of bimatrix games against a no-regret learning opponent, where the game matrices may change adversarially in each round. We assume the following properties hold for the adversarial sequence of games.

Assumption 4.

For a sequence $\{(A_{t},B_{t}):t\in[T]\}$ of $m\times n$ bimatrix games, with $m>n$ :

•

Each entry of $A_{t}$ and $B_{t}$ lies in $[-\frac{L}{2\sqrt{n}},\frac{L}{2\sqrt{n}}]$
•

the convex hull of the of the rows of each $B_{t}$ contains the unit ball in $\operatorname{\mathbb{R}}^{n}$ ,
•

$\left\lVert xA_{t}-xA^{*}_{t}\right\rVert\leq\delta_{t}$ for any $x\in\Delta(m)$ , where each row of $A^{*}_{t}$ is identical, and
•

$\left\lVert xB_{t}-xB_{t-1}\right\rVert\leq\epsilon_{t}$ for any $x\in\Delta(m)$ .

Each game $(A_{t},B_{t})$ is revealed after Players A and B commit to their respective strategies $x_{t}\in\Delta(m)$ and $y_{t}\in\Delta(n)$ . Observe that due to the first property, for any $z\in\operatorname{\mathcal{B}}_{1}(\mathbf{0})$ , there is some $x\in\Delta(m)$ such that $xB=z$ . By the second property, we have that $xA^{*}_{t}=x^{\prime}A^{*}_{t}$ for any $x,x^{\prime}\in\Delta(m)$ .

We recall the Online Gradient Descent algorithm with convex losses $\ell_{t}$ from Zinkevich (2003).

Algorithm 8 Online Gradient Descent (OGD)

Input: Convex set

\operatorname{\mathcal{Y}}\subseteq\operatorname{\mathbb{R}}^{n}

, initial point

y_{1}\in\operatorname{\mathcal{Y}}

, and step sizes

\theta_{1},\ldots,\theta_{T}

for

t=1

T

Play

y_{t}

and observe loss

\ell_{t}(y_{t})

Set

\nabla_{t}=\nabla\ell_{t}(y_{t})

Set

y_{t+1}=\Pi_{\operatorname{\mathcal{Y}}}\left(y_{t}-\theta_{t}\nabla_{t}\right% )=\text{argmin}_{y\in\operatorname{\mathcal{Y}}}\left\lVert y_{t}-\theta_{t}% \nabla_{t}-y\right\rVert

end for

Proposition 12 (Zinkevich (2003)).

For differentiable convex losses $\ell_{t}:\operatorname{\mathcal{Y}}\rightarrow\operatorname{\mathbb{R}}$ , with $\theta_{t+1}\leq\theta_{t}$ for each $t\leq T$ , then for all $y^{*}\in\operatorname{\mathcal{Y}}$ the regret of OGD is bounded by

\displaystyle\sum_{t=1}^{T}\ell_{t}(y_{t})-\ell_{t}(y^{*})\leq

\displaystyle\;\frac{2R^{2}_{B}}{\theta_{T}}+\sum_{t=1}^{T}\frac{\theta_{t}}{2% }\left\lVert\nabla_{t}\right\rVert^{2},

where $R_{B}$ is the radius of $\operatorname{\mathcal{Y}}$ . If $\left\lVert\nabla_{t}\right\rVert\leq G_{B}$ and $\theta_{t}=\frac{2R_{B}}{G_{B}\sqrt{T}}$ for all $t\leq T$ , we have that

\displaystyle\sum_{t=1}^{T}\ell_{t}(y_{t})-\ell_{t}(y^{*})\leq

\displaystyle\;2R_{B}G_{B}\sqrt{T}.

We assume that Player B plays according to OPGD in our setup, with $y_{1}=\mathbf{u}_{n}$ and $\theta=\frac{R_{B}}{G_{B}\sqrt{T}}$ . At each round $t$ , we (Player A) choose some mixed strategy $x_{t}\in\Delta(n)$ , and Player B plays some mixed strategy $y_{t}\in\Delta(n)$ . Utilities for each player are given by the game $(A_{t},B_{t})$ as

	$\displaystyle u_{t}^{A}(x_{t},y_{t})=$	$\displaystyle\;x_{t}A_{t}y_{t};$
	$\displaystyle u_{t}^{B}(x_{t},y_{t})=$	$\displaystyle\;x_{t}B_{t}y_{t}.$

Note that the loss gradient $-\nabla u_{t}^{B}(x_{t},y_{t})$ each round for Player B (for negative utilities) is given by

\displaystyle\nabla_{t}=

\displaystyle\;-x_{t}B,

and so their mixed strategy is updated at each round according to

\displaystyle y_{t}=

\displaystyle\;\Pi_{\Delta(n)}\left(y_{t-1}+\theta(x_{t-1}B_{t-1})\right).

Our utility is given by $x_{t}A_{t}y_{t}=\mathbf{u}_{n}A_{t}^{*}y_{t}+x_{t}(A_{t}-A_{t}^{*})y_{t}$ , as $x_{t}$ does not affect rewards from $A_{t}^{*}$ . We benchmark the regret of an algorithm $\operatorname{\mathcal{A}}$ against the optimal profile $(x,y)\in\Delta(m)\times\Delta(n)$ :

\displaystyle\operatorname{\textup{{Reg}}}_{T}(\operatorname{\mathcal{A}})=

\displaystyle\;\max_{(x,y)\in\Delta(m)\times\Delta(n)}\sum_{t=1}^{T}xA_{t}y-x_% {t}A_{t}y_{t}.

Note that the per-round average utility for the maximizing $(x,y)$ is at least as high as that obtained by the Stackelberg equilibrium of the average game $\left(\sum_{t}\frac{A_{t}}{T},\sum_{t}\frac{B_{t}}{T}\right)$ , as for this objective one can choose both players’ strategies without restriction. We remark that finding the Stackelberg equilibrium for any fixed game $(A_{t}^{*},B_{t})$ in our setting, where $A_{t}^{*}$ has identical rows, is straightforward: it suffices to optimize over $[n]$ , as any fixed action $j\in[n]$ is a best response to some $x\in\Delta(m)$ by our assumption on the rows of $B_{t}$ , and as our rewards are only a function of Player B’s strategy $y$ . However, we are not aware of any prior work which enables competing with the average-game Stackelberg value against a learning opponent when games arrive online.

K.3 Analysis

We first show that the problem can be formulated via known, strongly $\theta$ -locally controllable dynamics with adversarial disturbances. As $B_{t}$ changes slowly between rounds, we can run $\operatorname{\textup{{NestedOCO-UD}}}$ with disturbances representing the error resulting from assuming that $B_{t}$ does not change from $B_{t-1}$ .

Lemma 12.

Given the knowledge available prior to selecting $x_{t}$ , updates for $y_{t+1}$ can be expressed via known action-linear dynamics $(\operatorname{\mathcal{X}},\operatorname{\mathcal{Y}},D_{t})$ which satisfy strong $\theta$ -local controllability, and with adversarial disturbances $w_{t}$ satisfying $\sum_{t=1}^{T}\left\lVert w_{t}\right\rVert\leq\theta\sum_{t=1}^{T}\epsilon_{t}$ .

Proof

First, note that we can compute Player B’s current strategy $y_{t}$ , as it is a function only of games and strategies up to round $t-1$ , all of which are observable. Given the update rule for $\operatorname{\textup{OGD}}$ , we can formulate the dynamics ${D}_{t}(x_{t},y_{t})$ update as

	$\displaystyle{D}_{t}(x_{t},y_{t})=$	$\displaystyle\;\Pi_{\Delta(n)}\left(y_{t}+\theta(x_{t}B_{t})\right)$
	$\displaystyle=$	$\displaystyle\;\Pi_{\Delta(n)}\left(y_{t}+\theta(x_{t}B_{t-1})+\theta(x_{t}(B_% {t}-B_{t-1}))\right)$
	$\displaystyle=$	$\displaystyle\;\Pi_{\Delta(n)}\left(y_{t}+\theta(x_{t}B_{t-1})\right)+w_{t}$

where $w_{t}$ represents the error from assuming $B_{t}=B_{t-1}$ . by standard properties of Euclidean projection, and the change bound on $B_{t}$ , we have that $\left\lVert w_{t}\right\rVert\leq\left\lVert\theta(x_{t}(B_{t}-B_{t-1}))\right% \rVert\leq\theta\epsilon_{t}$ . Further, the update is action-linear (up to projection, prior to $w_{t}$ ).

To see that $D_{t}$ satisfies strong $\theta$ -local controllability, we recall that the convex hull of the rows of $B_{t-1}$ contain the unit ball, and so for any $y^{*}$ in $\operatorname{\mathcal{B}}_{\theta}(y_{t})\cap\Delta(n)$ there is some $x_{t}\in\Delta(m)$ such that $\theta(x_{t}B_{t-1})=y^{*}-y_{t}$ . ∎

At round each round $t$ , our loss is given by $f_{t}(x_{t},y_{t})=-x_{t}A_{t}{y}_{t}$ . There are two barriers to running our algorithm. First, the update for $y_{t}$ is determined by $x_{t-1}$ and not $x_{t}$ , yet we do not see $A_{t-1}$ prior to selecting $x_{t-1}$ , which would be required to take the appropriate step following $f_{t-1}$ . Second, the loss depends on $x_{t}$ in addition to $y_{t}$ . To address both issues, we instead run $\operatorname{\textup{{NestedOCO-UD}}}$ with surrogate losses $\tilde{f}_{t}(\tilde{y}_{t})=-\mathbf{u}_{n}A_{t-1}{y}_{t}$ , with action rounds relabeled to account for the fact that $x_{t-1}$ influences the step for $y_{t}$ (which does not change the behavior of the algorithm). We set $A_{0}=\mathbf{0}_{m,n}$ .

Theorem 21.

Repeated play against an opponent using $\operatorname{\textup{OGD}}$ with step size $\theta=\Theta(T^{-1/2})$ in a sequence of games $(A_{t},B_{t})$ satisfying Assumption 4 can be cast as an instance of online control with strongly $\theta$ -locally controllable dynamics, for which the regret of $\operatorname{\textup{{NestedOCO-UD}}}$ is at most

\displaystyle\operatorname{\textup{{Reg}}}_{T}(\operatorname{\textup{{% NestedOCO-UD}}})\leq

\displaystyle\;O\left(\sqrt{T}+\sum_{t=1}^{T}(\delta_{t}+\epsilon_{t})\right),

with efficient per-round computation.

Proof

We first analyze regret with respect to the surrogate losses $\tilde{f}_{t}(y_{t})$ . To run $\operatorname{\textup{{NestedOCO-UD}}}$ for $\alpha>0$ , it suffices to calibrate the step size for the internal $\operatorname{\textup{{FTRL}}}$ instance such that $\eta\frac{L}{\gamma}\leq\theta\alpha$ . Given that rewards are bounded in $[-\frac{L}{2\sqrt{n}},\frac{L}{2\sqrt{n}}]$ , we have that each $x_{t}B_{t}y_{t}$ is $\frac{L}{\sqrt{n}}$ -Lipschitz for the $\ell_{1}$ norm, and thus $L$ -Lipschitz for the $\ell_{2}$ norm, so we can take $G_{B}=L$ . Further, the $\ell_{2}$ radius of $\Delta(n)$ is $R_{B}={\sqrt{2}}/{2}$ , and so we have that

\theta=\sqrt{\frac{2}{L^{2}T}}.

Then, for a strongly $\theta$ -locally controllable instance with total perturbation bound $\sum_{t=1}^{T}\left\lVert w_{t}\right\rVert\leq E$ , we obtain the regret bound

\displaystyle\operatorname{\textup{{Reg}}}_{T}(\operatorname{\textup{{% NestedOCO-UD}}})\leq

\displaystyle\;\eta\frac{TL^{2}}{\gamma}+\frac{G}{\eta}+\frac{2LRE}{(1-\alpha)\theta}

(Thm. 14)

for any

\displaystyle\eta\leq\min\left(\sqrt{\frac{G\gamma}{L^{2}T}},\alpha\sqrt{\frac% {2}{T}}\right).

By Lemma 12, we can efficiently run $\operatorname{\textup{{NestedOCO-UD}}}$ over the surrogate losses $\tilde{f}_{t}$ and bound regret with respect to any $y^{*}\in\operatorname{\mathcal{Y}}$ as:

\displaystyle\sum_{t=1}^{T}\tilde{f}_{t}(y_{t})-\tilde{f}_{t}(y^{*})\leq

\displaystyle\;\eta\frac{TL^{2}}{\gamma}+\frac{G}{\eta}+\frac{\sqrt{2}L\cdot% \sum_{t=1}^{T}\epsilon_{t}}{1-\alpha}.

Further, we can bound the error from the surrogate losses as

$\displaystyle\sum_{t=1}^{T}{f}_{t}(x_{t},y_{t})-\tilde{f}_{t}(y_{t})=$	$\displaystyle\;\sum_{t=1}^{T}{f}_{t}(x_{t},y_{t})-{f}_{t-1}(\mathbf{u}_{n},y_{% t})$
$\displaystyle\leq$	$\displaystyle\;\frac{L}{2\sqrt{n}}+\sum_{t=1}^{T-1}{f}_{t}(x_{t},y_{t})-{f}_{t% }(\mathbf{u}_{n},y_{t+1})$	( $f_{0}(\mathbf{u}_{n},y_{1})=0$ , $f_{T}(x_{T},y_{T})\leq\frac{L}{2\sqrt{n}}$ )
$\displaystyle\leq$	$\displaystyle\;\frac{L}{2\sqrt{n}}+\eta\frac{TL^{2}}{\gamma}+\sum_{t=1}^{T-1}x% _{t}(A_{t}-A_{t}^{*})y_{t}$	(Prop. 5)
$\displaystyle\leq$	$\displaystyle\;\frac{L}{2\sqrt{n}}+\eta\frac{TL^{2}}{\gamma}+\sum_{t=1}^{T}% \delta_{t},$	(Assumption 4, Cauchy-Schwarz)

and likewise, for any $(x^{*},y^{*})\in\Delta(m)\times\Delta(n)$ we can bound

	$\displaystyle\sum_{t=1}^{T}\tilde{f}_{t}(y^{})-f_{t}(x^{},y^{*})\leq$	$\displaystyle\;-f_{T}(x^{},y^{})-\sum_{t=1}^{T-1}x^{}(A_{t}-A_{t}^{})y^{*}$
	$\displaystyle\leq$	$\displaystyle\;\frac{L}{2\sqrt{n}}+\sum_{t=1}^{T}\delta_{t}.$

Combining the previous results, we have that for any $(x^{*},y^{*})\in\Delta(m)\times\Delta(n)$ , the regret of $\operatorname{\textup{{NestedOCO-UD}}}$ with respect to the true losses is bounded by

	$\displaystyle\sum_{t=1}^{T}f_{t}(x_{t},y_{t})-f_{t}(x^{},y^{})\leq$	$\displaystyle\;\sum_{t=1}^{T}\tilde{f}_{t}(\tilde{y}_{t})-\tilde{f}_{t}(y^{})% +\sum_{t=1}^{T}{f}_{t}(x_{t},y_{t})-\tilde{f}_{t}(y_{t})+\sum_{t=1}^{T}\tilde{% f}_{t}(y^{})-f_{t}(x^{},y^{})$
	$\displaystyle\leq$	$\displaystyle\;\eta\frac{2TL^{2}}{\gamma}+\frac{G}{\eta}+\frac{L}{\sqrt{n}}+2% \sum_{t=1}^{T}\delta_{t}+\frac{\sqrt{2}L\cdot\sum_{t=1}^{T}\epsilon_{t}}{1-\alpha}$
	$\displaystyle\leq$	$\displaystyle\;3\cdot\max\left(\sqrt{\frac{TGL^{2}}{\gamma}},\sqrt{\frac{T}{2% \alpha^{2}}}\right)+\frac{L}{\sqrt{n}}+2\sum_{t=1}^{T}\delta_{t}+\frac{\sqrt{2% }L\cdot\sum_{t=1}^{T}\epsilon_{t}}{1-\alpha}$

for any $\alpha\in(0,1)$ , which yields the theorem. ∎

Theorem 10 follows directly from Theorem 21.

Online Stackelberg Optimization via Nonlinear Control

Abstract

1 Introduction

1.1 Related Work

Online control.

Strategizing against learners.

Nested convex optimization.

2 Model and Preliminaries

2.1 Locally Controllable Dynamics

Definition 1 (Weak Local Controllability).

Definition 2 (Strong Local Controllability).

Proposition 1.

2.2 States vs. Policies

Proposition 2.

3 No-Regret Algorithms for Locally Controllable Dynamics

3.1 Nonlinear Control via Online Convex Optimization

Theorem 1.

3.2 Efficient Updates for Action-Linear Dynamics

Proposition 3.

Definition 3 (Locally Action-Linear Dynamics).

3.3 Adversarial Disturbances

Theorem 2 (Bounded Disturbances for Weak Local Controllability).

Theorem 3 (Unbounded Disturbances for Strong Local Controllability).

3.4 Unknown Dynamics

Theorem 4.

3.5 Bandit Feedback

Theorem 5.

4 Applications for Online Stackelberg Optimization

4.1 Online Performative Prediction

Theorem 6 (Regret Minimization for Performative Prediction).

4.2 Adaptive Recommendations

Theorem 7 (Regret Minimization over EIRDEIRD\operatorname{\textup{{EIRD}}}EIRD).

Theorem 8 (Regret Minimization over Δϕ⁢(n)superscriptΔitalic-ϕ𝑛\Delta^{\phi}(n)roman_Δ start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT ( italic_n )).

4.3 Adaptive Pricing

Theorem 9 (Regret Minimization over Stable Reserve Policies).

4.4 Steering Learners in Online Games

Theorem 10 (Regret Minimization in Online Games).

References

Appendix A Omitted Proofs for Section 2

Appendix B Follow the Regularized Leader

Proposition 4.

Proposition 5.

Appendix C Analysis for NestedOCONestedOCO\operatorname{\textup{{NestedOCO}}}oenftrl

Appendix D Examples and Analysis for Action-Linear Dynamics

Example 1.

Example 2.

Appendix E Algorithms for Adversarial Disturbances

E.1 NestedOCO-BDNestedOCO-BD\operatorname{\textup{{NestedOCO-BD}}}oenftrlap and Proofs for Theorem 2

Theorem 11.

Theorem 12 (Regret Lower Bound for Bounded Disturbances).

Theorem 13.

E.2 NestedOCO-UDNestedOCO-UD\operatorname{\textup{{NestedOCO-UD}}}oenftrluap and Proofs for Theorem 3

Theorem 14.

Theorem 15 (Regret Lower Bound for Unbounded Disturbances).

Appendix F Unknown Dynamics: Analysis for ProbingOCOProbingOCO\operatorname{\textup{{ProbingOCO}}}probingoco

Appendix G Bandit Feedback: Analysis for NestedBCONestedBCO\operatorname{\textup{{NestedBCO}}}nestedbco

Proposition 6 (Flaxman et al. (2004)).

Appendix H Background and Proofs for Section 4.1: Performative Prediction

H.1 Background

H.2 Model

Assumption 1.

H.3 Analysis

Lemma 1.

Lemma 2.

Theorem 16.

Appendix I Background and Proofs for Section 4.2: Adaptive Recommendations

I.1 Background

I.2 Model

Definition 4 (Instantaneously Realizable Distributions).

Proposition 7 (Menu Times for IRDIRD\operatorname{\textup{{IRD}}}IRD Agarwal and Brown (2023)).

Definition 5 (Everywhere Instantaneously Realizable Distributions).

Proposition 8 (Corollary of Agarwal and Brown (2022)).

Definition 6 (ϕitalic-ϕ\phiitalic_ϕ-Smoothed Simplex).

Definition 7 (Scale-Bounded Functions).

Proposition 9 (Corollary of Agarwal and Brown (2023)).

I.3 Analysis

Theorem 17.

Lemma 3.

Lemma 4.

Lemma 5 (IRDIRD\operatorname{\textup{{IRD}}}IRD for Scale-Bounded Preferences Agarwal and Brown (2023)).

Theorem 7 (Regret Minimization over $\operatorname{\textup{{EIRD}}}$ ).

Theorem 8 (Regret Minimization over $\Delta^{\phi}(n)$ ).

Appendix C Analysis for $\operatorname{\textup{{NestedOCO}}}$

E.1 $\operatorname{\textup{{NestedOCO-BD}}}$ and Proofs for Theorem 2

E.2 $\operatorname{\textup{{NestedOCO-UD}}}$ and Proofs for Theorem 3

Appendix F Unknown Dynamics: Analysis for $\operatorname{\textup{{ProbingOCO}}}$

Appendix G Bandit Feedback: Analysis for $\operatorname{\textup{{NestedBCO}}}$

Proposition 7 (Menu Times for $\operatorname{\textup{{IRD}}}$ Agarwal and Brown (2023)).

Definition 6 ( $\phi$ -Smoothed Simplex).

Lemma 5 ( $\operatorname{\textup{{IRD}}}$ for Scale-Bounded Preferences Agarwal and Brown (2023)).