Online Stackelberg Optimization via Nonlinear Control
Abstract
In repeated interaction problems with adaptive agents, our objective often requires anticipating and optimizing over the space of possible agent responses. We show that many problems of this form can be cast as instances of online (nonlinear) control which satisfy local controllability, with convex losses over a bounded state space which encodes agent behavior, and we introduce a unified algorithmic framework for tractable regret minimization in such cases. When the instance dynamics are known but otherwise arbitrary, we obtain oracle-efficient regret by reduction to online convex optimization, which can be made computationally efficient if dynamics are locally action-linear. In the presence of adversarial disturbances to the state, we give tight bounds in terms of either the cumulative or per-round disturbance magnitude (for strongly or weakly locally controllable dynamics, respectively). Additionally, we give sublinear regret results for the cases of unknown locally action-linear dynamics as well as for the bandit feedback setting. Finally, we demonstrate applications of our framework to well-studied problems including performative prediction, recommendations for adaptive agents, adaptive pricing of real-valued goods, and repeated gameplay against no-regret learners, directly yielding extensions beyond prior results in each case.
1 Introduction
Machine learning problems involving strategic or adaptive agents are commonly framed as Stackelberg games, wherein the leader aims to commit to an optimal strategy in anticipation of the follower’s best response. This approach has been effectively applied to challenges ranging from performative feature manipulation (Hardt et al., 2015; Dong et al., 2018; Perdomo et al., 2020; Jagadeesan et al., 2022b) and optimal pricing (Roth et al., 2015; Daskalakis and Syrgkanis, 2015; Nedelec et al., 2020) to resource allocation in security games (Blum et al., 2014; Balcan et al., 2015; Alcantara-Jiménez and Clempner, 2020) and learning in tabular games (Letchford et al., 2009; Peng et al., 2019; Lauffer et al., 2022; Collina et al., 2023), often with a regret minimization objective. Additionally, several of these settings have been independently extended to account for agents that may update their strategies gradually over time rather than optimally responding in each round (Zrnic et al., 2021a; Brown et al., 2022; Braverman et al., 2017; Deng et al., 2019; Brown et al., 2023). Despite their conceptual similarities, these problems have largely been approached as distinct areas of study, each with their own growing body of techniques. Our aim in this work is to offer a unifying perspective and algorithmic approach for problems of this form, through the lens of online control.
For the broad family of online “Stackelberg-style” optimization problems, the language of control is quite natural to adopt: we are navigating a dynamical system where states corresponding to agent strategies evolve as a function of our own actions, and where objectives which consider best-response stability can be expressed in terms of the stationary behavior of this system. Our results consider a general class of online control instances for representing such problems, which we introduce in Section 2, and in Section 3 we give a sequence of no-regret algorithms for these instances satisfying a range of robustness properties. In Section 4, we show that several online optimization problems involving adaptive agents, including variants of online performative prediction (as in Kumar et al. (2022)), online recommendations (as in Agarwal and Brown (2023)), adaptive pricing (as in Roth et al. (2015)), and learning in time-varying games (as in Anagnostides et al. (2023)) can be embedded in our framework and solved by our algorithms.
While there has been a great deal of recent progress in online linear control, yielding algorithms which can optimize over stabilizing linear policies even with general convex costs, adversarial disturbances, and unknown dynamics (Agarwal et al., 2019a; Simchowitz et al., 2020; Cassel et al., 2022; Minasyan et al., 2022), the required assumptions and regret benchmarks for these algorithms do not always type-check with the settings we are interested in. For the examples we consider, we will often wish to allow for nonlinear dynamics (e.g. encoding an agent’s utility function) and explicitly bounded spaces (e.g. via projection into the simplex), and we will seek to compete with regret benchmarks which correspond to stable responses by the agent. Unfortunately, as we show in Proposition 2, the latter goal is incompatible with linear policies even under linear dynamics and in the absence of any disturbances: the performance of every linear policy can be worse than the best policy in the class of affine “state-targeting” policies.
In contrast, the orthogonal set of assumptions we identify enables tractable regret minimization even for nonlinear control problems and comports with the requirements of Stackelberg optimization across a wide range of settings, including the ability to compete with state-targeting policies. For convex and compact state and action spaces and , our first key assumption is that the dynamics satisfy a notion of local controllability. While local controllability is well-studied for continuous-time and asymptotic control (Aoki, 1974; Kuhn and Wohltmann, 1989; Barbero-Liñán and Jakubczyk, 2013; Boscain et al., 2021), we are unaware of any prior applications to finite-time online optimization, and we adapt existing definitions to be appropriate for this setting. We say that is strongly locally controllable if every state in a fixed-radius ball around is reachable in a single round by an appropriate choice of , and that is weakly locally controllable if the reachable radius around is allowed to vanish near the boundary of . We also assume that our loss in each round is determined (or well-approximated by) an adversarially-chosen convex function depending only on the state .
When these conditions hold, we show in Theorem 1 that this is sufficient to obtain regret with respect to the loss of the best fixed state, provided that dynamics are known and we have offline access to an oracle for non-convex optimization; the oracle call can be removed if dynamics are locally action-linear, i.e. given by (or locally well-approximated by) a function linear in at each fixed . If adversarial disturbances to the dynamics are present, our approach can be extended for both weakly (Theorem 2) and strongly (Theorem 3) locally controllable dynamics with additional regret scaling linearly in total disturbance magnitude, provided that each round’s disturbance cannot be too large in the case of weak local controllability; we give lower bounds showing that each dependence on disturbance magnitude is tight. The aforementioned results all extend to the case where the dynamics (absent disturbances) are given by a known but time-dependent function . If dynamics are unknown but time-invariant, and locally action-linear with appropriate regularity parameters, we obtain sublinear regret provided that a “near-stabilizing” action is known at . We additionally extend our approach to the bandit feedback setting, where we obtain regret. In Section 4 we show that each of the following, with appropriate assumptions, can be cast as a locally controllable instance with state-only convex surrogate losses:
-
•
Performative prediction: Minimize prediction loss for a classifier , where the distribution in each round is updated according to the prior classifier and distribution.
-
•
Adaptive recommendations: Maximize the reward when showing menus of size to an agent, whose choice in each round depends on preferences which are influenced by choices in prior rounds (encoded in the “memory vector” ).
-
•
Adaptive pricing: Maximize profit - for selling bundles of goods to an agent at prices and with costs , where the agent’s purchased bundle is a function of their utility function, consumption rate, and existing reserves.
-
•
Repeated gameplay: Maximize the reward obtained from playing a sequence of time-varying games against a no-regret learning agent.
In each case, application of our algorithms from Section 3 yields results which extend beyond the applicability regimes of prior work, such as by enabling relaxation of previous assumptions or a novel extension to adversarial or dynamic problem variants.
1.1 Related Work
Online control.
Much of the recent progress in online control (Agarwal et al., 2019a, b; Cassel et al., 2022; Minasyan et al., 2022) considers linear systems with general convex losses, benchmarking against a class of (“strongly stable”) fast-mixing linear policies introduced for linear-quadratic control (Cohen et al., 2018) by leveraging the framework of “OCO with memory” (Anava et al., 2014). Results have also been shown for nonlinear policy classes via neural networks (Chen et al., 2022), and for nonlinear dynamics with oracles in episodic settings (Kakade et al., 2020), via approximation with random Fourier features (Lale et al., 2021; Luo et al., 2022), via adaptive regret for time-varying linear systems (Gradu et al., 2022; Minasyan et al., 2022), and via dynamic regret over actions in terms of disturbance “attenuation” (Muthirayan and Khargonekar, 2022). For a further overview of online control and its historical context, see Hazan and Singh (2022). In contrast to the bulk of prior work in which states and actions are bounded implicitly via policy stability notions, we consider state and action spaces which are bounded explicitly, as enabled by nonlinearity in dynamics (e.g. via projection, or range decay of dynamics near the boundary). These works also view disturbances as intrinsic to the system, and account for their influence directly in regret benchmarks (the “optimal policy” will face the same sequence of disturbances in hindsight, regardless of state). Within the context of Stackelberg optimization where a fixed protocol largely determines an agent’s strategy updates, we view the role of disturbances as more akin to adversarial corruptions as considered in reinforcement learning (Lykouris et al., 2021; Zhang et al., 2021); while we incur linear dependence, our regret benchmarks are agnostic to alternate counterfactual disturbance sequences.
Strategizing against learners.
Initially formulated within the context of repeated auctions (Braverman et al., 2017), a recent line of work has considered the problem of optimizing long-run rewards in a repeated game against a no-regret learner across a range of tabular and Bayesian settings (Deng et al., 2019; Mansour et al., 2022; Brown et al., 2023; Zhang et al., 2023). While bounds on attainable reward have been known in terms of the Price of Anarchy (Blum et al., 2008; Hartline et al., 2015b), this sequence of results has highlighted important connections with Stackelberg equilibria: the Stackelberg value of the game is attainable on average against any no-regret learner, and it is the maximum attainable value against many common no-regret algorithms (such as no-swap learners, as shown by Deng et al. (2019)). This theme has emerged in other simultaneous learning settings as well; notably, Zrnic et al. (2021b) show that long-run outcomes in strategic classification are shaped by relative learning rates between parties, which can designate either as the Stackelberg leader.
Nested convex optimization.
The technique of identifying convex structure nested inside a more general problem has been applied broadly across a range of online optimization settings (Neu and Olkhovskaya, 2021; Shen et al., 2023; Flokas et al., 2019). For repeated interaction problems involving an agent with unknown utility, such as optimal pricing, Roth et al. (2015) identify utility conditions under which the non-convex objective over prices becomes convex in the space of agent actions, and where explorability properties resembling local controllability hold, which enables convex optimization by locally learning agent preferences; this “revealed preferences” approach has also been applied to strategic classification (Dong et al., 2018). In recent work concerning recommendations for agents with history-dependent preferences (Agarwal and Brown, 2022, 2023), properties related to local controllability are leveraged to enable tractable optimization as well. We consider each of these settings as applications in Section 4.
2 Model and Preliminaries
Let and be convex and compact subsets of Euclidean space, respectively denoting the action and state spaces, where we assume . Further, we assume that contains a ball of radius around the origin , and is contained in a ball of radius around the origin.
An instance of our control problem consists of choosing a sequence of actions over rounds, which will yield a sequence of states , and we will incur losses determined by adversarially chosen functions . Let the initial state be . In the basic version of our problem, upon choosing each for rounds , we observe the state update to
where is an arbitrary continuous function which we refer to as the dynamics of our problem. We sometimes allow disturbances to the dynamics, where for chosen adversarially. In some cases we allow time-varying dynamics , where the dynamics in each round are denoted by .
Here and in Section 3, we assume that our loss in round is given by , where each is a -Lipschitz convex function revealed after playing ; we relax these assumptions for some of our applications in Section 4, e.g. to allow dependence on as well. We generally measure will performance with respect to the best fixed state, and the regret for an algorithm yielding is
In Proposition 2, we relate this benchmark to the class of “state-targeting” policies, which can sometimes be expressed by affine functions, and we compare their performance to linear policies. Throughout, we use to donate the Euclidean norm, and we let denote the norm ball of radius around . We let denote Euclidean projection into the set ; denotes the uniform distribution over items, and denotes the probability simplex.
2.1 Locally Controllable Dynamics
A number of properties under the name “local controllability” have been considered for various continuous-time and asymptotic control settings (Aoki, 1974; Kuhn and Wohltmann, 1989; Barbero-Liñán and Jakubczyk, 2013; Boscain et al., 2021), generally relating to the notion that all states in a neighborhood around a given state are reachable. We give two formulations of local controllability for our setting, which we take as properties of the dynamics holding over all inputs.
Definition 1 (Weak Local Controllability).
For , an instance satisfies (weak) -local controllability if for any and , there is some such that , where is the distance from to the boundary of .
Definition 2 (Strong Local Controllability).
For , an instance satisfies strong -local controllability if for any and , there is some such that .
We often refer to weak local controllability simply as local controllability. This property ensures that there is always some action which results in the next state staying fixed at , as well as some action which moves the state to any point in a surrounding ball; in the weak case, the size of the reachable ball is allowed to decay as approaches the boundary of . The parameter controls the speed at which we can navigate the state space: when in the weak case (or in the strong case), we can always immediately reach some point on the boundary of , yet for close to zero we may only be able to move in a small neighborhood. Our results use local controllability to minimize regret over by reduction to online convex optimization. As we prove in Appendix A, up to a quantifier alternation which vanishes as approaches , a property of this form is essentially necessary: competing with the best state is impossible if we cannot remain in its neighborhood.
Proposition 1.
Suppose there is some and values such that for all and , . Then, there are losses such that for any algorithm .
2.2 States vs. Policies
While regret benchmarks in online control are typically expressed in terms of a reference class of policies, we note that there is a class of “state-targeting” policies which track the reward of fixed states (asymptotically, and up to the influence of disturbances), and which can be implemented if is known; we maintain the formulation in terms of fixed states for clarity with respect to our motivations for Stackelberg optimization. Existing no-regret algorithms for online control typically compete with linear policies, and choose actions each round by implementing policies which are linear in multiple past states (as in e.g. Agarwal et al. (2019a)). Here, we show that all such policies can be arbitrarily suboptimal when compared to state-targeting policies, even for dynamics which are linear up to projection and with fixed convex losses over states, as they may yield actions and states which remain fixed at in every round even if the optimal state is always immediately accessible under the dynamics. We prove Proposition 2 in Appendix A.
Proposition 2.
For an instance , let the class of state-targeting policies for be given by where . Define the regret of a policy class as
where is updated by playing at each round. For any -locally controllable instance, there is a set for which . Further, for any class where each is a matrix yielding actions , there is an instance where for .
If dynamics are linear up to projection with for full-rank , and , note that implements any for sufficiently large .
3 No-Regret Algorithms for Locally Controllable Dynamics
Here we give a sequence of no-regret algorithms satisfying a range of robustness properties. Our primary algorithm , presented in Section 3.1, operates over known time-varying dynamics without disturbances and requires an offline non-convex optimization oracle, and we identify conditions in Section 3.2 which remove the oracle requirement. In Section 3.3 we give two algorithms, and , which allow adversarial disturbances to weakly and strongly locally controllable dynamics, respectively. In Section 3.4 we extend to accommodate unknown dynamics under appropriate regularity conditions (provided an initial “approximately stabilizing” action is known at ), and in Section 3.5 we give an algorithm which obtains regret under bandit feedback.
3.1 Nonlinear Control via Online Convex Optimization
When dynamics satisfy local controllability and is not too close to , all points in a ball around are feasible with an appropriate ; this enables execution of an online convex optimization (OCO) algorithm over by playing the action which yields a state update to the target chosen at each iteration, computed via offline non-convex optimization. Here we assume that is known and can be queried for any inputs, and that disturbances to the state are not present. We allow the dynamics to change over time, potentially as a function of previous actions and losses for , provided that can be determined in each round. We use Follow the Regularized Leader () as our OCO subroutine (Shalev-Shwartz and Singer, 2006; Abernethy et al., 2008), yet we note that it may be substituted for any OCO algorithm whose per-round step size is guaranteed to be sufficiently small (such as OGD with a constant learning rate); statements of the algorithm and its key properties are provided in Appendix B. We instantiate over a contracted space , calibrated to ensure that the minimum loss over is close to that for , yet where each step of lies within the feasible region ensured by (weak) local controllability.
Theorem 1.
For a -locally controllable instance without disturbances and with known at each , the regret of for convex -Lipschitz losses is at most
with respect to any state , with queries made to a non-convex optimization oracle.
3.2 Efficient Updates for Action-Linear Dynamics
While requires no assumptions on the dynamics beyond local controllability, there are large classes of dynamics for which the oracle call can be removed. We say that dynamics are action-linear if is linear in , for (and arbitrary for .
Proposition 3.
For a -locally controllable and action-linear instance , the per-round optimization problem for in is convex.
-
Proof
For , we have for some matrix and vector , and so we can solve efficiently. ∎
The class of action-linear dynamics is quite general, owing to the flexibility permitted by nonlinear parameterizations of in terms of ; in Appendix D, we show that local controllability holds for multiple explicit families of instances when appropriate eigenvalue conditions are satisfied. We can further relax this condition to accommodate dynamics where action-linearity holds only locally in the neighborhood of stabilizing actions (i.e. actions where ).
Definition 3 (Locally Action-Linear Dynamics).
An instance is locally action-linear if, for any , such that , and such that , the dynamics are given by , where is a matrix and is a vector, both with norms bounded by some absolute constant, where and is any function where for some constants .
By this condition, for any in a sufficiently small neighborhood around , the deviation of dynamics (and thus the resulting ) from action-linearity vanishes. Note that our algorithm always chooses a target will always be near ; as such, these deviations from non-action-linearity can be modeled as disturbances with magnitude strictly less than our per-round step size (along with universal constant factors). The existence of an efficient implementation follows as a straightforward corollary of Theorem 2 in Section 3.3, which extends to accommodate bounded adversarial disturbances, as we can then select actions by disregarding the influence of and only considering the local approximation at each state (assuming that each decomposition between and the action-linear component is known).
3.3 Adversarial Disturbances
Our algorithm can be extended to accommodate adversarial disturbances, where the state is updated as , with chosen adversarially. In the weak local controllability case, we show a sharp threshold effect in terms of whether or not is allowed to exceed the undisturbed distance from the boundary by a factor of : if disturbances are bounded below this threshold, regret minimization remains feasible with a tight dependence on the total disturbance magnitude, yet if disturbances may exceed this, no sublinear regret rate is attainable even for a constant total disturbance magnitude. When is small, an adversary can push us to the boundary faster than we can “undo” past disturbances, causing our feasible range to decay.
Theorem 2 (Bounded Disturbances for Weak Local Controllability).
For any , suppose that a sequence of adversarial disturbances for a -locally controllable instance satisfies and , for some . If , there is an algorithm with regret for convex Lipschitz losses bounded by
and there is an instance where any algorithm obtains . If , there is an instance such that any algorithm obtains even when .
The maximum disturbance bound can be removed when dynamics are strongly locally controllable, as the ensured feasible range of the dynamics does not vanish at the boundary of the state space. For such instances, we can minimize regret (with tight dependence) even if disturbances are only implicitly bounded by the state space diameter (which is at least , without loss of generality).
Theorem 3 (Unbounded Disturbances for Strong Local Controllability).
For any and strongly -locally controllable instance with disturbances satisfying , there is an algorithm with regret for convex Lipschitz losses bounded by
and there is an instance where any algorithm obtains .
In each case, our lower bounds in terms of hold for the same constants obtained by our algorithms, and our algorithms obtain the stated regret guarantees even when is not known in advance. We present the algorithms and analysis for each theorem in Appendix E; both operate by tracking deviations from an idealized trajectory without disturbances, and calibrating parameters to preserve sufficient reachability margin for applying corrections towards this trajectory in each round. The lower bounds both proceed by considering an instance with a fixed target state and losses which track the distance from , along with an adversary whose goal is to maximize this distance by selecting disturbances which push the current state away from .
3.4 Unknown Dynamics
Up until this point, we have assumed that the dynamics can be queried arbitrarily in each round. While this has required minimal assumptions on beyond local controllability, accommodation of unknown dynamics is often desired in online control (Cassel et al., 2022; Minasyan et al., 2022) and for several of our applications (Roth et al., 2015; Agarwal and Brown, 2023). Here we give conditions under which regret minimization can be implemented without advance knowledge of by an algorithm , which maintains continuously-updating local linear approximations of near across rounds. Crucially, we assume that is time-invariant and locally action-linear with sufficiently small Lipschitz parameters, and that for the initial state some near-stabilizing action is known, i.e. , for some .
Theorem 4.
For any -locally controllable and time-invariant instance which satisfies local action-linearity and appropriate Lipschitz conditions, there is an algorithm with for convex Lipschitz losses and unknown dynamics , provided that at we are given some such that .
We state and prove Theorem 4 in Appendix F, along with additional details on the regularity and near-stability assumptions. The crux of our analysis, beyond that from our previous results, hinges on being able to maintain and update local linear approximations of throughout our optimization which are sufficiently accurate to allow us to discard the effects of both learned representation errors and action non-linearity from as bounded disturbances. We implement each update from our nested regret minimization algorithm as a series of steps involving small near-orthogonal perturbations to our targets , which we then use to update our local estimate for .
3.5 Bandit Feedback
We can extend our approach from to accommodate bandit feedback for convex losses by replacing with the algorithm (Flaxman et al., 2004) and appropriately recalibrating parameters. obtains regret, which is the best currently-known bound for bandit convex optimization without additional assumptions (e.g. strong convexity), and we obtain an analogous bound here for nested optimization. We note that this extension to bandit feedback can again be applied for any algorithm with a small per-round step-size bound, though this property does not hold for algorithms which sample from larger sets to reduce variance of gradient estimators (e.g. those from Abernethy et al. (2008); Hazan and Levy (2014)).
Theorem 5.
For any -locally controllable instance , there is an oracle-efficient algorithm with expected regret bounded by
for -Lipschitz convex losses under bandit feedback.
4 Applications for Online Stackelberg Optimization
We give several applications of our framework to online Stackelberg problems involving strategic or adaptive agents, each cast as an instance of online control with nonlinear dynamics where local controllability holds, and where our objectives are well-approximated by convex surrogate losses only over the state. Each application extends prior work by either allowing for more relaxed assumptions, unifying distinct problem instances, or giving a novel formulation to account for dynamic and adversarial behavior; analysis and comparison to related work is contained in Appendices H-K.
4.1 Online Performative Prediction
Performative Prediction was introduced by Perdomo et al. (2020) to capture settings in which the data distribution may shift as a function of the classifier itself. We consider the online formulation of Performative Prediction introduced in Kumar et al. (2022) as an instance of online convex optimization with unbounded memory, which we extend to accommodate a stateful variant of the problem (as in Brown et al. (2022)) in which the update to the distribution is a function of both the classifier and the current distribution itself. Let denote our space of classifiers, and let be the initial distribution over . When a classifier is deployed, the distribution is updated to
where , for a random variable with mean and covariance , and with , where satisfies -local controllability for some and appropriate smoothness notions. We also assume there is some linear such that if . We then receive loss , where each is convex and Lipschitz.
This generalizes the model of Kumar et al. (2022), in which is taken to be a fixed matrix; there, -local controllability is satisfied for some provided that is nonsingular. Their aim is to compete with the best fixed classifier by running regret minimization over . Here we run over , taken over the range of , which allows us to compete against the best fixed classifier as well by the properties of ; while the classifiers we play will generally not result in stabilizing points of , their excess loss compared to each is bounded.
Theorem 6 (Regret Minimization for Performative Prediction).
For any , the dynamics for Online Performative Prediction are -locally controllable, and obtains regret with respect to the best fixed classifier.
4.2 Adaptive Recommendations
Online interactions with economic agents of various types are ubiquitous, and the resulting control problems tend to be manifestly nonlinear; here we treat two diverse examples from this space. The Adaptive Recommendations problem, as introduced by Agarwal and Brown (2022), is about providing menu recommendations repeatedly to an agent, whose choice distribution is a function of their past selections, while the controller’s reward in each round depends on adversarial losses over the choice. In each round , we show the agent a (possibly randomized) menu containing (out of ) items, and the agent’s instantaneous choice distribution conditioned on seeing is
where each is the agent’s preference scoring function for item , for some , taking as input the agent’s memory vector . The memory vector updates each round as
where for is a possibly time-dependent update speed, and we receive loss , where each is convex and -Lipschitz. Note that the set of feasible choice distributions when considering all menu distributions depends on the memory vector . The regret benchmark considered by Agarwal and Brown (2022) is the intersection of all such sets, denoted the “everywhere instantaneously-realizable distribution” set , where is the “instantaneously realizable distribution” set for , given as the convex hull of the choice distributions resulting from each menu when is the memory vector. It is shown that the set is non-empty when is not too small, and algorithms which minimize regret with respect to any distribution in are given in Agarwal and Brown (2022) and Agarwal and Brown (2023) under varying assumptions regarding the scoring functions and update speed.
While the prior work considers a bandit version of the problem with unknown dynamics, here we consider a full-feedback deterministic variant of the problem for simplicity, which further allows us to circumvent barriers posed by uncertainty Agarwal and Brown (2022, 2023) and relax structural assumptions (e.g. on or ). We can cast this as an instance of our framework by taking and , where expresses updates to the memory vector. We assume , and we reparameterize to run our algorithm over . We optimize surrogate losses , and bound excess regret from .
Theorem 7 (Regret Minimization over ).
For , the dynamics for Adaptive Recommendations over are -locally controllable, and obtains regret .
In Agarwal and Brown (2023), a property for scoring functions is considered which enables regret minimization over a potentially much larger set of distributions than . A scoring function is said to be -scale-bounded for if, for all , we have that
The set considered is the -smoothed simplex , for , where it is shown that contains a ball around for . We take , which satisfies local controllability, and optimize over with .
Theorem 8 (Regret Minimization over ).
For -scale-bounded scoring functions , for any and , the dynamics for Adaptive Recommendations over are -locally controllable, and obtains regret .
4.3 Adaptive Pricing
Here we consider an Adaptive Pricing problem for real-valued goods, formulated as a dynamic extension of the setting of Roth et al. (2015) where purchase history and consumption affect demand. In each round we set per-unit price vectors , and an agent buys some bundle of goods , which results in us obtaining a reward , where our production cost function at each round is convex and -Lipschitz, and may be chosen adversarially.
Departing from Roth et al. (2015), we consider an agent who maintains goods reserves and consumes an adversarially chosen fraction of every good’s reserve at each round (for some ). The agent then chooses a bundle to maximize their utility , where is their updated reserve bundle. We make several regularity assumptions on the agent’s valuation function , all of which are satisfied by several classically studied utility families (which we discuss in Appendix 4.3). Notably, we assume that is strictly concave and increasing, and homogeneous; the range is bounded under rationality.
Our aim will be to set prices which allow us to compete with the best stable reserve policy, e.g. against any pricing policy where the agent maintains the same reserve bundle at each round for some regardless of . We take an appropriate convex set of such bundles as our state space, for which we show that local controllability holds. Observe that to induce a purchase of , it suffices to set prices , as we then have that . By homogeneity of , we also have that for some , and we show that optimization via the concave surrogate rewards
will closely track our true rewards . While neither our true nor surrogate rewards will be Lipschitz, we extend to obtain sublinear regret over Hölder continuous losses by appropriately calibrating our step size (which may be of independent interest).
Theorem 9 (Regret Minimization over Stable Reserve Policies).
For any , the dynamics for Adaptive Pricing can are -locally controllable, and obtains regret with respect to the best stable reserve policy.
4.4 Steering Learners in Online Games
A recent line of work (Deng et al., 2019; Mansour et al., 2022; Brown et al., 2023) explores maximizing rewards in a repeated game against a no-regret learner, and Anagnostides et al. (2023) study of no-regret dynamics in time-varying games. We consider these questions in unison, and aim to optimize reward against a no-regret learner for game matrices chosen adversarially and online.
Consider adversarial sequences of two-player bimatrix games , where ; we assume that the convex hull of the rows of each contains the unit ball. As Player A, we choose strategies each round to maximize our reward against Player B, who chooses their strategies according to a no-regret algorithm (in particular, online projected gradient descent). The game is only revealed after both players have chosen strategies for round . Our aim here is to illustrate the feasibility of steering the opponent’s trajectory, and so we consider games where Player A’s reward is predominantly a function only of Player B’s actions. We assume that for any , where each is a matrix with identical rows, and that per-round changes to are bounded, with for any . We measure the regret of an algorithm with respect to any profile , where
When Player B plays with step size , their strategy updates each round as
with , and yields regret for Player B with respect to any for the loss sequence . To cast this in our framework, we consider as our state space, where we select actions to induce desired updates to and optimize over the surrogate losses . While we do not see prior to choosing each , we view our update errors from instead selecting an action in terms of the dynamics resulting from as adversarial disturbances and run , as the dynamics are strongly locally controllable.
Theorem 10 (Regret Minimization in Online Games).
For , repeated play against in online games can be cast as a -strongly locally controllable instance of online control with nonlinear dynamics, for which obtains regret .
References
- Abernethy et al. (2008) Jacob D. Abernethy, Elad Hazan, and Alexander Rakhlin. Competing in the dark: An efficient algorithm for bandit linear optimization. In Annual Conference Computational Learning Theory, 2008. URL https://api.semanticscholar.org/CorpusID:8547150.
- Agarwal and Brown (2022) Arpit Agarwal and William Brown. Diversified recommendations for agents with adaptive preferences. In Advances in Neural Information Processing Systems, volume 35, 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/a75db7d2ee1e4bee8fb819979b0a6cad-Paper-Conference.pdf.
- Agarwal and Brown (2023) Arpit Agarwal and William Brown. Online recommendations for agents with discounted adaptive preferences, 2023.
- Agarwal et al. (2019a) Naman Agarwal, Brian Bullins, Elad Hazan, Sham M. Kakade, and Karan Singh. Online control with adversarial disturbances, 2019a.
- Agarwal et al. (2019b) Naman Agarwal, Elad Hazan, and Karan Singh. Logarithmic regret for online control. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019b. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/78719f11fa2df9917de3110133506521-Paper.pdf.
- Agrawal et al. (2023) Shipra Agrawal, Yiding Feng, and Wei Tang. Dynamic pricing and learning with bayesian persuasion, 2023.
- Ahmadi et al. (2023) Saba Ahmadi, Avrim Blum, and Kunhe Yang. Fundamental bounds on online strategic classification, 2023.
- Alcantara-Jiménez and Clempner (2020) Guillermo Alcantara-Jiménez and Julio B. Clempner. Repeated stackelberg security games: Learning with incomplete state information. Reliability Engineering & System Safety, 195:106695, 2020. ISSN 0951-8320. doi: https://doi.org/10.1016/j.ress.2019.106695. URL https://www.sciencedirect.com/science/article/pii/S0951832019304478.
- Anagnostides et al. (2022) Ioannis Anagnostides, Constantinos Daskalakis, Gabriele Farina, Maxwell Fishelson, Noah Golowich, and Tuomas Sandholm. Near-optimal no-regret learning for correlated equilibria in multi-player general-sum games. In Proceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing, pages 736–749, 2022.
- Anagnostides et al. (2023) Ioannis Anagnostides, Ioannis Panageas, Gabriele Farina, and Tuomas Sandholm. On the convergence of no-regret learning dynamics in time-varying games, 2023.
- Anava et al. (2014) Oren Anava, Elad Hazan, and Shie Mannor. Online convex optimization against adversaries with memory and application to statistical arbitrage, 2014.
- Aoki (1974) Masanao Aoki. Local Controllability of a Decentralized Economic System1. The Review of Economic Studies, 41(1):51–63, 01 1974. ISSN 0034-6527. doi: 10.2307/2296398. URL https://doi.org/10.2307/2296398.
- Balcan et al. (2015) Maria-Florina Balcan, Avrim Blum, Nika Haghtalab, and Ariel D. Procaccia. Commitment without regrets: Online learning in stackelberg security games. In Proceedings of the Sixteenth ACM Conference on Economics and Computation, EC ’15, page 61–78, New York, NY, USA, 2015. Association for Computing Machinery. ISBN 9781450334105. doi: 10.1145/2764468.2764478. URL https://doi.org/10.1145/2764468.2764478.
- Barbero-Liñán and Jakubczyk (2013) M. Barbero-Liñán and B. Jakubczyk. Second order conditions for optimality and local controllability of discrete-time systems, 2013.
- Blum et al. (2008) Avrim Blum, MohammadTaghi Hajiaghayi, Katrina Ligett, and Aaron Roth. Regret minimization and the price of total anarchy. In Proceedings of the fortieth annual ACM symposium on Theory of computing, pages 373–382, 2008.
- Blum et al. (2014) Avrim Blum, Nika Haghtalab, and Ariel D Procaccia. Learning optimal commitment to overcome insecurity. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014. URL https://proceedings.neurips.cc/paper_files/paper/2014/file/cc1aa436277138f61cda703991069eaf-Paper.pdf.
- Boscain et al. (2021) Ugo Boscain, Daniele Cannarsa, Valentina Franceschi, and Mario Sigalotti. Local controllability does imply global controllability, 2021.
- Braverman et al. (2017) Mark Braverman, Jieming Mao, Jon Schneider, and S. Matthew Weinberg. Selling to a no-regret buyer. CoRR, abs/1711.09176, 2017. URL http://arxiv.longhoe.net/abs/1711.09176.
- Brown et al. (2022) Gavin Brown, Shlomi Hod, and Iden Kalemaj. Performative prediction in a stateful world, 2022.
- Brown et al. (2023) William Brown, Jon Schneider, and Kiran Vodrahalli. Is learning in games good for the learners?, 2023.
- Cassel et al. (2022) Asaf Cassel, Alon Cohen, and Tomer Koren. Efficient online linear control with stochastic convex costs and unknown dynamics, 2022.
- Chen et al. (2022) Xinyi Chen, Edgar Minasyan, Jason D. Lee, and Elad Hazan. Provable regret bounds for deep online learning and control, 2022.
- Cohen et al. (2018) Alon Cohen, Avinatan Hassidim, Tomer Koren, Nevena Lazic, Yishay Mansour, and Kunal Talwar. Online linear quadratic control. CoRR, abs/1806.07104, 2018. URL http://arxiv.longhoe.net/abs/1806.07104.
- Collina et al. (2023) Natalie Collina, Eshwar Ram Arunachaleswaran, and Michael Kearns. Efficient stackelberg strategies for finitely repeated games. In Proceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems, AAMAS ’23, page 643–651, Richland, SC, 2023. International Foundation for Autonomous Agents and Multiagent Systems. ISBN 9781450394321.
- Daskalakis and Syrgkanis (2015) Constantinos Daskalakis and Vasilis Syrgkanis. Learning in auctions: Regret is hard, envy is easy. CoRR, abs/1511.01411, 2015. URL http://arxiv.longhoe.net/abs/1511.01411.
- Dean and Morgenstern (2022) Sarah Dean and Jamie Morgenstern. Preference dynamics under personalized recommendations, 2022.
- Deng et al. (2019) Yuan Deng, Jon Schneider, and Balusubramanian Sivan. Strategizing against no-regret learners, 2019.
- Dong et al. (2018) **shuo Dong, Aaron Roth, Zachary Schutzman, Bo Waggoner, and Zhiwei Steven Wu. Strategic classification from revealed preferences. In Proceedings of the 2018 ACM Conference on Economics and Computation, EC ’18, page 55–70, New York, NY, USA, 2018. Association for Computing Machinery. ISBN 9781450358293. doi: 10.1145/3219166.3219193. URL https://doi.org/10.1145/3219166.3219193.
- Feng et al. (2019) Zhe Feng, Okke Schrijvers, and Eric Sodomka. Online learning for measuring incentive compatibility in ad auctions. CoRR, abs/1901.06808, 2019. URL http://arxiv.longhoe.net/abs/1901.06808.
- Flaxman et al. (2004) Abraham Flaxman, Adam Tauman Kalai, and H. Brendan McMahan. Online convex optimization in the bandit setting: gradient descent without a gradient. CoRR, cs.LG/0408007, 2004. URL http://arxiv.longhoe.net/abs/cs.LG/0408007.
- Flaxman et al. (2016) Seth Flaxman, Sharad Goel, and Justin M. Rao. Filter Bubbles, Echo Chambers, and Online News Consumption. Public Opinion Quarterly, 80(S1):298–320, 03 2016. ISSN 0033-362X. doi: 10.1093/poq/nfw006. URL https://doi.org/10.1093/poq/nfw006.
- Flokas et al. (2019) Lampros Flokas, Emmanouil-Vasileios Vlatakis-Gkaragkounis, and Georgios Piliouras. Poincaré recurrence, cycles and spurious equilibria in gradient-descent-ascent for non-convex non-concave zero-sum games, 2019.
- Gaitonde et al. (2021) Jason Gaitonde, Jon M. Kleinberg, and Éva Tardos. Polarization in geometric opinion dynamics. In Péter Biró, Shuchi Chawla, and Federico Echenique, editors, EC ’21: The 22nd ACM Conference on Economics and Computation, Budapest, Hungary, July 18-23, 2021, pages 499–519. ACM, 2021.
- Golrezaei et al. (2020) Negin Golrezaei, Adel Javanmard, and Vahab S. Mirrokni. Dynamic incentive-aware learning: Robust pricing in contextual auctions. CoRR, abs/2002.11137, 2020. URL https://arxiv.longhoe.net/abs/2002.11137.
- Gradu et al. (2022) Paula Gradu, Elad Hazan, and Edgar Minasyan. Adaptive regret for control of time-varying dynamics, 2022.
- Hardt et al. (2015) Moritz Hardt, Nimrod Megiddo, Christos H. Papadimitriou, and Mary Wootters. Strategic classification. CoRR, abs/1506.06980, 2015. URL http://arxiv.longhoe.net/abs/1506.06980.
- Hartline et al. (2015a) Jason Hartline, Vasilis Syrgkanis, and Eva Tardos. No-regret learning in bayesian games. Advances in Neural Information Processing Systems, 28, 2015a.
- Hartline et al. (2015b) Jason D. Hartline, Vasilis Syrgkanis, and Éva Tardos. No-regret learning in repeated bayesian games. CoRR, abs/1507.00418, 2015b. URL http://arxiv.longhoe.net/abs/1507.00418.
- Hazan (2021) Elad Hazan. Introduction to online convex optimization, 2021.
- Hazan and Levy (2014) Elad Hazan and Kfir Levy. Bandit convex optimization: Towards tight bounds. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014. URL https://proceedings.neurips.cc/paper_files/paper/2014/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf.
- Hazan and Singh (2022) Elad Hazan and Karan Singh. Introduction to online nonstochastic control, 2022.
- Hazla et al. (2019) Jan Hazla, Yan **, Elchanan Mossel, and Govind Ramnarayan. A geometric model of opinion polarization. CoRR, abs/1910.05274, 2019.
- Jagadeesan et al. (2022a) Meena Jagadeesan, Nikhil Garg, and Jacob Steinhardt. Supply-side equilibria in recommender systems, 2022a.
- Jagadeesan et al. (2022b) Meena Jagadeesan, Tijana Zrnic, and Celestine Mendler-Dünner. Regret minimization with performative feedback. CoRR, abs/2202.00628, 2022b. URL https://arxiv.longhoe.net/abs/2202.00628.
- Jia et al. (2014) Liyan Jia, Lang Tong, and Qing Zhao. An online learning approach to dynamic pricing for demand response, 2014.
- Kakade et al. (2020) Sham Kakade, Akshay Krishnamurthy, Kendall Lowrey, Motoya Ohnishi, and Wen Sun. Information theoretic regret bounds for online nonlinear control, 2020.
- Kanoria and Nazerzadeh (2020) Yash Kanoria and Hamid Nazerzadeh. Dynamic reserve prices for repeated auctions: Learning from bids. CoRR, abs/2002.07331, 2020. URL https://arxiv.longhoe.net/abs/2002.07331.
- Kuhn and Wohltmann (1989) H. Kuhn and H.-W. Wohltmann. Controllability of economic systems under alternative expectations hypotheses—the discrete case. Computers & Mathematics with Applications, 18(6):617–628, 1989. ISSN 0898-1221. doi: https://doi.org/10.1016/0898-1221(89)90112-0. URL https://www.sciencedirect.com/science/article/pii/0898122189901120.
- Kumar et al. (2022) Raunak Kumar, Sarah Dean, and Robert D. Kleinberg. Online convex optimization with unbounded memory, 2022.
- Lale et al. (2021) Sahin Lale, Kamyar Azizzadenesheli, Babak Hassibi, and Anima Anandkumar. Model learning predictive control in nonlinear dynamical systems. In 2021 60th IEEE Conference on Decision and Control (CDC), pages 757–762, 2021. doi: 10.1109/CDC45484.2021.9683670.
- Lauffer et al. (2022) Niklas Lauffer, Mahsa Ghasemi, Abolfazl Hashemi, Yagiz Savas, and Ufuk Topcu. No-regret learning in dynamic stackelberg games, 2022.
- Letchford et al. (2009) Joshua Letchford, Vincent Conitzer, and Kamesh Munagala. Learning and approximating the optimal strategy to commit to. In Algorithmic Game Theory, 2009. URL https://api.semanticscholar.org/CorpusID:1795572.
- Luo et al. (2022) Wenhao Luo, Wen Sun, and Ashish Kapoor. Sample-efficient safe learning for online nonlinear control with control barrier functions, 2022.
- Lykouris et al. (2021) Thodoris Lykouris, Max Simchowitz, Alex Slivkins, and Wen Sun. Corruption-robust exploration in episodic reinforcement learning. In Mikhail Belkin and Samory Kpotufe, editors, Proceedings of Thirty Fourth Conference on Learning Theory, volume 134 of Proceedings of Machine Learning Research, pages 3242–3245. PMLR, 15–19 Aug 2021. URL https://proceedings.mlr.press/v134/lykouris21a.html.
- Mansour et al. (2022) Yishay Mansour, Mehryar Mohri, Jon Schneider, and Balasubramanian Sivan. Strategizing against learners in bayesian games, 2022.
- Mehta et al. (2007) Aranyak Mehta, Amin Saberi, Umesh Vazirani, and Vijay Vazirani. Adwords and generalized online matching. J. ACM, 54(5):22–es, oct 2007. ISSN 0004-5411. doi: 10.1145/1284320.1284321. URL https://doi.org/10.1145/1284320.1284321.
- Mendler-Dünner et al. (2020) Celestine Mendler-Dünner, Juan Perdomo, Tijana Zrnic, and Moritz Hardt. Stochastic optimization for performative prediction. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 4929–4939. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/33e75ff09dd601bbe69f351039152189-Paper.pdf.
- Miller et al. (2021) John Miller, Juan C. Perdomo, and Tijana Zrnic. Outside the echo chamber: Optimizing the performative risk. CoRR, abs/2102.08570, 2021. URL https://arxiv.longhoe.net/abs/2102.08570.
- Minasyan et al. (2022) Edgar Minasyan, Paula Gradu, Max Simchowitz, and Elad Hazan. Online control of unknown time-varying dynamical systems, 2022.
- Morgenstern and Roughgarden (2016) Jamie Morgenstern and Tim Roughgarden. Learning simple auctions. CoRR, abs/1604.03171, 2016. URL http://arxiv.longhoe.net/abs/1604.03171.
- Mussi et al. (2022) Marco Mussi, Gianmarco Genalti, Alessandro Nuara, Francesco Trovò, Marcello Restelli, and Nicola Gatti. Dynamic pricing with volume discounts in online settings, 2022.
- Muthirayan and Khargonekar (2022) Deepan Muthirayan and Pramod P. Khargonekar. Online learning robust control of nonlinear dynamical systems, 2022.
- Nedelec et al. (2020) Thomas Nedelec, Clement Calauzenes, Vianney Perchet, and Noureddine El Karoui. Robust stackelberg buyers in repeated auctions. In Silvia Chiappa and Roberto Calandra, editors, Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 1342–1351. PMLR, 26–28 Aug 2020. URL https://proceedings.mlr.press/v108/nedelec20a.html.
- Neu and Olkhovskaya (2021) Gergely Neu and Julia Olkhovskaya. Online learning in mdps with linear function approximation and bandit feedback. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 10407–10417. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/5631e6ee59a4175cd06c305840562ff3-Paper.pdf.
- Peng et al. (2019) Binghui Peng, Weiran Shen, **zhong Tang, and Song Zuo. Learning optimal strategies to commit to. In AAAI Conference on Artificial Intelligence, 2019. URL https://api.semanticscholar.org/CorpusID:92982174.
- Perdomo et al. (2020) Juan C. Perdomo, Tijana Zrnic, Celestine Mendler-Dünner, and Moritz Hardt. Performative prediction. CoRR, abs/2002.06673, 2020. URL https://arxiv.longhoe.net/abs/2002.06673.
- Piliouras and Yu (2022) Georgios Piliouras and Fang-Yi Yu. Multi-agent performative prediction: From global stability and optimality to chaos, 2022.
- Roth et al. (2015) Aaron Roth, Jonathan R. Ullman, and Zhiwei Steven Wu. Watch and learn: Optimizing from revealed preferences feedback. CoRR, abs/1504.01033, 2015. URL http://arxiv.longhoe.net/abs/1504.01033.
- Roughgarden (2015) Tim Roughgarden. Intrinsic robustness of the price of anarchy. J. ACM, 62(5), nov 2015. ISSN 0004-5411. doi: 10.1145/2806883. URL https://doi.org/10.1145/2806883.
- Shalev-Shwartz and Singer (2006) Shai Shalev-Shwartz and Yoram Singer. Online learning meets optimization in the dual. In Proceedings of the 19th Annual Conference on Learning Theory, COLT’06, page 423–437, Berlin, Heidelberg, 2006. Springer-Verlag. ISBN 3540352945. doi: 10.1007/11776420˙32. URL https://doi.org/10.1007/11776420_32.
- Shen et al. (2023) Lingqing Shen, Nam Ho-Nguyen, and Fatma Kılınç-Karzan. An online convex optimization-based framework for convex bilevel optimization. Mathematical Programming, 198(2):1519–1582, 04 2023. ISSN 1436-4646. doi: 10.1007/s10107-022-01894-5. URL https://doi.org/10.1007/s10107-022-01894-5.
- Simchowitz et al. (2020) Max Simchowitz, Karan Singh, and Elad Hazan. Improper learning for non-stochastic control. CoRR, abs/2001.09254, 2020. URL https://arxiv.longhoe.net/abs/2001.09254.
- Yue et al. (2012) Yisong Yue, Josef Broder, Robert Kleinberg, and Thorsten Joachims. The k-armed dueling bandits problem. Journal of Computer and System Sciences, 78(5):1538–1556, 2012. ISSN 0022-0000. doi: https://doi.org/10.1016/j.jcss.2011.12.028. URL https://www.sciencedirect.com/science/article/pii/S0022000012000281. JCSS Special Issue: Cloud Computing 2011.
- Zhang et al. (2023) Brian Hu Zhang, Gabriele Farina, Ioannis Anagnostides, Federico Cacciamani, Stephen Marcus McAleer, Andreas Alexander Haupt, Andrea Celli, Nicola Gatti, Vincent Conitzer, and Tuomas Sandholm. Steering no-regret learners to optimal equilibria, 2023.
- Zhang et al. (2021) Xuezhou Zhang, Yiding Chen, Jerry Zhu, and Wen Sun. Corruption-robust offline reinforcement learning. CoRR, abs/2106.06630, 2021. URL https://arxiv.longhoe.net/abs/2106.06630.
- Zinkevich (2003) Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. 2, 04 2003.
- Zrnic et al. (2021a) Tijana Zrnic, Eric Mazumdar, S. Shankar Sastry, and Michael I. Jordan. Who leads and who follows in strategic classification? CoRR, abs/2106.12529, 2021a. URL https://arxiv.longhoe.net/abs/2106.12529.
- Zrnic et al. (2021b) Tijana Zrnic, Eric Mazumdar, Shankar Sastry, and Michael Jordan. Who leads and who follows in strategic classification? In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 15257–15269. Curran Associates, Inc., 2021b. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/812214fb8e7066bfa6e32c626c2c688b-Paper.pdf.
Appendix A Omitted Proofs for Section 2
-
Proof
of Proposition 1. Without loss of generality, assume and that is even. Let for each . Consider any round where ; then, for all actions , we have that , as ; as such, we incur loss in round . Now suppose ; then, we must have incurred loss at least in round . As losses are non-negative, our total loss is at least , as loss is incurred at least every other round; given that the best fixed state incurs total loss , we have that for any algorithm . ∎
-
Proof
of Proposition 2. We begin by observing that for instances , the class of state-targeting policies contains a policy which obtains the reward of the best fixed state up to , for sufficiently large . Consider the set . Note that the reward of any is matched by some up to for any fixed inner radius , outer radius , and Lipschitz constant . For any such , note that under the policy when starting at , the distance between and in each round is updated to at most:
It is straightforward to see that is convex, and so our state will never leave on its path to ; as such, we reach within rounds, after which point our reward exactly tracks that of . For some , this yields a regret for of at most to the best fixed state in .
Next, consider an instance where and are both the unit ball in . With , let the dynamics be given by
Observe that this satisfies -local controllability for any , as a ball of radius is always feasible around . Let each loss , for some . Immediately we can see that any matrix policy has regret , as the action will be played in each round. ∎
Appendix B Follow the Regularized Leader
Here we state the algorithm and several of its key properties; see e.g. Hazan (2021) for proofs of Propositions 4 and 5.
Proposition 4.
For a -strongly convex regularizer where for all , and for convex -Lipschitz losses , the regret of is bounded by
Proposition 5.
Any pair of points and chosen by satisfies .
Appendix C Analysis for
-
Proof
of Theorem 1. First we show that any point chosen by will be feasible under local controllability, by induction. It is straightforward to see that is convex and ; further, any is bounded away from . By the definition of , we have that for some . Recall that , and note that . Let be any point in . By convexity of , we then have that any point lies in , and so for any we have that . Each lies in , and so we have that ; as such, any point in is feasible. Given that , by Proposition 5 we have that in each round for the chosen point. Each action will be selected by solving for
via a call to . Each call is guaranteed to have a solution which achieves an objective of 0 where for some by local controllability, yielding an exact state update to as we assume Oracle can solve arbitrary non-convex minimization problems. To bound the regret, first note that for any , we have
by Proposition 4, as for any . Then, observe that for any , we have that
Combining the previous claims, we have that
upon setting and , which yields the theorem. ∎
Appendix D Examples and Analysis for Action-Linear Dynamics
As a simple yet general example of dynamics which are both action-linear and locally controllable, consider update rules in which a step is taken by applying a nonsingular matrix transformation to the action, where the matrix can be parameterized by the state, with projection back into if necessary.
Example 1.
Let both and be given by the unit ball in . For any fixed , let the updates from be given by
where each is a square matrix with minimum absolute eigenvalue for some . Then, the instance is action-linear and satisfies -local controllability.
-
Proof
for Example 1. It is straightforward to see that is action-linear. To show -local controllability, let be any point in . It suffices to show that there is some such that . As is non-singular, we can solve for , where and , and so we have that . ∎
We can also extend this to include state-parameterized generalizations of any linear system governed by nonsingular matrices over a bounded-radius state space (for a sufficiently large action space).
Example 2.
Let be given by the radius- ball in , and let . For any fixed , let the updates from be given by
where both and are square matrices. For any , let , and suppose we take large enough such that for some . Then, the instance is action-linear and satisfies -local controllability.
-
Proof
for Example 2. Here, again it is evident that is action-linear, and so it suffices to show that there is some such that
for any in . As in the proof for Example 1, we have that , and for large enough there is some such that for any where . Thus, any point is feasible by some , which contains the ball . ∎
Appendix E Algorithms for Adversarial Disturbances
E.1 and Proofs for Theorem 2
We show that it is possible simulate over the undisturbed states under the assumption that the dynamics are in -locally controllable for some while retaining sufficient range in the feasible region around to correct for the disturbance from the previous round. Here, the oracle call for computing in each round is updated to consider the true state .
Theorem 2 follows directly from Theorems 11, 12, and 13. Intuitively, when the per-round disturbance magnitude is at most , one can calibrate for the case of -locally controllable dynamics and maintain sufficient “slack” to correct for the previous round’s disturbance in every round. When disturbances exceed , an adversary can continually push the state towards the boundary of , which may require vanishing disturbance magnitude as rounds progress due to the limited range promised by local controllability near the boundary.
Theorem 11.
For a -locally controllable instance with convex losses and adversarial disturbances where and , the regret of with respect to the reward of any state is bounded by
with queries made to an oracle for non-convex optimization.
-
Proof
We show by induction that each call to yields a feasible action satisfying . This is immediate for , and suppose this holds up to some round , where we have that . Given that selects actions under -local controllability, we can bound
Further, the magnitude of the disturbance is bounded by
yielding that
() As such, we have that
and so by -local controllability some feasible action exists, as lies in . The regret bound for holds over the states , and so we can bound the total regret of with respect to any as:
(Thm. 1) ∎
We show that the dependence on is tight up to the constant. Note that we we can obtain regret in the following instance via .
Theorem 12 (Regret Lower Bound for Bounded Disturbances).
Suppose for any and an adversary can choose with , where for any . There is a -locally controllable instance with -Lipschitz convex losses such that any algorithm obtains regret .
-
Proof
Consider any norm over . Let be the unit ball , and let each . Consider any action space and dynamics where -local controllability exactly characterizes the range of , i.e. for any and , there is some such that if and only if .
First, note that for any . In each round , suppose an algorithm plays an action at state which yields an target undisturbed update . The adversary can then choose any satisfying ; suppose each is given by
if is non-zero, and an arbitrary vector with if . This satisfies the disturbance norm bound, and further yields , where for non-zero we have
and thus for any ,
yielding a loss at a disturbance cost of . Assuming the adversary continues this strategy in each round until any disturbance budget is exhausted, this yields a regret for any algorithm of at least
as obtains total loss 0. ∎
The disturbance upper bound is indeed necessary for -locally controllable dynamics. We show a sharp threshold effect at , wherein an adversary who is allowed to exceed this limit by any amount can force an algorithm to incur linear regret even with only a constant budget. Note that for any and , there is some such that .
Theorem 13.
Suppose an adversary can choose any state disturbances with , for any and any . Then, there is a -locally controllable instance with convex losses such that any algorithm obtains regret even if .
-
Proof
Consider any instance where -local controllability exactly characterizes the range of , i.e. for any and , there is some such that if and only if .
Let for each round. Beginning at any round , suppose the adversary observes an action which yields an update . Let , and suppose the adversary chooses the disturbance:
This forces closer to the boundary at each round, regardless of the choice of :
() where holds by our assumption on . Assuming the adversary applies a disturbance selected as above in each round , we have that
where the magnitude of each disturbance is bounded by
where we take the initial state distance to the boundary to be a constant bounded away from zero. This yields that the sum of disturbance magnitudes is at most:
Now suppose that the loss at each round is given by . Then, our regret with respect to is at least:
∎
Together, the previous three theorems yield Theorem 2.
E.2 and Proofs for Theorem 3
We can remove the bound on the maximum disturbance for strongly locally controllable instances, as the feasible update sets do not vanish at the boundary of . Recall that an instance satisfies strong -local controllability for if, for any and , there is some such that . We assume without loss of generality that , where is the radius of .
Intuitively, our algorithm tracks the target state which would be chosen by in the absence of all disturbances (by recording the loss counterfactual loss rather than the one truly experienced), and always seeks to minimize distance to that state.
Theorem 14.
For a strongly -locally controllable instance with convex losses and adversarial disturbances where , the regret of is bounded by
with respect to the reward of any state, with queries made to an oracle for non-convex optimization.
-
Proof
We begin by bounding the total state error across rounds. First, note that for any fixed , and any desired , we have that for sufficiently large , as ; we assume this holds for any given choice of , and so we have that by Proposition 5. For a total disturbance budget , we separately consider disturbances depending on whether or not the accumulated disturbance error up to is driven to 0 in the next round. Define and as:
and
with and . First, observe that at each round corresponding to , given that we have that , as . As such, we have that
Next, consider any . As our instance is strongly -locally controllable, we must have that , as otherwise there would some feasible action which would be selected that would yield . Since , it then must be the case that , and so we can bound the number of disturbances in as:
Assuming a maximal distance for each round corresponding to some , this yields
We can assume is small enough to yield , and so we have
The regret bound for holds over the states , and so we can bound the total regret of with respect to any as:
(Prop. 4) ∎
Theorem 15 (Regret Lower Bound for Unbounded Disturbances).
Suppose an adversary can choose any state disturbances with . For any , there is a strongly -locally controllable instance with convex losses such that any algorithm obtains regret .
-
Proof
Let for any and let for each . Suppose strong -local controllability exactly characterizes the range of , i.e. for any there is some such that if and only if . Consider an adversary who chooses disturbances in each round such that until their disturbance budget is exhausted. This requires a disturbance of magnitude at most for , as we assume , and at most in subsequent rounds, and thus the adversary can force any algorithm to remain at for rounds.
As such, any algorithm must incur loss of at least across these rounds, and further must incur average loss over the subsequent rounds (if is not yet reached), for an additional loss of , as they can only decrease per-round loss by given the restriction on the range of . As the optimal state obtains loss 0, the total regret is at least:
∎
Together, the previous two theorems yield Theorem 3. Note that for both algorithms it remains computationally efficient to optimize over action-linear dynamics, as the constraint that can be encoded as a convex contraint over .
Appendix F Unknown Dynamics: Analysis for
-
Proof
of Theorem 4 Assume the following hold for at each :
-
–
, for a function ;
-
–
has a largest absolute eigenvalue bounded by an absolute constant, smallest absolute eigenvalue bounded away from 0, and is -Lipschitz in the matrix norm;
-
–
has a norm bounded by an absolute constant, and is -Lipschitz;
-
–
for any such that .
In the neighborhood of any , observe that playing yields an update to , where the error term has magnitude bounded linearly in terms of the neighborhood size as well as polynomial in the relevant constants. We assume sufficiently small values of , , and (whose relative bounds may trade off with each other, and in general will be inverse-polynomial in problem parameters other than ) to bound the error of this process in accordance with the requirements of Theorem 2, as well as to ensure that estimation error for is uniformly bounded for all . Given , this yields estimation error terms in each round, for small enough to obtain the obtain the desired regret bound. ∎
-
–
Appendix G Bandit Feedback: Analysis for
We first state the algorithm and its bounds for regret and per-round step size.
Proposition 6 (Flaxman et al. (2004)).
For -Lipschitz convex losses and a domain with diameter which contains a ball of radius around the origin, obtains expected regret
with each point contained in . Further, each pair of consecutive points , chosen by satisfies .
The algorithm is essentially equivalent to , replacing with and recalibrating parameters.
-
Proof
of Theorem 5. Following the proof of Theorem 1, to apply the bound of to our setting (along with excess regret at most per round from contracting to ), the key step is to show that each point selected by is feasible under weakly locally controllable dynamics over , i.e. . Let , and let . Assume for simplicity that and . When instantiating over with parameters and , by Proposition 6 we then have
and so each selected point is feasible. This allows us to bound our regret by
() () () ∎
Appendix H Background and Proofs for Section 4.1: Performative Prediction
H.1 Background
Introduced by Perdomo et al. (2020), the Performative Prediction problem captures settings in which the data distribution for which a classifier is deployed may shift as a function of the classifier itself, notably including strategic classification Hardt et al. (2015) as well as problems related to reinforcement learning and causal inference. While a number of extensions of strategic classification to online settings have been considered Dong et al. (2018); Zrnic et al. (2021b); Ahmadi et al. (2023), the bulk of the literature on performative prediction considers settings with a fixed loss function and distribution “update map” Perdomo et al. (2020); Miller et al. (2021); Jagadeesan et al. (2022b); Mendler-Dünner et al. (2020); Piliouras and Yu (2022); Brown et al. (2022), where the update map may sometimes depend on the current distribution (as in the Stateful Performative Prediction setting of Brown et al. (2022)). For the location-scale family of update maps introduced by Miller et al. (2021) (and additionally explored by Jagadeesan et al. (2022b) from a regret minimization perspective), which yields a convex “performative risk” objective function, a formulation of Online Performative Prediction is given by Kumar et al. (2022) as an application of online convex optimization with unbounded memory, in which the classification loss function may change over time and the distribution updates may occur gradually.
Here, we generalize the problem formulation of Kumar et al. (2022) to also accommodate notions of statefulness similar to that in Brown et al. (2022). In particular, the instances we consider will resemble location-scale maps when restricting attention only the performatively stable classifiers for each distribution, yet the update effect of a non-stable classifier may be distribution-dependent and nonlinear, provided that the update map satisfies local controllability (viewing classifiers as actions and distributions as states) and mild regularity properties (e.g. invertibility and Lipschitz conditions).
H.2 Model
In the setting of Online Performative Prediction we consider, as formulated by Kumar et al. (2022), in each round we deploy some classifier , and observe samples from some distribution , which may change dynamically as a function of the history of interactions. Here, we take as our space of classifiers, e.g. representing weight vectors for regression, which we assume is bounded and convex. The initial data distribution is given by some distribution over . In each round, upon deploying a classifier , the distribution is updated according to
for , where is the distribution update map taking as input our classifier and some representation of the state , where we assume is convex, contains , is bounded with radius , and that . We make the following assumptions on .
Assumption 1.
We assume the distribution update map operates as follows:
-
•
, with ,
-
•
is a random variable in with mean and covariance ,
-
•
satisfies -local controllability and has an inverse action map** where
defined over feasible pairs, which is -Lipschitz in (when feasibility of holds), and
-
•
There is a linear invertible function such that if , where is -Lipschitz.
Further, is known and can be sampled freely.
The inverse action map** assumption simply enforces that classifiers need not change drastically to have the same update effect under small changes to the state. The final assumption imposes a linear structure over performatively stable classifiers (i.e. classifiers for which the resulting distribution will remain fixed under , as formulated by Perdomo et al. (2020)), but we note that the distribution may update in an arbitrarily nonlinear fashion (subject to the other conditions) when is not a performatively stable classifier for the distribution induced by the previous state . The ability to accommodate a state component is reminiscent of prior work involving notions of statefulness in performative prediction such as Brown et al. (2022). Our setting generalizes that of Kumar et al. (2022), in which the map is taken to be a fixed matrix. For any nonsingular matrix there is immediately a linear map , and local controllability can be defined in terms of the largest and smallest absolute eigenvalues of (as a special case of our Example 1 with a fixed matrix). We view the nonsingularity assumption (and invertibility in the more general case) as fairly mild, as it amounts to assuming that the distribution map can depend on all parameters of classifier without any necessary (linear) dependency structure imposed, and that no two classifiers are equivalent only to the population but not the optimizer (as otherwise one could simply reduce dimensionality of ). However, even in the case where is singular, we note that this issue is resolvable augmenting the state representation to incorporate the choice of free classifier parameters which affect loss but not distribution updates (e.g. by adding a vector to which is orthogonal to the range of and linear in ). We assume invertibility here for simplicity, and we take to be simply be given by the range of over . At each round , some scoring function is chosen adversarially, and our loss is then given by
We assume each is convex and -Lipschitz in both and , and that . We measure our regret with respect to the best performatively stable classifier, i.e. the loss of any classifier as if were held constant indefinitely as the distribution updates. We define our regret as follows:
Here, the role of captures the convergence of the distribution to a stable point, resulting from taking the limit of the distribution update rule as grows large.
As in many of the applications we consider, here our loss is determined both by our action (the classifier) and the state (in terms of the distribution). Our approach for casting Online Performative Prediction as an instance of online nonlinear control in our framework will be to define appropriate surrogate convex losses which depend only on the state, over which we run . Here, these will correspond to losses only over the updated distribution component , which we show closely track our true incurred loss.
H.3 Analysis
For each round , define the surrogate loss as:
Lemma 1.
Each is convex and -Lipschitz in .
-
Proof
Consider any individual sample . We can then view as a vector-valued function which is -Lipschitz. The function is a -Lipschitz and convex function of this linear function of , and thus is convex and -Lipschitz in . The function is an average of such functions, taken over the expectation of , and thus is convex and -Lipschitz in as well. ∎
Observe that . We will run for these losses over the -locally controllable instance , where we can track the current state at each step as a function of our past actions given knowledge of , and can compute gradients of to arbitrary desired precision by sampling from . This will yield the regret bound from Theorem 1 with respect to the surrogate losses, and the key challenge will be to analyze our error between the true and surrogate losses.
Lemma 2.
For any round we have that
-
Proof
For any , the loss of over the distribution can be expressed as
which is convex and -Lipschitz in both parameters when taking the expectation over . For round in isolation, using the inverse action map** bound and the bound on from Proposition 5 we have that
and further for previous states that
We can decompose the distribution into updates from past rounds as
which then yields a loss discrepancy of at most
between the true and surrogate loss for round . ∎
We can now bound the cumulative regret of for the problem.
Theorem 16.
For any , when Assumption 1 holds for the distribution update rule, Online Performative Prediction can be cast as a -locally controllable instance of online control with nonlinear dynamics, for which obtains regret
with respect to the best performatively stable classifier classifier.
-
Proof
Combining the previous results with Theorem 1, we have that for any our regret is at most
upon setting . ∎
Theorem 6 follows directly from Theorem 16. For Online Performative Prediction, in the full generality of the setting considered, the per-round optimization problem may not be convex, in which case we make use of the non-convex optimization oracle access for . However, in each of the following applications we show that the action selection step can indeed be implemented efficiently without imposing additional restrictions on the dynamics.
Appendix I Background and Proofs for Section 4.2: Adaptive Recommendations
I.1 Background
Motivated by problems involving preference dynamics and feedback loops in recommendation systems (see e.g.Flaxman et al. (2016)), a number of recent works Hazla et al. (2019); Gaitonde et al. (2021); Dean and Morgenstern (2022); Jagadeesan et al. (2022a); Agarwal and Brown (2022, 2023) have explored models of repeated recommendation where given to an agent whose preferences or opinions evolve over time. Several of these models Hazla et al. (2019); Dean and Morgenstern (2022); Jagadeesan et al. (2022a) consider population-level effects for settings where a single recommendation is given each round and consumers (or producers) update their behavior according to linear dynamics. Nonlinear preference dynamics with menus of recommendations for a single agent are considered in Agarwal and Brown (2022, 2023), where the aims to minimize regret for adversarial losses over the agent’s choices. The Adaptive Recommendations formulation of Agarwal and Brown (2022) somewhat resembles the “Dueling Bandits” setting of Yue et al. (2012), where actions are chosen in each round, yet where preferences can now evolve dynamically as a function of the history rather than remaining fixed. Whereas Agarwal and Brown (2022, 2023) study a bandit formulation of the problem with unknown preference dynamics, here we consider a full-feedback model with known dynamics, allowing for relaxed structural assumptions (on the agent’s “memory horizon” and “preference scoring functions”) at the cost of stronger informational assumptions, while maintaining the overall dynamics of the problem.
I.2 Model
Here, we are tasked with repeatedly recommending menus of content to an agent. Out of a universe of elements (e.g. video channels, clothing items), we show a subset of size (denoted ) to the agent in each round, for total rounds. The agent chooses one item from the menu, according to a distribution in terms of their preferences, which are a function of their selection history. Conditioned on being shown a menu , the agent’s choice distribution has positive mass only on the items . The agent’s representation of their selection history is given by their memory vector , and choices are determined by their preference scoring functions for each , which map the agent’s memory vector to relative preference scores for each item. The menu we show to the agent may be chosen from some distribution , and for each the agent’s menu-conditional distribution is proportional to the scores for items in , given as
for each , with for . The joint item choice distribution, considering both random selection of a menu according to , and the agent’s choice from , is given by
which we may denote simply by the vector , or as a function . In contrast to prior work, here we consider a deterministic variant of the problem as an illustration of the flexibility of our framework for online nonlinear control. In particular, we assume that the agent’s memory vector updates according to its expectation over as
where is the per-round update speed, and we assume that the agent’s scoring functions are known. We receive convex and -Lipschitz losses in each round in terms of the agent’s choices, over which we aim to minimize regret with respect to some distribution set .
The prior work (Agarwal and Brown, 2022, 2023) has considered two particular subsets of as regret benchmarks. We show that both can be cast as locally controllable instances of online control, and further, we make use of local controllability to give a general characterization of convex sets over which sublinear regret is attainable. We recall some key definitions and results from (Agarwal and Brown, 2022, 2023).
Definition 4 (Instantaneously Realizable Distributions).
The set of instantaneously realizable distributions at a memory vector is given by
Each such set corresponds to the feasible distributions , given the agent’s scoring functions and memory . It is shown by Agarwal and Brown (2023) that each sets can be directly characterized in terms of the ratios between target frequencies and scores.
Proposition 7 (Menu Times for Agarwal and Brown (2023)).
Given a memory vector and target distribution , let the menu time for item be given by
where . Then, if and only if for each .
We recall the prior benchmark sets considered, and the corresponding assumptions which yield feasibility of regret minimization. We state informal analogues of the prior results as translated to our setting, which we then show formally below.
Definition 5 (Everywhere Instantaneously Realizable Distributions).
The set of everywhere instantaneously realizable distributions is given by
Proposition 8 (Corollary of Agarwal and Brown (2022)).
If , then is non-empty, and there is a regret algorithm with respect to any distribution .
Distributions are always feasible regardless of by an appropriate choice of , but may be quite small in relation to . Under stronger assumptions for each , a potentially much larger set becomes feasible as a regret benchmark.
Definition 6 (-Smoothed Simplex).
The -smoothed simplex for is given by
Definition 7 (Scale-Bounded Functions).
A scoring function is said to be -scale-bounded for and if, for all , we have that
For such functions, each score cannot be too far from item ’s weight in memory, and it is shown that contains a ball around for each , for an appropriate choice of .
Proposition 9 (Corollary of Agarwal and Brown (2023)).
If each is -scale-bounded, then there is a regret algorithm with respect to any distribution , for .
We extend these results to general convex benchmark sets , where we can characterize the feasibility of regret minimization via local controllability using the menu times . When -local controllability holds over a set , we can minimize regret via using surrogate losses , which closely track our true losses .
I.3 Analysis
We make use of the menu time quantities for a memory vector and target distribution to translate our notion of local controllability to the Adaptive Recommendations setting. Let be any convex subset of , let , where the dynamics are given by
Note that is action-linear in , and thus we can solve for efficiently (in terms of ); further, there is a construction given in Agarwal and Brown (2023) for removing exponential dependence on when computing menu distributions. We consider as an -dimensional subset of , where we define the the ball of radius around a point as:
Theorem 17.
An instance of Adaptive Recommendations satisfies -local controllability if, for any and , we have that
for every .
This follows immediately from Proposition 8 and the definition of local controllability, which can analogously extend to strong local controllability. We can use this formulation to unify the feasibility analysis for each of the previously considered sets.
Lemma 3.
For and , the set contains a ball of radius around , and any instance satisfies -local controllability.
-
Proof
For any , , and we have and , yielding that
and over all items (with ) we have
Observe that the bounds for each term are equalized at when , and so whenever . We can specify in terms of to maintain equality, and thus inclusion of . Taking in terms of as
gives us that
for , and so we maintain that . Inverting, we have
as the radius of a ball around contained in . To see that is -locally controllable, consider any and in where , and let . By playing an action distribution which induces , the memory vector is then updated to . This is feasible for any , as each corresponds to some . ∎
We remark that for the set, if losses are given over rather than , one can define dynamics which directly consider the state to simply be the induced distribution in each round, which satisfies strong local controllability with any feasible at each round; in general, we consider dynamics to view the memory vector as the state, as the feasible updates are a function of . Such is the case for the -smoothed simplex, for which we can state an analogous local controllability result.
Lemma 4.
If each is -scale-bounded, then any instance over the -smoothed simplex for satisfies -local controllability.
-
Proof
The following lemma from Agarwal and Brown (2023) shows that a ball of distributions around any memory vector is feasible under .
Lemma 5 ( for Scale-Bounded Preferences Agarwal and Brown (2023)).
Let each be -scale-bounded with , and let be a vector in the -smoothed simplex, for . Then, for any vector .
Let for any in . Any then is contained in , and so playing such that yields an update to , which is feasible for any , and so -local controllability holds. ∎
For any such set which yields locally controllable dynamics for the instance , we can minimize regret over via , where we optimize with respect to the surrogate losses . Note that for our regret benchmark of the best per-round instantaneously distribution in , any fixed vector which is instantaneously targeted across all rounds yields an item distribution in each round, and so . We assume that is bounded inside (which typically will hold for ).
Theorem 18.
For any -locally controllable instance of Adaptive Recommendations with update speed , running over the surrogate losses yields regret
with respect to the true losses over .
Appendix J Background and Proofs for Section 4.3: Adaptive Pricing
J.1 Background
While there is a large literature on designing online mechanisms for pricing discrete goods via auctions (Mehta et al., 2007; Kanoria and Nazerzadeh, 2020; Golrezaei et al., 2020; Morgenstern and Roughgarden, 2016; Feng et al., 2019; Braverman et al., 2017), there is comparatively little work related to online pricing problems for real-valued goods. Most work for such problems to date requires strong assumptions on valuation functions, often either assuming linearity (Jia et al., 2014) or additivity (Agrawal et al., 2023), or requiring approximability via discretization (Mussi et al., 2022). Here, we introduce a novel formulation for an Adaptive Pricing problem which builds on the myopic-demand fixed-cost setting of Roth et al. (2015), which we extend to accommodate adversarial consumption rates for the agent (which affect demand, as a function of the agent’s reserves) as well as adversarial production costs. As in Roth et al. (2015), our setting can accommodate general convex (increasing) production cost functions and concave (increasing) valuations for the agent, provided that valuations additionally are homogeneous; to our knowledge, this encompasses a much wider class of valuations and costs than considered by any prior work on no-regret dynamic pricing for real-valued goods.
J.2 Model
In each round , an agent (the consumer) begins with goods reserves (with ), then consumes an adversarially chosen fraction of each good simultaneously (e.g. corresponding to their rate of manufacturing downstream items, using the goods as components), updating their reserves to . We (the producer) show the consumer some vector of per-unit prices for each good, and the consumer purchases some bundle of goods . The consumer’s valuation function for reserves of goods is given by , and their selection of is given by
We later discuss behavior of when the is undefined; it will suffice for us to only consider price vectors for which it is defined. This updates the consumer’s reserves to . Upon seeing the consumer’s purchased bundle , we receive their payment minus our production cost , where is adversarially chosen. Our utility is then given by
We make the following assumptions on production costs and the consumer’s valuation .
Assumption 2 (Production Costs).
We assume that for each , the following hold over :
-
•
is non-negative, convex, and -Lipschitz,
-
•
for some , and
-
•
for some .
Further, each is revealed prior to setting prices .
Assumption 3 (Consumer Valuations).
We assume that the following hold over some set :
-
•
is non-negative, continuous, and differentiable,
-
•
is strictly concave and increasing,
-
•
is -Hölder continuous for some and , i.e.
and
-
•
is homogeneous of degree for some , i.e. for any .
Further, is known to the producer.
Given the concavity assumption, we note that it is without loss of generality to assume that for the homogeneity parameter. There are several well-studied valuation families which satisfy these properties for an appropriate set ; see Roth et al. (2015) for proofs of each example.
Example 3 (Constant Elasticity of Substitution (CES)).
Valuations of the form
with each and , are Hölder continuous, differentiable, strictly concave, non-decreasing, and homogeneous over a convex set in .
Example 4 (Cobb-Douglas).
Valuations of the form
with and are Hölder continuous, differentiable, strictly concave, non-decreasing, and homogeneous over a convex set in .
We initially assume that Assumption 3 holds over all of , but will restrict our attention to the set of bundles where for each , and we note that our results can be extended to arbitrary downward-closed convex sets (where for any and ). In Section J.3 we that show Assumptions 2 and 3 yield several important properties which enable optimization via our framework. We show a unique map** between price vectors and bundle purchases (for any fixed reserves and consumption rate), that restricting attention to is justified under rationality constraints, and that is convex.
Further, there is some price vector which yields a reserve update to any in a neighborhood around , yielding local controllability. Crucially, we show that there are concave surrogate rewards which will closely track our true rewards , leveraging the following property of homogeneous functions.
Proposition 10 (Euler’s Theorem for Homogeneous Functions).
A continuous and differentiable function is homogeneous of degree if and only if
We run directly over these concave surrogate rewards (by inverting the sign of each), where each can be computed efficiently in terms of and , and we show that the surrogate reward distance from our true rewards is bounded. While our rewards will not be Lipschitz over in general, we show that appropriately calibrating our step size yields sublinear regret with dependence on the Hölder continuity parameters. We measure our regret with respect to the set of stable reserve policies, i.e. pricing policies where remains constant.
Definition 8 (Regret for Stable Reserve Policies).
Let be the set of stable reserve policies, where for any and satisfying , playing prices computed by a policy yields
It is straightforward to see that any maintains the invariant that , provided that some such is always feasible.
J.3 Analysis
We show a series of results establishing the key conditions allowing us to formulate this problem as a locally controllable instance of online nonlinear control. We first show that any positive bundle is the unique optimal purchase for some positive price vector.
Lemma 6.
For any reserves , consumption rate , and vector where elementwise, the bundle is the unique solution to
for prices .
-
Proof
Recall that the consumer’s bundle choice is given by
Note that is strictly concave in for any , as the gradients
are preserved at each point , and subtracting the linear function does not affect strict concavity. We also have that for prices , as is strictly concave and non-decreasing. This yields that has a unique global maximum at , as . ∎
As such, the for is unique whenever for some . We let denote this price vector which induces a purchase of . For any other price vector , the maximizing bundle either approaches a point on the boundary of , or grows unboundedly. We restrict our attention to bundles contained in , and show that the issue of unboundedness is resolved by rationality considerations for the producer. We characterize the per-round rewards of stable reserve policies as concave functions of , and show that the optimal such policy corresponds to some state , where is convex and bounded.
Lemma 7.
The round- reward of a stable reserve policy corresponding to any is given by a strictly concave function
-
Proof
We first note that we can maintain in every round by Lemma 6, as and . As such, a bundle is purchased in each round at prices , and our reward is given by
where the final step follows from Proposition 10 for homogeneous functions. The function is strictly concave, which is preserved upon subtracting the convex function . ∎
Lemma 8.
The set is convex.
-
Proof
Consider any two points , and let for any . Recall that belongs to if and only if . By concavity of , we have that
and so , yielding convexity of . ∎
Lemma 9.
For any where , there is some such that for any and .
-
Proof
Consider some such that , for , and let . By homogeneity of , we have that , and so as . For any round with costs and consumption rate we then have that:
(homogeneity of ) ( lower bound and convexity of ) () () ∎
Thus the optimal for any cost and consumption sequence corresponds to some . We can also bound the radius of .
Lemma 10.
Let . Then, for every we have that
-
Proof
Let , where we have . Consider the vector for any . By homogeneity of , we have that
For any we have that
where and thus . This holds for all vectors with norm , as any such vector will have at most by homogeneity, which yields the result. ∎
The previous result also implies that for any and . We assume that , which is without loss of generality as we may otherwise take to be smaller artificially; we assume is small enough to ensure that contains a ball of radius 1 around some , and we let . We consider the dynamics to be given by
We let denote our action space of price vectors; while dynamics here are not action-linear, we can still compute our desired action efficiently, as we assume we have knowledge of . While the dynamics depend on , our choice of action depends only on the target update to the consumer’s reserves, by Lemma 6. Further, upon observing , we can solve for as
for purposes of representing our surrogate losses, which are given by
We now show that the dynamics satisfy local controllability.
Lemma 11 (Local Controllability).
The instance satisfies -local controllability for each round .
-
Proof
We show that -local controllability holds over all of , which implies -local controllability over as each distance while the feasible update region remains the same. By Lemma 6, any update where elementwise is feasible. Each over is simply the minimum element of , which we denote here by . Each element of is decreased by at least , and so any in the ball of radius , and thus the ball of radius , is feasible. ∎
We are now ready to analyse the regret of for the problem. The remaining key issues to resolve will be the errors between our true and surrogate rewards and , as well as the lack of Lipschitz continuity for our rewards. We will make use of more general formulations of the guarantees of , (see e.g. Hazan (2021)).
Proposition 11.
For a -strongly convex regularizer where for all , and for convex losses , the regret of is bounded by
where and .
We show that this implies a regret bound for -Hölder continuous convex losses, recovering the -Lipschitz bounds when .
Theorem 19.
For -Hölder continuous convex losses, with obtains regret bounded by
and chooses points which satisfy in each round.
-
Proof
For -Hölder continuous convex losses , we have that
by convexity of , where , and so
by Hölder continuity. Combining with the lower bound on from Proposition 11 gives us that
and thus
yielding a regret bound of
with per-round distance at most . ∎
We note that the concave surrogate rewards are a sum of a -Hölder continuous function and a -Hölder continuous (i.e. Lipschitz) function; we assume that each function is -Hölder continuous with , which is sufficient for for large enough as we will have and thus . We use a similar analysis to bound the error between true and surrogate rewards, yielding our regret bound for .
Theorem 20.
The regret of with respect to the stable reserve policies is bounded by
-
Proof
We reparameterize to treat the bundle where as the origin, and assume the choice of regularizer has as its minimum. By Theorem 1, for any step size and such that , running for the -locally controllable instance over the surrogate rewards , with inradius 1 and radius , obtains
for any , upon setting and , where
Note that the surrogate rewards exactly track the true rewards when a stable reserve policy is played, and so our regret with respect to the best stable reserve policy is at most
() (concavity of ) (Hölder, ) upon updating to as
which yields the theorem. ∎
Appendix K Background and Proofs for Section 4.4: Steering Learners
K.1 Background
While much of the literature related to no-regret learning in general-sum games considers either rates of convergence to (coarse) correlated equilibria Blum et al. (2008); Anagnostides et al. (2022) or welfare guarantees for such equilibria Roughgarden (2015); Hartline et al. (2015a), a recent line of work Braverman et al. (2017); Deng et al. (2019); Mansour et al. (2022) has considered the question of optimizing one’s reward when playing against a no-regret learner. A target benchmark which has emerged for this problem is the value of the Stackelberg equilibrium of a game (the optimal mixed strategy to “commit to”, assuming an opponent best responds), which was shown by attainable by Deng et al. (2019) against any no-regret algorithm and optimal in many cases (e.g. for no-swap learners), both up to terms, and further which may yield higher reward for the optimizer than (coarse) correlated equilibria.
We show a class of instances for which the problem for optimizing reward against a learner playing according to gradient descent can be formulated as a locally controllable instance of online nonlinear control with adversarial perturbations and surrogate state-based losses. The simplest non-trivial instances we consider are those where the optimizer’s reward is a function only of the learner’s actions (i.e. all rows of their reward matrix are identical), and the optimization problem amounts to steering the learner to a desired strategy via one’s choice of actions. Additionally, we allow the game matrices to change over time, which has not been substantially considered in prior work to our knowledge. We require that the learner’s matrices do not change too quickly (which we model as adversarial disturbances to dynamics), and the optimizer’s matrices can change arbitrarily provided that they remain close to some row-identical matrix (which we model as imprecision in our surrogate loss function).
K.2 Model
Here we are tasked with playing a sequence of bimatrix games against a no-regret learning opponent, where the game matrices may change adversarially in each round. We assume the following properties hold for the adversarial sequence of games.
Assumption 4.
For a sequence of bimatrix games, with :
-
•
Each entry of and lies in
-
•
the convex hull of the of the rows of each contains the unit ball in ,
-
•
for any , where each row of is identical, and
-
•
for any .
Each game is revealed after Players A and B commit to their respective strategies and . Observe that due to the first property, for any , there is some such that . By the second property, we have that for any .
We recall the Online Gradient Descent algorithm with convex losses from Zinkevich (2003).
Proposition 12 (Zinkevich (2003)).
For differentiable convex losses , with for each , then for all the regret of OGD is bounded by
where is the radius of . If and for all , we have that
We assume that Player B plays according to OPGD in our setup, with and . At each round , we (Player A) choose some mixed strategy , and Player B plays some mixed strategy . Utilities for each player are given by the game as
Note that the loss gradient each round for Player B (for negative utilities) is given by
and so their mixed strategy is updated at each round according to
Our utility is given by , as does not affect rewards from . We benchmark the regret of an algorithm against the optimal profile :
Note that the per-round average utility for the maximizing is at least as high as that obtained by the Stackelberg equilibrium of the average game , as for this objective one can choose both players’ strategies without restriction. We remark that finding the Stackelberg equilibrium for any fixed game in our setting, where has identical rows, is straightforward: it suffices to optimize over , as any fixed action is a best response to some by our assumption on the rows of , and as our rewards are only a function of Player B’s strategy . However, we are not aware of any prior work which enables competing with the average-game Stackelberg value against a learning opponent when games arrive online.
K.3 Analysis
We first show that the problem can be formulated via known, strongly -locally controllable dynamics with adversarial disturbances. As changes slowly between rounds, we can run with disturbances representing the error resulting from assuming that does not change from .
Lemma 12.
Given the knowledge available prior to selecting , updates for can be expressed via known action-linear dynamics which satisfy strong -local controllability, and with adversarial disturbances satisfying .
-
Proof
First, note that we can compute Player B’s current strategy , as it is a function only of games and strategies up to round , all of which are observable. Given the update rule for , we can formulate the dynamics update as
where represents the error from assuming . by standard properties of Euclidean projection, and the change bound on , we have that . Further, the update is action-linear (up to projection, prior to ).
To see that satisfies strong -local controllability, we recall that the convex hull of the rows of contain the unit ball, and so for any in there is some such that . ∎
At round each round , our loss is given by . There are two barriers to running our algorithm. First, the update for is determined by and not , yet we do not see prior to selecting , which would be required to take the appropriate step following . Second, the loss depends on in addition to . To address both issues, we instead run with surrogate losses , with action rounds relabeled to account for the fact that influences the step for (which does not change the behavior of the algorithm). We set .
Theorem 21.
Repeated play against an opponent using with step size in a sequence of games satisfying Assumption 4 can be cast as an instance of online control with strongly -locally controllable dynamics, for which the regret of is at most
with efficient per-round computation.
-
Proof
We first analyze regret with respect to the surrogate losses . To run for , it suffices to calibrate the step size for the internal instance such that . Given that rewards are bounded in , we have that each is -Lipschitz for the norm, and thus -Lipschitz for the norm, so we can take . Further, the radius of is , and so we have that
Then, for a strongly -locally controllable instance with total perturbation bound , we obtain the regret bound
(Thm. 14) for any
By Lemma 12, we can efficiently run over the surrogate losses and bound regret with respect to any as:
Further, we can bound the error from the surrogate losses as
(, ) (Prop. 5) (Assumption 4, Cauchy-Schwarz) and likewise, for any we can bound
Combining the previous results, we have that for any , the regret of with respect to the true losses is bounded by
for any , which yields the theorem. ∎