License: CC Zero
arXiv:2302.01680v3 [cs.LG] 09 Jan 2024

Two-Stage Constrained Actor-Critic for Short Video Recommendation

Qingpeng Cai Kuaishou TechnologyBei**gChina [email protected] Zhenghai Xue Kuaishou TechnologyBei**gChina [email protected] Chi Zhang Kuaishou TechnologyBei**gChina [email protected] Wanqi Xue Kuaishou TechnologyBei**gChina [email protected] Shuchang Liu Kuaishou TechnologyBei**gChina [email protected] Ruohan Zhan Hong Kong University of Science and TechnologyHong KongChina [email protected] Xueliang Wang Kuaishou TechnologyBei**gChina [email protected] Tianyou Zuo Kuaishou TechnologyBei**gChina [email protected] Wentao Xie Kuaishou TechnologyBei**gChina [email protected] Dong Zheng Kuaishou TechnologyBei**gChina [email protected] Peng Jiang Kuaishou TechnologyBei**gChina [email protected]  and  Kun Gai UnaffiliatedBei**gChina [email protected]
(2023)
Abstract.

The wide popularity of short videos on social media poses new opportunities and challenges to optimize recommender systems on the video-sharing platforms. Users sequentially interact with the system and provide complex and multi-faceted responses, including WatchTime  and various types of interactions with multiple videos. On the one hand, the platforms aim at optimizing the users’ cumulative WatchTime  (main goal) in the long term, which can be effectively optimized by Reinforcement Learning. On the other hand, the platforms also need to satisfy the constraint of accommodating the responses of multiple user interactions (auxiliary goals) such as Like, Follow, Share, etc. In this paper, we formulate the problem of short video recommendation as a Constrained Markov Decision Process (CMDP). We find that traditional constrained reinforcement learning algorithms fail to work well in this setting. We propose a novel two-stage constrained actor-critic method: At stage one, we learn individual policies to optimize each auxiliary signal. In stage two, we learn a policy to (i) optimize the main signal and (ii) stay close to policies learned in the first stage, which effectively guarantees the performance of this main policy on the auxiliaries. Through extensive offline evaluations, we demonstrate the effectiveness of our method over alternatives in both optimizing the main goal as well as balancing the others. We further show the advantage of our method in live experiments of short video recommendations, where it significantly outperforms other baselines in terms of both WatchTime  and interactions. Our approach has been fully launched in the production system to optimize user experiences on the platform.

constrained reinforcement learning, recommender systems, short video recommendation
journalyear: 2023copyright: acmlicensedconference: Proceedings of the ACM Web Conference 2023; May 1–5, 2023; Austin, TX, USAbooktitle: Proceedings of the ACM Web Conference 2023 (WWW ’23), May 1–5, 2023, Austin, TX, USAprice: 15.00doi: 10.1145/3543507.3583259isbn: 978-1-4503-9416-1/23/04ccs: Information systems Recommender systemsccs: Computing methodologies Reinforcement learning

1. Introduction

The surging popularity of short videos has been changing the status quo of social media. Short video consumption has brought in huge business opportunities for organizations. As a result, there has been an increasing interest in optimizing recommendation strategies (Wang et al., 2022a; Zhan et al., 2022; Lin et al., 2022; Gong et al., 2022) for short video platforms. Users interact with the platform by scrolling up and down and watching multiple videos as shown in Figure 1(a). Users provide multi-dimensional responses at each video. As shown in the left part of Figure 1(b), potential responses from a user after consuming a video include WatchTime (the time spent on watching the video), and several types of interactions: Follow  (follow the author of the video), Like  (Like this video), Comment (provide comments on the video), Collect (Collect this video), Share (share this video with his/her friends), etc.

Refer to caption
Figure 1. An example of a popular short video (TikTok, Kuaishou, etc) platform.
\Description

On the one hand, the main goal of the platform is to optimize the cumulative WatchTime  of multiple videos, as WatchTime  reflects user attention and is highly related to daily active users (DAU). Recently, a growing literature has focused on applying reinforcement learning (RL) to recommender systems, due to its ability to improve cumulative reward (Nemati et al., 2016; Zhao et al., 2017, 2018; Chen et al., 2018; Zou et al., 2019; Liu and Yang, 2019; Chen et al., 2019b; Xian et al., 2019; Ma et al., 2020; Afsar et al., 2021; Ge et al., 2021; Gao et al., 2022a; Wang et al., 2022b; Xin et al., 2022). In particular, WatchTime , can be effectively cumulatively maximized to increase user spent time across multiple videos with RL approaches. On the other hand, other responses such as Like/Follow/Share  also reflect user satisfaction levels. Thus the platform needs to satisfy the constraints of user interactions. Thereby, established recommender systems that exclusively optimize a single objective (such as gross merchandise volume for e-commerce platforms (Pi et al., 2020)) is no longer sufficient—the applied systems should take all aspects of responses into consideration to optimize user experiences.

In this paper, we model the problem of short video recommendation as a Constrained Markov Decision Process: users serve as the environments, and the recommendation algorithm is the agent; at each time step the agent plays an action (recommend a video to the user), the environment sends multiple rewards (responses) to the agent. The objective of the agent is to maximize the cumulative WatchTime  (main goal) subject to the constraints of other interaction responses (auxiliary goals). Our aim is different from Pareto optimality that aims to find a Pareto optimal solution (Sener and Koltun, 2018; Lin et al., 2019; Chen et al., 2021), which may not prioritize the main goal of the system.

The problem of this constrained policy optimization is much more challenging as compared to its unconstrained counterpart. A natural idea would be applying standard constrained reinforcement learning algorithms that maximize the Lagrangian with pre-specified multipliers (Tessler et al., 2018). However, such method can not apply to our setting for the following two reasons:

First, it is not sufficient to use a single policy evaluation model to estimate the Lagrangian dual objective due to different types of responses from the user. Such response combination is not adequate, particularly for responses with their own discount factors—the formulation of temporal difference error in value-based models only allows for a single discount value. In scenarios where one discount factor suffices, it can still be difficult for a single value model to evaluate the policy accurately, especially when different responses are observed at various frequencies, as typical for short video recommendations. The WatchTime response is dense and observed from each video view, while the interaction-signal such as Like/Follow/Share is much more sparse and may not be provided within dozens of views. The signal from the sparse responses will be weakened by the dense responses when naively summing them up together. To address this multi-response evaluation difficulty, we separately evaluate each response via its own value model, which allows for response-specific discount factors and mitigates the interference on evaluation from one response on another. Experiments in Section 4.1 validates the effectiveness of this method.

Second, different from only one constraint is considered in (Tessler et al., 2018), multiple constraints exist in recommender systems, especially in short video systems. We find that it is more difficult for algorithms that maximize the Lagrangian to optimize due to larger search space of multi-dimensional Lagrangian multipliers. It is time costly to grid search on the Lagrangian multipliers as the training of reinforcement learning algorithms takes long time. On account of this, we propose to firstly learn policies to optimize each auxiliary response and then “softly” regularize the policy of the main response to be close to others instead of searching optimal value of Lagrangian multipliers. We theoretically prove the closed form of the optimal solution. We demonstrate empirically that our approach can better maximize the main response and balance other responses in both offline and live experiments.

Together, we summarize our contributions as below:

  • Constrained Optimization in Short Video Recommendations: We formalize the problem of constrained policy learning in short video recommendations, where different responses may be observed at various frequencies, and the agent maximizes one with the constraint of balancing others.

  • Two-Stage Constrained Actor-Critic Algorithm We propose a novel two-stage constrained actor-critic algorithm that effectively tackles the challenge: (1) Multi-Critic Policy Estimation: To better evaluate policy on multiple responses that may differ in discount factors and observation frequencies, we propose to separately learn a value model to evaluate each response. (2) Two-Stage Actor Learning: We propose a two-stage actor learning method which firstly learns a policy to optimize each auxiliary response and secondly softly regularizes the policy of the main response to be not far from others, which we demonstrate to be a more effective way in constrained optimization with multiple constraints as compared with other alternatives.

  • Significant Gains in Offline and Live Experiments: We demonstrate the effectiveness of our method in both offline and live experiments.

  • Deployment in real world short video application: We fully launch our method in a popular short video platform.

2. Related Work

Reinforcement Learning for Recommendation

There is a growing literature in applying RL to recommender systems, for its ability to optimize user long-term satisfaction (Afsar et al., 2021). Value-based approaches estimate user satisfaction of being recommended an item from the available candidate set and then select the one with the largest predicted satisfaction (Nemati et al., 2016; Zhao et al., 2018; Liu and Yang, 2019; Chen et al., 2018). Policy-based methods directly learn the policy (which item to recommend) and optimize it in the direction of increasing user satisfaction (Chen et al., 2019a; ** (Chen et al., 2021); we view our work as complementary to the third line. In face of the multi-faceted user responses, the system in real applications often has preferences on different types of user responses, for which we propose the constrained optimization problem in contrast to pursuing the Pareto optimality as proposed in (Chen et al., 2021) and (Ge et al., 2022).

Constrained Reinforcement Learning

Our work is also closely related to the literature of constrained reinforcement learning, where the sequential decision making problem is formulated into a constrained Markov Decision Process (Sutton and Barto, 2018), and the policy learning procedure is expected to respect the constraints(Liu et al., 2021; Garcıa and Fernández, 2015; Chow et al., 2017; Tessler et al., 2018; Dalal et al., 2018). As an example, (Tessler et al., 2018) propose to update the policy and the Lagrangian multiplier alternatively and prove the convergence of their algorithm to a fixed point. This approach however only models one constraint, and can not scale well on problems with multiple constraints. In contrast, for each auxiliary response, we learn a policy to maximize it specifically, then we “softly” regularize the main policy to be close to others. We show empirically that this is a more effective way for constrained policy learning when dealing with multiple responses in recommender systems. Different from (Nair et al., 2020) that studies in offline RL and regularizes the learned policy to be near to one behavior policy, we softly restrict the policy within other policies maximizing other auxiliary responses.

Multi-objective Optimization

We also discuss a relevant line on multi-objective optimization. To trade off different objectives, methods in this field can be broadly categorized into two classes: the Pareto optimization and the joint optimization with pre-specified weights. The goal of Pareto optimization is to find a solution such that no other solutions can concurrently improve all objectives, named as Pareto optimality (Nguyen et al., 2020; Sener and Koltun, 2018; Chen et al., 2021; Ge et al., 2022). However, a Pareto optimal solution may not prioritize the objective that is most valued in applications. The other method combines different objectives together into a single one via pre-specifying the weights (White et al., 1980; Mossalam et al., 2016). However, it is difficult to quantify these weights that can accurately reflect preferences in real applications (Tessler et al., 2018).

3. Constrained Markov Decision Process for Short Video Recommendation

Refer to caption
Figure 2. The MDP of short video recommendation.
\Description

We start by formulating the problem of short video recommendation, which is shown in Figure 2. When a user u𝑢uitalic_u opens the app, a new session starts. A session consists of multiple requests. At each request t𝑡titalic_t the recommender system (agent) takes an action atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT that recommends the user a video based on the user current state. Then the user provides multi-faceted responses (such as WatchTime, Like, Share, and Follow) on the shown video, which are received by the agent as vector-valued reward signal. After the user leaves the app, the session ends. The goal of the recommender system is to optimize cumulative reward of the main response (e.g., WatchTime), with the constraint of not sacrificing others much.

We model the above procedure as a Constrained Markov Decision Process (CMDP) (Sutton and Barto, 2018) (S,A,P,R,C,ρ0,Γ)𝑆𝐴𝑃𝑅𝐶subscript𝜌0Γ(S,A,P,R,C,\rho_{0},\Gamma)( italic_S , italic_A , italic_P , italic_R , italic_C , italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , roman_Γ ), where S𝑆Sitalic_S is the state space of user current representation stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, A𝐴Aitalic_A is the action space (and each action atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT corresponds to a recommended video for one request), P:S×AΔ(S):𝑃𝑆𝐴Δ𝑆P:S\times A\rightarrow\Delta(S)italic_P : italic_S × italic_A → roman_Δ ( italic_S ) captures the state transition, R:S×Am:𝑅𝑆𝐴superscript𝑚R:S\times A\rightarrow\mathbb{R}^{m}italic_R : italic_S × italic_A → blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT defines the vector-valued reward function that yields m𝑚mitalic_m different rewards r(st,at)=(r1(st,at),,rm(st,at))𝑟subscript𝑠𝑡subscript𝑎𝑡subscript𝑟1subscript𝑠𝑡subscript𝑎𝑡subscript𝑟𝑚subscript𝑠𝑡subscript𝑎𝑡r(s_{t},a_{t})=\big{(}r_{1}(s_{t},a_{t}),\dots,r_{m}(s_{t},a_{t})\big{)}italic_r ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ( italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , … , italic_r start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ), ρ0subscript𝜌0\rho_{0}italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the initial state distribution, Γ=(γ1,,γm)(0,1)mΓsubscript𝛾1subscript𝛾𝑚superscript01𝑚\Gamma=(\gamma_{1},\dots,\gamma_{m})\in(0,1)^{m}roman_Γ = ( italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_γ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ∈ ( 0 , 1 ) start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT denotes the vector of discount factor for reward of each response. C𝐶Citalic_C specifies the constraints on the auxiliary responses, which denotes the lower bound of the total numbers of signals of other objectives.

Define the vector-valued discounted cumulative reward Rtsubscript𝑅𝑡R_{t}italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as Rt=t=tTΓttr(st,at)subscript𝑅𝑡superscriptsubscriptsuperscript𝑡𝑡𝑇superscriptΓsuperscript𝑡𝑡𝑟subscript𝑠superscript𝑡subscript𝑎superscript𝑡{{R}}_{t}=\sum_{t^{\prime}=t}^{T}\Gamma^{t^{\prime}-t}\cdot{r}(s_{t^{\prime}},% a_{t^{\prime}})italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_Γ start_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_t end_POSTSUPERSCRIPT ⋅ italic_r ( italic_s start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ), where T𝑇Titalic_T is the session length (i.e., the number of requests), Γb=(γ1b,,γmb)superscriptΓ𝑏superscriptsubscript𝛾1𝑏superscriptsubscript𝛾𝑚𝑏\Gamma^{b}=\big{(}\gamma_{1}^{b},\dots,\gamma_{m}^{b}\big{)}roman_Γ start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT = ( italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT , … , italic_γ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ), and 𝐱𝐲𝐱𝐲\mathbf{x}\cdot\mathbf{y}bold_x ⋅ bold_y denotes the pointwise product. Let Vπ(s)=(V1π(s),,Vmπ(s))superscript𝑉𝜋𝑠superscriptsubscript𝑉1𝜋𝑠superscriptsubscript𝑉𝑚𝜋𝑠V^{\pi}(s)=\big{(}V_{1}^{\pi}(s),\dots,V_{m}^{\pi}(s)\big{)}italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ) = ( italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ) , … , italic_V start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ) ) be the state value Eπ[Rt|st=s]subscript𝐸𝜋delimited-[]conditionalsubscript𝑅𝑡subscript𝑠𝑡𝑠E_{\pi}[R_{t}|s_{t}=s]italic_E start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT [ italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s ] under actions sampled in accordance with policy π𝜋\piitalic_π and Q(s,a)=(Q1π(s,a),,Qmπ(s,a))𝑄𝑠𝑎superscriptsubscript𝑄1𝜋𝑠𝑎superscriptsubscript𝑄𝑚𝜋𝑠𝑎Q(s,a)=\big{(}Q_{1}^{\pi}(s,a),\dots,Q_{m}^{\pi}(s,a)\big{)}italic_Q ( italic_s , italic_a ) = ( italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a ) , … , italic_Q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a ) ) be its state-action value Eπ[Rt|st=s,at=a]subscript𝐸𝜋delimited-[]formulae-sequenceconditionalsubscript𝑅𝑡subscript𝑠𝑡𝑠subscript𝑎𝑡𝑎E_{\pi}[R_{t}|s_{t}=s,a_{t}=a]italic_E start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT [ italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_a ]. Denote ρπsubscript𝜌𝜋\rho_{\pi}italic_ρ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT as the state distribution induced by policy π𝜋\piitalic_π. Without loss of generality, we set the first response as our main response. The goal is to learn a recommendation policy π(|s)\pi(\cdot|s)italic_π ( ⋅ | italic_s ) to solve the following optimization problem:

(1) maxπEρπ[V1π(s)]s.t.Eρπ[Viπ(s)]Ci,i=2,,m\begin{split}\max_{\pi}\quad&E_{\rho_{\pi}}\big{[}V^{\pi}_{1}(s)\big{]}\\ \mbox{s.t.}\quad&E_{\rho_{\pi}}\big{[}V^{\pi}_{i}(s)\big{]}\geq C_{i},\quad i=% 2,\dots,m\end{split}start_ROW start_CELL roman_max start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT end_CELL start_CELL italic_E start_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s ) ] end_CELL end_ROW start_ROW start_CELL s.t. end_CELL start_CELL italic_E start_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s ) ] ≥ italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i = 2 , … , italic_m end_CELL end_ROW

where Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is constraint on the auxiliary response i𝑖iitalic_i.

4. Two-Stage Constrained Actor-Critic

In this section, we propose a novel two-stage constrained actor-critic method, addressing the learning challenges in the context of short video recommendation:

Multi-Critic Policy Estimation:

We propose to estimate the responses separately to better estimate dense and sparse signals.

Stage One:

For each auxiliary response, we learn a policy to optimize its cumulative reward.

Stage Two:

For the main response, we learn a policy to optimize its cumulative reward, while softly limiting it to be close to other policies that are learned to optimize the auxiliary.

We first discuss the advantage of evaluating different policies separately over estimating jointly. Secondly, we elaborate our method in the settings of online learning with stochastic policies in Sections 4.2 and 4.3. We then discuss its extensions to the offline setting and deterministic policies.

4.1. Multi-Critic Policy Estimation

We showcase the advantage of separate evaluation for each response over a joint evaluation of summed response. Specifically, we consider two types of responses from each video view: WatchTime and interactions (which is an indicator function of whether the interactions happen during the view).

  • For the joint evaluation, we learn a value model Vjointsubscript𝑉𝑗𝑜𝑖𝑛𝑡V_{joint}italic_V start_POSTSUBSCRIPT italic_j italic_o italic_i italic_n italic_t end_POSTSUBSCRIPT with reward as a sum of WatchTime and interactions.

  • For the separate evaluation, we learn two value models Vwsubscript𝑉𝑤V_{w}italic_V start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT and Visubscript𝑉𝑖V_{i}italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with reward as WatchTime and interactions respectively. Define the value of separate evaluation as Vseparate=Vw+Visubscript𝑉𝑠𝑒𝑝𝑎𝑟𝑎𝑡𝑒subscript𝑉𝑤subscript𝑉𝑖V_{separate}=V_{w}+V_{i}italic_V start_POSTSUBSCRIPT italic_s italic_e italic_p italic_a italic_r italic_a italic_t italic_e end_POSTSUBSCRIPT = italic_V start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT + italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT

For fair comparison, we share the same discount factor 0.950.950.950.95 for all value models and train them on the same data collected from a popular short video platform for one day. To evaluate the accuracy of the value model in terms of WatchTime and interactions, we compute the correlation between model values Vjointsubscript𝑉𝑗𝑜𝑖𝑛𝑡V_{joint}italic_V start_POSTSUBSCRIPT italic_j italic_o italic_i italic_n italic_t end_POSTSUBSCRIPT and Vseparatesubscript𝑉𝑠𝑒𝑝𝑎𝑟𝑎𝑡𝑒V_{separate}italic_V start_POSTSUBSCRIPT italic_s italic_e italic_p italic_a italic_r italic_a italic_t italic_e end_POSTSUBSCRIPT with the Monte Carlo value of the sum of the corresponding responses in each session. As compared to Vjointsubscript𝑉𝑗𝑜𝑖𝑛𝑡V_{joint}italic_V start_POSTSUBSCRIPT italic_j italic_o italic_i italic_n italic_t end_POSTSUBSCRIPT, Vseparatesubscript𝑉𝑠𝑒𝑝𝑎𝑟𝑎𝑡𝑒V_{separate}italic_V start_POSTSUBSCRIPT italic_s italic_e italic_p italic_a italic_r italic_a italic_t italic_e end_POSTSUBSCRIPT is more correlated with WatchTime  and interactions by 0.19%percent0.190.19\%0.19 % and 0.14%percent0.140.14\%0.14 % respectively(a 0.1%percent0.10.1\%0.1 % improvement on WatchTime  and interactions is significant), demonstrating that the separate evaluation better learns different reward responses than jointly learning.

4.2. Stage One: Policy Learning for Auxiliary Responses

At this stage, we learn policies to optimize the cumulative reward of each auxiliary response separately. For completeness, we write out our procedure for stochastic policies (Williams, 1992). Considering response i𝑖iitalic_i, let the learned actor and the critic be parameterized by πθisubscript𝜋subscript𝜃𝑖\pi_{\theta_{i}}italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT and Vϕisubscript𝑉subscriptitalic-ϕ𝑖V_{\phi_{i}}italic_V start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT respectively. At iteration k𝑘kitalic_k, we observe sample (s,a,s)𝑠𝑎superscript𝑠(s,a,s^{\prime})( italic_s , italic_a , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) collected by πθi(k)subscript𝜋superscriptsubscript𝜃𝑖𝑘\pi_{\theta_{i}^{(k)}}italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, i.e., sρπθi(k),aπθi(k)(|s)s\sim\rho_{\pi_{\theta_{i}^{(k)}}},a\sim\pi_{\theta_{i}^{(k)}}(\cdot|s)italic_s ∼ italic_ρ start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_a ∼ italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( ⋅ | italic_s ) and sP(|s,a)s^{\prime}\sim P(\cdot|s,a)italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P ( ⋅ | italic_s , italic_a ). We update the critic to minimize the Bellman equation:

(2) ϕi(k+1)argminϕEπθi(k)[(ri(s,a)+γiVϕi(k)(s)Vϕ(s))2].superscriptsubscriptitalic-ϕ𝑖𝑘1subscriptitalic-ϕsubscript𝐸subscript𝜋superscriptsubscript𝜃𝑖𝑘delimited-[]superscriptsubscript𝑟𝑖𝑠𝑎subscript𝛾𝑖subscript𝑉superscriptsubscriptitalic-ϕ𝑖𝑘superscript𝑠subscript𝑉italic-ϕ𝑠2\phi_{i}^{(k+1)}\leftarrow\arg\min_{\phi}E_{\pi_{\theta_{i}^{(k)}}}\Big{[}\big% {(}r_{i}(s,a)+\gamma_{i}V_{\phi_{i}^{(k)}}(s^{\prime})-V_{\phi}(s)\big{)}^{2}% \Big{]}.italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k + 1 ) end_POSTSUPERSCRIPT ← roman_arg roman_min start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s , italic_a ) + italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_V start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_s ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] .

We update the actor to maximize the advantage:

(3) θi(k+1)argmaxθEπθi(k)[Ai(k)log(πθ(a|s))]whereAi(k)=ri(s,a)+γiVϕi(k)(s)Vϕi(k)(s).superscriptsubscript𝜃𝑖𝑘1subscript𝜃subscript𝐸subscript𝜋superscriptsubscript𝜃𝑖𝑘delimited-[]superscriptsubscript𝐴𝑖𝑘subscript𝜋𝜃conditional𝑎𝑠wheresuperscriptsubscript𝐴𝑖𝑘subscript𝑟𝑖𝑠𝑎subscript𝛾𝑖subscript𝑉superscriptsubscriptitalic-ϕ𝑖𝑘superscript𝑠subscript𝑉superscriptsubscriptitalic-ϕ𝑖𝑘𝑠\begin{split}&\theta_{i}^{(k+1)}\leftarrow\arg\max_{\theta}E_{\pi_{\theta_{i}^% {(k)}}}\Big{[}A_{i}^{(k)}\log\big{(}\pi_{\theta}(a|s)\big{)}\Big{]}\\ \mbox{where}\quad&A_{i}^{(k)}=r_{i}(s,a)+\gamma_{i}V_{\phi_{i}^{(k)}}(s^{% \prime})-V_{\phi_{i}^{(k)}}(s).\end{split}start_ROW start_CELL end_CELL start_CELL italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k + 1 ) end_POSTSUPERSCRIPT ← roman_arg roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT roman_log ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a | italic_s ) ) ] end_CELL end_ROW start_ROW start_CELL where end_CELL start_CELL italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s , italic_a ) + italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_V start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s ) . end_CELL end_ROW

4.3. Stage Two: Softly Constrained Optimization of the Main Response

After pre-training the policies πθ2,,πθmsubscript𝜋subscript𝜃2subscript𝜋subscript𝜃𝑚\pi_{\theta_{2}},\dots,\pi_{\theta_{m}}italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT that optimize the auxiliary responses, we now move onto the second stage of learning the policy to optimize the main response. We propose a new constrained policy optimization method with multiple constraints.

Let the actor and the critic be πθ1subscript𝜋subscript𝜃1\pi_{\theta_{1}}italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT and Vϕ1subscript𝑉subscriptitalic-ϕ1V_{\phi_{1}}italic_V start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT respectively. At iteration k𝑘kitalic_k, we similarly update the critic to minimize the Bellman equation:

(4) ϕ1(k+1)argminϕEπθ1(k)[(r1(s,a)+γ1Vϕ1(k)(s)Vϕ(s))2].superscriptsubscriptitalic-ϕ1𝑘1subscriptitalic-ϕsubscript𝐸subscript𝜋superscriptsubscript𝜃1𝑘delimited-[]superscriptsubscript𝑟1𝑠𝑎subscript𝛾1subscript𝑉superscriptsubscriptitalic-ϕ1𝑘superscript𝑠subscript𝑉italic-ϕ𝑠2\phi_{1}^{(k+1)}\leftarrow\arg\min_{\phi}E_{\pi_{\theta_{1}^{(k)}}}\Big{[}\big% {(}r_{1}(s,a)+\gamma_{1}V_{\phi_{1}^{(k)}}(s^{\prime})-V_{\phi}(s)\big{)}^{2}% \Big{]}.italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k + 1 ) end_POSTSUPERSCRIPT ← roman_arg roman_min start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s , italic_a ) + italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_V start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_s ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] .

The principle of updating the actor is two-fold: (i) maximizing the advantage; (ii) restricting the policy to the domain that is not far from other policies. The optimization is formalized below:

(5) maxπEπ[A1(k)]s.t.DKL(π||πθi)ϵi,i=2,,m,whereA1(k)=r1(s,a)+γ1Vϕ1(k)(s)Vϕ1(k)(s).\begin{split}\max_{\pi}\quad&E_{\pi}[A_{1}^{(k)}]\\ \mbox{s.t.}\quad&D_{KL}(\pi||\pi_{\theta_{i}})\leq\epsilon_{i},\quad i=2,\dots% ,m,\\ \mbox{where}\quad&A_{1}^{(k)}=r_{1}(s,a)+\gamma_{1}V_{\phi_{1}^{(k)}}(s^{% \prime})-V_{\phi_{1}^{(k)}}(s).\end{split}start_ROW start_CELL roman_max start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT end_CELL start_CELL italic_E start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT [ italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ] end_CELL end_ROW start_ROW start_CELL s.t. end_CELL start_CELL italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_π | | italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ≤ italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i = 2 , … , italic_m , end_CELL end_ROW start_ROW start_CELL where end_CELL start_CELL italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s , italic_a ) + italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_V start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s ) . end_CELL end_ROW

We get the closed form solution of the Lagrangian of Eq. (5) in the following theorem. We omit the proof due to lack of space, please refer to Appendix A.

Theorem 1 ().

The Lagrangian of Eq. (5) has the closed form solution

(6) π*(a|s)i=2m(πθi(a|s))λij=2mλjexp(A1(k)j=2mλj),proportional-tosuperscript𝜋conditional𝑎𝑠superscriptsubscriptproduct𝑖2𝑚superscriptsubscript𝜋subscript𝜃𝑖conditional𝑎𝑠subscript𝜆𝑖superscriptsubscript𝑗2𝑚subscript𝜆𝑗superscriptsubscript𝐴1𝑘superscriptsubscript𝑗2𝑚subscript𝜆𝑗\pi^{*}(a|s)\propto\prod_{i=2}^{m}\big{(}\pi_{\theta_{i}}(a|s)\big{)}^{\frac{% \lambda_{i}}{\sum_{j=2}^{m}\lambda_{j}}}\exp\bigg{(}\frac{A_{1}^{(k)}}{\sum_{j% =2}^{m}\lambda_{j}}\bigg{)},italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_a | italic_s ) ∝ ∏ start_POSTSUBSCRIPT italic_i = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_a | italic_s ) ) start_POSTSUPERSCRIPT divide start_ARG italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG end_POSTSUPERSCRIPT roman_exp ( divide start_ARG italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG ) ,

where λisubscript𝜆𝑖\lambda_{i}italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with i=2,,m𝑖2normal-…𝑚i=2,\dots,mitalic_i = 2 , … , italic_m are Lagrangian multipliers.

Given data collected by πθ1(k)subscript𝜋superscriptsubscript𝜃1𝑘\pi_{\theta_{1}^{(k)}}italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, we learn the policy πθ1subscript𝜋subscript𝜃1\pi_{\theta_{1}}italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT by minimizing its KL divergence from the optimal policy π*superscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT:

(7) θ1(k+1)argminθEπθ1(k)[DKL(π*(a|s)||πθ(a|s))]=argmaxθEπθ1(k)[i=2m(πθi(a|s))λij=2mλjπθ1(k)(a|s)exp(A1(k)j=2mλj)logπθ(a|s)].\begin{split}&\theta_{1}^{(k+1)}\leftarrow\arg\min_{\theta}E_{\pi_{\theta_{1}^% {(k)}}}[D_{KL}(\pi^{*}(a|s)||\pi_{\theta}(a|s))]\\ =&\arg\max_{\theta}E_{\pi_{\theta_{1}^{(k)}}}\Big{[}\frac{\prod_{i=2}^{m}\Big{% (}{\pi_{\theta_{i}}(a|s)}\Big{)}^{\frac{\lambda_{i}}{\sum_{j=2}^{m}\lambda_{j}% }}}{{\pi_{\theta_{1}^{(k)}}(a|s)}}\exp\bigg{(}\frac{A_{1}^{(k)}}{\sum_{j=2}^{m% }\lambda_{j}}\bigg{)}\log\pi_{\theta}(a|s)\Big{]}.\end{split}start_ROW start_CELL end_CELL start_CELL italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k + 1 ) end_POSTSUPERSCRIPT ← roman_arg roman_min start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_a | italic_s ) | | italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a | italic_s ) ) ] end_CELL end_ROW start_ROW start_CELL = end_CELL start_CELL roman_arg roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ divide start_ARG ∏ start_POSTSUBSCRIPT italic_i = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_a | italic_s ) ) start_POSTSUPERSCRIPT divide start_ARG italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG end_POSTSUPERSCRIPT end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_a | italic_s ) end_ARG roman_exp ( divide start_ARG italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG ) roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a | italic_s ) ] . end_CELL end_ROW

The procedure of the two-stage constrained actor-critic algorithm is shown in Appendix B, and we name it as TSCAC for short. We here provide some intuition behind actor updating in (7). The term πθi(a|s)subscript𝜋subscript𝜃𝑖conditional𝑎𝑠{\pi_{\theta_{i}}(a|s)}italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_a | italic_s ) denotes the probability the action selected by policy i𝑖iitalic_i and serves as an importance, which softly regularizes the learned policy πθ1subscript𝜋subscript𝜃1\pi_{\theta_{1}}italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT to be close to other policies πθisubscript𝜋subscript𝜃𝑖\pi_{\theta_{i}}italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Smaller Lagrangian multipliers λisubscript𝜆𝑖\lambda_{i}italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT indicate weaker constraints, and when λi=0subscript𝜆𝑖0\lambda_{i}=0italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0, we allow the learned policy πθ1subscript𝜋subscript𝜃1\pi_{\theta_{1}}italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT to be irrelevant of the constraint policy πθisubscript𝜋subscript𝜃𝑖\pi_{\theta_{i}}italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Note that we set the value of λ𝜆\lambdaitalic_λ to be the same, which is more practical for the production system. The performance of TSCAC would be better if we fine-tune it with different Lagrangian multiplier value. But the effectiveness of TSCAC with the same value of λ𝜆\lambdaitalic_λ is validated in both offline and live experiments, as we will see in following sections.

Offline Learning We now discuss adapting our constrained actor-critic method to the offline setting, i.e., a fixed dataset. The main change when moving from the online learning to the offline learning is the bias correction on the policy gradient. The actor is no longer updated on data collected by current policy but by another behavior policy πβsubscript𝜋𝛽\pi_{\beta}italic_π start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT, which may result in a different data distribution induced by the policy being updated. To address the distribution mismatch when estimating the policy gradient, a common strategy is to apply bias-correction ratio via importance sampling (Precup, 2000; Precup et al., 2001). Given a trajectory τ=(s1,a1,s2,a2,)𝜏subscript𝑠1subscript𝑎1subscript𝑠2subscript𝑎2\tau=(s_{1},a_{1},s_{2},a_{2},\dots)italic_τ = ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … ), the bias-correction ratio on the policy gradient for policy πθisubscript𝜋subscript𝜃𝑖\pi_{\theta_{i}}italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT is w(st,at)=t=1tπθi(st|at)πβ(st|at),𝑤subscript𝑠𝑡subscript𝑎𝑡superscriptsubscriptproductsuperscript𝑡1𝑡subscript𝜋subscript𝜃𝑖conditionalsubscript𝑠superscript𝑡subscript𝑎superscript𝑡subscript𝜋𝛽conditionalsubscript𝑠superscript𝑡subscript𝑎superscript𝑡w(s_{t},a_{t})=\prod_{t^{\prime}=1}^{t}\frac{\pi_{\theta_{i}}(s_{t^{\prime}}|a% _{t^{\prime}})}{\pi_{\beta}(s_{t^{\prime}}|a_{t^{\prime}})},italic_w ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ∏ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) end_ARG , which gives an unbiased estimation, but the variance can be huge. Therefore, we suggest using a first-order approximation, and using the current action-selection ratio when optimizing the actors of auxiliary responses,

(8) θi(k+1)argmaxθEπβ[πθi(k)(a|s)πβ(a|s)Ai(k)log(πθ(a|s))].superscriptsubscript𝜃𝑖𝑘1subscript𝜃subscript𝐸subscript𝜋𝛽delimited-[]subscript𝜋superscriptsubscript𝜃𝑖𝑘conditional𝑎𝑠subscript𝜋𝛽conditional𝑎𝑠superscriptsubscript𝐴𝑖𝑘subscript𝜋𝜃conditional𝑎𝑠\begin{split}\theta_{i}^{(k+1)}\leftarrow&\arg\max_{\theta}E_{\pi_{\beta}}% \bigg{[}\frac{\pi_{\theta_{i}^{(k)}}(a|s)}{\pi_{\beta}(a|s)}A_{i}^{(k)}\log(% \pi_{\theta}(a|s))\bigg{]}.\end{split}start_ROW start_CELL italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k + 1 ) end_POSTSUPERSCRIPT ← end_CELL start_CELL roman_arg roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_a | italic_s ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ( italic_a | italic_s ) end_ARG italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT roman_log ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a | italic_s ) ) ] . end_CELL end_ROW

When updating the actor of the main response, we have

(9) θ1(k+1)argmaxθEπβ[i=2m(πθi(a|s))λij=2mλjπβ(a|s)×exp(A1(k)j=2mλj)log(πθ(a|s))].\begin{split}\theta_{1}^{(k+1)}\leftarrow&\arg\max_{\theta}E_{\pi_{\beta}}\Big% {[}\frac{\prod_{i=2}^{m}\Big{(}\pi_{\theta_{i}}(a|s)\Big{)}^{\frac{\lambda_{i}% }{\sum_{j=2}^{m}\lambda_{j}}}}{\pi_{\beta}(a|s)}\\ &\qquad\qquad\times\exp\bigg{(}\frac{A_{1}^{(k)}}{\sum_{j=2}^{m}\lambda_{j}}% \bigg{)}\log(\pi_{\theta}(a|s))\Big{]}.\end{split}start_ROW start_CELL italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k + 1 ) end_POSTSUPERSCRIPT ← end_CELL start_CELL roman_arg roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ divide start_ARG ∏ start_POSTSUBSCRIPT italic_i = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_a | italic_s ) ) start_POSTSUPERSCRIPT divide start_ARG italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG end_POSTSUPERSCRIPT end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ( italic_a | italic_s ) end_ARG end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL × roman_exp ( divide start_ARG italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG ) roman_log ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a | italic_s ) ) ] . end_CELL end_ROW

Deterministic Policies We now discuss the extension of TSCAC to deterministic policies(Lillicrap et al., 2015), inspired by the updating rule for the actor of constrained policy discussed in (7). Similarly, at stage one, for each auxiliary response i𝑖iitalic_i, we learn separate critic models Qϕi(s,a)subscript𝑄subscriptitalic-ϕ𝑖𝑠𝑎Q_{\phi_{i}}(s,a)italic_Q start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_a ) and actor models πθi(s)subscript𝜋subscript𝜃𝑖𝑠\pi_{\theta_{i}}(s)italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s ). At stage two, for the main response, we learn critic Qϕ1(s,a)subscript𝑄subscriptitalic-ϕ1𝑠𝑎Q_{\phi_{1}}(s,a)italic_Q start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_a ) via temporal learning, and for actor πθ1(s)subscript𝜋subscript𝜃1𝑠\pi_{\theta_{1}}(s)italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s ), the updating rule follows the form:

(10) maxθi=2m(h(πθi(s),πθ1(s)))λiQϕ1(s,πθ(s)),subscript𝜃superscriptsubscriptproduct𝑖2𝑚superscriptsubscript𝜋subscript𝜃𝑖𝑠subscript𝜋subscript𝜃1𝑠subscript𝜆𝑖subscript𝑄subscriptitalic-ϕ1𝑠subscript𝜋𝜃𝑠\max_{\theta}\quad\prod_{i=2}^{m}\bigg{(}h(\pi_{\theta_{i}}(s),\pi_{\theta_{1}% }(s))\bigg{)}^{\lambda_{i}}Q_{\phi_{1}}(s,\pi_{\theta}(s)),roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_i = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_h ( italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s ) , italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s ) ) ) start_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_s ) ) ,

where h(a1,a2)subscript𝑎1subscript𝑎2h(a_{1},a_{2})italic_h ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) scores high when two actions a1,a2subscript𝑎1subscript𝑎2a_{1},a_{2}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are close to each other and scores low vice versa, and h(πθi(s),πθ1(s))subscript𝜋subscript𝜃𝑖𝑠subscript𝜋subscript𝜃1𝑠h(\pi_{\theta_{i}}(s),\pi_{\theta_{1}}(s))italic_h ( italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s ) , italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s ) ) scores high when the actions selected by policy πθ1subscript𝜋subscript𝜃1\pi_{\theta_{1}}italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT and πθisubscript𝜋subscript𝜃𝑖\pi_{\theta_{i}}italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT are close. λi0subscript𝜆𝑖0\lambda_{i}\geq 0italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ 0 plays a similar role as the constraint Lagrangian multiplier—larger λisubscript𝜆𝑖\lambda_{i}italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes stronger constraint. As an example, given n𝑛nitalic_n dimensional action space, one can choose h(a1,a2)=d=1nexp((a1da2d)22)subscript𝑎1subscript𝑎2superscriptsubscript𝑑1𝑛superscriptsubscript𝑎1𝑑subscript𝑎2𝑑22h(a_{1},a_{2})=\sum_{d=1}^{n}\exp\big{(}-\frac{(a_{1d}-a_{2d})^{2}}{2}\big{)}italic_h ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_d = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_exp ( - divide start_ARG ( italic_a start_POSTSUBSCRIPT 1 italic_d end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT 2 italic_d end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ). The deterministic version of TSCAC can apply to the setting with continuous actions, such as the embedding of the user preference.

5. Offline Experiments

In this section, we evaluate our method on a public dataset about short video recommendation via extensive offline learning simulations. We demonstrate the effectiveness of our approach as compared to existing baselines in both achieving the main goal and balancing the auxiliaries. We also test the versatility of our method on another public recommendation dataset, please refer to Appendix C due to lack of space.

5.1. Setup

Dataset

We consider a public dataset for short video recommendation named KuaiRand (https://kuairand.com/(Gao et al., 2022b), which is collected from a famous video-sharing mobile app and suitable for the offline evaluation of RL methods as it is unbiased. This dataset collects not only the overall WatchTime  of the videos, but also the interaction behavior of the users including Click, Like, Comment  and Hate. The statistics of the dataset are illustrated in Table 1. It shows that Like, Comment, and Hate  are sparse signals. Note that Hate  is extremely sparse. Logs provided by the same user are concatenated to form a trajectory; we choose top 150150150150 videos that are most frequently viewed.

Table 1. The statistics of KuaiRand.
Dimension Number Sparse Ratio
users 26858 -
items 10,221,515 -
samples 68,148,288 -
click 25,693,008 37.70%
like 1094434 1.61%
comment 163977 0.24%
hate 32449 0.048%

MDP

  • state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT: A 1044 dimension vector, which is a concatenation of user features(user property), the last 20202020 video features viewed by the user(user history) and all the 150150150150 candidate video features(context).

  • action atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT: the video ID to be recommended currently.

  • reward rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT: a vector of five scores the user provided for the viewed videos in terms of Click, Like, Comment, Hate, and WatchTime.

  • episode: a sequence of users’ video viewing history.

  • discount factor γ𝛾\gammaitalic_γ: 0.99

  • objective: We set the main goal to be maximizing the video WatchTime, and treat others as the auxiliaries.

Evaluation

We use the Normalised Capped Importance Sampling (NCIS) approach to evaluate different policies, which is a standard offline evaluation approach for RL methods in recommender systems (Zou et al., 2019). We also evaluate our method in terms of other metrics, please refer to Appendix D. The NCIS score is defined:

(11) N(π)=s,aDw(s,a)r(s,a)s,aDw(s,a),w(s,a)=min{c,π(a|s)πβ(a|s)},formulae-sequence𝑁𝜋subscript𝑠𝑎𝐷𝑤𝑠𝑎𝑟𝑠𝑎subscript𝑠𝑎𝐷𝑤𝑠𝑎𝑤𝑠𝑎𝑐𝜋conditional𝑎𝑠subscript𝜋𝛽conditional𝑎𝑠N(\pi)=\frac{\sum_{s,a\in D}w(s,a)r(s,a)}{\sum_{s,a\in D}w(s,a)},w(s,a)=\min\{% c,\frac{\pi(a|s)}{\pi_{\beta}(a|s)}\},italic_N ( italic_π ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_s , italic_a ∈ italic_D end_POSTSUBSCRIPT italic_w ( italic_s , italic_a ) italic_r ( italic_s , italic_a ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_s , italic_a ∈ italic_D end_POSTSUBSCRIPT italic_w ( italic_s , italic_a ) end_ARG , italic_w ( italic_s , italic_a ) = roman_min { italic_c , divide start_ARG italic_π ( italic_a | italic_s ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ( italic_a | italic_s ) end_ARG } ,

where D𝐷Ditalic_D is the dataset, w(s,a)𝑤𝑠𝑎w(s,a)italic_w ( italic_s , italic_a ) is the clipped importance sampling ratio, πβsubscript𝜋𝛽\pi_{\beta}italic_π start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT denotes the behavior policy, c𝑐citalic_c is a positive constant.

Baselines

We compare TSCAC with the following baselines.

  • BC: A supervised behavior-cloning policy πβsubscript𝜋𝛽\pi_{\beta}italic_π start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT to mimic the recommendation policy in the dataset, which inputs the user state and outputs the video ID.

  • Wide&Deep (Cheng et al., 2016): A supervised model which utilizes wide and deep layers to balance both memorization and generalization, which inputs the user state, outputs the item id, and the weight of each sample is set to be the weighted sum of all responses of this item.

  • DeepFM (Guo et al., 2017): a supervised recommendation model which combines deep neural network and factorization machine, which inputs the user state, outputs the item id, and the weight of each sample is set to be the weighted sum of all responses of this item.

  • RCPO (Tessler et al., 2018) : A constrained actor-critic approach called reward-constrained policy optimization which optimizes the policy to maximize the Lagrange dual function of the constrained program. Specifically, the reward function is defined as r=r0+i=1nλi*ri𝑟subscript𝑟0superscriptsubscript𝑖1𝑛subscript𝜆𝑖subscript𝑟𝑖r=r_{0}+\sum_{i=1}^{n}\lambda_{i}*r_{i}italic_r = italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT * italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, where r0subscript𝑟0r_{0}italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is main objective, WatchTime and risubscript𝑟𝑖r_{i}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes other feedback, and λisubscript𝜆𝑖\lambda_{i}italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the Lagrangian Multiplier.

  • RCPO-Multi-Critic: We test an improved version of RCPO with multiple critics. We separately learn multiple critic models to evaluate the cumulative rewards of each feedback. Then when optimizing the actor, we maximize a linear combination of critics, weighted by the Lagrangian multipliers.

  • Pareto (Chen et al., 2021): A multi-objective RL algorithm that finds the Pareto optimal solution for recommender systems.

  • TSCAC: our two-stage constrained actor-critic algorithm.

5.2. Overall Performance

Table 2. Performance of different algorithms on KuaiRand.
Algorithm Click\uparrow Like\uparrow(e-2) Comment\uparrow(e-3) Hate\downarrow(e-4) WatchTime\uparrow
BC 0.53380.53380.53380.5338 1.2311.2311.2311.231 3.2253.2253.2253.225 2.3042.3042.3042.304 12.8512.8512.8512.85
Wide&Deep 0.55440.55440.55440.5544 1.2441.2441.2441.244 3.3443.3443.3443.344 2.0112.0112.0112.011 12.8412.8412.8412.84
3.86%percent3.863.86\%3.86 % 1.07%percent1.071.07\%1.07 % 3.69%percent3.693.69\%3.69 % 12.7%percent12.7-12.7\%- 12.7 % 0.08%percent0.08-0.08\%- 0.08 %
DeepFM 0.5549*superscript0.55490.5549^{*}0.5549 start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT 1.388*superscript1.3881.388^{*}1.388 start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT 3.3103.3103.3103.310 2.1122.1122.1122.112 12.9212.9212.9212.92
3.95%*superscriptpercent3.953.95\%^{*}3.95 % start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT 12.76%*superscriptpercent12.7612.76\%^{*}12.76 % start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT 2.64%percent2.642.64\%2.64 % 8.31%percent8.31-8.31\%- 8.31 % 0.53%percent0.530.53\%0.53 %
RCPO 0.55100.55100.55100.5510 1.3861.3861.3861.386 3.628*superscript3.6283.628^{*}3.628 start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT 2.9512.9512.9512.951 13.07*superscript13.0713.07^{*}13.07 start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT
3.23%percent3.233.23\%3.23 % 12.57%percent12.5712.57\%12.57 % 12.5%*superscriptpercent12.512.5\%^{*}12.5 % start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT 28.1%percent28.128.1\%28.1 % 1.70%*superscriptpercent1.701.70\%^{*}1.70 % start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT
RCPO-Multi-Critic 0.55190.55190.55190.5519 1.3671.3671.3671.367 3.4133.4133.4133.413 2.1082.1082.1082.108 13.0013.0013.0013.00
3.41%percent3.413.41\%3.41 % 11.04%percent11.0411.04\%11.04 % 5.83%percent5.835.83\%5.83 % 8.49%percent8.49-8.49\%- 8.49 % 1.14%percent1.141.14\%1.14 %
Pareto 0.54380.54380.54380.5438 1.1711.1711.1711.171 3.3933.3933.3933.393 0.9915*superscript0.99150.9915^{*}0.9915 start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT 11.9011.9011.9011.90
1.87%percent1.871.87\%1.87 % 4.85%percent4.85-4.85\%- 4.85 % 5.22%percent5.225.22\%5.22 % 56.96%*superscriptpercent56.96-56.96\%^{*}- 56.96 % start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT 7.4%percent7.4-7.4\%- 7.4 %
TSCAC 0.55700.5570{\bf{0.5570}}bold_0.5570 1.4621.462{\bf{1.462}}bold_1.462 3.7283.728{\bf{3.728}}bold_3.728 1.8701.8701.8701.870 13.1413.14{\bf{13.14}}bold_13.14
4.35% 18.80%percent18.80{\bf{18.80\%}}bold_18.80 % 15.6%percent15.6{\bf{15.6\%}}bold_15.6 % 18.83%percent18.83-18.83\%- 18.83 % 2.23%percent2.23{\bf{2.23\%}}bold_2.23 %

The number in the bracket stands for the unit of this column; The number in the first row of each algorithm is the NCIS score.
The percentage in the second row means the performance gap between the algorithm and the BC algorithm.
The numbers with *** denote the best performance among all baseline methods in each response dimension.
The last row is marked by bold font when TSCAC achieves the best performance at each response dimension.

Table 2 presents the performance of different algorithms in terms of five scores. We can see that our TSCAC algorithm significantly outperforms other algorithms including both constrained reinforcement learning and supervised learning methods: for the main goal (WatchTime), TSCAC achieves the highest performance 13.1413.1413.1413.14(2.23%percent2.232.23\%2.23 %); for the auxiliary goal, TSCAC also ranks highest for 3333 out of 4444 scores (Click, Like, Comment). Note that TSCAC outperforms BC and RCPO at each dimension. The Pareto algorithm indeed learns a Pareto optimal solution that achieves best performance at Hate, but gets the lowest performance 11.9011.9011.9011.90(7.4%percent7.4-7.4\%- 7.4 %), i.e., it does not satisfy the setting with the main goal to optimize the WatchTime. The RCPO algorithm achieves the second highest performance at WatchTime, 13.07(1.70%)13.07percent1.7013.07(1.70\%)13.07 ( 1.70 % ), but the score at Hate  is the worst as the sparse signals are dominated by dense signals in a single evaluation model. Compared with RCPO, RCPO-Multi-Critic achieves much better score at Hate, which demonstrates the effectiveness of the multi-critic policy estimation method. TSCAC also outperforms RCPO-Multi-Critic at each dimension, which shows that the ability of our two-stage actor learning method to deal with multiple responses.

5.3. Ablation Study

We investigate how the value of Lagrangian multiplier affects the performance. As we set the value of λ𝜆\lambdaitalic_λ of all constraints to be the same in the second stage, we vary λ𝜆\lambdaitalic_λ across [1e1,1e2,1e3,1e4,1e5]1𝑒11𝑒21𝑒31𝑒41𝑒5[1e-1,1e-2,1e-3,1e-4,1e-5][ 1 italic_e - 1 , 1 italic_e - 2 , 1 italic_e - 3 , 1 italic_e - 4 , 1 italic_e - 5 ] and present performance of TSCAC in terms of all responses. Recall that larger λ𝜆\lambdaitalic_λ denotes stronger constraints of auxiliary responses. Figure 3 shows that with λ𝜆\lambdaitalic_λ increasing, the main goal, WatchTime  decreases as the constraints of auxiliary responses become stronger. As shown in Figure 3, the performance of interactions drops with small λ𝜆\lambdaitalic_λ 1e51𝑒51e-51 italic_e - 5 as the constraints are weak. Interestingly, the performance of interactions also decreases with larger λ𝜆\lambdaitalic_λ, which shows that too strong constraints affect the learning of the policy. The value of 1e41𝑒41e-41 italic_e - 4 achieves the best performance at interactions, and improve WatchTime  significantly compared with other baselines.

Refer to caption
Figure 3. Effect of the value of the Lagrangian multiplier on the performance.
\Description

6. Live Experiments

To demonstrate the effectiveness of our algorithm, we test its performance as well as other alternatives via live experiments in a popular short video platform. Algorithms are embodied in a candidate-ranking system used in production at a popular short video platform, that is, when a user arrives, these algorithms are expected to rank the candidate videos, and the system will recommend the top video to the user. We show that the proposed TSCAC algorithm is able to learn a policy that maximizes the main goal while also effectively balancing the auxiliary goal, and in particular, we set the main one as maximizing the WatchTime and the auxiliary one as improving the interactions between users and videos.

6.1. Setup

Evaluation metrics

We use online metrics to evaluate policy performance. For the main goal, we look at the total amount of time user spend on the videos, referred to as WatchTime. For the auxiliary goal, users can interact with videos through multiple ways, such as sharing the video to friends, downloading it, or providing comments. Here, we focus on the three online metrics associated with the user-video interactions—the total number of Share, Download, Comment interactions.

Refer to caption
Figure 4. The workflow of RL in production system.
\Description

MDP

Following the formulation in Section 2, we present the details of the Constrained MDP for short video recommendation.

  • state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT: user historical interactions (the list of items recommended to users at previous rounds and corresponding user feedbacks), user property (such as device and location) and the feature (the embeddings and statistics) of candidate videos at time t𝑡titalic_t.

  • action atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT: a vector embedding of algorithm-predicted user preferences on different video topics, which determines the actual recommendation action(the video to be recommended) via a ranking function described below:

    the ranking function: for each candidate video, this function calculates the dot product between the predicted user preference vector (atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT) and the video embedding (representing its topic and quality) as in (Dulac-Arnold et al., 2015). Then the video with the largest score is recommended.

  • reward rt=(lt,it)subscript𝑟𝑡subscript𝑙𝑡subscript𝑖𝑡r_{t}=(l_{t},i_{t})italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_i start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ): after each recommendation, the system observes how long the user spent on the video, WatchTime , denoted as ltsubscript𝑙𝑡l_{t}italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and whether the user has interacted with the video (Share/Download/Comment), denoted as itsubscript𝑖𝑡i_{t}italic_i start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

  • episode: a trajectory starts when a user opens the app and ends when the user leaves.

  • policy: we choose to learn a Gaussian policy in the live experiments. Specifically, the action atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is sampled from a multivariate Gaussian distribution whose mean and variance are output of the actor model.

Workflow

As shown in Figure 4, RL runs as follows:

  • Inference When the user comes, the user state are sent to the actor network, the actor network sample action by the Gaussian distribution. Then the ranking function inputs both the action and the embedding of candidates, calculates the dot product between the action and the video embeddings as scores, and output the item with the highest score to the user. After that, (state, action, rewards, next state) are saved in the replay buffer.

  • Training The actor and the critic networks are trained with a mini-batch (state, action, rewards, next state), sampled from the replay buffer.

Compared algorithms

We complement our evaluation with a supervised learning-to-rank (LTR) baseline, which is the default model run on the platform.

  • RCPO: Following (Tessler et al., 2018), we define a combined reward lt+λitsubscript𝑙𝑡𝜆subscript𝑖𝑡l_{t}+\lambda i_{t}italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_λ italic_i start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and learn a policy to maximize the cumulative combined reward with discount factor 0.950.950.950.95, where λ𝜆\lambdaitalic_λ is the Lagrangian multiplier.

  • TSCAC: We first learn a policy π2subscript𝜋2\pi_{2}italic_π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT to optimize the auxiliary goal. Then we learn a policy π1subscript𝜋1\pi_{1}italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to optimize the main goal with the soft constraint that π1subscript𝜋1\pi_{1}italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is close to π2subscript𝜋2\pi_{2}italic_π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

    • Interaction-AC: At the first stage, we learn a policy π2subscript𝜋2\pi_{2}italic_π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT to maximize the interaction reward, with critic update following (2) and actor update following (3).

    • TSCAC At the second stage, we learn a main policy π1subscript𝜋1\pi_{1}italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to maximize the cumulative reward of WatchTime  and softly regularize π1subscript𝜋1\pi_{1}italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to be close to π2subscript𝜋2\pi_{2}italic_π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, with critic update following (4) and actor update following (7).

  • LTR (Baseline): The learning-to-rank model (Liu et al., 2009) that takes user state embedding and video embedding as input and fits the sum of responses.

Experimental details

To test different algorithms, we randomly split users on the platform into several buckets. The first bucket runs the baseline LTR model, and the remaining buckets run models RCPO, Interaction-AC, and TSCAC. Models are trained for a couple of days and then are fixed to test performance within one day.

Table 3. Performance comparison of different algorithms with the LTR baseline in live experiments.

Algorithm

WatchTime Share Download Comment

RCPO

+0.309%percent0.309+0.309\%+ 0.309 % 0.707%percent0.707-0.707\%- 0.707 % 0.153%percent0.1530.153\%0.153 % 1.313%percent1.313-1.313\%- 1.313 %

Interaction-AC

+0.117%percent0.117+0.117\%+ 0.117 % +5.008%percent5.008+5.008\%+ 5.008 % +1.952%percent1.952+1.952\%+ 1.952 % 0.101%percent0.101-0.101\%- 0.101 %

TSCAC

+0.379%percent0.379+0.379\%+ 0.379 % +3.376%percent3.376+3.376\%+ 3.376 % +1.733%percent1.733+1.733\%+ 1.733 % 0.619%percent0.619-0.619\%- 0.619 %

6.2. Results

Table 3 shows the performance improvement of algorithm comparison with the LTR baseline regarding metrics WatchTime, Share, Download, and Comment. As we can see, RCPO can learn to improve the WatchTime  as compared to the baseline; but interaction-signals are too sparse with respect to WatchTime, such that when combining these responses together, it cannot effectively balance the interaction well. Performance of the Interaction-AC algorithm is as expected: with signal from only the interaction reward, it learns to improve the interaction-related metrics (Share, Download, Comment); such interactions between users and videos also improve the user WatchTime, since more interesting videos with high potential of invoking interactions are recommended, which optimizes user whole experience. Finally, The TSCAC algorithm achieves the best performance: as compared to RCPO, it has better WatchTime and does much better on interaction metrics, thanks to the effective softly regularization during training that it should not be too far from the Interaction-AC policy. Note that 0.1% improvement of WatchTime  and 1% improvement of interactions are statistically significant in the short video platform. That is, the performance improvement of our proposed method over baselines is significant. The universal drop of Comment for all RL methods is due to the natural trade-off between WatchTime  and Comment.

Refer to caption
Figure 5. Online performance gap of TSCAC over the LTR baseline of each day.
\Description

To understand how the TSCAC algorithm learns to balance the main and auxiliary goal, Figure 5 plots the online performance gap of the second stage over the LTR baseline on both WatchTime  and interactions. As shown, the algorithm quickly learns to improve the interaction metrics Share  and Comment  at the beginning, with the constraint of Interaction-AC policy. Then gradually, the model learns to improve WatchTime over time with sacrificing interactions a little. Note that the live performance of TSCAC outperforms RCPO significantly at each dimension, which demonstrates the effectiveness of our method.

7. Conclusion

In this paper we study the problem to optimize main cumulative responses with multiple auxiliary sparse constraints in short video platforms. To tackle the challenge of multiple constraints, we propose a novel constrained reinforcement learning method, called TSCAC, that optimizes the main goal as well as balancing the others for short video platforms. Our method consists of multiple critic estimation and two learning stages. At stage one, for each auxiliary response, we learn a policy to optimize its cumulative reward respectively. At stage two, we learn the major policy to optimize the cumulative main response, with a soft constraint that restricts the policy to be close to policies maximized for other responses. We demonstrate the advantages of our method over existing alternatives via extensive offline evaluations as well as live experiments. For the future work, it is promising to apply our method to other recommender systems. It is also an interesting future work to study the performance of the deterministic version of TSCAC.

References

  • (1)
  • Afsar et al. (2021) M Mehdi Afsar, Trafford Crump, and Behrouz Far. 2021. Reinforcement learning based recommender systems: A survey. arXiv preprint arXiv:2101.06286 (2021).
  • Alam et al. (2016) Md Hijbul Alam, Woo-Jong Ryu, and SangKeun Lee. 2016. Joint multi-grain topic sentiment: modeling semantic aspects for online reviews. Information Sciences 339 (2016), 206–223.
  • Chen et al. (2019b) Haokun Chen, Xinyi Dai, Han Cai, Weinan Zhang, Xuejian Wang, Ruiming Tang, Yuzhou Zhang, and Yong Yu. 2019b. Large-scale interactive recommendation with tree-structured policy gradient. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 3312–3320.
  • Chen et al. (2019a) Minmin Chen, Alex Beutel, Paul Covington, Sagar Jain, Francois Belletti, and Ed H Chi. 2019a. Top-k off-policy correction for a REINFORCE recommender system. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining. 456–464.
  • Chen et al. (2018) Shi-Yong Chen, Yang Yu, Qing Da, Jun Tan, Hai-Kuan Huang, and Hai-Hong Tang. 2018. Stabilizing reinforcement learning in dynamic environment with application to online recommendation. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1187–1196.
  • Chen et al. (2021) Xu Chen, Yali Du, Long Xia, and Jun Wang. 2021. Reinforcement Recommendation with User Multi-aspect Preference. In Proceedings of the Web Conference 2021. 425–435.
  • Cheng et al. (2016) Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. 2016. Wide & deep learning for recommender systems. In Proceedings of the 1st workshop on deep learning for recommender systems. 7–10.
  • Chow et al. (2017) Yinlam Chow, Mohammad Ghavamzadeh, Lucas Janson, and Marco Pavone. 2017. Risk-constrained reinforcement learning with percentile risk criteria. The Journal of Machine Learning Research 18, 1 (2017), 6070–6120.
  • Dalal et al. (2018) Gal Dalal, Krishnamurthy Dvijotham, Matej Vecerik, Todd Hester, Cosmin Paduraru, and Yuval Tassa. 2018. Safe exploration in continuous action spaces. arXiv preprint arXiv:1801.08757 (2018).
  • Dulac-Arnold et al. (2015) Gabriel Dulac-Arnold, Richard Evans, Hado van Hasselt, Peter Sunehag, Timothy Lillicrap, Jonathan Hunt, Timothy Mann, Theophane Weber, Thomas Degris, and Ben Coppin. 2015. Deep reinforcement learning in large discrete action spaces. arXiv preprint arXiv:1512.07679 (2015).
  • Gao et al. (2022a) Chongming Gao, Wenqiang Lei, Jiawei Chen, Shiqi Wang, Xiangnan He, Shijun Li, Biao Li, Yuan Zhang, and Peng Jiang. 2022a. CIRS: Bursting Filter Bubbles by Counterfactual Interactive Recommender System. arXiv preprint arXiv:2204.01266 (2022).
  • Gao et al. (2022b) Chongming Gao, Shijun Li, Yuan Zhang, Jiawei Chen, Biao Li, Wenqiang Lei, Peng Jiang, and Xiangnan He. 2022b. KuaiRand: An Unbiased Sequential Recommendation Dataset with Randomly Exposed Videos. In Proceedings of the 31st ACM International Conference on Information and Knowledge Management (Atlanta, GA, USA) (CIKM ’22). 5 pages. https://doi.org/10.1145/3511808.3557624
  • Garcıa and Fernández (2015) Javier Garcıa and Fernando Fernández. 2015. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research 16, 1 (2015), 1437–1480.
  • Ge et al. (2021) Yingqiang Ge, Shuchang Liu, Ruoyuan Gao, Yikun Xian, Yunqi Li, Xiangyu Zhao, Changhua Pei, Fei Sun, Junfeng Ge, Wenwu Ou, et al. 2021. Towards Long-term Fairness in Recommendation. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining. 445–453.
  • Ge et al. (2022) Yingqiang Ge, Xiaoting Zhao, Lucia Yu, Saurabh Paul, Diane Hu, Chu-Cheng Hsieh, and Yongfeng Zhang. 2022. Toward Pareto Efficient Fairness-Utility Trade-off inRecommendation through Reinforcement Learning. arXiv preprint arXiv:2201.00140 (2022).
  • Gong et al. (2022) Xudong Gong, Qinlin Feng, Yuan Zhang, Jiangling Qin, Weijie Ding, Biao Li, and Peng Jiang. 2022. Real-time Short Video Recommendation on Mobile Devices. In Proceedings of the 31st ACM International Conference on Information and Knowledge Management (Atlanta, GA, USA) (CIKM ’22).
  • Guo et al. (2017) Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. DeepFM: a factorization-machine based neural network for CTR prediction. arXiv preprint arXiv:1703.04247 (2017).
  • Lillicrap et al. (2015) Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2015. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015).
  • Lin et al. (2019) Xiao Lin, Hongjie Chen, Changhua Pei, Fei Sun, Xuanji Xiao, Hanxiao Sun, Yongfeng Zhang, Wenwu Ou, and Peng Jiang. 2019. A pareto-efficient algorithm for multiple objective optimization in e-commerce recommendation. In Proceedings of the 13th ACM Conference on recommender systems. 20–28.
  • Lin et al. (2022) Zihan Lin, Hui Wang, **gshu Mao, Wayne Xin Zhao, Cheng Wang, Peng Jiang, and Ji-Rong Wen. 2022. Feature-aware Diversified Re-ranking with Disentangled Representations for Relevant Recommendation. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 3327–3335.
  • Liu and Yang (2019) Dong Liu and Chenyang Yang. 2019. A deep reinforcement learning approach to proactive content pushing and recommendation for mobile users. IEEE Access 7 (2019), 83120–83136.
  • Liu et al. (2009) Tie-Yan Liu et al. 2009. Learning to rank for information retrieval. Foundations and Trends® in Information Retrieval 3, 3 (2009), 225–331.
  • Liu et al. (2021) Yongshuai Liu, Avishai Halev, and Xin Liu. 2021. Policy learning with constraints in model-free reinforcement learning: A survey. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence.
  • Ma et al. (2020) Jiaqi Ma, Zhe Zhao, Xinyang Yi, Ji Yang, Minmin Chen, Jiaxi Tang, Lichan Hong, and Ed H Chi. 2020. Off-policy learning in two-stage recommender systems. In Proceedings of The Web Conference 2020. 463–473.
  • Mossalam et al. (2016) Hossam Mossalam, Yannis M Assael, Diederik M Roijers, and Shimon Whiteson. 2016. Multi-objective deep reinforcement learning. arXiv preprint arXiv:1610.02707 (2016).
  • Nair et al. (2020) Ashvin Nair, Murtaza Dalal, Abhishek Gupta, and Sergey Levine. 2020. AWAC: Accelerating Online Reinforcement Learning with Offline Datasets. (2020).
  • Nemati et al. (2016) Shamim Nemati, Mohammad M Ghassemi, and Gari D Clifford. 2016. Optimal medication dosing from suboptimal clinical examples: A deep reinforcement learning approach. In 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). IEEE, 2978–2981.
  • Nguyen et al. (2020) Thanh Thi Nguyen, Ngoc Duy Nguyen, Peter Vamplew, Saeid Nahavandi, Richard Dazeley, and Chee Peng Lim. 2020. A multi-objective deep reinforcement learning framework. Engineering Applications of Artificial Intelligence 96 (2020), 103915.
  • Pi et al. (2020) Qi Pi, Guorui Zhou, Yu**g Zhang, Zhe Wang, Lejian Ren, Ying Fan, Xiaoqiang Zhu, and Kun Gai. 2020. Search-based user interest modeling with lifelong sequential behavior data for click-through rate prediction. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management. 2685–2692.
  • Precup (2000) Doina Precup. 2000. Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series (2000), 80.
  • Precup et al. (2001) Doina Precup, Richard S Sutton, and Sanjoy Dasgupta. 2001. Off-policy temporal-difference learning with function approximation. In ICML. 417–424.
  • Sener and Koltun (2018) Ozan Sener and Vladlen Koltun. 2018. Multi-task learning as multi-objective optimization. arXiv preprint arXiv:1810.04650 (2018).
  • Stamenkovic et al. (2021) Dusan Stamenkovic, Alexandros Karatzoglou, Ioannis Arapakis, Xin Xin, and Kleomenis Katevas. 2021. Choosing the Best of Both Worlds: Diverse and Novel Recommendations through Multi-Objective Reinforcement Learning. arXiv preprint arXiv:2110.15097 (2021).
  • Sutton and Barto (2018) Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An introduction. MIT press.
  • Tessler et al. (2018) Chen Tessler, Daniel J Mankowitz, and Shie Mannor. 2018. Reward constrained policy optimization. arXiv preprint arXiv:1805.11074 (2018).
  • Wang et al. (2022a) Jiayin Wang, Weizhi Ma, Jiayu Li, Hongyu Lu, Min Zhang, Biao Li, Yiqun Liu, Peng Jiang, and Shao** Ma. 2022a. Make Fairness More Fair: Fair Item Utility Estimation and Exposure Re-Distribution. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 1868–1877.
  • Wang et al. (2022b) Yuyan Wang, Mohit Sharma, Can Xu, Sriraj Badam, Qian Sun, Lee Richardson, Lisa Chung, Ed H Chi, and Minmin Chen. 2022b. Surrogate for Long-Term User Experience in Recommender Systems. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 4100–4109.
  • White et al. (1980) C Ch White, CC III WHITE, and KIM KW. 1980. Solution procedures for vector criterion Markov decision processes. (1980).
  • Williams (1992) Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Reinforcement learning (1992), 5–32.
  • Xian et al. (2019) Yikun Xian, Zuohui Fu, Shan Muthukrishnan, Gerard De Melo, and Yongfeng Zhang. 2019. Reinforcement knowledge graph reasoning for explainable recommendation. In Proceedings of the 42nd international ACM SIGIR conference on research and development in information retrieval. 285–294.
  • Xin et al. (2022) Xin Xin, Alexandros Karatzoglou, Ioannis Arapakis, and Joemon M Jose. 2022. Supervised Advantage Actor-Critic for Recommender Systems. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining. 1186–1196.
  • Zhan et al. (2022) Ruohan Zhan, Changhua Pei, Qiang Su, Jianfeng Wen, Xueliang Wang, Guanyu Mu, Dong Zheng, Peng Jiang, and Kun Gai. 2022. Deconfounding Duration Bias in Watch-time Prediction for Video Recommendation. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 4472–4481.
  • Zhao et al. (2018) Xiangyu Zhao, Liang Zhang, Zhuoye Ding, Long Xia, Jiliang Tang, and Dawei Yin. 2018. Recommendations with negative feedback via pairwise deep reinforcement learning. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1040–1048.
  • Zhao et al. (2017) Xiangyu Zhao, Liang Zhang, Long Xia, Zhuoye Ding, Dawei Yin, and Jiliang Tang. 2017. Deep reinforcement learning for list-wise recommendations. arXiv preprint arXiv:1801.00209 (2017).
  • Zou et al. (2019) Lixin Zou, Long Xia, Zhuoye Ding, Jiaxing Song, Weidong Liu, and Dawei Yin. 2019. Reinforcement learning to optimize long-term user engagement in recommender systems. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2810–2818.

Appendix A Proof of Theorem 1

Theorem 1 ().

The Lagrangian of Eq. (5) has the closed form solution

(12) π*(a|s)i=2m(πθi(a|s))λij=2mλjexp(A1(k)j=2mλj),proportional-tosuperscript𝜋conditional𝑎𝑠superscriptsubscriptproduct𝑖2𝑚superscriptsubscript𝜋subscript𝜃𝑖conditional𝑎𝑠subscript𝜆𝑖superscriptsubscript𝑗2𝑚subscript𝜆𝑗superscriptsubscript𝐴1𝑘superscriptsubscript𝑗2𝑚subscript𝜆𝑗\pi^{*}(a|s)\propto\prod_{i=2}^{m}\big{(}\pi_{\theta_{i}}(a|s)\big{)}^{\frac{% \lambda_{i}}{\sum_{j=2}^{m}\lambda_{j}}}\exp\bigg{(}\frac{A_{1}^{(k)}}{\sum_{j% =2}^{m}\lambda_{j}}\bigg{)},italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_a | italic_s ) ∝ ∏ start_POSTSUBSCRIPT italic_i = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_a | italic_s ) ) start_POSTSUPERSCRIPT divide start_ARG italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG end_POSTSUPERSCRIPT roman_exp ( divide start_ARG italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG ) ,

where λisubscript𝜆𝑖\lambda_{i}italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with i=2,,m𝑖2normal-…𝑚i=2,\dots,mitalic_i = 2 , … , italic_m are Lagrangian multipliers.

Proof.

The Lagrangian of Eq. (5) is

(13) (π,λ2,,λm)=Eπ[A1(k)]+i=2mλi(DKL(π||πθi)ϵi).\mathcal{L}(\pi,\lambda_{2},\dots,\lambda_{m})=-E_{\pi}[A_{1}^{(k)}]+\sum_{i=2% }^{m}\lambda_{i}\big{(}D_{KL}(\pi||\pi_{\theta_{i}})-\epsilon_{i}\big{)}.caligraphic_L ( italic_π , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) = - italic_E start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT [ italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ] + ∑ start_POSTSUBSCRIPT italic_i = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_π | | italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) .

Compute the gradient of (π,λ2,,λm)𝜋subscript𝜆2subscript𝜆𝑚\mathcal{L}(\pi,\lambda_{2},\dots,\lambda_{m})caligraphic_L ( italic_π , italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) with respect to π𝜋\piitalic_π,

(14) π=A1(k)+i=2mλi(1+log(π(a|s))log(πθi(a|s))).𝜋superscriptsubscript𝐴1𝑘superscriptsubscript𝑖2𝑚subscript𝜆𝑖1𝜋conditional𝑎𝑠subscript𝜋subscript𝜃𝑖conditional𝑎𝑠\frac{\partial\mathcal{L}}{\partial\pi}=-A_{1}^{(k)}+\sum_{i=2}^{m}\lambda_{i}% \big{(}1+\log(\pi(a|s))-\log(\pi_{\theta_{i}}(a|s))\big{)}.divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ italic_π end_ARG = - italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 1 + roman_log ( italic_π ( italic_a | italic_s ) ) - roman_log ( italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_a | italic_s ) ) ) .

Setting π=0𝜋0\frac{\partial\mathcal{L}}{\partial\pi}=0divide start_ARG ∂ caligraphic_L end_ARG start_ARG ∂ italic_π end_ARG = 0, we have the solution π*superscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT satisfies

(15) π*(a|s)=1Z(s)i=2m(πθi(a|s))λij=2mλjexp(A1(k)j=2mλj),superscript𝜋conditional𝑎𝑠1𝑍𝑠superscriptsubscriptproduct𝑖2𝑚superscriptsubscript𝜋subscript𝜃𝑖conditional𝑎𝑠subscript𝜆𝑖superscriptsubscript𝑗2𝑚subscript𝜆𝑗superscriptsubscript𝐴1𝑘superscriptsubscript𝑗2𝑚subscript𝜆𝑗\pi^{*}(a|s)=\frac{1}{Z(s)}\prod_{i=2}^{m}\big{(}\pi_{\theta_{i}}(a|s)\big{)}^% {\frac{\lambda_{i}}{\sum_{j=2}^{m}\lambda_{j}}}\exp\bigg{(}\frac{A_{1}^{(k)}}{% \sum_{j=2}^{m}\lambda_{j}}\bigg{)},italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_a | italic_s ) = divide start_ARG 1 end_ARG start_ARG italic_Z ( italic_s ) end_ARG ∏ start_POSTSUBSCRIPT italic_i = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_a | italic_s ) ) start_POSTSUPERSCRIPT divide start_ARG italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG end_POSTSUPERSCRIPT roman_exp ( divide start_ARG italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG ) ,

where Z(s)𝑍𝑠Z(s)italic_Z ( italic_s ) is the partition function to such that aπ*(a|s)=1subscript𝑎superscript𝜋conditional𝑎𝑠1\int_{a}\pi^{*}(a|s)=1∫ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ( italic_a | italic_s ) = 1. ∎

Appendix B The Two-Stage Constrained Actor-Critic Algorithm

Stage One: For each auxiliary response i=2,,m𝑖2𝑚i=2,\dots,mitalic_i = 2 , … , italic_m, learn a policy to optimize the response i𝑖iitalic_i, with πθisubscript𝜋subscript𝜃𝑖\pi_{\theta_{i}}italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT denoting actor and Vϕisubscript𝑉subscriptitalic-ϕ𝑖V_{\phi_{i}}italic_V start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT for critic. While not converged, at iteration k𝑘kitalic_k:
ϕi(k+1)argminϕEπθi(k)[(ri(s,a)+γiVϕi(k)(s)Vϕ(s))2],θi(k+1)argmaxθEπθi(k)[Ai(k)log(πθ(a|s))].formulae-sequencesuperscriptsubscriptitalic-ϕ𝑖𝑘1subscriptitalic-ϕsubscript𝐸subscript𝜋superscriptsubscript𝜃𝑖𝑘delimited-[]superscriptsubscript𝑟𝑖𝑠𝑎subscript𝛾𝑖subscript𝑉superscriptsubscriptitalic-ϕ𝑖𝑘superscript𝑠subscript𝑉italic-ϕ𝑠2superscriptsubscript𝜃𝑖𝑘1subscript𝜃subscript𝐸subscript𝜋superscriptsubscript𝜃𝑖𝑘delimited-[]superscriptsubscript𝐴𝑖𝑘subscript𝜋𝜃conditional𝑎𝑠\begin{split}\phi_{i}^{(k+1)}\leftarrow&\arg\min_{\phi}E_{\pi_{\theta_{i}^{(k)% }}}\Big{[}\big{(}r_{i}(s,a)+\gamma_{i}V_{\phi_{i}^{(k)}}(s^{\prime})-V_{\phi}(% s)\big{)}^{2}\Big{]},\\ \theta_{i}^{(k+1)}\leftarrow&\arg\max_{\theta}E_{\pi_{\theta_{i}^{(k)}}}\Big{[% }A_{i}^{(k)}\log\big{(}\pi_{\theta}(a|s)\big{)}\Big{]}.\end{split}start_ROW start_CELL italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k + 1 ) end_POSTSUPERSCRIPT ← end_CELL start_CELL roman_arg roman_min start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s , italic_a ) + italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_V start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_s ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , end_CELL end_ROW start_ROW start_CELL italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k + 1 ) end_POSTSUPERSCRIPT ← end_CELL start_CELL roman_arg roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT roman_log ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a | italic_s ) ) ] . end_CELL end_ROW
Stage Two: For the main response, learn a policy to both optimize the main response and restrict its domain close to the policies {πθi}i=2msuperscriptsubscriptsubscript𝜋subscript𝜃𝑖𝑖2𝑚\{\pi_{\theta_{i}}\}_{i=2}^{m}{ italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT of auxiliary responses, with πθ1subscript𝜋subscript𝜃1\pi_{\theta_{1}}italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT denoting actor and Vϕ1subscript𝑉subscriptitalic-ϕ1V_{\phi_{1}}italic_V start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT for critic. While not converged, at iteration k𝑘kitalic_k:
ϕ1(k+1)argminϕEπθ1(k)[(r1(s,a)+γ1Vϕ1(k)(s)Vϕ(s))2],θ1(k+1)argmaxθEπθ1(k)[i=2m(πθi(a|s))λij=2mλjπθ1(k)(a|s)×exp(A1(k)j=2mλj)logπθ(a|s)].\begin{split}\phi_{1}^{(k+1)}\leftarrow&\arg\min_{\phi}E_{\pi_{\theta_{1}^{(k)% }}}\Big{[}\big{(}r_{1}(s,a)+\gamma_{1}V_{\phi_{1}^{(k)}}(s^{\prime})-V_{\phi}(% s)\big{)}^{2}\Big{]},\\ \theta_{1}^{(k+1)}\leftarrow&\arg\max_{\theta}E_{\pi_{\theta_{1}^{(k)}}}\Big{[% }\frac{\prod_{i=2}^{m}\Big{(}\pi_{\theta_{i}}(a|s)\Big{)}^{\frac{\lambda_{i}}{% \sum_{j=2}^{m}\lambda_{j}}}}{\pi_{\theta_{1}^{(k)}}(a|s)}\\ &\qquad\quad\quad\quad\times\exp\bigg{(}\frac{A_{1}^{(k)}}{\sum_{j=2}^{m}% \lambda_{j}}\bigg{)}\log\pi_{\theta}(a|s)\Big{]}.\end{split}start_ROW start_CELL italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k + 1 ) end_POSTSUPERSCRIPT ← end_CELL start_CELL roman_arg roman_min start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s , italic_a ) + italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_V start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_s ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , end_CELL end_ROW start_ROW start_CELL italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k + 1 ) end_POSTSUPERSCRIPT ← end_CELL start_CELL roman_arg roman_max start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_E start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ divide start_ARG ∏ start_POSTSUBSCRIPT italic_i = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_a | italic_s ) ) start_POSTSUPERSCRIPT divide start_ARG italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG end_POSTSUPERSCRIPT end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_a | italic_s ) end_ARG end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL × roman_exp ( divide start_ARG italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG ) roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a | italic_s ) ] . end_CELL end_ROW
Output: the constrained policy π1subscript𝜋1\pi_{1}italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.
Algorithm 1 Two-Stage Constrained Actor-Critic (TSCAC)

Appendix C Offline Experiments on TripAdvisor

Our code is refered to https://github.com/AIDefender/TSCAC.

Table 4. Performance of different algorithms on TripAdvisor
Algorithms Service Business Cleanliness Check-in Value Rooms Location Overall Rating
BC 3.38 -1.86 3.57 -0.73 3.32 2.92 2.93*superscript2.932.93^{*}2.93 start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT 3.92
Wide&Deep 3.41*superscript3.413.41^{*}3.41 start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT -1.86 3.62*superscript3.623.62^{*}3.62 start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT -0.75 3.36*superscript3.363.36^{*}3.36 start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT 2.96 2.88 3.98
RCPO 3.40 -1.82 3.61 -0.71 3.34 2.95 2.91 3.97
RCPO-Multi-Critic 3.41*superscript3.413.41^{*}3.41 start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT -1.82 3.62*superscript3.623.62^{*}3.62 start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT -0.68 3.35 2.97*superscript2.972.97^{*}2.97 start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT 2.87 3.99*superscript3.993.99^{*}3.99 start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT
Pareto 3.36 1.79*superscript1.79-1.79^{*}- 1.79 start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT 3.57 0.62*superscript0.62-0.62^{*}- 0.62 start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT 3.29 2.93 2.86 3.95
TSCAC 3.43 -1.82 3.64 -0.68 3.37 3.00 2.98 3.99

The results with *** denote the best performance among all baseline methods in each response dimension
the data in last row is marked by bold font when TSCAC achieves the best performance.

In this section, we evaluate our approach on another public dataset via extensive offline learning simulations. We demonstrate the effectiveness of our approach as compared to existing baselines in both achieving the main goal and balancing the auxiliaries.

Dataset

We consider a hotel-review dataset named TripAdvisor, which is a standard dataset for studying policy optimization in recommender system with multiple responses in (Chen et al., 2021). In this data, customers not only provide an overall rating for hotels but also score hotels in multiple aspects including service, business, cleanliness, check-in, value, rooms, and location (Alam et al., 2016). 111The dataset consists of both the main objective and other responses, which can also be used to evaluate constrained policy optimization in recommender system. Reviews provided by the same user are concatenated chronologically to form a trajectory; we filter trajectories with length smaller than 20202020. In total, we have 20277202772027720277 customers, 150150150150 hotels, and 257932257932257932257932 reviews.

MDP

A trajectory tracks a customer hotel-reviewing history. For each review, we have state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT: customer ID and the last three reviewed hotel IDs as well as corresponding multi-aspect review scores; action atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT: currently reviewed hotel ID; reward rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT: a vector of eight scores the customer provided for the reviewed hotel in terms of service, business, cleanliness, check-in, value, rooms, location, and overall rating; discount factor γ𝛾\gammaitalic_γ: 0.99. We set the main goal to be maximizing the cumulative overall rating, and treat others as the auxiliaries.

Performance

Table 4 presents the results of different algorithms in terms of eight scores. We can see that our TSCAC algorithm performs the best among all algorithms: for the main goal, TSCAC achieves the highest overall rating 3.993.993.993.99; for the auxiliary goal, TSCAC also ranks highest for 5555 out of 7777 scores (service, cleanliness, value, rooms, location). The Pareto algorithm indeed learns a Pareto optimal solution that achieves best performance on the check-in score and business score, which however does not satisfy the setting here with the main goal to optimize the overall rating. RCPO-Multi-Critic outperforms RCPO in terms of 6 scores, which validates the effect of multi-critic estimation compared with joint-critic estimation. The RCPO-Multi-Critic algorithm achieves the same best overall score as our approach, but they sacrifice much on the others, and in particular, the location score is even lower than that from the BC algorithm.

Appendix D Offline Evaluation in terms of other metrics

We also evaluate the performance of TSCAC and baseline methods in terms of the Discounted Cumulative Gain (DCG) measure, as shown in Table 5.

Table 5. Performance of different algorithms in terms of DCG.
Algorithm Click\uparrow (e+2) Like\uparrow Comment\uparrow Hate\downarrow(e-2) WatchTime\uparrow(e+3)
BC 4.6174.6174.6174.617 8.4928.4928.4928.492 2.3202.3202.3202.320 8.1378.1378.1378.137 1.0791.0791.0791.079
Wide&Deep 4.5764.5764.5764.576 8.1588.1588.1588.158 2.2742.2742.2742.274 7.5037.5037.5037.503 1.0731.0731.0731.073
0.88%percent0.88-0.88\%- 0.88 % 3.93%percent3.93-3.93\%- 3.93 % 1.97%percent1.97-1.97\%- 1.97 % 7.79%*superscriptpercent7.79-7.79\%^{*}- 7.79 % start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT 0.56%percent0.56-0.56\%- 0.56 %
DeepFM 4.5944.5944.5944.594 8.4398.4398.4398.439 2.2742.2742.2742.274 8.3348.3348.3348.334 1.0831.0831.0831.083
0.50%percent0.50-0.50\%- 0.50 % 0.63%percent0.63-0.63\%- 0.63 % 1.97%percent1.97-1.97\%- 1.97 % 2.43%percent2.432.43\%2.43 % 0.41%*superscriptpercent0.410.41\%^{*}0.41 % start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT
RCPO 4.6034.6034.6034.603 8.5378.5378.5378.537 2.2822.2822.2822.282 8.3368.3368.3368.336 1.0801.0801.0801.080
0.30%percent0.30-0.30\%- 0.30 % 0.53%*superscriptpercent0.530.53\%^{*}0.53 % start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT 1.60%percent1.60-1.60\%- 1.60 % 2.45%percent2.452.45\%2.45 % 0.12%percent0.120.12\%0.12 %
RCPO-Multi-Critic 4.5694.5694.5694.569 8.1688.1688.1688.168 2.2722.2722.2722.272 7.5387.5387.5387.538 1.0691.0691.0691.069
1.03%percent1.03-1.03\%- 1.03 % 3.81%percent3.81-3.81\%- 3.81 % 2.09%percent2.09-2.09\%- 2.09 % 7.36%percent7.36-7.36\%- 7.36 % 0.88%percent0.88-0.88\%- 0.88 %
Pareto 4.6054.6054.6054.605 8.4948.4948.4948.494 2.3042.3042.3042.304 8.4498.4498.4498.449 1.0741.0741.0741.074
0.26%*superscriptpercent0.26-0.26\%^{*}- 0.26 % start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT 0.02%percent0.020.02\%0.02 % 0.69%*superscriptpercent0.69-0.69\%^{*}- 0.69 % start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT 3.83%percent3.833.83\%3.83 % 0.48%percent0.48-0.48\%- 0.48 %
TSCAC 4.6174.617{\bf{4.617}}bold_4.617 8.6528.652{\bf{8.652}}bold_8.652 2.3782.378{\bf{2.378}}bold_2.378 8.0368.0368.0368.036 1.0831.083{1.083}1.083
0.01% 1.88%percent1.88{\bf{1.88\%}}bold_1.88 % 2.49%percent2.49{\bf{2.49\%}}bold_2.49 % 1.24%percent1.24-1.24\%- 1.24 % 0.39%percent0.39{0.39\%}0.39 %

\uparrow: higher is better; \downarrow: lower is better.
The number in the bracket stands for the unit of this column; The number in the first row of each algorithm is the DCG score.
The percentage in the second row means the performance gap between the algorithm and the BC algorithm.
The numbers with *** denote the best performance among all baseline methods in each response dimension.
The last row is marked by bold font when TSCAC achieves the best performance at each response dimension.