License: arXiv.org perpetual non-exclusive license
arXiv:2303.11959v2 [q-fin.TR] 22 Dec 2023

Optimizing Trading Strategies in Quantitative Markets using Multi-Agent Reinforcement Learning

Abstract

Quantitative markets are characterized by swift dynamics and abundant uncertainties, making the pursuit of profit-driven stock trading actions inherently challenging. Within this context, Reinforcement Learning (RL) — which operates on a reward-centric mechanism for optimal control — has surfaced as a potentially effective solution to the intricate financial decision-making conundrums presented. This paper delves into the fusion of two established financial trading strategies, namely the constant proportion portfolio insurance (CPPI) and the time-invariant portfolio protection (TIPP), with the multi-agent deep deterministic policy gradient (MADDPG) framework. As a result, we introduce two novel multi-agent RL (MARL) methods: CPPI-MADDPG and TIPP-MADDPG, tailored for probing strategic trading within quantitative markets. To validate these innovations, we implemented them on a diverse selection of 100 real-market shares. Our empirical findings reveal that the CPPI-MADDPG and TIPP-MADDPG strategies consistently outpace their traditional counterparts, affirming their efficacy in the realm of quantitative trading.

Index Terms—  Quantitative trading, multi-agent reinforcement learning, constant proportion portfolio insurance, time-invariant portfolio protection

1 Introduction

Compared with traditional trading methods, quantitative trading is widely known for its features of high-frequency, algorithmic, and automated trading, which is difficult to achieve by human beings in such a complex and dynamic stock market [1, 2]. In the quantitative market, massive noisy signals from stochastic trading behaviors and all kinds of unforeseeable social events make the prediction of the market state grueling [3, 4]. And human traders can easily be affected by these events as well as their body and psychological conditions, which would make the irrational decisions of trading nearly inevitable [5, 6]. Therefore, different financial individuals and institutes from different research fields have started to explore more effective ways for handling these problems.

Over the past years, with the development of artificial intelligence techniques, reinforcement learning (RL) has emerged as an efficient method for making decisions in dynamic environments with uncertainties [7]. The principle behind RL is the Markov decision process (MDP). Through interacting with the environment, the RL agent, i.e. the decision maker, will iteratively update its strategy according to the rewards, which can be treated as guidance toward the expected target and the goal of the RL agent is hence to maximize the total reward [8]. Following the MDP, researchers from financial fields have tried to build their own specifically designed RL architecture to cope with different financial problems. A deep RL method combined with knowledge distillation was proposed to improve the training reliability in the trading of currency pairs [9]. To investigate the stock portfolio selection problem, a hypergraph-based RL method was designed to learn the policy function of generating appropriate trading actions [10]. Besides, a policy-based RL framework for stock portfolio management was introduced and its performance was also compared with other trading strategies [11].

Meanwhile, instead of focusing only on a centralized agent that interacts with the environment, people have begun to find that many scenarios, such as multi-robot control and multi-player games, are more like multi-agent system (MAS), where more than one agent is involved [12]. The framework of multi-agent RL (MARL) is hence established to handle those decision-making and optimal control problems with multiple agents inside a common environment [13]. Similar to the RL, MARL also concentrates on the issues of sequential decision-making, in which each agent needs to take action with its own strategic brain, which can be naturally designed using neural networks [14]. However, the situation becomes more complex since the dynamics of each agent will also be treated as the changes in the environment [15]. To achieve the management of the portfolio under the continuous changes over the market, a MARL-based system was proposed to maximize the return [16]. Similarly, an MAS stock market simulator was designed to address the issue of assessment over the market activity and reproduce the market metrics [17].

In this work, we integrate an MARL approach named multi-agent deep deterministic policy gradient (MADDPG) into two prior trading strategies, namely constant proportion portfolio insurance (CPPI) and time-invariant portfolio protection (TIPP) respectively for studying how this novel MAS architecture will behave in quantitative markets. The rest of this work is organized as follows. Section II introduces the system model, in which we discuss the MAS model and represent how the MADDPG is specifically established with CPPI and TIPP strategies, respectively. Further, the numerical experiment and results are implemented and analyzed in Section III. Finally, the conclusions are drawn in Section IV.

2 System model

In this section, we map the problem of strategic trading in quantitative markets into a MARL task. The principle of MADDPG is introduced first, and then we integrate the MADDPG approach into CPPI and TIPP strategies, respectively.

2.1 Framework of MARL for strategic trading

A sequential decision-making problem in the multi-agent scenario can be described as a stochastic game, which can be defined by a set of key elements N,𝕊,{𝔸i}i{1,,N}\langle N,\mathbb{S},\{\mathbb{A}^{i}\}_{i\in\{1,\cdots,N\}}⟨ italic_N , blackboard_S , { blackboard_A start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i ∈ { 1 , ⋯ , italic_N } end_POSTSUBSCRIPT, P,{Ri}i{1,,N},γP,\{R^{i}\}_{i\in\{1,\cdots,N\}},\gamma\rangleitalic_P , { italic_R start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i ∈ { 1 , ⋯ , italic_N } end_POSTSUBSCRIPT , italic_γ ⟩, where N𝑁Nitalic_N is the number of agents. At time t𝑡titalic_t, under a shared state St𝕊subscript𝑆𝑡𝕊S_{t}\in\mathbb{S}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_S, each agent takes its action ai𝔸isuperscript𝑎𝑖superscript𝔸𝑖a^{i}\in\mathbb{A}^{i}italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∈ blackboard_A start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT simultaneously. The joint action 𝐚={a1,,aN}𝐚superscript𝑎1superscript𝑎𝑁\mathbf{a}=\{a^{1},\cdots,a^{N}\}bold_a = { italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , ⋯ , italic_a start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT } leads the environment changes according to the dynamics P:𝕊×𝔸Δ(𝕊):𝑃𝕊𝔸Δ𝕊P:\mathbb{S}\times\boldsymbol{\mathbb{A}}\rightarrow\Delta(\mathbb{S})italic_P : blackboard_S × blackboard_bold_A → roman_Δ ( blackboard_S ), where 𝔸:=𝔸1××𝔸Nassign𝔸superscript𝔸1superscript𝔸𝑁\boldsymbol{\mathbb{A}}:=\mathbb{A}^{1}\times\cdots\times\mathbb{A}^{N}blackboard_bold_A := blackboard_A start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT × ⋯ × blackboard_A start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. After that each agent receives its individual reward risuperscript𝑟𝑖r^{i}italic_r start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT according to its reward function Ri:𝕊×𝔸×𝕊:superscript𝑅𝑖𝕊𝔸𝕊R^{i}:\mathbb{S}\times\boldsymbol{\mathbb{A}}\times\mathbb{S}\rightarrow% \mathbb{R}italic_R start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT : blackboard_S × blackboard_bold_A × blackboard_S → blackboard_R. γ𝛾\gammaitalic_γ is the discount factor that represents the value of time.

In our model, N𝑁Nitalic_N agents employ diverse strategies to trade in quantitative markets. They aim to optimize their returns while ensuring portfolio diversity, thereby distributing risks among all agents. At every step, agents observe a shared state 𝒔={s1,,sN}𝒔subscript𝑠1subscript𝑠𝑁\bm{s}=\{s_{1},\cdots,s_{N}\}bold_italic_s = { italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_s start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT }, where the individual state s=[p,h,b]𝑠𝑝𝑏s=[p,h,b]italic_s = [ italic_p , italic_h , italic_b ] is a vector that includes D𝐷Ditalic_D kinds of stock price p+D𝑝superscriptsubscript𝐷p\in\mathbb{R}_{+}^{D}italic_p ∈ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT, share h+Dsuperscriptsubscript𝐷h\in\mathbb{Z}_{+}^{D}italic_h ∈ blackboard_Z start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT, and the remaining balance b+D𝑏superscriptsubscript𝐷b\in\mathbb{R}_{+}^{D}italic_b ∈ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT. Then each agent determines an action a𝑎aitalic_a, which is a vector representing the weight of each stock in the portfolio, according to its policy. After taking the joint action, the number of shares of each agent is modified and their portfolios are updated. And each agent receives a reward based on the asset value change.

2.2 Strategies of CPPI and TIPP

Constant Proportion Portfolio Insurance (CPPI) is a type of portfolio insurance in which the investor sets a floor based on their asset, then structures asset allocation around the trading decision [18]. As shown in Fig. 1, the total asset A𝐴Aitalic_A is separated into two parts, the protection floor F𝐹Fitalic_F and the cushion C𝐶Citalic_C, in which the floor F𝐹Fitalic_F is the minimum guarantee used for protecting the basis of the total asset and the multiple cushions k*C𝑘𝐶k*Citalic_k * italic_C is supposed to be used as the risky asset E𝐸Eitalic_E,

E=k*C=k*(AF),𝐸𝑘𝐶𝑘𝐴𝐹E=k*C=k*(A-F),\vspace{-.2cm}italic_E = italic_k * italic_C = italic_k * ( italic_A - italic_F ) , (1)

where the risk factor k𝑘kitalic_k indicates the measurement of the risk and a higher value denotes a more aggressive trading strategy. As a comparison, Time-Invariant Portfolio Protection Strategy (TIPP) is a variation of CPPI, where the protection floor Ftsubscript𝐹𝑡F_{t}italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at time step t𝑡titalic_t is not a fixed value and changes over time according to some percentage of the total asset A𝐴Aitalic_A and the previous floor Ft1subscript𝐹𝑡1F_{t-1}italic_F start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT,

Ft=max{ϕAt,Ft1},Et=k(AtFt),formulae-sequencesubscript𝐹𝑡𝑚𝑎𝑥italic-ϕsubscript𝐴𝑡subscript𝐹𝑡1subscript𝐸𝑡𝑘subscript𝐴𝑡subscript𝐹𝑡\begin{split}F_{t}=max\{\phi A_{t},F_{t-1}\},\\ E_{t}=k(A_{t}-F_{t}),\end{split}\vspace{-.2cm}start_ROW start_CELL italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_m italic_a italic_x { italic_ϕ italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT } , end_CELL end_ROW start_ROW start_CELL italic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_k ( italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , end_CELL end_ROW (2)

where ϕitalic-ϕ\phiitalic_ϕ is the floor percentage. As the total asset A𝐴Aitalic_A increases, the amount of guarantee will accordingly rise. While the guarantee remains unchanged if the portfolio reduces.

Refer to caption
Fig. 1: The principle of CPPI strategy for agent i𝑖iitalic_i.

2.3 Multi-Agent Deep Deterministic Policy Gradient with Insurance Strategy

We adopt the MADDPG [19] to train our agents. This approach is specifically established for the implementation in the scenario of quantitative trading, presented as Fig. 2. MADDPG is a multi-agent version of actor-critic method considering a continuous action set with a deterministic policy, where each agent has an actor network πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT parameterized by θ𝜃\thetaitalic_θ and a critic network Qϕsubscript𝑄italic-ϕQ_{\phi}italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT parameterized by ϕitalic-ϕ\phiitalic_ϕ. Each agent learns its optimal policy by updating the parameters of its policy networks to directly maximize the objective function, i.e., the cumulative discounted return, J(θi)=𝔼sP,𝒂𝝅θ[t0γtRti]𝐽subscript𝜃𝑖subscript𝔼formulae-sequencesimilar-to𝑠𝑃similar-to𝒂subscript𝝅𝜃delimited-[]subscript𝑡0superscript𝛾𝑡superscriptsubscript𝑅𝑡𝑖J(\theta_{i})=\mathbb{E}_{s\sim P,\bm{a}\sim\bm{\pi}_{\theta}}[\sum_{t\geq 0}% \gamma^{t}R_{t}^{i}]italic_J ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_P , bold_italic_a ∼ bold_italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t ≥ 0 end_POSTSUBSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ] and the direction to take steps by agent i𝑖iitalic_i can be presented as the gradient of the cumulative discounted return, shown as,

θiJ(θi)=𝔼s𝒟[θilogπθi(ai|s)aiQϕi(s,a1,,aN)|ai=πθi(s)],subscriptsubscript𝜃𝑖𝐽subscript𝜃𝑖subscript𝔼similar-to𝑠𝒟delimited-[]evaluated-atsubscriptsubscript𝜃𝑖subscript𝜋subscript𝜃𝑖|subscript𝑎𝑖𝑠subscriptsubscript𝑎𝑖subscript𝑄subscriptitalic-ϕ𝑖𝑠subscript𝑎1subscript𝑎𝑁subscript𝑎𝑖subscript𝜋subscript𝜃𝑖𝑠\begin{split}\nabla_{\theta_{i}}J(\theta_{i})=\mathbb{E}_{s\sim\mathcal{D}}% \bigg{[}\nabla_{\theta_{i}}\log\pi_{\theta_{i}}(a_{i}|s)\cdot\\ \nabla_{a_{i}}Q_{\phi_{i}}(s,a_{1},\cdots,a_{N})|_{a_{i}=\pi_{\theta_{i}}(s)}% \bigg{]},\end{split}start_ROW start_CELL ∇ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_J ( italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = blackboard_E start_POSTSUBSCRIPT italic_s ∼ caligraphic_D end_POSTSUBSCRIPT [ ∇ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_s ) ⋅ end_CELL end_ROW start_ROW start_CELL ∇ start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_a start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) | start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s ) end_POSTSUBSCRIPT ] , end_CELL end_ROW (3)

where 𝒟𝒟\mathcal{D}caligraphic_D is the experience replay buffer containing tuples (s,s,a1,,aN,r1,,rN)𝑠superscript𝑠subscript𝑎1subscript𝑎𝑁subscript𝑟1subscript𝑟𝑁(s,s^{\prime},a_{1},\cdots,a_{N},r_{1},\cdots,r_{N})( italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_a start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_r start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) that are stored throughout training. To combine the policy with the insurance strategy, we use CPPI or TIPP to adjust the output actions instead of using raw outputs as the actions. The feasibility of such adjusting also benefits from the off-policy nature of MADDPG.

The centralized critic networks are updated by approximating the true action-value function using temporal-difference learning. Further, in order to avoid all agents moving towards the same strategy, we set the correlation between agents as part of the loss function to achieve the purpose of portfolio selection. The loss function for agent i𝑖iitalic_i can be expressed as,

(ϕi)=λ𝔼s,𝒂,𝒓,s[(Qϕi(s,a1,,aN)y)2]+(1λ)i=1,ijKCorr(ai,aj)2,y=ri+γQϕi(s,a1,,aN)|aj=πθi(s),formulae-sequencesubscriptitalic-ϕ𝑖𝜆subscript𝔼𝑠𝒂𝒓superscript𝑠delimited-[]superscriptsubscript𝑄subscriptitalic-ϕ𝑖𝑠subscript𝑎1subscript𝑎𝑁𝑦21𝜆superscriptsubscriptformulae-sequence𝑖1𝑖𝑗𝐾𝐶𝑜𝑟𝑟superscriptsubscript𝑎𝑖subscript𝑎𝑗2𝑦subscript𝑟𝑖evaluated-at𝛾subscript𝑄subscriptsuperscriptitalic-ϕ𝑖superscript𝑠subscriptsuperscript𝑎1subscriptsuperscript𝑎𝑁subscriptsuperscript𝑎𝑗subscript𝜋subscriptsuperscript𝜃𝑖𝑠\begin{split}\mathcal{L}(\phi_{i})=\lambda\mathbb{E}_{s,\bm{a},\bm{r},s^{% \prime}}\bigg{[}(Q_{\phi_{i}}(s,a_{1},\cdots,a_{N})-y)^{2}\bigg{]}&\\ +(1-\lambda)\sum_{i=1,i\neq j}^{K}Corr(a_{i},a_{j})^{2},\\ y=r_{i}+\gamma Q_{\phi^{\prime}_{i}}(s^{\prime},a^{\prime}_{1},\cdots,a^{% \prime}_{N})|_{a^{\prime}_{j}=\pi_{\theta^{\prime}_{i}}(s)},\end{split}start_ROW start_CELL caligraphic_L ( italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_λ blackboard_E start_POSTSUBSCRIPT italic_s , bold_italic_a , bold_italic_r , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ( italic_Q start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_a start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) - italic_y ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL + ( 1 - italic_λ ) ∑ start_POSTSUBSCRIPT italic_i = 1 , italic_i ≠ italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_C italic_o italic_r italic_r ( italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , end_CELL end_ROW start_ROW start_CELL italic_y = italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_γ italic_Q start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) | start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s ) end_POSTSUBSCRIPT , end_CELL end_ROW (4)

where Qϕisubscript𝑄subscriptsuperscriptitalic-ϕ𝑖Q_{\phi^{\prime}_{i}}italic_Q start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT and πθisubscript𝜋subscriptsuperscript𝜃𝑖\pi_{\theta^{\prime}_{i}}italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT are target networks with delayed parameters ϕisubscriptsuperscriptitalic-ϕ𝑖\phi^{\prime}_{i}italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and θisubscriptsuperscript𝜃𝑖\theta^{\prime}_{i}italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the action vector showing the positional confidence vector of agent i𝑖iitalic_i under the restriction of CPPI or TIPP, and λ𝜆\lambdaitalic_λ is the hyperparameter controlling the equilibrium. The pseudocode of the proposed algorithm is shown in Alg. 1. Notice that the rule-based policy ΦΦ\Phiroman_Φ here can be either CIPP or TIPP, which outputs the constraints of the actions under the insurance strategy.

3 Experiments

We conduct the experiments on a real-world stock trading environment provided by FinRL [20]. We test the performance of our proposed CPPI-MADDPG and TIPP-MADDPG with MADDPG, MADQN, and the Universal Portfolio (UP). The experimental results show that combining MADDPG with insurance strategy provides a substantial advantage in the realm of quantitative trading.

3.1 Dataset and Settings

In our experiment, we selected 100 stocks listed on the Shenzhen Stock Exchange, sourced via Tushare. These stocks have codes ranging from 000010.SZ to 300813.SZ. When combined with cash as a risk-free asset, the potential array of investment products expands significantly. For training, we utilized data spanning from January 1st, 2018 to December 31st, 2020. The testing set comprises data from January 1st, 2021 to December 31st, 2021. To align our analysis with real-world market conditions, constraints were applied to the data, including non-negative balance maintenance and the incorporation of transaction costs. We initialized with a set cash amount, aiming to maximize profits using the aforementioned trading strategies.

Refer to caption
Fig. 2: A schematic of MADDPG in the quantitative market environment.
Algorithm 1 MADDPG with insurance strategy
1:Initialize the rule-based policy ΦΦ\Phiroman_Φ according to the insurance strategy.
2:Initialize Qϕisubscript𝑄subscriptitalic-ϕ𝑖Q_{\phi_{i}}italic_Q start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT, πθisubscript𝜋subscript𝜃𝑖\pi_{\theta_{i}}italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT, Qϕisubscript𝑄subscriptsuperscriptitalic-ϕ𝑖Q_{\phi^{\prime}_{i}}italic_Q start_POSTSUBSCRIPT italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT, πθisubscript𝜋subscriptsuperscript𝜃𝑖\pi_{\theta^{\prime}_{i}}italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT for all i{1,,N}𝑖1𝑁i\in\{1,\cdots,N\}italic_i ∈ { 1 , ⋯ , italic_N }.
3:while training not finished do
4:     Initialize initial state s𝑠sitalic_s and a random process 𝒩𝒩\mathcal{N}caligraphic_N for
     action exploration.
5:     for each episode do
6:         For each agent select action ai=πθi(s)+𝒩tsubscript𝑎𝑖subscript𝜋subscript𝜃𝑖𝑠subscript𝒩𝑡a_{i}=\pi_{\theta_{i}}(s)+\mathcal{N}_{t}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s ) + caligraphic_N start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.
7:         if aiΦ(s)subscript𝑎𝑖Φ𝑠a_{i}\notin\Phi(s)italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∉ roman_Φ ( italic_s ) then
8:              adjust aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with Φ(s)Φ𝑠\Phi(s)roman_Φ ( italic_s ).
9:         end if
10:         Execute joint action 𝒂𝒂\bm{a}bold_italic_a and observe reward 𝒓𝒓\bm{r}bold_italic_r and
          next state ssuperscript𝑠s^{\prime}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.
11:         Store experience s,𝒂,𝒓,s𝑠𝒂𝒓superscript𝑠\langle s,\bm{a},\bm{r},s^{\prime}\rangle⟨ italic_s , bold_italic_a , bold_italic_r , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⟩ in replay buffer 𝒟𝒟\mathcal{D}caligraphic_D.
12:         Sample a minibatch of K𝐾Kitalic_K experiences from 𝒟𝒟\mathcal{D}caligraphic_D.
13:         for each agent do
14:              Update the critic Qϕisubscript𝑄subscriptitalic-ϕ𝑖Q_{\phi_{i}}italic_Q start_POSTSUBSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT by minimizing Eq.(4).
15:              Update the actor πθisubscript𝜋subscript𝜃𝑖\pi_{\theta_{i}}italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT using Eq.(3).
16:         end for
17:         Update target networks for each agent:
ϕiτϕi+(1τ)ϕi,θiτθi+(1τ)θi.formulae-sequencesubscriptsuperscriptitalic-ϕ𝑖𝜏subscriptitalic-ϕ𝑖1𝜏subscriptsuperscriptitalic-ϕ𝑖subscriptsuperscript𝜃𝑖𝜏subscript𝜃𝑖1𝜏subscriptsuperscript𝜃𝑖\begin{split}\phi^{\prime}_{i}\leftarrow\tau\phi_{i}+(1-\tau)\phi^{\prime}_{i}% ,\\ \theta^{\prime}_{i}\leftarrow\tau\theta_{i}+(1-\tau)\theta^{\prime}_{i}.% \vspace{-.2cm}\end{split}start_ROW start_CELL italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← italic_τ italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + ( 1 - italic_τ ) italic_ϕ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← italic_τ italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + ( 1 - italic_τ ) italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT . end_CELL end_ROW
18:     end for
19:end while

3.2 Results

Refer to caption
Fig. 3: The asset allocations with MADDPG, CPPI-MADDPG, and TIPP-MADDPG strategies .
Refer to caption
Fig. 4: The performances of portfolios with different strategies (The metrics of Total Asset and Time Step: 103superscript10310^{3}10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT RMB and Day).

The changes in the total assets with different trading strategies over time are present in Fig. 4, where only the trading days are leveraged for the algorithm implementation. In the first 80 trading days, the profits gained by all strategies except for Universal Portfolios (UP), which has increased in a stable but slow manner, are nearly the same. Afterward, TIPP-MADDPG has boomed during the 80th to 160th trading days compared with CPPI-MADDPG method. One of the significant risks with CPPI is the ”gap risk”. If the risky asset falls dramatically in value in a very short period, the portfolio might not be rebalanced in time to prevent it from breaching the floor value. TIPP, with its time-based adjustment, does not have this risk to the same extent.

We then compare the efficacy of our proposed trading strategies against UP, MADQN, and MADDPG using metrics such as Annual Return (AR), Sharpe Ratio (SR), and Maximum Drawdown (MaxD, which represents the maximum portfolio value drop from peak to trough). For MADQN, we use discretized action space as in MADDPG. The UP is a standard portfolio method, which optimizes based on the correlation of diverse stock returns. The results for AR, SR, and MaxD are detailed in Table 1.

From the numerical results, we can see that UP struggles with real-time data, underperforming in our test set. MADQN’s limitation is its restricted exploration capacity, making it less adept in a stock market rife with uncertainties. Both CPPI-MADDPG and TIPP-MADDPG can revert to the baseline MADDPG strategy under certain parameters, allowing them to be tailored to individual investor risk preferences. For profitability, the TIPP strategy mandates a greater capital guarantee than CPPI, leading to a more assertive portfolio approach. This results in TIPP excelling in annual return, whereas CPPI outshines in SR and MaxD metrics.

To further study how each agent has made a series of trading decisions over time in the test phase, we visualize the general trading behavior for each agent on 100 shares with MADDPG, CPPI-MADDPG, and TIPP-MADDPG, respectively. As shown in Fig. 3, the heatmap presents how the agents with different strategies choose to allocate the asset. The agents with MADDPG prefer the relatively uniform allocation while the assets allocated by those with CPPI-MADDPG and TIPP-MADDPG are more sparse.

Table 1: Comparison of Different Strategies
Strategy AR SR MaxD
UP 3.36%percent3.363.36\%3.36 % 1.031.031.031.03 3.86%percent3.863.86\%3.86 %
MADQN 6.47%percent6.476.47\%6.47 % 1.711.711.711.71 9.43%percent9.439.43\%9.43 %
MADDPG 8.22%percent8.228.22\%8.22 % 1.991.991.991.99 12.26%percent12.26\mathbf{12.26\%}bold_12.26 %
CPPI-MADDPG 7.76%percent7.767.76\%7.76 % 2.182.18\mathbf{2.18}bold_2.18 6.60%percent6.606.60\%6.60 %
TIPP-MADDPG 9.68%percent9.68\mathbf{9.68\%}bold_9.68 % 2.092.092.092.09 9.02%percent9.029.02\%9.02 %

4 Conclusion

In this study, we introduced two augmented MADDPG algorithms: CPPI-MADDPG and TIPP-MADDPG. Our goal was to elucidate the advantages these financial trading strategies offer when integrated with MARL for quantitative trading. We subjected both methods to rigorous testing using genuine financial data from the Shenzhen Stock Exchange. The empirical outcomes underscore the efficacy of our approaches. We believe these findings highlight the potential for future deployments of our methods in quantitative markets. Given the significant outcomes and the evolving nature of quantitative markets, we are optimistic about the broader applications and adaptations of our methodologies.

References

  • [1] Ankit Thakkar and Kinjal Chaudhari, “A comprehensive survey on deep neural networks for stock market: The need, challenges, and future directions,” Expert Systems with Applications, vol. 177, pp. 114800, 2021.
  • [2] Marcus G Daniels, J Doyne Farmer, László Gillemot, Giulia Iori, and Eric Smith, “Quantitative model of price diffusion and market friction based on trading as a mechanistic random process,” Physical Review Letters, vol. 90, no. 10, pp. 108102, 2003.
  • [3] Xin Guo, Tze Leung Lai, Howard Shek, and Samuel Po-Shing Wong, Quantitative trading: Algorithms, Analytics, Data, Models, Optimization, Chapman and Hall/CRC, 2017.
  • [4] Tobias J Moskowitz, Yao Hua Ooi, and Lasse Heje Pedersen, “Time series momentum,” Journal of financial economics, vol. 104, no. 2, pp. 228–250, 2012.
  • [5] Huyên Pham, Continuous-time Stochastic Control and Optimization with Financial Applications, vol. 61, Springer Science & Business Media, 2009.
  • [6] Wendell H Fleming and Tao Pang, “An application of stochastic control theory to financial economics,” SIAM Journal on Control and Optimization, vol. 43, no. 2, pp. 502–531, 2004.
  • [7] Bo An, Shuo Sun, and Rundong Wang, “Deep reinforcement learning for quantitative trading: Challenges and opportunities,” IEEE Intelligent Systems, vol. 37, no. 2, pp. 23–26, 2022.
  • [8] Richard S Sutton and Andrew G Barto, Reinforcement learning: An introduction, MIT press, 2018.
  • [9] Avraam Tsantekidis, Nikolaos Passalis, and Anastasios Tefas, “Diversity-driven knowledge distillation for financial trading using deep reinforcement learning,” Neural Networks, vol. 140, pp. 193–202, 2021.
  • [10] Xiaojie Li, Chaoran Cui, Donglin Cao, Juan Du, and Chunyun Zhang, “Hypergraph-based reinforcement learning for stock portfolio selection,” in 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 4028–4032.
  • [11] Huanming Zhang, Zhengyong Jiang, and Jionglong Su, “A deep deterministic policy gradient-based strategy for stocks portfolio management,” in 2021 IEEE 6th International Conference on Big Data Analytics (ICBDA), 2021, pp. 230–238.
  • [12] Kaiqing Zhang, Zhuoran Yang, and Tamer Başar, “Multi-agent reinforcement learning: A selective overview of theories and algorithms,” Handbook of Reinforcement Learning and Control, pp. 321–384, 2021.
  • [13] Lucian Busoniu, Robert Babuska, and Bart De Schutter, “A comprehensive survey of multiagent reinforcement learning,” IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 38, no. 2, pp. 156–172, 2008.
  • [14] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
  • [15] Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al., “Grandmaster level in starcraft ii using multi-agent reinforcement learning,” Nature, vol. 575, no. 7782, pp. 350–354, 2019.
  • [16] **ho Lee, Raehyun Kim, Seok-Won Yi, and Jaewoo Kang, “MAPS: Multi-agent reinforcement learning-based portfolio management system,” arXiv preprint arXiv:2007.05402, 2020.
  • [17] Johann Lussange, Ivan Lazarevich, Sacha Bourgeois-Gironde, Stefano Palminteri, and Boris Gutkin, “Modelling stock markets by multi-agent reinforcement learning,” Computational Economics, vol. 57, no. 1, pp. 113–147, 2021.
  • [18] Sven Balder, Michael Brandl, and Antje Mahayni, “Effectiveness of cppi strategies under discrete-time trading,” Journal of Economic Dynamics and Control, vol. 33, no. 1, pp. 204–220, 2009.
  • [19] Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch, “Multi-agent actor-critic for mixed cooperative-competitive environments,” in 31th Conference on Neural Information Processing Systems (NeurIPS), 2017, vol. 30.
  • [20] Xiao-Yang Liu, Hongyang Yang, Jiechao Gao, and Christina Dan Wang, “FinRL: Deep reinforcement learning framework to automate trading in quantitative finance,” ACM International Conference on AI in Finance (ICAIF), 2021.