Optimizing Trading Strategies in Quantitative Markets using Multi-Agent Reinforcement Learning

Abstract

Quantitative markets are characterized by swift dynamics and abundant uncertainties, making the pursuit of profit-driven stock trading actions inherently challenging. Within this context, Reinforcement Learning (RL) — which operates on a reward-centric mechanism for optimal control — has surfaced as a potentially effective solution to the intricate financial decision-making conundrums presented. This paper delves into the fusion of two established financial trading strategies, namely the constant proportion portfolio insurance (CPPI) and the time-invariant portfolio protection (TIPP), with the multi-agent deep deterministic policy gradient (MADDPG) framework. As a result, we introduce two novel multi-agent RL (MARL) methods: CPPI-MADDPG and TIPP-MADDPG, tailored for probing strategic trading within quantitative markets. To validate these innovations, we implemented them on a diverse selection of 100 real-market shares. Our empirical findings reveal that the CPPI-MADDPG and TIPP-MADDPG strategies consistently outpace their traditional counterparts, affirming their efficacy in the realm of quantitative trading.

Index Terms— Quantitative trading, multi-agent reinforcement learning, constant proportion portfolio insurance, time-invariant portfolio protection

1 Introduction

Compared with traditional trading methods, quantitative trading is widely known for its features of high-frequency, algorithmic, and automated trading, which is difficult to achieve by human beings in such a complex and dynamic stock market [1, 2]. In the quantitative market, massive noisy signals from stochastic trading behaviors and all kinds of unforeseeable social events make the prediction of the market state grueling [3, 4]. And human traders can easily be affected by these events as well as their body and psychological conditions, which would make the irrational decisions of trading nearly inevitable [5, 6]. Therefore, different financial individuals and institutes from different research fields have started to explore more effective ways for handling these problems.

Over the past years, with the development of artificial intelligence techniques, reinforcement learning (RL) has emerged as an efficient method for making decisions in dynamic environments with uncertainties [7]. The principle behind RL is the Markov decision process (MDP). Through interacting with the environment, the RL agent, i.e. the decision maker, will iteratively update its strategy according to the rewards, which can be treated as guidance toward the expected target and the goal of the RL agent is hence to maximize the total reward [8]. Following the MDP, researchers from financial fields have tried to build their own specifically designed RL architecture to cope with different financial problems. A deep RL method combined with knowledge distillation was proposed to improve the training reliability in the trading of currency pairs [9]. To investigate the stock portfolio selection problem, a hypergraph-based RL method was designed to learn the policy function of generating appropriate trading actions [10]. Besides, a policy-based RL framework for stock portfolio management was introduced and its performance was also compared with other trading strategies [11].

Meanwhile, instead of focusing only on a centralized agent that interacts with the environment, people have begun to find that many scenarios, such as multi-robot control and multi-player games, are more like multi-agent system (MAS), where more than one agent is involved [12]. The framework of multi-agent RL (MARL) is hence established to handle those decision-making and optimal control problems with multiple agents inside a common environment [13]. Similar to the RL, MARL also concentrates on the issues of sequential decision-making, in which each agent needs to take action with its own strategic brain, which can be naturally designed using neural networks [14]. However, the situation becomes more complex since the dynamics of each agent will also be treated as the changes in the environment [15]. To achieve the management of the portfolio under the continuous changes over the market, a MARL-based system was proposed to maximize the return [16]. Similarly, an MAS stock market simulator was designed to address the issue of assessment over the market activity and reproduce the market metrics [17].

In this work, we integrate an MARL approach named multi-agent deep deterministic policy gradient (MADDPG) into two prior trading strategies, namely constant proportion portfolio insurance (CPPI) and time-invariant portfolio protection (TIPP) respectively for studying how this novel MAS architecture will behave in quantitative markets. The rest of this work is organized as follows. Section II introduces the system model, in which we discuss the MAS model and represent how the MADDPG is specifically established with CPPI and TIPP strategies, respectively. Further, the numerical experiment and results are implemented and analyzed in Section III. Finally, the conclusions are drawn in Section IV.

2 System model

In this section, we map the problem of strategic trading in quantitative markets into a MARL task. The principle of MADDPG is introduced first, and then we integrate the MADDPG approach into CPPI and TIPP strategies, respectively.

2.1 Framework of MARL for strategic trading

A sequential decision-making problem in the multi-agent scenario can be described as a stochastic game, which can be defined by a set of key elements $\langle N,\mathbb{S},\{\mathbb{A}^{i}\}_{i\in\{1,\cdots,N\}}$ , $P,\{R^{i}\}_{i\in\{1,\cdots,N\}},\gamma\rangle$ , where $N$ is the number of agents. At time $t$ , under a shared state $S_{t}\in\mathbb{S}$ , each agent takes its action $a^{i}\in\mathbb{A}^{i}$ simultaneously. The joint action $\mathbf{a}=\{a^{1},\cdots,a^{N}\}$ leads the environment changes according to the dynamics $P:\mathbb{S}\times\boldsymbol{\mathbb{A}}\rightarrow\Delta(\mathbb{S})$ , where $\boldsymbol{\mathbb{A}}:=\mathbb{A}^{1}\times\cdots\times\mathbb{A}^{N}$ . After that each agent receives its individual reward $r^{i}$ according to its reward function $R^{i}:\mathbb{S}\times\boldsymbol{\mathbb{A}}\times\mathbb{S}\rightarrow% \mathbb{R}$ . $\gamma$ is the discount factor that represents the value of time.

In our model, $N$ agents employ diverse strategies to trade in quantitative markets. They aim to optimize their returns while ensuring portfolio diversity, thereby distributing risks among all agents. At every step, agents observe a shared state $\bm{s}=\{s_{1},\cdots,s_{N}\}$ , where the individual state $s=[p,h,b]$ is a vector that includes $D$ kinds of stock price $p\in\mathbb{R}_{+}^{D}$ , share $h\in\mathbb{Z}_{+}^{D}$ , and the remaining balance $b\in\mathbb{R}_{+}^{D}$ . Then each agent determines an action $a$ , which is a vector representing the weight of each stock in the portfolio, according to its policy. After taking the joint action, the number of shares of each agent is modified and their portfolios are updated. And each agent receives a reward based on the asset value change.

2.2 Strategies of CPPI and TIPP

Constant Proportion Portfolio Insurance (CPPI) is a type of portfolio insurance in which the investor sets a floor based on their asset, then structures asset allocation around the trading decision [18]. As shown in Fig. 1, the total asset $A$ is separated into two parts, the protection floor $F$ and the cushion $C$ , in which the floor $F$ is the minimum guarantee used for protecting the basis of the total asset and the multiple cushions $k*C$ is supposed to be used as the risky asset $E$ ,

E=k*C=k*(A-F),\vspace{-.2cm}

(1)

where the risk factor $k$ indicates the measurement of the risk and a higher value denotes a more aggressive trading strategy. As a comparison, Time-Invariant Portfolio Protection Strategy (TIPP) is a variation of CPPI, where the protection floor $F_{t}$ at time step $t$ is not a fixed value and changes over time according to some percentage of the total asset $A$ and the previous floor $F_{t-1}$ ,

\begin{split}F_{t}=max\{\phi A_{t},F_{t-1}\},\\ E_{t}=k(A_{t}-F_{t}),\end{split}\vspace{-.2cm}

(2)

where $\phi$ is the floor percentage. As the total asset $A$ increases, the amount of guarantee will accordingly rise. While the guarantee remains unchanged if the portfolio reduces.

Refer to caption — Fig. 1: The principle of CPPI strategy for agent $i$ .

2.3 Multi-Agent Deep Deterministic Policy Gradient with Insurance Strategy

We adopt the MADDPG [19] to train our agents. This approach is specifically established for the implementation in the scenario of quantitative trading, presented as Fig. 2. MADDPG is a multi-agent version of actor-critic method considering a continuous action set with a deterministic policy, where each agent has an actor network $\pi_{\theta}$ parameterized by $\theta$ and a critic network $Q_{\phi}$ parameterized by $\phi$ . Each agent learns its optimal policy by updating the parameters of its policy networks to directly maximize the objective function, i.e., the cumulative discounted return, $J(\theta_{i})=\mathbb{E}_{s\sim P,\bm{a}\sim\bm{\pi}_{\theta}}[\sum_{t\geq 0}% \gamma^{t}R_{t}^{i}]$ and the direction to take steps by agent $i$ can be presented as the gradient of the cumulative discounted return, shown as,

\begin{split}\nabla_{\theta_{i}}J(\theta_{i})=\mathbb{E}_{s\sim\mathcal{D}}% \bigg{[}\nabla_{\theta_{i}}\log\pi_{\theta_{i}}(a_{i}|s)\cdot\\ \nabla_{a_{i}}Q_{\phi_{i}}(s,a_{1},\cdots,a_{N})|_{a_{i}=\pi_{\theta_{i}}(s)}% \bigg{]},\end{split}

(3)

where $\mathcal{D}$ is the experience replay buffer containing tuples $(s,s^{\prime},a_{1},\cdots,a_{N},r_{1},\cdots,r_{N})$ that are stored throughout training. To combine the policy with the insurance strategy, we use CPPI or TIPP to adjust the output actions instead of using raw outputs as the actions. The feasibility of such adjusting also benefits from the off-policy nature of MADDPG.

The centralized critic networks are updated by approximating the true action-value function using temporal-difference learning. Further, in order to avoid all agents moving towards the same strategy, we set the correlation between agents as part of the loss function to achieve the purpose of portfolio selection. The loss function for agent $i$ can be expressed as,

\begin{split}\mathcal{L}(\phi_{i})=\lambda\mathbb{E}_{s,\bm{a},\bm{r},s^{% \prime}}\bigg{[}(Q_{\phi_{i}}(s,a_{1},\cdots,a_{N})-y)^{2}\bigg{]}&\\ +(1-\lambda)\sum_{i=1,i\neq j}^{K}Corr(a_{i},a_{j})^{2},\\ y=r_{i}+\gamma Q_{\phi^{\prime}_{i}}(s^{\prime},a^{\prime}_{1},\cdots,a^{% \prime}_{N})|_{a^{\prime}_{j}=\pi_{\theta^{\prime}_{i}}(s)},\end{split}

(4)

where $Q_{\phi^{\prime}_{i}}$ and $\pi_{\theta^{\prime}_{i}}$ are target networks with delayed parameters $\phi^{\prime}_{i}$ and $\theta^{\prime}_{i}$ , $a_{i}$ is the action vector showing the positional confidence vector of agent $i$ under the restriction of CPPI or TIPP, and $\lambda$ is the hyperparameter controlling the equilibrium. The pseudocode of the proposed algorithm is shown in Alg. 1. Notice that the rule-based policy $\Phi$ here can be either CIPP or TIPP, which outputs the constraints of the actions under the insurance strategy.

3 Experiments

We conduct the experiments on a real-world stock trading environment provided by FinRL [20]. We test the performance of our proposed CPPI-MADDPG and TIPP-MADDPG with MADDPG, MADQN, and the Universal Portfolio (UP). The experimental results show that combining MADDPG with insurance strategy provides a substantial advantage in the realm of quantitative trading.

3.1 Dataset and Settings

In our experiment, we selected 100 stocks listed on the Shenzhen Stock Exchange, sourced via Tushare. These stocks have codes ranging from 000010.SZ to 300813.SZ. When combined with cash as a risk-free asset, the potential array of investment products expands significantly. For training, we utilized data spanning from January 1st, 2018 to December 31st, 2020. The testing set comprises data from January 1st, 2021 to December 31st, 2021. To align our analysis with real-world market conditions, constraints were applied to the data, including non-negative balance maintenance and the incorporation of transaction costs. We initialized with a set cash amount, aiming to maximize profits using the aforementioned trading strategies.

Algorithm 1 MADDPG with insurance strategy

1:Initialize the rule-based policy

\Phi

according to the insurance strategy.

2:Initialize

Q_{\phi_{i}}

\pi_{\theta_{i}}

Q_{\phi^{\prime}_{i}}

\pi_{\theta^{\prime}_{i}}

for all

i\in\{1,\cdots,N\}

3:while training not finished do

4: Initialize initial state

s

and a random process

\mathcal{N}

for
action exploration.

5: for each episode do

6: For each agent select action

a_{i}=\pi_{\theta_{i}}(s)+\mathcal{N}_{t}

7: if

a_{i}\notin\Phi(s)

then

8: adjust

a_{i}

with

\Phi(s)

9: end if

10: Execute joint action

\bm{a}

and observe reward

\bm{r}

and
next state

s^{\prime}

11: Store experience

\langle s,\bm{a},\bm{r},s^{\prime}\rangle

in replay buffer

\mathcal{D}

12: Sample a minibatch of

K

experiences from

\mathcal{D}

13: for each agent do

14: Update the critic

Q_{\phi_{i}}

by minimizing Eq.(4).

15: Update the actor

\pi_{\theta_{i}}

using Eq.(3).

16: end for

17: Update target networks for each agent:

\begin{split}\phi^{\prime}_{i}\leftarrow\tau\phi_{i}+(1-\tau)\phi^{\prime}_{i}% ,\\ \theta^{\prime}_{i}\leftarrow\tau\theta_{i}+(1-\tau)\theta^{\prime}_{i}.% \vspace{-.2cm}\end{split}

18: end for

19:end while

3.2 Results

The changes in the total assets with different trading strategies over time are present in Fig. 4, where only the trading days are leveraged for the algorithm implementation. In the first 80 trading days, the profits gained by all strategies except for Universal Portfolios (UP), which has increased in a stable but slow manner, are nearly the same. Afterward, TIPP-MADDPG has boomed during the 80th to 160th trading days compared with CPPI-MADDPG method. One of the significant risks with CPPI is the ”gap risk”. If the risky asset falls dramatically in value in a very short period, the portfolio might not be rebalanced in time to prevent it from breaching the floor value. TIPP, with its time-based adjustment, does not have this risk to the same extent.

We then compare the efficacy of our proposed trading strategies against UP, MADQN, and MADDPG using metrics such as Annual Return (AR), Sharpe Ratio (SR), and Maximum Drawdown (MaxD, which represents the maximum portfolio value drop from peak to trough). For MADQN, we use discretized action space as in MADDPG. The UP is a standard portfolio method, which optimizes based on the correlation of diverse stock returns. The results for AR, SR, and MaxD are detailed in Table 1.

From the numerical results, we can see that UP struggles with real-time data, underperforming in our test set. MADQN’s limitation is its restricted exploration capacity, making it less adept in a stock market rife with uncertainties. Both CPPI-MADDPG and TIPP-MADDPG can revert to the baseline MADDPG strategy under certain parameters, allowing them to be tailored to individual investor risk preferences. For profitability, the TIPP strategy mandates a greater capital guarantee than CPPI, leading to a more assertive portfolio approach. This results in TIPP excelling in annual return, whereas CPPI outshines in SR and MaxD metrics.

To further study how each agent has made a series of trading decisions over time in the test phase, we visualize the general trading behavior for each agent on 100 shares with MADDPG, CPPI-MADDPG, and TIPP-MADDPG, respectively. As shown in Fig. 3, the heatmap presents how the agents with different strategies choose to allocate the asset. The agents with MADDPG prefer the relatively uniform allocation while the assets allocated by those with CPPI-MADDPG and TIPP-MADDPG are more sparse.

Table 1: Comparison of Different Strategies

Strategy	AR	SR	MaxD
UP	$3.36\%$	$1.03$	$3.86\%$
MADQN	$6.47\%$	$1.71$	$9.43\%$
MADDPG	$8.22\%$	$1.99$	$\mathbf{12.26\%}$
CPPI-MADDPG	$7.76\%$	$\mathbf{2.18}$	$6.60\%$
TIPP-MADDPG	$\mathbf{9.68\%}$	$2.09$	$9.02\%$

4 Conclusion

In this study, we introduced two augmented MADDPG algorithms: CPPI-MADDPG and TIPP-MADDPG. Our goal was to elucidate the advantages these financial trading strategies offer when integrated with MARL for quantitative trading. We subjected both methods to rigorous testing using genuine financial data from the Shenzhen Stock Exchange. The empirical outcomes underscore the efficacy of our approaches. We believe these findings highlight the potential for future deployments of our methods in quantitative markets. Given the significant outcomes and the evolving nature of quantitative markets, we are optimistic about the broader applications and adaptations of our methodologies.

References

[1] Ankit Thakkar and Kinjal Chaudhari, “A comprehensive survey on deep neural networks for stock market: The need, challenges, and future directions,” Expert Systems with Applications, vol. 177, pp. 114800, 2021.
[2] Marcus G Daniels, J Doyne Farmer, László Gillemot, Giulia Iori, and Eric Smith, “Quantitative model of price diffusion and market friction based on trading as a mechanistic random process,” Physical Review Letters, vol. 90, no. 10, pp. 108102, 2003.
[3] Xin Guo, Tze Leung Lai, Howard Shek, and Samuel Po-Shing Wong, Quantitative trading: Algorithms, Analytics, Data, Models, Optimization, Chapman and Hall/CRC, 2017.
[4] Tobias J Moskowitz, Yao Hua Ooi, and Lasse Heje Pedersen, “Time series momentum,” Journal of financial economics, vol. 104, no. 2, pp. 228–250, 2012.
[5] Huyên Pham, Continuous-time Stochastic Control and Optimization with Financial Applications, vol. 61, Springer Science & Business Media, 2009.
[6] Wendell H Fleming and Tao Pang, “An application of stochastic control theory to financial economics,” SIAM Journal on Control and Optimization, vol. 43, no. 2, pp. 502–531, 2004.
[7] Bo An, Shuo Sun, and Rundong Wang, “Deep reinforcement learning for quantitative trading: Challenges and opportunities,” IEEE Intelligent Systems, vol. 37, no. 2, pp. 23–26, 2022.
[8] Richard S Sutton and Andrew G Barto, Reinforcement learning: An introduction, MIT press, 2018.
[9] Avraam Tsantekidis, Nikolaos Passalis, and Anastasios Tefas, “Diversity-driven knowledge distillation for financial trading using deep reinforcement learning,” Neural Networks, vol. 140, pp. 193–202, 2021.
[10] Xiaojie Li, Chaoran Cui, Donglin Cao, Juan Du, and Chunyun Zhang, “Hypergraph-based reinforcement learning for stock portfolio selection,” in 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 4028–4032.
[11] Huanming Zhang, Zhengyong Jiang, and Jionglong Su, “A deep deterministic policy gradient-based strategy for stocks portfolio management,” in 2021 IEEE 6th International Conference on Big Data Analytics (ICBDA), 2021, pp. 230–238.
[12] Kaiqing Zhang, Zhuoran Yang, and Tamer Başar, “Multi-agent reinforcement learning: A selective overview of theories and algorithms,” Handbook of Reinforcement Learning and Control, pp. 321–384, 2021.
[13] Lucian Busoniu, Robert Babuska, and Bart De Schutter, “A comprehensive survey of multiagent reinforcement learning,” IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 38, no. 2, pp. 156–172, 2008.
[14] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
[15] Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al., “Grandmaster level in starcraft ii using multi-agent reinforcement learning,” Nature, vol. 575, no. 7782, pp. 350–354, 2019.
[16] **ho Lee, Raehyun Kim, Seok-Won Yi, and Jaewoo Kang, “MAPS: Multi-agent reinforcement learning-based portfolio management system,” arXiv preprint arXiv:2007.05402, 2020.
[17] Johann Lussange, Ivan Lazarevich, Sacha Bourgeois-Gironde, Stefano Palminteri, and Boris Gutkin, “Modelling stock markets by multi-agent reinforcement learning,” Computational Economics, vol. 57, no. 1, pp. 113–147, 2021.
[18] Sven Balder, Michael Brandl, and Antje Mahayni, “Effectiveness of cppi strategies under discrete-time trading,” Journal of Economic Dynamics and Control, vol. 33, no. 1, pp. 204–220, 2009.
[19] Ryan Lowe, Yi Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch, “Multi-agent actor-critic for mixed cooperative-competitive environments,” in 31th Conference on Neural Information Processing Systems (NeurIPS), 2017, vol. 30.
[20] Xiao-Yang Liu, Hongyang Yang, Jiechao Gao, and Christina Dan Wang, “FinRL: Deep reinforcement learning framework to automate trading in quantitative finance,” ACM International Conference on AI in Finance (ICAIF), 2021.