ElegantRL Demo: Stock Trading Using DDPG (Part I)

7 min readMar 28, 2021

Tutorial for Deep Deterministic Policy Gradient Algorithm (DDPG)

This article by Steven Li, Xiao-Yang Liu, and Yiyan Zeng describes the implementation of a stock trading application [1] using the Deep Deterministic Policy Gradient (DDPG) algorithm in ElegantRL. Stock trading plays a crucial role in investment, and it is challenging to develop an automated agent that can trade profitably in the dynamic stock market. To demonstrate the outstanding performance of ElegantRL, we show how to train an effective trading agent.

Please check the introductory article of the ElegantRL library and the Medium blog of [1].

In Part I, we discuss the DDPG algorithm incorporating the problem formulation of a stock trading task. In Part II, we talk about the implementation details of the application and its easy-to-customize features. After this article, you may design your own stock trading agent and start to earn money!

The Stock Trading Task

Stock trading is considered one of the hottest topics in machine learning since a profitable AI agent is irresistible for almost everyone. Using an intelligent trading agent to automatically manipulate a stock account, the only thing you need to do is to open a portfolio, lie on the couch, and count the dollar.

Figure 1. The bunny counts the money. [Image from the link].

Deep Reinforcement Learning (DRL) has proved its ability in the trading field. In this article, the goal is to show that a DDPG agent trained in stock market data may make profits in backtesting.

Problem Formulation

There are two core components of a DRL approach: trading agent and environment. The interaction between the trading agent and the market environment is shown in Fig 2:

The agent observes the current state of the market environment.
In the current state, the agent makes an action according to its policy.
The environment makes one step forward based on the action and state-transition, namely a transition, and then generates a reward.
The agent receives the reward and uses the transitions to update its policy.

Figure 2. Overview of stock trading using a DRL agent. [Image by authors].

Formally, we model stock trading as a Markov Decision Process (MDP), and formulate the trading objective as maximization of expected return:

State s = [b, p, h]: a vector that includes the remaining balance b, stock prices p, and stock shares h. p and h are vectors with D dimension, where D denotes the number of stocks.
Action a: a vector of actions over D stocks. The allowed actions on each stock include selling, buying, or holding, which result in decreasing, increasing, or no change of the stock shares in h, respectively.
Reward r(s, a, s’): The asset value change of taking action a at state s and arriving at new state s’.
Policy π(s): The trading strategy at state s, which is a probability distribution of actions at state s.
Q-function Q(s, a): the expected return (reward) of taking action a at state s following policy π.
State-transition: After taking the actions a, the number of shares h is modified, as shown in Fig 3, and the new portfolio is the summation of the balance and the total value of the stocks.

Figure 3. State transition. A starting portfolio value with three actions results in three possible portfolios. Note that “hold” may lead to different portfolio values due to the changing stock prices. [Image from [2]].

Why do we choose the DDPG algorithm?

The Deep Deterministic Policy Gradient (DDPG) algorithm [3] is a model-free off-policy algorithm under the actor-critic framework, and it could be considered as a combination of Deep Q Network (DQN) and Policy Gradient. The main reasons why we choose the DDPG algorithm are the following:

It is simple compared to other states of the arts (SOTA) algorithms and serves as a good example for the DRL algorithms in ElegantRL. Due to the simplicity, the user could focus more on the stock trading strategy, and selects the best algorithm from backtesting.
Unlike DQN, it is able to deal with continuous rather than discrete state and action space, thus can trade over a large stock set.

So far, people may ask a question: Why using a continuous action space for stock trading?

We first define the state space and action space of a stock trading example, assuming that our portfolio has 30 stocks in total:

State Space: We use a 181-dimensional vector consists of seven parts of information to represent the state space of multiple stocks trading environment: [b, p, h, M, R, C, X], where b is the balance, p is the stock prices, h is the number of shares, M is the Moving Average Convergence Divergence (MACD), R is the Relative Strength Index (RSI), C is the Commodity Channel Index (CCI), and X is the Average Directional Index (ADX).
Action Space: As a recap, we have three types of actions: selling, buying, and holding for a single stock. We use the negative value for selling, positive value for buying, and zero for holding. In this case, the action space is defined as {-k, …, -1, 0, 1, …, k}, where k is the maximum share to buy or sell in each transaction.

Back to the question, the number of stocks we trade is an integer, and the number of shares we sell, buy, or hold is also an integer. Intuitively, the action space of stock trading is more likely to be discrete rather than continuous. However, the complexity of the action space grows exponentially with the number of stocks D.

For example, for 5 stocks, and assume the agent is allowed to sell, buy, or hold up to 50 shares in each transaction. In this case, the action space is (50+50+1)⁵, which is approximately 10¹⁰. If we increase the number of stocks from 5 to 30, the action space is increased to 10⁶⁰. Since it is almost impossible to represent such a large action space discretely, we assume that the action space of stock trading is continuous.

Except that DDPG deals with continuous space, it also has the following advantages:

Actor-Critic framework: DDPG follows the standard actor-critic framework, which contains an Actor network for generating actions and a critic network for estimating expected rewards. Such a standard framework is clear for beginners to get started.
Reusability of familiar tricks: DDPG applies many tricks that are already used in DQN, such as the experience replay, the frozen target network, and the exploration noise. In this case, the elements in DDPG are easy to understand.

Algorithm Details and Pseudocode:

The DDPG algorithm [3] could be divided into four components:

Initialization: initializes variables and networks.
Sampling: obtain transitions through the interactions between the Actor network (policy) and the environment.
Computing: computes the related variables, such as the target Q-value.
Updating: updates the Actor and Critic networks based on the loss function.

Figure 4. Overview of the DDPG algorithm. [Image by authors].

Initialization:

We initialize the Actor and Critic (Q-function) networks. The Actor network makes the action for each stock, which decides the number of shares to buy, sell, or hold. The Critic network estimates the expected reward from the current state and actions.
We make a copy for each network as the target network.

Sampling:

Given the state, we use the Actor network to output the corresponding actions.
We deterministically follow the actions and modify the number of shares we hold for each stock to make the stock market environment step forward.
We observe the asset value change of our account (the balance and the total value of the stocks), and then the reward is computed.
We store the transition (s, a, s’, r) into the replay buffer for future training.

Computing:

We randomly sample a batch from the replay buffer.
We compute the target value, which is used to minimize the mean-squared Bellman error (MSBE).

Updating:

We update the Critic network using gradient descent on MSBE with the target value.
We update the Actor network using gradient ascent to find the action policy that maximizes the return.
The Actor target network and the Critic target network are synchronized through the soft update.

Figure 5. The pseudocode of the DDPG algorithm. [Image from [4]].

The detailed implementation of the stock market environment, the realization of the training process, and the performance evaluation will be available soon in Part II. Interested users can test the Jupyter Notebook in FinRL, we will provide a similar one in ElegantRL soon.

References:

[1] Z. Xiong, Xiao-Yang Liu, S. Zhong, Hongyang Yang, A. Walid. Practical deep reinforcement learning approach for stock trading. NeurIPS Workshop on Challenges and Opportunities for AI in Financial Services: the Impact of Fairness, Explainability, Accuracy, and Privacy, 2018.

[2] H. Yang, Xiao-Yang Liu (co-primary), S. Zhong, A. Walid. Deep reinforcement learning for automated stock trading: an ensemble strategy. ACM International Conference on AI in Finance, 2020.

[3] Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D. and Wierstra, D.. Continuous control with deep reinforcement learning. ICLR 2016.

[4] OpenAI spinning up in deep RL, Deep Deterministic Policy Gradient.

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com

🟠 Become a ML Writer