Using Recurrent Neural Network for Actor Critic Methods

Outline

I. Introduction

II. PPO Explained

III. Recurrent PPO

IV. LSTM

V. Hidden States

VI. Masks

VII. Rollout Buffer &

VII. Trajectory sampling

Introduction

Training Recurrent Models for Reinforcement Learning adds a little complexity to the implementation and production of the model. You need to make some implementation details when using Recurrent Architecture for Reinforcement Learning. Deep RL are particularly hard since the data distribution varies when the network is updated. So complicated architectures are needed in order to gather information about the environment. In this article we will implementing PPO algorithm with LSTM layers and talk about some implementation details. The goal of using recurrent network for RL is to give more encoded information to the agent in addition to current observation.

PPO is an on-policy Actor-Critic deep reinforcement learning algorithm. On-Policy algorithms use the current network / policy to create trajectories. LSTM based architecture use memory to represent more complicated representation in partially observable environment and gather information from past. We use reinforcement learning agents to perform a financial portfolio optimization therefore it’s important to use sequential models to capture temporal information about the market.

PPO was introduced by Schulman et al. as an advancement to TRPO which uses trust region for policy optimization for enabling stability and generalizable hyper-parameters. PPO has a simpler objective function which makes it easy to implement and more generalizable compared to on policy gradient methods.

In brief explanation, PPO optimizes a ‘surrogate’ loss function, where it penalizes huge steps compared to old policy. The objective of the PPO algorithms is to find the right parameters which minimizes expected loss function

Where the loss function calculates the advantage of taking actions generated by the actor network. Where the advantage function is the output of value function. If the advantage is positive for given state and actions. Then the model increases the probability taking those actions at given similar states.

There are lots of great article’s and a great video about PPO and policy gradients.

Recurrent Network in Proximal Policy Optimization

LSTM STATE and LSTM CELL

This article is more concentrated on implementing recurrent networks for on policies algorithms.

LSTM hidden state is an additional hidden layer which is a feed forward neural network, it learn which information should flow from past to the current state. Which acts as a memory cell for the neural network. Sequence length is an additional hyperparameter in RNN based architecture where you pick how many time step you want your model to learn. So for example, your environment has 10 input dimension and you want to feed your network with last 8 timesteps. When preparing for batch you set an input vector for (batch_size, sequence_length, input_size).

Pytorch provides two LSTM options which does not have a huge different but it changes the implementation of the model. An LSTM cell takes B x F (Batch, Features) size 2d input whereas LSTM takes 3d (L x B x F) or (B x L x F) as it comes handy with batch first features where L is sequence length.

The implementation details are important when building Deep Reinforcement Learning agents. So this is a gathering of some implementations across the internet. PPO algorithm consists of 3 stages. First, the current policy generates trajectories and store in Buffer, second step is to use generated trajectory to calculate loss (Expected Above) for policy and value networks. At last the model updates the network parameters with SGD to maximize reward / minimize loss. According to Andrychowicz et al. (2020) On-Policy algorithm parameter updates benefit from mini-batch gradient steps compared to large-batch updates. When using LSTM inputs include hidden state of LSTM and the observation from the environment. We add hidden states to the Buffer along with states/action/rewards to compute loss effectively.

Hidden States

There are various method on how to handle hidden states in RL. In Off-Policy RL there is a catch where the hidden state stored in the ReplayBuffer of the created trajectory causes discrepancy between the newly generated networks hidden state input needs. However, when using On-Policy algorithms like PPO these stale state issue eliminated since the trajectory is deleted after each parameter update of the network. Andrychowicz et al. (2020) argues that even on On-Policy algorithms the advantage values can be stale over the course of a single update. Andrychowicz et al. (2020) suggest that advantage functions should be recalculated before each mini-batch implementation. So we have to refresh advantages and recalculate hidden states

Initialization of the hidden cell state effect’s the performance of the agent. Different approaches can give better benefits in different environments. CHANGE this In our project, where each episode is fixed length (Trading days) we prefer zero state initialization at the beginning of the training for each epoch so the gradients can flow from the end of the episode to the beginning since each day is unique as well . However, in some setting you can use the last episode’s hidden state for the new episode to increase training speed and performance.

According to Kapturowski et al. in their R2D2 paper they explain ‘representational drift’ leading to ‘recurrent state staleness’, as the stored recurrent state generated by a sufficiently old network could differ significantly from a typical state produced by a more recent version. They try various initialization techniques where they use the last hidden state of the batch to the new batch or they start the hidden state with zeros. This approach get more complicated in off-line learning

Ways to use hidden layer in reinforcement learning setting

Masks

Masks are needed when using LSTM layers since we need to remove the effect of the zero padded sequences in the loss construction. — REWORDDD

For example, when we pick sequence length of 16 and we are the 8th time step we give 0 to 8 next steps when constructing the loss.

Replay Buffer & Preparing batch

We store trajectories and hidden states generated through the actors interaction with the environment to the LSTM Buffer. This trajectories which are stored in our buffer will be transformed for training our neural network with mini-batches of data. We will use padded sequences which divides our trajectories into sequences of fixed length inputs. By integrating architectures like LSTMs (Long Short-Term Memory units) into the learning pipeline, we can efficiently handle such temporally extended patterns.