# Machine Learning applying LSTM on Spreads

**Introduction**

Here we examine the techniques used for LSTM, i.e., long short term memory neural net models, and their applications to spread prices and how viable those models are for making price predictions. Spread prices are likely candidates to be used in this technique, because they have mean reverting patterns that can be picked up by neural nets. Furthermore, LSTM seems appropriate because of the long term memory retention networking of the cells on longer sequential data.

Recurrent Neural Nets have a problem with short term memory, meaning that they have a hard time carrying information from earlier time steps to ones further down. Gradients are used to compute weights on the information contribution in each cell, and as you propagate further back in time, the gradient shrinks to zero. LSTM models fix this diminishing gradient problem. Furthermore, network layers that get a small gradient stop learning. Those are usually the earlier layers and because those layers don’t learn, RNNs can forget what is seen in longer sequences, thus having a short-term memory.

**Cell architecture of RNNs versus LSTM**

Batches on input get passed through “cells” of an RNN for training. The RNN passes the sequences of input one by one, transforming and normalizing the data through the tanh activation function. This controls the flow of input between the range of -1 and 1 so that larger values don’t influence the training and estimates. The architecture of the Keras LSTM is more involved introducing concepts to hold information in the cell itself with gates in the cell that decide which previous information to drop and keep with concatenation operators. The main components being the forget gate, input and output gates, which respectively decide what information to keep, what is relevant from the current step, and what the next hidden state should be. Both sigmoid and tanh functions are utilized to decide which data is relevant. An overview of the cell structures can be found below, and you can see the LSTM units are more complex in nature with redundant data controls sitting inside the cell itself.

**Configuration Settings of LSTM models**

There are various properties of the LSTM models or configurations that we need to set for the LSTM process to run. Here we review some of the key features. The **state** is the cell-state that gets passed to the next step while the **sample** size is the length of the input sequence. The **batch size** is a parameter that controls the number of samples to work through before the internal values are updated when passing data through the cells. The data must be input in batch format. Having a larger batch size aids the tuning process of the weights which reduces the number of gradient updates needed for convergence. The number of **epochs** controls the number of passes through the training dataset. For predictions, we set the **n_timesteps** to be equal to the number of predictions you want to consequently run after training on each sample. The following code can be used to transform your data from time-series form into batch form input.

**Optimization settings of LSTM**

The model is trained and parameters are updated by controlling the loss rate that we want to measure as well as the optimizer type. A few built in loss rate functions can be selected including RMSE, MAE, MSE and logcash which is the logarithm of the hyperbolic cosine of the prediction error. Log(cosh(x)) is approximately equal to (x ** 2) / 2 for small x and to abs(x) – log(2) for large x. One benefit of using this is that it is not as widely effected by the occasional outliers in incorrect predictions. The main optimizer functions that can be used are the stochastic gradient descent methods and Adam.

**Applications to Futures Spreads**

We capture data using a client API and store the history on some fixed-length time increment to create a time series of data on the underlying outright contracts. This data is the same as what we see in production. To create a calendar spread you can apply the following formula OR1 – B*OR2 where B is the beta coefficient and OR1 and OR2 are the underlying outright contracts. The weighted midpoint, WMP, or clearing price, is the weight average of the bid-ask spread.

For training purposes, I construct the following calendar spread on coffee futures.

Training over 30 epochs we see that the after 10 epochs, the measured loss, in this case, MSE is minimized.

Applying the batch model predictions while plotting against the actual data, it can be noticed that the estimates are slightly skewed from the actual spread price, but overall do a decent job at predicting direction. From this vantage point, we do predict with 60% accuracy the direction over the next hour after aligning the timeseries with the output of the predictions. A further study should be done to see if we can capture larger directional moves, as smaller increments would be difficult to obtain in an automated trading environment. The following is a plot of the predictions and the sample code for scoring the batch data.