
Transformers vs. LSTM for Stock Price Time Series Prediction | by Michael May | Jun, 2023
So I assumed I ought to proceed my dialogue of the inventory time collection drawback I started in my first weblog submit. In that first submit, we used LSTM and CNN-LSTM mannequin architectures to foretell future inventory costs utilizing solely earlier costs as inputs and we did pretty effectively. Our imply absolute proportion error was at or beneath three % and issues have been wanting good. I discussed in that submit I’d focus on methods to enhance this determine and initially I had meant to take action by creating an ensemble LSTM mannequin utilizing LSTM nets with completely different hyperparameters after which averaging their prediction to reach at (hopefully) one thing higher. However I have to admit I bought somewhat sidetracked pursuing a distinct and I believe extra fascinating line of inquiry. In a single phrase that new method is Transformers.
What’s a Transformer Mannequin Structure?
Earlier than we get into drawback specifics I assumed somewhat exposition could be good. It ought to be famous, nevertheless, that transformers are a high-level machine studying idea that I’m simply getting my toes moist in, and due to this fact I’m removed from a subject knowledgeable. However that being stated, educating (or attempting to) is a integral a part of studying, so within the spirit of progress, right here it goes.
At a really excessive degree a transformer is a feed-forward nueral community structure that leverages a “self-attention mechanism” to scale back coaching occasions and parameter counts whereas remaining extremely predictive. How I’ve been conceptualizing that is as follows: transformers can do the work of RNN’s with out all that recurrence slowing issues down and including parameters. This works as follows: a transformer’s multihead consideration mechnanism (applied by tf.keras.layers.multiheadattention
in tensorflow) permits the mannequin to maintain observe of each information level related to a particular different information level. Put within the context of our drawback, if we’re attempting to foretell tomorow’s inventory worth, at this time’s worth is clearly related, however different costs are as effectively. As an example, this week’s worth motion may very well be similar to one thing that occurred three months in the past, and due to this fact it’s benefical for us if our mannequin can “keep in mind” that worth motion. This want for “reminiscence” was one motive we selected to make use of recurrent nueral networks and LSTM (lengthy brief time period reminiscence) fashions within the first place. By permitting earlier information factors to be fed backwards into our mannequin, RNN’s can weight earlier values, giving us this reminiscence impact. What makes transformers completely different is their distinctive self-attention mechanism implies that they will have this identical “reminiscence” impact with out all of the recurrence: information is fed in as soon as and self-attention retains observe of relevent information factors at every step (it weights that worth motion that was much like what have been at present seeing from three months in the past, say). This implies transformers prepare a lot sooner than RNN’s since we solely feed the information in as soon as, and so they require many fewer parameters, a win-win. For a extra cogent and knowledgeable rationalization of transformers I’ll hyperlink the unique paper describing them on the backside of this submit. However for now, on to our drawback.
The Inventory Time Sequence Drawback
The query this investigation is attempting to reply may be formulated merely as, “Can we predict inventory costs over the following 5 days, utilizing nothing however earlier inventory costs?”. As I’ve alluded to above we now have already partially answered this query with a tentative sure. Nonetheless, the fashions we’ve constructed, whereas suprising adept, are removed from excellent and even particualry helpful: being 3.5% off on common isn’t sufficient to really feel assured buying and selling on. A part of it’s because the issue is just onerous, the information is noisy and really practically random. They don’t name the inventory market a random stroll for nothing it seems. However even nonetheless we’re undeterred and so we’re going to plow forward and tick off the next steps in persuit of a greater mannequin:
- Set up a dependable baseline LSTM mannequin to guage our transformers perfomance in opposition to.
- Construct a transformer mannequin
- Prepare and consider our fashions
Half One: The LSTM
Let’s construct our baseline mannequin. From earlier experimentation, I’ve discovered that having three layers, an LSTM layer with 200 nuerons, and two dense nodes, with 50 and 1 nueron(s) respectively, works very effectively for this drawback. Right here’s the code to implement this:
from keras.fashions import Sequential
from keras.layers import Dense
from keras.layers import LSTMdef build_lstm(etl: ETL, epochs=25, batch_size=32) -> tuple[tf.keras.Model, tf.keras.History]]:
"""
Builds, compiles, and suits our LSTM baseline mannequin.
"""
n_timesteps, n_features, n_outputs = 5, 1, 5
callbacks = [tf.keras.callbacks.EarlyStopping(patience=10,
restore_best_weights=True)]
mannequin = Sequential()
mannequin.add(LSTM(200, activation='relu',
input_shape=(n_timesteps, n_features)))
mannequin.add(Dense(50, activation='relu'))
mannequin.add(Dense(n_outputs))
print('compiling baseline mannequin...')
mannequin.compile(optimizer='adam', loss='mse', metrics=['mae', 'mape'])
print('becoming mannequin...')
historical past = mannequin.match(etl.X_train, etl.y_train,
batch_size=batch_size,
epochs=epochs,
validation_data=(etl.X_test, etl.y_test),
verbose=1,
callbacks=callbacks)
return mannequin, historical past
And right here is the mannequin abstract related to this mannequin as soon as its compile:
So that is the baseline we’re going to make use of to guage our transformer, we’ll focus on its leads to the analysis part, however for now lets speak about our transformer structure.
Half Two: The Transformer
Now for our transformer structure we might be utilizing the development really useful within the keras documentation (I’ll hyperlink on the finish of the submit), however this arrange is constructed for classification so we’ll make a small change; we’ll change the ultimate output layers activation operate from softmax to relu, and our loss operate to imply squared error. Apart from that I’ve set the hyperparameters to what I discovered to work finest for this drawback by way of experimentation. Right here’s what we get:
def transformer_encoder(inputs, head_size, num_heads, ff_dim,
dropout=0, attention_axes=None):
"""
Creates a single transformer block.
"""
x = layers.LayerNormalization(epsilon=1e-6)(inputs)
x = layers.MultiHeadAttention(
key_dim=head_size, num_heads=num_heads, dropout=dropout,
attention_axes=attention_axes
)(x, x)
x = layers.Dropout(dropout)(x)
res = x + inputs# Feed Ahead Half
x = layers.LayerNormalization(epsilon=1e-6)(res)
x = layers.Conv1D(filters=ff_dim, kernel_size=1, activation="relu")(x)
x = layers.Dropout(dropout)(x)
x = layers.Conv1D(filters=inputs.form[-1], kernel_size=1)(x)
return x + res
def build_transfromer(head_size,
num_heads,
ff_dim,
num_trans_blocks,
mlp_units, dropout=0, mlp_dropout=0) -> tf.keras.Mannequin:
"""
Creates ultimate mannequin by constructing many transformer blocks.
"""
n_timesteps, n_features, n_outputs = 5, 1, 5
inputs = tf.keras.Enter(form=(n_timesteps, n_features))
x = inputs
for _ in vary(num_trans_blocks):
x = transformer_encoder(x, head_size, num_heads, ff_dim, dropout)
x = layers.GlobalAveragePooling1D(data_format="channels_first")(x)
for dim in mlp_units:
x = layers.Dense(dim, activation="relu")(x)
x = layers.Dropout(mlp_dropout)(x)
outputs = layers.Dense(n_outputs, activation='relu')(x)
return tf.keras.Mannequin(inputs, outputs)
transformer = build_transfromer(head_size=128, num_heads=4, ff_dim=2,
num_trans_blocks=4, mlp_units=[256],
mlp_dropout=0.10, dropout=0.10,
attention_axes=1)
transformer.compile(
loss="mse",
optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
metrics=["mae", 'mape'],
)
callbacks = [tf.keras.callbacks.EarlyStopping(patience=10,
restore_best_weights=True)]
t_hist = transformer.match(information.X_train, information.y_train, batch_size=32,
epochs=25, validation_data=(information.X_test, information.y_test),
verbose=1, callbacks=callbacks)
And this code will construct a mannequin that appears one thing like this (repeated num_trans_blocks
occasions).
Lastly, in complete our transformer block has 17,205 complete parameters, somewhat over one tenth of our LSTM. Now let’s get coaching.
Half Three: Coaching and Analysis
We’ll prepare each nets for 25 epochs with an Adam optimizer. First, lets focus on our baseline fashions efficiency on the check information. What I’d like to notice first is that the LSTM is remarkably consistant, that’s, if you happen to prepare it 10 consecutive occasions it will provide you with predictions that carry out very practically the identical on the check set in all ten circumstances. It additionally trains extra quikly than I anticipated, 143 seconds in all. It additionally does effectively in inference time, taking solely twenty-two seconds. That is made all of the extra impresive since, on this comparability no less than, the LSTM is the heavy weight weighing in at a portly 171,000+ parameters as talked about earlier than.
By way of its predictive functionality we discover the LSTM to be skillful (not terribily suprising, its our baseline for a motive). It scores a MAPE of two.44% on our check set. All in all of the LSTM is a consitant, simply educated mannequin which is skillful at predicting inventory time collection information. Its one limitation is that it’s massive, and thus can’t simply be scaled. Nonetheless, in terms of inventory costs there actually isn’t sufficient information to construct extraordinarily deep fashions, if you happen to did you’d truly begin to lose efficiency. Thus, I don’t suppose the heavy parameter rely actually hurts the LSTM very a lot.
Subsequent, let’s focus on the transformer. The very first thing I’d like to notice is that the transformer is extra unstable that the LSTM. What do I imply by that? I educated the identical fashions (learn: identical hyperparameters) many, many occasions. The consequence? I educated one of the best mannequin I’ve constructed to foretell inventory costs, coming in with a MAPE of two.37%. I constructed one other with MAPE of two.41%. Each of those, you’ll observe, are higher than our baseline, which consistantly weighed in round 2.45%-2.5%. However sadly, the transformer was not in a position to replicate that efficiency consistantly, even with the identical hyperparameters as these two golden coaching runs. What’s extra, it wasn’t just like the transformer was solely doing barely worse, both. There have been occasions its MAPE topped out over 3%. This can be a drawback if we’re attempting to construct after which retrain fashions each month, or quarter say. You may need a strong transformer mannequin, retrain it, and be left with a pile of junk. So in that respect the transformer was not splendid.
So, the place did the transformer actually shine? Paramter rely. It weighs in at somewhat over a tenth the variety of parameters because the LSTM. This meant that the transformer educated sooner than the LSTM, taking solely 138 seconds to the LSTM’s 143. However surprisingly inference was sooner on the LSTM: the transformer took 25 seconds in all.
Lastly, when it comes to relative variance of predictions vs. the precise values, in different phrases how a lot uncertainty is baked into our predictions, the LSTM hedged out the transformer: it was 2.4% to 2.6%.
Conclusion
To begin with, it ought to be famous that the tranformer’s efficiency was actually spectacular. With solely ~17,000 it was in a position to sustain with an LSTM with over 170,000. That’s no imply toes. Nonetheless, it was unpredictable and unstable within the face of retraining. I don’t know if it is a basic characteristic of transformers or an issue with my implementation and utilization of them, but it surely was definitly very noticable. This is probably not a problem if you happen to’re coaching a big language mannequin; English doesn’t change so quickly that you need to put out a brand new mannequin each couple weeks. However for predicting inventory costs the power to retrain as new information flows in with out having to fret that your beforehand very skillful mannequin is now merely mediocre, effectively that turns into essential. I believe for that motive, and the truth that our parameter financial savings didn’t pace up coaching occasions however all that a lot, I believe for my cash I’d nonetheless go for the LSTM. Plus, we don’t have to scale these fashions to billions of parameters, the regime the place transformers actually shine. 1B parameters vs. 10B is a really completely different alternative than 17,000 vs. 170,000.
So whereas I actually loved working with transformers for the primary time, I don’t suppose their terribly effectively suited to this specific drawback. Nonetheless, whereas I used to be experimenting a thought occured to me which I believe I’ll discover in a subsequent submit: By combining the Transformer and LSTM architecures can we get one of the best of each worlds? Extra reproducibility, decrease parameter counts, and higher predictions? For that keep tuned. Thanks for studying!
Hyperlinks:
The Unique Paper on Transformers: https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
My Collab Pocket book for supply code: https://github.com/maym5/lstm_vs_transformer/blob/main/lstm_vs__transformer.ipynb
Keras Transformer docs:
https://keras.io/examples/timeseries/timeseries_transformer_classification/