BERT is the Word: Predicting Stock Prices with Language Models

11 min readMar 16, 2019

Project By: Mark Babbe, Cory Nguyen, Won Lee, and Hanny Noueilaty

Source: https://www.huffingtonpost.com/entry/the-future-of-machine-learning-in-finance_us_58d55c99e4b06c3d3d3e6d42

Introduction

Machine learning and neural networks continue to revolutionize industries all around the globe, and the finance industry is no exception. In our class project we decided to implement a neural network with the hopes of understanding how information in SEC 8-K forms could be utilized in predicting the overall movement of the S&P 500. In reality, the success of such a model in the real world and its applications would depend on the infrastructure, datasets, and processing/action time. That said, we found that publicly available 8-K forms could provide valuable insights in determining the overall movement of markets.

We were inspired by a Huffington News article titled “The Future of Machine Learning in Finance” as well as a blog post on Medium from a fellow data scientist. We thank Yusef Aktan for his blog post, which was a helpful starting point.

Data Collection & Preprocessing

For data collection, we were fortunate enough to have access to Atkan’s work on his Github, which allowed us to build on top of his hard work. So, from the start we had access to a scraper and a Jupyter notebook for data preprocessing. With this as our starting point, we began modifying the code to see if we could improve how the data preprocessing stage was handled while addressing any bugs that inadvertently occurs when programming scripts change hands. For instance, we quickly ran into problems with the scraper that was meant to retrieve 8-K filings from the SEC Edgar database. Attaching a timer to the requests.get function would not solve the issue, since access to the documents was not the problem. We attributed the cause more than likely to how the text data was stored on the SEC Edgar database which proved to be a rather interesting reminder on how data sources can change quite quickly. We handled the issue by simply wrapping a standard timer class and function inside the scraper, since the problem could not be isolated down to a single point. After that, we were able to retrieve the text data without hindrance.

We faced a similar issue with data sources when it came to obtaining daily stock prices. At first, we tried to retrieve data from Alpha Vantage through their documented API using our own API key, but Alpha Vantage limited how much data we could call on in a day, requiring premium membership to do more. We instead turned to Quandl, which is another source of economic and financial data, so we could use their Wiki dataset. It should be noted that people began to turn from Quandl to Alpha Vantage after the Wiki dataset was discontinued in April 2018. With some degree of irony, we ended up turning from Alpha Vantage to Quandl’s Wiki dataset, since we only needed data up to April 2018 to compare our approach to the original developer.

Ultimately, we ended up with approximately 19,000 documents, slightly less than what Atkan had to start with, since the scraper had to skip a few hundred documents. The dataset was around 3 gigabytes in size at this point with some documents having a massive word count. In fact, a handful of documents had close to three million words in length despite our timer. It became evident to us that the data we procured was not clean and needed to be brought down to its necessary components. This part proved to be tricky, yet the solution turned out to be rather simple. In the original, the text documents obtained had information such as the accession number and so forth, that would not help us. Towards the end of these parts, however, there was a field called “Check the appropriate box.” Using regular expressions, we were able to eliminate all of the text leading up to this phrase. Still, our text data continued to have extraneous data. In order to find the most relevant information, we had to look on the second page where we identified a recurring pattern in the documents where the word “On” was usually used to precede the start of the actual information divulged by the corporation in question.

Using regular expression, we started to look for that particular word in the document, allowing us to obtain more quality data. The price for this, however, was the decrease in the number of documents we could use. Atkan had 17,000 after data preprocessing, while we were down to 11,000. On the other hand, our file size was brought down to 22 megabytes after we handled the documents with the excessive word counts. Most importantly, however, we ended up with a more polished set of documents brought down to its necessary parts.

Atkan’s Original Text Document Summary:

Our Team’s Text Document Summary:

At the end of our data preprocessing stage, we ended up with a data composition that differed from Atkan. For instance, we used the difference between adjusted opening and adjusted closing price to obtain the normalized price change for each stock, while Atkan used the difference between the normal opening price and the adjusted closing price to gather his normalized price change which might account for the difference. We should note for transparency’s sake that the Wiki dataset may differ in how it calculates adjusted stock prices from Alpha Vantage. On the other hand, we should see no difference when it comes to calculating the difference in index price from Atkan’s end or ours, since adjusted and normal prices are the same for the GSPC and VIX. They represent index prices, so they do not need to account for dividends, stock splits, and so on. So, using adjusted or normal prices does not matter when calculating the price change for the index, so we felt comfortable using adjusted prices when finding the price change for stocks. Additionally, we also slightly altered the code used to obtain said price changes which may have also played a part.

All of this put together caused us to end up with different numbers in terms of the three classes we have: up, stay, and down. Up is basically anything over 1% increase, stay is anything between 1% and -1%, and down is anything that constitutes a decrease of more than 1%. The bulk of our data would have constituted what we classified as “stay” at around 66%, while we ended up evenly splitting up and down for the rest of the 33%. Atkan, on the other hand, had around 58% in the “up” class, 18% in the “down” class, and around 24% in the “stay” class. Unlike Atkan, we did not use historical price trends for our processing, though we have gathered it for a possible future project. Lastly, we decided to use oversampling like our predecessor to deal with the class imbalance problem we faced which would improve our model’s accuracy significantly.

Above is the signal distribution of the raw data. Note that the distribution is different from Aktan’s due to a difference in computing it.

BERT

Of all of the options available for us in this analysis, we wanted to use the BERT (Bidirectional Encoder Representations from Transformers) model. BERT is the latest pre-trained language model introduced by researchers at Google AI Language just last year (published in this paper). The main benefit of using this model is that we would capture context in a document. In Aktan’s article, he used embeddings from Stanford’s GloVe (Global Vectors for Word Representation) instead, but the flaw we noticed was that it actually does not take context from the SEC Forms 8-K into account. We felt that by doing so, our model can better understand the implications of these documents and arrive at a conclusion that has better predicting power.

Prior to BERT, models had trouble capturing the contextual information of a word because they could only analyze the words preceding it. These are known as directional models because they read text inputs sequentially. With BERT however, we are able to overcome this obstacle in the form of two key methods: Masked Language Modeling (MLM) and Next Sentence Predicting (NSP).

Training Strategies

The reason models in the past could never use bidirectional methods is because the word being predicted would essentially have to already know what itself is when feeding it into the first layer. Conditioning on the previous and following word was simply not feasible until Masked Language Modeling. In MLM, 15% of the words in each sequence are masked and the model attempts to predict them based on the context of other non-masked words. This way, the model is able to use both sides of a word (bidirectional) as context for the context of the word that is being analyzed. An example of this is shown below:

Source: https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html

Another key method in bidirectional training is the Next Sentence Predicting method. When BERT is being trained, the model is fed pairs of sentences and predicts whether the second sentence is subsequent or not. In order to do so, 50% of the inputs are in fact subsequent, while the remaining 50% are random sentences form the corpus. This helps the model learn relationships throughout the training process and ultimately helps distinguish between the two sentences. An example is shown below:

Source: https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html

Something to keep in mind is that both of these methods (MLM & NSP) are trained simultaneously in BERT, with the overall goal of minimizing loss. Additionally, what made these methods even possible were Tensor Processing Units (TPUs), which essentially allow for maximum performance with the ability to quickly debug and edit models. If you are interested, you can read more here.

BERT Implementation

BERT already comes with a specific tokenizer for preprocessing text. First, we need to specify a max sequence length from our data to be input into the model. Because of our current hardware limitations, we played it safe and only allowed for a max sequence length of 128. What this means is that our model would only see the first hundred or so words in the text. This made the preprocessing crucial, and explains why we would rather cut our data by almost 50% instead of keeping the useless text. Furthermore, we would lose a few more words, as the BERT tokenizer adds CLS tokens at the beginning of a sentence pair and SEP tokens are added at the end of both the first and second sequence.

Once the data passes through the model, we have the option of extracting the pooled output or the sequence output. For a classification task such as ours (up, down, or stay), we requested the pooled output, which is only the output vector of the CLS token. With this output vector, an additional layer is added in order to be more specific in how we want to train. This process is known as “fine-tuning” and in our case, we added a fully connected dense layer.

Results

Unlike the original project, we did not separate our training data and test data by random sampling, rather we wanted to avoid using information from the “future” so to speak and avoid potentially biasing our results. In order to represent the real world better, we separated our test data and training data by time with a 90/10 split where the test data took on the most recent data while training data took on the rest.

After fitting our model of BERT + single layer MLP on the processed text data, our model produced a loss of 0.448 and an out of sample accuracy of 60.39%. This is impressive as it outperforms Aktan’s GloVe + single layer MLP (which had about 45% accuracy on a test data set). However, the result is disappointing as the baseline for accuracy in our testing data is 63% which could be achieved by labeling all data as “stay.”

Class imbalance proved to be the main issue we faced at this point, so we used oversampling. Within the original training data set, we resampled the data with replacement to create a new, balanced training set with the same number of rows labeled as “up,” “down,” or “stay.” With this resampled data, we re-trained the BERT + single layer MLP model with processed text data. The model produced a loss of 1.18 in the final step of training, but had an out of sample accuracy of 71%.

This result is a major improvement over our previous model. It also exceeds the baseline test accuracy of 66% which can be obtained by choosing only “stay” for the test data set. Furthermore, our model outperformed all of the models described in our Aktan’s blog with only one hour of training on the Google Colaboratory GPU runtime.

Our Test Data Signal Composition:

These results are surprising, as we did not expect to outperform the original blog’s models. We consider this result as evidence of the effectiveness of Google’s BERT model for contextual language processing. Since BERT considers the context of the tokens, it was able to store more information relevant to the stock market into each of its word embeddings. This single deviation from using GloVe’s word embeddings allowed us to get better results from our model, and it highlights the idea that context is key in natural language processing. From Aktan’s models, we already knew that the SEC Form 8-K texts were useful in predicting the S&P 500, and our project supports this idea.

We expect that we can improve this model by using some of the ideas that we did not carry over from Aktan’s original blog. For instance, Aktan’s model takes in the SEC Form 8-K text as well as some financial data as its inputs. Once we extract the CLS token’s vector from BERT, we could pass it through a network with other financial data vectors, in an attempt to improve our out of sample accuracy. Another idea would be to utilize an RNN, specifically an LSTM, on the token sequence vectors that we did not extract from BERT to see if that can improve our out of sample accuracy. This approach would also allow us look at the different weights in the network as they pertain to certain samples of data and pinpoint exactly which tokens are the most important within the Form 8-K text.

In closing, we found this project to be a success. While the improvement is only 5% better than the baseline, there is evidence that our network is using text to correctly predict some rises and falls within the S&P 500. Our improvements over Aktan’s model, despite using similar data, also highlight the effectiveness of BERT over GloVe for handling text data at the present time. Finally, the code for this project can be found here (https://github.com/markbabbe/BERT-Stock-Prediction-Using-NLP).

And to answer the burning question on everyone’s mind, yes. Yes we have heard. That BERT is in fact, the word.