Linear Regression | Data-Driven Weekly Forecasting of Dow Jones Stocks Using Machine Learning

As a secondary method and benchmark to the CatBoost regression, we hard-coded a linear regressor to predict future stock prices. The method draws data from our data collection and iterates over an entire stock. It creates a best fit line over 4 consecutive days of a stock’s price and then makes a prediction on future position 5 work-days after the last day utilized to draw the line. This was done to reflect the CatBoost model, which makes predictions 5 work-days out. The math behind the model is straightforward – the goal is to minimize the distance between the line and each of the points.

The Math

The following diagram depicts an example calculation on Goldman Sachs stocks. The prediction is based on the points (1, 241.57), (2, 243.13), (3, 241.32), and (4, 244.90) – the four days and their corresponding close prices, respectively.

Looking five days past the fourth day of data collection, this example’s prediction is 248.047 – a speculated upward movement.

The Procedure

This method begins in the same manner as the CatBoost algorithm – data collection. Bloomberg Terminals were used to access two years’ worth of information – because this method’s prediction algorithm only required four days of information, the LR method was limited to a little over two years (January 3, 2017 – September 6, 2019). From here, the data was organized in an Excel sheet and separated by each individual stock. 24 Dow Jones stocks were utilized. Within the Excel data conglomeration, a function was established that would determine each day’s up/down movement of every stock (solely whether the stock moved up and down, not the difference between the two). In other words, this created a metric from which the method’s efficacy could be analyzed. A main function of the linear regression model is to determine if a future date will see an increase or decrease in a stock’s price based on four days’ prices.

From here, we turned to MATLab. The LR method begins by importing close prices from the Excel document. Each stock pulled 674 days of close prices. For the five-day prediction, this meant we created 665 predictions – the first prediction is made on day 665 (674 – 4 days worth of data – 5 days until prediction). Within a while loop, the code iterates over the data set four days at a time, creating a best-fit-line on these four points in the form of y=mx+b. The point at x=9 is pulled and put into a “results” vector. This value is compared to the four data points that defined the best-fit-line and the stock’s up/down movement is determined from this comparison. The results are stored as binary values – up is 1, down is 0. This result is then kept in another matrix.

At this point, the method splits into two paths – one function subtracts the actual close prices from the “results” vector of predicted close prices in order to calculate each prediction’s error. The MATLab code prints out two graphs corresponding to these values – one graph depicts solely the error as a function of time, and the other graph models the predicted prices over the actual prices. Both graphs demonstrate error, they are just different methods. These charts can be found under the “Results” tab.

The other function compares the predicted up/down movement vector with the actual up/down movement. The predicted movement vector is copied from MATLab and imported into our original Excel document. Predicted and actual movement was analyzed within this document over each of the 665 days.