Results

Machine learning is an iterative process, and results of one modeling iteration and should be used to inform future iterations. In practice, this meant that we had to make significant adjustments to our algorithm configurations and training set. Below we have documented our preliminary and final iterations to illustrate the process and provide insights on performance.

Project Process Overview

Performance Criteria

Two criteria were used to judge performance of models:

  1. Mean Absolute Error (MAE) of predicted vs actual temperature. The physical-model forecast baseline results are given to us in terms of Mean Absolute Error. We use this statistic to do a direct comparison of the performance of our data-driven models to the baseline.
  2. Analyses of spread were also included. An advantage of existing models over data-driven models is that they contain some human intuition and interpretation. Thus, existing forecasts rarely run the risk of being wildly inaccurate because they receive a pass of human common sense. When we allow a data-driven model to make all the predictions, we run the risk of an outlier producing skewed or anomalous results. This would be unacceptable in a production setting – if a personal weather forecaster was right 90% of the time, but 10% of the time was off by 100-degrees, people would rarely use or trust it. To judge this, we included the following to evaluate error spread via standard deviation of error of predicted vs actual temperature. This statistic is not available in the forecast data set, but allows quick quantitative comparison of models. Because Standard Deviation assumes a gaussian population distribution, we also used interquartile range (IQR) as a description of spread to judge model performance.

As described in the Human Forecast Data Acquisition section, we obtained MAE statistics for existing physical model based forecasts in St. Louis for Minimum Temperature and Maximum Temperature. These statistics are shown in the figure below. Because the goal of our project is to prove that forecasting does not need to be expensive to be accurate, these statistics will be held as the gold standard and used as a benchmark for our performance.

First Iteration

Intra-Model Performance Analysis

In our first iteration, we built linear regression (‘linreg’), random forest (‘rf’), stochastic gradient descent (‘sgd’) and ridge (‘ridge’) models and used them to predict maximum (‘max’) and minimum temperature (‘min’) surface dry bulb temperatures at 24 and 168 hours out. Models were trained against subsets of varying size ranging from 100 to 100,000 random samples. A model’s name is a concatenation of the algorithm and training set size, i.e. ‘linreg100’ corresponds to a linear regression forecaster trained against 100 rows. Every model was then tested against a 100,000 record reserved subset of previously-unseen rows, and accuracy measures (error and absolute error) were tabulated.

The results of the models broken out by training set size, model, and statistic computed were plotted and can be found in Appendix B. The raw data table can be found in Appendix C. The models which were measured as ‘best’ by either smallest standard deviation or smallest MAE  are also summarized in Table 2. It turns out every model which minimized MAE also minimized Standard Deviation for that algorithm-projection-operation combination. The best models are visualized and compared in Figure 12.

model operation projection mae stddev iqr
linreg100000 max 24 12.59 23.29 16.13
linreg100000 max 168 17.54 28.00 23.63
linreg100000 min 24 3.29 4.25 5.31
linreg100000 min 168 4.01 5.12 6.60
rf10000 max 24 8.13 16.23 8.78
rf10000 max 168 13.06 23.04 14.01
rf10000 min 24 3.28 4.26 5.27
rf10000 min 168 4.06 5.27 6.51
ridge1000 max 24 12.58 37.93 14.65
ridge100000 max 168 17.47 27.87 23.61
ridge100000 min 24 3.28 4.23 5.31
ridge100000 min 168 4.02 5.13 6.62
sgd10000 max 24 10.67 29.67 10.22
sgd1000 max 168 19.09 30.50 25.81
sgd10000 min 24 3.29 4.66 5.19
sgd100000 min 168 4.01 5.13 6.60

Table 2 – Top Performing Models in 1st Iteration. For each combination of operation, projection, and modeling algorithm, models were built with training sets of varying sizes. This table shows the subset that had the best performance for each operation-projection-algorithm combination. Performance for each model is evaluated by the MAE – a measure of accuracy – and Standard Deviation and InterQuartile Range – measures of spread and reliability. The top performing models in terms of MAE were also highly performant in the measures of spread.

Figure 12 – Performance Comparison of First Iteration’s Best Performing Models. The MAE of the top performing model for each modeling algorithm – operation – projection interval combination is plotted. The data reveal significantly better results from the random forest in predicting the historically difficult to predict maximum temperatures. All of the models reflect fairly consistent performance in predicting minimum temperature.

As seen in Table 2, there is not an enormous difference in variation between the top performant models in terms of standard deviation. With that analysis complete, we felt confident taking the model for each target variable which minimized MAE and calling it the ‘best’ model. These results are summarized in Table 3.

Top Performing Model Target Variable MAE STD Dev
rf10000 168 Hour Projected Max Temperature 13.06 23.04
rf10000 24 Hour Projected Max Temperature 8.13 16.23
sgd100000 168 Hour Projected Min Temperature 4.01 5.13
ridge100000 24 Hour Projected Min Temperature 3.28 4.23

Table 3 – First Iteration’s Top 4 Models. These were the models produced which had the smallest MAE for their operation-projection interval combination. Their pattern of accuracy follows the same pattern as the baseline.

Comparison of Preliminary Iteration to Human Forecast Baseline

We compared the results of our preliminary iteration to the baseline as visualized below in Table 4 and Figure 13:

Top Performing Model Target Variable Model MAE Baseline MAE
rf10000 168 Hour Projected Max Temperature 13.06 6.12
rf10000 24 Hour Projected Max Temperature 8.13 2.61
sgd100000 168 Hour Projected Min Temperature 4.01 4.50
ridge100000 24 Hour Projected Min Temperature 3.28 2.29

Table 4 – Comparison of 1st Iteration Top Performing Models to Baseline Performance. Models were shown to have higher MAE across the board compared to the corresponding baseline, with the exception of projecting 168 hour minimum temperature.

Figure 13 – Visualization of Performance Comparison Between 1st Iteration Top Performing Models and Baseline Models were shown to have higher MAE across the board compared to the corresponding baseline, with the exception of projecting 168 hour minimum temperature.

As visualized, only 1 top-performing model out of 4 was able to beat the baseline. This indicates that the models as currently configured and built are not up to specifications, and will need to be tuned to achieve the objectives of the project.

First Iteration Tuning Exploration

Analysis of the dataset, a review of the literature, and discussion with our advisor, Dr. Trobaugh, led to the identification of several areas where further exploration could result in tuning for the models.

  1. Tuning Area 1: Predicting the Past is an Invalid Use Case
    One suggestion by Dr. Trobaugh was that our models were being skewed by older data, such that training on the climate of the mid 20th century would lead to inaccuracy in predicting the climate of the 21st century, and vice versa. Moreover, it is not a realistic use case to predict the past. Intra-model performance analysis of this iteration shows that we do not need to worry too much about cutting down the training set size, as performance levels off asymptotically around 10,000 rows of training.

Tune 1: Run the models against a training set of the recent past, and test them against a test set of the near future, to provide a more realistic use case. 

  1. Tuning Area 2: Curse of Dimensionality
    Blum and Langley’s article ‘Selection of Relevant Features … in Machine Learning’ introduces the idea commonly known as the ‘Curse of Dimensionality’ [15]. The explores the limitations of increasing numbers of independent features in deciding the value of a dependent feature in modeling. As dimensionality (column count) of the data increases, modeling algorithms require more objects (rows) to build an accurate model. Additionally, all models are influenced by collinearity in the data, which is a risk that increases with increasing dimensionality. Our training set for the first iteration contained over 100 dimensions, a high number. Because so many of those dimensions are representations of temperature, there is an obvious risk for multicollinearity.

Tune 2: Use dimensionality reduction analysis like Principal Component Analysis to identify if there is collinearity in the data. If yes, implement feature reduction pre-processing prior to passing the training data to the model-building algorithms.

We implemented Tune 1 and Tune 2 both separately and in conjunction to create the final iteration results.

Final Iteration Results

Tuning Implementation

We implemented Tune 1 described above by partitioning the dataset into 4 equal partitions, corresponding to roughly ~15 year segments. We then trained and tested models using this tune on those 15 year segments. While training and test records are still drawn randomly out of the partition, they are chronologically near each-other.

We implemented Tune 2 by preprocessing the entire data set with a Principal Component Analysis (PCA) algorithm configured to collapse the X-variables to 5 components. This had the effect of shrinking our X-set from > 100 variables to exactly 5.

We implemented both tunes by partitioning the dataset and then running PCA against each tune.

Intra-Model Performance Analysis

Models were trained against subsets of 10,000 random samples. Models were built using Tune 1, Tune 2, and a combination of the two. Any model built using Tune 2 only was tested against 30,000 rows. Any model built using Tune 1 just tested against any record that was not used for training. Untuned models, as described previously, were tested against 100,000 rows. Accuracy measures (error, absolute error, IQR, and Standard Deviation) were tabulated.

The models which were measured as ‘best’ by a combination of minimizing MAE, IQR, standard deviation, and ratio of MAE to test set size are summarized in Table 5, and the intra-model performance is plotted in Figure 15.

operation interval tune Model Algorithm Mean Absolute Error Error IQR Error Standard Deviation Test Set Size
max 24 both rf 1.92 2.61 2.72 106453
max 24 tune1 rf 1.91 2.60 2.69 106453
max 24 tune2 rf 4.00 5.17 6.39 30000
max 24 untuned rf 8.13 8.78 16.23 100000
max 168 both rf 4.32 6.15 5.98 106487
max 168 tune1 rf 4.07 5.01 6.34 106484
max 168 tune2 rf 5.73 7.81 8.53 30000
max 168 untuned rf 13.06 14.01 23.04 100000
min 24 both rf 1.79 2.53 2.51 106487
min 24 tune1 rf 1.79 2.54 2.50 106487
min 24 tune2 rf 2.34 3.51 3.15 30000
min 24 untuned rf 3.28 5.27 4.26 100000
min 168 both rf 1.92 2.69 2.69 106487
min 168 tune1 rf 1.94 2.73 2.72 106487
min 168 tune2 rf 2.81 4.24 3.75 30000
min 168 untuned linreg 4.01 6.60 5.12 100000

Table 5 – Results of the Final Iteration Model Building Process. Every tune was found to lead to a drastic improvement in MAE over the untuned case. For this iteration we also tracked the size of the model’s test-set. Because of the ways we cut up the data for Tune 1, some models were tested on significantly smaller reserved sets. The smaller the test set, often the smaller the MAE, because fewer outlier conditions are encountered. Using either Tune 1 or Both tunes was shown to be the most effective at minimizing MAE. Interestingly, only one model was not random forest.

 

Figure 15 – Visual comparison of tuned model performance. Every tune was found to lead to a drastic improvement in MAE over the untuned case. PCA was not shown to lead to as drastic an improvement in performance as training and testing models in chronologically coherent segments. Interestingly, the top performing model for every target was based on a Random Forest algorithm.

Comparison of Final Iteration to Human Forecast Baseline

The best models’ MAEs are plotted next to the corresponding baseline performance in Figure 16.

Figure 16 – Visual comparison of the best tuned model for each target variable to the baseline. A model was built for each target variable which outperformed the baseline in MAE. Every top performing model was based on a random forest algorithm, and had been tuned to train and test on chronologically coherent records.

As visualized, the tuned models were generally able to vastly out perform the baseline performance against every target variable. Because of the efforts taken to purify and analyze the data both before and after model-building, we feel confident claiming that this generation of models reaches at least parity with existing physical model forecasts.