Throughout the last century, there has been a complete overhaul of techniques for weather forecasting. Technological developments have made it possible to simulate the atmosphere more quickly and accurately than ever before, and with the advancements in machine learning this process has since improved even more. A combination of machine learning and numerical simulation allow a mapping between predictions and effects which make weather forecasting possible [1]. Some tools that are best for the prediction include Decision Trees [2] and Linear Regression [3], and these mathematical skills have been proven effective in various reports. A study from Stanford University uses a linear regression model and a variation on a functional regression model, and its results were comparable to professional forecasting services, proving the potential success of this problem [4]. Additionally, companies like Dark Sky [5], whose founders have engineering and computer science backgrounds, have teamed up with meteorologists and a team from Microsoft to produce results based on machine learning with geophysical modeling to change the public’s perspective on how weather is displayed and understood [5]. The transition to utilizing machine learning for weather forecasting is already underway, but the main challenge is that weather forecasting is generally expensive to model and inaccurate, so we plan to utilize free government data and common machine learning techniques to develop an accurate and inexpensive model.

Problem Statement

Weather forecasting is the task of predicting the state of the atmosphere at a future time and a specified location but is currently performed via expensive physical models created by human scientists and are usually inaccurate. The data show a trend toward increased variation with increasing projection interval, and more difficulty in predicting maximum temperature than minimum. To put the numbers in perspective, the colloquial definition of ‘room temperature’ is between 20 and 23.5 degrees Celsius, or a range of 3.5 degrees. Taken in aggregate, existing models are unable to perform within that range.

The figure below shows the mean absolute deviation (MAD) by number of days out a forecast is made. The data come from The Weather Channel’s 10-day forecast from August-February of 2015.

It shows a clear trend toward increasing deviation with increased projection interval, culminating in significant deviations well outside a range that a person can comfortably plan for – up to 5K seven days out. The following figure is a histogram from the same source which illustrates the increasing spread in forecast discrepancy with increased projection interval. One day out, we observe a somewhat tight, almost Gaussian distribution. By nine days [6] out, the distribution nearly looks uniform. One implication of these data is that current forecasting appears to be an out of control process.

In the figure, the vertical axis represents frequency. The chart is segmented by projection interval. The data show significant spread at any interval, with a less gaussian distribution with increased forecast interval – representing an out of control process [6].

We consider these visualizations of the data as demonstrating a need for improved forecasting. Physical models have been developed over centuries and use expensive computing power, instruments, and expert interpretation. However, forecasts often fall significantly short of reality, and display a suboptimal level of variability. We believe taking a machine learning approach will combat these inaccuracies as it is relatively robust to natural perturbations and doesn’t require a complete understanding of the physical processes that govern the atmosphere. It is also relatively cheap and can be applied to forecasting problems of any kind. Therefore, our intent is to utilize the innate accuracy and low price of machine learning to maximize accuracy in weather forecasting. We will test this concept via a proof of concept software package running against years of free government records. Our design will use common algorithms to gauge the cost and accuracy of our predictions. By comparing our results to those of existing physical models, we strive to achieve a similar level of accuracy to physical models using free software and data.