Physical Model Baseline Performance

Weather forecasting is the task of predicting the state of the atmosphere at a future time and a specified location but is currently performed via expensive physical models created by human scientists and are usually inaccurate. Figure 1 illustrates the mean absolute error of existing physical models across the U.S. in predicting maximum, minimum, and actual surface temperature at 1 and 7-day projection intervals.

Figure 1- Mean Absolute Error in Kelvin of the physical model baseline. This chart displays the MAE of forecasts for maximum and minimum daily temperature 1 and 7 days out, as forecasted by existing physical models and interpreted by meteorologists. Because these are the scores of existing models, they are the baseline and gold standard we compare out results against. The data show a trend toward increased variation with increasing projection interval, and more difficulty in predicting maximum temperature than minimum.

Human forecast accuracy data were acquired from the Verification Team of the Digital Forecast Services Branch of the National Meteorological Development Laboratory – a division of NOAA. This team produces automated reports on forecasts made by field stations throughout the U.S. stored in the National Digital Forecast Database. Scores are available for a variety of features, projection intervals, and locations. A consistently offered statistic is mean absolute error (MAE).

To make use of these data, we used a multiprocessing-based script which queries the NDFS Verification website to retrieve text files with forecast accuracy data. We read in these text files, filter out stations that are not specified in the configuration file (i.e., St. Louis), and store the MAE. The data are then consolidated into a spreadsheet and saved to disk for future manual analysis.

Weather Data Acquisition

Weather data were acquired from the Climate Data Online Project from the National Centers for Environmental Information, a group under the National Oceanic and Atmospheric Administration (part of the U.S. Department of Commerce). Data were obtained as hourly weather readings from Lambert Airport (STL) starting January 1st, 1945 and going through August of 2017. We requested and downloaded data in 10-year increments, and delivered as 90 column comma separated value (CSV) files.

Input File Restructuring

NOAA delivered data in the form of multiple separate de-normalized comma separated value files. To join the files together, a script was written which programmatically opened the different input files and then concatenated them into one large file.

The script then applied a series of transformations to the data, filtering out uninformative columns, dropping nulls, and coercing text into numeric data. The data were then saved to disc as a CSV file.

Data Loading

To give persistent storage and consistent programmatic access to data for concurrent users / processes, we loaded our data into a Microsoft SQL Server database. To achieve this, the merged input file was read into memory by a Python script, and then converted into a table uploaded to the database. This step was essentially a direct replication from local flat file storage into database storage.

A main objective of the project was to build an analytics solution that could be run on cheap, easily available hardware and software (i.e. personal computers). Due to the massive scale of data involved, this meant we had to take care to optimize performance in our implementation. Another main objective was to create a high-quality solution with high reusability, implying we should maximize modularity.

To this end, we implemented a normalized database design. At a high level, database normalization allows for optimized storage and machine processing. It does this by reducing redundancy in data by organizing columns and tables in the most atomic manner possible, such that a change in one attribute of one record does not risk compromising the integrity of the other attributes of the record. It also improves database performance minimizing the number of columns added to the database schema when a new attribute is defined.

The resulting database schema can be found in Appendix A.

Feature Calculation

After filtering out empty, redundant, or otherwise low-value columns, the dataset contains roughly a dozen columns. These relations represent our knowledge of the weather at one particular hour in time. These records lack knowledge of the past; facts like what the weather is typically like for that day of the year, or what the weather has been like for the preceding week. These would all presumably make good variables in a model. These relations also lack any knowledge of the future. There are no candidates for a ‘target’ or ‘Y’ variable.

To solve these problems and make each record atomic in containing knowledge of both its past and future for effective modeling, we performed a feature calculation step. This step was coordinated by a Python script which accessed lists of projection intervals, features, and arithmetic operations in the system configuration file. The Python script then accessed SQL script templates, injected the values read from the configuration file, and sent the script to the database to execute the calculations.

This process was made with a design emphasis on modularity so that we could add and remove both past and future features without altering the database schema. It was also made with an emphasis on normalization to control size, as the chronological_fact table contains more records than any other table by a factor of 100.

Data Export

For use as a training set the data needed to be exported into a single frame of data. We chose to implement this by extracting the complete set of data from the database and exporting it to a “flat file” – a CSV containing all the data needed in one single, many column table. The flat file option allowed us to keep the data on our local hard drive and distribute the file so that multiple computers could do modeling on the data.

To achieve this, we used a combination of SQL and Python to unite raw weather data (the useful columns from the original records), historic date data (such as average temperature for this day-month combination), and chronological data (such as 24-hour future maximum dry bulb temperature). This process effectively performed an in-memory denormalization – undoing the work of normalization done when the database was created because in the training set our focus is to optimize reporting performance, whereas before we worked to optimize storage performance and data integrity.

Dataset Shape


Initial Columns Columns Added to Database Computed Columns Training Set X Variables Training Set Y Variables
STATION STATION day day future min drybulbtemp 24
STATION_NAME STATION_NAME month month future min drybulbtemp 168
ELEVATION ELEVATION date_avg_dewpoint current_drybulbtemp future max drybulbtemp 24
LATITUDE LATITUDE date_avg_drybulbtemp current_wetbulbtemp future max drybulbtemp 168
LONGITUDE LONGITUDE date_avg_humidity current_dewpoint
DATE DATE date_avg_windspeed current_windspeed
REPORTTPYE HOURLYDRYBULBTEMPC past max drybulbtemp 1 current_stationpressure
HOURLYSKYCONDITIONS HOURLYWETBULBTEMPC past max humidity 1 current_sealevelpressure
HOURLYVISIBILITY HOURLYDewPointTempC past max windspeed 1 date_avg_dewpoint
HOURLYPRSENTWEATHERTYPE HOURLYRelativeHumidity past max dewpoint 1 date_avg_drybulbtemp
HOURLYDRYBULBTEMPF HOURLYWindSpeed past max drybulbtemp 3 date_avg_humidity
HOURLYDRYBULBTEMPC HOURLYWindDirection past max humidity 3 date_avg_windspeed
HOURLYWETBULBTEMPF HOURLYStationPressure past max windspeed 3 past max drybulbtemp 1
HOURLYWETBULBTEMPC HOURLYSeaLevelPressure past max dewpoint 3 past max humidity 1
HOURLYDewPointTempF past max drybulbtemp 24 past max windspeed 1
HOURLYDewPointTempC past max humidity 24 past max dewpoint 1
HOURLYRelativeHumidity past max windspeed 24 past max drybulbtemp 3
HOURLYWindSpeed past max dewpoint 24 past max humidity 3
HOURLYWindDirection past max drybulbtemp 48 past max windspeed 3
HOURLYWindGustSpeed past max humidity 48 past max dewpoint 3
HOURLYStationPressure past max windspeed 48 past max drybulbtemp 24
HOURLYPressureTendency past max dewpoint 48 past max humidity 24
HOURLYPressureChange past max drybulbtemp 72 past max windspeed 24
HOURLYSeaLevelPressure past max humidity 72 past max dewpoint 24
HOURLYPrecip past max windspeed 72 past max drybulbtemp 48
HOURLYAltimeterSetting past max dewpoint 72 past max humidity 48
DAILYMaximumDryBulbTemp past max drybulbtemp 96 past max windspeed 48
DAILYMinimumDryBulbTemp past max humidity 96 past max dewpoint 48
DAILYAverageDryBulbTemp past max windspeed 96 past max drybulbtemp 72
DAILYDeptFromNormalAverageTemp past max dewpoint 96 past max humidity 72
DAILYAverageRelativeHumidity past max drybulbtemp 120 past max windspeed 72
DAILYAverageDewPointTemp past max humidity 120 past max dewpoint 72
DAILYAverageWetBulbTemp past max windspeed 120 past max drybulbtemp 96
DAILYHeatingDegreeDays past max dewpoint 120 past max humidity 96
DAILYCoolingDegreeDays past max drybulbtemp 144 past max windspeed 96
DAILYSunrise past max humidity 144 past max dewpoint 96
DAILYSunset past max windspeed 144 past max drybulbtemp 120
DAILYWeather past max dewpoint 144 past max humidity 120
DAILYPrecip past max drybulbtemp 168 past max windspeed 120
DAILYSnowfall past max humidity 168 past max dewpoint 120
DAILYSnowDepth past max windspeed 168 past max drybulbtemp 144
DAILYAverageStationPressure past max dewpoint 168 past max humidity 144
DAILYAverageSeaLevelPressure past max drybulbtemp 720 past max windspeed 144
DAILYAverageWindSpeed past max humidity 720 past max dewpoint 144
DAILYPeakWindSpeed past max windspeed 720 past max drybulbtemp 168
PeakWindDirection past max dewpoint 720 past max humidity 168
DAILYSustainedWindSpeed past min drybulbtemp 1 past max windspeed 168
DAILYSustainedWindDirection past min humidity 1 past max dewpoint 168
MonthlyMaximumTemp past min windspeed 1 past max drybulbtemp 720
MonthlyMinimumTemp past min dewpoint 1 past max humidity 720
MonthlyMeanTemp past min drybulbtemp 3 past max windspeed 720
MonthlyAverageRH past min humidity 3 past max dewpoint 720
MonthlyDewpointTemp past min windspeed 3 past min drybulbtemp 1
MonthlyWetBulbTemp past min dewpoint 3 past min humidity 1
MonthlyAvgHeatingDegreeDays past min drybulbtemp 24 past min windspeed 1
MonthlyAvgCoolingDegreeDays past min humidity 24 past min dewpoint 1
MonthlyStationPressure past min windspeed 24 past min drybulbtemp 3
MonthlySeaLevelPressure past min dewpoint 24 past min humidity 3
MonthlyAverageWindSpeed past min drybulbtemp 48 past min windspeed 3
MonthlyTotalSnowfall past min humidity 48 past min dewpoint 3
MonthlyDeptFromNormalMaximumTemp past min windspeed 48 past min drybulbtemp 24
MonthlyDeptFromNormalMinimumTemp past min dewpoint 48 past min humidity 24
MonthlyDeptFromNormalAverageTemp past min drybulbtemp 72 past min windspeed 24
MonthlyDeptFromNormalPrecip past min humidity 72 past min dewpoint 24
MonthlyTotalLiquidPrecip past min windspeed 72 past min drybulbtemp 48
MonthlyGreatestPrecip past min dewpoint 72 past min humidity 48
MonthlyGreatestPrecipDate past min drybulbtemp 96 past min windspeed 48
MonthlyGreatestSnowfall past min humidity 96 past min dewpoint 48
MonthlyGreatestSnowfallDate past min windspeed 96 past min drybulbtemp 72
MonthlyGreatestSnowDepth past min dewpoint 96 past min humidity 72
MonthlyGreatestSnowDepthDate past min drybulbtemp 120 past min windspeed 72
MonthlyDaysWithGT90Temp past min humidity 120 past min dewpoint 72
MonthlyDaysWithLT32Temp past min windspeed 120 past min drybulbtemp 96
MonthlyDaysWithGT32Temp past min dewpoint 120 past min humidity 96
MonthlyDaysWithLT0Temp past min drybulbtemp 144 past min windspeed 96
MonthlyDaysWithGT001Precip past min humidity 144 past min dewpoint 96
MonthlyDaysWithGT010Precip past min windspeed 144 past min drybulbtemp 120
MonthlyDaysWithGT1Snow past min dewpoint 144 past min humidity 120
MonthlyMaxSeaLevelPressureValue past min drybulbtemp 168 past min windspeed 120
MonthlyMaxSeaLevelPressureDate past min humidity 168 past min dewpoint 120
MonthlyMaxSeaLevelPressureTime past min windspeed 168 past min drybulbtemp 144
MonthlyMinSeaLevelPressureValue past min dewpoint 168 past min humidity 144
MonthlyMinSeaLevelPressureDate past min drybulbtemp 720 past min windspeed 144
MonthlyMinSeaLevelPressureTime past min humidity 720 past min dewpoint 144
MonthlyTotalHeatingDegreeDays past min windspeed 720 past min drybulbtemp 168
MonthlyTotalCoolingDegreeDays past min dewpoint 720 past min humidity 168
MonthlyDeptFromNormalHeatingDD past average dewpoint 1 past min windspeed 168
MonthlyDeptFromNormalCoolingDD past average drybulbtemp 1 past min dewpoint 168
MonthlyTotalSeasonToDateHeatingDD past average drybulbtemp 3 past min drybulbtemp 720
MonthlyTotalSeasonToDateCoolingDD past average humidity 1 past min humidity 720
past average windspeed 1 past min windspeed 720
past average humidity 3 past min dewpoint 720
past average windspeed 3 past average dewpoint 1
past average dewpoint 3 past average drybulbtemp 1
past average drybulbtemp 24 past average drybulbtemp 3
past average humidity 24 past average humidity 1
past average windspeed 24 past average windspeed 1
past average dewpoint 24 past average humidity 3
past average drybulbtemp 48 past average windspeed 3
past average humidity 48 past average dewpoint 3
past average windspeed 48 past average drybulbtemp 24
past average dewpoint 48 past average humidity 24
past average drybulbtemp 72 past average windspeed 24
past average humidity 72 past average dewpoint 24
past average windspeed 72 past average drybulbtemp 48
past average dewpoint 72 past average humidity 48
past average drybulbtemp 96 past average windspeed 48
past average humidity 96 past average dewpoint 48
past average windspeed 96 past average drybulbtemp 72
past average dewpoint 96 past average humidity 72
past average drybulbtemp 120 past average windspeed 72
past average humidity 120 past average dewpoint 72
past average windspeed 120 past average drybulbtemp 96
past average dewpoint 120 past average humidity 96
past average drybulbtemp 144 past average windspeed 96
past average humidity 144 past average dewpoint 96
past average windspeed 144 past average drybulbtemp 120
past average dewpoint 144 past average humidity 120
past average drybulbtemp 168 past average windspeed 120
past average humidity 168 past average dewpoint 120
past average windspeed 168 past average drybulbtemp 144
past average dewpoint 168 past average humidity 144
past average drybulbtemp 720 past average windspeed 144
past average humidity 720 past average dewpoint 144
past average windspeed 720 past average drybulbtemp 168
past average dewpoint 720 past average humidity 168
future min drybulbtemp 24 past average windspeed 168
future min drybulbtemp 168 past average dewpoint 168
future average drybulbtemp 24 past average drybulbtemp 720
future average drybulbtemp 168 past average humidity 720
future max drybulbtemp 24 past average windspeed 720
future max drybulbtemp 168 past average dewpoint 720