Modeling Algorithms

Choice of Supervised Learning

There are two principle families of Machine Learning, Supervised and Unsupervised learning.

As explained by Zhu and Goldberg, both forms of learning examine a dataset X, consisting of n samples x. x represents an object which can be decomposed into a feature-vector x = {x1, x2, … , xD} where dD is the length of the feature-vector. Each attribute xi represents some fact about the object, and is called a feature. In the case of our dataset, each x in X is a weather reading taken on the hour at a weather station, combined with a number of historical facts about that hour [8].

Unsupervised learning takes the dataset X and create insights based on those data alone. Common unsupervised learning tests include clustering, outlier identification, and dimensionality (feature) reduction [8].

Supervised learning considers not only the dataset X, but also label-set Y comprised of a vector of scalar labels y. A supervised learning algorithm examines the set of pairs {Xi, Yi} for some range of index i and builds a function F: X –> Y so that a future set of unlabeled values X without Y can have their labels Y predicted. This is desirable when y is difficult or impossible to compute. In the context of our project, we seek to compute future temperatures. Barring supernatural powers, it is impossible to compute with 100% accuracy future weather; as such, we would attempt to build a model that can predict the label of future weather [8].

In the context of our project, unsupervised learning does not provide a direct means to achieve our goal of competing with existing forecasting methods. Forecasting is an inherently labeled process, and thus we chose to use supervised learning.

Choice of Algorithms

We chose four learning algorithms with which to build our model: Multiple Linear Regression, Ridge Regression, Random Forest Regression, and Stochastic Gradient Descent modeling. This decision was guided using the flowchart in Figure 6, created by scikit-learn, the open source software group which created the library used in our project’s code [9].

Figure 6 – SKLEARN Learning Algorithm Cheat Sheet and Flow Chart for Algorithm Selection. This flowchart outlines the 4 main families of machine learning algorithm with the top 2 families representing supervised learning and the bottom 2 representing unsupervised learning. This flowchart guided our selection of the modeling algorithms used for this project.

When choosing a learning algorithm for machine learning, there are four main families of model. Classification models predict categorical labels for data (discrete values). Clustering models find similarities between groups of data points. Dimensionality reduction models are used for exploratory analysis of data to gain insights. The fourth and final family, regression algorithms, was most pertinent to our goal. Regression models predict continuous-valued variables. Choice and effectiveness of regression algorithms are characterized by:

  1. Size of the data set
  2. Expectation of number of important variables
  3. Co-linearity in data

We used 4 algorithms to evaluate efficacy, listed below and described in more detail in the following pages.:

  1. Multiple linear regression
  2. Ridge regression
  3. Stochastic gradient descent regression
  4. Random forest regression

Detailed Algorithm Definitions

Multiple Linear Regression

Multiple linear regression is an expanded algorithm of the simple linear regression model where the model now has two or more predictors.

In a multiple linear regression, we take a dependent variable Y with n independent variables and attempt to fit a linear model  of the form:

where  alpha-i, is the slope of the independent variable X-i as it related to Y, and beta is the Y-intercept.

A linear model is built by an algorithm which finds the optimal solution to the dataset via the method of least-squares, i.e. minimizing the objective function below [10]:

where z is the sum of squared errors, leading to this algorithm’s name of “least-squares” method.

As an example, Figure 7 below displays a 2-variable case of multiple linear regression. In Figure 7 automobile horsepower and weight are covariates, with fuel efficiency in MPG as the dependent variable. The linear model trains against the data and produces a plane equation which attempts to fit the data. It is more difficult to visualize this phenomenon in higher dimensions.

Figure 7 – Multiple linear regression visualization with car weight and horsepower as covariates and fuel efficiency as the dependent variable. The regression trains on the two covariates and produces a planar model that accounts for as much data as possible [11] .

In the case of our project, Y is the maximum or minimum 24 or 168 hour projected temperature from a particular point in time, and each  is a feature about the weather at that point in time – like windspeed, historical average temperature, humidity, etc..

Ridge Regression

Ridge regression is a technique which augments multiple regression to account for multicollinearity. Multicollinearity is the existence of near-linear relationships between independent variables, which can lead to an error through normal linear regression. To solve this problem, the ridge model adds a degree of bias to the regression which reduces standard errors. To begin ridge regression, both the independent and dependent variables must be standardized, as all calculations are based on standardized variables. The model then modifies the multiple linear regression by adding a regularization term. The minimization problem turns from

to

Where y is a vector of all dependent variable points, x is a matrix of all covariate (independent) points, beta is the y-intercept, A is the vector of slopes in a linear model, c is the ridge constant to be determined, and I is the identity matrix of shape(x) [12].

The ridge estimator is biased, and the amount of bias depends on the biasing parameter c. The algorithm consolidates mutlicolinear variables into new components which are orthogonal to the planes of colinearity, and uses these as variables to build a new model which is proven to be at least as accurate as the least-squares regression’s model. In the final steps, the regression coefficients are adjusted back to the original scale [12].

The ridge regression’s x and y variables are the same as those of the linear regression in the context of our project. Where it becomes effective is the amount of colinearity inherent in our dataset; many variables, for example wet bulb temperature, yesterday’s temperature, and dew point, can be expected to have a nearly 1:1 relationship with each other. This creates challenges for an ordinary linear regression because the solutions to the minimization problem become non-unique, and it is harder to find the optimal solution. Ridge creates a more robust solution.

Random Forest Regression

Random forest regression is an estimator that utilizes decision trees on various samples of a data set to improve the predictive accuracy and control over-fitting. It is a valuable method for classification and regression of a data set through the construction of decision trees. In comparison to standard tress, where using the best split among all variables determines how each node is split, random forest splits each node using the best among a subset of randomly chosen predictors at that node. This method is extremely user-friendly and performs extremely well compared to other classifiers. For expansive data sets such as ours, deep trees are grown which lead to irregular patterns, so the ability to control over-fitting is valuable for this technique.

Random forests start with the concept of bootstrap aggregating (known as ‘bagging’). Rather than build a single model, the algorithm creates a set of bootstrap samples – random subsets of the dataset drawn with replacement. Because each sample is different, each model will be different. This has the effect of producing a robust regressor, rather than one that is overfit to the training set. The algorithm then builds an ensemble of models [13].

In a plain decision-tree model, the ensemble is made up of decision trees which iterate through all variables and split the sample based on the best variable at a time, as illustrated in the pseudocode in Figure 8 below [13].

Figure 8 – Decision tree building algorithm. The algorithm iterates through each feature available and computes the impurity (ineffectiveness) of a split along that dimension. If the split would be more pure than the previously most pure split, that feature is recorded as the best available feature to split on. The algorithm ends by returning the name of the best dimension to split on. The decision tree is built by creating these splits at every leaf node to form new leaves, terminating when either the maximum depth is achieved, or when the accuracy of the leaf node models is less than the accuracy of the parent node’s model, or when every dimension has been used to form a split [13].

Splits are made according to a splitting function Imp (usually a measure called the Gini index). At each resulting leaf node a model is built (usually a simple weighted average), and then the algorithm constructs further splits down the tree until building further branches creates a less reliable model than a parent node [13].

This concept is illustrated below in Figure 9, which is a sample decision tree regressor built to predict the survival of a Titanic passenger.

Figure 9 – Titanic survival probability decision tree. In this example tree, splits are formed which divide up the sample set according to the dimension which creates the most pure division. Then, each node builds a simple model to predict passenger survival based on the subset of the data as split so far. If the node’s model is less accurate than the parent node, the node is dropped [14].

The random-forest model takes the decision-tree one step further, by implementing a technique called subspace sampling. The random forest only considers a subspace of the independent features at a time, making its splits based only on that subset. This introduces additional diversity in the ensemble by forcing differences between models. Additionally, this step decreases computational cost, especially in highly dimensional training sets (like ours).

Figure 10 summarizes the random forest ensemble building algorithm in pseudocode.

Figure 10 – Random Forest Pseudocode. The ensemble regressor is built by creating T models, building each model by drawing |D| data points with replacement from D. It then selects d features at random and reduces the dimensionality of the sample, and trains a tree model, adding it to the ensemble [13].

Once all of the bagged, subspaced decision tree models have been built, they are collected into one super-model called the ensemble regressor. Then, independent variables are fed to the ensemble regressor, which then gets a forecast from each independent model and then averages the results together to create a predicted value for the dependent variable [13].

The random forest model is adept at handling numerical features and unlike linear models, random forests are able to capture non-linear interaction between the features and the target [13].

In the context of our project, a split would be formed along an independent variable interval. For example, a split could be formed at ‘month is less than 6?’, or ‘humidity is above 35.4%?’.

Stochastic Gradient Descent

A gradient descent algorithm first relies on the assumption that a learning system is defined by the goal of minimizing a function J(w) called the ‘expected risk function’, decomposed as:

The minimization variable w represents the part of the learning system which must be adapted in response to observing z events in the real world. The loss function Q(z, w) measures the performance of the learning system using w to estimate the signal produced by event z. The algorithm assumes that the event z was a random, independent observation from an unknown probability distribution dP(z). By definition, we do not and cannot know dP(z) [15].

J(w) cannot be minimized directly because the grand truth distribution is unknown. Gradient based models compute an approximation of J(w) by using a finite training set of independent observations , approximating J(w) as:

Typical gradient descent models iterate through samples to create successive estimates of the optimal parameter . With each iteration of data, the loss function J’s performance on the new data point is used to inform an adjustment of , weighted by a learning rate constant which determines the influence of the new data point, as well as the average of the gradients of the loss function over the entire training set [15].

The above algorithm requires huge computational resources with a large training set like ours. The stochastic gradient descent algorithm improves on the gradient descent by dropping the operation of averaging the gradients of the loss over the complete training set. Instead, the algorithm works local to the subset it has bootstrapped to calculate the gradient. Because the resulting estimate will be local to that region of data, it is likely that the estimate will be bad, and noisier than a reliable forecasting trend [15].

This problem is solved by iterating the algorithm many (1000 in our configuration) times, continuously updating the estimate. The end result is many bad estimates which, compiled together, lead to a local-minimum in the estimated loss function (an optimal solution). In our scenario, the estimated parameter  is a linear model that predicts temperature, and the loss function is the error in predicted temperature. A visualization of this process is in Figure 11 [16].

Figure 11 – Stochastic Gradient Descent Visualization. The figure illustrates the algorithm which iterates many times and continuously updates the estimate with each pass. This leads to a local-minimum in the estimated loss function from the compiled bad estimates [16].

Stochastic gradient descent is extremely useful in our case as it is especially successful with many variables. The model supports different loss functions and penalties to fit linear regression models. It is adept at handling peculiarities in the data set and is popular for training linear regression models.