Machine Learning

The most basic machine learning model is multiple linear regression. This method fits a linear relationship between many different independent variables and the dependent variable, the future percent change in price. Although it often works as a good baseline model, it assumes that the variables do not interact with each other, have a normal distribution, and can be transformed to fit a linear relationship with the dependent variable.

When a data set is still too complex for multiple linear regression, researchers often turn to a random forest to deal with the interactions between variables. A random forest is a collection of decision trees. Each decision tree iteratively buckets data into smaller subsets until each subset consists of similar samples. The decision tree then groups new data into these subsets and each tree is pooled together by the forest to reduce bias and prevent overfitting. The critical parameters to tune in a random forest regression are the number of decision trees to use and the minimum size of the decision tree subsets. The issue with random forests is that they are not designed for continuous time series data and the input features need to be broken up into discrete variables to work with the model.

To cope with the temporal nature of asset prices, researchers often use time series analysis methods instead. Hidden markov models are one of the most proven methods of time series analysis. They are designed to sequence together probabilistic transitions in order to estimate the final output state. The most important parameters in this model are the type of model (multinomial, gaussian, or gaussian mixture) and the number of hidden states. Although they work well with time series data that follow obvious trends, they cannot handle as much information as other machine learning models and often overfit on past trends.

However, there is a time series model that can handle more information, called a recurrent neural network. Long short term memory networks are the most useful type of recurrent neural network for financial time series data. This model combines hidden layers of a stationary neural networks with an additional sequencing memory later to predict future data. The important parameters of a general neural network is the type of solver, the number of hidden neuron layers, and the number of neurons per layer. With recurrent neural networks, the number of past states to examine is an important parameter as well. Although recurrent neural networks are a more powerful than the other methods mentioned above, it requires the most data and it is very hard to tune the parameters without overfitting.