Random forest classification trees implemented in PySpark will be the backbone of generating recommendations. This allows the developer to set a given return threshold for the binary label column called performed, set to 1 or 0.
Above is an example of a single decision tree determining the likelihood of a particular passenger surviving the sinking of the Titanic. A random forest is made up of multiple decision trees and falls under a class of techniques known as ensemble learning.
A paper used for guidance in this project is linked: Understanding Random Forests: From Theory to Practice
It is recommended that anyone iterating on this project give it a read.
A possible permutation of this project would be to use regression trees to predict exact security returns week-by-week rather than whether a particular security will meet or fall short of a threshold.
Common Factor Discovery
A K-Means clustering algorithm was used in an attempt to discover distinguishing factors among stocks predicted to perform well. While it did not yield any results in this project, it is still a recommended method due to its accessible Spark packages and ease of implementation relative to other algorithms.