Spark is a versatile distributed computing language in that a developer can write their code in Java, Scala, Python, R, and SQL depending on both a developer’s preference the required packages. As this project made extensive use of pipelining and machine learning packages, Spark was the ideal platform.
Databricks is a business intelligence platform enabling rapid, iterative analyses and distributed system management. It also allows automated data importing from various sources and exporting of results to business intelligence platforms of the developer’s choice. In this case, Tableau was the chosen export destination.
Python / Pandas
Python and the Pandas package were used for data cleaning and formatting for use in Databricks. Pandas is a common tool for reformatting comma-separated spreadsheets, which is the export format of Bloomberg Terminal. It also has a deep pedigree within the financial industry as its development started at AQR Capital Management, a preeminent hedge fund.
Bloomberg Terminal is unmatched in its cataloging of both historical and instantaneously available financial data. Washington University students are fortunate to have access to multiple terminals at each of the business libraries.
Tableau is the premier data visualization application for every role from business analysts to data scientists. It provides powerful visualization tools at a gentle learning curve. Anyone seeking to perform well in this sort of project should consider knowledge of Tableau a prerequisite. A useful tutorial on linking a data source hosted in Databricks can be found here.