CSE427s Cloud Computing with Big Data Applications

This course provides a comprehensive introduction to applied parallel computing using the MapReduce programming model facilitating large scale data management and processing. There will be an emphasis on hands-on experience working with the Hadoop architecture, an open-source software framework written in Java for distributed storage and processing of very large data sets on computer clusters. Further, we will derive and discuss various algorithms to tackle big data applications and make use of related big data analysis tools from the Hadoop ecosystem, such as Pig, Hive, Impala, and Apache Spark to solve problems faced by enterprises today. Check the Roadmap for more detailed information.

Prerequisites: CSE 131 (solid background in programming with Java), CSE 247, and CSE 330 (RDMSSQL, Python, and AWS).

This class counts towards the Certificate in Data Mining and Machine Learning as applications course.

The content of this class is derived largely from the Cloudera Developer Training for MapReduce, the  Cloudera Data Analyst Training: Using Pig, Hive, and Impala with Hadoop, and the Cloudera Developer Training for Apache Spark, which are made available to Washington University through the Cloudera Academic Parntership program. Further materials are adapted from the “Mining of Massive Data Sets” book and class taught at Stanford by Jure Leskovec.

Fall 2019