This is an inactive course webpage. Find the one for your current semester.
This course provides a comprehensive introduction to applied parallel computing using the MapReduce programming model facilitating large scale data management and processing. There will be an emphasis on hands-on experience working with the Hadoop architecture, an open-source software framework written in Java for distributed storage and processing of very large data sets on computer clusters. Further, we will make use of related big data technologies from the Hadoop ecosystem of tools, such as Hive, Impala, and Pig in developing analytics and solving problems faced by enterprises today.
Prerequisite: CSE 247, CSE 131 (or a solid background in programming with Java), and CSE 330 (or basic knowledge in relational databases (RDMS) and SQL).
This class counts towards the Certificate in Data Mining and Machine Learning as applications course.
The content of this class is derived largely from the Cloudera Developer Training for Apache Hadoop and Cloudera Data Analyst Training: Using Pig, Hive, and Impala with Hadoop, which are made available to Washington University through the Cloudera Academic Parntership program. Further materials are adapted from the “Mining of Massive Data Sets” book and class taught at Stanford by Jure Leskovec.
Instructor: Marion Neumann
Office: Jolley Hall Room 403
Office Hours: Thursday 11:30am-12:30pm; moved to TUE in the first week of may (TUE 3rd of May 10-11:30am)!
Please ask any questions related to the course materials and homework problems on Piazza. Other students might have the same questions or are able to provide a quick answer. Any postings of solutions to assignments (in form of source or pseudo code) will result in a grade of zero for that particular problem/assignment for ALL students.
TA Office Hours
Grades on BB
Resources and HowTos