CSE 427s – Cloud Computing with Big Data Applications


  • Grading office hours
    • TUE Dec19 3-4pm in Jolley 222 (MN) 
    • All hw and quiz grades will be on BB on Mon Dec 18.
    • All hw and quiz grading issues need to be fixed before TUE Dec 19 4pm!
    • Please, come to my office hours – no Piazza or email!
  • The final project is up!
    • THU Nov 16: introduction
    • TUE Nov 21: milestone 1
    • TUE Nov 28: milestone 2 (WED Nov 29 for project 3 – cf. sign-in sheet)
    • THU Dec 14: final submission
    • no automatic extensions on any of these deadlines
  • All [R] Quizzes are linked form the course calendar – just click on the Socrative image!!
  • Submission instructions clarified! Follow those instructions to make sure you receive maximum credit. Starting with hw6!!
    • group submission on Gradescope required for groups (-10% if your group’s submission is not a group submission)
    • wustlkey indicating SVN repository for code submission for group and non-group submissions required (no credit for code submission if wustlkey is not provided on the submission page for the respective problem linked on Gradescope)
  • Check out these VM trouble-shooting tips whenever you run into issues with your VM!

TA Office Hours:

MON 4-6pm in Jolley 224 (Hongyi)
TUE 10am-12pm in Jolley 408 (Ruben)
WED 1-3pm in Jolley 224 (John)
THU 6:30-8:30pm in Urbauer 116 (Bofei)
FRI 10am-12pm in Urbauer 116 (Yu)
SAT 10am-12pm in Urbauer 116 (Yutong)


This course provides a comprehensive introduction to applied parallel computing using the MapReduce programming model facilitating large scale data management and processing. There will be an emphasis on hands-on experience working with the Hadoop architecture, an open-source software framework written in Java for distributed storage and processing of very large data sets on computer clusters. Further, we will derive and discuss various algorithms to tackle big data applications and make use of related big data analysis tools from the Hadoop ecosystem, such as Pig, Hive, Impala, and Apache Spark to solve problems faced by enterprises today. Check the Roadmap for more detailed information.

Prerequisites: CSE 131 (solid background in programming with Java), CSE 247, and CSE 330 (basic knowledge in relational databases (RDMS), SQL, and AWS).

This class counts towards the Certificate in Data Mining and Machine Learning as applications course.

The content of this class is derived largely from the Cloudera Developer Training for MapReduce, the  Cloudera Data Analyst Training: Using Pig, Hive, and Impala with Hadoop, and the Cloudera Developer Training for Apache Spark, which are made available to Washington University through the Cloudera Academic Parntership program. Further materials are adapted from the “Mining of Massive Data Sets” book and class taught at Stanford by Jure Leskovec.

Instructor: Marion Neumann
Office: Jolley Hall Room 222
Office Hours: TUE 3-4pm (or individual appointment* – avoid drop ins w/o appointment)

*request individual appointments via email and allow for 2-3 days reply and scheduling time

Section 1: 1-2:30pm

Section 2: 4-5:30pm

Please ask any questions related to the course materials and homework problems on Piazza. Other students might have the same questions or are able to provide a quick answer. Sign-up using your wustl email address here:  piazza.com/wustl/fall2017/cse427s
Any postings of (partial) solutions to problems (written or in form of source or pseudo code) will result in a grade of zero for that particular problem for ALL students.



Course Calendar and Reading

Homework Assignments

Grades on BB

Resources and HowTos