**This is an inactive course webpage**
Lectures and Labs
- Lecture: THU 4-5:20pm in Wilson 214
- Lab – Section 1: TUE 1-2:20pm in Eads 016
- Lab – Section 2: TUE 4-5:20pm in January 110
Instructor: Marion Neumann
Office: Jolley Hall Room 222
Contact: Please use Piazza!
Office Hours: TUE 3-4pm or individual appointment (request via email – allow for 1-2 days to reply and schedule)
Try to avoid drop ins w/o appointment outside my office hours.
Head TA: Jonathan C (takes care of all grading issues – contact via Piazza or in his office hours)
TAs: Alexis, Arushee, Jordie, Kevin, Lorenzo, Patrick, Steven, Wentao, Zhibo
TA Office Hours:Monday Sever 300 @ 2:30-4pm Alexis, Jonathan Wednesday Lopata 201 @ 2:30-4:30pm Lorenzo, Patrick Friday Lopata 201 @ 10am-2pm Wentao, Jordie, Steven Sunday Rudolph 201 @ 9-11am Kevin, Zhibo
This course provides a comprehensive introduction to applied parallel computing using the MapReduce programming model facilitating large scale data management and processing. There will be an emphasis on hands-on experience working with the Hadoop architecture, an open-source software framework written in Java for distributed storage and processing of very large data sets on computer clusters. Further, we will derive and discuss various algorithms to tackle big data applications and make use of related big data analysis tools from the Hadoop ecosystem, such as Pig, Hive, Impala, and Apache Spark to solve problems faced by enterprises today. Check the Roadmap for more detailed information.
Prerequisites: CSE 131 (solid background in programming with Java), CSE 247, and CSE 330 (basic knowledge in relational databases (RDMS), SQL, and AWS). Use this prerequisite check list if you are not sure.
This class counts towards the Certificate in Data Mining and Machine Learning as applications course.
This class uses materials from the Cloudera Developer Training for MapReduce, the Cloudera Data Analyst Training: Using Pig, Hive, and Impala with Hadoop, and the Cloudera Developer Training for Apache Spark, which are made available to Washington University through the Cloudera Academic Parntership program. Further contents are based on the “Mining of Massive Data Sets” book and class taught at Stanford by Jure Leskovec.
Lectures and Labs
Lectures will be held every THU. Labs will be held on TUE. Lab sessions may be replaced by lectures. Any change in locations will be announced on the course webpage.
Course Content
Find a list of topics to be covered on the course Roadmap!
Homework Assignments
We will have weekly homework assignments that can be worked on in groups of two students. They will be assigned after each lecture on THU and will be due one week later at 4pm. Each homework assignment will be graded and its score counts towards the total grade. It is every student’s responsibility to meet the submission requirements and deadlines. Late submissions will not be accepted for no reason (see also: Late Policy below). Submissions that do not follow the instructions provided on the assignment or the homework website will receive a score penalty. All homework assignments will be weighted equally and the total grade will contribute 40% towards your total course performance.
Regrade Requests
Any regrade requests and claims of missing scores for any graded work will have to made within one week of the grade announcement. We will not take any regrade requests made after this one week period for no reason. Grade announcements and grading comments will be provided via Gradescope. Grades will be maintained on Canvas.
Makeup Homework
There will be one additional makeup homework assigned in the last week of classes that can be used to replace the lowest homework assignment score.
In-class Exam
There will be one written in-class exam contributing 30% towards your total course performance. Date:
- exam: THU 21 Nov 2019 4-5:20pm in the lecture
- no final exam
Final Project
There will be a final project assigned in the 2nd half of the course. Due date:
- FRI 13 Dec 2019 at 6pm (no extension)
The final project contributes 20% towards your total course performance. It will be graded on a 0-100% score scale and grades will be assigned based on both, its implementation component and its conceptual component (a report motivating, describing, and analyzing the project and a discussion of its experimental results and impact).
Non-curricular Activities
We cannot offer accommodation for examinations and given deadlines for non-curricular activities outside your Wash U commitments. This includes job interviews or flying home early. I understand that you may decide to miss a scheduled exam date for these reasons, but you will need to weigh the consequences when making such a decision.

Lab and Active Learning Quizzes
Quizzes will be given in lectures and labs to encourage your own thinking and enhance your learning process. We will use socrative (http://www.socrative.com) to distribute and record quizzes. Students will need to bring a WIFI enabled device (laptop, tablet, smart phone, …).
- Lab quizzes will be graded and contribute 10% towards the final grade for the course. The two lowest score recap quizzes will be dropped.
- In-class Active Learning quizzes will be recorded for participation. A participation >70% may result in a grade bump for boarder-lined final course scores (less than 1% away from cutoff). Active Learning quizzes can only count if your answers are meaningful; quizzes with empty or nonsense answers do not count towards participation.
- There are no make ups for missed quizzes.
Grading Summary
40% homework assignments
10% lab quizzes
30% in-class exam
20% final project (implementation component and conceptual component)
It is not possible to achieve a higher percentage on any individual grade component than listed above through bonus or extra credit problems.
Final course grades will be assigned using the following straight scale:
Letter Grade | Cutoff Percentage |
---|---|
A | >= 93% |
A- | >= 90% |
B+ | >= 87% |
B | >= 83% |
B- | >= 80% |
C+ | >= 77% |
C | >= 73% |
C- | >= 70% |
D+ | >= 67% |
D | >= 63% |
D- | >= 60% |
F | < 60% |
Late Policy
Your homework assignments must be turned in on time. We cannot accept any late submissions or submissions that do not follow the instructions on the assignment. It is your responsibility to follow the submission instructions exactly. You get an automatic three day extension on every homework deadline (this does not include the final project deadlines).
WARNING: There is absolutely NO extension to this extension for NO reason!
Collaboration Policy
You are encouraged to discuss the course materials with other students. Discussing the material, and the general form of solutions to the labs is a key part of the class. Since, for many of the assignments, there is no single “right” answer, talking to other students and to the TAs is a good thing. However, everything that you turn in should be your own work, unless we tell you otherwise. If you talk about assignment solutions with another student, then you need to explicitly tell us on the hand-in. You are not allowed to copy answers/code or parts of answers/code from anyone else, or from material you find on the internet. This will be considered as willful cheating, and will be dealt with according to the official collaboration policy stated below:
Academic Integrity
Unless explicitly instructed otherwise, everything that you turn in for this course must be your own work. If you willfully misrepresent someone else’s work as your own, you are guilty of cheating. Cheating, in any form, will not be tolerated in this class.
Checkout these questions and answers in the CSE FAQ.
There is zero tolerance of Academic Dishonesty. I will be actively searching for academic dishonesty on all homework assignments, quizzes, and exams. If you are guilty of cheating on any assignment or exam, you will receive an F (failed) in the course and be referred to the School of Engineering Discipline Committee. In severe cases, this can lead to expulsion from the University, as well as possible deportation for international students. If you copy from anyone in the class or anyone that has previously taken this or other classes at Washington University covering the same topics, both parties will be penalized, regardless of which direction the information flowed. This is your only warning.
Please refer to the University Undergraduate Academic Integrity Policy, for more information. If you suspect that you may be entering an ambiguous situation, it is your responsibility to clarify it before the professor or TAs detect it. If in doubt, please ask us.
Mental Health
Mental Health Services professional staff members work with students to resolve personal and interpersonal difficulties, many of which can affect the academic experience. These include conflicts with or worry about friends or family, concerns about eating or drinking patterns, and feelings of anxiety and depression. See: http://shs.wustl.edu/MentalHealth
If you have any problems with the workload of this class, please come and talk to me. The earlier we talk the better.
Accommodations based upon sexual assault
The University is committed to offering reasonable academic accommodations to students who are victims of sexual assault. Students are eligible for accommodation regardless of whether they seek criminal or disciplinary action. Depending on the specific nature of the allegation, such measures may include but are not limited to: implementation of a no-contact order, course/classroom assignment changes, and other academic support services and accommodations. If you need to request such accommodations, please direct your request to Kim Webb (kim_webb@wustl.edu), Director of the Relationship and Sexual Violence Prevention Center. Ms. Webb is a confidential resource; however, requests for accommodations will be shared with the appropriate University administration and faculty. The University will maintain as confidential any accommodations or protective measures provided to an individual student so long as it does not impair the ability to provide such measures.
If a student comes to me to discuss or disclose an instance of sexual assault, sex discrimination, sexual harassment, dating violence, domestic violence or stalking, or if I otherwise observe or become aware of such an allegation, I will keep the information as private as I can, but as a faculty member of Washington University, I am required to immediately report it to my Department Chair or Dean or directly to Ms. Jessica Kennedy, the Universitys Title IX Coordinator. If you would like to speak with the Title IX Coordinator directly, Ms. Kennedy can be reached at (314) 935-3118, jwkennedy@wustl.edu, or by visiting her office in the Womens Building. Additionally, you can report incidents or complaints to Tamara King, Associate Dean for Students and Director of Student Conduct, or by contacting WUPD at (314) 935-5555 or your local law enforcement agency.
You can also speak confidentially and learn more about available resources at the Relationship and Sexual Violence Prevention Center by calling (314) 935-8761 or visiting the 4th floor of Seigle Hall.
Bias Reporting
The University has a process through which students, faculty, staff and community members who have experienced or witnessed incidents of bias, prejudice or discrimination against a student can report their experiences to the Universitys Bias Report and Support System (BRSS) team. See: http://brss.wustl.edu
Part I – MapReduce (about 5 weeks)
- Big Data and Big Data Analysis
- Cloud Computing
- Distributed Storage
- HDFS
- Distributed Computation
- MapReduce Programming Paradigm
- Hadoop MapReduce
- Running a MapReduce Job
- Job Execution
- Writing MapReduce Programs
- Implementing Mappers, Reducers, and Drivers in Java
- Reusing Objects and Map-only Jobs
- Job Configuration
- MapReducing Algorithms
- Sorting and Searching
- Inverted Index
- Secondary Sort
Part II – Big Data Analysis and Applications (about 8 weeks)
- Recommendation Engines
- Introduction (Top-N list, Frequently Bought Together)
- Content-based Recommendation
- Collaborative Filtering
- Final Project Topic: Netflix Recommendation Challenge
- [Data Analysis with Pig]
- Basics: Pig Latin, Loading Data, Data Types
- Example Data Analysis Task: ETL
- Multi-Dataset Operations
- Data Analysis with Hive
- Introduction: Hive vs Traditional Databases
- HiveQL Syntax and Built-in Functions
- Hive Data Management
- Final Project Topic: Text Processing with Hive
- Data Analysis with Spark
- RDDs
- Interactive Spark Shell and PySpark
- Example Application: ETL
- Spark Applications – Python Programs using Spark
- Example Iterative Algorithm: PageRank
- Final Project Topic: Geolocation Clustering with Spark
- Applications
- Link Analysis: PageRank
- Text-Mining: TF-IDF, Word-Co-occurrence, and N-grams
- [Introduction to Impala]
- Interactive Data Analysis with Impala
- Discussion: [Impala vs] Hive vs [Pig vs] Spark vs MapReduce vs RDMS
Note: [ ] indicate optional topics
More Big Data Applications (we might touch upon)
- Large-scale Machine Learning
- Classification (k-NN in MapReduce, Perceptron in Spark)
- Final Project Topic: Clustering (k-means in MapReduce, Spark)
- Naive Bayes and Linear Regression in MapReduce
- Principal Component Analysis
- Matrix multiplication via MapReduce
- PCA and SVD
- Finding Similar Items
- Document Retrieval
- Big input/feature spaces
- Locality Sensitive Hashing
- Social Network Analysis
- Social Networks as Graphs
- Clustering/Community Detection/Partitioning
- Finding Triangles using MapReduce
- Beyond Social Networks: Introduction to Graph-based Machine Learning
_______ |
Topic |
Materials |
---|---|---|
27 Aug | Syllabus, Course Overview | |
29 Aug | Introduction
|
|
3 Sept | Lab1 – Distributed Storage
|
|
5 Sept | Cloud Computing
MR1: MapReduce (MR)
|
|
10 Sept | Lab2 – MR2: Running a MR Job
|
|
12 Sept | MR3: Job Execution
|
|
17 Sept | Lab3 – MR4: Writing & Testing a MR Program
|
|
19 Sept | MR3: Job Execution cont.
MR5: Job Configuration
|
|
24 Sept | Lab4 – MR6: Optimizing MR Programs
|
|
26 Sept | Application: Recommendation Systems
RS1: Collaborative Filtering
|
|
1 Oct | MR7: Custom Key/Value Types
Lab 5 – Use Case: Secondary Sort
|
|
3 Oct | RS1: Collaborative Filtering contd.
|
|
8 Oct | MR8: Practical Development
Lab 6 – RS2: Top-N-List Recommendations
|
|
10 Oct | RS3: Co-occurrence based Recommendation
|
|
15 Oct | FALL BREAK – no lab | |
17 Oct | MR9: MR Workflows & Beyond MR
Database Operations on HDFS
|
|
22 Oct | Lab 7 – Data Management with Hive
|
|
24 Oct | SP1: Spark
|
|
29 Oct | Lab 8 – Flume & Spark Shell
|
|
31 Oct | SP2: More Spark
Application: PageRank |
|
5 Nov | SP3 – RDD Persistence
FINAL PROJECT milestone 1 |
|
7 Nov | Application: Text Mining & Sentiment Analysis
|
|
12 Nov | Lab 10 – PageRank for Real
FINAL PROJECT milestone 2 |
|
14 Nov | Model Parameters, Choices, and Evaluation
FINAL PROJECT milestone 3 |
|
19 Nov | EXAM Review and Questions
EXAM Preparation
|
|
21 Nov | ***In-Class EXAM*** | |
26 Nov | Lab 11 – Cloud Execution on EMR
FINAL PROJECT milestone 4 |
EMR links for final projects |
28 Nov | THANKSGIVING – no class | |
3 Dec | Lab 12 – Work on Project
|
Exam regrades may be made in the lab sessions. |
5 Dec | Application: Large-scale Classificationor
Application: Large-Scale Social Network Analysis |
|
13 Dec | FINAL PROJECT due at 6pm – no extension! |
All homework submissions must be made via Gradescope. Sign-up will be managed via Canvas.
SUBMISSION INSTRUCTIONS
(violations will result in a penalty on the hw grade)
Find a tutorial on submitting a PDF to Gradescope HERE or watch this video.
- Match Pages: In Gradescope every page needs to be matched to the problems it contains.
- -10% penalty of the assignment score, if the pages are not (or incorrectly) matched.
- [tentative] Include wustlkey: Each page needs to include one wustlkey indicating the SVN repository used for code submission.
- required for individual and group submissions
- no credit for code submission if wustlkey is not provided on the submission page for the respective problem
- regrades will be taken but a penalty of -10% of the problem score will be applied
- Gradescope Group Submission: In Gradescope both group members need to be added to the submission.
- -5% penalty for all team members if your group’s submission is not a Gradescope group submission listing both team members
- Find a tutorial on how to add a group member to your submission in the second half of this video.
- 08/29 hw1
- due: THU 09/05/2019 at 4pm
- submit pdf via Gradescope
- submit hw reflection via SVN repository commit
- submit hw rating using this link
- 09/05 hw2
- due: THU 09/12/2019 at 4pm
- submit pdf via Gradescope
- submit hw reflection via SVN repository commit
- submit hw rating using this link
- 09/12 hw3
- due: THU 09/19/2019 at 4pm
- submit pdf via Gradescope
- submit hw reflection via SVN repository commit
- submit hw rating using this link
- 09/19 hw4
- due: THU 09/26/2019 at 4pm
- submit pdf to Homework 4 assignment via Gradescope
- submit zip to Homework 4 Code assignment via Gradescope
- submit hw reflection via SVN repository commit
- submit hw rating using this link
- 09/26 hw5
- due: THU 10/03/2019 at 4pm
- submit pdf to Homework 5 assignment via Gradescope
- submit zip to Homework 5 Code assignment via Gradescope
- submit hw reflection via SVN repository commit
- submit hw rating using this link
- 10/03 hw6
- due: THU 10/10/2019 at 4pm
- submit pdf to Homework 6 assignment via Gradescope
- submit zip to Homework 6 Code assignment via Gradescope
- submit hw reflection via SVN repository commit
- submit hw rating using this link
- 10/10 hw7
- due: THU 10/24/2019 at 4pm (2 weeks!!!)
- submit pdf to Homework 7 assignment via Gradescope
- submit zip to Homework 7 Code assignment via Gradescope
- submit hw reflection via SVN repository commit
- submit hw rating using this link
- 10/24 hw8
- due: THU 10/31/2019 at 4pm
- submit pdf to Homework 8 assignment via Gradescope
- submit hw reflection via SVN repository commit
- submit hw rating using this link
- 10/31 hw9
- due: THU 11/07/2019 at 4pm
- submit pdf to Homework 9 assignment via Gradescope
- submit zip to Homework 9 Code assignment via Gradescope
- submit hw reflection via SVN repository commit
- submit hw rating using this link
Books
- Mining of Massive Data Sets by Jure Leskovec, Anand Rajaraman, Jeff Ullman (available for free online http://mmds.org)
- Hadoop: The Definite Guide (4th edition) by Tom White
- electronic copy through the Wash U library for viewing online
- Data Analytics with Hadoop – An Introduction for Data Scientists by Benjamin Bengfort, Jenny Kim
- electronic copy through the Wash U library for viewing online
Optional Book:
- Data Algorithms: Recipes for Scaling Up with Hadoop and Spark by Mahmoud Parsian
Cloudera Course VM
We will use a pre-configured virtual machine in the course.
- Download and install a virtualization program to run the virtual machine.
- VirtualBox (recommended for all platforms Mac, Linux, Windows)
- VMWare (possible for Windows OS – limited instructor and TA support)
- Download the VM matching your virtualization software from HERE.
- System requirements: To be able to run the VM on your laptop you need at least 4GB RAM which is the minimum recommended memory as indicated here.
- If your laptop/computer does not support these requirements, please contact me asap! We can provide you with a rental laptop for the semester.
- Set up the VM.
- Check out these VM trouble-shooting tips whenever you run into issues with your VM!
- Working with the VM and optional set up.
- The username for the CentOs operating system running in the VM is cloudera and the password (if you should need it) is cloudera as well.
- Attention Windows users: make friends with the Linux terminal and consider this cheat sheet for useful shell commands.
- Optional: to install software on CentOS running in your VM use the terminal application yum
- e.g., sudo yum install htop (if you want to install htop)
- e.g., sudo yum install subversion (if you want to install svn)
- here is a tutorial on yum
- Optional but useful: set up a shared folder with your host machine: here are the instructions.
Gradescope
We will use Gradescope for all homework grading. Find a tutorial on submitting a PDF to Gradescope HERE. To sign up use entry code TBA.
SVN
We will be using SVN to distribute code stubs and data, as well as to collect code solutions. Please see this tutorial about accessing your repository.
The path to your SVN repository is:
https://svn.seas.wustl.edu/repositories/<wustlkey>/cseXXX_fl18
You need to substitute your own wustlkey (e.g. m.neumann) in place of <wustlkey>, your course number (e.g. 427s) in place of XXX, and the respective abbreviation for the semester and year (e.g. fl18 for fall 2018).
If you wish to access your files from your own computer, you can use SVN via the terminal (Mac, Linux) on your host machine or you will need to install Tortoise (Windows) or SmartSVN (Windows, Mac, Linux) again on your host OS.
Verifying your repository commits
To verify if your work was committed successfully enter the URL (https://svn.seas.wustl.edu/repositories/<wustlkey>/cseXXX_fl18) of your repository in a web browser. You will see all the files that are currently in the repository (mind browser caching).
AWS Account
Towards the end of the semester we will be using AWS to execute our programs on a “real” cloud. Follow theses instructions to create an account and get educational credit (only possible if you have not applied for educational credit before).
Eclipse
Notes on how to use Eclipse can be found here. Notes on how to use Eclipse to test MapRedcue programs locally are here.
Regex
A reference about regular expressions can be found here.
Please ask any questions related to the course materials and homework problems on Piazza. Other students might have the same questions or are able to provide a quick answer.
Any public postings of (partial or full) solutions to homework problems (written or in form of source or pseudo code) will result in a grade of zero for that particular problem for ALL students in the course.