**This is an inactive course webpage**

Overview

The growth in available data is a challenge to many companies. This presents an opportunity for companies to conquer the vast and various data available to them. The growth in data includes traditional structured data, as well as unstructured data created by both people and machines. It is essential for analysts to be comfortable in the new technologies and tools that are being developed to store, retrieve, analyze, and report, using the vast data resources available. This course introduces students to the technologies currently deployed to overcome the challenges of Big Data.

Today’s data environment benefits greatly from reduced costs of collecting, storing, and processing data. This enables companies to maintain ever-increasing data repositories. Initially, structured data was stored focusing on transactions and interactions with customers, suppliers, financial services, transportation services, and extended supply chain partners. More recently, social media data was added to the available data repository. This data is often inherently unstructured and multi-media, including Facebook or Twitter posts, user reviews, blogs, pictures, and videos. Add information from Internet of Things sensors on vehicles, machines, fixed-points, inventory, and medical devices, or pictures and videos from internal cameras, and we find massive volume and variety in data available to today’s analysts.

Hadoop, and related technologies supported by the Apache Foundation, is the current standard in facilitating storage of vast amounts of heterogeneous data across commodity servers. This course introduces students to current projects supported by the Apache Foundations, including Hadoop, YARN, MapReduce, Sqoop, Hive, Pig, and Spark. Each of these plays a unique role in the development of clusters of commodity servers, managing vast amounts of structured and unstructured data, parallel processing, organizing data for analysis, and developing queries for reports. Through hands-on examples using relevant data, students develop competencies in these technologies, realizing the challenges and opportunities of Big Data.

Prerequisites: MGT560G

Lectures and Labs

  • Lectures will be held on FRI (Oct 19, Oct 26, Nov 2) 8:30-11:30am in Emerson Auditorium in Knight Hall (Room 110).
  • Labs will be held in two sections:
    • Section 1 meets at
      • 1-4pm FRI on Oct 19, Oct 26, and Nov 2 in Eads 016
    • Section 2 meets at
      • 1-4pm on SAT Oct 20 in Bauer 330
      • 9am-12pm on Oct 27 in Bauer 330
      • 9am-12pm on Nov 3 in Bauer 210

Instructor: Marion Neumann
Office: Jolley Hall Room 222
Contact: Please use Piazza!
Office Hours: TUE 11:30am-12:30pm and 3-4pm

TAs: Jonathan (Head TA), Eric, Erik, Lan, Jonny, Steven, Yachen, ZacWED 12-2pm (Erik, Jonathan) in Urbauer 215 THU 2-4pm (Lan) in Jolley 224 SUN 1-3pm (Eric, Jonny) in Jolley 408

This course introduces Hadoop, and related technologies supported by the Apache Foundation, as the current standard in facilitating storage of vast amounts of heterogeneous data across commodity servers. Students will learn about projects supported by the Apache Foundations, including Hadoop, YARN, MapReduce, Sqoop, Hive, Pig, and Spark. Each of these plays a unique role in the development of clusters of commodity servers, managing vast amounts of structured and unstructured data, parallel processing, organizing data for analysis, and developing queries for reports. Through hands-on examples using relevant data, students develop competencies in these technologies, realizing the challenges and opportunities of Big Data.

Prerequisites: MGT560G

Syllabus

Grading

There are several components in the grade:

  • lecture attendance and participation: 10%
  • 3 assignments submitted in groups of two via Gradescope (20% each): 60%
  • 3 lab quizzes* submitted and graded via Socrative (10% each): 30%

* lab quizzes may be broken down into smaller components aligning with the covered materials. All lab quizzes given in one of the three sessions will count as 10%.

Lecture Attendance and Participation

Participation: Active participation is essential to the success of this course. Quizzes will be given in the lectures to encourage your participation and own thinking. This will enhance your learning process and facilitate discussions with your peers as well as among the entire class. We will use Socrative (http://www.socrative.com) to distribute and record quizzes. You will need to bring a WIFI enabled device (laptop, tablet, smart phone, …). To perform well on this metric, your submitted answers and comments should be clear and meaningful (i.e., not simply reiterating obvious facts or submitting blank or nonsense answers). The participation score will count 10% to your total course performance.

Please bring your name card and laptop to each class.

Short Assignments (in groups of 2 students)

There are several hands-on assignments, focusing on implementing the applications covered in the course. Assignments will cover the following applications: Sqoop, Hive, Pig, and Spark. Each assignment will count 20% to your total course performance.

Lab Quizzes

In each of the Lab section we will give a quiz that needs to be completed within the given time period. Lab quizzes will be graded and the score will be determined based on the number of correct answers. We will use Socrative (http://www.socrative.com) to distribute and record quizzes. Each lab quiz will count 10% to your total course performance.

Please bring your name card and laptop to each lab.

Grading Summary

Final course grades will be assigned using the following straight scale:

Letter GradeCutoff Percentage
A>= 93%
A->= 90%
B+>= 87%
B>= 83%
B->= 80%
C+>= 77%
C>= 73%
C->= 70%
D+>= 67%
D>= 63%
D->= 60%
F< 60% 

Late Policy

Your homework assignments must be turned in on time. We cannot accept any late submissions or submissions that do not follow the instructions on the assignment. It is your responsibility to follow the submission instructions exactly.

Collaboration Policy

You are encouraged to discuss the course materials with other students. Discussing the material, and the general form of solutions to the labs is a key part of the class. Since, for many of the assignments, there is no single “right” answer, talking to other students and to the TAs is a good thing. However, everything that you turn in should be your own work, unless we tell you otherwise. If you talk about assignment solutions with another student, then you need to explicitly tell us on the hand-in. You are not allowed to copy answers/code or parts of answers/code from anyone else, or from material you find on the internet. This will be considered as willful cheating, and will be dealt with according to the official collaboration policy stated below.

Academic Dishonesty

Unless explicitly instructed otherwise, everything that you turn in for this course must be your own work. If you willfully misrepresent someone else’s work as your own, you are guilty of cheating. Cheating, in any form, will not be tolerated in this class.

There is zero tolerance of Academic Dishonesty. I will be actively searching for academic dishonesty on all homework assignments, quizzes, and exams. If you are guilty of cheating on any assignment or exam, you will receive an F (failed) in the course. If you copy from anyone that is currently taking this class or has previously taken this or other classes at Washington University covering the same topics, both parties will be penalized, regardless of which direction the information flowed. This is your only warning.

Code Of Conduct

The purpose of Olin’s Code of Conduct is to clarify expectations about academic and professional behavior. The Code is meant to encourage and clarify appropriate academic, classroom, interpersonal, and extra-curricular etiquette that is expected of each individual by their peers, the faculty and the institution. It is also intended to help describe the overall environment of excellence and professionalism that members of the Olin community seek to establish and to continually enhance. It is the responsibility of each member of the Olin community to uphold the spirit, as well as the principles, of the Code.

Please refer to the publication Integrity Matters: Olin Business School Code of Conductfor specific responsibilities, guidelines and procedures regarding academic integrity.

Olin’s Code of Conduct as it relates to Academic Matters

The following is a summary of the Code as it applies to Academic matters:

Student Academic Violations.It is dishonest and a violation of student academic integrity if you:

  1. Plagiarize– You commit plagiarism by taking someone else’s ideas, words or other types of product and presenting them as your own. You can avoid plagiarism by using proper methods of documentation and acknowledgement.
  2. Cheat on an examination– You must not receive or provide any unauthorized assistance on an examination. During an examination you may use only material authorized by the faculty.
  3. Copy or collaborate on assignments without permission– It is dishonest to collaborate with others when completing graded assignments or tests, performing laboratory experiments, writing and/or documenting computer programs, writing papers or reports and completing problem sets (unless expressly discussed in class).
  4. Fabricate or falsify data or records– It is dishonest to fabricate or falsify data in laboratory experiments, research papers, reports or other circumstances; fabricate source material in a bibliography or “works cited” list; or provide false information on a resume or other document in connection with academic efforts. It is also dishonest to take data developed by someone else and present them as your own.
  5. Engage in other forms of deceit or dishonesty that violate the spirit of the Code.

If you have any questions regarding the definition of allowable behavior, it is your responsibility to ask for clarification prior to engaging in the collaboration.

Olin’s Code of Conduct as it relates to Professional Behavior

Expectations – Professional Standards of Conduct: Olin students are expected to conduct themselves at all times in a professional manner. Professional behavior includes, but is not limited to, the following:

  • Attendance: Students are expected to attend each class session. Students who must miss a session for any reason should make every effort to notify the instructor prior to the class meeting. Students should never register for courses scheduled in conflict with one another.
  • Punctuality: Students are expected to arrive and be seated prior to the start of each class session. They should display their name cards in all classes at all times.
  • Behavior: Classroom interaction will be conducted in a spirited manner but always while displaying professional courtesy and personal respect.
  • PreparationStudents are expected to complete the readings, case preparations and other assignments prior to each class session and be prepared to actively participate in class discussion.
  • Distractions:
    • Exiting and Entering: Students are expected to remain in the classroom for the duration of the class session unless an urgent need arises or prior arrangements have been made with the professor.
    • Laptop, PDA, and Other Electronic Device Usage: Students are expected to not use laptops, PDAs, and other electronic devices in classrooms unless with the instructor’s consent and for activities directly related to the class session. Accessing email or the Internet during class is not permitted as they can be distracting for peers and faculty.
    • Cellular Phone and Pager Usage: Students are expected to keep their mobile phones and pagers turned off or have them set on silent/vibrate during class. Answering phones or pagers while class is in session is not permitted.
    • Other distractions: Those identified by individual instructors, such as eating in the classroom.
Course Outline and Materials (Lectures, Labs, Assignments, and Reading)

Module 1 – Distributed Computing

LECTURE 1
  • Defining Big Data and Data Products
  • Big Data Workflow and ETL
  • What is Cloud Computing?
  • Meet Hadoop
  • Distributed Storage
  • Distributed Processing
    • MapReduce Programming Paradigm
    • Resource Management and Job Execution
  • slides (Module 1)
LAB SESSION 1
HOMEWORK 1
  • Using Pig for ETL
  • Data Analysis with Pig
  • Assignment hw1: due FRI Oct 26 at 8:30am
READING 1
  • Data Analytics with Hadoop
    • Chapter 1: The Age of the Data Product
    • Chapter 2: An Operating System for Big Data
    • Chapter 8: Analytics with Higher-Level APIs (Pig)
    • Chapter 10: Data Lakes, Data Ingestion

——————————————————-

Module 2 – Tools and Workflows

LECTURE 2
  • Relational Operations with Pig
  • Data Mining with Hive (HIVE REFERENCE)
    • Hive Meta-Store
    • Hive Query Language
    • Intro to HBase
  • Text Processing & Text Mining  (HIVE TEXT REFERENCE)
  • slides (Module 2)
LAB SESSION 2
HOMEWORK 2
READING 2
  • Data Analytics with Hadoop
    • Chapter 6: Data Mining and Warehousing (Hive)
    • Chapter 7: Data Ingestion

——————————————————-

Module 3: Beyond Batch Processing

LECTURE 3
  • Beyond Batch Processing
    • Interactive Data Analysis
    • Iterative Data Processing
    • Stream Processing
  • Introduction to Spark
  • Application: Clustering
  • Conclusion: Data Product Lifecycle
  • slides (Module 3)
LAB 3
HOMEWORK 3
READING 3
  • Data Analytics with Hadoop
    • Chapter 7: Data Ingestion – Flume
    • Chapter 4: In-Memory Computing with Spark
    • Chapter 9: Clustering
    • Chapter 10: Data Product Lifecycle
Resources and HowTos

Course Book

  • Data Analytics with Hadoop – An Introduction for Data Scientists by Benjamin Bengfort, Jenny Kim

Additional Books

Screen Shot 2016-01-18 at 12.22.07 PM
  • Mining of Massive Data Sets by Jure Leskovec, Anand Rajaraman, Jeff Ullman (available for free online http://mmds.org)
  • [optional] Hadoop: The Definite Guide (4th edition) by Tom White

Cloudera Course VM

We will use a pre-configured virtual machine that runs Hadoop in the course.

  1. Download and install a virtualization program to run the virtual machine.
    • VirtualBox (recommended for all platforms (MacLinuxWindows)
    • VMWare (possible for Windows OS)
  2. Download the VM matching your virtualization software from HERE.
    • System requirements: To be able to run the VM on your laptop you need at least 4GB RAM which is the minimum recommended memory as indicated here. If your laptop/computer does not support these requirements, please contact me asap.
  3. Set up the VM.
    • Here is a tutorial for VirtualBox.
    • Here is a tutorial for VMWare.
  4. Do Lab0 (requires a working VM).
  5. Check out these VM trouble-shooting tips whenever you run into issues with your VM!
  6. Working with the VM and optional set up.
    • The username for the CentOs operating system running in the VM is cloudera and the password (if you should need it) is cloudera as well.
    • Attention Windows users: make friends with the Linux terminal and consider this cheat sheet for useful shell commands.
    • Optional: to install software on CentOS running in your VM use the terminal application yum
      • e.g., sudo yum install htop (if you want to install htop)
      • e.g., sudo yum install subversion (if you want to install svn)
      • here is a tutorial on yum
    • Optional but useful: set up a shared folder with your host machine: here are the instructions.

AWS Account

We will be using AWS to execute our programs on a “real” cloud. Follow theses instructions to create an account and get educational credit (only possible if you have not applied for educational credit before).

Gradescope

We will use Gradescope for all homework grading. Find a tutorial on submitting a PDF to Gradescope HERE. To sign up use entry code TBA.

Regex

A reference about regular expressions can be found here.

Please ask any questions related to the course materials and homework problems on Piazza. I cannot promise to monitor Piazza 24/7, but other students might have the same questions and/or are able to provide a quick answer.
Any public postings of (partial or full) solutions to homework problems (written or in form of source or pseudo code) will result in a grade of zero for that particular problem for ALL students in the course.