MGT 560M Outline and Materials (FL18)

This is an inactive course webpage.

 

Module 1 – Distributed Computing

Lecture 1
  • Defining Big Data and Data Products
  • Big Data Workflow and ETL
  • What is Cloud Computing?
  • Meet Hadoop
  • Distributed Storage
  • Distributed Processing
    • MapReduce Programming Paradigm
    • Resource Management and Job Execution
  • slides (Module 1)
Lab Session 1
Homework 1
  • Using Pig for ETL
  • Data Analysis with Pig
  • Assignment hw1: due FRI Oct 26 at 8:30am
Reading 1
  • Data Analytics with Hadoop
    • Chapter 1: The Age of the Data Product
    • Chapter 2: An Operating System for Big Data
    • Chapter 8: Analytics with Higher-Level APIs (Pig)
    • Chapter 10: Data Lakes, Data Ingestion

——————————————————-

Module 2 – Tools and Workflows

Lecture 2
  • Relational Operations with Pig
  • Data Mining with Hive (HIVE REFERENCE)
    • Hive Meta-Store
    • Hive Query Language
    • Intro to HBase
  • Text Processing & Text Mining  (HIVE TEXT REFERENCE)
  • slides (Module 2)
Lab Session 2
Homework 2
Reading 2
  • Data Analytics with Hadoop
    • Chapter 6: Data Mining and Warehousing (Hive)
    • Chapter 7: Data Ingestion

——————————————————-

Module 3: Beyond Batch Processing

Lecture 3
  • Beyond Batch Processing
    • Interactive Data Analysis
    • Iterative Data Processing
    • Stream Processing
  • Introduction to Spark
  • Application: Clustering
  • Conclusion: Data Product Lifecycle
  • slides (Module 3)
Lab 3
Homework 3
Reading 3
  • Data Analytics with Hadoop
    • Chapter 7: Data Ingestion – Flume
    • Chapter 4: In-Memory Computing with Spark
    • Chapter 9: Clustering
    • Chapter 10: Data Product Lifecycle