This is an inactive course webpage.
Module 1 – Distributed Computing
Lecture 1
- Defining Big Data and Data Products
- Big Data Workflow and ETL
- What is Cloud Computing?
- Meet Hadoop
- Distributed Storage
- Distributed Processing
- MapReduce Programming Paradigm
- Resource Management and Job Execution
- slides (Module 1)
Lab Session 1
Homework 1
- Using Pig for ETL
- Data Analysis with Pig
- Assignment hw1: due FRI Oct 26 at 8:30am
Reading 1
- Data Analytics with Hadoop
- Chapter 1: The Age of the Data Product
- Chapter 2: An Operating System for Big Data
- Chapter 8: Analytics with Higher-Level APIs (Pig)
- Chapter 10: Data Lakes, Data Ingestion
——————————————————-
Module 2 – Tools and Workflows
Lecture 2
- Relational Operations with Pig
- Data Mining with Hive (HIVE REFERENCE)
- Hive Meta-Store
- Hive Query Language
- Intro to HBase
- Text Processing & Text Mining (HIVE TEXT REFERENCE)
- slides (Module 2)
Lab Session 2
Homework 2
- Sentiment Analysis with Hive
- Assignment hw2: due FRI Nov 2 at 8:30am
Reading 2
- Data Analytics with Hadoop
- Chapter 6: Data Mining and Warehousing (Hive)
- Chapter 7: Data Ingestion
——————————————————-
Module 3: Beyond Batch Processing
Lecture 3
- Beyond Batch Processing
- Interactive Data Analysis
- Iterative Data Processing
- Stream Processing
- Introduction to Spark
- Application: Clustering
- Conclusion: Data Product Lifecycle
- slides (Module 3)
Lab 3
- Lab6: Data Ingest with Flume
- Lab7: Exploring RDDs using the Spark Shell
- Lab8: Cloud Execution for Real: Using Amazon EMR
Homework 3
- Writing a Spark Application in Python
- Perform Geolocation Clustering on EMR
- Assignment hw3: due FRI Nov 9 at 8:30am
Reading 3
- Data Analytics with Hadoop
- Chapter 7: Data Ingestion – Flume
- Chapter 4: In-Memory Computing with Spark
- Chapter 9: Clustering
- Chapter 10: Data Product Lifecycle
You must be logged in to post a comment.