CSE 427s Calendar (FL19)

This is an inactive course webpage.

 

If you are wondering where we are going: here is a Roadmap!

_______

Topic

Materials

27 Aug SyllabusCourse Overview
29 Aug Introduction

  • What is Big Data?
  • Big Data Characteristics
  • Big Data Applications
  • Processing Patterns
  • HTDG 4th ed:
    • Ch1 pp 3-14, Data & Meet Hadoop
    • Ch2 pp 19-22, Weather Example
3 Sept  Lab1 – Distributed Storage

  • Distributed File Systems
  • HDFS
  • HTDG 4th ed
    • Ch3 pp 43-47, HDFS
    • Ch3 pp 51-52, Basic FS Operations
5 Sept Cloud Computing

  • History
  • Basic Concept
  • Main Components

MR1:  MapReduce (MR)

  • Distributed Computing
  • MapReduce Data Flow
  • HTDG 4th ed
    • Ch2 pp 22-24, MapReduce
10 Sept Lab2 – MR2: Running a MR Job

  • MapReduce Processes
  • WordCount Example
  • Java Implementation
  • Job Submission
  • HTDG 4th ed
    • Ch2 pp 24-30, Java MapReduce
    • Ch6 pp 160-168, Running a MR Job
12 Sept MR3: Job Execution

  • MR Job Execution
  • YARN
  • Data Locality
  • Shuffle and Sort
  • Combiners

 

  • HTDG 4th ed
    • Ch2 pp 30-37, Scaling Out & Combiner
    • Ch4 pp 79-81, YARN 
    • Ch4 pp 85-86, Scheduling in YARN 
    • Ch7 pp 197-200, Shuffle and Sort
17 Sept Lab3 – MR4: Writing & Testing a MR Program

  • Serialization
  • Driver
  • Map-only Job
  • Reuse Objects
  • Testing Locally
  • Notes on using Eclipse
  • Hadoop 2.6.0 CDH 5.14.0 API
  • HTDG 4th ed
    • Ch2 pp 26-27, Driver
    • Ch5 pp 109-115, Serialization
    • Ch8 pp 220-223,232-236, Input Formats
    • Ch6 p 141, Developing MR Programs
19 Sept MR3: Job Execution cont.

  • Combiners

MR5: Job Configuration

  • ToolRunner
  • Passing Parameters
  • Distributed Cache
  • HTDG 4th ed
    • Ch6 pp 141-148, Configuration API
    • Ch6 pp 148-152, ToolRunner
    • Ch9 pp 273-279, Distributed Cache
24 Sept Lab4 – MR6: Optimizing MR Programs

  • Partitioner
  • Use Case: Global Sort
  • [optional] Log File Analysis
  • HTDG 4th ed
    • Ch9 pp 255-262, Sorting
26 Sept Application: Recommendation Systems

  • Long Tail
  • Recommendation Tasks

RS1: Collaborative Filtering

  • Utility Matrix
  • Similarity Measures
  • MMDS Ch9
    • Ch9.1: A Model for RS
    • Ch9.3: Collaborative Filtering
    • Ch9.5: Netflix Challenge
  • WIRED article:  The Long Tail
1 Oct MR7: Custom Key/Value Types

  • Writables & WritableComparables

Lab 5 –  Use Case: Secondary Sort 

  • Custom WriteableComparable
  • Custom Partitioner
  • Custom SortComparator
  • Custom GroupComparator
  • HTDG 4th ed
    • Ch5 pp 109-126, Serialization
    • Ch9 pp 262-266, Secondary Sort
  • Data Algorithms: pp 1-12, Secondary Sort
3 Oct RS1: Collaborative Filtering contd.

  • MR Program
  • Challenges
  • No Lecture QUIZ
  • slides: see previous lecture
  • MMDS Ch9
    • Ch9.3: Collaborative Filtering
8 Oct MR8: Practical Development

  • Incremental Development
  • Debugging
  • [optional] Unit Testing, Logging

Lab 6 – RS2: Top-N-List  Recommendations

  •  Top-N-List in MapReduce
10 Oct RS3: Co-occurrence based Recommendation

  • Frequently bought together
  • Customers who bought this item also bought…
  • Communication Cost
  • Pairs and Stipes
15 Oct  FALL BREAK – no lab
17 Oct MR9: MR Workflows & Beyond  MR

  • MR Workflows
  • DAGs
  • Database Operations
  • Latency

 

Database Operations on HDFS

  • Selection, Projection
  • Union, Intersection, Difference
  • Grouping & Aggregation
  • Joins
  • Quick Into to Hive & Pig
  • slides:  MR Workflows
  • HTDG 4th ed,
    • Ch6 pp 177-179: MR Workflows
  • MMDS Ch2.3.3-8 Relational-Algebra Operations
  • HTDG 4th ed,
    • Ch9 pp 268-273: Joins
    • Ch17 pp471-484: Hive (Intro, An Example, The MetastoreComparison with Traditional Databases)
  • [optional] HTDG 4th ed,
    • Ch16 pp 423-431: Pig (Intro, An Example, Comparison with Databases)
22 Oct Lab 7 – Data Management with Hive

  • Hive Syntax, Data Types, and Basic Operations
  • Hive Metastore
  • Creating Tables
  • Querying Tables
  • Complex Field Types
  • Hive Reference Slides: Hive1
  • HTDG 4th ed,
    • Ch9 pp 268-273: Joins
    • Ch17 pp471-484: Hive (Intro, An Example, The MetastoreComparison with Traditional Databases)
24 Oct SP1: Spark

  • Introduction
  • RDDs
  • Spark Shell
  • Lazy Execution
  • Pair RDDs
  • MapReduce in Spark
29 Oct Lab 8 – Flume & Spark Shell

  • Data Ingest
    • Sqoop (hw9)
    • Flume (Lab 8)
  • Using the Spark Shell
  • Creating Pair RDDs
  • HTDG 4th ed, Ch16
    • pp 381-384: Intro to Flume
  • HTDG 4th ed, Ch15
    • pp 401: Intro to Sqoop
31 Oct SP2: More Spark

  • Writing Spark Programs
  • Spark Job Execution

Application: PageRank

FINAL PROJECT Introduction

  • HTDG 4th ed, Ch19
    • pp 565-570: Jobs & Stages
    • pp 571-574: Spark on YARN
5 Nov SP3 – RDD Persistence

  • RDD Lineage
  • RDD Persistance
  • Checkpointing

Lab 9 PageRank

FINAL PROJECT milestone 1

  • HTDG 4th ed, Ch19
    • pp 560-561: Persistence
7 Nov Application: Text Mining & Sentiment Analysis

  • TF-IDF
  • word co-occurrence
12 Nov Lab 10 – PageRank for Real

  • Write the PageRank Spark application
  • Analyze the application
  • Draw the DAG
  • Compute rankings for a subset of the real webgraph

FINAL PROJECT milestone 2

  • slides: cf. lecture Oct 31 and Lab9
14 Nov Model Parameters, Choices, and Evaluation

  • recommendations
  • prediction/classification
  • clustering

FINAL PROJECT milestone 3

  • MMDS Ch12
    • 12.1.1 Training Sets
    • 12.1.4 Testing (Model Evaluation)
19 Nov EXAM Review and Questions

EXAM Preparation

  • open discussion based on your questions
  • work on previous exam problems in groups
  • We will not present solutions to previous exams.
  • If you have worked on the problems and you have any questions or got stuck, I am happy to help and discuss!
21 Nov ***In-Class EXAM***
26 Nov  Lab 11 – Cloud Execution on EMR

  • Amazon Elastic Map Reduce
  • Create S3 bucket
  • Launch EC2 instances
  • Launch Hadoop cluster

FINAL PROJECT milestone 4

EMR links for final projects
28 Nov  THANKSGIVING – no class
3 Dec Lab 12 – Work on Project

  • coordinate next steps with your team members
  • start project report template
  • ask us clarifying questions
Exam regrades may be made in the lab sessions.
5 Dec Application: Large-scale Classificationor

Application: Large-Scale Social Network Analysis

  • MMDS
    • Ch12 Large-scale Machine Learning
    • Ch10 Mining Social Network Graphs
13 Dec FINAL PROJECT due at 6pm – no extension!

(*) indicates more advanced reading for the interested student