CSE 427s Roadmap

Part I – MapReduce (about 5 weeks)

  1. Big Data and Big Data Analysis
  2. Cloud Computing
    • Distributed Storage
    • HDFS
    • Distributed Computation
    • MapReduce Programming Paradigm
  3. Hadoop MapReduce
    • Running a MapReduce Job
    • Job Execution
    • Writing MapReduce Programs
      • Implementing Mappers, Reducers, and Drivers in Java
      • Reusing Objects and Map-only Jobs
    • Job Configuration
  4. MapReducing Algorithms
    • Sorting and Searching
    • Inverted Index
    • Secondary Sort

Part II – Big Data Analysis and Applications (about 8 weeks)

  1. Recommendation Engines
    • Introduction (Top-N list, Frequently Bought Together)
    • Content-based Recommendation
    • Collaborative Filtering
    • Final Project Topic: Netflix Recommendation Challenge
  2. [Data Analysis with Pig]
    • Basics: Pig Latin, Loading Data, Data Types
    • Example Data Analysis Task: ETL
    • Multi-Dataset Operations
  3. Data Analysis with Hive
    • Introduction: Hive vs Traditional Databases
    • HiveQL Syntax and Built-in Functions
    • Hive Data Management
    • Final Project Topic: Text Processing with Hive
  4. Data Analysis with Spark
    • RDDs
    • Interactive Spark Shell and PySpark
    • Example Application: ETL
    • Spark Applications – Python Programs using Spark
    • Example Iterative Algorithm: PageRank
    • Final Project Topic: Geolocation Clustering with Spark
  5. Applications
  6. [Introduction to Impala]
    • Interactive Data Analysis with Impala
  7. Discussion: [Impala vs] Hive vs [Pig vs] Spark vs MapReduce vs RDMS

Note: [ ] indicate optional topics

More Big Data Applications (we might touch upon)

  • Large-scale Machine Learning
    • Classification (k-NN in MapReduce, Perceptron in Spark)
    • Final Project Topic: Clustering (k-means in MapReduce, Spark)
    • Naive Bayes and Linear Regression in MapReduce
  • Finding Similar Items
    • Document Retrieval
    • Big input/feature spaces
    • Locality Sensitive Hashing
  • Social Network Analysis
    • Social Networks as Graphs
    • Clustering/Community Detection/Partitioning
    • Finding Triangles using MapRedcue
    • Beyond Social Networks: Introduction to Graph-based Machine Learning