Part I – MapReduce (about 5 weeks)
- Big Data and Big Data Analysis
- Cloud Computing
- Distributed Storage
- HDFS
- Distributed Computation
- MapReduce Programming Paradigm
- Hadoop MapReduce
- Running a MapReduce Job
- Job Execution
- Writing MapReduce Programs
- Implementing Mappers, Reducers, and Drivers in Java
- Reusing Objects and Map-only Jobs
- Job Configuration
- MapReducing Algorithms
- Sorting and Searching
- Inverted Index
- Secondary Sort
Part II – Big Data Analysis and Applications (about 8 weeks)
- Recommendation Engines
- Introduction (Top-N list, Frequently Bought Together)
- Content-based Recommendation
- Collaborative Filtering
- Final Project Topic: Netflix Recommendation Challenge
- [Data Analysis with Pig]
- Basics: Pig Latin, Loading Data, Data Types
- Example Data Analysis Task: ETL
- Multi-Dataset Operations
- Data Analysis with Hive
- Introduction: Hive vs Traditional Databases
- HiveQL Syntax and Built-in Functions
- Hive Data Management
- Final Project Topic: Text Processing with Hive
- Data Analysis with Spark
- RDDs
- Interactive Spark Shell and PySpark
- Example Application: ETL
- Spark Applications – Python Programs using Spark
- Example Iterative Algorithm: PageRank
- Final Project Topic: Geolocation Clustering with Spark
- Applications
- Link Analysis: PageRank
- Text-Mining: TF-IDF, Word-Co-occurrence, and N-grams
- [Introduction to Impala]
- Interactive Data Analysis with Impala
- Discussion: [Impala vs] Hive vs [Pig vs] Spark vs MapReduce vs RDMS
Note: [ ] indicate optional topics
More Big Data Applications (we might touch upon)
- Large-scale Machine Learning
- Classification (k-NN in MapReduce, Perceptron in Spark)
- Final Project Topic: Clustering (k-means in MapReduce, Spark)
- Naive Bayes and Linear Regression in MapReduce
- Principal Component Analysis
- Matrix multiplication via MapReduce
- PCA and SVD
- Finding Similar Items
- Document Retrieval
- Big input/feature spaces
- Locality Sensitive Hashing
- Social Network Analysis
- Social Networks as Graphs
- Clustering/Community Detection/Partitioning
- Finding Triangles using MapRedcue
- Beyond Social Networks: Introduction to Graph-based Machine Learning