Lab 6 – Top-N-List Recommendations

In today’s lab and this week’s homework we will tackle our first real-world big data application: recommending products to users! Before we dive into the implementation, let’s discuss best practices when developing MR solutions.

Part 1: Practical Development (25min)

Designing, writing, and testing MapReduce programs is a non-trivial implementation and debugging process. To speed up your development time it is important to carefully follow a

defensive and incremental development strategy.

This will become increasingly more important as we are looking into more complex big data applications. Let’s draw a chart to outline this process:

Also check-out these slides.

Part 2: Top-N-List Example (20min)

  1. Go over the lab slides to learn how a MR program to recommend the most popular items (Top-N-List/Popularity-based recommendation) works. 
  2. Now, apply the Top-N-List approach to the given sample data (cf. slide 10) using pen & paper – you will need the answers to do the quiz.

This is essentially step one of the incremental development strategy: when designing your job(s), always “execute” the parts of your program on a toy example using pen and paper. Now, you are prepared to do the quiz and the implementation part.

Part 3: Quiz (15min)

Do the Quiz.

Part 4: Top-N-List of Most Popular Movies (20min + hw7)

Let’s analyze movie rating data and compute a list of the most popular movies!

The data we use is a subset of the training data from the Netflix Prize. The Netflix Prize aimed at substantially improving the accuracy of predictions about how much someone is going to enjoy a movie based on their movie preferences. It was issued by the Netflix company and on September 21, 2009 a $1mio Grand Prize was awarded to the winning team.

Usage Agreement:

By downloading and using the dataset from the Netflix Prize you agree to all of the following:

  • I agree to the terms specified in Netflix Prize Rules (cf. README).
  • I agree to delete this dataset once the project has been completed.
  • I will not redistribute or use this data in any form outside this class.

If you accept this usage agreement, use your WUSTLKEY and WUSTLKEY PASSWORD to download the data from HERE.

Do NOT add this data to your SVN repository at any time!!!!!!!!

The format of this dataset is described in the description.txt file included in the zip file you just downloaded.

Download the step by step lab instructions for this part HERE.