Lab 4 – Partitioner & Global Sort

Part 1: Partitioner & Global Sort (35min)

Let’s discuss the Hadoop default Partitioner and how to write your own custom Partitioner. Using non-default partitioners is required for global sort use cases. Let’s explore an example application: sorting all weather recordings by temperature. Here are the slides.

Part 2: Quiz (15min)

Today’s quiz is mainly a recap of last lecture, so those slides (Job Configuration) might be helpful.  If you have any questions about any of these materials don’t hesitate to ask us! We are happy to help you understand the stuff. 

Part 3: WordCount with Partitioner (30min)

This is essentially Problem 2 on hw4. If you already completed this problem or want to work on it outside of class, skip to Part 4.

lmplement a Partitioner for the WordCount MapReduce program that assigns positive, negative, and neutral words to different reducers. The stubs are in the following directory:

 ~/workspace/partitioner/src/

In eclipse right-click on the stubs package and click Refresh to see the files.

IMPLEMENTATION INSTRUCTIONS:
  1. Implement a Partitioner using the stubs SentimentPartitioner.java. Your Partitioner should send each key-value pair to one out of three Reducers based on whether the key is appearing in the list of positive words (positive-words.txt), in the list of neg- ative words (negative-words.txt), or in neither of them. The files including the word lists are in your SVN folder. Make sure you ignore all lines in the .txt files starting with;. Use Distributed Cache to access the files in the Partitioner.
  2. Observe the SentimentPartitionerTest.java program. You can use it to test if your partitioner works correctly. It contains an example to test for the correct assignment of one example positive word.Add the three tests specified in the stub file to the program to test your Partitioner and run it.

Part 4: Log File Analysis (30min)

Let’s do some log file analysis. Your task is to process a web log file to group the IP addresses and number of hits of those addresses per each month of the year.

To achieve this you will write a Partitioner and use it with 12 Reducers, each of which is responsible for processing the data for a particular month. Reducer 0 processes January hits, Reducer 1 processes February hits, and so on.

Download the lab instructions HERE.