Lab 5 – Secondary Sort

In this lab we will investigate an interesting use case of MapReduce: Secondary Sort. Secondary sort comes in handy if you want to group your records by one feature and sort by a different feature.

In order to use Hadoop MapReduce to perform a secondary sort, the developer needs to

  • influence the grouping and sorting mechanism using custom SortComparators and custom GroupComparators
  • implement and use a custom WritableComparable as key

We will also see a couple of example applications for a secondary sort.

Part 1: Reading (10min)

Go over the lab slides. (Friendly reminder: all lab contents are course material and hence exam relevant.) For this lab we recommend to do the practical part first, since it will help to answer the quiz questions.

Part 2: Secondary Sort for Family Record Analysis (45min)

We are given a data set of family records with full names (last name and first name) and birth date and our task is to find the youngest person with each last name. This could be useful for getting random representative samples for a phone survey. Today’s practical part will be largely problem 1 and problem 2 on this week’s homework (hw6).

Download the lab instructions HERE.

A different application of the same kind is Customer Data Analysis, where you want to find for instance the youngest/younger customers who bought a certain category of products for targeted marketing!

Part 3: Other Applications (10min)

WEATHER Data ANALYSIS

For this example we are given temperature data from a large number of weather stations. And we want to sort the temperature in increasing order for each month in each year. The input format looks like this:

year, month, day, temperature

2015, 01, 01, 5
2015, 01, 01, 35
2015, 01, 01, 16
2015, 01, 01, 11

2001, 08, 20, 50
2001, 08, 21, 52

The output should then look like this:

2015-01: 5, 11, 16, 35, …

2001-08: 50, 52, …

A more detailed description of the problem and the algorithm can be found HERE (pages 1-12).

If you are interested, you can find a MapReduce implementation HERE.

A different application of the same problem is that you are given measurements that are the outcome of a large scientific experiment (e.g., a traffic simulation or wind channel experiment) and you want so sort the values of a target variable grouped by sensor or any other parameter.

Text Mining

A very prominent task in text mining applications is computing the word co-occurrence of pairs of words in a large collection of documents. These co-occurrence counts can be converted into relative frequencies using “order inversion”. Relative frequencies qualify better for tasks like document classificationinformation retrieval, search, or plagiarism detection as they are normalized co-occurrence counts.

Relative frequencies of word co-occurrences can be computed by partitioning and grouping a composite key similar as we do in secondary sort.

NOTE: here we are not really interested in a sorted list.

A more detailed description of the problem and the algorithm computing relative frequencies using secondary sort can be found HERE (pages 57-62).

Part 4: Quiz (15min)

Complete the QUIZ on Gradescope.