Part 1: Let’s talk about the Basic API and the Driver (15min)
To be able to implement a MapReduce program we need to understand input and output data formats and the serialization between the programs. Let’s look at
- InputFormat and OutputFormat
- the Java objects that represent keys and values
The Driver configures and submits a MapReduce Job. We can, for example, specify:
- job name
- input and output paths
- input and output data format
- which Mapper and Reducer to use
- what their input and output format is
Part 2: Reading & Quiz (20min)
Skim through the rest of the lab slides. You will learn…
- what a map-only job is and how to create one
- how to reuse objects
- how to test your jobs locally w/o submitting to a (pseudo) cluster
Now, you are prepared to do the quiz and the lab! Finish the QUIZ before starting the practical part.
Part 3: Writing and Testing a Java MapReduce Program (40min)
Write a MapReduce program that reads text input and computes the average word length of all words that start with each character. Your result should be case-sensitive.
You will use Eclipse fo this lab. Make sure Eclipse is configured correctly. (This was achieved when executing the course setup script as instructed in Lab2.)
Note that running your MapReduce programs locally for testing and debugging will save you a lot of time in the development process. The advantage is that we do not have to submit the job to the (pseudo) cluster but run it directly on the client. If you use Eclipse to do so, you do not even have to compile and jar your classes.
Download the step by step lab instructions for this part HERE.
If you re not familiar with Eclipse, consult these notes on how to use Eclipse.