Lab 8 – Flume & Spark Shell


  1. Download the following file inside your VM:
  2. Copy the file to ~/training_materials/dev1/scripts/ overwriting the existing version.
  3. Run the new spark setup script using the following command in the terminal:
    • ~/training_materials/dev1/scripts/
  4. Restart the VM (the VM needs to be powered off!)

If you get any errors let us know. Most likely you will not be able to run spark-related stuff smoothly otherwise.

Part I: Data Ingest with Flume (20min)

Here is a basic introduction to data ingest using Flume and Sqoop (you will use Sqoop in this week’s homework for some more data ingest).

Now, let’s use Flume to prepare the data for this week’s homework. Make sure you understand the following concepts (exam relevant):

  • what does Flume do in the Big data management & processing process
  • how to configure, use, and start a Flume agent

Since, we don’t have streaming data arriving out of nowhere, we will now simulate data arriving from a log file server and then we will use a flume agent to put this data into HDFS. Find the step-by-step lab instructions HERE.

In this week’s homework, you will use the Spark shell to transform the log file data ingested in Part I, so make sure you successfully complete it.

Part II: Using the Spark Shell (40min)

  • Load and View data
  • Transform data (and appreciate Spark’s low latency)
  • Create pair RDDs (needed for quiz problems #7 and #8)

Note: it is sufficient to choose one language Python or Scala for all labs and homework problems. If you aren’t sure which one to use explore both and make up your mind after this part.

Find the step-by-step lab instructions HERE.

Part III: Quiz (15min)

Quiz problems #1-#6 are general Flume and Spark questions. For #7 and #8 you will need to complete the lab part on how to create pair RDDs above. #8 is graded as BOUNS problem.