Part I: Create A Spark Application (20min)

To be able to analyze our spark implementation and eventually deploy it on the (pseudo) cluster, let’s first create a Python module/spark application. You can use the WordCount.py or WordCount.scala provided in ~/training_materials/dev1/examples as a template. Rename your module PageRank.py or PageRank.scala. Now let’s modify the program, so that we can deploy it on the cluster.

STEP 1: COMMANDLINE ARGUMENTS
  • Instead of hardcoding path and number of PageRank iterations, we want to provide those on the command line. Change the command line input check to take 3 arguments in addition to the program itself:
    • number of iterations
    • input path
    • output path
  • Hint: len(sys.argv) should be 4
  • Now, set the variable n to be read from the command line. Hint: Be aware that n is an integer. Load the data from the provided input path instead of a hardcoded path.
  • Test your modifications by executing the spark application (in local mode) on the toy data from lab9 (pagelinks.txt).

QUIZ: How many jobs and stages does your application run after Step 1

  • for 1 iteration?
  • for 5 iterations?
STEP 2: PRINT OUTS

We ran this PageRank implementation on the shell and only for 10 iterations. Depending on the graph structure we will have to run the algorithm longer than that and on the cluster. This means print statements are first, inefficient, and second not useful since they won’t necessarily appear on your screen.

  • the print statements use take(10) in the for loop. That means they actually trigger a computation in each iteration that starts from scratch each time!
    • comment it out!
  • Before storing the result, let’s sort by PageRank value in descending order.
  • Add a saveAsTextFile statement to trigger job execution. Use the location where to store the result as provided on the command line (see step 1.)
    • now we execute everything once 
    • and we store the result
  • Test your modifications by executing the spark application (in local mode or cluster) on the toy data from lab9 (pagelinks.txt).

QUIZ: How many jobs and stages does your application now run (after completing Step 2)

  • for 1 iteration?
  • for 5 iterations?
  • for 15 iterations?
STEP 3: RDD LINEAGE & CHECKPOINTING

The RDD lineage keeps track of which parent data is needed for each transformation during lazy execution. So, over the iterations this can get very long. To avoid maintaining too long lineages and having to face expensive recovery from those lineages (say for 100 iterations or more), we can checkpoint some intermediate data to HDFS.

  • Add a checkpoint after every 10 iterations starting at the 10th iteration (not the first!) to your PageRank program.
  • Is checkpointing an action or transformation?
  • What does checkpointing do?
  • Where is the checkpoint data stored?
  • What happens to the RDD lineage after checkpointing?

QUIZ: Did the number of jobs or stages change after completing Step 3 (when running 15 iterations)?

  1. [OPTIONAL] AVOID RE-COMPUTATIONSNote that the computeContribs function will be executed for every page in every iteration. Observe it more closely and identify computations that can be performed once instead of in every iteration. 
    • Can you come up with a better data representation of links so that you can avoid these unnecessary computations?
    • Change your spark program accordingly and test it. The results should not change… But on a bigger graph example it should execute faster.

QUIZ [BONUS]: Draw the the DAG of the PageRank application (after completing Steps 1-3) and indicate the jobs and stages/places where RDDs are repartitioned. Submit your drawing on a pice of paper at the end of the lab and indicate in the Gradescope quiz question if you did so.

Part II:  PageRank on Real Webgraph

Let’s execute our PageRank on subset of the a real webgraph (california.txt) provided in your lab9 folder (in the SVN repository). It was constructed by expanding a 200-page response set to the search engine query “california“. Note that, this data was collected some time back, so a number of the links will be broken by now. The graph has 9,664 nodes and 16,149 edges. Put the data into HDFS.

Formatting of the data:

  • lines starting with ### are comments
  • lines starting with n are nodes in the form of nodeID url (separated by space)
  • lines starting with e are links in the form of nodeID nodeID (separated by space)

Implementation:

  1. Modify your PageRank implementation to load and parse this data into RDDs. You want to create a node RDD to lookup the urls in the end and a links RDD for your PageRank computation.
  2. Add a checkpoint after every 50 iterations (instead of 10) to your PageRank program.
  3. Then run the PageRank implementation from 100 iterations with all your modifications from the previous steps.
  4. The output of your job should be the urls of the 10 most important and the 10 least important webpages and their ranks.
  5. Look them up in the web browser and verify the correctness of your algorithm. Does your result make sense intuitively?