Lab 9 – PageRank in Spark | Marion Neumann | Washington University in St. Louis

LEARNING OBJECTIVES

Understand RDD persistence and its benefits.
Understand RDD lineage and the usage and effects of checkpointing.
Be able to map the given PageRank implementation to the algorithm discussed in class.
Adapt the implementation to implement teleportation and persist appropriate RDD.
Turn implementation into Spark application.
Understand how jobs and stages are created and based on this understanding make your implementation suitable for Big Data execution by removing anything that makes it inefficient.

Part I: Persistent RDDs (25min)

Recall the iterative PageRank algorithm to rank webpages. Check the slides from last lecture if you need a refresher.

It essentially iteratively multiplies the Google matrix G with a page rank vector v avoiding the explicit computation of G! Note that we can represent G by a source – list of destinations data structure called links and iteratively join it with the uniformly initialized ranks RDD. The only information changing over the iterations are the ranks, however, the links RDD stays fixed over the iterations.

EFFICIENCY CONSIDERATION: For an efficient iterative computation it therefore makes sense to keep the links data in the compute nodes’ memory across the iterations. This way we can save the time it takes to load it from disk (or worst case to have to shuffle it across the network). In Spark this can be achieved by persisting the RDD that holds the links data.

Go over the SP3 slides to understand the concepts of RDD lineage, persistence, and checkpointing.

Part II: PageRank in Spark (20min)

Let’s look at a Spark implementation of PageRank. Find the pyspark and scalaspark commands in your SVN repository in the lab9 folder. Look at the code in your favorite editor in the VM. HERE is an illustrative documentation of this implementation.

Does this Spark application implement teleportation?
- If not add teleportation to the implementation.
- Why is teleportation required?
Persist the links RDD in memory.
- Why is persisting the link data beneficial?
- How do you make the link data persistent?

Once you are familiar with the code and made the above changes run it in the Spark shell using the toy data pagelinks.txt provided in ~/training_materials/dev1/data/spark_toydata.

to run a .pyspark application within the pyspark shell use: exec file(“PageRank.pyspark”)
to run a .scalaspark application within the scalaspark shell use: :load PageRank.scalaspark

Remember or take a note of which page in the toy webgraph is the most/least important?

Part III: Challenges of Iterative Algorithms (20min)

To be able to analyze our spark implementation let’s first create a Python module. You can use the WordCount.py or WordCount.scala provided in ~/training_materials/dev1/examples as a template. Rename your module PageRank.py or PageRank.scala. Now let’s modify the program, so that we can deploy it on the cluster.

COMMANDLINE ARGUMENTS
- Instead of hardcoding path and number of PageRank iterations, we want to provide those on the command line. Change the command line input check to take 3 arguments in addition to the program itself:
  - number of iterations
  - input path
  - output path
- Hint: len(sys.argv) should be 4
- Now, set the variable n to be read from the command line. Hint: Be aware that n is an integer. Load the data from the provided input path instead of a hardcoded path.
PRINT OUTSWe ran this PageRank implementation on the shell and only for 10 iterations. Depending on the graph structure we will have to run the algorithm longer than that and on the cluster. This means print statements are first, inefficient, and second not useful since they won’t necessarily appear on your screen.
- the print statements use take(10) in the for loop. That means they actually trigger a computation in each iteration that starts from scratch each time!
  - comment it out!
- Add a saveAsTextFile statement after the loop terminates to trigger job execution. Read the the location where to store the result from the command line (using sys.argv)
  - now we execute everything once
  - and we store the result
RDD LINEAGE & CHECKPOINTINGThe RDD lineage keeps track of which parent data is needed for each transformation during lazy execution. So, over the iterations this can get very long. To avoid maintaining too long lineages and having to face expensive recovery from those lineages (say for 100 iterations or more), we can checkpoint some intermediate data to HDFS.
- Add a checkpoint after every 10 iterations to your PageRank program.
- What does checkpointing do? Where is the checkpoint data stored? What happens to the RDD lineage?
[OPTIONAL] AVOID RE-COMPUTATIONSNote that the computeContribs function will be executed for every page in every iteration. Observe it more closely and identify computations that can be performed once instead of in every iteration.
- Can you come up with a better data representation of links so that you can avoid these unnecessary computations?
- Change your spark program accordingly and test it. The results should not change… But on a bigger graph example it should execute faster.

Part IV: Quiz (10min)

Do the quiz.