CSE 427s Resources and HowTos


Screen Shot 2016-01-18 at 12.22.07 PM

  • Mining of Massive Data Sets by Jure Leskovec, Anand Rajaraman, Jeff Ullman (available for free online http://mmds.org)
  • Hadoop: The Definite Guide (4th edition) by Tom White
  • Data Analytics with Hadoop – An Introduction for Data Scientists by Benjamin Bengfort, Jenny Kim

Optional Book: 

  • Data Algorithms: Recipes for Scaling Up with Hadoop and Spark by Mahmoud Parsian

Cloudera Course VM

We will use a pre-configured virtual machine in the course.

  1. Download and install a virtualization program to run the virtual machine.
    • VirtualBox (recommended for all platforms MacLinux, Windows)
    • VMWare (possible for Windows OS – limited instructor and TA support)
  2. Download the VM matching your virtualization software from HERE.
    • System requirements: To be able to run the VM on your laptop you need at least 4GB RAM which is the minimum recommended memory as indicated here.
    • If your laptop/computer does not support these requirements, please contact me asap! We can provide you with a rental laptop for the semester.
  3. Set up the VM.
    • Here is a tutorial for VirtualBox.
    • Here is a tutorial for VMWare.
  4. Check out these VM trouble-shooting tips whenever you run into issues with your VM!
  5. Working with the VM and optional set up.
    • The username for the CentOs operating system running in the VM is cloudera and the password (if you should need it) is cloudera as well.
    • Attention Windows users: make friends with the Linux terminal and consider this cheat sheet for useful shell commands.
    • Optional: to install software on CentOS running in your VM use the terminal application yum
      • e.g., sudo yum install htop (if you want to install htop)
      • e.g., sudo yum install subversion (if you want to install svn)
      • here is a tutorial on yum
    • Optional but useful: set up a shared folder with your host machine: here are the instructions.


We will use Gradescope for all homework grading. Find a tutorial on submitting a PDF to Gradescope HERE. To sign up use entry code TBA


We will be using SVN to distribute code stubs and data, as well as to collect code solutions. Please see this tutorial about accessing your repository.

The path to your SVN repository is:


You need to substitute your own wustlkey (e.g. m.neumann) in place of <wustlkey>, your course number (e.g. 427s) in place of XXX, and the respective abbreviation for the semester and year (e.g. fl18 for fall 2018).

If you wish to access your files from your own computer, you can use SVN via the terminal (Mac, Linux) on your host machine or you will need to install Tortoise (Windows) or SmartSVN (Windows, Mac, Linux) again on your host OS.

Verifying your repository commits
To verify if your work was committed successfully enter the URL (https://svn.seas.wustl.edu/repositories/<wustlkey>/cseXXX_fl18) of your repository in a web browser. You will see all the files that are currently in the repository (mind browser caching).

AWS Account

Towards the end of the semester we will be using AWS to execute our programs on a “real” cloud. Follow theses instructions to create an account and get educational credit (only possible if you have not applied for educational credit before).


Notes on how to use Eclipse can be found here. Notes on how to use Eclipse to test MapRedcue programs locally are here.


A reference about regular expressions can be found here.