MGT 560M Resources and HowTos

Course Book

  • Data Analytics with Hadoop – An Introduction for Data Scientists by Benjamin Bengfort, Jenny Kim

Additional Books

Screen Shot 2016-01-18 at 12.22.07 PM

  • Mining of Massive Data Sets by Jure Leskovec, Anand Rajaraman, Jeff Ullman (available for free online
  • [optional] Hadoop: The Definite Guide (4th edition) by Tom White

Cloudera Course VM

We will use a pre-configured virtual machine that runs Hadoop in the course.

  1. Download and install a virtualization program to run the virtual machine.
    • VirtualBox (recommended for all platforms (MacLinux, Windows)
    • VMWare (possible for Windows OS)
  2. Download the VM matching your virtualization software from HERE.
    • System requirements: To be able to run the VM on your laptop you need at least 4GB RAM which is the minimum recommended memory as indicated here. If your laptop/computer does not support these requirements, please contact me asap.
  3. Set up the VM.
    • Here is a tutorial for VirtualBox.
    • Here is a tutorial for VMWare.
  4. Do Lab0 (requires a working VM).
  5. Check out these VM trouble-shooting tips whenever you run into issues with your VM!
  6. Working with the VM and optional set up.
    • The username for the CentOs operating system running in the VM is cloudera and the password (if you should need it) is cloudera as well.
    • Attention Windows users: make friends with the Linux terminal and consider this cheat sheet for useful shell commands.
    • Optional: to install software on CentOS running in your VM use the terminal application yum
      • e.g., sudo yum install htop (if you want to install htop)
      • e.g., sudo yum install subversion (if you want to install svn)
      • here is a tutorial on yum
    • Optional but useful: set up a shared folder with your host machine: here are the instructions.

AWS Account

We will be using AWS to execute our programs on a “real” cloud. Follow theses instructions to create an account and get educational credit (only possible if you have not applied for educational credit before).


We will use Gradescope for all homework grading. Find a tutorial on submitting a PDF to Gradescope HERE. To sign up use entry code TBA


A reference about regular expressions can be found here.