The final project counts 20% of your overall grade and will be graded out of 100%. All topics are designed as group work for teams of 3-4 students (4 is best)! Working in a team is mandatory. The projects will be graded on a per team basis. On this page you find information about:

Topics (choose one)

  • Project 1: Collaborative Filtering using the Netflix Data
  • Project 2: Large-scale Text Processing and Sentiment Analysis in Hive
  • Project 3: Geo-location Clustering using the k-means Algorithm

Milestones and Deadlines

  • milestone 1 (topic [this choice is binding!], group members [this choice is binding!], assign roles to group members)
    • due TUE 11/05/2019 (5:30pm – no extension!!)
    • fill in excel sheet in lab session (no email or Piazza)
  •  milestone 2 (problem understanding, data preparation, routing)
    • due TUE 11/12/2019 (5:30pm – no extension!!)
    • short oral presentation (to instructor or TA)
    • sign in for a slot HERE (first come first serve)
  • milestone 3 (model parameters, model choices, and evaluation)
    • due THU 11/14/2019 in the lecture (5:30pm – no extension!!)
    • submit worksheet at the end of the lecture
  • milestone 4 (EMR execution of wordcount or other toy program)
    • due TUE 11/26/2019 in the lab session (5:30pm – no extension!!)
    • show successful execution to TA or instructor by the end of the lab
  • project submission (implementation, project report)
    • due 12/13/2019 (6pm – no extension!!)
    • submit code via SVN repository commit
    • submit report via Gradescope

Team Member Roles and Responsibilities

  • project manager
    • communication/coordinator among all team members
    • [final submission] manages code submission to SVN repository
    • [final submission] manages report submission via Gradescope
    • [milestone 1] needs to present to submit milestone 1 (see above)
    • [milestone 2] needs to present to submit milestone 2 (see above)
    • set up and maintain team repo (optional)
    • [report/write-up] template
    • [report/write-up] motivation and introduction – coordinate with key user
    • [report/write-up] documentation of approach – coordinate with developers
    • [report/write-up] conclusion – coordinate with key user
  • key user
    • data preprocessing
    • [milestone 2] needs to present to submit milestone 2 (see above)
    • find and preprocess Big data application/real-world dataset
    • [report/write-up] documentation of real world data
    • assists developer cloud in executing implementation on real-world data
    • [report/write-up] documentation of results
  • developer local
    • implementation
    • [milestone 3] needs to present to submit milestone 3 (see above)
    • [report/write-up] documentation of implementation
    • testing locally
    • testing pseudo-cluster together with developer cloud
  • developer cloud
    • assists developer local
    • testing pseudo-cluster
    • [milestone 4] needs to present to submit milestone 4 (see above)
    • cloud execution
    • [report/write-up] documentation of cloud execution
    • assists key user with documentation of results

Cloud Execution: Amazon EMR

All three projects have a Big data and cloud execution portion. If you took CSE330, you should be familiar with AWS. To be able to use Amazon web services we need some credit.

  • CLASSROOM ACCOUNT: I am currently working on getting classroom credit of all cse427s students. (UpdateYou should have gotten an email with instructions on how to join! There will be $50 of credit for every student in the class.)
  • PERSONAL CREDIT: You can also apply for personal credit by setting up an AWS Educate Account – this will be independent of the duration of this course. If you already applied for CSE330 you will not be able to get another round of credit. (Trick: use your wustlkey@email.wustl.edu email address for a second account  …)

AWS Educate Account

In order to use Amazon EMR you may want to create an AWS educate account using your wustl email. You’ll only have limited resources as this is just an educational version, however, it should be fine for the purpose of our projects. Be careful on the resources you leverage – there can be (hopefully refundable) costs associated!

The tutorials on this site are also be helpful: https://aws.amazon.com/education/edu-getting-started-videos/.

EMR help

Grading Rubric

  • milestone 1 (5%)
  • milestone 2 (10%)
    • project management
      • conceptual understanding (3%)
      • understanding of roles and responsibilities of team members (2%)
    • technical (check project instructions for more details)
      • data preprocessing and other steps completed (5%)
  • milestone 3 (5%)
    • understand the model parameters , model choices, and quality evaluation process for your project
  • milestone 4 (5%)
    • demonstrate that you master cloud execution using Amazon EMR using a toy Spark/MapReduce/Hive program
  • final submission (75%)
    • [write-up] motivation – what is your project about and why is the application important in the real-world and in the context of Big Data Analysis and Cloud Computing (10%)
    • [write-up] documentation of approach – how did you approach the problem (5%)
    • small data/pseudo cluster – did it work?
      • [code] implementation (15%)
      • [write-up] results/discussion – what’s next? (10%)
    • Big data application/dataset
      • which dataset/application – creativity (5%)
      • [write-up] description (5%)
      • [code] implementation/execution (5%)
      • [write-up] results/discussion (10%)
    • [write-up] final conclusion/lessons learned/future work (10%)

Motivation, documentation, and discussion, as well as results are part of the project report. This report should be readable for an informed outsider and it should not require the reader to look at or run any of your code.
IMPORTANT: The grading of the report (55%) will heavily account for cleanliness (structure), readability (logical flow, writing style), and presentation (figures, tables, plots, etc.)!

Time Management and Getting Help

First, please note that all three projects are BIG data analysis projects. That means they are a lot of implementation and debugging work and additionally, the mere execution time (once your programs are running) will also be BIG!

Time management is key!

Also, note that we cannot give an extension for the deadlines as the projects have to be graded before the final grading deadline!

If you need help for your project you can best get it from the responsible TA(s) or instructor. Those are the office hours during regular class time. Any updates on the office hours for the week after classes are finished will be posted on the course webpage. 

  • Project 1:
    • Jonathan (office hours: MON 2:30-4pm in Sever 300
    • Wentao (office hours: FRI 10-11:30am in Lopata 201)
    • MN (office hours: TUE 3-4pm in Jolley 222)
  • Project 2:
    • Kevin (office hours: SUN 9-11am in Rudolph 201)
    • MN (office hours: TUE 3-4pm in Jolley 222)
  • Project 3:
    • Patric and Lorenzo (office hours: WED 2:30-4:30pm in Lopata 201)
    • Kevin (office hours: SUN 9-11am in Rudolph 201)
    • Jordie (office hours: FRI 10:30-12:30am in Lopata 201)
    • MN (office hours: TUE 3-4pm in Jolley 222)
  • Amazon EMR for all projects
    • Wentao (office hours: FRI 10-11:30am in Lopata 201)

You can also use Piazza for project related discussions. Make sure to use the tag for your project (fp-1fp-2, or fp-3), so that the responsible TAs and other groups working on the same project can easily find your posts.