**This is an inactive course webpage**

Labs: TUE 2:30-4pm (Section 1) and 4-5:30pm (Section 2) in Eads 016
Lectures: THU 4-5:30pm in Crow 201

Instructor: Marion Neumann
Office: Jolley Hall Room 222
Contact: Please use Piazza!
Office Hours: THU 3-4pm during lecture weeks, for all other times check announcements above or on Piazza. Individual appointments are possoible (request via email – allow for 2-3 days to reply/schedule).
Please, avoid random drop ins outside my office hours.

Head TA:
Jonathan – manages all Gradescope/Canvas grades –> use Piazza tag grades
Alexis, Amanda, Erik, Harrison, Luxiao, Michael, Steven, Yushu, Zac

TA Office Hours (during lecture weeks):Monday 4:30-6:30pm in Jolley 431 (Michael and Amanda) Wednesday 11:30am-1pm in Jolley 431 (Zac) Friday 4-6pm in Jolley 431 (Steven) Sunday 4-6pm in Lopata Hall 302 (Erik)

Since this is a pilot offering of a brand new course we appreciate any feedback you have for us! Tell us what you like, don’t like, or could be improved (and how if you have any ideas). Use this Anonymous Feedback Form.


Homework assignments
Homework assignments will be assigned concurrently to the lecture/lab sessions covering the respective materials. Due dates and submission instructions will be indicated on the course webpage under homework assignments. It is every student’s responsibility to meet the submission requirements and deadlines. We cannot accept late submissions and submissions that do not follow the submission instructions for no reason (see also Late Policy below).

We will drop the lowest score homework and each remaining homework assignment will be weighted equally. The total grade achieved for the homework assignments (one drop, no make-ups) will contribute 40% towards your total course performance.

Regrade Requests
Any regrade requests, claims of missing scores, or grade discrepancies will have to made within one week of the grade announcement. We will not take any regrade requests after this one-week period for no reason. Grade announcements will be made on Piazza and grading comments will be provided in your SVN repository or via Gradescope. All grades will be maintained on Canvas. It is the student’s responsibility to verify that all grades on Canvas are accurate. Regrade submissions should be exclusively done via Gradescope and grade discrepancies should be reported via Piazza (using the grades tag).

Lab Participation
You are expected to actively participate in the labs. Lab participation contributes 20% to your total course performance. Successful (full score) lab participation goes beyond attending the lab. It will be determined based on the following components:

  • active participation in small group discussions
  • lab progress/completion (assed via end-of-lab demos to instructor/TA)
  • lab quizzes

There are no make-ups for any missed labs or quizzes. The lowest lab score will be dropped (one drop).

There will be 2 exams contributing 20% each towards your total course performance. The dates are

  • Midterm: March 7 2019 (in-class)
  • Final: May 8 2019 6-8pm (scheduled by university)

Grading Summary
40%  hw assignments
20%  lab participation
20%  midterm
20%  final exam

It is not possible to achieve a higher percentage on any individual grade component than listed above through bonus or extra credit problems.

Final course grades will be assigned using the following straight scale:

Letter GradeCutoff Percentage
F< 60%

The passing grade is C- or better (70%).

Late Policy
Your homework assignments must be turned in on time. There are absolutely no makeup quizzes or assignments for any reason and/or missed deadlines.

Collaboration Policy
You are encouraged to discuss the course material with other students. Discussing the material, and the general form of solutions to the labs is a key part of the class. Since, for many of the assignments, there is no single “right” answer, talking to other students and to the TAs is a good thing. However, everything that you turn in should be your own work, unless we tell you otherwise. If you talk about assignments with another student, then you need to explicitly tell us on the hand-in by providing their name(s) and student ID(s). You are not allowed to copy answers or parts of answers from anyone else, or from material you find on the Internet. This will be considered as willful cheating, and will be dealt with according to the official collaboration policy:

Academic Integrity
Unless explicitly instructed otherwise, everything that you turn in for this course must be your own work. If you willfully misrepresent someone else’s work as your own, you are guilty of cheating. Cheating, in any form, will not be tolerated in this class.

Checkout these questions and answers in the CSE FAQ.

There is zero tolerance of Academic Dishonesty. I will be actively searching for academic dishonesty on all homework assignments, quizzes, and exams. If you are guilty of cheating on any assignment or exam, you will receive and F in the course and be referred to the School of Engineering Discipline Committee. In severe cases, this can lead to expulsion from the University, as well as possible deportation for international students. If you copy from anyone in the class both parties will be penalized, regardless of which direction the information flowed. This is your only warning.

Please refer to the University Undergraduate Academic Integrity Policy, for more information. If you suspect that you may be entering an ambiguous situation, it is your responsibility to clarify it before the professor or TAs detect it. If in doubt, please ask.

Providing/Posting Solutions
Providing your course work (written or code) in any form to others is a violation of the academic integrity policy. If you provide your solutions to someone else in the course or post them publicly onlineyou are guilty of violating our academic integrity policy. Such a case will be treated the same way as described above and prosecution will also take place after finishing the course or even graduating form Wash U.

Mental Health
Mental Health Services professional staff members work with students to resolve personal and interpersonal difficulties, many of which can affect the academic experience. These include conflicts with or worry about friends or family, concerns about eating or drinking patterns, and feelings of anxiety and depression. See: http://shs.wustl.edu/MentalHealth

Accommodations based upon sexual assault
The University is committed to offering reasonable academic accommodations to students who are victims of sexual assault. Students are eligible for accommodation regardless of whether they seek criminal or disciplinary action. Depending on the specific nature of the allegation, such measures may include but are not limited to: implementation of a no-contact order, course/classroom assignment changes, and other academic support services and accommodations. If you need to request such accommodations, please direct your request to Kim Webb (kim_webb@wustl.edu), Director of the Relationship and Sexual Violence Prevention Center. Ms. Webb is a confidential resource; however, requests for accommodations will be shared with the appropriate University administration and faculty. The University will maintain as confidential any accommodations or protective measures provided to an individual student so long as it does not impair the ability to provide such measures.

If a student comes to me to discuss or disclose an instance of sexual assault, sex discrimination, sexual harassment, dating violence, domestic violence or stalking, or if I otherwise observe or become aware of such an allegation, I will keep the information as private as I can, but as a faculty member of Washington University, I am required to immediately report it to my Department Chair or Dean or directly to Ms. Jessica Kennedy, the Universitys Title IX Coordinator. If you would like to speak with the Title IX Coordinator directly, Ms. Kennedy can be reached at (314) 935-3118, jwkennedy@wustl.edu, or by visiting the Title IX office in Umrath Hall.  Additionally, you can report incidents or complaints to Tamara King, Associate Dean for Students and Director of Student Conduct, or by contacting WUPD at (314) 935-5555 or your local law enforcement agency. See: Title IX

You can also speak confidentially and learn more about available resources at the Relationship and Sexual Violence Prevention Center by calling (314) 935-8761 or visiting the 4th floor of Seigle Hall. See: RSVP Center

Bias Reporting 
The University has a process through which students, faculty, staff and commu- nity members who have experienced or witnessed incidents of bias, prejudice or discrimination against a student can report their experiences to the Universitys Bias Report and Support System (BRSS) team. See: http://brss.wustl.edu

Center for Diversity and Inclusion (CDI):
The Center of Diversity and Inclusion (CDI) supports and advocates for undergraduate, graduate, and professional school students from underrepresented and/or marginalized populations, creates collaborative partnerships with campus and community partners, and promotes dialogue and social change.  One of the CDI’s strategic priorities is to cultivate and foster a supportive campus climate for students of all backgrounds, cultures and identities.
See: diversityinclusion.wustl.edu/

Course calendar and reading




15 Jan Syllabus

Group Activity: What is data science?

17 Jan Lecture 1 – Data Science

  • What is data science?
  • What is machine learning?
  • DS workflow
  • Data Representation
  • slides: Introduction
  • worksheet 1
  • [DSFS] Ch1
    • What is Data Science?
  • [DSFS] Ch11 (p141-142)
    • Modeling
    • What is Machine Learning?
  • [PDSH] Preface (p xi-xii)
    • What is Data Science?
    • Why Python?
  • [PDSH] Ch5 (p331-342)
    • What is Machine Learning?
  • 9 Data Science Problems
22 Jan Lab 1 – Plant Species Classification

  • Data Exploration with Python
  • NumPy Arrays
  • materials: Lab1
  • [DSFS] Ch2 Python
    • The Basics (p15-26)
  • [PDSH] Ch1
    • IPython (all about notebooks, skip stuff about the shell)
  • [PDSH] Ch2 NumPy  (p33-63, p78-85)
    • Data Types in Python
    • Basic of NumPy Arrays
    • Computation on NumPy Arrays
    • Aggregations
    • Fancy Indexing
24 Jan Lecture 2 – Exploratory Data Analysis

  • Data Types
  • Data Representation
  • Dataset Statistics
  • Visualization
  • slides: EDA
  • worksheet 2
  • [DSFS] Ch3 Visualizing Data
  • [DSFS] Ch10 (p121-132)
    • Exploring your Data
    • Cleaning and Munging
    • Manipulating Data
  • [PDSH] Ch4
    • General Matplotlib Tips (p217-221)
    • Scatter plots (p233-237)
    • Histograms (p245-247)
29 Jan Lab 2 – Analyzing the MoMA Data

  • EDA Process
  • Posing Data Questions
  • Answering Data Questions
  • Pandas DataFrames
  • materials: Lab2
  • [PDSH] Ch3 Pandas (p97-114)
    • Pandas Objects
    • Data Indexing and Selection
  • more answers
31 Jan Lecture 3 – Sentiment Analysis

  • Working with Text Data
  • Scraping Data from the Web
  • Sentiment Prediction
  • Error Rate and Accuracy
  • slides: Sentiment Analysis
  • worksheet 3
  • [DSFS] Ch4
    • Vectors (p49-53)
  • [DSFS] Ch9
    • Reading Files (p105-108)
    • Using APIs (p114-117) 
    • Example: Twitter APIs (p117-120)
  • [DSFS] Ch20 (p239-244)
    • Word Clouds
    • n-gram Models 
5 Feb Lab 3 – Analyzing Movie Reviews

  • Rule-Based Sentiment Prediction
  • Sentiment Classifier
  • Evaluation and Model Comparison
  • materials: Lab3
  • [PDSH] Ch5 (p343-359)
    • Introducing Scikit-Learn
7 Feb Lecture 4 – Regression

  • Least-Squares Method
  • Linear vs Polynomial Regression
  • Model Complexity
  • RMSE and MAE
12 Feb Lab 4 – Predicting Housing Prices

  • Implement 1D Linear Regression
  • Data Exploration
  • Evaluation via RMSE
  • materials: Lab4
  • [DSFS] Ch14 Multiple Regression (p179-183)
    • The Model
    • Least Squares Model
    • Fitting the Model
    • Interpreting the Model
    • Goodness of Fit 
14 Feb  Lecture 5 – Logistic Regression

  • Decision Boundary
  • Probabilistic Classifier
  • Sigmoid/Logistic Function
  • Likelihood
  • Confusion Matrix
  • slides: see lecture below
  • [DSFS] Ch16 Logistic Regression
  • [PDSH] Ch5 Scikit-Learn
    • Classification on Digits (p357)
19 Feb Lab 5 – Detecting Breast Cancer

  • Implement Logistic Regression
  • Evaluation using Confusion Matrix
  •  materials: Lab5
  • [DSFS] Ch16 Logistic Regression
  • [DSFS] Ch11
    • Correctness (p145-147)
21 Feb Lecture 5 – Logistic Regression Revisited

  • Model Parameters and Decision Boundary
  • Likelihood

Lecture 6 – Evaluation and Learning Principles

  • Noise
  • Overfitting
  • Model Selection
  • Learning Curve
  • Sampling Bias


  • slides: Learning Principles
  • worksheet 6
  • [DSFS] Ch11 (p142-147)
    • Overfitting and Underfitting
    • Correctness
  • [PDSH] Ch5
    • Model Validation [ignore everything on cross-validation] (p359-361)
    • Selecting the Best Model (p363)
    • Learning Curves (p370-373)
    • Basis Function Regression (p392-396)
    • Regularization (p396-398)
26 Feb Lab 5 – Detecting Breast Cancer

  • Applying LR to Breast Cancer Prediction
  • Evaluation using Confusion Matrix
  •  materials:
    • cf. Lab5 above
    • add this code to a new cell at the very end
28 Feb  no class
5 Mar Study for Midterm Exam

  • Review
  • Discuss questions/confusion with peers, TAs, instructor

 What to study? 


How to study?

  • Revisit Worksheets
    • do problems again (yes, rewrite your answers!)
    • practice helps to remember the stuff
    • the more practice, the better you will remember
  • Revisit Labs
    • ignore coding
    • focus on Write up! problems and conceptual stuff
  • Practice Retrieval of knowledge
    • quiz your neighbor
    • explain concepts to your neighbor
    • or yourself (aloud!)
    • …believe me it helps you retain your knowledge!
  • Create Note Cards or Summary Sheets
    • encoding concepts in your own writing helps you learn and retain your knowledge!
7 Mar Midterm EXAM in Crow 204

closed book – no notes – no crib/cheat sheet

 …CAUTION: room change!!!!
12 Mar
14 Mar
Spring Break
19 Mar Lab 6 – Ethical Thinking for Data Science

  • Why are ethics important in DS?
  • Examples Scenarios
21 Mar Lecture 7 – Clustering

  • Clustering Problem
  • Similarity Measures
  • k-means Algorithm
  • slides: Clustering
  • worksheet 7
  • [DSFS] Ch19 Clustering (p225-232)
    • The Idea
    • The Model
    • Example: Meetups
    • Choosing k
    • Example: Clustering Colors
  • [PDSH] Ch5 (p462-479)
    • k-Means Clustering 
26 Mar Lab 7 – Clustering

  • Explore the k-Means Algorithm
  • Choose k
  • Application
  • materials: Lab7
  • reading: cf. Lecture 7
28 Mar Lecture 8 – Similarity-based Learning

  • k-Nearest Neighbor Model
  • Cross-Validation
  • Input Transformations
  • slides: kNN (annotated)
  • worksheet 8
  • [DSFS] Ch12 k-NN
  • [PDSH] Ch5 Hyperparameters and Model Validation
    • Thinking about Model Validation [cross-validation] (p359-362)
2 Apr Lab 8 – k-NN

  • Explore the k-NN Algorithm
  • Data Scaling
  • Data Standardization
  • materials: Lab8
  • reading: cf. Lecture 8
4 Apr Lecture 9 – Feature Engineering

  • Feature selection
  • Feature learning
  • Quick Intro to
    • Decision Trees
    • Random Forests
    • Neural Networks
  • slides: Feature Engineering (annotated)
  • worksheet 9
  • [DSFS]
    • Ch10 Dimensionality Reduction (p134-139)
    • Ch17 What is a Decision Tree? (p201-203)
    • Ch17 Random Forest (p211)
  • [PDSH]  Ch5 Feature Engineering (p375-381)
9 Apr Lab 9 – Feature Learning

  • Explore a pre-trained NN
  • Feature Learning
  • kNN Retrieval
  • materials: Lab9
  • reading: cf. Lecture 9
11 Apr Lecture 10 – Data Engineering

  • More Insights into Neural Networks
  • Data Augmentation
  • Outlier Detection
16 Apr  Lab10 – Gesture Recognition
18 Apr  Lab 10 – Wrap-up

  • Train our NN
  • Evaluating the trained NN

Lecture 11 – Topic Models

23 Apr Lab11 – Organizing Text Data

  • Topic Model
  • Wikipedia Data
  • Text Features
  • LDA

Course evaluationHERE

  • Let’s avoid sampling bias! To do so, we need everyone to fill out the evals! Thanks for taking the time.
  • Incentive: this will count as the graded part of the lab quiz for today’s lab.
25 Apr Semester Review

Pilot Offering Feedback Form

Towards Data Science: What to Study Next? 

8 May  Final EXAM

  • 6-7pm in Crow 201
Homework assignments
  1. Code submission
    • use the following filename: hwX_<your wustlkey>.ipynb
      • for example: hw1_mneumann.ipynb
    • submit the Python notebook (.ipynb file) via file upload
    • do not create a zip file
    • do not add or delete cells in the notebook unless instructed otherwise
  • 04/19 hw10
  • 04/09 hw9
  • 04/02 hw8
  • 03/26 hw7
    • to get started finish part 2 and 3 in Lab7 (those are relevant for/part of hw7)
    • due: TUE 04/02 at 2:30pm
    • extended: WED 04/03 at 2:30pm
    • submit via Gradescope
  • 03/19 hw6
  • 02/26 hw5
    • due: TUE 03/05 at 2:30pm
    • [PDSH]  Ch5 Machine Learning
      • Introducing Scikit-Learn: Ch5 (p343-359)
    • [DSFS] Ch17: Decision Trees
      • What is a Decision Tree? (p201-203)
    • [PDSH]  Ch5 Machine Learning
      • Motivating Random Forests: Decision Trees (p421-426)
    • submit via Gradescope
  • 02/12 hw4
    • due: TUE 02/19 at 2:30pm
    • [PDSH] Ch5 Machine Learning
      • Introducing Scikit-Learn: Ch5 (p343-359)
    • submit via Gradescope
  • 02/05 hw3
    • due: TUE 02/12 at 2:30pm
    • [DSFS] Ch9 Getting Data
      • Reading Files (p105-108)
      • Using APIs (p114-117)
      • Example: Using the Twitter APIs (p117-120)
    • submit via Gradescope
  • 01/24 hw2
    • due: TUE 02/05 at 2:30pm
    • [PDSH] Ch2 NumPy  (p33-63, p78-85)
      • Data Types in Python
      • Basic of NumPy Arrays
      • Computation on NumPy Arrays
      • Aggregations
      • Fancy Indexing
    • submit via Gradescope
  • 01/15 hw1
    • due: TUE 01/22 at 2:30pm
    • [DSFS] Ch2 Python Crash Course
      • The Basics (p15-26)
    • [PDSH] Ch1 IPython and Jupiter Notebooks
      •  all about notebooks, skip stuff about the shell
    • [PDSH] Ch2 NumPy  (p33-63, p78-85)
      • Data Types in Python
      • Basic of NumPy Arrays
      • Computation on NumPy Arrays
      • Aggregations
      • Fancy Indexing
    • submit via Gradescope
Resources and HowTos


There isn’t really a course book for this class. But the following books will be useful. Check them out whenever you see references in the slides or on the course calendar.


We will be using Python and NumpyScipyScikit-learnPandas, and Matplotlib for the course. All those packages are included in the Anaconda package.


Downloading the Anaconda package will give you access to all packages and toolboxes that will be used in this class. It’s recommended that you go with the newest version.

  1. Download the Anaconda Distribution with the latest version of Python: https://www.anaconda.com/download/#macos

Getting Started with Jupyter Notebooks

You can get the notebook server running with the following methods

  1. You can use the user interface to open the notebook
  2. Or you can open the notebook via terminal by running the following command:jupyter notebook

Please ask any questions related to the course materials and homework problems on Piazza. Other students might have the same questions or are able to provide a quick answer.
Any public postings of (partial or full) solutions to homework problems (written or in form of source or pseudo code) will result in a grade of zero for that particular problem for ALL students in the course.