Teaching Data Science

A syllabus for General Assembly

Michael Selik
mike@selik.org

Abstract

This is an overview of Data Science at General Assembly. This course will take us through the basics of probability, statisitics, and the scientific method. We will also survey common machine learning methods and discuss techniques for handling big data. If you do the assignments with gusto, by the end of the course you will have a portfolio of projects that demonstrates your ability to do the job of a data scientist. There are many statistics and machine learning teaching resources available for free on the internet. This course is unique in its ambition to provide depth in statistical reasoning, breadth in machine learning, and practical application of theory.

Table of Contents

  1. Introduction
  2. Schedule
  3. Recommended texts
  4. Course requirements
  5. Grading and evaluation

Introduction

Data science has become popular recently. I think this is a misleading term. First of all, data science is redundant. You can’t do science without data, and data is useless without science. I think data science has become popular because data mining was popular some years ago and people gradually realized that data mining was unsatisfactory because it lacked science. But let’s go ahead and call ourselves data scientists, because it’s even harder to explain what an econometrician is.

We are calling ourselves data scientists to emphasize two things. First, we call ourselves data scientists to emphasize that we are using larger amounts of data than was ever possible before. The magnitude of the dataset that qualifies as “big” differs in each field, but the general rule is that now we have so much more data that there is a qualitative difference in what can be done to the data and what information the data provides. The old methods are not just slow, they are now impossible. The old (small) data often had fewer data points than the number of variables that we now examine. Because we have bigger data, now we can look at bigger problems.

Second, we call ourselves data scientists to emphasize that we are doing science. It’s easiest to contrast this with data mining. If you go out using statistics and machine learning to crunch numbers on some big data, you’ll quickly find statistically significant correlations. Many of these correlations will be surprising. You might find that the stock market responds to the length of skirts. You might find that the rise in global temperatures corresponds to a decline in pirate populations. If you use these correlations to make predictions, well, I wouldn’t make any bets.

Data scientists use the scientific method to construct a falsifiable hypothesis and test that hypothesis using statistical or machine learning techniques applied to systematically collected data. The data need not have been collected specifically for the science experiment, but science relies on knowing the method of collection. We’ll get to that part later.

Third, we call ourselves data scientists because we want jobs. The phrase originated as a job title. Over the past few decades, people have been doing data science under different names. Large organizations have always had trouble recognizing the particular mix of skills that make a good data scientist. Human resources needs keywords. It wasn’t that long ago that Wall Street discovered computational chemists make good financiers. Data science brings us all under one roof. A few years ago, Hal Varian said that statisticians will be sexy. Let’s make that true.

You all know how to code, so we could jump straight into the big data and machine learning fun, but first I’m going to teach you the fundamentals of probability and statistics. You can go learn how to implement a particular algorithm from Wikipedia or some other text, or more likely just download a package that’ll do things for you. I want to make sure you don’t miss the high-level syntax for the compiled bytecode. By the end of the class, you’ll have practiced not just executing algorithms on big data, but critiquing the appropriateness of the data, the methods applied, the interpretation of the results, and most importantly the purpose of the questions asked.

Schedule

We will have two two-hour-long lectures each week, on Mondays and Thursdays. Homeworks will be assigned on Monday’s lectures and will be due before Thursday’s lectures. For students that want to get their hands dirty with larger and messier data, we will have weekly practicum assignments discussed after Thursdays’ lectures and due on Mondays.

  1. ProbabilityOct 9
  2. StatisticsOct 11
  3. CovarianceOct 15
  4. RegressionOct 18
  5. ConfidenceOct 22
  6. CausalityOct 25
  7. Nearest NeighborsOct 29
  8. Curse of DimensionalityNov 1
  9. Bias vs VarianceNov 5
  10. RecommendationsNov 8
  11. ClusteringNov 12
  12. Logistic RegressionNov 15
  13. Hill ClimbingNov 19
  14. ThanksgivingNov 22
  15. RegularizationNov 26
  16. Efficiency vs FragilityNov 29
  17. Exploration vs ExploitationDec 3
  18. Decision Trees, Neural Networks, Support Vector MachinesDec 6
  19. Big DataDec 10
  20. Evidence-Based ManagementDec 13

1. Probability

Oct 9

We will start by exploring the purpose of data science — why one should use data science methods and common types of problems that can be solved with data science.

Next we will set up our scientific computing environment by installing python and several useful libraries (numpy, scipy, matplotlib, pandas, statsmodels, sklearn, ipython, and their dependencies).

We will finish the day with a short prediction exercise, exploring the rationale behind expected value and various methods of measuring prediction error.

Homework: Complete the setup of your programming environment, if you were unable to during class. Code a set of simulations using the standard random module.

2. Statistics

Oct 11

Using the generated data from your homework, and some publicly available data, we will explore how to describe probability distributions using visualizations and summary statistics. We will discuss various types of distributions and the real phenomena that create them.

Practicum: Analyze data from the New York City Metropolitan Transportation Authority.

3. Covariance

Oct 15

Interactions are what make life interesting. The first two lectures gave us a good understanding of the probability distribution of one variable. In the third lecture we will look at the joint probability distributions of two variables. We will discuss conditional expectation, conditional variance, and Bayes’ theorem.

For an exercise in calculating conditional probabilities, we will use Hadoop to examine how word choice changed over time based on the Google Ngram Dataset.

Homework: Continue exploring the Google Ngram Dataset. Use conditional probability to find interesting historical phenomena that you can share with the class.

4. Regression

Oct 18

Linear regression is the workhorse of statistical analysis. We will discuss the rationale for and mechanism of ordinary least squares linear regression. This will include measuring goodness-of-fit; statistical significance will be left for the next lecture.

Practicum: Prepare the USDA National Nutrient Database for Standard Reference for analysis.

5. Confidence

Oct 22

When we express a belief, that a random variable has a particular distribution or that two random variables are correlated, we can measure our confidence that the evidence supports our beliefs. This lecture discusses confidence and statistical significance in the context of linear regression.

Homework: Explore the Statistical Abstract of the United States. Construct a hypothesis that, when tested, would provide actionable information for the government, a business, a non-profit, or yourself. Test that hypothesis using linear regression. Write an analysis of your results. Include visualizations and a discussion of why the analysis was interesting.

6. Causality

Oct 25

We all know that correlation does not imply causation. This lecture explores what allows us to infer causality. We will discuss some of the assumptions made by ordinary least squares regression and how to avoid unobserved variable problems and selection bias.

Practicum: Analyze the USDA nutrient database with linear regression.

7. Nearest neighbors

Oct 29

We introduce the nearest neighbors family of algorithms with k-nearest neighbors and constrast this with a high-dimensional linear regression model.

Homework: TBD

8. The Curse of Dimensionality

Nov 1

Nearest neighbors looks great in low-dimensional space, but quickly runs into trouble as the number of variables increases. We will explore the behavior of k-nearest neighbors in many dimensions and contrast it again with linear regression. To compare the success of our algorithms, we will use cross-validation.

Practicum: Train a predictive model of which movies pass the Bechtel Test. Collect data from the Open Movie Database API

9. Bias versus Variance

Nov 5

Continuing the comparison of linear models with nearest neighbors, we will formalize the balance of assumptions and complexity as a tradeoff between bias and variance.

Homework: TBD

10. Recommendations

Nov 8

A popular use of nearest neighbors is the collaborative filtering family of recommendation systems. We will implement a simple movie recommendation algorithm using MovieLens movie ratings data. Through this exercise we will explore different distance measurements and how they affect the nearest neighbors algorithm.

Practicum: TBD

11. Clustering

Nov 12

K-means is a simple clustering algorithm. We will discuss unsupervised learning by experimenting with k-means. Following on the last lecture’s movie recommendation system, we will use k-means to reduce its complexity.

Homework: TBD

12. Logistic Regression

Nov 15

We will look at the problems of linear probability models — heteroskedasticity and violating the rules of probability — and solve those problems with logistic regression.

Practicum: Write a proposal for your final project.

13. Hill climbing

Nov 19

We have been using optimization as just part of the regression process. Now we will examine it more closely with hill climbing and randomized hill climbing algorithms.

Homework: TBD

14. National Holiday

Thanksgiving

15. Regularization

Nov 26

With our new optimization methods, we will explore how different ways of measuring error affect linear regression in high dimensions.

Homework: Get started on your final project.

16. Efficiency versus Fragility

Nov 29

We will continue our investigation of optimization by looking at hierarchical optimization and constrained optimization. The curse of dimensionality expresses itself again here. The bias / variance tradeoff also returns as a balance between efficiency and fragility.

Practicum: Get preliminary results for your final project.

17. Exploration versus Exploitation

Dec 3

This lecture will scratch the surface of reinforcement learning with a simple Markov decision process.

Homework: Write a formal report for your final project.

18. Decision Trees, Neural Networks, and Support Vector Machines

Dec 6

There are many machine learning algorithms, but trees, neural networks, and support vector machines are some of the most popular. We will have a brief survey of the mechanism of each algorithm and try to mimic its behavior with linear regression.

19. Big Data

Dec 10

Scaling these algorithms requires a discussion of asypmtotic complexity. We will revisit the MapReduce algorithm for parallelization and talk about the fast-querying concerns of data warehouses.

Homework: Prepare final project presentation.

20. Evidence-based Management

Dec 13

Students will present their final projects. General Assembly will invite hiring managers and other interested people from the community. We will introduce the presentations with a short discussion of the characteristics of a good data scientist and why science is profitable.

These textbooks are currently available for free download.

Some not-free texts that I also recommend are:

  • Russel and Norvig. Artificial Intelligence.
  • Wooldridge. Introductory Econometrics.
  • Winkelmann and Boes. Analysis of Microdata.

Requirements

I am working hard to provide a complete data science course and I expect students to work hard with me. Also, I am going to have fun teaching and I expect students to have fun learning with me. Students should be familiar with programming, though professional experience is not necessary.

Most of the code examples will be written in Python. Some examples will be in R. It’s not necessary to know either language before the class, but it is necessary to learn them quickly. Don’t worry, both R and Python are easy to learn.

Unfortunately, setting up Python on your computer with all the necessary library packages is not a trivial task. I drafted a guide for setting up Python on OS X. You may choose to use a service such as Python Anywhere instead. In fact, Python Anywhere would be very useful for collaborating with your classmates, because you can share your console in real time.

Grading and Evaluation

  • Up to 50 points for weekly assignments (5 points each)
  • Up to 10 points for in-class quizzes (1 point each)
  • Up to 25 points for final paper
  • Up to 25 points for final presentation (practicum only)

Grades reflect the following criteria:

  • A: Instructor recommends for work as a data scientist
  • B: Instructor recommends for work in a data-driven organization
  • C: Instructor suggests more study
  • F: Student showed little thought or effort

I will use a machine learning algorithm to cluster the students using each assignment and quiz as a separate variable. I will then assign a letter grade to each cluster.

Weekly assignments

Each week, students will be responsible for a short paper (one or two pages) describing an analysis of publicly available data. The assignments will cover the same topics as that week’s lecture. Most weeks, the instructor will demonstrate a version of the assignment in class, usually using a subset of the same data that the students will use.

Students may form teams and work on the assignments together. However, teams will be expected to accomplish more than an individual and will be graded more strictly.

Students are encouraged to publish their work as a blog post.

Final paper

Students in the practicum section of the course will, with the instructor’s guidance, choose a project complex enough to demonstrate the student’s ability to succeed as a professional data scientist. Ideally this should cover all the basic skills of a data scientist and at least one area of machine learning. Students may form teams to take on larger projects, but expectations for accomplishment will be correspondingly larger as well.

Students in the lecture-only section of the course will be assigned a final project topic that will require a broad set of skills from the course, but will not involve intensive data collection and transformation work.

Final presentation

Because good communication of results is integral to the success of a data scientist, the students in the practicum section will present their projects as part of the final lecture. Hiring managers and other interested people from the data community will be invited to attend.

In-class exercises

Students will not be graded on the exercises in-class. I mention them here only to comment that students are encouraged to work together on these exercises.

Rubric

If we were to break down the skills of a data scientist into problem articulation, data exploration, data transformation, model construction, model validation, and storytelling, then the components of these skills could be further classified into indicators of proficiency and expertise. These indicators will guide the grading of the weekly assignments, final paper and final presentation. Most assignments are intended to demonstrate only a few indicators.

Problem articulation

Indicators of proficiency:

  • Can explain a problem in terms of

    • Metric(s) to be improved
    • Threshold(s) for success
    • Possible actions to solve the problem
    • Information necessary and sufficient for choosing the correct action
  • Can explain the benefit of solving the problem

Indicators of expertise:

  • Can explain a problem such that the data required for the solution can be feasibly collected
  • Can explain the different outcomes that might occur with different problem phrasing

Data exploration

Indicators of proficiency:

  • Can visualize variables with histograms and scatterplots
  • Can calculate summary statistics

Indicators of expertise:

  • Can make reasonable hypotheses about the generating mechanism
    • Type of distribution
    • Relationship to other variables

Data transformation

Indicators of proficiency:

  • Can use standard tools to gather data from public datasets
  • Can use standard tools to reformat, filter, and combine datasets

Indicators of expertise:

  • Can efficiently reformat, filter, and combine large datasets (100+ GB)
  • Can reasonably combine misspellings and different phrasings
  • Can efficiently impute rare missing values

Model construction

Indicators of proficiency:

  • Can use standard tools to implement/execute
    • A multivariate linear regression
    • A logistic regression
    • A clustering algorithm
    • A Markov decision process

Indicators of expertise:

  • Can compare the assumptions made by different techniques
  • Can explain why one technique is better for a problem than another
  • Can explain standard techniques in terms of estimating probability distributions

Model validation

Indicators of proficiency:

  • Can explain the assumptions made by the model
  • Can explain possible causes of selection bias
  • Can explain possible causes of endogeneity

Indicators of expertise:

  • Can test the model for breaking assumptions
  • Can test for overfitting

Storytelling

Indicators of proficiency:

  • Can interpret results of regression and clustering
    • Statistical significance
    • Goodness-of-fit
    • Implications for hypotheses
    • Implications for motivating problem

Indicators of expertise:

  • Can effectively present results in multimedia to a variety of audiences