Michael Selik
mike@selik.org
Data science has become popular recently. I think this is a misleading term. First of all, data science is redundant. You can’t do science without data, and data is useless without science. I think data science has become popular because data mining was popular some years ago and people gradually realized that data mining was unsatisfactory because it lacked science. But let’s go ahead and call ourselves data scientists, because it’s even harder to explain what an econometrician is.
We are calling ourselves data scientists to emphasize two things. First, we call ourselves data scientists to emphasize that we are using larger amounts of data than was ever possible before. The magnitude of the dataset that qualifies as “big” differs in each field, but the general rule is that now we have so much more data that there is a qualitative difference in what can be done to the data and what information the data provides. The old methods are not just slow, they are now impossible. The old (small) data often had fewer data points than the number of variables that we now examine. Because we have bigger data, now we can look at bigger problems.
Second, we call ourselves data scientists to emphasize that we are doing science. It’s easiest to contrast this with data mining. If you go out using statistics and machine learning to crunch numbers on some big data, you’ll quickly find statistically significant correlations. Many of these correlations will be surprising. You might find that the stock market responds to the length of skirts. You might find that the rise in global temperatures corresponds to a decline in pirate populations. If you use these correlations to make predictions, well, I wouldn’t make any bets.
Data scientists use the scientific method to construct a falsifiable hypothesis and test that hypothesis using statistical or machine learning techniques applied to systematically collected data. The data need not have been collected specifically for the science experiment, but science relies on knowing the method of collection. We’ll get to that part later.
Third, we call ourselves data scientists because we want jobs. The phrase originated as a job title. Over the past few decades, people have been doing data science under different names. Large organizations have always had trouble recognizing the particular mix of skills that make a good data scientist. Human resources needs keywords. It wasn’t that long ago that Wall Street discovered computational chemists make good financiers. Data science brings us all under one roof. A few years ago, Hal Varian said that statisticians will be sexy. Let’s make that true.
You all know how to code, so we could jump straight into the big data and machine learning fun, but first I’m going to teach you the fundamentals of probability and statistics. You can go learn how to implement a particular algorithm from Wikipedia or some other text, or more likely just download a package that’ll do things for you. I want to make sure you don’t miss the high-level syntax for the compiled bytecode. By the end of the class, you’ll have practiced not just executing algorithms on big data, but critiquing the appropriateness of the data, the methods applied, the interpretation of the results, and most importantly the purpose of the questions asked.
We will have two two-hour-long lectures each week, on Mondays and Thursdays. Homeworks will be assigned on Monday’s lectures and will be due before Thursday’s lectures. For students that want to get their hands dirty with larger and messier data, we will have weekly practicum assignments discussed after Thursdays’ lectures and due on Mondays.
Oct 9
We will start by exploring the purpose of data science — why one should use data science methods and common types of problems that can be solved with data science.
Next we will set up our scientific computing environment by installing python and several useful libraries (numpy, scipy, matplotlib, pandas, statsmodels, sklearn, ipython, and their dependencies).
We will finish the day with a short prediction exercise, exploring the rationale behind expected value and various methods of measuring prediction error.
Homework: Complete the setup of your programming environment, if you were unable to during class. Code a set of simulations using the standard random module.
Oct 11
Using the generated data from your homework, and some publicly available data, we will explore how to describe probability distributions using visualizations and summary statistics. We will discuss various types of distributions and the real phenomena that create them.
Practicum: Analyze data from the New York City Metropolitan Transportation Authority.
Oct 15
Interactions are what make life interesting. The first two lectures gave us a good understanding of the probability distribution of one variable. In the third lecture we will look at the joint probability distributions of two variables. We will discuss conditional expectation, conditional variance, and Bayes’ theorem.
For an exercise in calculating conditional probabilities, we will use Hadoop to examine how word choice changed over time based on the Google Ngram Dataset.
Homework: Continue exploring the Google Ngram Dataset. Use conditional probability to find interesting historical phenomena that you can share with the class.
Oct 18
Linear regression is the workhorse of statistical analysis. We will discuss the rationale for and mechanism of ordinary least squares linear regression. This will include measuring goodness-of-fit; statistical significance will be left for the next lecture.
Practicum: Prepare the USDA National Nutrient Database for Standard Reference for analysis.
Oct 22
When we express a belief, that a random variable has a particular distribution or that two random variables are correlated, we can measure our confidence that the evidence supports our beliefs. This lecture discusses confidence and statistical significance in the context of linear regression.
Homework: Explore the Statistical Abstract of the United States. Construct a hypothesis that, when tested, would provide actionable information for the government, a business, a non-profit, or yourself. Test that hypothesis using linear regression. Write an analysis of your results. Include visualizations and a discussion of why the analysis was interesting.
Oct 25
We all know that correlation does not imply causation. This lecture explores what allows us to infer causality. We will discuss some of the assumptions made by ordinary least squares regression and how to avoid unobserved variable problems and selection bias.
Practicum: Analyze the USDA nutrient database with linear regression.
Oct 29
We introduce the nearest neighbors family of algorithms with k-nearest neighbors and constrast this with a high-dimensional linear regression model.
Homework: TBD
Nov 1
Nearest neighbors looks great in low-dimensional space, but quickly runs into trouble as the number of variables increases. We will explore the behavior of k-nearest neighbors in many dimensions and contrast it again with linear regression. To compare the success of our algorithms, we will use cross-validation.
Practicum: Train a predictive model of which movies pass the Bechtel Test. Collect data from the Open Movie Database API
Nov 5
Continuing the comparison of linear models with nearest neighbors, we will formalize the balance of assumptions and complexity as a tradeoff between bias and variance.
Homework: TBD
Nov 8
A popular use of nearest neighbors is the collaborative filtering family of recommendation systems. We will implement a simple movie recommendation algorithm using MovieLens movie ratings data. Through this exercise we will explore different distance measurements and how they affect the nearest neighbors algorithm.
Practicum: TBD
Nov 12
K-means is a simple clustering algorithm. We will discuss unsupervised learning by experimenting with k-means. Following on the last lecture’s movie recommendation system, we will use k-means to reduce its complexity.
Homework: TBD
Nov 15
We will look at the problems of linear probability models — heteroskedasticity and violating the rules of probability — and solve those problems with logistic regression.
Practicum: Write a proposal for your final project.
Nov 19
We have been using optimization as just part of the regression process. Now we will examine it more closely with hill climbing and randomized hill climbing algorithms.
Homework: TBD
Thanksgiving
Nov 26
With our new optimization methods, we will explore how different ways of measuring error affect linear regression in high dimensions.
Homework: Get started on your final project.
Nov 29
We will continue our investigation of optimization by looking at hierarchical optimization and constrained optimization. The curse of dimensionality expresses itself again here. The bias / variance tradeoff also returns as a balance between efficiency and fragility.
Practicum: Get preliminary results for your final project.
Dec 3
This lecture will scratch the surface of reinforcement learning with a simple Markov decision process.
Homework: Write a formal report for your final project.
Dec 6
There are many machine learning algorithms, but trees, neural networks, and support vector machines are some of the most popular. We will have a brief survey of the mechanism of each algorithm and try to mimic its behavior with linear regression.
Dec 10
Scaling these algorithms requires a discussion of asypmtotic complexity. We will revisit the MapReduce algorithm for parallelization and talk about the fast-querying concerns of data warehouses.
Homework: Prepare final project presentation.
Dec 13
Students will present their final projects. General Assembly will invite hiring managers and other interested people from the community. We will introduce the presentations with a short discussion of the characteristics of a good data scientist and why science is profitable.
These textbooks are currently available for free download.
Some not-free texts that I also recommend are:
I am working hard to provide a complete data science course and I expect students to work hard with me. Also, I am going to have fun teaching and I expect students to have fun learning with me. Students should be familiar with programming, though professional experience is not necessary.
Most of the code examples will be written in Python. Some examples will be in R. It’s not necessary to know either language before the class, but it is necessary to learn them quickly. Don’t worry, both R and Python are easy to learn.
Unfortunately, setting up Python on your computer with all the necessary library packages is not a trivial task. I drafted a guide for setting up Python on OS X. You may choose to use a service such as Python Anywhere instead. In fact, Python Anywhere would be very useful for collaborating with your classmates, because you can share your console in real time.
Grades reflect the following criteria:
I will use a machine learning algorithm to cluster the students using each assignment and quiz as a separate variable. I will then assign a letter grade to each cluster.
Each week, students will be responsible for a short paper (one or two pages) describing an analysis of publicly available data. The assignments will cover the same topics as that week’s lecture. Most weeks, the instructor will demonstrate a version of the assignment in class, usually using a subset of the same data that the students will use.
Students may form teams and work on the assignments together. However, teams will be expected to accomplish more than an individual and will be graded more strictly.
Students are encouraged to publish their work as a blog post.
Students in the practicum section of the course will, with the instructor’s guidance, choose a project complex enough to demonstrate the student’s ability to succeed as a professional data scientist. Ideally this should cover all the basic skills of a data scientist and at least one area of machine learning. Students may form teams to take on larger projects, but expectations for accomplishment will be correspondingly larger as well.
Students in the lecture-only section of the course will be assigned a final project topic that will require a broad set of skills from the course, but will not involve intensive data collection and transformation work.
Because good communication of results is integral to the success of a data scientist, the students in the practicum section will present their projects as part of the final lecture. Hiring managers and other interested people from the data community will be invited to attend.
Students will not be graded on the exercises in-class. I mention them here only to comment that students are encouraged to work together on these exercises.
If we were to break down the skills of a data scientist into problem articulation, data exploration, data transformation, model construction, model validation, and storytelling, then the components of these skills could be further classified into indicators of proficiency and expertise. These indicators will guide the grading of the weekly assignments, final paper and final presentation. Most assignments are intended to demonstrate only a few indicators.
Indicators of proficiency:
Can explain a problem in terms of
Can explain the benefit of solving the problem
Indicators of expertise:
Indicators of proficiency:
Indicators of expertise:
Indicators of proficiency:
Indicators of expertise:
Indicators of proficiency:
Indicators of expertise:
Indicators of proficiency:
Indicators of expertise:
Indicators of proficiency:
Indicators of expertise: