Why Data Science?

How to know when you need science

Michael Selik
mike@selik.org


Yesterday, I attended a talk by Hilary Mason at a CTO School meetup. Among other things, she responded to a question from the audience, “How do I know when I data science?” She replied, “When you feel guilty about your decisions.”

At its core, data science is just science. It is a philosophy that the world follows rules and those rules can be known. Data science is an epistemology. We believe that knowledge comes from systematic analysis of observed events. There is no easy definition for what can be accepted as known. I know it when I see it. Most clear is to describe it as opposed to case study analysis, which a data scientist would view as just an anecdote. Anecdotes are still interesting. They are useful for generating hypotheses and learning about a topic, but one should not confuse a hypothesis for knowledge. Someone who shares the philosophy of science would feel guilty making decisions based on anecdotes without data — without evidence.

Data science is the particular flavor of science that uses certain tools for systematic analysis. Most common are statistics and statistical machine learning. Some practitioners use other forms of computational modeling, such as agent-based models. Perhaps more defining of data scientists is the taste for large and complicated datasets, with too many observations, too many variables, or too much mess. These problems require such tools as parallelization to handle many observations, dimensionality reduction and regularization to handle many variables, and natural language processing or interpolation to handle the mess.

In all data science, the language of probability is dominant. We like measure how confident we are in our knowledge. When we see a pattern in reality, we want to know the likelihood we are being fooled by randomness, seeing ghosts in the data. We are not afraid to make uninformed decisions, but then we experiment, gather evidence, and update our beliefs.

Informing decisions is the heart of data science. As the blend of software engineering with statistics and scientific method, data science follows the beat of practicality. The answer to a data science research question should deliver actionable information, new knowledge that enables a choice that previously was a guess.