At parties you always get the inevitable question: “So… what do you do?”. When I explain to my partner in dialogue that I am a data scientist, they often reply: “But what exactly is data science?”. An understandable question since data science has received much attention from media recently. People are starting to see the possibilities more and more. Although visionaries claim that data scientists have the sexiest job of the 21st century, for many people it remains a vague term. Especially since it is often used interchangeably with two related (and also quite vague) terms such as big data and business intelligence (BI). Because I truly believe that data science will have a major part to play in our nearby future business environments, I think it is worth devoting some time and effort to clarify a few fundamental things about the differences and similarities between these terms.
Increase in Google search queries for ‘data science’ (source: www.google.nl/trends)
In my opinion, big data has been one of the most ambiguously used concepts in the past years. The metaphor of comparing big data with teenage sex captures this nicely (and is certainly worth googling if you have not heard this one before). The Data Science Department of Berkeley has asked 50 experts what they think is the definition of big data. It resulted in an interesting variety of opinions, of which the core could be summarized as:
'The collecting and combining of (mostly voluminous) structured and unstructured datasets from a large variety of internal and external data sources that need to be stored and processed at a high velocity.'
So how does this relate to data science? A striking metaphor (I just love metaphors) to explain the difference between big data and data science is the one of raw materials and the blacksmith. Providing the blacksmith with more ores does not add any value if the blacksmith does not have the skills and tools to forge, draw, bend, punch and weld the raw material. Just sitting on a big pile of data does not give you a competitive edge. In order to extract actual value out of your (big) data, you need the appropriate data science skills and toolset.
So do you first need big data in order to do data science? The answer is no. Data science techniques performed on (a few not so big) datasets from a simple SQL DataWareHouse can already give your organization a huge advantage (in contrary to just sitting on all the data in the world). However, the more (relevant) datasets you combine, the more added value a data science approach will generate.
To clarify the difference between BI and data science I would like to introduce a commonly used framework to classify data analytics practices in a sequence of four stages. Every organization that uses data can classify their practices into at least one of these stages. Each stage takes a more specific skillset but also adds a lot more value to your organization (if executed correctly):
• Descriptive: what happened to Y (such as revenue, profit, customer experience)?
• Diagnostive: why did this happen to Y?
• Predictive: what will happen to Y if I change X (such as price, assortment, marketing expenditures)?
• Action: what to do with X if I want to optimize Y (automated decision making)?
The descriptive stage can be seen as looking at historical data such as sales reports or scorecards, without giving an explanation of certain changes. Looking at the right relevant data is already very valuable since a human mind can see certain patterns and relations quite well. However, a human mind is never totally objective and there are limits to the amount of data it can successfully process.
To move up to the diagnostive stage it is necessary to find relations in your data. As said before, this can be done either by just looking at a lot of numbers or visualizing your data in an intuitive way. Although I sometimes still see people focusing on the first, I truly believe that an intuitive visualization outperforms a table with a lot of numbers any day. Another way to find relations in your data is by using statistics, machine learning or data mining techniques. This is also the stage where BI stops (intuitively visualizing patterns) and data science begins (finding statistically significant patterns).
The human brain is trained to recognize patterns through intuitive visualizations rather than finding them in a plain table filled with numbers.
Once you find a statistical relation between, for example prices and demand, it can be used to predict what will happen if you or your competitor change price. Although this requires vast knowledge of data science techniques, it also adds much more value to the quality of your decision making than just looking at what happened in the past.
The next and final stage is when your organization (or your data science algorithm) knows how to set your prices, adjust your planning, recommend the right products to optimize your sales (or any other desired objective). For many years, computational power has been a bottleneck for such algorithms. Nowadays however, it is possible to scale up on distributed processing or even to let algorithms run in the cloud on a dedicated server platform somewhere else in the world. This gives data scientists the ability to be more creative in developing, combining and fine-tuning their algorithms into something very valuable for an organization and their customers. I think a great example is the Spotify Discover Weekly recommendation algorithm. The algorithm understands my taste of music, knows all the songs in the world and sends me a list of 30 songs on a weekly basis. A list consisting of 30 different artists yet unknown to me that, according to my behavioral pattern, I will probably like. I have always wanted such a friend!
So data science is the science (and art) of extracting value out of your data by using predictive models (from the fields of statistics, machine learning and data mining). And the data does not necessarily need to be big in order to extract such value from it. We can expect the attention data science will receive will increase in the years to come. I therefore hope this explanation helps fellow data scientists to not talk about work too much at parties in the weekend (although I actually think that most of them really don’t mind).