Everyone has probably heard about the data scientist, known as the sexiest job of the 21st century. However, data science is more than just a data scientists’ job, the field of data science spans from C-level to operations to IT. Therefore, we need experts on multiple domains to successfully implement data science in business and make it work on all these levels.
A data science team consists of three roles: the data translator, the data scientist and the data engineer. Each specializing in a specific subfield. In our blog: “Data science is a team sport” we briefly explain these three roles and their role in the data science team. Also, you can read in depth about the data translator and the data scientist in our previous blogs. In this blog I will further dive into the role of data engineer.
A common mistake is that the data engineer is the person that only collects, connects and cleans data for the data scientist. We will uncover that the world of a data engineer is much broader and more complex. The data engineer bridges the gap between the data scientist and your IT department and is responsible for setting up the architecture of the entire data science solution.
So what makes a good data engineer? A data engineer is often someone with a degree in computer science which preferably included a bit of statistics and machine learning. Knowledge of statistics and machine learning are a big advantage since this will greatly improve their understanding of the results produced by data scientists.
Furthermore, the data engineer is someone that likes to explore the state-of-the-art in computer science and is able to quickly grasp the essentials of new technologies. The data engineer knows all about IT infrastructures and software development. Preferably, a data engineer has in depth knowledge of how data stores and distributed systems internally work (for example: query planners, indices, or distributed computing). This is important since the data engineer will continuously be diving into new data stores to collect, connect and clean more data. Since more and more data becomes available, an increasing amount of computing power is required to process the data. Thus, he or she should also be familiar with distributed systems.
You are probably a good data engineer if you answer yes to all of the following questions:
The data engineer works in close cooperation with your IT department and is responsible for drawing and setting up the architecture of the entire data solution. He or she programs the data science algorithms efficiently, and makes the solution actually work in your daily operations. Next to programming and implementing the solution, the data engineer is the infrastructure expert that makes sure that the data will be retrieved via secured networks, and cleans and prepare the data for the data scientist.
It might seem like the data engineer is very similar to a software engineer. Indeed, a large overlap exists and since specialized education for data engineering hardly exists, at least in the Netherlands, a software engineer could be suitable for the job as well. The differences are mostly in the areas of focus. I will give some examples to clarify my statement.
The data engineer thinks of data in a different way. For a data engineer every piece of data might be relevant currently, or in the near future, and should not easily be discarded. A software engineer is often only concerned with data required by the software that is written.
A software engineer often has much more freedom in choosing an IT infrastructure on which to build the new software. A data engineer however often has to integrate his data science solution into an existing IT infrastructure.
For a software engineer it is important that small parts of the data are quickly accessible in a reliable state. The data engineer however often requires large parts of the data set at once. Requiring a totally different tactic in making data accessible.
Of course, a lot more differences exist. Which could be described in a blog on its own. For now, let’s dig into the huge variety the data engineer bumps into.
The hard, and simultaneously fun, thing about being a data engineer is that every organization (and even within departments) have different software systems, use different programming languages and data sources. This variety in IT infrastructures makes the work different, and thereby challenging, every day!
Since the data engineer is the person that brings the data science solution to life in operations, he or she should be able to design the data science solution in such a way that it fits this operational IT landscape. By doing this, he or she bridges the gap between the data scientist and the IT department.
Data science these days is hardly standardized, which makes the solutions different every time. Data scientists require many different tools which eventually need to be chained together to form a working end-to-end solution. Thus, the data engineer should generalize their ideas about IT infrastructures, software development and data stores. Meaning that the data engineer should not think about specific implementations or variants, for example: Scala or MongoDB, but rather think about general ideas like functional programming or document-databases.
The data engineer should learn ideas and concepts instead of specifics
So you now know who makes the data science solution come to life within your organization, the data engineer. The data engineer is an architect, an expert in software development, data stores and IT infrastructures and works in close cooperation with the data scientist and your IT department to get the job done.