You most likely read of a difference between a data engineer and a data scientist. I always thought the difference between them was obvious. Data engineers prep the data for analysis so that data scientists can do the analysis of that data.
However, in my humble opinion of this matter, the distinction has shifted after I began work in data science.
Everything begins with data in data science. Your model is only as good as the data fed into it. Junk in, junk out! A data scientist cannot perform magic to yield a valuable product without the right data.
The right data isn’t always easily accessible to data scientists. Typically, it will be up to a data scientist to turn the raw data into the right format.
Unless your work is at a giant tech company that has separate teams for data engineers and data scientists, then you must possess the ability and skills to handle some data engineering tasks. These tasks cover a broad range of operations and I will elaborate on this in the remaining part of the article.
What is the difference anyway?
I would like to explain my opinion on the matter of the difference between the role of a data engineer and a data scientist.
It may appear to be an arguable statement. However, I would like to emphasize that my opinion used to be different prior to beginning my work as a data scientist. I used to think of data engineers and data scientists as completely separate roles.
In the remaining portion of the blog post, I will try to explain what I mean by a data scientist should be both a data scientist and a data engineer.
For instance, data engineers do a set of operations known as ETL (extract, transform, load). It covers the procedures for collecting data from one or more sources, apply some transformations, and then load into a different source.
I would definitely not be surprised if a data scientist is expected to perform ETL operations. Data science is still evolving and most companies do not have clearly separated data engineer and data scientist roles. As a result, a data scientist should be able to perform some data engineering tasks.
If you expect to only work on running machine learning algorithms with ready-to-use data, you will face the harsh truth soon after you start working as a data scientist.
You may have to write some stored procedures in SQL to preprocess the client data. It is also possible that you receive the client data from a few different sources. It will be your job to extract and combine them. Then, you will need to load them into a single source. In order to write efficient stored procedures, you need extensive SQL skills.
The transform part of ETL procedures involves in many data cleaning and manipulation steps. SQL may not be the best choice if you work with large-scale data. Distributed computing is a better alternative in such cases. Therefore, a data scientist should also be familiar with distributed computing.
Your best friend in distributed computing might be Spark. It is an analytics engine used for large-scale data processing. We can distribute both data and computations over clusters to achieve a substantial performance increase.
If you are familiar with Python and SQL, you won’t have hard time getting used to Spark. You can use Spark features with PySpark which is a Python API for Spark.
When it comes to work with clusters, the optimal environment is the cloud. There are various cloud providers but AWS, Azure, and Google Cloud Platform (GCP) lead the way.
Although the PySpark code is the same for all cloud providers, how you setup the environment and create clusters change between them. They allow for creating clusters using both scripts or the user interface.
Distributed computing over clusters is a whole different world. It is nothing like doing analysis in your computer. It has very different dynamics. Evaluating cluster performance and choosing the optimal number of workers for a cluster will be your predominant concerns.
Long story short, data processing will be a substantial part of your job as a data scientist. By substantial, I mean more than 80% of your time. Data processing is not just cleaning and manipulating the data. It also involves ETL operations which are thought to be the job of a data engineer.
I strongly recommend getting familiar with ETL tools and concepts. It would of great help if you have a chance to practice them.
It would be a naive assumption to think you will only work on machine learning algorithms as a data scientist. It is an important task too but it will only consume a small part of your time.