As a data scientist who has worked in the field since 2014, I can confidently tell you that data science is not going to disappear. It’s sad seeing people talk about how data science will “be extinct” 10 years from now. The typical reason that people give is that emerging AutoML tools will eliminate the need to have data science practitioners creating algorithms.
These claims are especially sad since they dissuades some aspiring data scientists from pursuing the career or studying the discipline enough to excel at it. Frankly, the comments are a disservice to the community data scientists. The comments also don’t make sense because demand continues to increase every year.
I intend to present to you the main reasons why the field is not becoming extinct anytime soon. I will further provide you with advice in another post to hopefully help you stay on the right side of this profession in the next 10 years InshaAllah.
Let’s begin with science. I don’t need to prove that science has been here for a long time. The essence of science is using data to learn something. We observe the world around us (collect data) and then use the data to build a model (traditionally called a theory), which summarizes those observations. We build models in order to help us tackle problems.
Data science has precisely the same essence. In data science, we collect data, learn from the data by making models, and then we employ the models to solve problems. Throughout the years, various disciplines have made and refined tools in order to do this. Depending on the field’s focus, people have used different names to describe the tools and procedures. One term that gained lots of traction is data science.
However, there are crucial differences between the way things were in the past and how they are now in the present, which is the amount of data accessible to us and the power that we have to perform computations with that data. When we had only a couple of points of data and only a few dimensions, it was not even fathomable to put them to a piece of paper and fit a straight line (a regression analysis) to it or to attempt to decipher any patterns among them. Now, we can quickly and cheaply gather large amounts of data from multiple sources (multi-features). We can even automate the process of gathering the data. It’s not humanly possible to generate a best-fit straight line (or cluster) with a large number of dimensions and data points.
If anything, we will gather larger amount of more diverse data. Further, we will need more creative methods to deal with any issues that arise in the process of analyzing the data.
Building a Model is Only a Small Part of the Work That Gets Done on Projects
There are several tools within the tool belt of “Automated Machine Learning” that are gaining traction. Several of these tools will probably lead to data science’s democratization. However, the majority of these tools will help speed up the process of testing and implementing machine learning algorithms with cleaned data.
The ability to obtain clean data and put that data in a model is not trivial in the least.
In fact, a number of surveys in data science have pointed out the large amount of time spent by any data scientist on gathering the data and cleaning it. For example, the annual survey produced by Anaconda pointed out that data scientists use approximately two-thirds of the time they have at work to load their data, clean it, and generate visualizations of it. Only about 23% of a data scientist’s time at work is spent training a model, selecting the best model, and scoring data based on the model. My experience as a data scientist has been similar to this in terms of time allocation.