The Data warehouse architecture has been in IT industry for at least three decades from now. The history of data warehousing started with helping business leaders get analytical insights by collecting data from operational databases (transactional systems — historical data) into centralized warehouses, which then could be used for decision support and business intelligence (BI).
This approach has been well known and refers to the first generation data analytics platforms. In the picture below, we can see what it looked like.
In the meantime, this kind of architecture has been applied and worked for several years, the data requirements in the path start to change dramatically. We are talking about a tremendous volume of data generated by: IoT devices, social medias, mobile phones, etc. In addition, we also had to consider differents unstructured datasets such as video, audio, documents. As a matter of fact, traditional data warehouses architectures could not address this kind of requirements anymore.
It emerges the second generation of data analytics platforms, which was built on top of new technologies, and a these technologies we put all together and call it for BigData for simplicity. Without enter in the specifics of this approach, we have to understand that this architecture tries to solve new problems, using new technologies, mixed with old technologies.
Data lake has been introduced as a cheaper way to store data independent of the structure of data, but the necessity of analytical insights are still addressed by Data Warehouses and BI tools. In the picture below, we can see what it looked like.
As we can notice in the picture above, it is extremely hard to maintain the second generation architecture for a couple of reasons. In this article, it is not my intention to list all challenges in this approach, but a group of academics at Berkeley (University of California) noticed that there is a opportunity in the bigdata analytics world. After years of research, creating and collaborating with open-source data technologies — the most famous is Apache Spark — they created a new company called Databricks, which has a mission to provide bigdata analytics and AI tools to business using a more simpliest and integrated data analytics plataform.
At Databricks, a new architecture pattern has emerged and promise to address all issues that we are seeing with old architectures. That is a Lakehouse architecture. What it can see in the picture below.
In a Lakehouse architecture, the data warehouse will not be necessary any more, since that classical BI and reports could be generated accessing the datalake, and new advanced techniques such as: Machine learning and data science could be generated accessing the same layer.
In the next blog posts we will go deep of this approach and technologies behind.
If you want to know more, I will give you some links to study:
- Lakehouse: A New Generation of Open Platforms that Unify
Data Warehousing and Advanced Analytics
- Inmon vs Kimball — the great data warehousing debate
- A Guide To The Data Lake — Modern Batch Data Warehousing
What do you think? Does this mean the end of data warehousing? We at Answers 2 Analytics would love to hear your thoughts on this.