The shortcomings of data lakes have had more to do with existing solutions and associated use cases than the data lake vision itself.This coming October will be 10 years since the data lake concept was introduced and coined by CTO James Dixon. Since then, there have been benefits and drawbacks. The journey has produced improvements; however, data lakes certainly have fallen short of their potential.
This article will suggest that such shortcomings have had more to do with existing solutions and associated use cases than the data lake vision itself. It will also postulate not only that the next decade will see significant expansion in data lake usage, but that innovation will reinvent and accelerate it. To conceptualize this brave new world, let’s first see how we got here.
Early Years: Hadoop Storage
Early in the Hype Cycle, data lakes were associated with the rise and fall of the Apache Hadoop platform, which, sadly, helped coin the term data swamp. Though much has been written on the demise of Hadoop, the reality is probably somewhere in the middle.
This author’s take is that the Hadoop platform was inappropriately used for far too many database scenarios and use cases. As a result, many equate the data lake philosophy with Hadoop headaches. The truth is, Hadoop is a data platform that has lake components, such as Hadoop’s Distributed File System (HDFS). But let’s be clear: Not all lake repositories are created equal. And for that matter, not all data platforms are equal either.
Later Years: Cloud Storage
Since the early days of Hadoop-based data lakes, there has been an expeditious (now massive) migration from Hadoop’s HDFS to cloud storage-backed lakes. There are two primary reasons for this.
First, cloud storage services (think object storage in the cloud) require no provisioning (pure SaaS) and are expressly secure (built-in RBAC), extremely durable (11-nines reliability) and truly cost-efficient — none of which were aspects of HDFS.
Second, the major leading cloud providers (Amazon, Microsoft and Google) have all built and promoted their own cloud storage services as foundational platforms for building out cloud offerings. As a result, there has been an ever-increasing stream of solutions and services integrating with cloud object storage (e.g., AWS S3) and a resurgence in data lake thinking and architectures.
With that said, lakes in the cloud are still, by definition, only centralized repositories. Sure, the time, cost and complexity have been greatly reduced due to cloud-first thinking; however, one still needs to leverage a data platform to derive value. And over the last five years, data analytic solutions have retrofitted their architectures to use cloud storage. Even Hadoop offerings have extended the platform to connect to object-based storage. A few new solutions have entered the market, such as Athena (i.e., Presto) and Snowflake, where cloud-based storage is leveraged heavily. And with their popularity, data lake thinking is incrementally changing.
Evolution: Incremental Approach
So far this lake story has outlined an evolution, not a revolution; the data lake proposition from a decade ago has yet to be realized. It is true that leveraging cloud-based lakes as centralized repositories has significantly helped with data management, but the core mission was to expand and streamline the transformation of raw data into actionable information, where time to results is reduced, decisions can be made and insights can be derived. And with the big bang of big data, the cost-effective transformation of data is more important than ever.
The first cloud lakes were initially focused on simple cloud archiving. From there, lakes began to stage big data workloads through extract, transform and load (ETL) processing into analytic databases. As Hadoop platforms became more popular, they began to replace these databases for more unstructured analytics. However, much of the complexity associated with data swamps is a direct result of this unstructured technique. Since then, some new solutions have leveraged cloud storage, not as a data lake but a cost-efficient backing store; though it should be noted that this reduction in storage cost has been replaced by additional memory and compute costs.
Revolution: Reimaged Approach
However, this is not an evolutionary tale but a revolutionary one — an origin story for a completely new data lake approach, not just cloud storage, per se. There’s an assertion that deriving operational and business insights should be as fast and agile as simply streaming data into a lake-based repository. It’s a viewpoint that not only allows for lake-type thinking, but promotes and empowers it.
This reimagination begins with a hybrid approach that would dynamically fuse a data lake repository with a database fabric. It involves a data lake and a database platform coming together to truly reduce time, cost and complexity for diverse datasets and use cases.
Today’s solutions will certainly continue to incrementally evolve. But there is an assumption that data lakes and data platforms are and will always be two separate solutions and concepts. This viewpoint is a direct result of the current technology’s inability to fuse these aspects together — databases were designed to run on block storage, not cloud storage.
And don’t get started on leveraging memory-heavy solutions for solving big data problems: Expensive and complex memory/caching databases will never keep pace with the ever-growing tsunami of operational, business and machine-generated data. And finally, the move to use AI to simplify data refining and make analytical associations and predictions is right on point. However, this still does not address the dominant issues of time, cost and the complexity of scaling big data.
What is needed is the confidence to move away from incremental thinking and its 30-year-old science and look toward reimagined, new data structures and algorithms that are specifically designed to truly fuse cloud storage and databases.