When it comes to data, today’s CIOs are being pulled in all directions. They’re asked to provide an effective data platform that supports machine learning and other data-driven innovation, but they must also continue to support traditional business intelligence (BI) tools and weekly data reports. Successful CIOs have figured out how to address both of these needs without making tradeoffs, and the answer lies in understanding the relative strengths of data warehouses and data lakes.
Much has been said about whether data lakes will replace data warehouses, but I view them as complementary. Data warehouses enable fast, complex queries across historical structured data. They help businesses learn from the past, so a retailer can understand what products have sold well in a particular region, for example. Data lakes focus on learning about the present using streaming analytics — and the future using predictive analytics and machine learning. They’re a repository for a wide variety of unstructured and semi-structured data — think video, social streams and IoT sensor data, for example. And they support the technologies data scientists use today for machine learning, such as the Python language and multiple open-source query engines.
So how does a CIO build a cohesive architecture that brings these two platforms together?
With data warehouses, data must go through a standardized ETL (extract, transform, load) process to load the data and make it available to users, and this ETL process takes time. With a data lake, what we think of as ETL happens at the time the data is read or consumed by the user. Through a process of continuous data engineering, data is made available in the most efficient way, allowing users to access that data directly from the data lake. This is critical because data can become stale in a matter of hours or even minutes. Often, organizations don’t have the luxury of going through a long ETL process to make data available to users.
A modern data architecture accelerates return on investment by combining a data warehouse and a data lake to federate relational and non-relational data stores into a single, cohesive architecture. This enables new practices that complement the core data warehouse without replacing it, because a data warehouse remains the right platform for the standardized data used for BI reports, dashboards and OLAP (online analytical processing), while the data lake supports newer use cases, such as streaming analytics and machine learning.
To visualize this, imagine a cloud object store as the bottom layer of this modern data architecture. Data from all sources reside here, including the structured data for traditional business apps and the unstructured data for your data lake — the clickstreams, images, server logs, IoT and other data required for machine learning and advanced analytics. Data from this cloud object store is then fed into the data warehouse for the BI workload, while the rest of the data in the cloud object store is made available via a data lake platform for streaming analytics and machine learning.
In this way, the modern data platform becomes the single ingestion point for all new data. Organizations can transform and process data, formerly destined only for the data warehouse, with a schema-on-read approach directly from the data lake.
Enhancing an existing data warehouse with a cloud data lake provides freedom of choice. There’s no need to compromise between technical fit or cost reduction for managing all types of data. When an integrated data environment leverages a cloud-native data platform, companies get the best approach to managing all types of data from the perspective of cost, performance and flexibility. This integrated environment can support data engineering, data analytics and machine learning services, which ultimately yield trusted data sets for the business.
I would add that CIOs should aim to build “open” data lakes to ensure they’re not locked into proprietary data formats or proprietary systems, which reduces flexibility and leads to increased costs. With an open data lake, data is stored in open formats and accessed through open, standards-based interfaces. This philosophy should extend to every aspect of the data lake, including data management, data storage, data processing and data access. Open formats are those based on open standards and developed through a community-driven process, such as Parquet or ORC.
In this way, CIOs can address the needs of all their users in the cloud without needing to support and manage multiple data repositories. Data scientists get what they need to develop, test and deploy machine learning applications, while the business continues to support more traditional reporting needs. The overall objective is driving innovation while keeping costs to a minimum, and this dual strategy of combining your data warehouses with an open data lake is the way to achieve this.