Lakehouse Trifecta — Delta Lake, Apache Iceberg & Apache Hudi
What is a Data Lakehouse?
A data Lakehouse can be defined as a modern data platform built from a combination of a data lake and a data warehouse. It is a new, open data management architecture that combines the flexibility, cost-efficiency, and scale of data lakes with the ACID transactions and data management of data warehouses enabling batch, streaming, BI, AI and ML on all data on the same platform.
The modern Lakehouse melds characteristics of data lakes and data warehouses into a single entity in support of customers’ advanced analytics and data science workloads.
Advantages:
● Unlimited scale data repository for mixed data types: structured, semi-structured and unstructured.
● Access to source raw data for Data science and Machine learning models
● Rapid access for Analytics and BI reporting and remove data staleness.
● ACID transactions, version history, time travel.
● High performance and concurrency support, record-level mutations (updates, deletes, etc.) across large-scale datasets.
● Metadata Scalability and support for schema evolution and schema enforcement.
● A universal semantic layer with enterprise-grade governance and security.
● Flexibility to use open standard file formats and support for multiple engines to run different types of workloads.
● Reduce multi hop data copy and remove redundant storage cost.
When building a data lakehouse, the most consequential decision is choosing the right storage format. The outcome will have a direct effect on its performance, usability, and compatibility.
Delta Lake, Apache Iceberg and Apache Hudi are the current best-in-breed opensource storage formats designed for lakehouse.
Open Storage Formats
Open storage formats are instrumental for getting the scalability benefits of the data lake and the underlying object store, while at the same time getting the data quality and governance associated with data warehouses. These are the building blocks of any Lakehouse platform.
Major Storage Formats:
A Storage format (sometimes also called as Table format) is essentially an additional metadata layer on top of existing open-source file formats e.g., ORC, Parquet, Avro etc. Selecting a suitable file format is very important while building a data lake table. Same importance implies while choosing a storage format for a Lakehouse. The major open storage formats which are currently available are:
1. Delta Lake Storage Format
Delta Lake is an open-source storage format that enables building a Lakehouse architecture on top of data lakes.
Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing on top of data lakes, such as S3, ADLS, GCS, and HDFS.
Delta lake is deeply integrated with Spark for reading and writing but accessing data from other engines like Presto, Flink, Kafka connect, Snowflake, AWS Athena, Redshift etc. is also possible. Major upgrade Delta 2.0 was released in 2022. Delta 3.0 has been released as Beta version in mid of 2023.
Delta Lake table format consist of two components:
● Data files: These are the files that store the data in columnar Parquet format. Each data file has a unique ID and contains metadata such as file size, row count, column statistics, partition values, etc.
● Transaction log: This file (Delta log) records all the changes made to a table over time. Each entry in the transaction log has a unique version number and contains metadata such as operation type, timestamp, user ID, etc. The transaction log also references one or more data files that contain the actual data. Checkpoints summarize all changes to the table up to that point minus transactions that cancel each other out.
Delta Lake’s benefits:
● Schema evolution: Delta Lake supports adding, dropping, renaming, reordering, or updating columns without breaking existing queries or requiring a full table rewrite. It also supports schema enforcement to ensure data quality and consistency.
● ACID transactions: Delta Lake supports atomic operations such as append, overwrite, delete, merge, etc., that guarantee isolation, consistency, and durability of data. It also supports optimistic concurrency control that prevents conflicts and ensures the serializability of transactions.
● Time-travel: Delta Lake supports accessing historical versions of a table by using version numbers or timestamps. This enables features such as auditing, rollbacks, replication, etc.
● Data compaction: Delta Lake supports compacting small files into larger ones to improve read performance and reduce storage costs. It also supports optimizing the layout of data files by sorting them on specified columns.
● Data quality: Delta Lake supports validating and enforcing data quality rules at various levels. For example, it can check schema compatibility, data type constraints, uniqueness constraints, etc. It can also perform data cleansing and deduplication using merge or delete operations.
● Data governance: Delta Lake supports tracking and auditing data changes at various levels. For example, it can record the history of table operations, versions, files, etc. It can also provide lineage and provenance information by using metadata tables.
● Z-order: It is a technique to co-locate related information in the same set of files. This co-locality is automatically used by Delta Lake on data-skipping algorithms. This behaviour dramatically reduces the amount of data a query needs to read.
Delta Lake table format only support parquet as underneath file format. One can easily convert existing parquet tables to delta using Spark-SQL or DataFrame code. Following is an example to convert a parquet table to delta using PySpark.
from delta.tables import *
# Convert unpartitioned Parquet table at path <path-to-table>
deltaTable=DeltaTable.convertToDelta(spark,”parquet.’<path-to-table>’”)
# Convert partitioned Parquet table at path <path to table> and partitioned by Integer column named ‘part’
partitionedDeltaTable=DeltaTable.convertToDelta(spark,”parquet.’<path-to-table>’”,”part int”)
Once converted, the data files should be read/written in Delta format only and existing ETL pipelines should be refactored. However, for tables with another file format like ORC, Avro etc., one needs to completely rewrite the data to convert to Delta. Multi-table transactions are not supported in Delta Lake and Primary/Foreign key support are missing in open-source version.
Major Contributors to Delta Lake:
Databricks is the major contributor to Delta Lake OSS Project.
[Note: This info is based on contributions to each project’s core repository on GitHub, measuring contributions which are issues/pull requests and commits in the GitHub repository. Activity or code merges that occur in other upstream or private repositories are not factored in since there is no visibility into that activity.]
2. Iceberg Storage Format
Iceberg is an open-source storage format that was originally developed by Netflix to address issues in Apache Hive. After its initial development in 2018, Netflix donated Iceberg to the Apache Software Foundation as a completely open-source, openly managed project. It remedies many of the shortcomings of its predecessor and has quickly become one of the most popular table formats to build Lakehouse.
Iceberg storage format consist of three components:
● Data files: These are the actual files that store the data in any row-oriented or columnar format, such as Avro, ORC, or Parquet. Each data file has a unique ID and contains metadata such as file size, row count, column statistics, partition values, etc.
● Manifests: These are metadata files that list all the data files that belong to a table or a partition. Each manifest has a unique ID and contains metadata such as manifest size, file count, partition spec, etc.
● Snapshots: These are metadata files representing a consistent table view at a time. Each snapshot has a unique ID and contains metadata such as snapshot timestamp, operation type, summary stats, etc. Snapshots also reference one or more manifests that contain the actual data files.
These components are organized hierarchically, enabling efficient metadata management and query optimization. For example, Iceberg can quickly prune irrelevant data files using the column statistics stored in the manifests. It can also track changes to a table by using the snapshots stored in the version history.
Iceberg’s benefits:
● Schema evolution: Iceberg supports full schema evolution (adding, dropping, renaming, reordering, or updating columns) without breaking existing queries or requiring a full table rewrite. It also supports evolving partitioning schemes without affecting data layout or query performance.
● ACID transactions: Iceberg supports atomic operations such as append, overwrite, delete, merge, etc., that guarantee isolation, consistency, and durability of data. It also supports optimistic concurrency control that prevents conflicts and ensures the serializability of transactions.
● Time-travel: Iceberg supports accessing historical versions of a table by using snapshot IDs or timestamps. This enables features such as auditing, rollbacks, replication, etc.
● Hidden partitions: Iceberg supports hiding partition columns from queries using partition specs. This simplifies query syntax and avoids skewing query results due to partition pruning.
● Incremental processing: Iceberg supports reading only new or modified data files using snapshot IDs or timestamps. This enables incremental backups, change data capture (CDC), streaming ingestion, etc.
● Metadata tables: Iceberg exposes various metadata tables that provide information about a table’s structure, history, partitions, files, etc. These tables can be queried using SQL or Spark APIs to perform schema validation, data quality checks, lineage analysis, etc.
The transaction model in Iceberg is snapshot based. A snapshot is a complete list of the files up in table. As Iceberg tables grow and have many operations performed against them, it’s a good idea to optimize them from time to time. The optimize command not only makes small files larger for better performance, but also cleans up the metadata which improves queries due to less metadata that needs to be read.
One can easily migrate existing Parquet, ORC and AVRO tables to Iceberg using Spark or Hive.
spark-shell — packages org.apache.iceberg:iceberg-spark3-runtime:0.13.0\
— conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
— conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
— conf spark.sql.catalog.spark_catalog.type=hive \
— conf spark.sql.catalog.iceberg=org.apache.iceberg.spark.SparkCatalog \
— conf spark.sql.catalog.iceberg.type=hadoop \
— conf spark.sql.catalog.iceberg.warehouse=$PWD/iceberg-warehouse\
— conf spark.sql.warehouse.dir=$PWD/hive-warehouse
#Migrate db.sample in the current Catalog to an Iceberg table
CALL catalog_name.system.migrate(‘db.sample’)
Major Contributors to Apache Iceberg:
Netflix, Apple, AWS, Cloudera etc. are major contributors to Iceberg OSS.
[Note: This info is based on contributions to each project’s core repository on GitHub, measuring contributions which are issues/pull requests and commits in the GitHub repository. Activity or code merges that occur in other upstream or private repositories are not factored in since there is no visibility into that activity]
3. Hudi Storage Format
Apache Hudi is another open storage format which can support Data Lakehouse. It was initial developed by Uber and donated to Apache foundation in 2017.
Hudi storage format consist of three components:
● Timeline: At its core, Hudi maintains a timeline of all actions performed on the table at different instants of time that helps provide instantaneous views of the table, while also efficiently supporting retrieval of data in the order in which it was written. The timeline is akin to a redo/transaction log, found in databases, and consists of a set of timeline-instants (Commits, Cleans, Compaction, Rollback etc.)
● Data Files: Hudi organizes a table into a folder structure under a table-basepath on DFS. If the table is partitioned by some columns, then there are additional table-partitions under the base path, which are folders containing data files for that partition, very similar to Hive tables. Hudi stores ingested data in two different storage formats. The actual formats used are pluggable but require the following characteristics -
Read-Optimized columnar storage format (RO-Format). The default is Apache Parquet.
Write-Optimized row-based storage format (WO-Format). The default is Apache Avro.
● Index: Hudi provides efficient upserts, by mapping a record-key + partition-path combination consistently to a file-id, via an indexing mechanism. This mapping between record key and file group/file id, never changes once the first version of a record has been written to a file group. In short, the mapped file group contains all versions of a group of records.
Hudi supports two table types: copy-on-write and merge-on-read.
Copy on Write (CoW) — Data is stored in a columnar format (Parquet/ORC), and each update creates a new version of files during a write. CoW is the default storage type.
Merge on Read (MoR) — Data is stored using a combination of columnar (Parquet/ORC) and row-based (Avro) formats. Updates are logged to row-based delta files and are compacted as needed to create new versions of the columnar files.
Hudi’s benefits:
● ACID transactions, Time travel and rollback.
● Supports partial schema evolution. Includes column add, drop, renaming and type promotions.
● Hudi tables have Primary key and indexing.
● Supports common columnar file formats Parquet and ORC.
● Supports read/write by major data lake engines — Spark and Flink.
● Update, Delete and Insert commands are supported.
● Hudi provides three logical views for accessing the data:
1. Read-optimized view — Provides latest committed dataset from CoW tables and the latest compacted dataset from MoR tables.
2. Incremental view — Provides a change stream between two actions out of a CoW dataset to feed downstream jobs.
3. Real-time view — Provides the latest committed data from a MoR table by merging the columnar and row-based files inline.
Hudi supports 2 modes when migrating parquet tables using Spark.
1. METADATA_ONLY: In this mode, record level metadata alone is generated for each source record and stored in new bootstrap location.
2. FULL_RECORD: In this mode, record level metadata is generated for each source record and both original record and metadata copied to new bootstrap location.
Major Contributors to Apache Hudi:
Uber, Alibaba etc. are major contributors to Hudi OSS.
[Note: This info is based on contributions to each project’s core repository on GitHub, measuring contributions which are issues/pull requests and commits in the GitHub repository. Activity or code merges that occur in other upstream or private repositories are not factored in since there is no visibility into that activity]
Delta vs Iceberg vs Hudi — Our View
Community Momentum
Needless to mention that it’s critical for an open-source project to be vibrant through community contributions. The community can make or break the development momentum, ecosystem adoption, or the objectiveness of the platform. Delta Lake leads the pack in awareness and popularity with almost 9M active downloads per month as per latest stats.
GitHub stars is a vanity metric that represents popularity more than contribution.
Delta vs Iceberg vs Hudi — Performance Comparison
Databeans team conducted the TPC-DS benchmark test in early 2022 to compare the load performance and query performance of all 3 storage formats and below is the synopsis for the same.
Environment setup:
In this benchmark Hudi 0.11.1 with COW table type, Delta 1.2.0 and Iceberg 0.13.1 with the environment components listed in the table below.
Overall Performance:
Load Performance:
Query Performance:
Delta vs Iceberg vs Hudi — Our Recommendations
Go with Delta Lake if -
• Maximum of your workload in on Spark, expect relatively low write throughput and looking for better query performance.
• Your existing large tables do not require any possible change in partitioning columns. However, Liquid Partitioning has been introduced in Delta 3.0 (Beta release) which can remove the need of partitioning at first place.
Go with Iceberg if -
• Your table has large volume of metadata (more than 10k partitions).
• You are looking for full schema evolution support with changes in partitioning scheme.
Go with Hudi if -
• You are looking for efficient incremental data processing and indexing.
- You are majorly having streaming use cases. Hudi provides many out-of-the-box capabilities for streaming.
Our Final Thoughts
The adoption of Lakehouse is not crystallized and appears to be gaining momentum as more and more customers are expressing their interest about the architecture. With the trend of using cloud object storage as a data lake is not going to diminish, and there are clear performance and efficiency advantages in bringing ACID transaction support, Schema enforcement and data governance functionalities to that data, rather than having to export it into external data-warehousing environments for analysis — Lakehouse architectures are here to stay for sure.
Though Delta Lake format was the first open Lakehouse format and still the most popular and active one but Iceberg and Hudi are also quickly catching up in the race. However, Databricks UniForm could turn out to be a game changer as it could end the cold war allowing all 3 formats to work together in democratized fashion. But we also expect this fascinating rivalry to continue however only to enrich the entire data ecosystem and fraternity!
References:
https://iceberg.apache.org/docs/latest/