声明

Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison

感谢原文作者!

本文转载已获取原文著作公司同意,若要转载,请邮件联系原文著作公司!联系邮箱:info@onehouse.ai

转载的正文如下:

Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison

Written by

Kyle Weller

Introduction

With growing popularity of the lakehouse there has been a rising interest in the analysis and comparison of the open source projects which are at the core of this data architecture: Apache Hudi, Delta Lake, and Apache Iceberg.

Most comparison articles currently published seem to evaluate these projects merely as table/file formats for traditional append-only workloads, overlooking some qualities and features that are critical for modern data lake platforms that need to support update heavy workloads with continuous table management. This article will dive into greater depth to highlight technical differentiators of Apache Hudi and how it is a full fledged data lake platform steps ahead of the rest.

This article is periodically updated to keep up with the fast moving landscape. The last update was in January 2023 which updated the feature comparison matrix, added in statistics about the community adoption, and referenced recent benchmarks that were published in the industry.

Feature Comparisons

First let's look at an overall feature comparison. As you read, notice how the Hudi community has invested heavily into comprehensive platform services on top of the lake storage format. While formats are critical for standardization and interoperability, table/platform services give you a powerful toolkit to easily develop and manage your data lake deployments. 


【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据湖


As of v0.12.2

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_大数据_02


As of v2.2.0

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据湖_03

As of v1.1.0

Read/write features

ACID Transactions

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_大数据_04

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_大数据_04

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_大数据_04

Copy-On-Write


(Can I version and rewrite columnar files?)

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_大数据_04

Writes

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_大数据_04

Writes

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_大数据_04

Writes

Merge-On-Read


(Can I efficiently amortize updates without rewriting the whole file?)


【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_大数据_04

Merge-On-Read

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据湖_11

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据仓库_12

Limited functionality, Cannot balance merge perf for queries. Also Requires manual compaction maintenance

Efficient Bulk Load

(Can I efficiently layout the initial load into the table? )

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_大数据_04

Bulk_Insert

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据湖_11

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据湖_11

Efficient merge writes with record-level indices

(can I avoid merging all base files against all incoming update/delete records?)

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_大数据_04

Over 4 types of Indexing

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据湖_11

Bloom filter index still proprietary

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据湖_11

Metadata indexing is for tracking statistics  

Bootstrap

(Can I upgrade data in-place into the system without rewriting the data?)

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_大数据_04

Bootstrap

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_大数据_04

Convert to delta

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_大数据_04

Table migration

Incremental Query

(can I obtain a change stream for a given time window on the table?)

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_大数据_04

Incremental Query

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据仓库_12

CDF Experimental mode in 2.0.0

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据湖_11

Can only incrementally read appends

Time travel

(can I query the table as of a point-in-time?)

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_大数据_04

Time Travel

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_大数据_04

Time Travel

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_大数据_04

Time Travel

Managed Ingestion

(can I ingest data stream from popular sources, with no/low code?)

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_大数据_04

Hudi DeltaStreamer

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据湖_11

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据湖_11

Concurrency

(can I run different writers and table services against the table at the same time?)

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_大数据_04

OCC with Non-blocking table services

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据仓库_12

OCC only

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据仓库_12

OCC only

Primary Keys

(Can I define primary keys like regular database tables?)

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_大数据_04

Primary Keys

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据湖_11

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据湖_11

Column Statistics and Data Skipping

(Can queries benefit from file pruning based on predicates from any column, without reading data files footers?)

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_大数据_04

Col Stats in metadata


Hfile Column Stats Index adds up to 50x perf

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_大数据_04

Column Stats in parquet checkpoint

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_大数据_04

Column Stats in avro manifest

Data Skipping based on built-in functions


(Can queries perform data skipping based on functions defined on column values, in addition to literal column values?)

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据仓库_12

With col stats index, Hudi can effectively prune files based on column predicates, and order preserving functions on columns.

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_大数据_04


Logical predicates on a source or a

Generated column will prune files during query execution

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_大数据_04


Iceberg can transform table data to partition values and maintain relationship, while also collecting stats on columns

Partition Evolution

(Can I keep changing the partition structure of the table as we go?)


【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据湖_11


Hudi takes a different approach with coarse-grained partitions and fine-grained Clustering which can be evolved async without rewriting data.

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据湖_11


Delta Lake also considers more complex partitioning as an anti-pattern

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_大数据_04


Partition Evolution lets you change partitions as your data evolves. Old data stays in old partitions, new data gets new partitions, with uneven performance across them.

Data deduplication

(can I insert data without introducing duplicates?) 

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_大数据_04

Record key uniqueness,

Precombine Utility Customizations,

Merge,

Drop dupes from inserts

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据仓库_12

Merge Only

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据仓库_12

Merge Only

Table Services

File Sizing

(Can I configure a single standard file size to be enforced across any writes to the table automatically?)

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_大数据_04

Automated file size tuning

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据仓库_12

OPTIMIZE cmd open sourced in 2.0, but automation still proprietary

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据仓库_12

Manual maintenance

Compaction

(Merge changelogs with updates/deletes  from MoR writes)

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_大数据_04

Managed Compaction

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据湖_11

File sizing only. No MoR, so no compaction of deletes/changes

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据仓库_12

Delete compaction manual maintenance

Cleaning

(Do older versions of files get automatically removed from storage?)

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_大数据_04

Managed cleaning service

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据仓库_12

VACUUM is manual operation for data and managed for the transaction log

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据仓库_12

Expiring snapshots is manual operation

Index management

(Can I build new indices on the table?)

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_大数据_04

Async multi-modal indexing subsystem

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据湖_11

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据湖_11

Linear Clustering

(Can I linearly co-locate certain data close together for performance?)

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_大数据_04


Automated Clustering that can be evolved for perf tuning, user defined partitioners

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据湖_11



【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据仓库_12

You can force writers to sort as they write.

Multidimensional Z-Order/Space Curve Clustering


(Can I sort high cardinality data with space curves for performance?)

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_大数据_04


Z-Order + Hilbert Curves with auto async clustering

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据仓库_12


Z-Order through manual maintenance

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据仓库_12


Z-Order through manual maintenance

Schema Evolution

(Can I adjust the schema of my table)

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据仓库_12

Schema evolution for add, reorder, drop, rename, update (Spark only)

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_大数据_04

Schema evolution for add, reorder, drop, rename, update

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_大数据_04

Schema evolution for add, reorder, drop, rename, update

Scalable Metadata
Management

(Can the table metadata scale with my data sizes)

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_大数据_04

Hudi MoR based metadata table w/ HFile formats for 100x faster lookups, self managed like any Hudi Table

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据仓库_12

Parquet txn log checkpoints significantly slower lookups

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据仓库_12


Avro Manifest files significantly slower and need maintenance as you scale

Platform Support

CLI

(Can I manage my tables with a CLI)

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_大数据_04

CLI

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据湖_11

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据湖_11

Data Quality Validation

(Can I define quality conditions to be checked and enforced?)

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_大数据_04

Pre-Commit Validators

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_大数据_04

Delta Constraints

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据湖_11


Pre-commit Transformers

(Can I transform data before commit while I write?)

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_大数据_04

Transformers

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据湖_11

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据湖_11

Commit Notifications

(Can I get a callback notification on successful commit?)

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_大数据_04

Commit Notifications

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据湖_11

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据湖_11

Failed Commit Safeguards

(How am I protected from partial and failed write operations?)

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_大数据_04

Automated Marker Mechanism

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据仓库_12

Manual configs 

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据湖_11

Orphaned files need manual maintenance, failed commits can corrupt table

Monitoring

(Can I get metrics and monitoring out of the box?)

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_大数据_04

MetricsReporter for automated monitoring

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据湖_11

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据湖_11

Savepoint and Restore

(Can I save a snapshot of the data and then restore the table back to this form?)

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_大数据_04


Savepoint command to  save specific versions.


Restore command with time travel versions or savepoints

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据仓库_12

Restore command with time travel versions


Have to preserve all versions in vacuum retention (eg. If you want to restore to 6mon ago, you have to retain 6mon of versions or DIY)

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据湖_11


DIY

Ecosystem Support

Apache Spark

Read + Write

Read + Write

Read + Write

Apache Flink

Read + Write

Read + Write

Read + Write

Presto

Read

Read

Read + Write

Trino

Read

Read + Write

Read + Write

Hive

Read

Read

Read + Write

DBT

Read + Write

Read + Write

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据湖_11

Kafka Connect

Write

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据湖_11

Proprietary only

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据湖_11

Kafka

Write

Write

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据湖_11

Pulsar

Write

Write

Write

Debezium

Write

Write

Write

Kyuubi

Read + Write

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据湖_11

Read + Write

ClickHouse

Read

Read

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据湖_11

Apache Impala

Read + Write

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据湖_11

Read + Write

AWS Athena

Read

Read

Read + Write

AWS EMR

Read + Write

Read + Write

Read + Write

AWS Redshift

Read

Read

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据湖_11

AWS Glue

Read + Write

Read + Write

Read + Write

Google BigQuery

Read

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据湖_11

Read

Google DataProc

Read + Write

Read + Write

Read + Write

Azure Synapse

Read + Write

Read + Write

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据湖_11

Azure HDInsight

Read + Write

Read + Write

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据湖_11

Databricks

Read + Write

Read + Write

Read + Write

Snowflake

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据湖_11

Read

Read + Write

Vertica

Read

Read

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据湖_11

Apache Doris

Read

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据湖_11

Read

Starrocks

Read

Preview

Read

Dremio

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据湖_11

Read

With limintations

Read + Write

With limitations

Community Momentum

Equally important to features and capabilities of an open source project is the community. The community can make or break the development momentum, ecosystem adoption, or the objectiveness of the platform. Below is a comparison of Hudi, Delta, Iceberg when it comes to their communities:

Github Stars:

Github stars is a vanity metric that represents popularity more than contribution. Delta Lake leads the pack in awareness and popularity.

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_sed_109

Github Watchers and Forks

A closer indication of engagement/usage of the project:

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据湖_110

Github Contributors

In December 2022 Apache Hudi had almost 90 unique authors contribute to the project. More than 2x Iceberg and 3x Delta Lake.

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_sed_111

Github PRs and Issues

In December 2022 Hudi and Iceberg merged about the same # of PRs while the number of PRs opened was double in Hudi.

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_sed_112

Contribution Diversity

Apache Hudi and Apache Iceberg have a strong diversity in the community who contributes to the project.

Apache Hudi:

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_sed_113

Apache Iceberg:

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据湖_114

Delta Lake:

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_Apache_115

TPC-DS Performance Benchmarks

Performance benchmarks rarely are representative of real life workloads, and we strongly encourage the community to run their own analysis against their own data. Nonetheless these benchmarks can serve as an interesting data point while you start your research into choosing a Lakehouse platform. Below are references to relevant benchmarks:

Databeans and Onehouse

Databeans worked with Databricks to publish a benchmark used in their Data+AI Summit Keynote in June 2022, but they misconfigured an obvious out-of-box setting. Onehouse corrected the benchmark here:
Apache Hudi vs Delta Lake - Transparent TPC-DS Lakehouse Performance Benchmarks

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_大数据_116

Brooklyn Data and Onehouse

Databricks asked Brooklyn Data to publish a benchmark of Delta vs Iceberg in Nov 2022:
Setting the Table: Benchmarking Open Table Formats

Onehouse added Apache Hudi and published the code in the Brooklyn Github repo:
https://github.com/brooklyn-data/delta/pull/2

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_Apache_117

A clear pattern emerges from these benchmarks, Delta and Hudi are comparable, while Apache Iceberg consistently trails behind as the slowest of the projects. Performance isn’t the only factor you should consider, but performance does translate into cost savings that add up throughout your pipelines. 

A note on running TPC-DS benchmarks:

One key thing to remember when running TPC-DS benchmarks comparing Hudi, Delta, Iceberg is that by default Delta + Iceberg are optimized for append-only workloads, while Hudi is by default optimized for mutable workloads. By default, Hudi uses an `upsert` write mode which naturally has a write overhead compared to inserts. Without this knowledge you may be comparing apples to oranges. Change this one out-of-the-box configuration to `bulk-insert` for a fair assessment: Write Operations | Apache Hudi  

Feature Highlights

Building a data lake platform is more than just checkboxes of feature availability. Let’s pick a few of the differentiating features above and dive into the use cases and real benefits in plain english.

Incremental Pipelines

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_大数据_118

The majority of data engineers today feel like they have to choose between streaming and old-school batch ETL pipelines. Apache Hudi has pioneered a new paradigm called Incremental Pipelines. Out of the box, Hudi tracks all changes (appends, updates, deletes) and exposes them as change streams. With record level indexes you can more efficiently leverage these change streams to avoid recomputing data and just process changes incrementally. While other data lake platforms may enable a way to consume changes incrementally, Hudi is designed from the ground up to enable incrementalization efficiently which results in cost efficient ETL pipelines at lower latencies. 

Databricks recently developed a similar feature they call Change Data Feed which they have held proprietary until it was finally released to open source in Delta Lake 2.0. Iceberg has an incremental read, but it only allows you to read incremental appends, no updates/deletes which are essential for true Change Data Capture and transactional data.

Concurrency Control

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据仓库_119

ACID transactions and concurrency control are key characteristics of a Lakehouse, but how do current designs actually stack up compared to real world workloads? Hudi, Delta, and Iceberg all support Optimistic Concurrency Control (OCC). In optimistic concurrency control, writers check if they have overlapping files and if a conflict exists, they fail the operations and retry. For Delta Lake as an example this was just a JVM level lock held on a single Apache Spark driver node which means you have no OCC outside of a single cluster, until recently.

While this may work fine for append-only immutable datasets, optimistic concurrency control struggles with real world scenarios which introduces the need for frequent updates and deletes because of the data loading pattern or reorganizing the data for query performance. Oftentimes, it’s not practical to take writers offline for table management to ensure the table is healthy and performant. Apache Hudi concurrency control is more granular than other data lake platforms (File level) and with a design optimized for multiple small updates/deletes the conflict possibility can be largely reduced to negligible in most real world cases. You can read more details in this blog, of how you can operate with asynchronous table services even in multi-writer scenarios, without the need to pause writers. This is very close to the level of concurrency supported by standard databases.

Merge On Read

Any good database system supports different trade-offs between write and query performance. The Hudi community has made some seminal contributions, in terms of defining these concepts for data lake storage across the industry. Hudi, Delta, and Iceberg all write and store data in parquet files. When updates occur, these parquet files are versioned and rewritten. This write mode pattern is what the industry now calls Copy On Write (CoW). This model works well for optimizing query performance, but can be limiting for write performance and data freshness. In addition to CoW, Apache Hudi supports another table storage layout called Merge On Read (MoR). MoR stores data using a combination of columnar parquet files and row-based Avro log files. Updates can be batched up in log files that can later be compacted into new parquet files synchronously or asynchronously to balance  maximum query performance and lower write amplification. 

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_Apache_120

Thus, for a near real time streaming workload, Hudi could use the more efficient row oriented formats and for batch workloads the hudi format uses vectorizable column oriented format with seamless merging of the two formats when required. Many users turn to Apache Hudi since it is the only project with this capability which allows them to achieve unmatched write performance and E2E data pipeline latencies.

Partition Evolution

One feature often highlighted for Apache Iceberg is hidden partitioning that unlocks what is called partition evolution. The basic idea is when your data starts to evolve, or you just aren’t getting the performance value you need out of your current partitioning scheme, partition evolution allows you to update your partitions for new data without rewriting your data. When you evolve your partitions, old data is left in the old partitioning scheme and only new data is partitioned with your evolution. A table partitioned multiple ways pushes complexity to the user and cannot guarantee consistent performance if the user is unaware of the evolution history.

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据仓库_121

Apache Hudi takes a different approach to address the problem of adjusting data layout as your data evolves with Clustering. You can choose a coarse-grained partition strategy or even leave it unpartitioned, and use a more fine-grained clustering strategy within each partition. Clustering can be run synchronously or asynchronously and can be evolved without rewriting any data. This approach is comparable to the micro-partitioning and clustering strategy of Snowflake.

Multi-Modal Indexing

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据湖_122

Indexing is an integral component for databases and data warehouses, yet is largely absent in data lakes. In recent releases, Apache Hudi created a first-of-its-kind high performance indexing subsystem for the Lakehouse that we call the Hudi Multi-Modal Index. Apache Hudi offers an asynchronous indexing mechanism that allows you to build and change indexes without impacting write latency. This indexing mechanism is extensible and scalable to support any popular index techniques such as Bloom, Hash, Bitmap, R-tree, etc.

These indexes are stored in the Hudi Metadata Table which is stored in cloud storage next to your data. In this new release the metadata is written in optimized indexed file formats which results in 10-100x performance improvements for point lookups versus Delta or Iceberg generic file formats. When testing real world workloads, this new indexing subsystem results in 10-30x overall query performance.

Ingestion Tools 

What sets a data platform apart from data formats are the operational services available. A differentiator for Apache Hudi is the powerful ingestion utility called DeltaStreamer. DeltaStreamer is battle tested and used in production to build some of the largest data lakes on the planet today. DeltaStreamer is a standalone utility which allows you to incrementally ingest upstream changes from a wide variety of sources such as DFS, Kafka, database changelogs, S3 events, JDBC, and more.

Iceberg has no solution for a managed ingestion utility, and Delta Autoloader remains a Databricks proprietary feature that only supports cloud storage sources such as S3.

Use Cases - Examples from the community

Feature comparisons and benchmarks can help newcomers orient themselves on what technology choices are available, but more important is sizing up your personal use cases and workloads to find the right fit for your data architecture. All three of these technologies, Hudi, Delta, Iceberg have different origin stories and advantages for certain use cases. Iceberg was born at Netflix and was designed to overcome cloud storage scale problems like file listings. Delta was born at Databricks and it has deep integrations and accelerations when using the Databricks Spark runtime. Hudi was born at Uber to power petabyte scale data lakes in near real-time, with painless table management.

From years of engaging in real world comparison evaluations in the community, Apache Hudi routinely has a technical advantage when you have mature workloads that grow beyond simple append-only inserts. Once you start processing many updates, start adding real concurrency, or attempt to reduce the E2E latency of your pipelines, Apache Hudi stands out as the industry leader in performance and feature set.

Here are a couple examples and stories from the community who independently evaluated and decided to use Apache Hudi:

Amazon package delivery system - 

“One of the biggest challenges ATS faced was handling data at petabyte scale with the need for constant inserts, updates, and deletes with minimal time delay, which reflects real business scenarios and package movement to downstream data consumers.”

“In this post, we show how we ingest data in real time in the order of hundreds of GBs per hour and run inserts, updates, and deletes on a petabyte-scale data lake using Apache Hudi tables loaded using AWS Glue Spark jobs and other AWS server-less services including AWS Lambda, Amazon Kinesis Data Firehose, and Amazon DynamoDB”

ByteDance/Tiktok  

“In our scenario, the performance challenges are huge. The maximum data volume of a single table reaches 400PB+, the daily volume increase is PB level, and the total data volume reaches EB level.”

“The throughput is relatively large. The throughput of a single table exceeds 100 GB/s, and the single table needs PB-level storage. The data schema is complex. The data is highly dimensional and sparse. The number of table columns ranges from 1,000 to 10,000+. And there are a lot of complex data types.”

“When making the decision on the engine, we examine three of the most popular data lake engines, Hudi, Iceberg, and DeltaLake. These three have their own advantages and disadvantages in our scenarios. Finally, Hudi is selected as the storage engine based on Hudi's openness to the upstream and downstream ecosystems, support for the global index, and customized development interfaces for certain storage logic.”

Walmart

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_sed_123

From video transcription:

“Okay so what is it that enables us for us and why do we really like the Hudi features that have unlocked this in other use cases? We like the optimistic concurrency or mvcc controls that are available to us. We've done a lot of work around asynchronous compaction. We're in the process of looking at doing asynchronous compaction rather than inline compaction on our merge on read tables. 

We also want to reduce latency and so we leverage merge on read table significantly because that enables us to append data much faster. We also love native support for deletion. It's something we had custom frameworks built for things like ccpa and gdpr where somebody would uh put in a service desk ticket and we'd have to build an automation flow to remove records from hdfs this comes out of the box for us. 

Row versioning is really critical obviously a lot of our pipelines have out of order data and we need the latest records to show up and so we provide version keys as part of our framework for all upserts into the hudi tables

The fact that customers can pick and choose how many versions of a row to keep be able to provide snapshot queries and get incremental updates like what's been updated in the last five hours is really powerful for a lot of users”

Robinhood

“Robinhood has a genuine need to keep data freshness low for the Data Lake. Many of the batch processing pipelines that used to run on daily cadence after or before market hours had to be run at hourly or higher frequency to support evolving use-cases. It was clear we needed a faster ingestion pipeline to replicate online databases to the data-lake.”

“We are using Apache Hudi to incrementally ingest changelogs from Kafka to create data-lake tables. Apache Hudi is a unified Data Lake platform for performing both batch and stream processing over Data Lakes. Apache Hudi comes with a full-featured out-of-box Spark based ingestion system called Deltastreamer with first-class Kafka integration, and exactly-once writes. Unlike immutable data, our CDC data have a fairly significant proportion of updates and deletes. Hudi Deltastreamer takes advantage of its pluggable, record-level indexes to perform fast and efficient upserts on the Data Lake table.”

Zendesk

“The Data Lake pipelines consolidate the data from Zendesk’s highly distributed databases into a data lake for analysis.

Zendesk uses Amazon Database Migration Service (AWS DMS) for change data capture (CDC) from over 1,800 Amazon Aurora MySQL databases in eight AWS Regions. It detects transaction changes and applies them to the data lake using Amazon EMR and Hudi.

Zendesk ticket data consists of over 10 billion events and petabytes of data. The data lake files in Amazon S3 are transformed and stored in Apache Hudi format and registered on the AWS Glue catalog to be available as data lake tables for analytics querying and consumption via Amazon Athena.”

GE Aviation

“The introduction of a more seamless Apache Hudi experience within AWS has been a big win for our team. We’ve been busy incorporating Hudi into our CDC transaction pipeline and are thrilled with the results. We’re able to spend less time writing code managing the storage of our data, and more time focusing on the reliability of our system. This has been critical in our ability to scale. Our development pipeline has grown beyond 10,000 tables and more than 150 source systems as we approach another major production cutover.”

A Community that Innovates

Finally, given how quickly lakehouse technologies are evolving, it's important to consider where open source innovation in this space has come from. Below are a few foundational ideas and features that originated in Hudi and that are now being adopted into the other projects.

  Hudi OSS Community Innovation 

  Equivalent Feature 

  Transactional updates (March 2017)

  Delta OSS  (April 2019)

  Merge On Read (Oct 2017)

  Iceberg (Aug 2021, v2 format approval)

  Incremental Queries (March 2017)

  Delta Change Feed OSS 2.x (June 2022)

  Z-order/Hilbert Space Curves (Dec 2021)

  Delta OSS 2.x (June 2022)

In fact, outside of the table metadata (file listings, column stats) support, the Hudi community has pioneered most of the other critical features that make up today’s lakehouses. The community has supported over 1500 user issues and 5500+ slack support threads over the last 4 years, and is rapidly growing stronger with an ambitious vision ahead. Users can consider this track record of innovation as a leading indicator for the future that lies ahead.

Conclusion

When choosing the technology for your Lakehouse it is important to perform an evaluation for your own personal use cases. Feature comparison spreadsheets and benchmarks should not be the end-all deciding factor, so we hope that this blog post simply provides a starting point and reference for you in your decision making process. Apache Hudi is innovative, battle hardened and here to stay. Join us on Hudi Slack where you can ask questions and collaborate with the vibrant community from around the globe. 

If you would like 1:1 consultation to dive deep into your use cases and architecture, feel free to reach out at info@onehouse.ai. At Onehouse we have decades of experience designing, building, and operating some of the largest distributed data systems in the world. We recognize these technologies are complex and rapidly evolving. It is likely we missed a feature or could have read the documentation wrong on some of the above comparisons. Please drop a note to info@onehouse.ai if you see any comparisons above that stand in need of correction so we can keep the facts accurate in this article.

Update Notes

8/11/22 - Original publish date
1/11/23 - Refresh feature comparisons, added community stats + benchmarks
1/12/23 - Databricks contributed few minor corrections