【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison

转载

high2011 2023-05-20 07:57:15

文章标签 数据仓库大数据数据湖 Apache sed 文章分类 Html/CSS 前端开发

声明

Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison

感谢原文作者！

本文转载已获取原文著作公司同意，若要转载，请邮件联系原文著作公司！联系邮箱：info@onehouse.ai

转载的正文如下：

Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison

Written by

Kyle Weller

Introduction

With growing popularity of the lakehouse there has been a rising interest in the analysis and comparison of the open source projects which are at the core of this data architecture: Apache Hudi, Delta Lake, and Apache Iceberg.

Most comparison articles currently published seem to evaluate these projects merely as table/file formats for traditional append-only workloads, overlooking some qualities and features that are critical for modern data lake platforms that need to support update heavy workloads with continuous table management. This article will dive into greater depth to highlight technical differentiators of Apache Hudi and how it is a full fledged data lake platform steps ahead of the rest.

This article is periodically updated to keep up with the fast moving landscape. The last update was in January 2023 which updated the feature comparison matrix, added in statistics about the community adoption, and referenced recent benchmarks that were published in the industry.

Feature Comparisons

First let's look at an overall feature comparison. As you read, notice how the Hudi community has invested heavily into comprehensive platform services on top of the lake storage format. While formats are critical for standardization and interoperability, table/platform services give you a powerful toolkit to easily develop and manage your data lake deployments.

	As of v0.12.2	As of v2.2.0	As of v1.1.0
Read/write features
ACID Transactions
Copy-On-Write (Can I version and rewrite columnar files?)	Writes	Writes	Writes
Merge-On-Read (Can I efficiently amortize updates without rewriting the whole file?)	Merge-On-Read		Limited functionality, Cannot balance merge perf for queries. Also Requires manual compaction maintenance
Efficient Bulk Load (Can I efficiently layout the initial load into the table? )	Bulk_Insert
Efficient merge writes with record-level indices (can I avoid merging all base files against all incoming update/delete records?)	Over 4 types of Indexing	Bloom filter index still proprietary	Metadata indexing is for tracking statistics
Bootstrap (Can I upgrade data in-place into the system without rewriting the data?)	Bootstrap	Convert to delta	Table migration
Incremental Query (can I obtain a change stream for a given time window on the table?)	Incremental Query	CDF Experimental mode in 2.0.0	Can only incrementally read appends
Time travel (can I query the table as of a point-in-time?)	Time Travel	Time Travel	Time Travel
Managed Ingestion (can I ingest data stream from popular sources, with no/low code?)	Hudi DeltaStreamer
Concurrency (can I run different writers and table services against the table at the same time?)	OCC with Non-blocking table services	OCC only	OCC only
Primary Keys (Can I define primary keys like regular database tables?)	Primary Keys
Column Statistics and Data Skipping (Can queries benefit from file pruning based on predicates from any column, without reading data files footers?)	Col Stats in metadata Hfile Column Stats Index adds up to 50x perf	Column Stats in parquet checkpoint	Column Stats in avro manifest
Data Skipping based on built-in functions (Can queries perform data skipping based on functions defined on column values, in addition to literal column values?)	With col stats index, Hudi can effectively prune files based on column predicates, and order preserving functions on columns.	Logical predicates on a source or a Generated column will prune files during query execution	Iceberg can transform table data to partition values and maintain relationship, while also collecting stats on columns
Partition Evolution (Can I keep changing the partition structure of the table as we go?)	Hudi takes a different approach with coarse-grained partitions and fine-grained Clustering which can be evolved async without rewriting data.	Delta Lake also considers more complex partitioning as an anti-pattern	Partition Evolution lets you change partitions as your data evolves. Old data stays in old partitions, new data gets new partitions, with uneven performance across them.
Data deduplication (can I insert data without introducing duplicates?)	Record key uniqueness, Precombine Utility Customizations, Merge, Drop dupes from inserts	Merge Only	Merge Only
Table Services
File Sizing (Can I configure a single standard file size to be enforced across any writes to the table automatically?)	Automated file size tuning	OPTIMIZE cmd open sourced in 2.0, but automation still proprietary	Manual maintenance
Compaction (Merge changelogs with updates/deletes from MoR writes)	Managed Compaction	File sizing only. No MoR, so no compaction of deletes/changes	Delete compaction manual maintenance
Cleaning (Do older versions of files get automatically removed from storage?)	Managed cleaning service	VACUUM is manual operation for data and managed for the transaction log	Expiring snapshots is manual operation
Index management (Can I build new indices on the table?)	Async multi-modal indexing subsystem
Linear Clustering (Can I linearly co-locate certain data close together for performance?)	Automated Clustering that can be evolved for perf tuning, user defined partitioners		You can force writers to sort as they write.
Multidimensional Z-Order/Space Curve Clustering (Can I sort high cardinality data with space curves for performance?)	Z-Order + Hilbert Curves with auto async clustering	Z-Order through manual maintenance	Z-Order through manual maintenance
Schema Evolution (Can I adjust the schema of my table)	Schema evolution for add, reorder, drop, rename, update (Spark only)	Schema evolution for add, reorder, drop, rename, update	Schema evolution for add, reorder, drop, rename, update
Scalable Metadata Management (Can the table metadata scale with my data sizes)	Hudi MoR based metadata table w/ HFile formats for 100x faster lookups, self managed like any Hudi Table	Parquet txn log checkpoints significantly slower lookups	Avro Manifest files significantly slower and need maintenance as you scale
Platform Support
CLI (Can I manage my tables with a CLI)	CLI
Data Quality Validation (Can I define quality conditions to be checked and enforced?)	Pre-Commit Validators	Delta Constraints
Pre-commit Transformers (Can I transform data before commit while I write?)	Transformers
Commit Notifications (Can I get a callback notification on successful commit?)	Commit Notifications
Failed Commit Safeguards (How am I protected from partial and failed write operations?)	Automated Marker Mechanism	Manual configs	Orphaned files need manual maintenance, failed commits can corrupt table
Monitoring (Can I get metrics and monitoring out of the box?)	MetricsReporter for automated monitoring
Savepoint and Restore (Can I save a snapshot of the data and then restore the table back to this form?)	Savepoint command to save specific versions. Restore command with time travel versions or savepoints	Restore command with time travel versions Have to preserve all versions in vacuum retention (eg. If you want to restore to 6mon ago, you have to retain 6mon of versions or DIY)	DIY
Ecosystem Support
Apache Spark	Read + Write	Read + Write	Read + Write
Apache Flink	Read + Write	Read + Write	Read + Write
Presto	Read	Read	Read + Write
Trino	Read	Read + Write	Read + Write
Hive	Read	Read	Read + Write
DBT	Read + Write	Read + Write
Kafka Connect	Write	Proprietary only
Kafka	Write	Write
Pulsar	Write	Write	Write
Debezium	Write	Write	Write
Kyuubi	Read + Write		Read + Write
ClickHouse	Read	Read
Apache Impala	Read + Write		Read + Write
AWS Athena	Read	Read	Read + Write
AWS EMR	Read + Write	Read + Write	Read + Write
AWS Redshift	Read	Read
AWS Glue	Read + Write	Read + Write	Read + Write
Google BigQuery	Read		Read
Google DataProc	Read + Write	Read + Write	Read + Write
Azure Synapse	Read + Write	Read + Write
Azure HDInsight	Read + Write	Read + Write
Databricks	Read + Write	Read + Write	Read + Write
Snowflake		Read	Read + Write
Vertica	Read	Read
Apache Doris	Read		Read
Starrocks	Read	Preview	Read
Dremio		Read With limintations	Read + Write With limitations

Community Momentum

Equally important to features and capabilities of an open source project is the community. The community can make or break the development momentum, ecosystem adoption, or the objectiveness of the platform. Below is a comparison of Hudi, Delta, Iceberg when it comes to their communities:

Github Stars:

Github stars is a vanity metric that represents popularity more than contribution. Delta Lake leads the pack in awareness and popularity.

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_sed_109

Github Watchers and Forks

A closer indication of engagement/usage of the project:

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据湖_110

Github Contributors

In December 2022 Apache Hudi had almost 90 unique authors contribute to the project. More than 2x Iceberg and 3x Delta Lake.

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_sed_111

Github PRs and Issues

In December 2022 Hudi and Iceberg merged about the same # of PRs while the number of PRs opened was double in Hudi.

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_sed_112

Contribution Diversity

Apache Hudi and Apache Iceberg have a strong diversity in the community who contributes to the project.

Apache Hudi:

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_sed_113

Apache Iceberg:

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据湖_114

Delta Lake:

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_Apache_115

TPC-DS Performance Benchmarks

Performance benchmarks rarely are representative of real life workloads, and we strongly encourage the community to run their own analysis against their own data. Nonetheless these benchmarks can serve as an interesting data point while you start your research into choosing a Lakehouse platform. Below are references to relevant benchmarks:

Databeans and Onehouse

Databeans worked with Databricks to publish a benchmark used in their Data+AI Summit Keynote in June 2022, but they misconfigured an obvious out-of-box setting. Onehouse corrected the benchmark here:
Apache Hudi vs Delta Lake - Transparent TPC-DS Lakehouse Performance Benchmarks

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_大数据_116

Brooklyn Data and Onehouse

Databricks asked Brooklyn Data to publish a benchmark of Delta vs Iceberg in Nov 2022:
Setting the Table: Benchmarking Open Table Formats

Onehouse added Apache Hudi and published the code in the Brooklyn Github repo:
https://github.com/brooklyn-data/delta/pull/2

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_Apache_117

A clear pattern emerges from these benchmarks, Delta and Hudi are comparable, while Apache Iceberg consistently trails behind as the slowest of the projects. Performance isn’t the only factor you should consider, but performance does translate into cost savings that add up throughout your pipelines.

A note on running TPC-DS benchmarks:

One key thing to remember when running TPC-DS benchmarks comparing Hudi, Delta, Iceberg is that by default Delta + Iceberg are optimized for append-only workloads, while Hudi is by default optimized for mutable workloads. By default, Hudi uses an `upsert` write mode which naturally has a write overhead compared to inserts. Without this knowledge you may be comparing apples to oranges. Change this one out-of-the-box configuration to `bulk-insert` for a fair assessment: Write Operations | Apache Hudi

Feature Highlights

Building a data lake platform is more than just checkboxes of feature availability. Let’s pick a few of the differentiating features above and dive into the use cases and real benefits in plain english.

Incremental Pipelines

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_大数据_118

The majority of data engineers today feel like they have to choose between streaming and old-school batch ETL pipelines. Apache Hudi has pioneered a new paradigm called Incremental Pipelines. Out of the box, Hudi tracks all changes (appends, updates, deletes) and exposes them as change streams. With record level indexes you can more efficiently leverage these change streams to avoid recomputing data and just process changes incrementally. While other data lake platforms may enable a way to consume changes incrementally, Hudi is designed from the ground up to enable incrementalization efficiently which results in cost efficient ETL pipelines at lower latencies.

Databricks recently developed a similar feature they call Change Data Feed which they have held proprietary until it was finally released to open source in Delta Lake 2.0. Iceberg has an incremental read, but it only allows you to read incremental appends, no updates/deletes which are essential for true Change Data Capture and transactional data.

Concurrency Control

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据仓库_119

ACID transactions and concurrency control are key characteristics of a Lakehouse, but how do current designs actually stack up compared to real world workloads? Hudi, Delta, and Iceberg all support Optimistic Concurrency Control (OCC). In optimistic concurrency control, writers check if they have overlapping files and if a conflict exists, they fail the operations and retry. For Delta Lake as an example this was just a JVM level lock held on a single Apache Spark driver node which means you have no OCC outside of a single cluster, until recently.

While this may work fine for append-only immutable datasets, optimistic concurrency control struggles with real world scenarios which introduces the need for frequent updates and deletes because of the data loading pattern or reorganizing the data for query performance. Oftentimes, it’s not practical to take writers offline for table management to ensure the table is healthy and performant. Apache Hudi concurrency control is more granular than other data lake platforms (File level) and with a design optimized for multiple small updates/deletes the conflict possibility can be largely reduced to negligible in most real world cases. You can read more details in this blog, of how you can operate with asynchronous table services even in multi-writer scenarios, without the need to pause writers. This is very close to the level of concurrency supported by standard databases.

Merge On Read

Any good database system supports different trade-offs between write and query performance. The Hudi community has made some seminal contributions, in terms of defining these concepts for data lake storage across the industry. Hudi, Delta, and Iceberg all write and store data in parquet files. When updates occur, these parquet files are versioned and rewritten. This write mode pattern is what the industry now calls Copy On Write (CoW). This model works well for optimizing query performance, but can be limiting for write performance and data freshness. In addition to CoW, Apache Hudi supports another table storage layout called Merge On Read (MoR). MoR stores data using a combination of columnar parquet files and row-based Avro log files. Updates can be batched up in log files that can later be compacted into new parquet files synchronously or asynchronously to balance maximum query performance and lower write amplification.

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_Apache_120

Thus, for a near real time streaming workload, Hudi could use the more efficient row oriented formats and for batch workloads the hudi format uses vectorizable column oriented format with seamless merging of the two formats when required. Many users turn to Apache Hudi since it is the only project with this capability which allows them to achieve unmatched write performance and E2E data pipeline latencies.

Partition Evolution

One feature often highlighted for Apache Iceberg is hidden partitioning that unlocks what is called partition evolution. The basic idea is when your data starts to evolve, or you just aren’t getting the performance value you need out of your current partitioning scheme, partition evolution allows you to update your partitions for new data without rewriting your data. When you evolve your partitions, old data is left in the old partitioning scheme and only new data is partitioned with your evolution. A table partitioned multiple ways pushes complexity to the user and cannot guarantee consistent performance if the user is unaware of the evolution history.

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据仓库_121

Apache Hudi takes a different approach to address the problem of adjusting data layout as your data evolves with Clustering. You can choose a coarse-grained partition strategy or even leave it unpartitioned, and use a more fine-grained clustering strategy within each partition. Clustering can be run synchronously or asynchronously and can be evolved without rewriting any data. This approach is comparable to the micro-partitioning and clustering strategy of Snowflake.

Multi-Modal Indexing

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_数据湖_122

Indexing is an integral component for databases and data warehouses, yet is largely absent in data lakes. In recent releases, Apache Hudi created a first-of-its-kind high performance indexing subsystem for the Lakehouse that we call the Hudi Multi-Modal Index. Apache Hudi offers an asynchronous indexing mechanism that allows you to build and change indexes without impacting write latency. This indexing mechanism is extensible and scalable to support any popular index techniques such as Bloom, Hash, Bitmap, R-tree, etc.

These indexes are stored in the Hudi Metadata Table which is stored in cloud storage next to your data. In this new release the metadata is written in optimized indexed file formats which results in 10-100x performance improvements for point lookups versus Delta or Iceberg generic file formats. When testing real world workloads, this new indexing subsystem results in 10-30x overall query performance.

Ingestion Tools

What sets a data platform apart from data formats are the operational services available. A differentiator for Apache Hudi is the powerful ingestion utility called DeltaStreamer. DeltaStreamer is battle tested and used in production to build some of the largest data lakes on the planet today. DeltaStreamer is a standalone utility which allows you to incrementally ingest upstream changes from a wide variety of sources such as DFS, Kafka, database changelogs, S3 events, JDBC, and more.

Iceberg has no solution for a managed ingestion utility, and Delta Autoloader remains a Databricks proprietary feature that only supports cloud storage sources such as S3.

Use Cases - Examples from the community

Feature comparisons and benchmarks can help newcomers orient themselves on what technology choices are available, but more important is sizing up your personal use cases and workloads to find the right fit for your data architecture. All three of these technologies, Hudi, Delta, Iceberg have different origin stories and advantages for certain use cases. Iceberg was born at Netflix and was designed to overcome cloud storage scale problems like file listings. Delta was born at Databricks and it has deep integrations and accelerations when using the Databricks Spark runtime. Hudi was born at Uber to power petabyte scale data lakes in near real-time, with painless table management.

From years of engaging in real world comparison evaluations in the community, Apache Hudi routinely has a technical advantage when you have mature workloads that grow beyond simple append-only inserts. Once you start processing many updates, start adding real concurrency, or attempt to reduce the E2E latency of your pipelines, Apache Hudi stands out as the industry leader in performance and feature set.

Here are a couple examples and stories from the community who independently evaluated and decided to use Apache Hudi:

Amazon package delivery system -

“One of the biggest challenges ATS faced was handling data at petabyte scale with the need for constant inserts, updates, and deletes with minimal time delay, which reflects real business scenarios and package movement to downstream data consumers.”

“In this post, we show how we ingest data in real time in the order of hundreds of GBs per hour and run inserts, updates, and deletes on a petabyte-scale data lake using Apache Hudi tables loaded using AWS Glue Spark jobs and other AWS server-less services including AWS Lambda, Amazon Kinesis Data Firehose, and Amazon DynamoDB”

ByteDance/Tiktok

“In our scenario, the performance challenges are huge. The maximum data volume of a single table reaches 400PB+, the daily volume increase is PB level, and the total data volume reaches EB level.”

“The throughput is relatively large. The throughput of a single table exceeds 100 GB/s, and the single table needs PB-level storage. The data schema is complex. The data is highly dimensional and sparse. The number of table columns ranges from 1,000 to 10,000+. And there are a lot of complex data types.”

“When making the decision on the engine, we examine three of the most popular data lake engines, Hudi, Iceberg, and DeltaLake. These three have their own advantages and disadvantages in our scenarios. Finally, Hudi is selected as the storage engine based on Hudi's openness to the upstream and downstream ecosystems, support for the global index, and customized development interfaces for certain storage logic.”

Walmart

【数据湖】-- Apache Hudi vs Delta Lake vs Apache Iceberg - Lakehouse Feature Comparison_sed_123

From video transcription:

“Okay so what is it that enables us for us and why do we really like the Hudi features that have unlocked this in other use cases? We like the optimistic concurrency or mvcc controls that are available to us. We've done a lot of work around asynchronous compaction. We're in the process of looking at doing asynchronous compaction rather than inline compaction on our merge on read tables.

We also want to reduce latency and so we leverage merge on read table significantly because that enables us to append data much faster. We also love native support for deletion. It's something we had custom frameworks built for things like ccpa and gdpr where somebody would uh put in a service desk ticket and we'd have to build an automation flow to remove records from hdfs this comes out of the box for us.

Row versioning is really critical obviously a lot of our pipelines have out of order data and we need the latest records to show up and so we provide version keys as part of our framework for all upserts into the hudi tables.

The fact that customers can pick and choose how many versions of a row to keep be able to provide snapshot queries and get incremental updates like what's been updated in the last five hours is really powerful for a lot of users”

Robinhood

“Robinhood has a genuine need to keep data freshness low for the Data Lake. Many of the batch processing pipelines that used to run on daily cadence after or before market hours had to be run at hourly or higher frequency to support evolving use-cases. It was clear we needed a faster ingestion pipeline to replicate online databases to the data-lake.”

“We are using Apache Hudi to incrementally ingest changelogs from Kafka to create data-lake tables. Apache Hudi is a unified Data Lake platform for performing both batch and stream processing over Data Lakes. Apache Hudi comes with a full-featured out-of-box Spark based ingestion system called Deltastreamer with first-class Kafka integration, and exactly-once writes. Unlike immutable data, our CDC data have a fairly significant proportion of updates and deletes. Hudi Deltastreamer takes advantage of its pluggable, record-level indexes to perform fast and efficient upserts on the Data Lake table.”

Zendesk

“The Data Lake pipelines consolidate the data from Zendesk’s highly distributed databases into a data lake for analysis.

Zendesk uses Amazon Database Migration Service (AWS DMS) for change data capture (CDC) from over 1,800 Amazon Aurora MySQL databases in eight AWS Regions. It detects transaction changes and applies them to the data lake using Amazon EMR and Hudi.

Zendesk ticket data consists of over 10 billion events and petabytes of data. The data lake files in Amazon S3 are transformed and stored in Apache Hudi format and registered on the AWS Glue catalog to be available as data lake tables for analytics querying and consumption via Amazon Athena.”

GE Aviation

“The introduction of a more seamless Apache Hudi experience within AWS has been a big win for our team. We’ve been busy incorporating Hudi into our CDC transaction pipeline and are thrilled with the results. We’re able to spend less time writing code managing the storage of our data, and more time focusing on the reliability of our system. This has been critical in our ability to scale. Our development pipeline has grown beyond 10,000 tables and more than 150 source systems as we approach another major production cutover.”

A Community that Innovates

Finally, given how quickly lakehouse technologies are evolving, it's important to consider where open source innovation in this space has come from. Below are a few foundational ideas and features that originated in Hudi and that are now being adopted into the other projects.

Hudi OSS Community Innovation	Equivalent Feature
Transactional updates (March 2017)	Delta OSS (April 2019)
Merge On Read (Oct 2017)	Iceberg (Aug 2021, v2 format approval)
Incremental Queries (March 2017)	Delta Change Feed OSS 2.x (June 2022)
Z-order/Hilbert Space Curves (Dec 2021)	Delta OSS 2.x (June 2022)

In fact, outside of the table metadata (file listings, column stats) support, the Hudi community has pioneered most of the other critical features that make up today’s lakehouses. The community has supported over 1500 user issues and 5500+ slack support threads over the last 4 years, and is rapidly growing stronger with an ambitious vision ahead. Users can consider this track record of innovation as a leading indicator for the future that lies ahead.

Conclusion

When choosing the technology for your Lakehouse it is important to perform an evaluation for your own personal use cases. Feature comparison spreadsheets and benchmarks should not be the end-all deciding factor, so we hope that this blog post simply provides a starting point and reference for you in your decision making process. Apache Hudi is innovative, battle hardened and here to stay. Join us on Hudi Slack where you can ask questions and collaborate with the vibrant community from around the globe.

If you would like 1:1 consultation to dive deep into your use cases and architecture, feel free to reach out at info@onehouse.ai. At Onehouse we have decades of experience designing, building, and operating some of the largest distributed data systems in the world. We recognize these technologies are complex and rapidly evolving. It is likely we missed a feature or could have read the documentation wrong on some of the above comparisons. Please drop a note to info@onehouse.ai if you see any comparisons above that stand in need of correction so we can keep the facts accurate in this article.

Update Notes

8/11/22 - Original publish date
1/11/23 - Refresh feature comparisons, added community stats + benchmarks
1/12/23 - Databricks contributed few minor corrections

上一篇：常用的 IntelliJ IDEA 快捷键

下一篇：[Maxwell基础]--手把手搭建maxwell+kafka的环境

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯