Hudi Admin CLI使用指南————附带详细步骤

原创

fanxinglanyu 2022-05-25 17:54:53 博主文章分类：大数据 ©著作权

文章标签 大数据 Hudi Admin CLI spark hive 文章分类 运维

©著作权归作者所有：来自51CTO博客作者fanxinglanyu的原创作品，请联系作者获取转载授权，否则将追究法律责任

文章目录

Hudi Admin CLI使用指南
0 常见问题

0.1 常用指标
0.2 故障排除

0.2.1 缺失记录
0.2.2 重复
0.2.3 Spark故障

1 开始使用
2 查看hudi表信息
3 显示所有可用指令
4 查看提交

4.1 检查提交
4.2 深入到特定的提交

4.2.1 了解写入如何分散到特定分区
4.2.1 深入到某次文件级粒度

5 文件系统视图

5.1 查看数据集的文件切片

6 统计信息
7 压缩

7.1 压缩前的准备
7.2 手动压缩
7.3 取消压缩
7.4 修复压缩

Hudi Admin CLI使用指南

0 常见问题

0.1 常用指标

为Hudi Client配置正确的数据集名称和指标环境后，它将生成以下graphite指标，以帮助调试hudi数据集

提交持续时间 - 这是成功提交一批记录所花费的时间
回滚持续时间 - 同样，撤消失败的提交所剩余的部分数据所花费的时间(每次写入失败后都会自动发生)
文件级别指标 - 显示每次提交中新增、版本、删除(清除)的文件数量
记录级别指标 - 每次提交插入/更新的记录总数
分区级别指标 - 更新的分区数量(对于了解提交持续时间的突然峰值非常有用)
然后可以将这些指标绘制在grafana等标准工具上。

0.2 故障排除

以下部分通常有助于调试Hudi故障。以下元数据已被添加到每条记录中，可以通过标准Hadoop SQL引擎(Hive/Presto/Spark)检索，来更容易地诊断问题的严重性。

_hoodie_record_key - 作为每个DFS分区内的主键，是所有更新/插入的基础

_hoodie_commit_time - 该记录上次的提交

_hoodie_file_name - 包含记录的实际文件名(对检查重复非常有用)

_hoodie_partition_path - basePath的路径，该路径标识包含此记录的分区

请注意，到目前为止，Hudi假定应用程序为给定的recordKey传递相同的确定性分区路径。即仅在每个分区内保证recordKey(主键)的唯一性。

0.2.1 缺失记录

请在可能写入记录的窗口中，使用上面的admin命令检查是否存在任何写入错误。如果确实发现错误，那么记录实际上不是由Hudi写入的，而是交还给应用程序来决定如何处理。

0.2.2 重复

首先，请确保访问Hudi数据集的查询是没有问题的，并之后确认的确有重复。

如果确认，请使用上面的元数据字段来标识包含记录的物理文件和分区文件。

如果重复的记录存在于不同分区路径下的文件，则意味着您的应用程序正在为同一recordKey生成不同的分区路径，请修复您的应用程序.

如果重复的记录存在于同一分区路径下的多个文件，请使用邮件列表汇报这个问题。这不应该发生。您可以使用records deduplicate命令修复数据。

0.2.3 Spark故障

典型的upsert() DAG如下所示。请注意，Hudi客户端会缓存中间的RDD，以智能地并调整文件大小和Spark并行度。另外，由于还显示了探针作业，Spark UI显示了两次sortByKey，但它只是一个排序。

Hudi Admin CLI使用指南————附带详细步骤_hive

概括地说，有两个步骤

索引查找以标识要更改的文件:

Job 1 : 触发输入数据读取，转换为HoodieRecord对象，然后根据输入记录拿到目标分区路径。
Job 2 : 加载我们需要检查的文件名集。
Job 3 & 4 : 通过联合上面1和2中的RDD，智能调整spark join并行度，然后进行实际查找。
Job 5 : 生成带有位置的recordKeys作为标记的RDD。
执行数据的实际写入

执行实际的数据写入:

Job 6 : 将记录与recordKey(位置)进行懒惰连接，以提供最终的HoodieRecord集，现在它包含每条记录的文件/分区路径信息(如果插入，则为null)。然后还要再次分析工作负载以确定文件的大小。
Job 7 : 实际写入数据(更新 + 插入 + 插入转为更新以保持文件大小)

根据异常源(Hudi/Spark)，上述关于DAG的信息可用于查明实际问题。最常遇到的故障是由YARN/DFS临时故障引起的。

1 开始使用

$ cd /usr/app/hudi-0.8.0/
$ cd hudi-cli && ./hudi-cli.sh

建立与Hudi数据集的联建:

connect --path /user/hive/warehouse/test_increment_hudi5_mor

其中/user/hive/warehouse/test_increment_hudi4_mor为HDFS的文件基路径。

2 查看hudi表信息

desc

Hudi Admin CLI使用指南————附带详细步骤_Hudi_02

3 显示所有可用指令

help

所有可用指令：

* bootstrap index showmapping - Show bootstrap index mapping
* bootstrap index showpartitions - Show bootstrap indexed partitions
* bootstrap run - Run a bootstrap action for current Hudi table
* clean showpartitions - Show partition level details of a clean
* cleans refresh - Refresh table metadata
* cleans run - run clean
* cleans show - Show the cleans
* clear - Clears the console
* cls - Clears the console
* commit rollback - Rollback a commit
* commits compare - Compare commits with another Hoodie table
* commit show_write_stats - Show write stats of a commit
* commit showfiles - Show file level details of a commit
* commit showpartitions - Show partition level details of a commit
* commits refresh - Refresh table metadata
* commits show - Show the commits
* commits showarchived - Show the archived commits
* commits sync - Compare commits with another Hoodie table
* compaction repair - Renames the files to make them consistent with the timeline as dictated by Hoodie metadata. Use when compaction unschedule fails partially.
* compaction run - Run Compaction for given instant time
* compaction schedule - Schedule Compaction
* compaction show - Shows compaction details for a specific compaction instant
* compaction showarchived - Shows compaction details for a specific compaction instant
* compactions show all - Shows all compactions that are in active timeline
* compactions showarchived - Shows compaction details for specified time window
* compaction unschedule - Unschedule Compaction
* compaction unscheduleFileId - UnSchedule Compaction for a fileId
* compaction validate - Validate Compaction
* connect - Connect to a hoodie table
* create - Create a hoodie table if not present
* date - Displays the local date and time
* desc - Describe Hoodie Table properties
* exit - Exits the shell
* export instants - Export Instants and their metadata from the Timeline
* hdfsparquetimport - Imports Parquet table to a hoodie table

4 查看提交

4.1 检查提交

提交：更新或插入一批记录的任务。

提交可提供基本的原子性保证，即只有提交的数据可用于查询。每个提交都有一个单调递增的字符串/数字，称为提交编号。通常，这是我们开始提交的时间。

愿数据库表（6次变动）：

Hudi Admin CLI使用指南————附带详细步骤_spark_03

commits show --sortBy "Total Bytes Written" --desc true --limit 10

Hudi shell：

Hudi Admin CLI使用指南————附带详细步骤_Admin CLI_04

kafak Tool：

Hudi Admin CLI使用指南————附带详细步骤_大数据_05

使用hadoop指令查询HDFS文件：

hadoop fs -ls /user/hive/warehouse/test_increment_hudi5_mor/2021/07/29/

Hudi Admin CLI使用指南————附带详细步骤_Hudi_06

4.2 深入到特定的提交

在每次写入开始时，Hudi还将.inflight提交写入.hoodie文件夹。可以使用那里的时间戳来估计正在进行的提交已经花费的时间。

hadoop fs -ls /user/hive/warehouse/test_increment_hudi5_mor/.hoodie/*.inflight

Hudi Admin CLI使用指南————附带详细步骤_Hudi_07

4.2.1 了解写入如何分散到特定分区

commit showpartitions --commit 20210728164031 --sortBy "Total Bytes Written" --desc true --limit 10

Hudi Admin CLI使用指南————附带详细步骤_Admin CLI_08

遇到问题：只能对某些时间戳进行查询，有些则不行。

Hudi Admin CLI使用指南————附带详细步骤_Hudi_09

4.2.1 深入到某次文件级粒度

commit showfiles --commit 20210728164031 --sortBy "Partition Path"

Hudi Admin CLI使用指南————附带详细步骤_大数据_10

5 文件系统视图

Hudi将每个分区视为文件组的集合，每个文件组包含按提交顺序排列的文件切片列表(请参阅概念)。以下命令允许用户查看数据集的文件切片。

5.1 查看数据集的文件切片

show fsview all

Hudi Admin CLI使用指南————附带详细步骤_Hudi_11

展示最近的文件切片：

show fsview latest --partitionPath "2021/07/29"

Hudi Admin CLI使用指南————附带详细步骤_大数据_12

6 统计信息

由于Hudi直接管理DFS数据集的文件大小，这些信息会帮助你全面了解Hudi的运行状况

stats filesizes --partitionPath 2021/07/29 --sortBy "95th" --desc true --limit 10

Hudi Admin CLI使用指南————附带详细步骤_Hudi_13

如果Hudi写入花费的时间更长，那么可以通过观察写放大指标来发现任何异常。

Hudi Admin CLI使用指南————附带详细步骤_Admin CLI_14

7 压缩

为了限制DFS上.commit文件的增长量，Hudi将较旧的.commit文件(适当考虑清理策略)归档到commits.archived文件中。这是一个序列文件，其包含commitNumber => json的映射，及有关提交的原始信息(上面已很好地汇总了相同的信息)。

7.1 压缩前的准备

要了解压缩和写程序之间的时滞，请使用以下命令列出所有待处理的压缩。

compactions show all

Hudi Admin CLI使用指南————附带详细步骤_hive_15

检查特定的压缩计划：

compaction show --instant 20210728164031

Hudi Admin CLI使用指南————附带详细步骤_hive_16

验证压缩：检查压缩所需的所有文件是否都存在且有效。

compaction validate --instant 20210728164031

暂无运行结果。

7.2 手动压缩

要手动调度或运行压缩，请使用以下命令。该命令使用spark启动器执行压缩操作。注意：确保没有其他应用程序正在同时调度此数据集的压缩。

help compaction schedule

Hudi Admin CLI使用指南————附带详细步骤_hive_17

help compaction run

运行压缩说明：

Keyword:                   compaction run
Description:               Run Compaction for given instant time
 Keyword:                  parallelism
   Help:                   Parallelism for hoodie compaction
   Mandatory:              true
   Default if specified:   '__NULL__'
   Default if unspecified: '__NULL__'

 Keyword:                  schemaFilePath
   Help:                   Path for Avro schema file
   Mandatory:              true
   Default if specified:   '__NULL__'
   Default if unspecified: '__NULL__'

 Keyword:                  sparkMemory
   Help:                   Spark executor memory
   Mandatory:              false
   Default if specified:   '__NULL__'
   Default if unspecified: '4G'

 Keyword:                  retry
   Help:                   Number of retries
   Mandatory:              false
   Default if specified:   '__NULL__'
   Default if unspecified: '1'

 Keyword:                  compactionInstant
   Help:                   Base path for the target hoodie table
   Mandatory:              false
   Default if specified:   '__NULL__'
   Default if unspecified: '__NULL__'

 Keyword:                  propsFilePath
   Help:                   path to properties file on localfs or dfs with configurations for hoodie client for compacting
   Mandatory:              false
   Default if specified:   '__NULL__'
   Default if unspecified: ''

 Keyword:                  hoodieConfigs
   Help:                   Any configuration that can be set in the properties file can be passed here in the form of an array
   Mandatory:              false
   Default if specified:   '__NULL__'
   Default if unspecified: ''

* compaction run - Run Compaction for given instant time

Hudi CLI 上异步执行指定Compaction的方式:

hudi:trips->compaction run --tableName <table_name> --parallelism <parallelism> --compactionInstant <InstantTime>

7.3 取消压缩

取消调度压缩

hoodie:trips->compaction unscheduleFileId --fileId <FileUUID>
....
No File renames needed to unschedule file from pending compaction. Operation successful.

在其他情况下，需要撤销整个压缩计划。以下CLI支持此功能

hoodie:trips->compaction unschedule --compactionInstant <compactionInstant>
.....
No File renames needed to unschedule pending compaction. Operation successful.

7.4 修复压缩

上面的压缩取消调度操作有时可能会部分失败(例如：DFS暂时不可用)。如果发生部分故障，则压缩操作可能与文件切片的状态不一致。当您运行压缩验证时，您会注意到无效的压缩操作(如果有的话)。在这种情况下，修复命令将立即执行，它将重新排列文件切片，以使文件不丢失，并且文件切片与压缩计划一致。

compaction repair --instant 20210728164031

上一篇：智能指针（unique_ptr、shared_ptr、weak_ptr、auto_ptr）(C++11/14)—————C++2.0第十三讲

下一篇：ERROR YarnClientSchedulerBackend:Yarn application has already ended! It might have been killed

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯