1.MSCK REPAIR TABLE 一般用法

MSCK REPAIR TABLE命令主要是用来解决通过hdfs dfs -put或者hdfs api写入hive分区表的数据在hive中无法被查询到的问题。
我们都知道hive服务中有一个叫hive metastore的服务,这个服务主要是存储一些meta元数据信息,比如数据库名,表名或者表的分区等等信息。

如果不是通过hive的insert等插入语句,很多分区信息在metastore中是没有的,如果插入分区数据量很多的话,你用 ALTER TABLE table_name ADD PARTITION 一个个分区添加十分麻烦。这时候**MSCK REPAIR TABLE**就派上用场了。

具体语法如下:

MSCK REPAIR TABLE table_name;

原理相当简单,执行后,Hive会检测如果HDFS目录下存在但表的metastore中不存在的partition元信息,更新到metastore中。

案例说明:

#当前没有partition元信息
hive> show partitions ew_rcb_customer;
OK
Time taken: 0.104 seconds
#创建两个分区目录
hive> dfs -mkdir /user/hive/warehouse/ew_rcb_customer/syncdate=20211025;
hive> dfs -mkdir /user/hive/warehouse/ew_rcb_customer/syncdate=20211026;
#使用MSCK修复分区
hive> msck repair table ew_rcb_customer;
OK
Partitions not in metastore:	ew_rcb_customer=20211025
Partitions not in metastore:	ew_rcb_customer=20211026
Repair: Added partition to metastore ew_rcb_customer:syncdate=20211025
Repair: Added partition to metastore ew_rcb_customer:syncdate=20211026
Time taken: 0.286 seconds, Fetched: 2 row(s)
#再次查看,发现已经成功更新元信息
hive> show partitions ew_rcb_customer;
OK
syncdate=20211025
syncdate=20211026

这种用法我们都很熟悉,这里就不再赘述。

但是MSCK REPAIR TABLE这个命令能否删除metastore已经不存在hdfs上的表分区信息呢,答案是否定的,所以才引发本文第2部分来处理
come on …


2. msck repair 增加清理metastore中已经不在hdfs上的分区信息

发现Fix Version/s: 3.0.0, 2.4.0, 3.1.0 这几个版本的hive才支持这个功能

附上官网的链接:Recover Partitions (MSCK REPAIR TABLE)

msck repair table 通常用于新分区作为目录加载到HDFS或S3的环境中,用户希望批量创建缺失的分区。但是,目前它只支持添加缺失的分区。如果有任何分区在metastore中存在,但在FileSystem上没有,它也应该删除它们,以便真正修复表元数据。

Hive stores a list of partitions for each table in its metastore. If, however, new partitions are directly added to HDFS (say by using hadoop fs -put command) or removed from HDFS, the metastore (and hence Hive) will not be aware of these changes to partition information unless the user runs ALTER TABLE table_name ADD/DROP PARTITION commands on each of the newly added or removed partitions, respectively.

Hive在metastore中存储每个表的分区列表。然而,如果新的分区直接添加到HDFS(比如通过使用hadoop fs——命令)或删除从HDFS的metastore(因此hive)不会意识到这些变化分区信息,除非用户分别运行ALTER TABLE table_name上添加/删除分区命令每个新添加或删除分区。

However, users can run a metastore check command with the repair table option:
然而,用户可以使用修复表option选项运行metastore检查命令:

MSCK [REPAIR] TABLE table_name [ADD/DROP/SYNC PARTITIONS];

which will update metadata about partitions to the Hive metastore for partitions for which such metadata doesn’t already exist. The default option for MSC command is ADD PARTITIONS. With this option, it will add any partitions that exist on HDFS but not in metastore to the metastore. The DROP PARTITIONS option will remove the partition information from metastore, that is already removed from HDFS. The SYNC PARTITIONS option is equivalent to calling both ADD and DROP PARTITIONS. See HIVE-874 and HIVE-17824 for more details. When there is a large number of untracked partitions, there is a provision to run MSCK REPAIR TABLE batch wise to avoid OOME (Out of Memory Error). By giving the configured batch size for the property hive.msck.repair.batch.size it can run in the batches internally. The default value of the property is zero, it means it will execute all the partitions at once. MSCK command without the REPAIR option can be used to find details about metadata mismatch metastore.

对于不存在分区元数据的分区,会更新分区元数据到Hive metastore。

  • MSC命令的默认选项是“ADD PARTITIONS”。通过这个选项,它会将所有HDFS上存在但不存在metastore的分区添加到metastore中。
  • DROP PARTITIONS选项将从metastore中删除已经从HDFS中删除的分区信息。
  • SYNC PARTITIONS选项相当于同时调用ADD和DROP PARTITIONS。

详情请参阅HIVE-874和HIVE-17824。当有大量未跟踪的分区时,有一个规定是运行MSCK REPAIR TABLE批处理以避免OOME(内存不足错误)。通过为属性hive.msck.repair.batch.size提供配置的批大小,它可以在内部的批中运行。属性的默认值是0,这意味着它将一次执行所有分区。
不带REPAIR选项的MSCK命令可用于查找元数据mismatch metastore的详细信息。