impala cdh yarn查询进程

转载

IT独行侠 2024-09-03 22:45:44

文章标签 impala cdh yarn查询进程 hadoop java hive apache 文章分类 Yarn 大数据

背景

最近在全权负责的一款数据产品，基于Hive-JDBC执行用户提交的Impala SQL，实现数据集，数据监控，数据推送等功能。Hive-JDBC版本2.1.1-cdh6.2.1：

<dependency>
	<groupId>org.apache.hive</groupId>
    <artifactId>hive-jdbc</artifactId>
    <version>2.1.1-cdh6.2.1</version>
</dependency>

报错

ParseException: Syntax error in line 34:sum(enable)Encountered: ENABLE.Expected: ALL…

之前CDH版本执行无误的SQL，在CDH集群升级到7.1.7版本后执行失败，具体报错信息如下：ParseException: Syntax error in line 34:\nsum(enable) as 辅助\n ^\nEncountered: ENABLE\nExpected: ALL, CASE, CAST, DATE, DEFAULT, DISTINCT, EXISTS, FALSE, IF, INTERVAL, LEFT, NOT, NULL, REPLACE, RIGHT, TRUNCATE, TRUE, IDENTIFIER\n\nCAUSED BY: Exception: Syntax error 即，enable是一个敏感关键词，报语法错误。

解决方法：加上反单引号，即：

sum(`enable`)

注：在SQL第一行加set ENABLE_EXPR_REWRITES = 0;不能解决问题。

关于set ENABLE_EXPR_REWRITES = 0;，请参考set ENABLE_EXPR_REWRITES = 0及GROUP BY expression must not contain aggregate functions等问题解决

default.url_decode() unknown for database default. Currently this db has 4 functions.

SQL里面包含url_decode()函数，SQL执行失败。如图，非hive-jdbc问题，在可视化查询界面执行也是失败的：

impala cdh yarn查询进程_apache

TransmitData() to ip:27000 failed: Network error: Client connection negotiation failed: client connection to ip:27000: connect: Connection refused (error 111)

节点TCP通信问题，手动重试。参考TransmitData 27000 Network-error

Failed to close HDFS file.The DiskSpace quota of is exceeded

具体的报错信息如下：

Failed to close HDFS file: hdfs://ppdhdpha/user/hive/warehouse/lkzc.db/jh_100701_cdd_noapply_cutdown_user/_impala_insert_staging//.0.parq\\nError(255): Unknown error 255\\nRoot cause: RemoteException: The DiskSpace quota of /user/hive/warehouse/lkzc.db is exceeded: quota = 65970697666560 B = 60 TB but diskspace consumed = 65971183008777 B = 60.00 TB

lkzc业务线的库空间满，无法继续写入数据，见上面的报错语句里_impala_insert_staging提示。

解决方案：删除无用的大表，释放空间。

The library hdfs:/udf/hive.udf-1.0.2.jar last modified time does not match the expected last modified time. Run ‘refresh functions <db name>’.

某个SQL通过hive-jdbc提交到大数据平台，报错如下：

java.lang.Exception: java.sql.SQLException: The library hdfs:///udf/hive.udf-1.0.2.jar last modified time 1670929674 does not match the expected last modified time 1653377949. Run 'refresh functions <db name>'.

很莫名其妙的报错。偶现。再次执行没有问题。

Failed to close HDFS file.Unable to close file because the last blockBP does not have enough number of replicas.

报错信息：

java.sql.SQLException: Failed to close HDFS file: hdfs:/user/hive/warehouse/lkzc.db/jh_061201_ffhywkt_chuda/_impala_insert_staging/
Error(255): Unknown error 255
Root cause: IOException: Unable to close file because the last blockBP does not have enough number of replicas.

Failed due to unreachable impalad(s): hadoopcbd008156.ppdgdsl.com:2200

SQL提交到大数据impala集群某个节点，而这个节点刚好有点网络异常等原因，导致unreachable。暂时没有比较好的解决方法：

手动retry，再提交一次SQL，大概率（？）就不会到这个节点；
程序增加自动retry逻辑，针对特定类型的报错才有retry逻辑；retry次数不能过多，比如1~2次即可。因为平台用户提交的SQL多达上w条，需要考虑用户SQL自动重试后对impala集群资源的使用率问题

org.apache.hive.service.cli.HiveSQLException: Invalid or unknown query handle

SQLException: Cancelled

具体报错信息如下：

java.sql.SQLException: Cancelled
	at org.apache.hive.jdbc.HiveStatement.waitForOperationToComplete(HiveStatement.java:381)
	at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:260)
	at org.apache.hive.jdbc.HiveStatement.executeQuery(HiveStatement.java:473)
	at com.alibaba.druid.pool.DruidPooledStatement.executeQuery(DruidPooledStatement.java:308)

通过hive-jdbc提交到大数据平台的SQL执行报错：Cancelled，没有更多报错日志。原因：大数据平台节点资源紧张，直接抛弃任务。区别于下面的Cancelled from Impala’s debug web interface报错。解决方法：重试，优化SQL。貌似没有更好的方法。

Cancelled in ScannerContext

Cancelled from Impala’s debug web interface

具体报错信息如下：

java.sql.SQLException: Cancelled from Impala's debug web interface
	at org.apache.hive.jdbc.HiveStatement.waitForOperationToComplete(HiveStatement.java:381)
	at org.apache.hive.jdbc.HiveQueryResultSet.next(HiveQueryResultSet.java:356)
	at com.alibaba.druid.pool.DruidPooledResultSet.next(DruidPooledResultSet.java:68)

解决方法：重试，优化SQL。貌似没有更好的方法。

Invalid query handle: xxxx.yyyy

报错信息如下：

ERROR c.a.druid.pool.DruidPooledStatement - clearResultSet error
org.apache.hive.service.cli.HiveSQLException: Invalid query handle: d84d9133d8a6ce9c:9a77cd100000000
	at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:266)
	at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:252)
	at org.apache.hive.jdbc.HiveStatement.closeStatementIfNeeded(HiveStatement.java:210)
	at org.apache.hive.jdbc.HiveStatement.closeClientOperation(HiveStatement.java:221)
	at org.apache.hive.jdbc.HiveQueryResultSet.close(HiveQueryResultSet.java:308)
	at com.alibaba.druid.pool.DruidPooledResultSet.close(DruidPooledResultSet.java:86)
	at com.alibaba.druid.pool.DruidPooledStatement.clearResultSet(DruidPooledStatement.java:206)
	at com.alibaba.druid.pool.DruidPooledStatement.close(DruidPooledStatement.java:514)

报错代码片段：

finally {
    if (ps != null) {
        ps.close();
    }
    if (con != null) {
        con.close();
    }
}

发生在Statement.close()处。

任务超时被kill。大数据平台资源有限，不可能让用户的查询SQL无限期执行下去。任务查杀规则综合考虑3方面因素（查询数据量 + 执行时间 + 完成度）：

impala cdh yarn查询进程_hive_02

针对数据块这一问题，非常不建议使用select *，而列出具体的查询字段，另外加上limit 10000字句限制条数。

SQLException: Sender timed out waiting for receiver fragment instance: , dest node: 66

具体的报错信息如下：

java.sql.SQLException: Sender timed out waiting for receiver fragment instance: 394c696029ddcce6:a51b7cab000007cc, dest node: 66
	at org.apache.hive.jdbc.HiveStatement.waitForOperationToComplete(HiveStatement.java:381)
	at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:260)
	at org.apache.hive.jdbc.HiveStatement.executeQuery(HiveStatement.java:473)
	at com.alibaba.druid.pool.DruidPooledStatement.executeQuery(DruidPooledStatement.java:308)

有时候重试成功，有时候失败。登录到Hadoop集群机器里，查看任务调度执行日志：

impala cdh yarn查询进程_apache_03

发现这个SQL，居然要全表扫描一张1.2W+分区，扫描20.1TB数据。

解决方法：重试，优化SQL。貌似没有更好的方法。

Failed to open HDFS file.No such file or directory.Root cause: RemoteException: File does not exist:

任务执行失败，具体的报错信息如下：

java.sql.SQLException: Disk I/O error: Failed to open HDFS file hdfs://ppdhdpha/user/hive/warehouse/test.db/chengzhangquanyi_huolizhiguoqi_chuda/2c43254ab60211d3-cf0e47d200000235_298950249_data.0.
Error(2): No such file or directory
Root cause: RemoteException: File does not exist: /user/hive/warehouse/test.db/chengzhangquanyi_huolizhiguoqi_chuda/2c43254ab60211d3-cf0e47d200000235_298950249_data.0.
        at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:85)
        at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:75)
        at org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:152)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1909)
        at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:735)
        at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:415)
        at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:869)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:815)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2675)

报错原因：数据表不存在。尝试方法：refresh table或invalidate table

Error(s) deleting partition directories. First error (of 37) was: Hdfs op. Input/output error

java.sql.SQLException: Error(s) deleting partition directories. First error (of 37) was: Hdfs op (DELETE hdfs://ppdhdpha/user/hive/warehouse/cszc.db/zmj_sop_m0_base_snp/1d460f3a4d87ea14-a4c4521100000091_870332460_data.0.) failed, error was: hdfs://ppdhdpha/user/hive/warehouse/cszc.db/zmj_sop_m0_base_snp/1d460f3a4d87ea14-a4c4521100000091_870332460_data.0.
Error(5): Input/output error

解决方法：重试，优化SQL。貌似没有更好的方法。

java.sql.SQLException: Admission for query exceeded timeout 180000ms in pool root.report. Queued reason: queue is not empty (size 5); queued queries are executed first.

具体报错信息：

ERROR c.alibaba.druid.pool.DruidDataSource - create connection SQLException, url: jdbc:hive2://111.222.333.66:21050/default, errorCode 0, state HY000
java.sql.SQLException: Admission for query exceeded timeout 180000ms in pool root.report. Queued reason: queue is not empty (size 4); queued queries are executed first.
	at org.apache.hive.jdbc.HiveStatement.waitForOperationToComplete(HiveStatement.java:381)
	at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:260)
	at org.apache.hive.jdbc.HiveStatement.executeQuery(HiveStatement.java:473)
	at com.alibaba.druid.pool.DruidAbstractDataSource.validateConnection(DruidAbstractDataSource.java:1434)
	at com.alibaba.druid.pool.DruidAbstractDataSource.createPhysicalConnection(DruidAbstractDataSource.java:1733)
	at com.alibaba.druid.pool.DruidDataSource$CreateConnectionThread.run(DruidDataSource.java:2838)

提交到大数据平台的SQL需要排队，排队超过3分钟被抛弃。

GetConnectionTimeoutException & Rejected query from pool root.report: queue full, limit=200, num_queued=200.

具体报错信息：

java.lang.Exception: java.lang.Exception: com.alibaba.druid.pool.GetConnectionTimeoutException: wait millis 120000, active 1, maxActive 150, creating 0, createErrorCount 2
        at com.xy.cloudiview.common.util.JdbcUtil.getConnection(JdbcUtil.java:136)
        at com.xy.cloudiview.datasetsubscript.business.service.impl.TableWarnServiceImpl.getSqlCount(TableWarnServiceImpl.java:244)
        at com.xy.cloudiview.datasetsubscript.business.service.impl.TableWarnServiceImpl.executeTableWarnJob(TableWarnServiceImpl.java:66)
        at com.xy.cloudiview.datasetsubscript.business.xxljob.IviewTableWarnJobHandler.execute(IviewTableWarnJobHandler.java:45)
        at com.ppdai.job.core.thread.JobThread.run(JobThread.java:142)
Caused by: java.lang.Exception: com.alibaba.druid.pool.GetConnectionTimeoutException: wait millis 120000, active 1, maxActive 150, creating 0, createErrorCount 2
        at com.xy.cloudiview.common.util.JdbcUtil.getConnection(JdbcUtil.java:122)
        ... 4 common frames omitted
Caused by: com.alibaba.druid.pool.GetConnectionTimeoutException: wait millis 120000, active 1, maxActive 150, creating 0, createErrorCount 2
        at com.alibaba.druid.pool.DruidDataSource.getConnectionInternal(DruidDataSource.java:1773)
        at com.alibaba.druid.pool.DruidDataSource.getConnectionDirect(DruidDataSource.java:1427)
        at com.alibaba.druid.pool.DruidDataSource.getConnection(DruidDataSource.java:1407)
        at com.alibaba.druid.pool.DruidDataSource.getConnection(DruidDataSource.java:1397)
        at com.alibaba.druid.pool.DruidDataSource.getConnection(DruidDataSource.java:100)
        at com.xy.cloudiview.common.util.JdbcUtil.getConnection(JdbcUtil.java:118)
        ... 4 common frames omitted
Caused by: java.sql.SQLException: Rejected query from pool root.report: queue full, limit=200, num_queued=200.
        at org.apache.hive.jdbc.HiveStatement.waitForOperationToComplete(HiveStatement.java:381)
        at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:260)
        at org.apache.hive.jdbc.HiveStatement.executeQuery(HiveStatement.java:473)
        at com.alibaba.druid.pool.DruidAbstractDataSource.validateConnection(DruidAbstractDataSource.java:1434)
        at com.alibaba.druid.pool.DruidAbstractDataSource.createPhysicalConnection(DruidAbstractDataSource.java:1733)
        at com.alibaba.druid.pool.DruidDataSource$CreateConnectionThread.run(DruidDataSource.java:2838)

大数据平台设置SQL提交数量，最多运行150个，排队200个，排队超时时间为5分钟。后续提交的SQL，排队连接超时，报错。

ImpalaRuntimeException: Error making ‘updateTableColumnStatistics’ RPC to Hive Metastore: CAUSED BY: MetaException: Could not create “increment”/“table” value-generation container SEQUENCE_TABLE since autoCreate flags do not allow it.

ImpalaRuntimeException: Error making ‘dropTable’ RPC to Hive Metastore: \nCAUSED BY: MetaException: One or more instances could not be deleted

Failed to close HDFS file: hdfs://ppdhdpha/user/hive/warehouse/xkyy.db/cdjk_apibg_willa/_impala_insert_staging/.0.parq\nError(255): Unknown error 255\nRoot cause: RemoteException: The DiskSpace quota of /user/hive/warehouse/xkyy.db is exceeded: quota = 43980465111040 B = 40 TB but diskspace consumed = 43981119305382 B = 40.00 TB

File ‘hdfs://ppdcdpha/user/hive/warehouse/usrt.db/willa_liubiao_daily’ has an invalid Parquet version number: 336\n\n. Please check that it is a valid Parquet file. This error can also occur due to stale metadata. If you believe this is a valid Parquet file, try running refresh usrt.willa_liubiao_daily

表不存在

有两类报错信息：

Table does not exist:

Could not resolve table reference

区别是什么？？？

Decompressor: invalid compressed length. Data is likely corrupt.

待执行的查询SQL依赖于表rhino_man_checked_hitrule，而表的存储文件损坏，需要重新生成表。

Expected to read a compressed text file of size 5553487 bytes. But only read 5401212 bytes. This may indicate data file corruption. (file: hdfs://hdpha/user/hive/warehouse/rpt.db/rhino_man_checked_hitrule/000000_0.snappy).

待执行的查询SQL依赖于表rhino_man_checked_hitrule，而表的存储文件损坏，需要重新生成表。

参考

Sender timed out waiting for receiver fragment instance: , dest node: 66Invalid query handle，感觉借鉴意义不大

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：RequestTemplate压缩JSON

下一篇：sql server 创建插入时间列

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯