生产集群中组件启动报错JDK找不到
- 前言
- 正文
- 现象描述
- 问题排查
- 问题分析&解决
- 结语
前言
这周成都可算是复工了,上周居家办公的时候,配合其他同事线上对各个集群的Ambari元数据库做了迁移工作,因为这个迁移不涉及组件的变更,所以当时做完没有去做所有集群的服务组件启停的测试,只是做了抽查,毕竟30多个集群,大部分的集群只是观察了一会主机的监控和agent上报就没管了。
昨天下午,突然接到同事的会议邀请,说是有个集群服务启停不正常,Spark Worker重启都失败了,于是就上去和他们一起看了下,其实问题也是人为疏忽导致的,这里做了一下记录。
正文
现象描述
现象其实很简单,页面拉起Spark Worker失败了:
日志报错信息如下:
Traceback (most recent call last):
File "/var/lib/ambari-agent/cache/stack-hooks/before-ANY/scripts/hook.py", line 41, in <module>
BeforeAnyHook().execute()
File "/usr/lib/ambari-agent/lib/resource_management/libraries/script/script.py", line 352, in execute
method(env)
File "/var/lib/ambari-agent/cache/stack-hooks/before-ANY/scripts/hook.py", line 35, in hook
setup_java()
File "/var/lib/ambari-agent/cache/stack-hooks/before-ANY/scripts/shared_initialization.py", line 214, in setup_java
__setup_java(custom_java_home=params.java_home, custom_jdk_name=params.jdk_name)
File "/var/lib/ambari-agent/cache/stack-hooks/before-ANY/scripts/shared_initialization.py", line 227, in __setup_java
raise Fail(format("Unable to access {java_exec}. Confirm you have copied jdk to this host."))
resource_management.core.exceptions.Fail: Unable to access /usr/java/jdk/bin/java. Confirm you have copied jdk to this host.
Error: Error: Unable to run the custom hook script ['/usr/bin/python', '/var/lib/ambari-agent/cache/stack-hooks/before-ANY/scripts/hook.py', 'ANY', '/var/lib/ambari-agent/data/command-9414.json', '/var/lib/ambari-agent/cache/stack-hooks/before-ANY', '/var/lib/ambari-agent/data/structured-out-9414.json', 'INFO', '/var/lib/ambari-agent/tmp', 'PROTOCOL_TLSv1_2', '']
问题排查
关键日志为Unable to access /usr/java/jdk/bin/java. Confirm you have copied jdk to this host.
,直观感受就是服务器上找不到配置的jdk路径;于是登录到异常节点,ls一下确实找不到:
这是已经运行很久的集群,不可能是jdk没装对,当然还是验证了一下,以下是其中一个节点的截图,ansible验证了所有节点,都是由jdk的:
虽然jdk是有,但是实际的安装路径和ambari调用的路径不一致,按理说运行这么久的集群不会现在才暴露这个问题,因为报错脚本是/var/lib/ambari-agent/cache/stack-hooks/before-ANY/scripts/shared_initialization.py
,这里是所有操作之前会做的钩子函数,看了下这个脚本报错的地方:
def setup_java():
"""
Install jdk using specific params.
Install ambari jdk as well if the stack and ambari jdk are different.
"""
import params
__setup_java(custom_java_home=params.java_home, custom_jdk_name=params.jdk_name)
if params.ambari_java_home and params.ambari_java_home != params.java_home:
__setup_java(custom_java_home=params.ambari_java_home, custom_jdk_name=params.ambari_jdk_name)
def __setup_java(custom_java_home, custom_jdk_name):
"""
Installs jdk using specific params, that comes from ambari-server
"""
import params
java_exec = format("{custom_java_home}/bin/java")
备注里写的比较明确,这个配置来自于ambari-server的,而在启动服务调用钩子函数的时候,它会调用params.py,这里会读入java_home,代码如下:
from resource_management.libraries.script import Script
from resource_management.libraries.functions import default
from resource_management.libraries.functions import format
from resource_management.libraries.functions import conf_select
config = Script.get_config()
tmp_dir = Script.get_tmp_dir()
stack_root = Script.get_stack_root()
architecture = get_architecture()
dfs_type = default("/clusterLevelParams/dfs_type", "")
artifact_dir = format("{tmp_dir}/AMBARI-artifacts/")
jdk_name = default("/ambariLevelParams/jdk_name", None)
java_home = config['ambariLevelParams']['java_home']
java_version = expect("/ambariLevelParams/java_version", int)
jdk_location = config['ambariLevelParams']['jdk_location']
这个配置项通过调用接口获取出来检查发现是不对的:
随后,登录ambari-server所在服务器,查看ambari.properties配置文件中的jdk路径:
这里的jdk路径就是错误的。
问题分析&解决
联想到前几日做的迁移工作,估计是实施人员在操作ambari-server setup进行数据库设置的时候,jdk路径指定错了,选择了默认的路径,所以当时配置被改了过来,由于没做服务启停,就没暴露这个问题,于是重新setup一下,JAVA_HOME重新设置,重启ambari-server:
再次重启Spark Worker,没有问题了:
结语
又是一次人为疏忽导致的故障,后来我又检查了一下我的操作文档(我还担心我文档写错),我在文档中是有说明一定要指定正确:
So,一言难尽啊~