生产集群中组件启动报错JDK找不到

  • 前言
  • 正文
  • 现象描述
  • 问题排查
  • 问题分析&解决
  • 结语


前言

这周成都可算是复工了,上周居家办公的时候,配合其他同事线上对各个集群的Ambari元数据库做了迁移工作,因为这个迁移不涉及组件的变更,所以当时做完没有去做所有集群的服务组件启停的测试,只是做了抽查,毕竟30多个集群,大部分的集群只是观察了一会主机的监控和agent上报就没管了。

昨天下午,突然接到同事的会议邀请,说是有个集群服务启停不正常,Spark Worker重启都失败了,于是就上去和他们一起看了下,其实问题也是人为疏忽导致的,这里做了一下记录。

正文

现象描述

现象其实很简单,页面拉起Spark Worker失败了:

win下载yarn后查询版本提示找不到命令_java

日志报错信息如下:

Traceback (most recent call last):
  File "/var/lib/ambari-agent/cache/stack-hooks/before-ANY/scripts/hook.py", line 41, in <module>
    BeforeAnyHook().execute()
  File "/usr/lib/ambari-agent/lib/resource_management/libraries/script/script.py", line 352, in execute
    method(env)
  File "/var/lib/ambari-agent/cache/stack-hooks/before-ANY/scripts/hook.py", line 35, in hook
    setup_java()
  File "/var/lib/ambari-agent/cache/stack-hooks/before-ANY/scripts/shared_initialization.py", line 214, in setup_java
    __setup_java(custom_java_home=params.java_home, custom_jdk_name=params.jdk_name)
  File "/var/lib/ambari-agent/cache/stack-hooks/before-ANY/scripts/shared_initialization.py", line 227, in __setup_java
    raise Fail(format("Unable to access {java_exec}. Confirm you have copied jdk to this host."))
resource_management.core.exceptions.Fail: Unable to access /usr/java/jdk/bin/java. Confirm you have copied jdk to this host.
Error: Error: Unable to run the custom hook script ['/usr/bin/python', '/var/lib/ambari-agent/cache/stack-hooks/before-ANY/scripts/hook.py', 'ANY', '/var/lib/ambari-agent/data/command-9414.json', '/var/lib/ambari-agent/cache/stack-hooks/before-ANY', '/var/lib/ambari-agent/data/structured-out-9414.json', 'INFO', '/var/lib/ambari-agent/tmp', 'PROTOCOL_TLSv1_2', '']

win下载yarn后查询版本提示找不到命令_开发语言_02

问题排查

关键日志为Unable to access /usr/java/jdk/bin/java. Confirm you have copied jdk to this host.,直观感受就是服务器上找不到配置的jdk路径;于是登录到异常节点,ls一下确实找不到:

win下载yarn后查询版本提示找不到命令_开发语言_03


这是已经运行很久的集群,不可能是jdk没装对,当然还是验证了一下,以下是其中一个节点的截图,ansible验证了所有节点,都是由jdk的:

win下载yarn后查询版本提示找不到命令_开发语言_04


虽然jdk是有,但是实际的安装路径和ambari调用的路径不一致,按理说运行这么久的集群不会现在才暴露这个问题,因为报错脚本是/var/lib/ambari-agent/cache/stack-hooks/before-ANY/scripts/shared_initialization.py,这里是所有操作之前会做的钩子函数,看了下这个脚本报错的地方:

def setup_java():
  """
  Install jdk using specific params.
  Install ambari jdk as well if the stack and ambari jdk are different.
  """
  import params
  __setup_java(custom_java_home=params.java_home, custom_jdk_name=params.jdk_name)
  if params.ambari_java_home and params.ambari_java_home != params.java_home:
    __setup_java(custom_java_home=params.ambari_java_home, custom_jdk_name=params.ambari_jdk_name)

def __setup_java(custom_java_home, custom_jdk_name):
  """
  Installs jdk using specific params, that comes from ambari-server
  """
  import params
  java_exec = format("{custom_java_home}/bin/java")

备注里写的比较明确,这个配置来自于ambari-server的,而在启动服务调用钩子函数的时候,它会调用params.py,这里会读入java_home,代码如下:

from resource_management.libraries.script import Script
from resource_management.libraries.functions import default
from resource_management.libraries.functions import format
from resource_management.libraries.functions import conf_select

config = Script.get_config()
tmp_dir = Script.get_tmp_dir()

stack_root = Script.get_stack_root()

architecture = get_architecture()

dfs_type = default("/clusterLevelParams/dfs_type", "")

artifact_dir = format("{tmp_dir}/AMBARI-artifacts/")
jdk_name = default("/ambariLevelParams/jdk_name", None)
java_home = config['ambariLevelParams']['java_home']
java_version = expect("/ambariLevelParams/java_version", int)
jdk_location = config['ambariLevelParams']['jdk_location']

这个配置项通过调用接口获取出来检查发现是不对的:

win下载yarn后查询版本提示找不到命令_java_05


随后,登录ambari-server所在服务器,查看ambari.properties配置文件中的jdk路径:

win下载yarn后查询版本提示找不到命令_重启_06


这里的jdk路径就是错误的。

问题分析&解决

联想到前几日做的迁移工作,估计是实施人员在操作ambari-server setup进行数据库设置的时候,jdk路径指定错了,选择了默认的路径,所以当时配置被改了过来,由于没做服务启停,就没暴露这个问题,于是重新setup一下,JAVA_HOME重新设置,重启ambari-server:

win下载yarn后查询版本提示找不到命令_ambari_07


再次重启Spark Worker,没有问题了:

win下载yarn后查询版本提示找不到命令_重启_08

结语

又是一次人为疏忽导致的故障,后来我又检查了一下我的操作文档(我还担心我文档写错),我在文档中是有说明一定要指定正确:

win下载yarn后查询版本提示找不到命令_重启_09


So,一言难尽啊~