简单命令监控elasticsearch集群状态


原理:

使用curl命令模拟访问任意一个es节点可以反馈的集群状态,集群的状态需要为green

curl -sXGET http://serverip:9200/_cluster/health/?pretty


{

  "cluster_name" : "yunva-es",

  "status" : "green",

  "timed_out" : false,

  "number_of_nodes" : 7,

  "number_of_data_nodes" : 6,

  "active_primary_shards" : 66,

  "active_shards" : 132,

  "relocating_shards" : 0,

  "initializing_shards" : 0,

  "unassigned_shards" : 0,

  "delayed_unassigned_shards" : 0,

  "number_of_pending_tasks" : 0,

  "number_of_in_flight_fetch" : 0,

  "task_max_waiting_in_queue_millis" : 0,

  "active_shards_percent_as_number" : 100.0

}


前端使用了nginx验证,需要模拟登陆

curl模拟用户登录命令格式:

curl -u username:password -sXGET http://serverip:9200/_cluster/health/?pretty | grep "status"|awk -F '[ "]+' '{print $4}'


1.修改客户端zabbix配置:

vim /etc/zabbix/zabbix_agentd.conf


UserParameter=es_status,curl -u elkadmin:hSeC7ENeirAAPzv047m4 -sXGET http://serverip/_cluster/health/?pretty | grep "status"|awk -F '[ "]+' '{print $4}'|grep -c 'green'

zabbix通过简单命令监控elasticsearch集群状态_elasticsearch

重启zabbix-agent使配置生效

service zabbix-agent restart


在zabbix-server端测试

zabbix_get -s ip -p 10050 -k es_status


2.在zabbix的web页面添加对应的监控:


添加监控项item

Confuguration --> Hosts --> 找到对应的主机,点开 Items --> Create item

zabbix通过简单命令监控elasticsearch集群状态_vim_02

zabbix通过简单命令监控elasticsearch集群状态_重启_03

创建触发器:

Name

es_status_check

es_cluster_status is not green

zabbix通过简单命令监控elasticsearch集群状态_vim_04

3.针对es集群中的每个节点做进程监控,如果进程挂了自动重启

配置监控进程item

zabbix通过简单命令监控elasticsearch集群状态_java_05

配置触发器

zabbix通过简单命令监控elasticsearch集群状态_重启_06

配置action,看参考 

 


触发脚本:

/usr/local/zabbix-agent/scripts/start_es.sh

#!/bin/bash
# if elasticsearch exists kill it
source /etc/profile
count_es=`ps -ef|grep elasticsearch|grep -v grep|wc -l`
if [ $count_es -gt 1 ];then
ps -ef|grep elasticsearch|grep -v grep|/bin/kill `awk '{print $2}'`
fi
# start it

su yunva -c "cd /data/elasticsearch-5.0.1/bin && /bin/bash elasticsearch &"



执行:

sudo /bin/bash /usr/local/zabbix-agent/scripts/start_es.sh

报错:

which: no java in (/sbin:/bin:/usr/sbin:/usr/bin)

Could not find any executable java binary. Please install java in your PATH or set JAVA_HOME


解决办法:

在脚本中添加

source /etc/profile


以root用户运行elasticsearch


报错:

can not run elasticsearch as root


网上的方法,针对elasticsearch5.1不起作用

解决方法1:

在执行elasticSearch时加上参数-Des.insecure.allow.root=true,完整命令如下


./elasticsearch -Des.insecure.allow.root=true  

解决办法2:

用vi打开elasicsearch执行文件,在变量ES_JAVA_OPTS使用前添加以下命令


ES_JAVA_OPTS="-Des.insecure.allow.root=true"  


解决办法:

su yunva -c "cd /data/elasticsearch-5.0.1/bin && /bin/bash elasticsearch &"


自动拉起kibana服务的脚本:

cat /usr/local/zabbix/scripts/restart_kibana.sh

#!/bin/bash

# if kibana exists kill it


count_kibana=`ps -ef|grep kibana|grep -v grep|wc -l`

if [ $count_kibana -eq 1 ];then

    ps -ef|grep kibana|grep -v grep|/bin/kill `awk '{print $2}'`

fi

# start it

 

# 小小总结一下



1.修改zabbix_agentd.conf文件打开远程shell命令
# egrep -v '^#|^$' /usr/local/zabbix_agents_3.2.0/scripts/conf/zabbix_agentd.conf

EnableRemoteCommands=1
UnsafeUserParameters=1

# 打开zabbix的命令

echo "Defaults:zabbix !requiretty" >> /etc/sudoers
zabbix ALL=(ALL) NOPASSWD: ALL >> /etc/sudoers


2.监控命令

UserParameter=es_status,curl -sXGET http://172.16.0.230:9200/_cluster/health/?pretty | grep "status"|awk -F '[ "]+' '{print $4}'|grep -c 'green'
UserParameter=es_debug,sudo /bin/find /opt/elasticsearch-5.6.15 -name hs_err_pid*.log -o -name java_pid*.hprof|wc -l
# 默认的端口监控可能出问题,替换为如下命令
UserParameter=net.tcp.listen.grep[*],grep -q $(printf '%04X.00000000:0000.0A' $1) /proc/net/tcp ;if [ $? -eq 0 ];then echo 1;else grep -q "$(printf '%04X 0000000000000000000000 0000000000:0000 0A' $1)" /proc/net/tcp6;if [ $? -eq 0 ];then echo 1;else echo 0;fi;fi

3.当出现java内存溢出时触发重启elasticsearch的脚本

# vim /usr/local/zabbix_agents_3.2.0/scripts/start_es.sh

#!/bin/bash
# if elasticsearch exists kill it
source /etc/profile


# 删除java报错产生的文件
/usr/bin/sudo /bin/find /opt/elasticsearch-5.6.15 -name hs_err_pid*.log -o -name java_pid*.hprof | xargs rm -f

# kill并重新启动elasticsearch
count_es=`ps -ef|grep elasticsearch|grep -v grep|wc -l`
if [ $count_es -ge 1 ];then
ps -ef|grep elasticsearch|grep -v grep|/bin/kill -9 `awk '{print $2}'`
fi
# start it
su elasticsearch -c "cd /opt/elasticsearch-5.6.15/bin && /bin/bash elasticsearch -d"


4.监控模板

<?xml version="1.0" encoding="UTF-8"?>
<zabbix_export>
<version>3.2</version>
<date>2019-08-09T06:14:18Z</date>
<groups>
<group>
<name>Templates</name>
</group>
</groups>
<templates>
<template>
<template>es_cluster_monitor</template>
<name>es_cluster_monitor</name>
<description/>
<groups>
<group>
<name>Templates</name>
</group>
</groups>
<applications/>
<items>
<item>
<name>es_debug</name>
<type>0</type>
<snmp_community/>
<multiplier>0</multiplier>
<snmp_oid/>
<key>es_debug</key>
<delay>30</delay>
<history>90</history>
<trends>365</trends>
<status>0</status>
<value_type>3</value_type>
<allowed_hosts/>
<units/>
<delta>0</delta>
<snmpv3_contextname/>
<snmpv3_securityname/>
<snmpv3_securitylevel>0</snmpv3_securitylevel>
<snmpv3_authprotocol>0</snmpv3_authprotocol>
<snmpv3_authpassphrase/>
<snmpv3_privprotocol>0</snmpv3_privprotocol>
<snmpv3_privpassphrase/>
<formula>1</formula>
<delay_flex/>
<params/>
<ipmi_sensor/>
<data_type>0</data_type>
<authtype>0</authtype>
<username/>
<password/>
<publickey/>
<privatekey/>
<port/>
<description/>
<inventory_link>0</inventory_link>
<applications/>
<valuemap/>
<logtimefmt/>
</item>
<item>
<name>es_status</name>
<type>0</type>
<snmp_community/>
<multiplier>0</multiplier>
<snmp_oid/>
<key>es_status</key>
<delay>30</delay>
<history>90</history>
<trends>365</trends>
<status>0</status>
<value_type>3</value_type>
<allowed_hosts/>
<units/>
<delta>0</delta>
<snmpv3_contextname/>
<snmpv3_securityname/>
<snmpv3_securitylevel>0</snmpv3_securitylevel>
<snmpv3_authprotocol>0</snmpv3_authprotocol>
<snmpv3_authpassphrase/>
<snmpv3_privprotocol>0</snmpv3_privprotocol>
<snmpv3_privpassphrase/>
<formula>1</formula>
<delay_flex/>
<params/>
<ipmi_sensor/>
<data_type>0</data_type>
<authtype>0</authtype>
<username/>
<password/>
<publickey/>
<privatekey/>
<port/>
<description/>
<inventory_link>0</inventory_link>
<applications/>
<valuemap/>
<logtimefmt/>
</item>
<item>
<name>es_9200</name>
<type>0</type>
<snmp_community/>
<multiplier>0</multiplier>
<snmp_oid/>
<key>net.tcp.listen.grep[9200]</key>
<delay>30</delay>
<history>90</history>
<trends>365</trends>
<status>0</status>
<value_type>3</value_type>
<allowed_hosts/>
<units/>
<delta>0</delta>
<snmpv3_contextname/>
<snmpv3_securityname/>
<snmpv3_securitylevel>0</snmpv3_securitylevel>
<snmpv3_authprotocol>0</snmpv3_authprotocol>
<snmpv3_authpassphrase/>
<snmpv3_privprotocol>0</snmpv3_privprotocol>
<snmpv3_privpassphrase/>
<formula>1</formula>
<delay_flex/>
<params/>
<ipmi_sensor/>
<data_type>0</data_type>
<authtype>0</authtype>
<username/>
<password/>
<publickey/>
<privatekey/>
<port/>
<description/>
<inventory_link>0</inventory_link>
<applications/>
<valuemap/>
<logtimefmt/>
</item>
<item>
<name>es_9300</name>
<type>0</type>
<snmp_community/>
<multiplier>0</multiplier>
<snmp_oid/>
<key>net.tcp.listen.grep[9300]</key>
<delay>30</delay>
<history>90</history>
<trends>365</trends>
<status>0</status>
<value_type>3</value_type>
<allowed_hosts/>
<units/>
<delta>0</delta>
<snmpv3_contextname/>
<snmpv3_securityname/>
<snmpv3_securitylevel>0</snmpv3_securitylevel>
<snmpv3_authprotocol>0</snmpv3_authprotocol>
<snmpv3_authpassphrase/>
<snmpv3_privprotocol>0</snmpv3_privprotocol>
<snmpv3_privpassphrase/>
<formula>1</formula>
<delay_flex/>
<params/>
<ipmi_sensor/>
<data_type>0</data_type>
<authtype>0</authtype>
<username/>
<password/>
<publickey/>
<privatekey/>
<port/>
<description/>
<inventory_link>0</inventory_link>
<applications/>
<valuemap/>
<logtimefmt/>
</item>
<item>
<name>es_process</name>
<type>0</type>
<snmp_community/>
<multiplier>0</multiplier>
<snmp_oid/>
<key>proc.num[,,all,elasticsearch]</key>
<delay>30</delay>
<history>90</history>
<trends>365</trends>
<status>0</status>
<value_type>3</value_type>
<allowed_hosts/>
<units/>
<delta>0</delta>
<snmpv3_contextname/>
<snmpv3_securityname/>
<snmpv3_securitylevel>0</snmpv3_securitylevel>
<snmpv3_authprotocol>0</snmpv3_authprotocol>
<snmpv3_authpassphrase/>
<snmpv3_privprotocol>0</snmpv3_privprotocol>
<snmpv3_privpassphrase/>
<formula>1</formula>
<delay_flex/>
<params/>
<ipmi_sensor/>
<data_type>0</data_type>
<authtype>0</authtype>
<username/>
<password/>
<publickey/>
<privatekey/>
<port/>
<description/>
<inventory_link>0</inventory_link>
<applications/>
<valuemap/>
<logtimefmt/>
</item>
</items>
<discovery_rules/>
<httptests/>
<macros/>
<templates/>
<screens/>
</template>
</templates>
<triggers>
<trigger>
<expression>{cms_uts_es:es_status.last(0)}<>1 and {cms_uts_es:es_status.last(1)}<>1 and {cms_uts_es:es_status.last(2)}<>1</expression>
<recovery_mode>0</recovery_mode>
<recovery_expression/>
<name>es cluster is not green</name>
<correlation_mode>0</correlation_mode>
<correlation_tag/>
<url/>
<status>0</status>
<priority>0</priority>
<description/>
<type>0</type>
<manual_close>0</manual_close>
<dependencies/>
<tags/>
</trigger>
<trigger>
<expression>{cms_uts_es:proc.num[,,all,elasticsearch].last(0)}<1 and {cms_uts_es:proc.num[,,all,elasticsearch].last(1)}<1 and {cms_uts_es:proc.num[,,all,elasticsearch].last(2)}<1</expression>
<recovery_mode>0</recovery_mode>
<recovery_expression/>
<name>es process was down</name>
<correlation_mode>0</correlation_mode>
<correlation_tag/>
<url/>
<status>0</status>
<priority>0</priority>
<description/>
<type>0</type>
<manual_close>0</manual_close>
<dependencies/>
<tags/>
</trigger>
<trigger>
<expression>{cms_uts_es:net.tcp.listen.grep[9200].max(#2)}=0</expression>
<recovery_mode>0</recovery_mode>
<recovery_expression/>
<name>es_9200 port down</name>
<correlation_mode>0</correlation_mode>
<correlation_tag/>
<url/>
<status>0</status>
<priority>0</priority>
<description/>
<type>0</type>
<manual_close>0</manual_close>
<dependencies/>
<tags/>
</trigger>
<trigger>
<expression>{cms_uts_es:net.tcp.listen.grep[9300].max(#2)}=0</expression>
<recovery_mode>0</recovery_mode>
<recovery_expression/>
<name>es_9300 port down</name>
<correlation_mode>0</correlation_mode>
<correlation_tag/>
<url/>
<status>0</status>
<priority>0</priority>
<description/>
<type>0</type>
<manual_close>0</manual_close>
<dependencies/>
<tags/>
</trigger>
<trigger>
<expression>{cms_uts_es:es_debug.last()}<>0</expression>
<recovery_mode>0</recovery_mode>
<recovery_expression/>
<name>es_debug error</name>
<correlation_mode>0</correlation_mode>
<correlation_tag/>
<url/>
<status>0</status>
<priority>0</priority>
<description/>
<type>0</type>
<manual_close>0</manual_close>
<dependencies/>
<tags/>
</trigger>
</triggers>
</zabbix_export>