MHA 邮件提醒及MHA损坏的环境修复恢复

原创

wx5d4124a358e8a 2022-09-02 23:57:50 博主文章分类：MySQL ©著作权

©著作权归作者所有：来自51CTO博客作者wx5d4124a358e8a的原创作品，请联系作者获取转载授权，否则将追究法律责任

主库宕机后系统发邮件提醒要再Manager所在的机器进行配置，本例在db03中

1上传发邮件脚本给脚本增加执行权限并测试发邮件

[root@db03 ~]# cd /usr/local/bin #上传发邮件脚本
[root@db03 bin]# chmod +x * #给脚本增加可执行权限
[root@db03 bin]# ll
总用量 92
-rwxr-xr-x. 1 root root 2226 3月 5 22:34 master_ip_failover
-rwxr-xr-x. 1 root root 35 12月 27 2017 send
-rwxr-xr-x. 1 root root 80213 9月 30 2009 sendEmail
-rwxr-xr-x. 1 root root 203 4月 19 2019 testpl
[root@db03 bin]# vim testpl #配置自己的发件箱smtp和收件箱
[root@db03 bin]# ./testpl #测试发邮件

发送邮件脚本下载

链接：https://pan.baidu.com/s/11hQa39LsRScn3O6xVPxPdQ
提取码：abcd

2将发邮件脚本路径写入到Manager配置文件中

[root@db03 bin]# vim /etc/mha/appl.cnf
[root@db03 bin]# cat /etc/mha/appl.cnf
[server default]
manager_log=/var/log/mha/appl/manager
manager_workdir=/var/log/mha/appl
master_binlog_dir=/data/binlog
password=mha
ping_interval=2
repl_password=123
repl_user=repl
ssh_user=root
user=mha
master_ip_failover_script=/usr/local/bin/master_ip_failover
report_script=/usr/local/bin/send

3重启MHA服务

[root@db03 bin]# masterha_stop --conf=/etc/mha/appl.cnf< /dev/null > /var/log/mha/appl/manager.log 2>&1nohup masterha_manager --conf=/etc/mha/appl.cnf --remove_dead_master_conf --ignore_last_failover </dev/null> /var/log/mha/appl/manager.log 2>&1 &
[1] 1245

4在db03中进行Node节点的ssh连接状态检查和复制环境检查和MHA状态检查并查看主机是哪台：

masterha_check_ssh --conf=/etc/mha/appl.cnf #检查互信状态
masterha_check_repl --conf=/etc/mha/appl.cnf #检查主从状态
[root@db03 tools]# masterha_check_status --conf=/etc/mha/appl.cnf #查看状态查看主库

5停主库测试邮件通知

[root@db02 ~]# /etc/init.d/mysqld stop
Shutting down MySQL............. SUCCESS! 
[root@db02 ~]#

6检查db01后发现vip已经飘过来了，然后登录到数据库查看一下是否自动变为主库了：

[root@db01 ~]# ip add
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 00:0c:29:2d:7c:19 brd ff:ff:ff:ff:ff:ff
inet 10.0.0.201/24 brd 10.0.0.255 scope global noprefixroute eth0
valid_lft forever preferred_lft forever
inet6 fe80::b23b:2bba:e233:1170/64 scope link tentative noprefixroute dadfailed
valid_lft forever preferred_lft forever
inet6 fe80::3e78:8d00:2f7:d39f/64 scope link noprefixroute
valid_lft forever preferred_lft forever
[root@db01 ~]# mysql -uroot -p123456
mysql: [Warning] Using a password on the command line interface can be insecure.
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 20
Server version: 5.7.26-log MySQL Community Server (GPL)

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

db01 [(none)]>show slave status \G
Empty set (0.00 sec)

db01 [(none)]>

7检查另外一个从库db03是不是指针指向了新的主库：

[root@db03 bin]# mysql
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 19
Server version: 5.7.26-log MySQL Community Server (GPL)

Copyright (c) 2000, 2019, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

db03 [(none)]>show slave status \G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 10.0.0.201
                  Master_User: repl
                  Master_Port: 3306

8环境恢复(MHA是一次性的高可用，用过一次就不能再用了，因此需要修复)

如何排查和解决MHA的故障

8.1首先检查Manager所在的db03机器中的MHA 的运行状态，发现已经宕掉了：

[root@db03 bin]# masterha_check_status --conf=/etc/mha/appl.cnf
appl is stopped(2:NOT_RUNNING).

8.2判断哪台机器是主机并启动数据库：

(1)登录每天机器数据库，查看主从状态，没有状态的数据库指定是主库show slave status \G

(2)登录从库数据库看看指针指向那台机器，那么这台机器就是主库

(3)在manager机器中通过日志查看change master to执行语句（看最后一条）：

[root@db03 bin]# grep -i 'change master to' /var/log/mha/appl/manager
Thu Mar  4 21:45:00 2021 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER T MASTER_HOST='10.0.0.202', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Tue Mar  9 20:18:39 2021 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER T MASTER_HOST='10.0.0.201', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='xxx';
[root@db03 bin]#

(4)启动宕掉的哪台机器数据库db02，并将其加入到一主两从这个环境里：

[root@db02 ~]# mysql
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 2
Server version: 5.7.26-log MySQL Community Server (GPL)

Copyright (c) 2000, 2019, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

db02 [(none)]>show slave status \G
Empty set (0.00 sec)

db02 [(none)]>CHANGE MASTER TO MASTER_HOST='10.0.0.201', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='123';
Query OK, 0 rows affected, 2 warnings (0.45 sec)

db02 [(none)]>start slave;
Query OK, 0 rows affected (0.10 sec)

db02 [(none)]>show slave status \G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: 10.0.0.201
Master_User: repl
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: mysql-bin.000011
Read_Master_Log_Pos: 234
Relay_Log_File: db02-relay-bin.000002
Relay_Log_Pos: 367
Relay_Master_Log_File: mysql-bin.000011
Slave_IO_Running: Yes
Slave_SQL_Running: Yes

(5)检查vip是否已在主库中设置(未发现):

[root@db01 ~]# ip add
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 00:0c:29:2d:7c:19 brd ff:ff:ff:ff:ff:ff
    inet 10.0.0.201/24 brd 10.0.0.255 scope global noprefixroute eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::b23b:2bba:e233:1170/64 scope link tentative noprefixroute dadfailed 
       valid_lft forever preferred_lft forever
    inet6 fe80::3e78:8d00:2f7:d39f/64 scope link noprefixroute 
       valid_lft forever preferred_lft forever

（6）到Manager机器中查看高可用脚本里的vip网段

[root@db03 ~]# cd /usr/local/bin/
[root@db03 bin]# vim master_ip_failover

#!/usr/bin/env perl

use strict;
use warnings FATAL => 'all';

use Getopt::Long;

my (
    $command,          $ssh_user,        $orig_master_host, $orig_master_ip,
    $orig_master_port, $new_master_host, $new_master_ip,    $new_master_port
);

my $vip = '10.0.0.55/24'; #配置一个闲置的网段
my $key = '1';
my $ssh_start_vip = "/sbin/ifconfig ens33:$key $vip"; #换成自己的网卡名称
my $ssh_stop_vip = "/sbin/ifconfig ens33:$key down"; #换成自己的网卡名称
my $ssh_Bcast_arp= "/sbin/arping -I ens33 -c 3 -A 10.0.0.55"; #换成自己的网卡名称

(7)把vip手动添加到主库中：

[root@db01 ~]# yum install -y net-tools
[root@db01 ~]# ifconfig eth0:1 10.0.0.55/24

[root@db01 ~]# ip add
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 00:0c:29:2d:7c:19 brd ff:ff:ff:ff:ff:ff
inet 10.0.0.201/24 brd 10.0.0.255 scope global noprefixroute eth0
valid_lft forever preferred_lft forever
inet 10.0.0.55/24 brd 10.0.0.255 scope global secondary eth0:1
valid_lft forever preferred_lft forever
inet6 fe80::b23b:2bba:e233:1170/64 scope link tentative noprefixroute dadfailed
valid_lft forever preferred_lft forever
inet6 fe80::3e78:8d00:2f7:d39f/64 scope link noprefixroute
valid_lft forever preferred_lft forever
[root@db01 ~]#

（8）检查Manager所在的机器db03中的配置文件，看看之前宕掉的故障机器的节点是否在，如果不在就加进来(不在)：

[root@db03 bin]# cat /etc/mha/appl.cnf
[server default]
manager_log=/var/log/mha/appl/manager
manager_workdir=/var/log/mha/appl
master_binlog_dir=/data/binlog
master_ip_failover_script=/usr/local/bin/master_ip_failover
password=mha
ping_interval=2
repl_password=123
repl_user=repl
report_script=/usr/local/bin/send
ssh_user=root
user=mha

[server1]
hostname=10.0.0.201
port=3306

[server3]
hostname=10.0.0.203
port=3306
[root@db03 bin]# vim /etc/mha/appl.cnf
[root@db03 bin]# cat /etc/mha/appl.cnf
[server default]
manager_log=/var/log/mha/appl/manager
manager_workdir=/var/log/mha/appl
master_binlog_dir=/data/binlog
master_ip_failover_script=/usr/local/bin/master_ip_failover
password=mha
ping_interval=2
repl_password=123
repl_user=repl
report_script=/usr/local/bin/send
ssh_user=root
user=mha

[server1]
hostname=10.0.0.201
port=3306

[server2]
hostname=10.0.0.202
port=3306

（9）在Manager所在的机器中进行启动预检查：

[root@db03 bin]# masterha_check_ssh --conf=/etc/mha/appl.cnf #检查互信
[root@db03 bin]# masterha_check_repl --conf=/etc/mha/appl.cnf #检查主从

（10）启动MHA并检查状态：

[root@db03 bin]# nohup masterha_manager --conf=/etc/mha/appl.cnf --remove_dead_master_conf --ignore_last_failover </dev/null> /var/log/mha/appl/manager.log 2>&1 &
[1] 3187
[root@db03 bin]# masterha_check_status --conf=/etc/mha/appl.cnf
appl (pid:3187) is running(0:PING_OK), master:10.0.0.201
[root@db03 bin]#