1、概述

    MHA是目前使用比较广泛的MySQL高可用架构方案,可以实现30秒之内完成故障切换,尽可能地保证了数据的一致性,在一定程度上减少了故障带来的影响。

    MHA是由MHA Manager和MHA Node组成,Manager用来检测各个MySQL复制组集群的状态,并定时来监测集群中Master节点的状态,如果Master出现故障,则会将相应的Slave节点提升为新的Master,同时该Master节点下的各Slave节点会自动指向新的Master。

2、搭建环境准备

2.1、架构描述

    本文中搭建的MHA使用3台配置相同的主机,任选其中一台用来部署MHA Manager节点(这里选择1库所在主机h168),并全部部署MySQL主从,共三套库。
    主从配置方式采用交叉方式,即每台拥有一个主节点和另外一个集群的两个从节点,每套对应一个对外访问的虚拟IP(VIP),全部采用半同步方式来进行复制。具体集群结构如图1所示:

window mysql高可用 免费方案 mysql高可用搭建_MHA


图1 MHA集群架构图


    下面表格具体说明每台主机的部署情况以及数据库结构的设计,如表1,表2所示:


表1:主机角色描述

IP

主机名

描述

192.168.1.168

h168

MHA Manager,D1-Master,D2-Slave1,D3-Slave1

192.168.1.169

h169

MHA-Node,D2-Master,D1-Slave1,D3-Slave2

192.168.1.170

h170

MHA-Node,D3-Master,D1-Slave2,D2-Slave2

表2:数据库主从设计

数据库实例名

IP

配置文件

端口

server_id

VIP

D1-M

192.168.1.168

3306-M.cnf

3306

33060

192.168.1.171

D1-S1

192.168.1.169

3306-S1.cnf

3306

33061

192.168.1.171

D1-S2

192.168.1.170

3306-S2.cnf

3306

33062

192.168.1.171

D2-M

192.168.1.169

3307-M.cnf

3307

33070

192.168.1.172

D2-S1

192.168.1.168

3307-S1.cnf

3307

33071

192.168.1.172

D2-S2

192.168.1.170

3307-S2.cnf

3307

33072

192.168.1.172

D3-M

192.168.1.170

3308-M.cnf

3308

33080

192.168.1.173

D3-S1

192.168.1.168

3308-S1.cnf

3308

33081

192.168.1.173

D3-S2

192.168.1.169

3308-S2.cnf

3308

33082

192.168.1.173

表3:目录结构设计

    这里以第1个复制组为例:

目录

D1-M(h168)

D1-S1(h169)

D1-S2(h170)

根目录

/app/mysql/3306-M

/app/mysql/3306-S1

/app/mysql/3306-S2

配置目录

/app/mysql/conf/3306-M.cnf

/app/mysql/conf/3306-S1.cnf

/app/mysql/conf/3306-S2.cnf

data

/app/mysql/3306-M/data

/app/mysql/3306-S1/data

/app/mysql/3306-S2/data

pid

/app/mysql/3306-M/mysqld.pid

/app/mysql/3306-S1/mysqld.pid

/app/mysql/3306-S2/mysqld.pid

socket

/app/mysql/3306-M/mysql.sock

/app/mysql/3306-S1/mysql.sock

/app/mysql/3306-S2/mysql.sock

慢SQL日志

/app/mysql/3306-M/logs/slow-log/mysql-slow.log

/app/mysql/3306-S1/logs/slow-log/mysql-slow.log

/app/mysql/3306-S2/logs/slow-log/mysql-slow.log

bin-log日志

/app/mysql/3306-M/logs/bin-log/mysql-bin

/app/mysql/3306-S1/logs/bin-log/mysql-bin

/app/mysql/3306-S2/logs/bin-log/mysql-bin

relay-log日志

/app/mysql/3306-M/logs/relay-log/relay-bin

/app/mysql/3306-S1/logs/relay-log/relay-bin

/app/mysql/3306-S2/logs/relay-log/relay-bin

mysqld日志

/app/mysql/3306-M/logs/mysqld.log

/app/mysql/3306-S1/logs/mysqld.log

/app/mysql/3306-S2/logs/mysqld.log

启动脚本

/app/mysql/start_3306-M.sh

/app/mysql/start_3306-S1.sh

/app/mysql/start_3306-S2.sh

2.2、主机基本设置
  1. 修改hosts(每台均修改)

    root下编辑vim /etc/hosts,添加如下配置:

[root@h168 webapp]# cat /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.1.168 	h168
192.168.1.169 	h169
192.168.1.170 	h170

    重启网卡systemctl restart network,断开终端重新连接,即可生效。

  1. 配置主机互信(每台均执行)

    webapp账号下(非root账号下,MySQL所属用户)执行以下命令完成互信,包括本机到本机的互信。

ssh-keygen -t rsa	#一路回车即可
ssh-copy-id -i ~/.ssh/id_rsa.pub webapp@h168
ssh-copy-id -i ~/.ssh/id_rsa.pub webapp@h169
ssh-copy-id -i ~/.ssh/id_rsa.pub webapp@h170

    配置完成后记得验证一下,避免后面出现问题。

  1. 创建目录

    每台创建/app和/app/soft目录

mkdir /app/
# 根下创建app目录
mkdir -p /app/soft
# soft目录用来放置安装包
# root下执行以下命令,更改/app目录及其子目录的所属用户和组
chown -R webapp:webapp /app

    h168目录结构:

[webapp@h168 app]$ mkdir -p /app/mha/conf	# MHA的配置文件
[webapp@h168 app]$ mkdir -p /app/mha/logs	# MHA Manager日志
[webapp@h168 app]$ mkdir -p /app/mha/scripts	# MHA的failover切换脚本
[webapp@h168 app]$ mkdir -p /app/mha/workdir	# MHA的工作目录
[webapp@h168 app]$ mkdir -p /app/mysql/conf	# MySQL的配置文件
[webapp@h168 app]$ mkdir -p /app/mysql/3306-M/data    # 下面三个是MySQL集群根目录
[webapp@h168 app]$ mkdir -p /app/mysql/3306-M/logs/slow-log
[webapp@h168 app]$ mkdir -p /app/mysql/3306-M/logs/bin-log
[webapp@h168 app]$ mkdir -p /app/mysql/3306-M/logs/relay-log
[webapp@h168 app]$ mkdir -p /app/mysql/3307-S1
[webapp@h168 app]$ mkdir -p /app/mysql/3308-S1
[webapp@h168 app]$ cp -r /app/mysql/3306-M/* /app/mysql/3307-S1/
[webapp@h168 app]$ cp -r /app/mysql/3306-M/* /app/mysql/3308-S1/

[webapp@h168 mysql]# tree /app/
/app/
├── mha
│   ├── conf
│   ├── logs
│   ├── scripts
│   └── workdir
├── mysql
│   ├── 3306-M
│   │   ├── data
│   │   └── logs
│   │       ├── bin-log
│   │       ├── relay-log
│   │       └── slow-log
│   ├── 3307-S1
│   │   ├── data
│   │   └── logs
│   │       ├── bin-log
│   │       ├── relay-log
│   │       └── slow-log
│   ├── 3308-S1
│   │   ├── data
│   │   └── logs
│   │       ├── bin-log
│   │       ├── relay-log
│   │       └── slow-log
│   └── conf
└── soft

26 directories, 0 files

    h169目录结构:

[webapp@h169 app]$ mkdir -p /app/mha/conf	# MHA的配置文件
[webapp@h169 app]$ mkdir -p /app/mha/logs	# MHA Manager日志
[webapp@h169 app]$ mkdir -p /app/mha/scripts	# MHA的failover切换脚本
[webapp@h169 app]$ mkdir -p /app/mha/workdir	# MHA的工作目录
[webapp@h169 app]$ mkdir -p /app/mysql/conf	# MySQL的配置文件
[webapp@h169 app]$ mkdir -p /app/mysql/3307-M/data    # 下面三个是MySQL集群根目录
[webapp@h169 app]$ mkdir -p /app/mysql/3307-M/logs/slow-log
[webapp@h169 app]$ mkdir -p /app/mysql/3307-M/logs/bin-log
[webapp@h169 app]$ mkdir -p /app/mysql/3307-M/logs/relay-log
[webapp@h169 app]$ mkdir -p /app/mysql/3306-S1
[webapp@h169 app]$ mkdir -p /app/mysql/3308-S2
[webapp@h169 app]$ cp -r /app/mysql/3307-M/* /app/mysql/3306-S1/
[webapp@h169 app]$ cp -r /app/mysql/3307-M/* /app/mysql/3308-S2/

[webapp@h169 mysql]# tree /app
/app
├── mha
│   ├── conf
│   ├── logs
│   ├── scripts
│   └── workdir
├── mysql
│   ├── 3306-S1
│   │   ├── data
│   │   └── logs
│   │       ├── bin-log
│   │       ├── relay-log
│   │       └── slow-log
│   ├── 3307-M
│   │   ├── data
│   │   └── logs
│   │       ├── bin-log
│   │       ├── relay-log
│   │       └── slow-log
│   ├── 3308-S2
│   │   ├── data
│   │   └── logs
│   │       ├── bin-log
│   │       ├── relay-log
│   │       └── slow-log
│   └── conf
└── soft

26 directories, 0 files

    h170目录结构

[webapp@h170 app]$ mkdir -p /app/mha/conf	# MHA的配置文件
[webapp@h170 app]$ mkdir -p /app/mha/logs	# MHA Manager日志
[webapp@h170 app]$ mkdir -p /app/mha/scripts	# MHA的failover切换脚本
[webapp@h170 app]$ mkdir -p /app/mha/workdir	# MHA的工作目录
[webapp@h170 app]$ mkdir -p /app/mysql/conf	# MySQL的配置文件
[webapp@h170 app]$ mkdir -p /app/mysql/3308-M/data    # 下面三个是MySQL集群根目录
[webapp@h170 app]$ mkdir -p /app/mysql/3308-M/logs/slow-log
[webapp@h170 app]$ mkdir -p /app/mysql/3308-M/logs/bin-log
[webapp@h170 app]$ mkdir -p /app/mysql/3308-M/logs/relay-log
[webapp@h170 app]$ mkdir -p /app/mysql/3306-S2
[webapp@h170 app]$ mkdir -p /app/mysql/3307-S2
[webapp@h170 app]$ cp -r /app/mysql/3308-M/* /app/mysql/3306-S2/
[webapp@h170 app]$ cp -r /app/mysql/3308-M/* /app/mysql/3307-S2/

[webapp@h170 mysql]# tree /app
/app
├── mha
│   ├── conf
│   ├── logs
│   ├── scripts
│   └── workdir
├── mysql
│   ├── 3306-S2
│   │   ├── data
│   │   └── logs
│   │       ├── bin-log
│   │       ├── relay-log
│   │       └── slow-log
│   ├── 3307-S2
│   │   ├── data
│   │   └── logs
│   │       ├── bin-log
│   │       ├── relay-log
│   │       └── slow-log
│   ├── 3308-M
│   │   ├── data
│   │   └── logs
│   │       ├── bin-log
│   │       ├── relay-log
│   │       └── slow-log
│   └── conf
└── soft

26 directories, 0 files

3、安装配置MySQL数据库及主从

3.1 安装MySQL数据库
  1. tar -zxvf mysql-5.7.24-linux-glibc2.12-x86_64.tar.gz -C /app/mysql/ && mv /app/mysql/mysql-5.7.24-linux-glibc2.12-x86_64/ /app/mysql/mysql-5.7.24
  2. 编辑数据库配置文件
        这里贴出来3306-M的配置文件,其他8个配置文件只需相应修改端口、目录即可,篇幅有限,不一一列出。
h168
vim   /app/mysql/conf/3306-M.cnf
vim   /app/mysql/conf/3307-S1.cnf
vim   /app/mysql/conf/3308-S1.cnf

h169
vim    /app/mysql/conf/3307-M.cnf
vim    /app/mysql/conf/3306-S1.cnf
vim    /app/mysql/conf/3308-S2.cnf

h170
vim    /app/mysql/conf/3308-M.cnf
vim    /app/mysql/conf/3306-S2.cnf
vim    /app/mysql/conf/3307-S2.cnf
  1. 初始化数据库(9个库)
        这里依然举例一个库的初始化过程。
[webapp@h168 bin]$ cd /app/mysql/mysql-5.7.24/bin
[webapp@h168 bin]$ ./mysqld --user=webapp --basedir=/app/mysql/mysql-5.7.24/ --datadir=/app/mysql/3306-M/data --initialize
./mysqld: error while loading shared libraries: libaio.so.1: cannot open shared object file: No such file or directory
[webapp@h168 bin]$

    如果出现上面的报错,需要来安装一下libaio.so.1,即可解决。因为yum安装依然报错,貌似yum默认安装了32位的版本,本实验系统是64位,所以采用rpm方式来安装,报错解决(每台均需安装)。

[root@h168 soft]# wget http://mirror.centos.org/centos/7/os/x86_64/Packages/libaio-0.3.109-13.el7.x86_64.rpm

[root@h168 soft]# rpm -ivh libaio-0.3.109-13.el7.x86_64.rpm

[root@h168 soft]# chown -R webapp:webapp /app/ 
[root@h168 soft]# su webapp

下面开始进行初始化:

[webapp@h168 bin]$ cd /app/mysql/mysql-5.7.24/bin
[webapp@h168 bin]$ ./mysqld --user=webapp --basedir=/app/mysql/mysql-5.7.24/ --datadir=/app/mysql/3306-M/data --initialize

出现临时密码,且无报错,初始化成功。
  1. 编写MySQL启动脚本
[webapp@h168 bin]$ cd /app/mysql/
[webapp@h168 mysql]$ cat << EOF > start_3306-M.sh
nohup ./mysql-5.7.24/bin/mysqld --defaults-file=./conf/3306-M.cnf --user=webapp 2>&1 &
EOF
[webapp@h168 mysql]$ chmod +x start_3306-M.sh
[webapp@h168 mysql]$ ./start_3306-M.sh
  1. 修改root密码
        初始化完成后会生成随机的初始密码,不方便记忆,我们需要重置一下密码。
[webapp@h168 bin]$ cd /app/mysql/mysql-5.7.24/bin
[webapp@h168 bin]$ ./mysql -uroot  -p -P3306 -S /app/mysql/3306-M/mysql.sock

mysql> SET PASSWORD = PASSWORD('123456');
Query OK, 0 rows affected, 1 warning (10.03 sec)
  1. 编写MySQL客户端脚本
[webapp@h168 bin]$ cd /app/mysql/
[webapp@h168 mysql]$ cat << EOF > client_3306-M.sh
mysql-5.7.24/bin/mysql -uroot  -p123456 -P3306 -S /app/mysql/3306-M/mysql.sock
EOF
[webapp@h168 mysql]$ chmod +x client_3306-M.sh

将其他8个库依次按照上面步骤来进操作。精简一下,按照下面的操作步骤,记得修改并核对好目录及脚本名!

3.2 配置主从

    主从的复制方式有两种,一种是通过binlog中POS的值来进行复制,另一种是通过GTID的方式;本文中基于GTID的方式进行主从复制。关于GTID请见下面介绍:

GTID工作原理:
1、全局事务标识:global transaction identifiers。
2、GTID是一个事务一一对应,并且全局唯一ID。
3、一个GTID在一个服务器上只执行一次,避免重复执行导致数据混乱或者主从不一致。
4、GTID用来代替传统复制方法,不再使用MASTER_LOG_FILE+MASTER_LOG_POS开启复制。而是使用MASTER_AUTO_POSTION=1的方式开始复制。
5、MySQL-5.6.5开始支持的,MySQL-5.6.10后开始完善。
6、在传统的slave端,binlog是不用开启的,但是在GTID中slave端的binlog是必须开启的,目的是记录执行过的GTID(强制)。
过程描述:
1、当一个事务在主库端执行并提交时,产生GTID,一同记录到binlog日志中。
2、binlog传输到slave,并存储到slave的relaylog后,读取这个GTID的这个值设置gtid_next变量,即告诉Slave,下一个要执行的GTID值。
3、sql线程从relay log中获取GTID,然后对比slave端的binlog是否有该GTID。
4、如果有记录,说明该GTID的事务已经执行,slave会忽略。
5、如果没有记录,slave就会执行该GTID事务,并记录该GTID到自身的binlog,在读取执行事务前会先检查其他session持有该GTID,确保不被重复执行。
6、在解析过程中会判断是否有主键,如果没有就用二级索引,如果没有就用全部扫描。


    由于是新搭建的数据库,就不需要mysqldump 来转储备份了,如果现有的库搭建主从,则需执行备份并导入备份保持数据一致。

  1. 创建复制用户repl,授权给从库使用该账号来进行复制(8个库都来执行)
GRANT REPLICATION SLAVE , REPLICATION CLIENT ON *.* TO 'repl'@'192.168.1.%' IDENTIFIED BY  '123456';
flush privileges;
  1. 在每个主库对应的两个从库下执行下面语句来开启复制

h169(D1-S1),h170(D1-S2)执行:

CHANGE MASTER TO MASTER_HOST='192.168.1.168',MASTER_PORT=3306,MASTER_USER='repl',MASTER_PASSWORD='123456',MASTER_AUTO_POSITION=1;
start slave;

h168(D2-S1),h170(D2-S2)执行:

CHANGE MASTER TO MASTER_HOST='192.168.1.169',MASTER_PORT=3307,MASTER_USER='repl',MASTER_PASSWORD='123456',MASTER_AUTO_POSITION=1;
start slave;

h168(D3-S1),h169(D3-S2)执行:

CHANGE MASTER TO MASTER_HOST='192.168.1.170',MASTER_PORT=3308,MASTER_USER='repl',MASTER_PASSWORD='123456',MASTER_AUTO_POSITION=1;
start slave;

由于是多个库,多个复制组,这个“MASTER_PORT”一定要注意,从库执行的时候一定要指定各自主库的端口,如果不指定,默认就是3306,这样会带来较大的麻烦。
3. 逐个库查看下复制状态:show slave status\G;

mysql> show slave status\G;
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.1.170
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000001
          Read_Master_Log_Pos: 2786
               Relay_Log_File: relay-bin.000002
                Relay_Log_Pos: 2999
        Relay_Master_Log_File: mysql-bin.000001
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 2786
              Relay_Log_Space: 3200
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 33062
                  Master_UUID: 0b1e8302-ea15-11e8-93f4-000c29bc90dc
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
           Master_Retry_Count: 86400
                  Master_Bind:
      Last_IO_Error_Timestamp:
     Last_SQL_Error_Timestamp:
               Master_SSL_Crl:
           Master_SSL_Crlpath:
           Retrieved_Gtid_Set: 0b1e8302-ea15-11e8-93f4-000c29bc90dc:1-6,
b985084d-e9b1-11e8-b0e3-000c299e0f9f:1-6
            Executed_Gtid_Set: 0b1e8302-ea15-11e8-93f4-000c29bc90dc:1-6,
6324223f-ea10-11e8-800e-000c299e0f9f:1-7,
b985084d-e9b1-11e8-b0e3-000c299e0f9f:1-6
                Auto_Position: 1
         Replicate_Rewrite_DB:
                 Channel_Name:
           Master_TLS_Version:
1 row in set (0.00 sec)

ERROR:
No query specified
下面这两个参数为yes,表示主从复制状态OK。
Slave_IO_Running: Yes
Slave_SQL_Running: Yes

4、部署MHA

  1. 所有节点(h168,h169,h170)安装MHA 所需的perl依赖模块
yum -y install perl-DBD-MySQL
yum -y install perl-Config-Tiny
yum -y install perl-Log-Dispatch
yum -y install perl-Parallel-ForkManager
yum -y install perl-DBI.x86_64 perl-ExtUtils-CBuilder
	yum -y install perl-ExtUtils-MakeMaker perl-CPAN
	yum -y install perl-Mail-Sender
  1. 所有节点安装MHA Node(包括要安装Manager的节点)

    将下载好的mha4mysql-node-0.56.tar上传并解压至/app/soft/mha4mysql-node-0.56文件夹内。
切换至root用户:

[root@h168 mha4mysql-node-0.56]# cd /app/soft/mha4mysql-node-0.56
[root@h168 mha4mysql-node-0.56]# perl Makefile.PL
*** Module::AutoInstall version 1.03
*** Checking for Perl dependencies...
[Core Features]
- DBI        ...loaded. (1.627)
- DBD::mysql ...loaded. (4.023)
*** Module::AutoInstall configuration finished.
Checking if your kit is complete...
Looks good
Writing Makefile for mha4mysql::node

[root@h168 mha4mysql-node-0.56]# make
cp lib/MHA/BinlogManager.pm blib/lib/MHA/BinlogManager.pm
cp lib/MHA/BinlogPosFindManager.pm blib/lib/MHA/BinlogPosFindManager.pm
cp lib/MHA/BinlogPosFinderXid.pm blib/lib/MHA/BinlogPosFinderXid.pm
cp lib/MHA/BinlogHeaderParser.pm blib/lib/MHA/BinlogHeaderParser.pm
cp lib/MHA/BinlogPosFinder.pm blib/lib/MHA/BinlogPosFinder.pm
cp lib/MHA/BinlogPosFinderElp.pm blib/lib/MHA/BinlogPosFinderElp.pm
cp lib/MHA/NodeUtil.pm blib/lib/MHA/NodeUtil.pm
cp lib/MHA/SlaveUtil.pm blib/lib/MHA/SlaveUtil.pm
cp lib/MHA/NodeConst.pm blib/lib/MHA/NodeConst.pm
cp bin/filter_mysqlbinlog blib/script/filter_mysqlbinlog
/usr/bin/perl "-Iinc" -MExtUtils::MY -e 'MY->fixin(shift)' -- blib/script/filter_mysqlbinlog
cp bin/apply_diff_relay_logs blib/script/apply_diff_relay_logs
/usr/bin/perl "-Iinc" -MExtUtils::MY -e 'MY->fixin(shift)' -- blib/script/apply_diff_relay_logs
cp bin/purge_relay_logs blib/script/purge_relay_logs
/usr/bin/perl "-Iinc" -MExtUtils::MY -e 'MY->fixin(shift)' -- blib/script/purge_relay_logs
cp bin/save_binary_logs blib/script/save_binary_logs
/usr/bin/perl "-Iinc" -MExtUtils::MY -e 'MY->fixin(shift)' -- blib/script/save_binary_logs
Manifying blib/man1/filter_mysqlbinlog.1
Manifying blib/man1/apply_diff_relay_logs.1
Manifying blib/man1/purge_relay_logs.1
Manifying blib/man1/save_binary_logs.1

[root@h168 mha4mysql-node-0.56]# make install
Installing /usr/local/share/perl5/MHA/BinlogManager.pm
Installing /usr/local/share/perl5/MHA/BinlogPosFindManager.pm
Installing /usr/local/share/perl5/MHA/BinlogPosFinderXid.pm
Installing /usr/local/share/perl5/MHA/BinlogHeaderParser.pm
Installing /usr/local/share/perl5/MHA/BinlogPosFinder.pm
Installing /usr/local/share/perl5/MHA/BinlogPosFinderElp.pm
Installing /usr/local/share/perl5/MHA/NodeUtil.pm
Installing /usr/local/share/perl5/MHA/SlaveUtil.pm
Installing /usr/local/share/perl5/MHA/NodeConst.pm
Installing /usr/local/share/man/man1/filter_mysqlbinlog.1
Installing /usr/local/share/man/man1/apply_diff_relay_logs.1
Installing /usr/local/share/man/man1/purge_relay_logs.1
Installing /usr/local/share/man/man1/save_binary_logs.1
Installing /usr/local/bin/filter_mysqlbinlog
Installing /usr/local/bin/apply_diff_relay_logs
Installing /usr/local/bin/purge_relay_logs
Installing /usr/local/bin/save_binary_logs
Appending installation info to /usr/lib64/perl5/perllocal.pod

安装完成后生成以下脚本,这些脚本由MHA Manager来进行调用。

[webapp@h168 mha4mysql-node-0.56]$ cd /usr/local/bin
[webapp@h168 bin]$ ll
total 44
-r-xr-xr-x 1 root root 16367 Nov 17 13:02 apply_diff_relay_logs	#识别差异的中继日志事件并将其差异的事件应用于其他的slave
-r-xr-xr-x 1 root root  4807 Nov 17 13:02 filter_mysqlbinlog	#去除不必要的ROLLBACK事件(MHA已不再使用这个工具)
-r-xr-xr-x 1 root root  8261 Nov 17 13:02 purge_relay_logs	#清除中继日志
-r-xr-xr-x 1 root root  7525 Nov 17 13:02 save_binary_logs	#保存和复制master的二进制日志
  1. 安装MHA Manager

    可以任意执行一个节点来安装,这里我就安装在了h168节点上面。将mha4mysql-manager-0.56.tar.gz上传并解压至/app/soft/mha4mysql-manager-0.56目录,安装方法同Node的安装。
切换至root用户:

cd /app/soft/mha4mysql-manager-0.56
perl Makefile.PL
make
make install

Manager主要使用的脚本:

#检查MHA的SSH配置状况
masterha_check_ssh
#检查MySQL复制状况
masterha_check_repl
#启动MHA
masterha_manger
#检测当前MHA运行状态
masterha_check_status
#检测master是否宕机
masterha_master_monitor
#控制故障转移(自动或者手动)
masterha_master_switch
#添加或删除配置的server信息
masterha_conf_host

  1. 创建监控用户monitor(在三个主库执行)
mysql> grant all privileges on *.* to 'monitor'@'192.168.1.%' identified  by '123456';
Query OK, 0 rows affected (0.00 sec)

mysql> flush  privileges;
Query OK, 0 rows affected (0.01 sec)
  1. 创建配置文件及脚本
    MHA配置文件目录/app/mha/conf,文件结构如下:
[webapp@h168 conf]$ tree /app/mha/conf/
/app/mha/conf/
├── d1-3306.cnf
├── d1-3306-switch-back.cnf
├── d2-3307.cnf
├── d2-3307-switch-back.cnf
├── d3-3308.cnf
└── d3-3308-switch-back.cnf

0 directories, 6 files

MHA的failover脚本目录/app/mha/scripts,文件结构如下:

[webapp@h168 scripts]$ tree /app/mha/scripts/
/app/mha/scripts/
├── master_ip_failover_d1
├── master_ip_failover_d2
├── master_ip_failover_d3
├── master_ip_online_change_d1
├── master_ip_online_change_d2
└── master_ip_online_change_d3

0 directories, 6 files

按照上面的样例准确修改配置及脚本文件,并赋予执行权限!!!

[webapp@h168 scripts]$ chmod +x /app/mha/scripts/*
[webapp@h168 scripts]$ ll
total 60
-rwxr-xr-x 1 webapp webapp  4850 Nov 15 10:39 master_ip_failover_d1
-rwxr-xr-x 1 webapp webapp  4850 Nov 17 17:34 master_ip_failover_d2
-rwxr-xr-x 1 webapp webapp  4850 Nov 17 17:35 master_ip_failover_d3
-rwxr-xr-x 1 webapp webapp 10215 Nov 15 10:39 master_ip_online_change_d1
-rwxr-xr-x 1 webapp webapp 10215 Nov 17 17:35 master_ip_online_change_d2
-rwxr-xr-x 1 webapp webapp 10215 Nov 17 17:35 master_ip_online_change_d3
  1. 关于MHA全局配置文件

MHA会默认读取/etc/masterha_default.cnf,三个复制组的公共配置已经抽出放入默认配置下。在“/etc”目录下新建masterha_default.cnf配置文件,并将文件所属用户和组修改为与MHA一致。

[webapp@h168 etc]$ sudo su
[sudo] password for webapp:
[root@h168 etc]# vim masterha_default.cnf
[root@h168 etc]# cd
[root@h168 etc]# chown webapp:webapp masterha_default.cnf
  1. 网卡权限和虚IP

由于账号“webapp”无权限绑定虚IP,所以需赋予用户“webapp”特殊权限“s”,避免后续出现问题,现在先给用户赋权。(三个节点h168,h169,h170都需要修改)

[webapp@h168 app]$ sudo su
[sudo] password for webapp:
[root@h168 app]# chmod u+s /sbin/ifconfig
[root@h168 app]# exit

赋权之后,需要先手动绑定一下每个复制组(h168,h169,h170)各自的虚IP,虚拟网卡命名规则为ens33:Master端口,下面只是举例绑定h168的虚IP:192.168.1.171。

[webapp@h168 scripts]$ ifconfig ens33:3306 192.168.1.171/24
[webapp@h168 scripts]$ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 00:0c:29:9e:0f:9f brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.168/24 brd 192.168.1.255 scope global ens33
       valid_lft forever preferred_lft forever
    inet 192.168.1.171/24 brd 192.168.1.255 scope global secondary ens33:3306
       valid_lft forever preferred_lft forever
    inet6 fe80::e6a:57e1:79b0:6099/64 scope link
       valid_lft forever preferred_lft forever
  1. 检查MHA Node节点的SSH连接
    依次检查一下三个复制组SSH连接的情况是否正常。
[webapp@h168 app]$ masterha_check_ssh --conf=/app/mha/conf/d1-3306.cnf
Sat Nov 17 18:08:54 2018 - [info] Reading default configuration from /etc/masterha_default.cnf..
Sat Nov 17 18:08:54 2018 - [info] Reading application default configuration from /app/mha/conf/d1-3306.cnf..
Sat Nov 17 18:08:54 2018 - [info] Reading server configuration from /app/mha/conf/d1-3306.cnf..
Sat Nov 17 18:08:54 2018 - [info] Starting SSH connection tests..
Sat Nov 17 18:08:55 2018 - [debug]
Sat Nov 17 18:08:54 2018 - [debug]  Connecting via SSH from webapp@h168(192.168.1.168:22) to webapp@h169(192.168.1.169:22)..
Sat Nov 17 18:08:54 2018 - [debug]   ok.
Sat Nov 17 18:08:54 2018 - [debug]  Connecting via SSH from webapp@h168(192.168.1.168:22) to webapp@h170(192.168.1.170:22)..
Sat Nov 17 18:08:55 2018 - [debug]   ok.
Sat Nov 17 18:08:55 2018 - [debug]
Sat Nov 17 18:08:54 2018 - [debug]  Connecting via SSH from webapp@h169(192.168.1.169:22) to webapp@h168(192.168.1.168:22)..
Sat Nov 17 18:08:55 2018 - [debug]   ok.
Sat Nov 17 18:08:55 2018 - [debug]  Connecting via SSH from webapp@h169(192.168.1.169:22) to webapp@h170(192.168.1.170:22)..
Sat Nov 17 18:08:55 2018 - [debug]   ok.
Sat Nov 17 18:08:56 2018 - [debug]
Sat Nov 17 18:08:55 2018 - [debug]  Connecting via SSH from webapp@h170(192.168.1.170:22) to webapp@h168(192.168.1.168:22)..
Sat Nov 17 18:08:55 2018 - [debug]   ok.
Sat Nov 17 18:08:55 2018 - [debug]  Connecting via SSH from webapp@h170(192.168.1.170:22) to webapp@h169(192.168.1.169:22)..
Sat Nov 17 18:08:56 2018 - [debug]   ok.
Sat Nov 17 18:08:56 2018 - [info] All SSH connection tests passed successfully.
  1. 检查三个复制组的集群的状态
[webapp@h168 scripts]$ masterha_check_repl --conf=/app/mha/conf/d1-3306.cnf
Sat Nov 17 18:24:56 2018 - [info] Reading default configuration from /etc/masterha_default.cnf..
Sat Nov 17 18:24:56 2018 - [info] Reading application default configuration from /app/mha/conf/d1-3306.cnf..
Sat Nov 17 18:24:56 2018 - [info] Reading server configuration from /app/mha/conf/d1-3306.cnf..
Sat Nov 17 18:24:56 2018 - [info] MHA::MasterMonitor version 0.56.
Sat Nov 17 18:24:57 2018 - [info] GTID failover mode = 1
Sat Nov 17 18:24:57 2018 - [info] Dead Servers:
Sat Nov 17 18:24:57 2018 - [info] Alive Servers:
Sat Nov 17 18:24:57 2018 - [info]   h168(192.168.1.168:3306)
Sat Nov 17 18:24:57 2018 - [info]   h169(192.168.1.169:3306)
Sat Nov 17 18:24:57 2018 - [info]   h170(192.168.1.170:3306)
Sat Nov 17 18:24:57 2018 - [info] Alive Slaves:
Sat Nov 17 18:24:57 2018 - [info]   h169(192.168.1.169:3306)  Version=5.7.24-log (oldest major version between slaves) log-bin:enabled
Sat Nov 17 18:24:57 2018 - [info]     GTID ON
Sat Nov 17 18:24:57 2018 - [info]     Replicating from 192.168.1.168(192.168.1.168:3306)
Sat Nov 17 18:24:57 2018 - [info]   h170(192.168.1.170:3306)  Version=5.7.24-log (oldest major version between slaves) log-bin:enabled
Sat Nov 17 18:24:57 2018 - [info]     GTID ON
Sat Nov 17 18:24:57 2018 - [info]     Replicating from 192.168.1.168(192.168.1.168:3306)
Sat Nov 17 18:24:57 2018 - [info] Current Alive Master: h168(192.168.1.168:3306)
Sat Nov 17 18:24:57 2018 - [info] Checking slave configurations..
Sat Nov 17 18:24:57 2018 - [info]  read_only=1 is not set on slave h169(192.168.1.169:3306).
Sat Nov 17 18:24:57 2018 - [info]  read_only=1 is not set on slave h170(192.168.1.170:3306).
Sat Nov 17 18:24:57 2018 - [info] Checking replication filtering settings..
Sat Nov 17 18:24:57 2018 - [info]  binlog_do_db= , binlog_ignore_db=
Sat Nov 17 18:24:57 2018 - [info]  Replication filtering check ok.
Sat Nov 17 18:24:57 2018 - [info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking.
Sat Nov 17 18:24:57 2018 - [info] Checking SSH publickey authentication settings on the current master..
Sat Nov 17 18:24:57 2018 - [info] HealthCheck: SSH to h168 is reachable.
Sat Nov 17 18:24:57 2018 - [info]
h168(192.168.1.168:3306) (current master)
 +--h169(192.168.1.169:3306)
 +--h170(192.168.1.170:3306)

Sat Nov 17 18:24:57 2018 - [info] Checking replication health on h169..
Sat Nov 17 18:24:57 2018 - [info]  ok.
Sat Nov 17 18:24:57 2018 - [info] Checking replication health on h170..
Sat Nov 17 18:24:57 2018 - [info]  ok.
Sat Nov 17 18:24:57 2018 - [info] Checking master_ip_failover_script status:
Sat Nov 17 18:24:57 2018 - [info]   /app/mha/scripts/master_ip_failover_d1 --command=status --ssh_user=webapp --orig_master_host=h168 --orig_master_ip=192.168.1.168 --orig_master_port=3306
        inet 192.168.1.171  netmask 255.255.255.0  broadcast 192.168.1.255
INFO: VIP 192.168.1.171 found on Master
Sat Nov 17 18:24:58 2018 - [info]  OK.
Sat Nov 17 18:24:58 2018 - [warning] shutdown_script is not defined.
Sat Nov 17 18:24:58 2018 - [info] Got exit code 0 (Not master dead).

MySQL Replication Health is OK.

看到这个OK则代表主从复制正常。

  1. 启动MHA Manager监控

需要启动三个Manager,来监控三套复制组,启动脚本类似,这里只列出其中一个。

[webapp@h168 mha]$	 mkdir manager_cmd
[webapp@h168 manager_cmd]$ vim start_mha_manager_d1.sh
nohup masterha_manager --conf=/app/mha/conf/d1-3306.cnf --ignore_last_failover < /dev/null > /app/mha/logs/manager-d1.log 2>&1 &
[webapp@h168 manager_cmd]$ chmod +x start_mha_manager_d1.sh
[webapp@h168 manager_cmd]$ ./start_mha_manager_d1.sh

启动参数介绍:
–remove_dead_master_conf 该参数代表当发生主从切换后,老的主库的ip将会从配置文件中移除。
–manger_log 日志存放位置
–ignore_last_failover 在缺省情况下,如果MHA检测到连续发生宕机,且两次宕机间隔不足8小时的话,则不会进行Failover,之所以这样限制是为了避免ping-pong效应。该参数代表忽略上次MHA触发切换产生的文件,默认情况下,MHA发生切换后会在日志目录,也就是上面我设置的/data产生app1.failover.complete文件,下次再次切换的时候如果发现该目录下存在该文件将不允许触发切换,除非在第一次切换后收到删除该文件,为了方便,这里设置为–ignore_last_failover。

引自:

三个Manager启动后,查看 各个Manager状态:

[webapp@h168 scripts]$ masterha_check_status --conf=/app/mha/conf/d1-3306.cnf
d1-3306 (pid:4974) is running(0:PING_OK), master:h168
[webapp@h168 scripts]$ masterha_check_status --conf=/app/mha/conf/d2-3307.cnf
d2-3307 (pid:5161) is running(0:PING_OK), master:h169
[webapp@h168 scripts]$ masterha_check_status --conf=/app/mha/conf/d3-3308.cnf
d3-3308 (pid:5176) is running(0:PING_OK), master:h170

Manager关闭:

[webapp@h168 scripts]$ masterha_stop --conf=/app/mha/conf/d1-3306.cnf
Stopped d1-3306 successfully.

查看Manager日志,如下,即Manager启动正常。

[webapp@h168 logs]$ cd /app/mha/logs
[webapp@h168 logs]$ ls
manager-d1.log  manager-d2.log  manager-d3.log
[webapp@h168 logs]$ tail -100f manager-d1.log
Sat Nov 17 19:24:09 2018 - [info] Reading default configuration from /etc/masterha_default.cnf..
Sat Nov 17 19:24:09 2018 - [info] Reading application default configuration from /app/mha/conf/d1-3306.cnf..
Sat Nov 17 19:24:09 2018 - [info] Reading server configuration from /app/mha/conf/d1-3306.cnf..
Sat Nov 17 19:24:09 2018 - [info] MHA::MasterMonitor version 0.56.
Sat Nov 17 19:24:10 2018 - [info] GTID failover mode = 1
Sat Nov 17 19:24:10 2018 - [info] Dead Servers:
Sat Nov 17 19:24:10 2018 - [info] Alive Servers:
Sat Nov 17 19:24:10 2018 - [info]   h168(192.168.1.168:3306)
Sat Nov 17 19:24:10 2018 - [info]   h169(192.168.1.169:3306)
Sat Nov 17 19:24:10 2018 - [info]   h170(192.168.1.170:3306)
Sat Nov 17 19:24:10 2018 - [info] Alive Slaves:
Sat Nov 17 19:24:10 2018 - [info]   h169(192.168.1.169:3306)  Version=5.7.24-log (oldest major version between slaves) log-bin:enabled
Sat Nov 17 19:24:10 2018 - [info]     GTID ON
Sat Nov 17 19:24:10 2018 - [info]     Replicating from 192.168.1.168(192.168.1.168:3306)
Sat Nov 17 19:24:10 2018 - [info]   h170(192.168.1.170:3306)  Version=5.7.24-log (oldest major version between slaves) log-bin:enabled
Sat Nov 17 19:24:10 2018 - [info]     GTID ON
Sat Nov 17 19:24:10 2018 - [info]     Replicating from 192.168.1.168(192.168.1.168:3306)
Sat Nov 17 19:24:10 2018 - [info] Current Alive Master: h168(192.168.1.168:3306)
Sat Nov 17 19:24:10 2018 - [info] Checking slave configurations..
Sat Nov 17 19:24:10 2018 - [info]  read_only=1 is not set on slave h169(192.168.1.169:3306).
Sat Nov 17 19:24:10 2018 - [info]  read_only=1 is not set on slave h170(192.168.1.170:3306).
Sat Nov 17 19:24:10 2018 - [info] Checking replication filtering settings..
Sat Nov 17 19:24:10 2018 - [info]  binlog_do_db= , binlog_ignore_db=
Sat Nov 17 19:24:10 2018 - [info]  Replication filtering check ok.
Sat Nov 17 19:24:10 2018 - [info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking.
Sat Nov 17 19:24:10 2018 - [info] Checking SSH publickey authentication settings on the current master..
Sat Nov 17 19:24:10 2018 - [info] HealthCheck: SSH to h168 is reachable.
Sat Nov 17 19:24:10 2018 - [info]
h168(192.168.1.168:3306) (current master)
 +--h169(192.168.1.169:3306)
 +--h170(192.168.1.170:3306)

Sat Nov 17 19:24:10 2018 - [info] Checking master_ip_failover_script status:
Sat Nov 17 19:24:10 2018 - [info]   /app/mha/scripts/master_ip_failover_d1 --command=status --ssh_user=webapp --orig_master_host=h168 --orig_master_ip=192.168.1.168 --orig_master_port=3306
        inet 192.168.1.171  netmask 255.255.255.0  broadcast 192.168.1.255
INFO: VIP 192.168.1.171 found on Master
Sat Nov 17 19:24:10 2018 - [info]  OK.
Sat Nov 17 19:24:10 2018 - [warning] shutdown_script is not defined.
Sat Nov 17 19:24:10 2018 - [info] Set master ping interval 3 seconds.
Sat Nov 17 19:24:10 2018 - [info] Set secondary check script: masterha_secondary_check -s 192.168.1.169 -s 192.168.1.170 --user=monitor --master_host=h168 --master_ip=192.168.1.168 --master_port=3306
Sat Nov 17 19:24:10 2018 - [info] Starting ping health check on h168(192.168.1.168:3306)..
Sat Nov 17 19:24:10 2018 - [info] Ping(CONNECT) succeeded, waiting until MySQL doesn't respond..

到此,MySQ高可用架构之MHA搭建就完成了,下面开始测试。

5、测试MHA

  1. 模拟主库故障
    停掉h168节点的3306-M库,查看虚IP情况,会发现ens33:3306从h168上消失,出现在h169上面。
[webapp@h169 mysql]$ ifconfig
ens33: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.1.169  netmask 255.255.255.0  broadcast 192.168.1.255
        inet6 fe80::e6a:57e1:79b0:6099  prefixlen 64  scopeid 0x20<link>
        inet6 fe80::ca67:d327:9f88:5a61  prefixlen 64  scopeid 0x20<link>
        ether 00:0c:29:f0:82:85  txqueuelen 1000  (Ethernet)
        RX packets 2655317  bytes 3554516293 (3.3 GiB)
        RX errors 0  dropped 4  overruns 0  frame 0
        TX packets 188456  bytes 16039467 (15.2 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ens33:3306: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.1.171  netmask 255.255.255.0  broadcast 192.168.1.255
        ether 00:0c:29:f0:82:85  txqueuelen 1000  (Ethernet)

ens33:3307: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.1.172  netmask 255.255.255.0  broadcast 192.168.1.255
        ether 00:0c:29:f0:82:85  txqueuelen 1000  (Ethernet)

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1  (Local Loopback)
        RX packets 171  bytes 29920 (29.2 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 171  bytes 29920 (29.2 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

观察Manager切换过程日志/app/mha/logs/manager-d1.log下面我贴了完整的切换日志,帮助了解MHA是如何进行故障切换的。

Sat Nov 17 20:17:02 2018 - [warning] Got error on MySQL connect ping: DBI connect(';host=192.168.1.168;port=3306;mysql_connect_timeout=1','monitor',...) failed: Can't connect to MySQL server on '192.168.1.168' (111) at /usr/local/share/perl5/MHA/HealthCheck.pm line 97.
2003 (Can't connect to MySQL server on '192.168.1.168' (111))
Sat Nov 17 20:17:02 2018 - [info] Executing secondary network check script: masterha_secondary_check -s 192.168.1.169 -s 192.168.1.170 --user=monitor --master_host=h168 --master_ip=192.168.1.168 --master_port=3306  --user=webapp  --master_host=h168  --master_ip=192.168.1.168  --master_port=3306 --master_user=monitor --master_password=123456 --ping_type=CONNECT
Sat Nov 17 20:17:02 2018 - [info] Executing SSH check script: exit 0
Sat Nov 17 20:17:02 2018 - [info] HealthCheck: SSH to h168 is reachable.
Monitoring server 192.168.1.169 is reachable, Master is not reachable from 192.168.1.169. OK.
Monitoring server 192.168.1.170 is reachable, Master is not reachable from 192.168.1.170. OK.
Sat Nov 17 20:17:02 2018 - [info] Master is not reachable from all other monitoring servers. Failover should start.
Sat Nov 17 20:17:05 2018 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '192.168.1.168' (111))
Sat Nov 17 20:17:05 2018 - [warning] Connection failed 2 time(s)..
Sat Nov 17 20:17:08 2018 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '192.168.1.168' (111))
Sat Nov 17 20:17:08 2018 - [warning] Connection failed 3 time(s)..
Sat Nov 17 20:17:11 2018 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '192.168.1.168' (111))
Sat Nov 17 20:17:11 2018 - [warning] Connection failed 4 time(s)..
Sat Nov 17 20:17:11 2018 - [warning] Master is not reachable from health checker!
Sat Nov 17 20:17:11 2018 - [warning] Master h168(192.168.1.168:3306) is not reachable!
Sat Nov 17 20:17:11 2018 - [warning] SSH is reachable.
Sat Nov 17 20:17:11 2018 - [info] Connecting to a master server failed. Reading configuration file /etc/masterha_default.cnf and /app/mha/conf/d1-3306.cnf again, and trying to connect to all servers to check server status..
Sat Nov 17 20:17:11 2018 - [info] Reading default configuration from /etc/masterha_default.cnf..
Sat Nov 17 20:17:11 2018 - [info] Reading application default configuration from /app/mha/conf/d1-3306.cnf..
Sat Nov 17 20:17:11 2018 - [info] Reading server configuration from /app/mha/conf/d1-3306.cnf..
Sat Nov 17 20:17:12 2018 - [info] GTID failover mode = 1
Sat Nov 17 20:17:12 2018 - [info] Dead Servers:
Sat Nov 17 20:17:12 2018 - [info]   h168(192.168.1.168:3306)
Sat Nov 17 20:17:12 2018 - [info] Alive Servers:
Sat Nov 17 20:17:12 2018 - [info]   h169(192.168.1.169:3306)
Sat Nov 17 20:17:12 2018 - [info]   h170(192.168.1.170:3306)
Sat Nov 17 20:17:12 2018 - [info] Alive Slaves:
Sat Nov 17 20:17:12 2018 - [info]   h169(192.168.1.169:3306)  Version=5.7.24-log (oldest major version between slaves) log-bin:enabled
Sat Nov 17 20:17:12 2018 - [info]     GTID ON
Sat Nov 17 20:17:12 2018 - [info]     Replicating from 192.168.1.168(192.168.1.168:3306)
Sat Nov 17 20:17:12 2018 - [info]   h170(192.168.1.170:3306)  Version=5.7.24-log (oldest major version between slaves) log-bin:enabled
Sat Nov 17 20:17:12 2018 - [info]     GTID ON
Sat Nov 17 20:17:12 2018 - [info]     Replicating from 192.168.1.168(192.168.1.168:3306)
Sat Nov 17 20:17:12 2018 - [info] Checking slave configurations..
Sat Nov 17 20:17:12 2018 - [info]  read_only=1 is not set on slave h169(192.168.1.169:3306).
Sat Nov 17 20:17:12 2018 - [info]  read_only=1 is not set on slave h170(192.168.1.170:3306).
Sat Nov 17 20:17:12 2018 - [info] Checking replication filtering settings..
Sat Nov 17 20:17:12 2018 - [info]  Replication filtering check ok.
Sat Nov 17 20:17:12 2018 - [info] Master is down!
Sat Nov 17 20:17:12 2018 - [info] Terminating monitoring script.
Sat Nov 17 20:17:12 2018 - [info] Got exit code 20 (Master dead).
Sat Nov 17 20:17:12 2018 - [info] Reading default configuration from /etc/masterha_default.cnf..
Sat Nov 17 20:17:12 2018 - [info] Reading application default configuration from /app/mha/conf/d1-3306.cnf..
Sat Nov 17 20:17:12 2018 - [info] Reading server configuration from /app/mha/conf/d1-3306.cnf..
Sat Nov 17 20:17:12 2018 - [info] MHA::MasterFailover version 0.56.
Sat Nov 17 20:17:12 2018 - [info] Starting master failover.
Sat Nov 17 20:17:12 2018 - [info]
Sat Nov 17 20:17:12 2018 - [info] * Phase 1: Configuration Check Phase..
Sat Nov 17 20:17:12 2018 - [info]
Sat Nov 17 20:17:13 2018 - [info] GTID failover mode = 1
Sat Nov 17 20:17:13 2018 - [info] Dead Servers:
Sat Nov 17 20:17:13 2018 - [info]   h168(192.168.1.168:3306)
Sat Nov 17 20:17:13 2018 - [info] Checking master reachability via MySQL(double check)...
Sat Nov 17 20:17:13 2018 - [info]  ok.
Sat Nov 17 20:17:13 2018 - [info] Alive Servers:
Sat Nov 17 20:17:13 2018 - [info]   h169(192.168.1.169:3306)
Sat Nov 17 20:17:13 2018 - [info]   h170(192.168.1.170:3306)
Sat Nov 17 20:17:13 2018 - [info] Alive Slaves:
Sat Nov 17 20:17:13 2018 - [info]   h169(192.168.1.169:3306)  Version=5.7.24-log (oldest major version between slaves) log-bin:enabled
Sat Nov 17 20:17:13 2018 - [info]     GTID ON
Sat Nov 17 20:17:13 2018 - [info]     Replicating from 192.168.1.168(192.168.1.168:3306)
Sat Nov 17 20:17:13 2018 - [info]   h170(192.168.1.170:3306)  Version=5.7.24-log (oldest major version between slaves) log-bin:enabled
Sat Nov 17 20:17:13 2018 - [info]     GTID ON
Sat Nov 17 20:17:13 2018 - [info]     Replicating from 192.168.1.168(192.168.1.168:3306)
Sat Nov 17 20:17:13 2018 - [info] Starting GTID based failover.
Sat Nov 17 20:17:13 2018 - [info]
Sat Nov 17 20:17:13 2018 - [info] ** Phase 1: Configuration Check Phase completed.
Sat Nov 17 20:17:13 2018 - [info]
Sat Nov 17 20:17:13 2018 - [info] * Phase 2: Dead Master Shutdown Phase..
Sat Nov 17 20:17:13 2018 - [info]
Sat Nov 17 20:17:13 2018 - [info] Forcing shutdown so that applications never connect to the current master..
Sat Nov 17 20:17:13 2018 - [info] Executing master IP deactivation script:
Sat Nov 17 20:17:13 2018 - [info]   /app/mha/scripts/master_ip_failover_d1 --orig_master_host=h168 --orig_master_ip=192.168.1.168 --orig_master_port=3306 --command=stopssh --ssh_user=webapp
Sat Nov 17 20:17:13 2018 - [info]  done.
Sat Nov 17 20:17:13 2018 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Sat Nov 17 20:17:13 2018 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Sat Nov 17 20:17:13 2018 - [info]
Sat Nov 17 20:17:13 2018 - [info] * Phase 3: Master Recovery Phase..
Sat Nov 17 20:17:13 2018 - [info]
Sat Nov 17 20:17:13 2018 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Sat Nov 17 20:17:13 2018 - [info]
Sat Nov 17 20:17:13 2018 - [info] The latest binary log file/position on all slaves is mysql-bin.000002:194
Sat Nov 17 20:17:13 2018 - [info] Retrieved Gtid Set: b985084d-e9b1-11e8-b0e3-000c299e0f9f:1-10
Sat Nov 17 20:17:13 2018 - [info] Latest slaves (Slaves that received relay log files to the latest):
Sat Nov 17 20:17:13 2018 - [info]   h169(192.168.1.169:3306)  Version=5.7.24-log (oldest major version between slaves) log-bin:enabled
Sat Nov 17 20:17:13 2018 - [info]     GTID ON
Sat Nov 17 20:17:13 2018 - [info]     Replicating from 192.168.1.168(192.168.1.168:3306)
Sat Nov 17 20:17:13 2018 - [info]   h170(192.168.1.170:3306)  Version=5.7.24-log (oldest major version between slaves) log-bin:enabled
Sat Nov 17 20:17:13 2018 - [info]     GTID ON
Sat Nov 17 20:17:13 2018 - [info]     Replicating from 192.168.1.168(192.168.1.168:3306)
Sat Nov 17 20:17:13 2018 - [info] The oldest binary log file/position on all slaves is mysql-bin.000002:194
Sat Nov 17 20:17:13 2018 - [info] Retrieved Gtid Set: b985084d-e9b1-11e8-b0e3-000c299e0f9f:1-10
Sat Nov 17 20:17:13 2018 - [info] Oldest slaves:
Sat Nov 17 20:17:13 2018 - [info]   h169(192.168.1.169:3306)  Version=5.7.24-log (oldest major version between slaves) log-bin:enabled
Sat Nov 17 20:17:13 2018 - [info]     GTID ON
Sat Nov 17 20:17:13 2018 - [info]     Replicating from 192.168.1.168(192.168.1.168:3306)
Sat Nov 17 20:17:13 2018 - [info]   h170(192.168.1.170:3306)  Version=5.7.24-log (oldest major version between slaves) log-bin:enabled
Sat Nov 17 20:17:13 2018 - [info]     GTID ON
Sat Nov 17 20:17:13 2018 - [info]     Replicating from 192.168.1.168(192.168.1.168:3306)
Sat Nov 17 20:17:13 2018 - [info]
Sat Nov 17 20:17:13 2018 - [info] * Phase 3.3: Determining New Master Phase..
Sat Nov 17 20:17:13 2018 - [info]
Sat Nov 17 20:17:13 2018 - [info] Searching new master from slaves..
Sat Nov 17 20:17:13 2018 - [info]  Candidate masters from the configuration file:
Sat Nov 17 20:17:13 2018 - [info]  Non-candidate masters:
Sat Nov 17 20:17:13 2018 - [info] New master is h169(192.168.1.169:3306)
Sat Nov 17 20:17:13 2018 - [info] Starting master failover..
Sat Nov 17 20:17:13 2018 - [info]
From:
h168(192.168.1.168:3306) (current master)
 +--h169(192.168.1.169:3306)
 +--h170(192.168.1.170:3306)

To:
h169(192.168.1.169:3306) (new master)
 +--h170(192.168.1.170:3306)
Sat Nov 17 20:17:13 2018 - [info]
Sat Nov 17 20:17:13 2018 - [info] * Phase 3.3: New Master Recovery Phase..
Sat Nov 17 20:17:13 2018 - [info]
Sat Nov 17 20:17:13 2018 - [info]  Waiting all logs to be applied..
Sat Nov 17 20:17:13 2018 - [info]   done.
Sat Nov 17 20:17:13 2018 - [info] Getting new master's binlog name and position..
Sat Nov 17 20:17:13 2018 - [info]  mysql-bin.000001:3556
Sat Nov 17 20:17:13 2018 - [info]  All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='h169 or 192.168.1.169', MASTER_PORT=3306, MASTER_AUTO_POSITION=1, MASTER_USER='repl', MASTER_PASSWORD='xxx';
Sat Nov 17 20:17:13 2018 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set: mysql-bin.000001, 3556, 7ecf9cc8-ea14-11e8-b778-000c29f08285:1-6,
b985084d-e9b1-11e8-b0e3-000c299e0f9f:1-10
Sat Nov 17 20:17:13 2018 - [info] Executing master IP activate script:
Sat Nov 17 20:17:13 2018 - [info]   /app/mha/scripts/master_ip_failover_d1 --command=start --ssh_user=webapp --orig_master_host=h168 --orig_master_ip=192.168.1.168 --orig_master_port=3306 --new_master_host=h169 --new_master_ip=192.168.1.169 --new_master_port=3306 --new_master_user='monitor' --new_master_password='123456'
Set read_only=0 on the new master.
Sat Nov 17 20:17:15 2018 - [info]  OK.
Sat Nov 17 20:17:15 2018 - [info] ** Finished master recovery successfully.
Sat Nov 17 20:17:15 2018 - [info] * Phase 3: Master Recovery Phase completed.
Sat Nov 17 20:17:15 2018 - [info]
Sat Nov 17 20:17:15 2018 - [info] * Phase 4: Slaves Recovery Phase..
Sat Nov 17 20:17:15 2018 - [info]
Sat Nov 17 20:17:15 2018 - [info]
Sat Nov 17 20:17:15 2018 - [info] * Phase 4.1: Starting Slaves in parallel..
Sat Nov 17 20:17:15 2018 - [info]
Sat Nov 17 20:17:15 2018 - [info] -- Slave recovery on host h170(192.168.1.170:3306) started, pid: 8024. Check tmp log /app/mha/workdir/h170_3306_20181117201712.log if it takes time..
Sat Nov 17 20:17:26 2018 - [info]
Sat Nov 17 20:17:26 2018 - [info] Log messages from h170 ...
Sat Nov 17 20:17:26 2018 - [info]
Sat Nov 17 20:17:15 2018 - [info]  Resetting slave h170(192.168.1.170:3306) and starting replication from the new master h169(192.168.1.169:3306)..
Sat Nov 17 20:17:15 2018 - [info]  Executed CHANGE MASTER.
Sat Nov 17 20:17:15 2018 - [info]  Slave started.
Sat Nov 17 20:17:25 2018 - [info]  gtid_wait(7ecf9cc8-ea14-11e8-b778-000c29f08285:1-6,
b985084d-e9b1-11e8-b0e3-000c299e0f9f:1-10) completed on h170(192.168.1.170:3306). Executed 7 events.
Sat Nov 17 20:17:26 2018 - [info] End of log messages from h170.
Sat Nov 17 20:17:26 2018 - [info] -- Slave on host h170(192.168.1.170:3306) started.
Sat Nov 17 20:17:26 2018 - [info] All new slave servers recovered successfully.
Sat Nov 17 20:17:26 2018 - [info]
Sat Nov 17 20:17:26 2018 - [info] * Phase 5: New master cleanup phase..
Sat Nov 17 20:17:26 2018 - [info]
Sat Nov 17 20:17:26 2018 - [info] Resetting slave info on the new master..
Sat Nov 17 20:17:26 2018 - [info]  h169: Resetting slave info succeeded.
Sat Nov 17 20:17:26 2018 - [info] Master failover to h169(192.168.1.169:3306) completed successfully.
Sat Nov 17 20:17:26 2018 - [info]

----- Failover Report -----

d1-3306: MySQL Master failover h168(192.168.1.168:3306) to h169(192.168.1.169:3306) succeeded

Master h168(192.168.1.168:3306) is down!

Check MHA Manager logs at h168 for details.

Started automated(non-interactive) failover.
Invalidated master IP address on h168(192.168.1.168:3306)
Selected h169(192.168.1.169:3306) as a new master.
h169(192.168.1.169:3306): OK: Applying all logs succeeded.
h169(192.168.1.169:3306): OK: Activated master IP address.
h170(192.168.1.170:3306): OK: Slave started, replicating from h169(192.168.1.169:3306)
h169(192.168.1.169:3306): Resetting slave info succeeded.
Master failover to h169(192.168.1.169:3306) completed successfully.
  1. 故障数据库恢复,重新加入集群中
    如果h168节点3306-M库恢复,启动库之后,masterha_check_repl的时候会出现如下的报错。
Mon Nov 19 14:46:00 2018 - [error][/usr/local/share/perl5/MHA/ServerManager.pm, ln653] There are 2 non-slave servers! MHA manages at most one non-slave server. Check configurations.
Mon Nov 19 14:46:00 2018 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln424] Error happened on checking configurations.  at /usr/local/share/perl5/MHA/MasterMonitor.pm line 326.
Mon Nov 19 14:46:00 2018 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln523] Error happened on monitoring servers.

这时候需要我们在刚恢复的库上面执行CHANGE MASTER操作,将该库作为从库指向新的Master。

mysql> CHANGE MASTER TO MASTER_HOST='192.168.1.169',MASTER_PORT=3306,MASTER_USER='repl',MASTER_PASSWORD='123456',MASTER_AUTO_POSITION=1;
Query OK, 0 rows affected, 2 warnings (0.01 sec)

mysql> start slave;
Query OK, 0 rows affected (0.01 sec)

重新检查拓扑关系:

[webapp@h168 mysql]$ masterha_check_repl --conf=/app/mha/conf/d1-3306.cnf

这样新的集群拓扑关系生成,验证无问题,则恢复成功。

  1. 将宕机的旧Master重新做为Master

将会使用d1-3306_switch-back.cnf这个文件,这个文件与d1-3306.cnf类似,不同之处在于将旧Master(h168)的下面加一条配置candidate=1,配置文件如下:

[server default]
client_bindir=/app/mysql/mysql-5.7.24/bin/
client_libdir=/app/mysql/mysql-5.7.24/lib/

manager_workdir=/app/mha/workdir
remote_workdir=/app/mha/workdir

master_ip_failover_script=/app/mha/scripts/master_ip_failover_d1
master_ip_online_change_script=/app/mha/scripts/master_ip_online_change_d1

secondary_check_script=masterha_secondary_check -s 192.168.1.169 -s 192.168.1.170 --user=monitor --master_host=h168 --master_ip=192.168.1.168 --master_port=3306


[server2]
hostname=h169
ip=192.168.1.169
master_binlog_dir=/app/mysql/3306-S1/logs/bin-log
port=3306
ssh_port=22


[server1]
hostname=h168
ip=192.168.1.168
ssh_port=22
master_binlog_dir=/app/mysql/3306-M/logs/bin-log
port=3306
candidate_master=1


[server3]
hostname=h170
ip=192.168.1.170
master_binlog_dir=/app/mysql/3306-S1/logs/bin-log
port=3306
ssh_port=22

使用新的配置文件检查一下集群的拓扑关系:masterha_check_repl --conf=/app/mha/conf/d1-3306_switch-back.cnf,无问题后开始切换:

(1)首先在新的Master(h169)上执行:

mysql> FLUSH NO_WRITE_TO_BINLOG TABLES;
Query OK, 0 rows affected (2.43 sec)

(2)在h168执行下面命令实现旧Master重新做为Master

masterha_master_switch --conf=/app/mha/conf/d1-3306_switch-back.cnf --master_state=alive --orig_master_is_new_slave

过程中会提示确认信息,恢复过程稍等一会儿,若无报错、检查验证无问题则恢复成功。

6、常见问题及处理

  1. 主从不一致,报错信息如下:
Last_Errno: 1008
                   Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction 'b985084d-e9b1-11e8-b0e3-000c299e0f9f:12' at master log mysql-bin.000003, end_log_pos 513. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 358
              Relay_Log_Space: 1596
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 1008
               Last_SQL_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction 'b985084d-e9b1-11e8-b0e3-000c299e0f9f:12' at master log mysql-bin.000003, end_log_pos 513. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 33060

可以暂且跳过一个事务,不过此时主从的数据已经不一致了,需要重新跟随Master,

mysql> stop slave;
Query OK, 0 rows affected (0.00 sec)

mysql> reset slave;
Query OK, 0 rows affected (0.00 sec)

mysql> reset master;
Query OK, 0 rows affected (0.00 sec)

mysql> CHANGE MASTER TO MASTER_HOST='192.168.1.168',MASTER_PORT=3306,MASTER_USER='repl',MASTER_PASSWORD='123456',MASTER_AUTO_POSITION=1;
Query OK, 0 rows affected (0.00 sec)

mysql> start slave;
Query OK, 0 rows affected (0.00 sec)

mysql> show slave status\G;
mysql> stop slave;
Query OK, 0 rows affected (0.00 sec)

mysql> SET @@SESSION.GTID_NEXT = 'b985084d-e9b1-11e8-b0e3-000c299e0f9f:12';
Query OK, 0 rows affected (0.00 sec)

mysql> BEGIN; COMMIT;
Query OK, 0 rows affected (0.01 sec)

Query OK, 0 rows affected (0.00 sec)

mysql> SET @@SESSION.GTID_NEXT = AUTOMATIC;
Query OK, 0 rows affected (0.00 sec)

mysql> start slave;
Query OK, 0 rows affected (0.01 sec)
  1. 找不到虚IP
Mon Nov 19 09:44:01 2018 - [info] Checking replication health on h169..
Mon Nov 19 09:44:01 2018 - [info]  ok.
Mon Nov 19 09:44:01 2018 - [info] Checking replication health on h170..
Mon Nov 19 09:44:01 2018 - [info]  ok.
Mon Nov 19 09:44:01 2018 - [info] Checking master_ip_failover_script status:
Mon Nov 19 09:44:01 2018 - [info]   /app/mha/scripts/master_ip_failover_d1 --command=status --ssh_user=webapp --orig_master_host=h168 --orig_master_ip=192.168.1.168 --orig_master_port=3306
CRITICAL: VIP 192.168.1.171 not found on Master!
Mon Nov 19 09:44:01 2018 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln226]  Failed to get master_ip_failover_script status with return code 1:0.
Mon Nov 19 09:44:01 2018 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln424] Error happened on checking configurations.  at /usr/local/bin/masterha_check_repl line 48.
Mon Nov 19 09:44:01 2018 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln523] Error happened on monitoring servers.
Mon Nov 19 09:44:01 2018 - [info] Got exit code 1 (Not master dead).

MySQL Replication Health is NOT OK!

由于是虚拟机演示,每次从挂起恢复的时候,虚IP会消失,需要再次手动绑定一下。

解决方式:

[webapp@h168 mysql]$ ifconfig ens33:3306 192.168.1.171/24