MySQL修改binlog位置造成从库复制故障处理

1. 故障背景

22日晚上8点25分,某套MySQL系统的备库复制进程宕了。登录到服务器检查,发现复制进程出现问题。

2. 故障诊断

检查备库的Slave信息

mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: 
                  Master_Host: xxx.xxx.xx.87
                  Master_User: repl_user
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-master-bin.000031
          Read_Master_Log_Pos: 51480
               Relay_Log_File: xxxxxxxxx01-relay-bin.000060
                Relay_Log_Pos: 51707
        Relay_Master_Log_File: mysql-master-bin.000031
             Slave_IO_Running: No
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 51480
              Relay_Log_Space: 51974
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 1236
                Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Could not find first log file name in binary log index file'
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 2
                  Master_UUID: 23d9408c-f66b-11e5-8d6a-001a4aef334a
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
           Master_Retry_Count: 86400
                  Master_Bind: 
      Last_IO_Error_Timestamp: 170222 20:25:27
     Last_SQL_Error_Timestamp: 
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
           Retrieved_Gtid_Set: 
            Executed_Gtid_Set: 
                Auto_Position: 0
         Replicate_Rewrite_DB: 
                 Channel_Name: 
           Master_TLS_Version: 
 1 row in set (0.00 sec)

复制进程的Slave_IO_Running线程已停,这个线程是从主库获取binlog日志,从Last_IO_Error中也明确显示无法获取日志信息了。
检查主库:

mysql> show master status;
+-------------------------+----------+--------------+------------------+-------------------+
| File                    | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+-------------------------+----------+--------------+------------------+-------------------+
| mysql-master-bin.000002 |    21783 |              |                  |                   |
+-------------------------+----------+--------------+------------------+-------------------+
1 row in set (0.00 sec)

发现主库的日志序号变成:mysql-master-bin.000002,猜测可能有人在重做复制。于是到维护群里看了下,发现维护人员在修改binlog日志的目录。
在主库上将参数log-bin,从

log-bin=/var/lib/mysql/mysql-master-bin.log

修改为

log-bin=/data/mysqldata/mysql-master-bin.log

并重启主库。由于没有修改复制和修改mysql-master-bin.index,导致修改binlog的位置后,binlog的序号重新生成了。

3. 解决故障

重做备库当然可以解决这个问题,不过可以先看看这个问题的严重程度,看看有没有更简单的处理办法

mysql> show slave status\G
*************************** 1. row ***************************
                  Master_Host: xxx.xxx.xx.87
                  Master_User: repl_user
              Master_Log_File: mysql-master-bin.000031
          Read_Master_Log_Pos: 51480
...
             Slave_IO_Running: No
            Slave_SQL_Running: Yes
...
          Exec_Master_Log_Pos: 51480
...

首先确认下备库读到主库以前binlog的什么位置,当前读取并应用的位置是 mysql-master-bin.000031 : 51480
在主库上检查一下这个binlog的大小

[root@xxxxxxxxx02 mysqldata]# ls -l /var/lib/mysql
总用量 1902604
-rw-rw---- 1 mysql mysql  12582912 3月  30 2016 ibdata1
-rw-rw---- 1 mysql mysql  50331648 3月  30 2016 ib_logfile0
-rw-rw---- 1 mysql mysql  50331648 3月  30 2016 ib_logfile1
drwx--x--x 2 mysql mysql      4096 3月  30 2016 mysql
-rw-r----- 1 mysql mysql 846297237 2月  14 09:41 mysql-master-bin.000027
-rw-r----- 1 mysql mysql   6878760 2月  14 10:39 mysql-master-bin.000028
-rw-r----- 1 mysql mysql 730089162 2月  22 20:15 mysql-master-bin.000029
-rw-r----- 1 mysql mysql     37214 2月  22 20:17 mysql-master-bin.000030
-rw-r----- 1 mysql mysql     51503 2月  22 20:24 mysql-master-bin.000031
-rw-r----- 1 mysql mysql       195 2月  22 20:17 mysql-master-bin.index

实际在主库上,mysql-master-bin.000031日志的大小是51503,如果把备库的复制进程读取的位置直接修改,那会丢失23个字节的数据。
业务在修改binlog的位置前,肯定已经把业务全停了,所以猜测这23个字节的数据可能是重启造成的。
MySQL的复制,其实是逻辑复制,所以丢失23个字节,应该影响不大。最差的情况,如果修改备库的复制过程有问题,再重做备库也不迟。

检查当前binlog的index文件中的内容

# cat /data/mysqldata/mysql-master-bin.index
/data/mysqldata/mysql-master-bin.000001
/data/mysqldata/mysql-master-bin.000002
/data/mysqldata/mysql-master-bin.000003

修改备库复制进程如下:

mysql> change master to 
master_log_file='mysql-master-bin.000001', 
master_log_pos=0;

然后重启复制进程,问题解决,在备库上也没有碰到逻辑错误。

4. 正确修改binlog位置的操作

4.1 主库设置为readonly

关闭master的数据写入,等待从库与主库同步完成

root@localhost[test]> SET @@global.read_only = ON;
Query OK, 0 rows affected (0.01 sec)

4.2 检查数据全部同步到从库

检查主库

root@localhost[(none)]> show master status;
+-------------------+----------+--------------+------------------+----------------------------------------+
| File              | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                      |
+-------------------+----------+--------------+------------------+----------------------------------------+
| master-bin.000009 |      456 |              |                  | 05d54147-c68e-11e6-9c5a-001a4a1ee9ff:1 |
+-------------------+----------+--------------+------------------+----------------------------------------+
1 row in set (0.00 sec)

检查备库

root@localhost[(none)]> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 132.228.187.45
                  Master_User: repl_user
              Master_Log_File: master-bin.000009
          Read_Master_Log_Pos: 456
......
        Relay_Master_Log_File: master-bin.000009
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
......
          Exec_Master_Log_Pos: 456
......

4.3 停止从库的slave进程

root@localhost[(none)]> stop slave;
Query OK, 0 rows affected (0.02 sec)

4.4 停止主库

# mysqladmin -u root -p shutdown

4.5 修改主库参数

# vi /etc/my.cnf
log_bin                               = /data/logbin/master-bin
log-bin-index                         = /data/logbin/master-bin.index

4.6 复制原来的binlog和binlog.index文件到新的位置

$ mv master-bin.* /data/logbin/ 

修改master-bin.index文件中,确认所有binlog的路径正确

4.7 启动主库并恢复写入

# service mysql.server start

关闭只读模式

root@localhost[(none)]> SET @@global.read_only = OFF;
Query OK, 0 rows affected (0.00 sec)

4.8 备库开启slave

root@localhost[(none)]> start slave;
Query OK, 0 rows affected (0.01 sec)

4.9 测试可以继续同步

主库测试:

root@localhost[(none)]> use test
Database changed
root@localhost[test]> insert into books values(6,'magic');
Query OK, 1 row affected (0.04 sec)

root@localhost[test]> show master status;
+-------------------+----------+--------------+------------------+----------------------------------------+
| File              | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                      |
+-------------------+----------+--------------+------------------+----------------------------------------+
| master-bin.000010 |      459 |              |                  | 05d54147-c68e-11e6-9c5a-001a4a1ee9ff:1 |
+-------------------+----------+--------------+------------------+----------------------------------------+
 1 row in set (0.00 sec)

检查备库:

root@localhost[(none)]> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 132.228.187.45
                  Master_User: repl_user
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: master-bin.000010
          Read_Master_Log_Pos: 459
......
        Relay_Master_Log_File: master-bin.000010
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
......
          Exec_Master_Log_Pos: 459

root@localhost[(none)]> use test;
Database changed
root@localhost[test]> select * from books;
+------+-------+
| id   | name  |
+------+-------+
|    1 | a     |
|    5 | it    |
|    6 | magic |
+------+-------+
 3 rows in set (0.00 sec)

查看主库的binlog.index文件

$ cat master-bin.index
./master-bin.000003
./master-bin.000004
./master-bin.000005
./master-bin.000006
./master-bin.000007
./master-bin.000008
./master-bin.000009
/data/logbin/master-bin.000010

关于紫砂壶

感悟技术人生
此条目发表在MySQL故障诊断分类目录,贴了标签。将固定链接加入收藏夹。