keepalived异常切换VIP引发的MySQL数据丢失血案

1. 背景说明

MySQL维护组报告,在6月24日凌晨发现某个平台的MySQL数据库发生数据丢失,数据库状态是正常的,但是数据是6月12日的,维护组拿6月23日23点左右的备份(通过mysqldump备份)进行了恢复,
恢复后数据找回,现在需要分析数据丢失的原因。由于整个处理过程不是MySQL维护组操作,是由应用项目组操作的,有些操作MySQL维护组不是很清楚。

MySQL双主复制IP:
xxx.xxx.248.43
xxx.xxx.248.44

keepalive浮动IP:XXX.XXX.248.244

2. 检查43的log-error日志

2016-06-12 14:43:14 30206 [ERROR] Slave SQL: Slave SQL thread retried transaction 10 time(s) in vain, giving up. Consider raising the value of the slave_transaction_retries variable. Error_code: 1205
2016-06-12 14:43:14 30206 [Warning] Slave: Lock wait timeout exceeded; try restarting transaction Error_code: 1205
2016-06-12 14:43:14 30206 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000008' position 44758969
2016-06-24 12:03:26 30206 [ERROR] Error reading packet from server: Lost connection to MySQL server during query ( server_errno=2013)
2016-06-24 12:03:26 30206 [Note] Slave I/O thread killed while reading event
2016-06-24 12:03:26 30206 [Note] Slave I/O thread exiting, read up to log 'mysql-bin.000008', position 559704364
2016-06-24 14:36:10 30206 [Note] 'CHANGE MASTER TO executed'. Previous state master_host='xxx.xxx.248.44', master_port= 3306, master_log_file='mysql-bin.000008', master_log_pos= 559704364, master_bind=''. New state master_host='xxx.xxx.248.44', master_port= 3306, master_log_file='mysql-bin.000009', master_log_pos= 838784666, master_bind=''.

从6月12日到6月24日,基本没有什么信息,与MySQL维护组沟通,平时这套库写入量不大,主要用于查询,所以mysql的log-error里,基本没有信息,
但是发现在6月12日的14:43分,43的 slave SQL thread异常aborted。这时43就不应用44的日志了。

3. 检查44的log-error日志

44上,从6月23日到6月25日,基本没有重要信息,表明在应用发现的故障时间点6月24日凌晨,44数据库没有问题。

4. 检查44操作系统日志

Jun 23 20:51:32 xxxxx034 Keepalived_vrrp[4854]: VRRP_Script(check_run) timed out
Jun 23 20:51:34 xxxxx034 Keepalived_vrrp[4854]: VRRP_Instance(mysql_inst) Entering FAULT STATE
Jun 23 20:51:34 xxxxx034 Keepalived_vrrp[4854]: VRRP_Instance(mysql_inst) removing protocol VIPs.
Jun 23 20:51:35 xxxxx034 Keepalived_healthcheckers[4853]: Netlink reflector reports IP xxx.xxx.248.244 removed
Jun 23 20:51:35 xxxxx034 Keepalived_vrrp[4854]: VRRP_Instance(mysql_inst) Now in FAULT state
Jun 23 20:51:35 xxxxx034 Keepalived_vrrp[4854]: VRRP_Group(VG1) Syncing instances to FAULT state
Jun 23 20:51:35 xxxxx034 Keepalived_vrrp[4854]: VRRP_Script(check_run) succeeded
Jun 23 20:51:35 xxxxx034 avahi-daemon[3470]: Withdrawing address record for xxx.xxx.248.244 on bond0.
Jun 23 20:51:35 xxxxx034 Keepalived_vrrp[4854]: Kernel is reporting: Group(VG1) UP
Jun 23 20:51:35 xxxxx034 Keepalived_vrrp[4854]: VRRP_Group(VG1) Leaving FAULT state
Jun 23 20:51:35 xxxxx034 Keepalived_vrrp[4854]: VRRP_Instance(mysql_inst) Entering BACKUP STATE
Jun 23 20:51:35 xxxxx034 Keepalived_vrrp[4854]: VRRP_Group(VG1) Syncing instances to BACKUP state

发现整个架构存在keepalived配置,keepalived是个配置浮动IP的HA工具,主要是用于保证VIP在集群内一直在线。
通过44的操作系统日志发现,在6月23号晚上20:51分左右,keepalived的check_run脚本发生超时,检测mysql实例失败,导致认为mysql已经是fault状态,这时侯keepalive把VIP xxx.xxx.248.244从44上漂到43。

5. 检查43操作系统日志

Jun 23 20:51:36 xxxxx033 Keepalived_vrrp[1120]: VRRP_Instance(mysql_inst) Sending gratuitous ARPs on bond0 for xxx.xxx.248.244
Jun 23 20:51:36 xxxxx033 Keepalived_healthcheckers[1119]: Netlink reflector reports IP xxx.xxx.248.244 added
Jun 23 20:51:36 xxxxx033 avahi-daemon[3607]: Registering new address record for xxx.xxx.248.244 on bond0.IPv4.
Jun 23 20:51:38 xxxxx033 ntpd[24044]: Listening on interface #11 bond0, xxx.xxx.248.244#123 Enabled

在43主机上,6月23日晚上20:51分左右,VIP xxx.xxx.248.244确认切过来。

6. 过程分析

在6月12日,43主机上的MySQL,由于slave SQL thread异常aborted,导致43不再应用44主机传过来的日志。由于在6月23日前,VIP xxx.xxx.248.244一直在44主机上,应用一直连接的是44主机。
在6月23日,由于44主机keepalived的检查进程check_run超时,认为mysql实例已经宕了,把VIP从44切换到43,这时侯应用去连43主机,由于复制在12号已停,导致应用发现12号之后的数据都已丢失,进而应用从备份中恢复数据。

7. 分析keepalived的check_run超时

44主机上的Keepalived的配置文件如下:

# pwd
/etc/keepalived
# cat keepalived.conf
! Configuration File for keepalived

global_defs {
   router_id LVS_DEVEL
}

vrrp_script check_run {
    script "/etc/keepalived/check_mysqld.sh"
    interval 2
}

vrrp_sync_group VG1 {
    group {
        mysql_inst
    }
}

vrrp_instance mysql_inst {
    state MASTER
    interface bond0
    virtual_router_id 52
    priority 100
    advert_int 1
    nopreempt

    track_script {
        check_run
    }

    authentication {
        auth_type PASS
        auth_pass 1111
    }
    virtual_ipaddress {
        xxx.xxx.248.244
    }
}

其中check_run就是个sh脚本/etc/keepalived/check_mysqld.sh

# cat /etc/keepalived/check_mysqld.sh
#!/bin/bash  
  
MYSQL="/usr/bin/mysql -S /var/lib/mysql/mysql.sock"
MYSQL_USER=root
MYSQL_PORT=3306
MYSQL_PASSWORD=xxxxxx
LOG_FILE=/var/log/keepalived/check_mysql.log
CHECK_TIME=3
#mysql is working MYSQL_OK is 1 , mysql down MYSQL_OK is 0  
MYSQL_OK=1

if [ ! -f "$LOG_FILE" ]; then
    mkdir -p /var/log/keepalived
    echo `date --date=today +"%Y-%M-%d %H:%m:%S"` -  [INFO] - logfile not exist, create it now. >> $LOG_FILE
fi
  
  
function check_mysql_helth (){
  $MYSQL -u$MYSQL_USER -p$MYSQL_PASSWORD -P $MYSQL_PORT -e "show status;" >/dev/null 2>&1
  if [ $? = 0 ] ;then
    MYSQL_OK=1
  else
    MYSQL_OK=0
  fi
  return $MYSQL_OK
}
  
while [ $CHECK_TIME -ne 0 ]
do
  let "CHECK_TIME -= 1"
  check_mysql_helth
  if [ $MYSQL_OK = 1 ] ; then
    CHECK_TIME=0
    #echo `date --date=today +"%Y-%M-%d %H:%m:%S"` -  [INFO] - mysql available: success[$MYSQL_OK] >> $LOG_FILE  
    exit 0
  fi  
  if [ $MYSQL_OK -eq 0 ] && [ $CHECK_TIME -eq 0 ]
  then
    killall keepalived
    echo `date --date=today +"%Y-%M-%d %H:%m:%S"` -  [INFO] - mysql invaild. keepalived stop. >> $LOG_FILE
    exit 1
  fi
  sleep 1
done

每次检查失败的日志就写入/var/log/keepalived/check_mysql.log这个日志文件里

# ls -l /var/log/keepalived/check_mysql.log
-rw-r--r-- 1 root root 191 Jan 27  2015 /var/log/keepalived/check_mysql.log
# cat /var/log/keepalived/check_mysql.log
2015-20-23 17:01:49 - [INFO] - logfile not exist, create it now.
2015-52-27 15:01:14 - [INFO] - mysql invaild. keepalived stop.
2015-52-27 15:01:16 - [INFO] - mysql invaild. keepalived stop.

在44上这个文件的日期,和里面的内容,感觉有些异常。

而44主机的mysql的log-error日志中,没有相关信息,mysql实例在23号也没有重启,以下是44在6月23日20:51左右的mysql log-error日志,无实例异常宕掉的信息。

2016-06-23 20:50:06 11684 [Warning] IP address 'xxx.xxx.156.93' could not be resolved: Temporary failure in name resolution
2016-06-23 20:50:06 11684 [Warning] IP address 'xxx.xxx.156.93' could not be resolved: Temporary failure in name resolution
2016-06-23 20:50:06 11684 [Warning] IP address 'xxx.xxx.156.93' could not be resolved: Temporary failure in name resolution
2016-06-23 20:50:06 11684 [Warning] IP address 'xxx.xxx.156.93' could not be resolved: Temporary failure in name resolution
2016-06-23 21:08:38 11684 [ERROR] Slave SQL: Could not execute Write_rows event on table hive.serdes; Duplicate entry '6596706' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mysql-bin.000010, end_log_pos 312284152, Error_code: 1062
2016-06-23 21:08:38 11684 [Warning] Slave: Duplicate entry '6596706' for key 'PRIMARY' Error_code: 1062
2016-06-23 21:08:38 11684 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000010' position 312283922
2016-06-24 12:03:36 11684 [ERROR] Error reading packet from server: Lost connection to MySQL server during query ( server_errno=2013)
2016-06-24 12:03:36 11684 [Note] Slave I/O thread killed while reading event
2016-06-24 12:03:36 11684 [Note] Slave I/O thread exiting, read up to log 'mysql-bin.000010', position 331769782

8. 总结

由于这次故障是MySQL复制异常断掉,并且keepalived异常切换VIP造成。后续需要加强MySQL复制的监控,不要MySQL的复制长时间断了也没有人知道。

关于紫砂壶

感悟技术人生
此条目发表在复制分类目录,贴了标签。将固定链接加入收藏夹。