MySQL 主主復(fù)制失敗
Mysql 主主復(fù)制失敗故障描述架構(gòu)信息節(jié)點(diǎn)信息故障分析同步AipPRD2的從環(huán)境同步AIPPRD1的從環(huán)境
故障描述
  | 原因描述 | 因?yàn)闄C(jī)柜PDU老化, 導(dǎo)致整個(gè)機(jī)柜掉電. | 
  | 故障時(shí)間 | 20160923-10:09 | 
  | 發(fā)現(xiàn)時(shí)間 | 20160929-13:56 | 
架構(gòu)信息
  | Tomcat | Memcache | Keepalive | Mysql主主復(fù)制 | 
節(jié)點(diǎn)信息
  | 序號 | 節(jié)點(diǎn)名稱 | IP地址 | 報(bào)錯(cuò)信息 | 
  | 1 | aipprd1 | 10.66.1.52 | Got fatal error 1236 from master when reading data from binary log: ‘binlog truncated in the middle of event; consider out of disk space on master; the first event ‘mysql-bin.000084’ at 91941417, the last event read from ‘/aip/mysql/data/log/mysql-bin.000084’ at 91941783, the last byte read from ‘/aip/mysql/data/log/mysql-bin.000084’ at 91942912.’ | 
  | 2 | aipprd2 | 10.66.1.51 | Got fatal error 1236 from master when reading data from binary log: ‘binlog truncated in the middle of event; consider out of disk space on master; the first event ‘mysql-bin.000082’ at 6369026, the last event read from ‘/aip/mysql/data/log/mysql-bin.000082’ at 6369026, the last byte read from ‘/aip/mysql/data/log/mysql-bin.000082’ at 6369280.’ | 
故障分析
由于Zabbix的Mysql監(jiān)控腳本的緣故, 沒有觸發(fā)事件, 所以直至20160929檢查Zabbix日志的時(shí)候, 才發(fā)現(xiàn)該故障, 當(dāng)時(shí)Keepalive的VIP在aipprd1上, 這個(gè)節(jié)點(diǎn)上, 數(shù)據(jù)庫是對外服務(wù)的;那么首先以aipprd1為主, 先將aipprd2的從環(huán)境同步起來;待aipprd2從環(huán)境同步完成后, 再將aipprd1的從環(huán)境同步起來.
同步AIPPRD2的從環(huán)境
檢查aipprd2的Slave狀態(tài) 根據(jù)代碼區(qū)的描述, Last_Errno的錯(cuò)誤代碼為1062, 需要手動(dòng)修改Position.
[root@aipprd2 ~]# mysql -uzabbixmoniter -ppassw0rd -hlocalhost -e "show slave status/G;"*************************** 1. row ***************************           Slave_IO_State:               Master_Host: 10.66.1.52              Master_User: root              Master_Port: 3388            Connect_Retry: 60          Master_Log_File: mysql-bin.000082      Read_Master_Log_Pos: 6369026           Relay_Log_File: mysql-relay-bin.000257            Relay_Log_Pos: 8999    Relay_Master_Log_File: mysql-bin.000082         Slave_IO_Running: No        Slave_SQL_Running: No          Replicate_Do_DB:       Replicate_Ignore_DB:        Replicate_Do_Table:    Replicate_Ignore_Table:   Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table:                Last_Errno: 1062               Last_Error: Error 'Duplicate entry '93FF91EF92866D23E80E4A57D55ED538-n1.tomcat604' for key 'PRIMARY'' on query. Default database: 'aipprd'. Query: 'INSERT INTO eahttpsession (     sessionid, username, account,      createtime, loginip,userid,explorer,userDomain,computerName,computerUserName)   VALUES ('93FF91EF92866D23E80E4A57D55ED538-n1.tomcat604', '李花', 'XS003_4200',      '2016-09-23 12:10:33', , '10752920','MSIE 7.0','','','')'             Skip_Counter: 0      Exec_Master_Log_Pos: 6367510          Relay_Log_Space: 11099          Until_Condition: None           Until_Log_File:             Until_Log_Pos: 0       Master_SSL_Allowed: No       Master_SSL_CA_File:        Master_SSL_CA_Path:           Master_SSL_Cert:         Master_SSL_Cipher:            Master_SSL_Key:     Seconds_Behind_Master: NULLMaster_SSL_Verify_Server_Cert: No            Last_IO_Errno: 1236            Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'binlog truncated in the middle of event; consider out of disk space on master; the first event 'mysql-bin.000082' at 6369026, the last event read from '/aip/mysql/data/log/mysql-bin.000082' at 6369026, the last byte read from '/aip/mysql/data/log/mysql-bin.000082' at 6369280.'           Last_SQL_Errno: 1062           Last_SQL_Error: Error 'Duplicate entry '93FF91EF92866D23E80E4A57D55ED538-n1.tomcat604' for key 'PRIMARY'' on query. Default database: 'aipprd'. Query: 'INSERT INTO eahttpsession (     sessionid, username, account,      createtime, loginip,userid,explorer,userDomain,computerName,computerUserName)   VALUES ('93FF91EF92866D23E80E4A57D55ED538-n1.tomcat604', '李花', 'XS003_4200',      '2016-09-23 12:10:33', , '10752920','MSIE 7.0','','','')'Replicate_Ignore_Server_Ids:          Master_Server_Id: 1在aipprd2上按照Last_IO_Error中的Posfile和Pos修改. 按照報(bào)錯(cuò)給出的提示Posfile和Pos修改后, 報(bào)錯(cuò)依舊.
mysql> slave stop;mysql> CHANGE MASTER TO master_host='10.66.1.52', master_port=3388, master_user='root',     master_passWord='passw0rd', master_log_file='mysql-bin.000082', master_log_pos=6369026;mysql> slave start;在aipprd2檢查Mysql的日志 Mysql的日志中記錄了Crash開始的時(shí)間, 并給出了建議, Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000082' position 6367510 
160923 12:35:22 [Note] Starting crash recovery...160923 12:35:22 [Note] Crash recovery finished.160923 12:36:27 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000082' position 6367510在aipprd2上按照日志中建議的Posfile和Pos修改. 按照日志中建議的Posfile和Pos修改后, 報(bào)錯(cuò)依舊.
mysql> slave stop;mysql> CHANGE MASTER TO master_host='10.66.1.52', master_port=3388, master_user='root', master_password='passw0rd', master_log_file='mysql-bin.000082', master_log_pos=6367510;mysql> slave start;在aipprd1上檢查posfile 首先檢查show slave status/G;中給出的pos, 發(fā)現(xiàn)日志中根本不存在.
[root@aipprd1 log]# mysqlbinlog --no-defaults --start-position=6369026 mysql-bin.000083 /*!40019 SET @@session.max_insert_delayed_threads=0*/;/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;DELIMITER /*!*/;# at 4#160923 12:38:59 server id 1  end_log_pos 107  Start: binlog v 4, server v 5.5.24-log           created 160923 12:38:59 at startupROLLBACK/*!*/;BINLOG '47HkVw8BAAAAZwAAAGsAAAAAAAQANS41LjI0LWxvZwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADjseRXEzgNAAgAEgAEBAQEEgAAVAAEGggAAAAICAgCAA=='/*!*/;ERROR: Error in Log_event::read_log_event(): 'read error', data_len: 912208946, event_type:      71ERROR: Could not read entry at offset 6369026: Error in log format or read error.DELIMITER # End of log fileROLLBACK /* added by mysqlbinlog */;/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;此后檢查Mysql日志中建議的pos, 發(fā)現(xiàn)posfile中是存在此記錄的, 而此posfile的最后一個(gè)pos是6368660, 而show slave status/G;的pos是6369026, 顯然不存在日志文件中. 
[root@aipprd1 log]# mysqlbinlog --no-defaults  --start-position=6367510 mysql-bin.000082 /*!40019 SET @@session.max_insert_delayed_threads=0*/;/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;DELIMITER /*!*/;# at 4#160923 10:39:15 server id 1  end_log_pos 107  Start: binlog v 4, server v 5.5.24-log created 160923 10:39:15 at startup# Warning: this binlog is either in use or was not closed properly.ROLLBACK/*!*/;BINLOG '05XkVw8BAAAAZwAAAGsAAAABAAQANS41LjI0LWxvZwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADTleRXEzgNAAgAEgAEBAQEEgAAVAAEGggAAAAICAgCAA=='/*!*/;    # at 6367510    # at 6367585    # at 6367963    # at 6367990    # at 6368073    # at 6368163    # at 6368275    # at 6368358    # at 6368558    # at 6368585    # at 6368660    # End of log fileROLLBACK /* added by mysqlbinlog */;/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;最后在mysql-bin.000083中檢查show slave status/G;提示pos:6369026, 也不存在.
[root@aipprd1 log]# mysqlbinlog --no-defaults  --start-position=6369026 mysql-bin.000083 在aipprd2上重新發(fā)起修改posfile和pos操作 檢查aipprd1上的日志, 既然mysql-bin.000082日志末不存在pos:6369026, 且mysql-bin.000083為下一個(gè)日志, 那么重新發(fā)起修改posfile和pos的操作.
[root@aipprd1 log]# lltotal 526908-rw-rw---- 1 mysql mysql 104858346 Sep 19 23:04 mysql-bin.000079-rw-rw---- 1 mysql mysql 104859112 Sep 22 13:01 mysql-bin.000080-rw-rw---- 1 mysql mysql  42558345 Sep 23 10:23 mysql-bin.000081-rw-rw---- 1 mysql mysql   6369280 Sep 23 12:13 mysql-bin.000082-rw-rw---- 1 mysql mysql 104857953 Sep 26 15:02 mysql-bin.000083-rw-rw---- 1 mysql mysql  70844542 Sep 27 17:42 mysql-bin.000084-rw-rw---- 1 mysql mysql 104858026 Sep 29 15:48 mysql-bin.000085-rw-rw---- 1 mysql mysql    297019 Sep 29 15:54 mysql-bin.000086-rw-rw---- 1 mysql mysql       296 Sep 29 15:48 mysql-bin.log.index-rw-rw---- 1 mysql mysql       107 Sep 27 18:05 mysql-relay-bin.000182-rw-rw---- 1 mysql mysql        43 Sep 27 18:05 mysql-relay-bin.index重新修改posfile為mysql-bin.000083, pos為0, 啟動(dòng)Slave后, 現(xiàn)在同步正常.
mysql> slave stop;mysql> CHANGE MASTER TO master_host='10.66.1.52', master_port=3388, master_user='root', master_password='passw0rd', master_log_file='mysql-bin.000083', master_log_pos=0;mysql> slave start;mysql> show slave status/G;*************************** 1. row ***************************           Slave_IO_State: Waiting for master to send event              Master_Host: 10.66.1.52              Master_User: root              Master_Port: 3388            Connect_Retry: 60          Master_Log_File: mysql-bin.000086      Read_Master_Log_Pos: 18585           Relay_Log_File: mysql-relay-bin.000002            Relay_Log_Pos: 8119286    Relay_Master_Log_File: mysql-bin.000083         Slave_IO_Running: Yes        Slave_SQL_Running: Yes          Replicate_Do_DB:       Replicate_Ignore_DB:        Replicate_Do_Table:    Replicate_Ignore_Table:   Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table:                Last_Errno: 0               Last_Error:              Skip_Counter: 0      Exec_Master_Log_Pos: 8119140          Relay_Log_Space: 280580804          Until_Condition: None           Until_Log_File:             Until_Log_Pos: 0       Master_SSL_Allowed: No       Master_SSL_CA_File:        Master_SSL_CA_Path:           Master_SSL_Cert:         Master_SSL_Cipher:            Master_SSL_Key:     Seconds_Behind_Master: 521370Master_SSL_Verify_Server_Cert: No            Last_IO_Errno: 0            Last_IO_Error:            Last_SQL_Errno: 0           Last_SQL_Error: Replicate_Ignore_Server_Ids:          Master_Server_Id: 11 row in set (0.00 sec)在同步的過程中, 發(fā)現(xiàn)有幾個(gè)Last_SQL_Error: Error ‘Duplicate entry 1026的SQL Error, 這個(gè)是因?yàn)橹貜?fù)主鍵導(dǎo)致Slave停止工作, 執(zhí)行以下操作解決(如果有多條重復(fù)的主鍵, 需要執(zhí)行多次):
mysql> slave stop;mysql> set GLOBAL SQL_SLAVE_SKIP_COUNTER=1;mysql> slave start;還有另一種辦法就是修改mysql配置文件/etc/my.cnf在[mysqld]下加一行slave_skip_errors = 1062 ,保存后重啟mysql, mysql slave可以正常同步了.
同步AIPPRD1的從環(huán)境
檢查aipprd1的Slave狀態(tài)
mysql> show slave status/G;*************************** 1. row ***************************           Slave_IO_State:               Master_Host: 10.66.1.51              Master_User: root              Master_Port: 3388            Connect_Retry: 60          Master_Log_File: mysql-bin.000084      Read_Master_Log_Pos: 91941783           Relay_Log_File: mysql-relay-bin.000002            Relay_Log_Pos: 296    Relay_Master_Log_File: mysql-bin.000084         Slave_IO_Running: No        Slave_SQL_Running: Yes          Replicate_Do_DB:       Replicate_Ignore_DB:        Replicate_Do_Table:    Replicate_Ignore_Table:   Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table:                Last_Errno: 0               Last_Error:              Skip_Counter: 0      Exec_Master_Log_Pos: 91941783          Relay_Log_Space: 452          Until_Condition: None           Until_Log_File:             Until_Log_Pos: 0       Master_SSL_Allowed: No       Master_SSL_CA_File:        Master_SSL_CA_Path:           Master_SSL_Cert:         Master_SSL_Cipher:            Master_SSL_Key:     Seconds_Behind_Master: NULLMaster_SSL_Verify_Server_Cert: No            Last_IO_Errno: 1236            Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'binlog truncated in the middle of event; consider out of disk space on master; the first event 'mysql-bin.000084' at 91941417, the last event read from '/aip/mysql/data/log/mysql-bin.000084' at 91941783, the last byte read from '/aip/mysql/data/log/mysql-bin.000084' at 91942912.'           Last_SQL_Errno: 0           Last_SQL_Error: Replicate_Ignore_Server_Ids:          Master_Server_Id: 21 row in set (0.00 sec)在aipprd2上檢查日志文件 在aipprd1上檢查show slave status/G;后, 提示需要修改posfile為mysql-bin.000084和pos91942912, 因?yàn)樵?code>aipprd2同步完成后, 實(shí)際同步的數(shù)據(jù)是從aipprd1過來的, 這些數(shù)據(jù)在aipprd1上本身就存在的.
[root@aipprd2 log]# lltotal 653832-rw-rw---- 1 mysql mysql 104857774 Sep 21 13:12 mysql-bin.000083-rw-rw---- 1 mysql mysql  91942912 Sep 23 10:21 mysql-bin.000084-rw-rw---- 1 mysql mysql   6295552 Sep 23 12:10 mysql-bin.000085-rw-rw---- 1 mysql mysql      8250 Sep 27 17:39 mysql-bin.000086-rw-rw---- 1 mysql mysql 104858014 Sep 29 17:20 mysql-bin.000087-rw-rw---- 1 mysql mysql 104857971 Sep 29 19:18 mysql-bin.000088-rw-rw---- 1 mysql mysql  55744926 Sep 29 20:23 mysql-bin.000089-rw-rw---- 1 mysql mysql       259 Sep 29 19:18 mysql-bin.log.index-rw-rw---- 1 mysql mysql  70844718 Sep 29 15:35 mysql-relay-bin.000005-rw-rw---- 1 mysql mysql 104857729 Sep 29 15:44 mysql-relay-bin.000006-rw-rw---- 1 mysql mysql       708 Sep 29 15:44 mysql-relay-bin.000007-rw-rw---- 1 mysql mysql       199 Sep 29 15:44 mysql-relay-bin.000008-rw-rw---- 1 mysql mysql  25188420 Sep 30 08:40 mysql-relay-bin.000009-rw-rw---- 1 mysql mysql       215 Sep 29 18:42 mysql-relay-bin.index在aipprd2上檢查show master status;后, 記錄Posfile為mysql-bin.000089, 既然aipprd1的數(shù)據(jù)為最新的, 且aipprd2已經(jīng)從aipprd1后同步完成了(通過檢查show slave status/G;中的Seconds_Behind_Master:, 如果此項(xiàng)值很小, 應(yīng)該是同步完成了.), 那么兩邊的數(shù)據(jù)應(yīng)該差不多的.
mysql> show master status;+------------------+-----------+--------------+------------------+| File             | Position  | Binlog_Do_DB | Binlog_Ignore_DB |+------------------+-----------+--------------+------------------+| mysql-bin.000089 | 104341982 |              |                  |+------------------+-----------+--------------+------------------+1 row in set (0.00 sec)在aipprd1上發(fā)起修改posfile和pos操作 所以在此用posfile的mysql-bin.000089和pos的0來修改, 啟動(dòng)Slave后, 開始同步.
mysql> slave stop;Query OK, 0 rows affected (0.11 sec)mysql> CHANGE MASTER TO master_host='10.66.1.51', master_port=3388, master_user='root', master_password='passw0rd', master_log_file='mysql-bin.000089', master_log_pos=0;Query OK, 0 rows affected (0.06 sec)mysql> slave start;Query OK, 0 rows affected (0.00 sec)mysql> show slave status/G;*************************** 1. row ***************************            Slave_IO_State: Waiting for master to send event              Master_Host: 10.66.1.51              Master_User: root              Master_Port: 3388            Connect_Retry: 60          Master_Log_File: mysql-bin.000089      Read_Master_Log_Pos: 55744926           Relay_Log_File: mysql-relay-bin.000002            Relay_Log_Pos: 253    Relay_Master_Log_File: mysql-bin.000089         Slave_IO_Running: Yes        Slave_SQL_Running: Yes          Replicate_Do_DB:       Replicate_Ignore_DB:        Replicate_Do_Table:    Replicate_Ignore_Table:   Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table:                Last_Errno: 0               Last_Error:              Skip_Counter: 0      Exec_Master_Log_Pos: 55744926          Relay_Log_Space: 409          Until_Condition: None           Until_Log_File:             Until_Log_Pos: 0       Master_SSL_Allowed: No       Master_SSL_CA_File:        Master_SSL_CA_Path:           Master_SSL_Cert:         Master_SSL_Cipher:            Master_SSL_Key:     Seconds_Behind_Master: 0Master_SSL_Verify_Server_Cert: No            Last_IO_Errno: 0            Last_IO_Error:            Last_SQL_Errno: 0           Last_SQL_Error: Replicate_Ignore_Server_Ids:          Master_Server_Id: 21 row in set (0.00 sec)