Поиск по этому блогу

четверг, 19 января 2017 г.

Troubleshooting row-based replication issues

Hi,

I would like to describe my problem: one of slave server got a permanent issue with replication delay, there was a setup with following topology:
master--STATEMANET BINLOG-->slave1--ROW format-->slave2-ROW format.

Slave2 had replication delay that was continuously increasing.


One of the slave thread in processlist:
*************************** 3. row ***************************
         User: system user
         Host:
           db: NULL
      Command: Connect
         Time: 8935374
        State: Reading event from the relay log
         Info: NULL
    Rows_sent: 0
Rows_examined: 0
           Id: 5
Part of "show engine innodb status":

---TRANSACTION 4675333, ACTIVE 17629 sec fetching rows
mysql tables in use 1, locked 1
28346 lock struct(s), heap size 2618920, 1921370 row lock(s), undo log entries 3220
MySQL thread id 5, OS thread handle 0x7f8c8060e700, query id 21 Reading event from the relay log


Hm, what is going on? No queries in processlist. Then I checked what was in positions in relay logs form show slave status command:

mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: X.X.X.X
                  Master_User: XXXXX
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: db-xxx-bin.001368
          Read_Master_Log_Pos: 9430623
               Relay_Log_File: db-xxx-relay.001192
                Relay_Log_Pos: 409141308
        Relay_Master_Log_File: db-xxx-bin.001290
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes

sudo mysqlbinlog -v  --base64-output=DECODE-ROWS --start-position=409141308  db-xxx-relay.001192:

SET @@session.character_set_client=33,@@session.collation_connection=33,@@session.collation_server=8/*!*/;
SET @@session.time_zone='SYSTEM'/*!*/;
SET @@session.lc_time_names=0/*!*/;
SET @@session.collation_database=DEFAULT/*!*/;
BEGIN
/*!*/;
# at 409141389
#160618 20:06:24 server id 701020101  end_log_pos 409141521 CRC32 0x04f7b62f    Table_map: `db`.`TABLE` mapped to number 1002
# at 409141685
#160618 20:06:24 server id 701020101  end_log_pos 409149640 CRC32 0xaad0c198    Delete_rows: table id 1002
# at 409149804
#160618 20:06:24 server id 701020101  end_log_pos 409157762 CRC32 0x7f27178b    Delete_rows: table id 1002
# at 409157926
#160618 20:06:24 server id 701020101  end_log_pos 409165890 CRC32 0x118c26ad    Delete_rows: table id 1002
# at 409166054
#160618 20:06:24 server id 701020101  end_log_pos 409173977 CRC32 0x137acdbd    Delete_rows: table id 1002
# at 409174141
.....

So here we can see that  table "TABLE" is being cleaned up by some clause. I would like to add that this table contains a lot of records, let's find out SQL code(we use -v  --base64-output=DECODE-ROWS options):
...
#160618 20:06:24 server id 701020101  end_log_pos 928334665 CRC32 0x790815e0    Delete_ro
ws: table id 1002 flags: STMT_END_F
### DELETE FROM `db`.`TABLE`
### WHERE
###   @1='data1'
###   @2='data2'
###   @3='data3'
###   @4=86561281
###   @5='data3'
###   @6=2016-09-13 14:00:00
###   @7=1111
###   @8='Friday'
###   @9='14:30'
###   @10='14:00'
###   @11=14
###   @12='May'
###   @13='14:10'
###   @14='14:13:57'
###   @15=NULL
###   @16=2016-06-13 14:43:57
###   @17=2016
..... 
So  there is DELETE statement for each row with clause that includes every column. I mentioned that the table is pretty big, that is why we got replication delay. I checked table TABLE definition and found that it had been created without primary key definition. And "DELETE from TABLE where column <> 'data1'"(as example) created a lot of events in binary log of slave1. Slave2 have to perform full scan for each record. I'll try to fix it by creating appropriate primary key on master server.






Комментариев нет:

Отправить комментарий