{ Useful stuff for System and Network Admins }: Troubleshooting row-based replication issues

Hi,

I would like to describe my problem: one of slave server got a permanent issue with replication delay, there was a setup with following topology:
master--STATEMANET BINLOG-->slave1--ROW format-->slave2-ROW format.

Slave2 had replication delay that was continuously increasing.

One of the slave thread in processlist:

*************************** 3. row ***************************
User: system user
Host:
db: NULL
Command: Connect
Time: 8935374
State: Reading event from the relay log
Info: NULL
Rows_sent: 0
Rows_examined: 0

Id: 5

Part of "show engine innodb status":

---TRANSACTION 4675333, ACTIVE 17629 sec fetching rows
mysql tables in use 1, locked 1
28346 lock struct(s), heap size 2618920, 1921370 row lock(s), undo log entries 3220
MySQL thread id 5, OS thread handle 0x7f8c8060e700, query id 21 Reading event from the relay log

Hm, what is going on? No queries in processlist. Then I checked what was in positions in relay logs form show slave status command:

mysql> show slave status\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: X.X.X.X
Master_User: XXXXX
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: db-xxx-bin.001368
Read_Master_Log_Pos: 9430623
Relay_Log_File: db-xxx-relay.001192
Relay_Log_Pos: 409141308 Relay_Master_Log_File: db-xxx-bin.001290
Slave_IO_Running: Yes
Slave_SQL_Running: Yes

sudo mysqlbinlog -v --base64-output=DECODE-ROWS --start-position=409141308 db-xxx-relay.001192:

SET @@session.character_set_client=33,@@session.collation_connection=33,@@session.collation_server=8/*!*/;
SET @@session.time_zone='SYSTEM'/*!*/;
SET @@session.lc_time_names=0/*!*/;
SET @@session.collation_database=DEFAULT/*!*/;
BEGIN
/*!*/;
# at 409141389
#160618 20:06:24 server id 701020101 end_log_pos 409141521 CRC32 0x04f7b62f Table_map: `db`.`TABLE` mapped to number 1002
# at 409141685
#160618 20:06:24 server id 701020101 end_log_pos 409149640 CRC32 0xaad0c198 Delete_rows: table id 1002
# at 409149804
#160618 20:06:24 server id 701020101 end_log_pos 409157762 CRC32 0x7f27178b Delete_rows: table id 1002
# at 409157926
#160618 20:06:24 server id 701020101 end_log_pos 409165890 CRC32 0x118c26ad Delete_rows: table id 1002
# at 409166054
#160618 20:06:24 server id 701020101 end_log_pos 409173977 CRC32 0x137acdbd Delete_rows: table id 1002
# at 409174141

.....

So here we can see that table "TABLE" is being cleaned up by some clause. I would like to add that this table contains a lot of records, let's find out SQL code(we use -v --base64-output=DECODE-ROWS options):

...

#160618 20:06:24 server id 701020101 end_log_pos 928334665 CRC32 0x790815e0 Delete_ro
ws: table id 1002 flags: STMT_END_F
### DELETE FROM `db`.`TABLE`
### WHERE
### @1='data1'
### @2='data2'
### @3='data3'
### @4=86561281
### @5='data3'
### @6=2016-09-13 14:00:00
### @7=1111
### @8='Friday'
### @9='14:30'
### @10='14:00'
### @11=14
### @12='May'
### @13='14:10'
### @14='14:13:57'
### @15=NULL
### @16=2016-06-13 14:43:57
### @17=2016
.....

So there is DELETE statement for each row with clause that includes every column. I mentioned that the table is pretty big, that is why we got replication delay. I checked table TABLE definition and found that it had been created without primary key definition. And "DELETE from TABLE where column <> 'data1'"(as example) created a lot of events in binary log of slave1. Slave2 have to perform full scan for each record. I'll try to fix it by creating appropriate primary key on master server.

{ Useful stuff for System and Network Admins }

Поиск по этому блогу

четверг, 19 января 2017 г.

Troubleshooting row-based replication issues

Комментариев нет:

Отправить комментарий

Ярлыки

Обо мне

Поиск по этому блогу

четверг, 19 января 2017 г.

Troubleshooting row-based replication issues

Комментариев нет:

Отправить комментарий

четверг, 19 января 2017 г.