Resync mysql replica in a GTID setup gives error HA_ERR_FOUND_DUPP_KEY with AUTOPOSITION
Hello
I use a procedure based on the page
http://web.archive.org/web/20230323121432/https://blog.pythian.com/mysql-streaming-xtrabackup-slave-recovery/ for years to sync a new replica or resync a broken replica for year without issue. This uses xtrabackup to save and send the mysql data to the slave, apply the redo log, and restart the slave with MASTER_AUTO_POSITION. Simple and effective.
xtrabackup is convenient for us, due to the size of the databases we have to transfer. we use this way "xtrabackup --backup --stream=xbstream --parallel=$(NB_PROC/2)"
mysql: 5.7.42
xtrabackup: 2.4.29
lately when I start the slave I get such error
==============================================
Last_Errno: 1062
Last_Error: Could not execute Write_rows event on table db_2.webhook_message_status; Duplicate entry '3304591' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mysql-bin.001615, end_log_pos 5021
Skip_Counter: 0
Exec_Master_Log_Pos: 4692
Relay_Log_Space: 143514642
...
Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 1062
Last_SQL_Error: Could not execute Write_rows event on table db_2.webhook_message_status; Duplicate entry '3304591' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log mysql-bin.001615, end_log_pos 5021
Replicate_Ignore_Server_Ids:
Master_Server_Id: 189087
Master_UUID: 144c0164-3223-11ef-8319-74563c5c838d
Master_Info_File: mysql.slave_master_info
==============================================
Looking on slave I confirm the entry with id 3304591 is already there
==============================================
+---------+---------------------+---------+--------+-----------------------------+---------------------+
| id | message_external_id | site_id | status | context | created_at |
+---------+---------------------+---------+--------+-----------------------------+---------------------+
| 3304591 | xxxxxxxxxxxxxxxxxxx | 483 | read | {"code":200,"title":"read"} | 2024-09-19 07:52:00 |
+---------+---------------------+---------+--------+-----------------------------+---------------------+
==============================================
so it seems the slave does not know anymore to properly position itself.
As the workaround I get the content of xtrabackup_binlog_info
==============================================
mysql-bin.001615 73610932 144c0164-3223-11ef-8319-74563c5c838d:1-14071690
==============================================
and I did this
==============================================
mysql> reset master
mysql> set global GTID_PURGED="144c0164-3223-11ef-8319-74563c5c838d:1-14071690"
mysql> start slave
==============================================
and now it works
==============================================
Master_Server_Id: 189087
Master_UUID: 144c0164-3223-11ef-8319-74563c5c838d
Master_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
Master_Retry_Count: 86400
...
Retrieved_Gtid_Set: 144c0164-3223-11ef-8319-74563c5c838d:14047201-14248907
Executed_Gtid_Set: 144c0164-3223-11ef-8319-74563c5c838d:1-14248907
Auto_Position: 1
==============================================
as an alternative solution I use pt-slave-restart to bypass all error id 1062, so after a while, the slave eventually is in sync, but I don't feel confident about the replica data integrity.
Do you have an idea about what could be the cause of this problem? It used to work fine, during the last 3 years we use without an issue. We did not changed major version of MySQL, or anytool involved.
is the "workaround" I'm doing is fine ? Do I have all master data on slave ?
best