InnoDB Cluster Distributed Recovery seemingly stuck forever
Posted by: Raphaël Galmiche
Date: April 07, 2021 09:13AM
Date: April 07, 2021 09:13AM
Hello, I'm currently a 3 nodes single-primary 8.0.23 InnoDB Cluster and I'm encountering issues with replication & distributed recovery.
One my node is forever stuck in RECOVERING state and unable to become ONLINE. There is pretty much no transfer occurring between the donor (in my case the primary) and the broken node stuck in recovery.
Topology :
mysql0 (10.0.20.20) RECOVERING (donor: mysql2)
mysql1 (10.0.20.21) ONLINE (secondary)
mysql2 (10.0.20.21) ONLINE (primary)
All nodes are Debian 10 servers with 8 cores and 15GB of ram. They are connected to a private vlan.
Here's the Cluster.status() from mysqlsh :
{
"clusterName": "abcd",
"defaultReplicaSet": {
"name": "default",
"primary": "10.0.20.22:3306",
"ssl": "REQUIRED",
"status": "OK",
"statusText": "Cluster is ONLINE and can tolerate up to ONE failure. 1 member is not active.",
"topology": {
"10.0.20.20:3306": {
"address": "10.0.20.20:3306",
"mode": "R/O",
"readReplicas": {},
"recoveryStatusText": "Recovery in progress",
"role": "HA",
"status": "RECOVERING",
"version": "8.0.23"
},
"10.0.20.21:3306": {
"address": "10.0.20.21:3306",
"mode": "R/O",
"readReplicas": {},
"replicationLag": null,
"role": "HA",
"status": "ONLINE",
"version": "8.0.23"
},
"10.0.20.22:3306": {
"address": "10.0.20.22:3306",
"mode": "R/W",
"readReplicas": {},
"replicationLag": null,
"role": "HA",
"status": "ONLINE",
"version": "8.0.23"
}
},
"topologyMode": "Single-Primary"
},
"groupInformationSourceMember": "10.0.20.22:3306"
}
Log from mysql0 :
2021-04-07T14:41:54.470289Z 343 [System] [MY-013587] [Repl] Plugin group_replication reported: 'Plugin 'group_replication' is starting.'
2021-04-07T14:41:54.472905Z 345 [System] [MY-010597] [Repl] 'CHANGE MASTER TO FOR CHANNEL 'group_replication_applier' executed'. Previous state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''. New state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''.
2021-04-07T14:41:56.656107Z 343 [System] [MY-011511] [Repl] Plugin group_replication reported: 'This server is working as secondary member with primary member address 10.0.20.22:3306.'
2021-04-07T14:41:57.656438Z 0 [System] [MY-013471] [Repl] Plugin group_replication reported: 'Distributed recovery will transfer data using: Incremental recovery from a group donor'
2021-04-07T14:41:57.656682Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 10.0.20.22:3306, 10.0.20.21:3306, 10.0.20.20:3306 on view 16178027833824519:9.'
As you can see mysql0's log indicates mysql2 is the donor and a distributed recovery is ongoing however no amount of time will bring that node back to an ONLINE state.
When checking the replication_connection_status there is seemingly no activity (note the recovery channel is OFF) :
mysql> SELECT * FROM performance_schema.replication_connection_status\G
*************************** 1. row ***************************
CHANNEL_NAME: group_replication_applier
GROUP_NAME: c86adab9-8573-11eb-bd74-fa163ed5368e
SOURCE_UUID: c86adab9-8573-11eb-bd74-fa163ed5368e
THREAD_ID: NULL
SERVICE_STATE: ON
COUNT_RECEIVED_HEARTBEATS: 0
LAST_HEARTBEAT_TIMESTAMP: 0000-00-00 00:00:00.000000
RECEIVED_TRANSACTION_SET: c86adab9-8573-11eb-bd74-fa163ed5368e:1-6375586
LAST_ERROR_NUMBER: 0
LAST_ERROR_MESSAGE:
LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00.000000
LAST_QUEUED_TRANSACTION:
LAST_QUEUED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP: 0000-00-00 00:00:00.000000
LAST_QUEUED_TRANSACTION_IMMEDIATE_COMMIT_TIMESTAMP: 0000-00-00 00:00:00.000000
LAST_QUEUED_TRANSACTION_START_QUEUE_TIMESTAMP: 0000-00-00 00:00:00.000000
LAST_QUEUED_TRANSACTION_END_QUEUE_TIMESTAMP: 0000-00-00 00:00:00.000000
QUEUEING_TRANSACTION:
QUEUEING_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP: 0000-00-00 00:00:00.000000
QUEUEING_TRANSACTION_IMMEDIATE_COMMIT_TIMESTAMP: 0000-00-00 00:00:00.000000
QUEUEING_TRANSACTION_START_QUEUE_TIMESTAMP: 0000-00-00 00:00:00.000000
*************************** 2. row ***************************
CHANNEL_NAME: group_replication_recovery
GROUP_NAME:
SOURCE_UUID:
THREAD_ID: NULL
SERVICE_STATE: OFF
COUNT_RECEIVED_HEARTBEATS: 0
LAST_HEARTBEAT_TIMESTAMP: 0000-00-00 00:00:00.000000
RECEIVED_TRANSACTION_SET:
LAST_ERROR_NUMBER: 0
LAST_ERROR_MESSAGE:
LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00.000000
LAST_QUEUED_TRANSACTION:
LAST_QUEUED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP: 0000-00-00 00:00:00.000000
LAST_QUEUED_TRANSACTION_IMMEDIATE_COMMIT_TIMESTAMP: 0000-00-00 00:00:00.000000
LAST_QUEUED_TRANSACTION_START_QUEUE_TIMESTAMP: 0000-00-00 00:00:00.000000
LAST_QUEUED_TRANSACTION_END_QUEUE_TIMESTAMP: 0000-00-00 00:00:00.000000
QUEUEING_TRANSACTION:
QUEUEING_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP: 0000-00-00 00:00:00.000000
QUEUEING_TRANSACTION_IMMEDIATE_COMMIT_TIMESTAMP: 0000-00-00 00:00:00.000000
QUEUEING_TRANSACTION_START_QUEUE_TIMESTAMP: 0000-00-00 00:00:00.000000
mysql> show slave status for CHANNEL 'group_replication_recovery'\G
*************************** 1. row ***************************
Slave_IO_State:
Master_Host: <NULL>
Master_User: mysql_innodb_cluster_1
Master_Port: 0
Connect_Retry: 60
Master_Log_File:
Read_Master_Log_Pos: 4
Relay_Log_File: mysql0-relay-bin-group_replication_recovery.000001
Relay_Log_Pos: 4
Relay_Master_Log_File:
Slave_IO_Running: No
Slave_SQL_Running: No
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 0
Relay_Log_Space: 513
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: Yes
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 0
Master_UUID:
Master_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State:
Master_Retry_Count: 1
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set:
Executed_Gtid_Set: c86adab9-8573-11eb-bd74-fa163ed5368e:1-6235867
Auto_Position: 1
Replicate_Rewrite_DB:
Channel_Name: group_replication_recovery
Master_TLS_Version: TLSv1,TLSv1.1,TLSv1.2,TLSv1.3
Master_public_key_path:
Get_master_public_key: 0
Network_Namespace:
I checked for resource starvation and all the nodes have enough cpu, ram & hdd available. All the nodes are able to communicate with each other.
I tried stopping and restarting group_replication directly on mysql0 : it leaves and rejoin the cluster then gets stuck into RECOVERING forever again. I tried restarting the mysql process and the whole host : same result, it joins the cluster and remain stuck forever.
The only solution I have is to force remove the node (cluster.removeInstance(ip, {force:true})) and add it back into the cluster by forcing the recovery method to be CLONE and not distributed recovery. As you can imagine it's not ideal, it takes a lot of time and requires manual intervention everytime a node encounters an issue.
This is a recurring issue.
I'm out of ideas and all suggestions are welcomed :)
Thank you.
One my node is forever stuck in RECOVERING state and unable to become ONLINE. There is pretty much no transfer occurring between the donor (in my case the primary) and the broken node stuck in recovery.
Topology :
mysql0 (10.0.20.20) RECOVERING (donor: mysql2)
mysql1 (10.0.20.21) ONLINE (secondary)
mysql2 (10.0.20.21) ONLINE (primary)
All nodes are Debian 10 servers with 8 cores and 15GB of ram. They are connected to a private vlan.
Here's the Cluster.status() from mysqlsh :
{
"clusterName": "abcd",
"defaultReplicaSet": {
"name": "default",
"primary": "10.0.20.22:3306",
"ssl": "REQUIRED",
"status": "OK",
"statusText": "Cluster is ONLINE and can tolerate up to ONE failure. 1 member is not active.",
"topology": {
"10.0.20.20:3306": {
"address": "10.0.20.20:3306",
"mode": "R/O",
"readReplicas": {},
"recoveryStatusText": "Recovery in progress",
"role": "HA",
"status": "RECOVERING",
"version": "8.0.23"
},
"10.0.20.21:3306": {
"address": "10.0.20.21:3306",
"mode": "R/O",
"readReplicas": {},
"replicationLag": null,
"role": "HA",
"status": "ONLINE",
"version": "8.0.23"
},
"10.0.20.22:3306": {
"address": "10.0.20.22:3306",
"mode": "R/W",
"readReplicas": {},
"replicationLag": null,
"role": "HA",
"status": "ONLINE",
"version": "8.0.23"
}
},
"topologyMode": "Single-Primary"
},
"groupInformationSourceMember": "10.0.20.22:3306"
}
Log from mysql0 :
2021-04-07T14:41:54.470289Z 343 [System] [MY-013587] [Repl] Plugin group_replication reported: 'Plugin 'group_replication' is starting.'
2021-04-07T14:41:54.472905Z 345 [System] [MY-010597] [Repl] 'CHANGE MASTER TO FOR CHANNEL 'group_replication_applier' executed'. Previous state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''. New state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''.
2021-04-07T14:41:56.656107Z 343 [System] [MY-011511] [Repl] Plugin group_replication reported: 'This server is working as secondary member with primary member address 10.0.20.22:3306.'
2021-04-07T14:41:57.656438Z 0 [System] [MY-013471] [Repl] Plugin group_replication reported: 'Distributed recovery will transfer data using: Incremental recovery from a group donor'
2021-04-07T14:41:57.656682Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 10.0.20.22:3306, 10.0.20.21:3306, 10.0.20.20:3306 on view 16178027833824519:9.'
As you can see mysql0's log indicates mysql2 is the donor and a distributed recovery is ongoing however no amount of time will bring that node back to an ONLINE state.
When checking the replication_connection_status there is seemingly no activity (note the recovery channel is OFF) :
mysql> SELECT * FROM performance_schema.replication_connection_status\G
*************************** 1. row ***************************
CHANNEL_NAME: group_replication_applier
GROUP_NAME: c86adab9-8573-11eb-bd74-fa163ed5368e
SOURCE_UUID: c86adab9-8573-11eb-bd74-fa163ed5368e
THREAD_ID: NULL
SERVICE_STATE: ON
COUNT_RECEIVED_HEARTBEATS: 0
LAST_HEARTBEAT_TIMESTAMP: 0000-00-00 00:00:00.000000
RECEIVED_TRANSACTION_SET: c86adab9-8573-11eb-bd74-fa163ed5368e:1-6375586
LAST_ERROR_NUMBER: 0
LAST_ERROR_MESSAGE:
LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00.000000
LAST_QUEUED_TRANSACTION:
LAST_QUEUED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP: 0000-00-00 00:00:00.000000
LAST_QUEUED_TRANSACTION_IMMEDIATE_COMMIT_TIMESTAMP: 0000-00-00 00:00:00.000000
LAST_QUEUED_TRANSACTION_START_QUEUE_TIMESTAMP: 0000-00-00 00:00:00.000000
LAST_QUEUED_TRANSACTION_END_QUEUE_TIMESTAMP: 0000-00-00 00:00:00.000000
QUEUEING_TRANSACTION:
QUEUEING_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP: 0000-00-00 00:00:00.000000
QUEUEING_TRANSACTION_IMMEDIATE_COMMIT_TIMESTAMP: 0000-00-00 00:00:00.000000
QUEUEING_TRANSACTION_START_QUEUE_TIMESTAMP: 0000-00-00 00:00:00.000000
*************************** 2. row ***************************
CHANNEL_NAME: group_replication_recovery
GROUP_NAME:
SOURCE_UUID:
THREAD_ID: NULL
SERVICE_STATE: OFF
COUNT_RECEIVED_HEARTBEATS: 0
LAST_HEARTBEAT_TIMESTAMP: 0000-00-00 00:00:00.000000
RECEIVED_TRANSACTION_SET:
LAST_ERROR_NUMBER: 0
LAST_ERROR_MESSAGE:
LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00.000000
LAST_QUEUED_TRANSACTION:
LAST_QUEUED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP: 0000-00-00 00:00:00.000000
LAST_QUEUED_TRANSACTION_IMMEDIATE_COMMIT_TIMESTAMP: 0000-00-00 00:00:00.000000
LAST_QUEUED_TRANSACTION_START_QUEUE_TIMESTAMP: 0000-00-00 00:00:00.000000
LAST_QUEUED_TRANSACTION_END_QUEUE_TIMESTAMP: 0000-00-00 00:00:00.000000
QUEUEING_TRANSACTION:
QUEUEING_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP: 0000-00-00 00:00:00.000000
QUEUEING_TRANSACTION_IMMEDIATE_COMMIT_TIMESTAMP: 0000-00-00 00:00:00.000000
QUEUEING_TRANSACTION_START_QUEUE_TIMESTAMP: 0000-00-00 00:00:00.000000
mysql> show slave status for CHANNEL 'group_replication_recovery'\G
*************************** 1. row ***************************
Slave_IO_State:
Master_Host: <NULL>
Master_User: mysql_innodb_cluster_1
Master_Port: 0
Connect_Retry: 60
Master_Log_File:
Read_Master_Log_Pos: 4
Relay_Log_File: mysql0-relay-bin-group_replication_recovery.000001
Relay_Log_Pos: 4
Relay_Master_Log_File:
Slave_IO_Running: No
Slave_SQL_Running: No
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 0
Relay_Log_Space: 513
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: Yes
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 0
Master_UUID:
Master_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State:
Master_Retry_Count: 1
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set:
Executed_Gtid_Set: c86adab9-8573-11eb-bd74-fa163ed5368e:1-6235867
Auto_Position: 1
Replicate_Rewrite_DB:
Channel_Name: group_replication_recovery
Master_TLS_Version: TLSv1,TLSv1.1,TLSv1.2,TLSv1.3
Master_public_key_path:
Get_master_public_key: 0
Network_Namespace:
I checked for resource starvation and all the nodes have enough cpu, ram & hdd available. All the nodes are able to communicate with each other.
I tried stopping and restarting group_replication directly on mysql0 : it leaves and rejoin the cluster then gets stuck into RECOVERING forever again. I tried restarting the mysql process and the whole host : same result, it joins the cluster and remain stuck forever.
The only solution I have is to force remove the node (cluster.removeInstance(ip, {force:true})) and add it back into the cluster by forcing the recovery method to be CLONE and not distributed recovery. As you can imagine it's not ideal, it takes a lot of time and requires manual intervention everytime a node encounters an issue.
This is a recurring issue.
I'm out of ideas and all suggestions are welcomed :)
Thank you.
Subject
Views
Written By
Posted
InnoDB Cluster Distributed Recovery seemingly stuck forever
2387
April 07, 2021 09:13AM
Sorry, you can't reply to this topic. It has been closed.
Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.