ram/swap maxing out in innodb-cluster
Posted by: Florian Winkler
Date: January 04, 2022 08:22AM
Date: January 04, 2022 08:22AM
Hi everyone,
This has also been sent as a bug report: https://bugs.mysql.com/bug.php?id=106032&thanks=4
We have set up an 3 node innodb cluster with mysql 8.0.27.
1 node primary
2 nodes secondary RO
> cluster.status()
{
"clusterName": "testCluster",
"defaultReplicaSet": {
"name": "default",
"primary": "db01-002.########:3306",
"ssl": "REQUIRED",
"status": "OK",
"statusText": "Cluster is ONLINE and can tolerate up to ONE failure.",
"topology": {
"db01-002.########:3306": {
"address": "db01-002.########:3306",
"memberRole": "PRIMARY",
"mode": "R/W",
"readReplicas": {},
"replicationLag": null,
"role": "HA",
"status": "ONLINE",
"version": "8.0.27"
},
"db02-002.########:3306": {
"address": "db02-002.########:3306",
"memberRole": "SECONDARY",
"mode": "R/O",
"readReplicas": {},
"replicationLag": null,
"role": "HA",
"status": "ONLINE",
"version": "8.0.27"
},
"db03-002.########:3306": {
"address": "db03-002.########:3306",
"memberRole": "SECONDARY",
"mode": "R/O",
"readReplicas": {},
"replicationLag": null,
"role": "HA",
"status": "ONLINE",
"version": "8.0.27"
}
},
"topologyMode": "Single-Primary"
},
"groupInformationSourceMember": "db01-002.########:3306"
}
The 3 nodes are virtualized - as a managed service:
we have root access to the 3 virtual machines, but no access to the hypervisor/networking/etc.
It's a minimal setup, no clients accessing the databases on the cluster yet.
The problem: we guess that there are short network disconnects/lags and one of the nodes loses contact to the other ones. We cant see anything obvious with our monitoring and ping statistics. The node not seeing one or two other nodes lasts only some seconds and it reconnects after this event and everything looks good - cluster state etc. Except for ram/swap usage. Whenever this occurs, ram and swap start to grow till exhausted.
log of event:
2021-12-31T23:28:59.677783+01:00 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address db03-002.########:3306 has become unreachable.'
2021-12-31T23:29:13.362366+01:00 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address db03-002.########:3306 is reachable again.'
2021-12-31T23:31:44.909802+01:00 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address db03-002.########:3306 has become unreachable.'
2021-12-31T23:31:44.916173+01:00 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address db01-002.########:3306 has become unreachable.'
2021-12-31T23:31:44.916746+01:00 0 [ERROR] [MY-011496] [Repl] Plugin group_replication reported: 'This server is not able to reach a majority of members in the group. This server will now block all updates. The server will remain blocked for the next 15 seconds. Unless contact with the majority is restored, after this time the member will error out and leave the group. It is possible to use group_replication_force_members to force a new group membership.'
2021-12-31T23:31:44.927991+01:00 0 [Warning] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Shutting down an outgoing connection. This happens because something might be wrong on a bi-directional connection to node db03-002.########:33061. Please check the connection status to this member'
2021-12-31T23:31:44.942143+01:00 0 [Warning] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Shutting down an outgoing connection. This happens because something might be wrong on a bi-directional connection to node db01-002.########:33061. Please check the connection status to this member'
2021-12-31T23:31:44.961775+01:00 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address db03-002.########:3306 is reachable again.'
2021-12-31T23:31:44.961813+01:00 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address db01-002.########:3306 is reachable again.'
2021-12-31T23:31:44.967994+01:00 0 [Warning] [MY-011498] [Repl] Plugin group_replication reported: 'The member has resumed contact with a majority of the members in the group. Regular operation is restored and transactions are unblocked.'
2021-12-31T23:31:51.825487+01:00 0 [ERROR] [MY-011505] [Repl] Plugin group_replication reported: 'Member was expelled from the group due to network failures, changing member status to ERROR.'
2021-12-31T23:31:51.827340+01:00 0 [ERROR] [MY-011712] [Repl] Plugin group_replication reported: 'The server was automatically set into read only mode after an error was detected.'
2021-12-31T23:31:51.854404+01:00 149104 [System] [MY-013373] [Repl] Plugin group_replication reported: 'Started auto-rejoin procedure attempt 1 of 3'
2021-12-31T23:32:21.868086+01:00 0 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Timeout while waiting for the group communication engine to exit!'
2021-12-31T23:32:21.872116+01:00 0 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] The member has failed to gracefully leave the group.'
2021-12-31T23:32:51.891498+01:00 149104 [Warning] [MY-013659] [Repl] Plugin group_replication reported: 'Timeout while waiting for a view change event during the leave step before a auto-rejoin attempt.'
2021-12-31T23:35:00.127143+01:00 0 [System] [MY-011504] [Repl] Plugin group_replication reported: 'Group membership changed: This member has left the group.'
2021-12-31T23:35:00.760316+01:00 149104 [Warning] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Automatically adding IPv4 localhost address to the allowlist. It is mandatory that it is added.'
2021-12-31T23:35:00.763983+01:00 149104 [Warning] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Automatically adding IPv6 localhost address to the allowlist. It is mandatory that it is added.'
2021-12-31T23:35:00.981936+01:00 149119 [System] [MY-010597] [Repl] 'CHANGE MASTER TO FOR CHANNEL 'group_replication_applier' executed'. Previous state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 429560479, master_bind=''. New state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''.
2021-12-31T23:35:07.702481+01:00 0 [Warning] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Shutting down an outgoing connection. This happens because something might be wrong on a bi-directional connection to node db01-002.########:33061. Please check the connection status to this member'
2021-12-31T23:35:07.910540+01:00 149104 [System] [MY-013375] [Repl] Plugin group_replication reported: 'Auto-rejoin procedure attempt 1 of 3 finished. Member was able to join the group.'
2021-12-31T23:35:08.923772+01:00 0 [System] [MY-013471] [Repl] Plugin group_replication reported: 'Distributed recovery will transfer data using: Incremental recovery from a group donor'
2021-12-31T23:35:08.937563+01:00 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to db01-002.########:3306, db02-002.########:3306, db03-002.########:3306 on view 16400802118655702:17.'
PS: We can rule out the OS, as we had the same issue on CentOS 8 also using mysql 8.0.27 Community Edition.
How to repeat:
Short network disconnects between clients to trigger Auto-rejoin procedure
This has also been sent as a bug report: https://bugs.mysql.com/bug.php?id=106032&thanks=4
We have set up an 3 node innodb cluster with mysql 8.0.27.
1 node primary
2 nodes secondary RO
> cluster.status()
{
"clusterName": "testCluster",
"defaultReplicaSet": {
"name": "default",
"primary": "db01-002.########:3306",
"ssl": "REQUIRED",
"status": "OK",
"statusText": "Cluster is ONLINE and can tolerate up to ONE failure.",
"topology": {
"db01-002.########:3306": {
"address": "db01-002.########:3306",
"memberRole": "PRIMARY",
"mode": "R/W",
"readReplicas": {},
"replicationLag": null,
"role": "HA",
"status": "ONLINE",
"version": "8.0.27"
},
"db02-002.########:3306": {
"address": "db02-002.########:3306",
"memberRole": "SECONDARY",
"mode": "R/O",
"readReplicas": {},
"replicationLag": null,
"role": "HA",
"status": "ONLINE",
"version": "8.0.27"
},
"db03-002.########:3306": {
"address": "db03-002.########:3306",
"memberRole": "SECONDARY",
"mode": "R/O",
"readReplicas": {},
"replicationLag": null,
"role": "HA",
"status": "ONLINE",
"version": "8.0.27"
}
},
"topologyMode": "Single-Primary"
},
"groupInformationSourceMember": "db01-002.########:3306"
}
The 3 nodes are virtualized - as a managed service:
we have root access to the 3 virtual machines, but no access to the hypervisor/networking/etc.
It's a minimal setup, no clients accessing the databases on the cluster yet.
The problem: we guess that there are short network disconnects/lags and one of the nodes loses contact to the other ones. We cant see anything obvious with our monitoring and ping statistics. The node not seeing one or two other nodes lasts only some seconds and it reconnects after this event and everything looks good - cluster state etc. Except for ram/swap usage. Whenever this occurs, ram and swap start to grow till exhausted.
log of event:
2021-12-31T23:28:59.677783+01:00 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address db03-002.########:3306 has become unreachable.'
2021-12-31T23:29:13.362366+01:00 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address db03-002.########:3306 is reachable again.'
2021-12-31T23:31:44.909802+01:00 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address db03-002.########:3306 has become unreachable.'
2021-12-31T23:31:44.916173+01:00 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address db01-002.########:3306 has become unreachable.'
2021-12-31T23:31:44.916746+01:00 0 [ERROR] [MY-011496] [Repl] Plugin group_replication reported: 'This server is not able to reach a majority of members in the group. This server will now block all updates. The server will remain blocked for the next 15 seconds. Unless contact with the majority is restored, after this time the member will error out and leave the group. It is possible to use group_replication_force_members to force a new group membership.'
2021-12-31T23:31:44.927991+01:00 0 [Warning] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Shutting down an outgoing connection. This happens because something might be wrong on a bi-directional connection to node db03-002.########:33061. Please check the connection status to this member'
2021-12-31T23:31:44.942143+01:00 0 [Warning] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Shutting down an outgoing connection. This happens because something might be wrong on a bi-directional connection to node db01-002.########:33061. Please check the connection status to this member'
2021-12-31T23:31:44.961775+01:00 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address db03-002.########:3306 is reachable again.'
2021-12-31T23:31:44.961813+01:00 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address db01-002.########:3306 is reachable again.'
2021-12-31T23:31:44.967994+01:00 0 [Warning] [MY-011498] [Repl] Plugin group_replication reported: 'The member has resumed contact with a majority of the members in the group. Regular operation is restored and transactions are unblocked.'
2021-12-31T23:31:51.825487+01:00 0 [ERROR] [MY-011505] [Repl] Plugin group_replication reported: 'Member was expelled from the group due to network failures, changing member status to ERROR.'
2021-12-31T23:31:51.827340+01:00 0 [ERROR] [MY-011712] [Repl] Plugin group_replication reported: 'The server was automatically set into read only mode after an error was detected.'
2021-12-31T23:31:51.854404+01:00 149104 [System] [MY-013373] [Repl] Plugin group_replication reported: 'Started auto-rejoin procedure attempt 1 of 3'
2021-12-31T23:32:21.868086+01:00 0 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Timeout while waiting for the group communication engine to exit!'
2021-12-31T23:32:21.872116+01:00 0 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] The member has failed to gracefully leave the group.'
2021-12-31T23:32:51.891498+01:00 149104 [Warning] [MY-013659] [Repl] Plugin group_replication reported: 'Timeout while waiting for a view change event during the leave step before a auto-rejoin attempt.'
2021-12-31T23:35:00.127143+01:00 0 [System] [MY-011504] [Repl] Plugin group_replication reported: 'Group membership changed: This member has left the group.'
2021-12-31T23:35:00.760316+01:00 149104 [Warning] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Automatically adding IPv4 localhost address to the allowlist. It is mandatory that it is added.'
2021-12-31T23:35:00.763983+01:00 149104 [Warning] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Automatically adding IPv6 localhost address to the allowlist. It is mandatory that it is added.'
2021-12-31T23:35:00.981936+01:00 149119 [System] [MY-010597] [Repl] 'CHANGE MASTER TO FOR CHANNEL 'group_replication_applier' executed'. Previous state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 429560479, master_bind=''. New state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''.
2021-12-31T23:35:07.702481+01:00 0 [Warning] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Shutting down an outgoing connection. This happens because something might be wrong on a bi-directional connection to node db01-002.########:33061. Please check the connection status to this member'
2021-12-31T23:35:07.910540+01:00 149104 [System] [MY-013375] [Repl] Plugin group_replication reported: 'Auto-rejoin procedure attempt 1 of 3 finished. Member was able to join the group.'
2021-12-31T23:35:08.923772+01:00 0 [System] [MY-013471] [Repl] Plugin group_replication reported: 'Distributed recovery will transfer data using: Incremental recovery from a group donor'
2021-12-31T23:35:08.937563+01:00 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to db01-002.########:3306, db02-002.########:3306, db03-002.########:3306 on view 16400802118655702:17.'
PS: We can rule out the OS, as we had the same issue on CentOS 8 also using mysql 8.0.27 Community Edition.
How to repeat:
Short network disconnects between clients to trigger Auto-rejoin procedure
Subject
Views
Written By
Posted
Sorry, you can't reply to this topic. It has been closed.
Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.