I am running MySQL 5.6.41 on AWS RDS. I have one multi-AZ master and one multi-AZ read replica. It is a vanilla configuration MySQL configuration, so the binlog is in `MIXED` format and the replication process on the replica is single threaded.
When I attach a new replica, it is completes initializing with ~1 hour of lag, and it takes *10 hours* to catch up. When I turn the binlog format to `ROW` on the replica, the lag starts increasing indefinitely if `sync_binlog=1`.
I have tracked this down to the `SQL_THREAD` in particular on the replica that is struggling to keep up.
I'm trying to work out why replication is so slow, given that no system resources appear to be saturated.
The write IOPS on the master does not exceed 2,500/s during short-lived spikes. The average is more like 470/s. Read IOPS are very low: no more than 100/s. We have a *baseline* capacity of 2,700/s IOPS because we have a 900GiB General Purpose SSD. See AWS docs on this:
https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_Storage.html. I also don't think the disk is the bottleneck because the DiskQueueDepth doesn't exceed 1.
I don't believe the issue is saturating the network because the `IO_THREAD` on the replica is keeping up with the master.
The CPU and memory usage on the replica seem ok. CPU doesn't go above 8% utilization on an 8 vCPU instance, and it has 8 GiB freeable memory.
Does anyone have any ideas for how I can track this down further?