I have a headache with GTID replication under MySQL 5.7; in that every now and again we start to get long lag times (detected by examining a pt-heartbeat controlled table) which drop suddenly every now and gain.
(See
https://imagebin.ca/v/4fbUIkRbzRhM for a graph of lag in seconds over time)
Running a show slave status shows that both Slave IO and Slave SQL threads are running, Slave SQL status is more often than not "System lock" and Retrieved GTID set and Executed GTID set are increasing however the lag despite its jumps is just not getting down.
I'm at a loss to explain the issue in detail as I simply don't know how to proceed at this point - Using iotop, I can see that the main MySQL process is heavily IO bound (95%+), but outside of that, CPU load is low.
We do run a number of apparently unsafe operations on the master - a far number of LOAD DATA INFILEs, a lot of TRUNCATEs and quite a few ALTER TABLEs all under script control however up until recently this has not been an issue.
The computer in question was recently (~7 days ago) rebuilt with Ubuntu 18.04.2 LTS and there is nothing else that this machine (the slave) is used for.
Please feel free to ask for more information - I'm simply unsure as to what to provide at this stage and I'd really appreciate some help.
Many thanks.