MySQL Forums
Forum List  »  InnoDB

Re: Redundant InnoDB lock
Posted by: Jakub Lopuszanski
Date: February 17, 2020 06:06AM

Hello!

Thanks for interesting question :)

Indeed, locking the row in secondary index for key 'b' with b=3,a=3 itself (instead of just the gap before it) seems unnecessary.
It doesn't cause any correctness issues though, so I would not use the word "bug".

Please allow me to elaborate, to make sure we both understand the situation.

MySQL has a handy tool for analysing scenarios involving more than one client, called mysql-test-run.pl, or mtr for short - you can find it in mysql-test subdirectory.
So, I've adapted your scenarios to mtr scripts, so I can reproduce them easily.


--echo "#"
--echo "# SCENARIO WITH PRIMARY KEY (a)"
--echo "#"

CREATE TABLE foo (
a INT NOT NULL,
b INT NOT NULL,
c CHAR(1),
PRIMARY KEY (a),
KEY (b)
) ENGINE=InnoDB;

INSERT INTO foo VALUES (1,1,'A'), (3,3,'B'), (5,5,'C'), (7,7,'D'), (9,9,'E');

--connect (c1, localhost, root,,)
START TRANSACTION;
SELECT * FROM foo WHERE a < 2 FOR UPDATE;
--echo "# I'd expect exclusive locks by c1 on PRIMARY KEY row a=1 and gap before it, and gap before row a=3"
--connection default
SELECT engine_transaction_id,index_name,lock_mode,lock_status,lock_data FROM performance_schema.data_locks WHERE lock_type='RECORD' AND object_name='foo';

--connect (c2, localhost, root,,)
START TRANSACTION;
DELETE FROM foo WHERE a = 3;

--echo "# I'd expect additional exclusive lock by c2 on PRIMARY KEY row a=3 (which is delete marked and not yet purged, so read views can see state of DB before uncommitted c2)."
--connection default
SELECT engine_transaction_id,index_name,lock_mode,lock_status,lock_data FROM performance_schema.data_locks WHERE lock_type='RECORD' AND object_name='foo';

--connection c1
COMMIT;

--echo "# I'd expect only the exclusive lock by c2 on PRIMARY row a=3."
--connection default
SELECT engine_transaction_id,index_name,lock_mode,lock_status,lock_data FROM performance_schema.data_locks WHERE lock_type='RECORD' AND object_name='foo';

--connection c2
COMMIT;

--connection default
--disconnect c2
--disconnect c1
DROP TABLE foo;


As you can see I'm querying the performance_schema.data_locks to see what data locks are being held.
This table is documented at https://dev.mysql.com/doc/mysql-perfschema-excerpt/8.0/en/data-locks-table.html.

It's important to understand several things when interpreting content of data_locks table:
- the foo table has two indexes (PRIMARY and KEY(b)), which can be thought of as two separate collections of rows
- the PRIMARY KEY can be imagined as a sorted list of (a,b,c) tuples: (1,1,'A'), (3,3,'B'), (5,5,'C'), (7,7,'D'), (9,9,'E')
- the secondary KEY(b) can be imagined as a sorted list of (b,a) tuples: (1,1), (3,3), (5,5), (7,7), (9,9)
- if we imagine these sorted tuples as laying on an axis, then a record lock can have several types:
- "REC_NOT_GAP" - such lock protects just the tuple itself
- "GAP" - such lock protects just the gap before (to the left of) the tuples
- "", the default - such lock protects both: the gap before and the tuple itself
- locks can be e"X"clusive or "S"shared. (Currently this doesn't matter for the gap before record - even "X" is treated as "S" for the gap part of a lock)

If you place the above test in mysql-test/suites/innodb/t/ and run it with
./mtr --record forum.test
it will generate a mysql-test/suites/innodb/r/forum.result file, in which you should see following fragments:


"# I'd expect exclusive locks by c1 on PRIMARY KEY row a=1 and gap before it, and gap before row a=3"
SELECT engine_transaction_id,index_name,lock_mode,lock_status,lock_data FROM performance_schema.data_locks WHERE lock_type='RECORD' AND object_name='foo';
engine_transaction_id index_name lock_mode lock_status lock_data
1816 PRIMARY X GRANTED 1
1816 PRIMARY X,GAP GRANTED 3
START TRANSACTION;

Indeed the observation matches the expectation: X,GAP on 3 means that the row itself is not protected, just the gap spanning range between 1 and 3 (non-inclusive).
And as expected, this allows the c2's DELETE to succeed:

"# I'd expect additional exclusive lock by c2 on PRIMARY KEY row a=3 (which is delete marked and not yet purged, so read views can see state of DB before uncommitted c2)."
SELECT engine_transaction_id,index_name,lock_mode,lock_status,lock_data FROM performance_schema.data_locks WHERE lock_type='RECORD' AND object_name='foo';
engine_transaction_id index_name lock_mode lock_status lock_data
1817 PRIMARY X,REC_NOT_GAP GRANTED 3
1816 PRIMARY X GRANTED 1
1816 PRIMARY X,GAP GRANTED 3

Here c1=1816 and c2=1817 (actual transaction ids may vary from run to run, obviously).
And we see that c2 successfully got GRANTED the X,REC_NOT_GAP lock on the row a=3, despite c1 having X,GAP lock on a=3.
Cool.

Now, let's move on to the next scenario.
I've modified it a bit to clarify one potential confusion.
You've wrote:
> Locking the index entry with b = 3 depends only on the uniqueness of the index used.
I'll demonstrate that locking happens even if the index on 'b' is UNIQUE. What matters is rather that it is a secondary (=not PRIMARY) index.

--echo "#"
--echo "# SCENARIO WITH UNIQUE KEY (b)"
--echo "#"

CREATE TABLE foo (
a INT NOT NULL,
b INT NOT NULL,
c CHAR(1),
PRIMARY KEY (a),
UNIQUE KEY (b)
) ENGINE=InnoDB;

INSERT INTO foo VALUES (1,1,'A'), (3,3,'B'), (5,5,'C'), (7,7,'D'), (9,9,'E');

--connect (c1, localhost, root,,)
START TRANSACTION;
SELECT * FROM foo WHERE b < 2 FOR UPDATE;
--echo "# I'd expect exclusive locks by c1 on KEY 'b' row b=1,a=1 and gap before it, and gap before row b=3,a=3, and exclusive lock on PRIMARY KEY row a=1,"
--echo "# instead the lock on b=3,a=3 covers not only gap, but also record"
--connection default
SELECT engine_transaction_id,index_name,lock_mode,lock_status,lock_data FROM performance_schema.data_locks WHERE lock_type='RECORD' AND object_name='foo';

--connect (c2, localhost, root,,)
START TRANSACTION;
--echo "# because c1 latched the INDEX 'b' row b=3,a=3, this has to wait:"
--send
DELETE FROM foo WHERE b = 3;

# one could use SET DEBUG_SYNC='lock_wait_will_wait SIGNAL c2_will_wait' before DELETE in c2, and SET DEBUG_SYNC='now WAIT_FOR c2_will_wait' here instead of sleep,
# but this wouldn't work on release build of mysqld
--sleep 1
--echo "# I'd expect additional exclusive WAITING lock by c2 on KEY 'b' row b=3,a=3"
--connection default
SELECT engine_transaction_id,index_name,lock_mode,lock_status,lock_data FROM performance_schema.data_locks WHERE lock_type='RECORD' AND object_name='foo';

--connection c1
COMMIT;

--echo "# I'd expect exclusive lock by c2 on KEY 'b' row b=3,a=3, and exclusive lock on PRIMARY KEY row a=3"
--connection default
SELECT engine_transaction_id,index_name,lock_mode,lock_status,lock_data FROM performance_schema.data_locks WHERE lock_type='RECORD' AND object_name='foo';

--connection c2
--reap
COMMIT;

--connection default
--disconnect c2
--disconnect c1
DROP TABLE foo;

Let's focus on the most intriguing part of the result:

"# I'd expect exclusive locks by c1 on KEY 'b' row b=1,a=1 and gap before it, and gap before row b=3,a=3, and exclusive lock on PRIMARY KEY row a=1,"
"# instead the lock on b=3,a=3 covers not only gap, but also record"
SELECT engine_transaction_id,index_name,lock_mode,lock_status,lock_data FROM performance_schema.data_locks WHERE lock_type='RECORD' AND object_name='foo';
engine_transaction_id index_name lock_mode lock_status lock_data
1841 b X GRANTED 1, 1
1841 b X GRANTED 3, 3
1841 PRIMARY X,REC_NOT_GAP GRANTED 1

Here, the (b=3,a=3) tuple is latched together with the gap before it - this is the default in our code.
It takes extra development effort to implement (and prove correctness of) locking in REC_NOT_GAP or GAP mode.
Note that this locking happens even though the KEY (b) is UNIQUE.

The logic of choosing correct lock mode was recently (8.0.18) fixed for PRIMARY KEY, and this is why you don't see problems with it for PRIMARY KEY.
The logic is now contained in https://github.com/mysql/mysql-server/blob/8.0/storage/innobase/row/row0sel.cc#L4290 row_compare_row_to_range().
You can see an explicit `if (index == clust_index` condition in it:
https://github.com/mysql/mysql-server/blob/8.0/storage/innobase/row/row0sel.cc#L4362
(clust_index is short for clustered index, which is a different way of saying "PRIMARY KEY").

If I remove this `index == clust_index && ` from the condition, recompile, and run the test again, I see a change in the locks taken:
engine_transaction_id index_name lock_mode lock_status lock_data
1841 b X GRANTED 1, 1
-1841 b X GRANTED 3, 3
1841 PRIMARY X,REC_NOT_GAP GRANTED 1
+1841 b X,GAP GRANTED 3, 3

This is probably a change you would welcome: only the gap before (b=3,a=3) - but not the tuple itself - is locked.
As expected, it follows that c2 can successfully lock and remove the row with (b=3,a=3).

What's wrong with such a change?
It took more than a month to implement, verify and test the previous change for PRIMARY KEY.
The `if` in question is "guarded" by this comment:

/* Following heuristics are meant to avoid locking the row itself, or even
the gap before it, in case when the row is "after the end of range". The
difficulty here is in that the index itself can be declared as either
ascending or descending, separately for each column, and cursor can be
PAGE_CUR_G(E) or PAGE_CUR_L(E) etc., and direction of scan can be 0,
ROW_SEL_NEXT or ROW_SEL_PREV, and this might be a secondary index (with
duplicates). So we limit ourselves just to the cases, which are at the
same common, tested, actionable and easy to reason about.
In particular we only handle cases where we iterate the index in its
natural order. */

What is it all about? If you declare the index as DESC then it has tuples sorted in the descending order:
(9,9), (7,7), (5,5), (3,3), (1,1)
This changes what "gap before record" mean: now it means values larger than the record.
So, for example, now X,GAP on (3,3) would protect (4,4), but not (2,2).
This clearly changes some decisions to be made, as now we would probably prefer to only X lock (1,1) in default mode.
But wait, there is more. The scan itself can be ascending or descending depending on the ORDER BY clause.
Moreover, the sign of inequality can be either "sharp" as in b<3 or "soft" as in b<=3.
Exponential explosion of cases escalates quickly...
But wait, there is more, if the index is non-unique, we have additional fun, of having, say (b=3,a=5),(b=3,a=3),(b=3,a=2) in it,
and having to figure out which (if any) of the "siblings" to lock, which gaps, etc.
Also, the index can be defined on multiple columns (KEY (b,c)) while the query can reference just some prefix of it (WHERE b<3), or even a prefix of a column if it is a string.
I'm sure how NULL values (in particular, in "tiebreaker" columns) should be handled for various sort orders.
Part (perhaps large) of the problem is that row_compare_row_to_range() is used inside row_search_mvcc() function, which spans 1500+ lines of code starting from:
https://github.com/mysql/mysql-server/blob/8.0/storage/innobase/row/row0sel.cc#L4416
- not a toy puzzle to analyse.
You may want to look into ./mysql-test/suite/innodb/t/lock_end_of_range.test which demonstrates some (~400) cases we've tested for PRIMARY KEY to make "sure" it works as it should.

So, there would be some non-trivial development effort, to "just" remove this one conjunct from the `if`.
As I wrote above: I wouldn't use the word "bug", because there is no data loss, wrong answer, etc. - just less concurrency than one could imagine in a different implementation (which might have other issues in other scenarios).
And I'm reluctant to even use the word "inefficiency", given that "number of resources locked" is just one axis, some of others being "simplicity of design" and "speed of development".
I'd love to see a well tested simple contribution for this "issue", though.
As an encouragement, I've run the full ./mtr test suite with the "index==clust_index&&" removed, and it seems to pass most tests, except for:
Failing test(s): sys_vars.innodb_monitor_reset_all_basic sys_vars.innodb_monitor_enable_basic innodb.monitor i_main.update sys_vars.innodb_monitor_reset_basic sys_vars.innodb_monitor_disable_basic

So, now it "only" remains to prove somehow the optimization is really the correct thing to do (or find and handle counterexamples) and fix those test cases and/or code logic.

Options: ReplyQuote


Subject
Views
Written By
Posted
227
February 10, 2020 06:04AM
Re: Redundant InnoDB lock
96
February 17, 2020 06:06AM
68
February 17, 2020 10:59AM
62
February 18, 2020 06:42PM


Sorry, only registered users may post in this forum.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.