MySQL :: Re: Advice on partitioning

New Topic

Re: Advice on partitioning

Posted by: Rick James
Date: April 27, 2015 02:22PM

(Sorry, I am going to be a bit long-winded and repetitious...)

As I read the numbers, you are aiming at having about 10TB on each shard? And 100 shards. And 300 billion rows in the table (adding up all shards and/or partitions).

10TB on a machine, and modest randomness in the records touched, implies most queries will be I/O-bound. RAID striping and/or SSDs will help some. Increasing the number of shards also helps.

Assuming all the queries have "AND userid = <constant>", all the PARTITION methods listed will limit activity to one partition. But that does not necessarily have any benefit. It seems like a given shard will have enough connections from enough users to effectively need to hit all 100 partitions nearly all the time. That is, scattered blocks of indexes and data will be cached (in buffer_pool), but there will be a lot of churn in the cache.

Let's look at a 'point' query into this table. With sharding, there are 3B rows (per shard); with sharding _and_ partitioning, there would be 30M rows per partition.

1. Decide which shard (outside the scope of this discussion) and connect to that shard.
2. (if partitioned), decide which partition.
3. Drill down a BTree of 3B (or 30M) rows. Such a BTree would be about 5 (or 4) levels. The cost of one level trades off with the cost of picking the partition. I do not know which is faster, but I suspect it is a minor difference.

Conclusion: A 'point query' does not benefit from partitioning.

Now let's look at a range query. If you are using KEY(`ownerId`,`parentID`,`LastModifiedDate`) and are looking for the latest 20 emails in a specified 'folder',...

You must have WHERE ownerId=constant AND parentID=constant ORDER BY LastModifiedDate DESC LIMIT 20

Note that parentID needs to be specified in order to get to the date. Is that the way the query works?

A user with a million emails really need to have them in the same partition in the same shard. So far, all proposals do that.

Let's walk through the action.

1. Get to shard
2. (if partitioned) Get to partition. Let's count this as a disk hit.
3. Drill into index BTree to the end of the emails for that owner+parent. This is likely to have the same 5 or 4 levels, etc; so the timing argument above applies.
4. Linearly scan that index. With, say, 100 entries per index block, all 20 records will be in one, maybe two blocks. Very cheap.
5. For each of the 20 index records, look up the data from the table. This the costly part. Because of the low cacheability, etc, this is likely to take 20 non-cached reads. (Spoiler alert: Later, I will describe how to improve on this.)

Total blocks read (approx): 5 + 2 + 20 = 27; number of non-cached blocks to read from disk: 1 + 1 + 20 = 22. Estimated time to perform the SELECT of 20 emails: 22 times the avg time for a disk hit. Let's say 220ms for spinning disk (even if striped) or 22ms for SSDs.

Dozens to hundreds (depending on spinning vs SSD and striping) of users could be serviced per second per shard. Is this adequate? If not, then I suggest you need to worry about it now.

Let me digress for a minute and talk about loading the data. To load 10TB onto one shard and to have all the indexes/partitions set up will take time. Assuming that you ingest 10B emails per year (10TB), that would require INSERTing 300 per second. (I'm ignoring spikes.) Does this sound like what you are planning for? Can you do 300/sec? Does Partitioning help or not? Let's think about it. Let's say

* 10 rows of the table can fit in one InnoDB block (of 16KB). (A side note: it _may_ be beneficial to shrink the block size to 8 or 4, especially if point-queries dominate.)
* 100+ rows of an index can fit in a block.
* Each data block and each index block needs to eventually be written to disk. (But hopefully only once it is full.)
* If the index updates are too random, then blocks will be flushed to disk and will need to be reread before the 10 or 100 is completed. (Think read-modify-write cycle of updating a BTree, and the attempt to cache the reads/writes.) (I'll ignore block splits as being a small percentage.)
* Let's say inserts are pretty random on userId.
* Let's say the PRIMARY KEY starts with an AUTO_INCREMENT.
* Let's say there are 3 secondary indexes _and_ each starts with userId.
* Let's ignore redo log, double-write buffer, etc. (We should not.)
* Note: PARTITIONing is not a factor in any of this computation.

Conclusions:
* 1/10 actual disk write per row INSERTed. (Effectively appending to the 'end' of the table or partition.)
* 3/100 writes per row.
* ?? reads per row, average.

So, at least 0.13 writes per INSERT. Times 300 rows/sec = at least 40 disk hits per second.

These 40 will compete with the reads needed for the SELECTs. That's about half what a non-striped, non-SSD, drive can handle. So, no panic. (But check the assumption of 10B emails/year and my computations.)

If your 10TB is sitting on a SAN (eg, NetApp), there is some added latency to add into my computations.

But... If each user looks at the latest 20 emails each time he receives on email, then we need to compute based on (0.13 + 22) disk hits! This is likely to overwhelm even a striped set of SSDs.

I'm pointing out in this lengthy discussion is
* the size of the task,
* the randomness of the accesses,
* partitioning should not matter (in the actions mentioned so far).

Now, back to the benchmarking of PARTITION schemes. To reliably judge the techniques, one needs to
* try a big enough dataset -- Especially being several times as big as innodb_buffer_pool_size. Also, keep in mind that an empty VARCHAR(500) occupies only 2 bytes. (I have been using 1KB for the estimated row size.) One could also shrink innodb_buffer_pool_size to compensate.
* Do both INSERTs and SELECTs.
* Do them with reasonable randomness.

Some other details:

"Deleting" a row by setting a flag is much like a point-query, but needs to log the action, hence will involve more writes. (I don't have a good feel for what metric to apply.)

Marking as read/unread is similar to a logical delete. Again, worse than a point-query.

UPDATE ... SET parentId=... for one row, is like a point query, plus a write. But it is done both for the data and for each index containing parentId. Hence it is more costly than a logical Delete. How often to you expect this to happen?

That Update, but with a set of, say, 50 emails -- Well, multiply by 50. This will be slow enough for a user to notice (and want to complain). So, you should do something to push it into the background. There is actually some caching here -- the user probably just recently brought up the "recent emails" list.

Some size notes. (Smaller --> more cacheable --> less I/O --> faster and more capacity):
* Do you really need microsecond resolution on the DATETIMEs? That's 8 bytes vs 5.
* Can you put the several flags into a single SET (or other bit-oriented structure)?
* int unsigned DEFAULT '0', -- make it NOT NULL? NULLs take an extra bit.
* `Deleted` int unsigned -- 4 bytes for a flag?

Remember the 22 disk hits? This will shrink that down to 5:
PRIMARY KEY(userId, id)
INDEX(id)
Notes:
* AUTO_INCREMENT (for `id`) needs nothing more than "some index starting with id".
* InnoDB PK is clustered with the data. Hence the emails for the user will be clustered (in id order) together. A user with one mailbox will have 20 emails clustered in 2-3 blocks, not scattered across 20.

(You may decide that this is better: PRIMARY KEY(userId, parentId, id))

A certain, large, photo-sharing site recently went through the agony of changing their main PK in a similar way. The benefit was 40% improvement in general.

Once the PK _starts_ with the proposed PARTITION KEY, there is even less need for partitioning.

Summary.
* I have argued that PARTITIONing will not buy much, if any, speed.
* Some datatypes could be shrunk.
* Sizing analysis is on the tight side.
* (This may be the most important advice) Change to PRIMARY KEY to get clustering to benefit the typical queries.

To someone who argues that Oracle (etc) can do a better job, I will counter that they have the same number of disk hits and caching possibilities, hence cannot necessarily do much better.

(I am available for short term consulting. ora at rjweb.org )

Navigate: Previous Message• Next Message

Options: Reply• Quote

Subject

Views

Written By

Posted

Advice on partitioning

2726

me su

April 24, 2015 01:40PM

Re: Advice on partitioning

1650

Rick James

April 24, 2015 05:34PM

Re: Advice on partitioning

1648

Mehdi Salarkia

April 27, 2015 09:05AM

Re: Advice on partitioning

1570

Rick James

April 27, 2015 02:22PM

Re: Advice on partitioning

1407

Drew Morris

April 28, 2015 11:02AM

Re: Advice on partitioning

1468

Rick James

April 28, 2015 12:32PM

Re: Advice on partitioning

1425

Mehdi Salarkia

April 28, 2015 04:07PM

Re: Advice on partitioning

1325

Mehdi Salarkia

June 15, 2015 06:39PM

Re: Advice on partitioning

1482

Rick James

June 15, 2015 07:05PM

Re: Advice on partitioning

1401

Mehdi Salarkia

June 16, 2015 06:40PM

Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.