MySQL Forums
Forum List  »  Newbie

Re: Partitioning or Not Partitioning?
Posted by: Rick James
Date: November 08, 2014 05:10PM

> single customer ... for example the last 3 months

That smells like the classic Data Warehouse problem and the classic solution of Summary Tables. See
http://mysql.rjweb.org/doc.php/datawarehouse

Summary tables will significantly improve system performance. It provides an optimization that PARTITION cannot.

> in the customer single table, I can use partitioning by date

A Rule of Thumb: Don't partition unless you have at least a million rows. Will each customer have a million rows? If so, we can continue to discuss the multi-table + PARTITION option.

Partitioning on date provides _no_ benefit over an equivalent compound index that includes date. (At least not in the vague example you give. Please formulate a real SELECT, and provide the corresponding CREATE TABLE. For discussing it, I don't care whether it has partitions or not.)

5000 rows/second = 40B (B as in billion!) rows in 3 months. That would be several terabytes. Do you have that much disk space?

Furthermore, is that 5000/second _per_ customer? If so, you are talking about PETABYTES! If so, we need to talk about sharding, partitioning, ingestion speed, etc, etc.

Even if it is just terabytes, we need to look very seriously at the number of indexes on the table. Even one secondary index could make it _impossible_ to INSERT 5000 rows/second, whether partitioned or not, whether using SSDs or not, etc.

I repeat, please provide SHOW CREATE TABLE. Even it it is not the final version.

Another thing to watch carefully in terabyte-sized tables: Datatypes. Don't use BIGINT if INT will do. Don't use INT if MEDIUMINT will do. Etc. Normalize where that would save space in the Fact table and disk space overall. Use ENUM or TINYINT UNSIGNED instead of VARCHAR when appropriate. Use ascii, not utf8 for hex string; or pack into BINARY.

Shaving even 1 byte off each row in a 40B-row table saves 40GB. Smaller -> more cacheable -> faster.

> security risk is better than 1 only table

Does the customer create the SQL by himself? Or does he interact with a UI that could have the power to restrict him to his data?

If you are worried about security, you should not let the user write the SQL.

RAID: These provide striping, which is beneficial for your case: RAID-0, -5, -10. RAID-1 is just mirroring; not beneficial (for this task).

If you have 4TB of data on a machine with 32GB of RAM, less than 1% of the data can be cached. For users throwing arbitrary queries (3 months / 1 day / last month / ...), the cache is almost useless. Again, I extol the virtues of Summary Tables.

I have built systems with hundreds of GB of data, plus a UI that provided near-instantaneous responses for hundreds of semi-arbitrary queries. Again, the secret was Summary Tables. The "raw data" (the Fact table) was almost never touched. The Summary tables totaled only a few GB, so they mostly stayed cached.

I say "semi-arbitrary queries" because, in real life, there are only a few queries that a user "needs". He wants to see the details (Fact table access) for a few items -- hitting the disk on the PRIMARY KEY a few dozen times is acceptable. He wants some courser details for the past day or week -- a moderately fine-grained Summary Table that is well cached handles this. He wants some gross statistics for the past year or two (courser Summary Tables). He almost never wants details from 7 months ago. I do _not_ let him build SQL, other than allowing him to say "how many days/months to go back in time", which category/subcategory to probe, and a few other things.

The antique details languish on disk, not cluttering the cache (buffer_pool). The stuff from the past week is mostly cached, ready for repeated probing. This week's data gradually slides off to disk, as new data comes in.

Options: ReplyQuote


Subject
Written By
Posted
Re: Partitioning or Not Partitioning?
November 08, 2014 05:10PM


Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.