Hi Rick,
Many thanks for your prompt feedback along with great hints.
As a matter of fact I started studying mysql partitioning from your blog (http://mysql.rjweb.org/doc.php/partitionmaint).
Again, great piece of comprehensive knowledge!
Please find my answers inline.
Rick James Wrote:
-------------------------------------------------------
> 96 PARTITIONs -- that's about the practical limit.
> (There are inefficiencies when having too many
> partitions; 5.7 / 10.1 may fix some of them.)
[Miko] Ok, thanks for the hint.
>
> > Estimated number of records inserted per hour:
> 20M (peak 50M)
>
> 50M INSERTs/hr = 14K/sec => several performance
> issues. First, some questions:
> * Disks? Rotating? SSDs? RAIDed?
> * Batch insert? LOAD DATA? Multi-threaded?
> * How random are the indexes?
[Miko] Yes, 14k per sec in a peak.
Storage: decent high-end RAID10 spread over 48 FC disks, approx 8k+ IOPS (50/50 reads/writes random/sequential)
I've tested max insert rate, and it can handle up to 18k per sec, and CPU seemed to be the bottleneck not the I/O (I rarely met such situation before when CPU was a bottleneck not I/O - with this kind of operations)
Inserts: unfortunately not batch /multivalued inserts, just JDBC streams from MongoDB that looks like:
BEGIN
insert into Table1(col1,col2,..., col19) values (val1, val2,..., val19) ON DUPLICATE KEY UPDATE col1 = col1
insert into Table1(col1,col2,..., col19) values (val1, val2,..., val19) ON DUPLICATE KEY UPDATE col1 = col1
...
COMMIT;
(there is no 'col20' here, for which value is autogenerated - partition key - when a record is inserted)
About indexes:
Index 1: CURRENT_TIMESTAMP
Index 2: Timestamp as varchar2 ('YYYY:MM:DD:HH24:MI:SS:MS')
Index 3 (planned): BIGINT as varchar2 - High cardinality; 10M+ of possible different values
> You really ought to have an PRIMARY KEY, even it
> is an artificial AUTO_INCREMENT. (If you don't
> specify a PK, InnoDB gives you a _hidden_ 6-byte
> int-like thingie.)
[Miko] Yes! But such artificial increment needs to be a part of PK together with the partition key. And it has to be BIGINT (8 bytes) for such large number of records. I am aware that 6B implicit key is added as long as there's neither UK nor PK. In fact I thought about such AUTO_INCREMENT artificial key, but saw no real reason to do this (definitely it is my lack of experience in that area).
> > 5.2.10-MariaDB
> > Change version to mariadb10.0.x to support
> better partition pruning (it is almost not
> supported now)
>
> Good.
[Miko] Ok, thanks again.
> > RANGE partitioning
>
> Good. See my blog for more discussion:
>
http://mysql.rjweb.org/doc.php/partitionmaint
[Miko] As I said before, great source of information, looking at it on a regular basis.
> > Add an additional index key
>
> Caution -- each INDEX must be updated for every
> row INSERTed. If they are random, then this could
> lead to I/O thrashing.
[Miko] Yes, I know. It definitely does not help with inserts!
> > Enable replication (and bin log)
>
> Fine. Note that the longer you wait, the bigger
> the dump/reload will be to initialize the Slave.
> And the gotchas may bite you.
[Miko] Yes, but my headache is how much the replication slows down the insertion process. In fact, to speed up the process, I decided to disable the binlog, now I have to enable it back. But this is the matter of tests and performance degradation (which I am going to carry out in the near future).
> > Use temp tables / heap storage engine to copy in
> data from a single partition of Table1 and play
> with inserted records (do aggregations and store
> results into other tables)
>
> Excellent! (I won't need to lecture on such.)
> I found this to be especially efficient:
> 1. Load batches of rows into a MEMORY table in a
> _separate_ database.
> 2. Aggregate from that table into the Summary
> tables
> 3. Load from that table into the big Fact table.
> Furthermore,
> * Use RBR (Row Based Replication)
> * binlog_ignore_database = the database with the
> MEMORY table.
> This will cut down significantly on the
> replication traffic.
[Miko] That's the great recommendation! I will test it very soon and keep you posted!
> > I'm aware of costly JDBC inserts (no batch
> inserts, autocommit, etc.)
>
> You _will_ need to improve on that in order to
> achieve 50M inserts/hour. 90% of the cost of a
> single-row INSERT is overhead that could be
> amortized if you batch the inserts.
[Miko] I wish I could... This "JDBC stream" comes from 3rd party, it would be hard to change anything in it as long as there is no blocking point (unfortunately for me, any performance degradation is not considered as blocking point)
> > and "ON DUPLICATE KEY UPDATE"
>
> If you can gather thousand(s) of rows at a time in
> the MEMORY table, then about 2 SQLs (an
> INSERT..SELECT..LEFT JOIN, and a UPDATE..JOIN)
> will
> * replace the IODKU, and
> * be 2 COMMITs (autocommit) instead of
> thousand(s)
> (If you want the details on the 2 queries, I'll
> scratch my head and reconstruct them. Or maybe I
> will finish my blog on such.)
[Miko] Sounds tempting. Could you please expand on that a little bit more or point me to some further reading in your blog / on the Internet?
> > just partition drops.
>
> After 96 (or 36) hours?
[Miko] I've developed a piece of software; it gets input from a "management" table and creates ahead of / drops the partition(s) automatically and it drops old partition(s) if particular thresholds have been reached (size of a table for example). It works well so far.
> > `col14` smallint(5) unsigned DEFAULT '0',
>
> Two more lectures averted -- SMALLINT (and other
> sizes), and UNSIGNED.
> What about "NOT NULL"?
[Miko] Point taken! I will add "NOT NULL" clause. Some of the columns of INT types were previously part of PK, that is why they are "NOT NULL"
> > ### my.cnf
>
> (SHOW VARIABLES is better. Possibly not necessary
> at the moment. But if you would like tuning
> advice, provide SHOW VARIABLES and SHOW GLOBAL
> STATUS.)
>
> > query_cache_size=8M
>
> Hardly worth having. And, since you have a high
> ingestion rate, you probably have a high QC purge
> rate. Recommend
> query_cache_size=0
> query_cache_type=OFF
[Miko] Ok. Thanks!
> > lower_case_table_names=1
>
> Windows?
[Miko] No, it is Linux.
> > table_cache=1024
> -->
> table_open_cache=5000
> Note that each partition is effectively a "table".
[Miko] Ok. I will change it, it was setup initially for 24 partitions. Thanks!
>
> > innodb_flush_log_at_trx_commit=0
>
> Aha, you already found that one. Recommend 2 over
> 0; hardly any slower, possibly more secure.
[Miko] Ok. Thanks!
> > insert into Table1(col1,col2,..., col19) values
> (val1, val2,..., val19) ON DUPLICATE KEY UPDATE
> col1 = col1
>
> Ouch. IODKU needs a UNIQUE (or PRIMARY) key in
> order to check for "DUPLICATE KEY"!
[Miko] So basically - provided that there are neither PK nor UKs such syntax does not result in performance degradation with inserts, is it correct?
> Did you think about INSERT IGNORE?
[Miko] Well, as I said this stream is from 3rd party, it will be (very) hard to change anything inside.
> After you upgrade, be sure to set
> innodb_buffer_pool_instances = 8 -- or some value
> > 1
[Miko] Ok. Thanks. Need to read a little more to get better understanding of configuration details.
> > Average record size 209B
> > 800 million of records in the table (number of
> partitions: 36)
> > 64GB RAM,
> > innodb_buffer_pool_size=32768M
> > Buffer pool hit rate 1000 / 1000
>
> --> 167 GB for the table. (Plus indexes?)
> However,... Before the table gets to 32GB (re
> buffer_pool), you _may_ start being I/O bound.
> This is where the randomness of the indexes, and
> other issues, comes into play.
> The hit rate suggests that you have not hit the
> problem _yet_.
[Miko] Hmm, it sounds scary. As of now database size is 60GB (data + indexes), I have not observed any issues but the insert rate is very low indeed (AVG 500 per sec, MAX 1500 per sec)
> Oh, one more question -- Are the records inserted
> in roughly timestamp order? That is, is the
> 'latest' partition hot while the others are cold?
[Miko] Yes, exactly! The partition key is always generated for each inserted record (`col20` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP)
> Another thing to note: in 5.2 (I think), INSERT
> does no PARTITION pruning. Another reason to
> upgrade. (No, I don't know for a fact that 10.0's
> INSERT will prune.)
[Miko] Definitely!
>
> If all the activity is in one partition, then the
> "working set" is only 1.7GB (167GB/96), which is
> much less than 32GB. This implies that even a
> 'random' index won't lead to thrashing.
[Miko] Yes! The "true" working set is relatively small.
> Too much information? Probably not. And I love
> expounding on scaling/performance of MySQL.
[Miko] No, not at all! Each and every hint is highly appreciated!
> Can any of those VARCHARs be normalized in order
> to shrink the disk footprint? The normalization
> is a couple of SQLs while the data is still in the
> MEMORY table -- very cheap and efficient.
[Miko] Yes, if normalised it would be much better in terms of footprint, IO, and so on. As I said the insert "stream" is from the 3rd party.
Theoretically the data could be loaded into temp table / memory, and the trigger/procedures could check the varchars, update the "dictionary" in case of any new varchar appears, and update/replace the varchars in temp / memory with appropriate hashes (from the "dictionary" table), etc.
It could be done, but I am not sure whether it makes sense.