MySQL Forums
Forum List  »  Performance

Re: RAM consideration/ IO usage
Posted by: Rick James
Date: January 16, 2015 02:14PM

You have a big task. But it should be possible using stock MySQL.
We can communicate over this forum, or you could hire me to advise you on the architecture. You obviously have issues with hardware (disk and RAM), ingestion speed, query speed, PARTITIONing, lat/lng coding, etc. I have experience with all those. Contact me at ora (at) rjweb.org. If your schema is proprietary, there will be a communication problem through this forum.

Meanwhile, some thoughts...

> 14 * (10^9) records at the end of two years.
> expecting 40,000 insert requests per second

7E9 rows/year = average of 220 rows per second. You cannot achieve that on ordinary hardware without doing at one of several things. Please provide the SHOW CREATE TABLE. And explain how (batched, multi-threaded, etc) you are doing the INSERTs.

Are you expecting huge bursts? Otherwise where does 40K come from? And, what if the 40K were spread out over, say, a minute?

The table's indexes are critically important to ingestion speed. This is, indirectly, a reason for using summary tables. (I will discuss that in more depth after seeing more details.)

Can MySQL handle 40K inserts/sec? It depends. I have a blog (as yet unpublished) on high-speed ingestion. I'll get that to you.

> How should I account for this in the hardware configuratin/mysql configuration.

You cannot solve it simply by hardware. (More on hardware below.)

> 1. We are considering aggregating data which are older than 3 months, which will then result in a BAD schema as you say it - * huge datasets that get purged of old data daily/monthly .

On the contrary. Aggregations are usually one-tenth the size of the raw data. My philosophy is to purge old raw data, but don't purge aggregations. Please give details on the aggregations you need.

> older than 3 months

There are multiple ways to aggregate. "Older than xx days" is the least desirable (in my opinion. I usually do it hourly or daily or as part of the high-speed ingestion process. (We can discuss this in more depth.)

Purging old data should (because of the table size) be done using PARTITIONing. Read my blog on such:
http://mysql.rjweb.org/doc.php/partitionmaint
Huge DELETEs are too costly for your dataset size. (I have a blog on that, but I don't think it will be needed.)

> Should I pick an instance with high RAM value, or look at IOPS options. Since our system requires mostly INSERTS, should IOPS be the important factor?

I must see the schema before answering that. Also, I must see the tentative SELECT statements. And the ingestion process needs discussing.

> 4. We do store lat/long values for some events, and our hope was to use "find nearest" feature of mysql.. Why is this BAD schema?

If the dataset were small enough to be cached, you could burn CPU cycles to do it.
But your dataset is too big, and the traditional ways would be painfully I/O-bound.
This is a partial answer to your RAM vs IOPs question -- "it depends"

http://mysql.rjweb.org/doc.php/latlng
explains briefly why the traditional indexes are poor. Then it explains a very efficient way -- both I/O and CPU. Pending further details, I estimate that you will be able to do 20 "find nearest" queries/second, even on cheap drives.

> where I defined an index in the TIME column.

I assume you mean the DATETIME or TIMESTAMP datatype. (TIME is a nearly useless datatype.)

> 5. We do have SELECTS with count(*) but within a TIME RANGE, where I defined an index in the TIME column. THis should not result in high I/O usage right?

Again, the "devil is in the details". (CREATE TABLE, SELECT)

> 6. How should I test performance in MYSQL.. Shoudl I load up the database with dummy data matching our forecast, and then execute inserts/ and Selects?

That would be good. However, it may take nearly 2 years to load 2 year's worth of dummy data. The loading will go fast until you exceed the buffer_pool, then it will slow down. So let's discuss alternative an alternative:
* Let X = amount of RAM on the eventual machine.
* innodb_buffer_pool_size on that machine should be about 0.70 * X.
* For testing, however, let's set the buffer_pool to 1% of that.
* Load a dummy test up to 1% of the target amount. (Only 16GB)
* Now test. This won't be a perfect simulation, but it will help discover some of the issues you _will_ hit.

Critical things to test:
* 40K/sec ingestion _after_ the dataset is bigger than the buffer_pool
* PARTITION maintenance
* SELECTs (find nearest, etc)
* Because of "find nearest", there is the question of whether to populate 1% of the land area, or populate 1% of each area. Probably wise to do some of each, then do the SELECTs in dense areas and in sparse areas.
* Aggregation -- Design what to do, build maintenance scripts, etc. It will be a lot faster to test with 16GB than with 1.6TB.

Some (eg, aggregation) metrics you get from testing need to be multiplied by 100 to estimate eventual configuration.
Most metrics map directly to the eventual config. This is partially because of artificially shrinking the buffer_pool.

The testing can be done on any handy machine. CPU speeds are not much different; you won't be using much RAM. The drives, on the other hand, may need scaling if you test on non-RAIDed, non-SSD, devices but plan to use them.

> 7. I am being advised that MySQL will not scale

It probably will scale. However, the 'obvious' ways of doing things will not handle your case. Occasionally I encounter a different product that works in some niche market by doing something that MySQL does not do. I cannot speak for the one you mentioned. However, based on what you have said, I believe MySQL can be made to work.

Summary:
Disks -- Recommend RAID striping and/or SSDs.
RAM -- depends on 'clustering' (working set) of the operations you need to perform. If the working set is big, adding RAM won't help.
Cores/CPUs -- Rarely matters.
PARTITIONing -- split the raw data by months (~26 partitions for 2 years) or 2-week intervals (~55 partitions) and use "BY RANGE(timestamp or datetime)"
Version -- Oracle 5.6 or 5.7; MariaDB 10.1; Percona 5.6.
Engine -- InnoDB.
Code -- Database "layer" between the app and MySQL.
Datatypes -- minimize size wherever possible and safe.
Character set -- utf8 (or utf8mb4) only where required.
"Find nearest" -- see my blog; it will be orders of magnitude faster than a casual implementation.
Summarization -- need more details.
innodb_buffer_pool_size (the most important tunable) -- depends on working set, RAM; see discussion above.

Options: ReplyQuote


Subject
Views
Written By
Posted
1593
January 13, 2015 12:39PM
977
January 14, 2015 01:47PM
Re: RAM consideration/ IO usage
1172
January 16, 2015 02:14PM


Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.