MySQL Forums
Forum List  »  Newbie

Re: Partitioning or Not Partitioning?
Posted by: Rick James
Date: November 21, 2014 08:26PM

Is `dati_rilevati` the "history" table? If so, what is the CREATE TABLE for the "real time" table?

> (for example 25 rows per second that correspond to 25 different value like temperature, voltage, ampere ecc..., so 1500 value per minute)

If you can make that one row with 25 columns for the measurements, you can do the INSERT a lot faster. Also, the dataset size will be smaller -- possibly 1/3 the size.

Perhaps that is what you are doing??
> and than I delete all data and restart for other 60 second

> I am very happy to know the data quantity for one month or one year because I need to buy an hardware that it will able to do this

There are two issues: dataset size (and you are willing to buy bigger disks), and ability to perform the INSERTs (which I am mostly focusing on). MySQL has a limit of how many rows it can INSERT per second.

> Yes, I have two kind of information, like I explained before . One information in real time, that is produced by a select in the "real_time" table of every client, in witch I consider the last datetime, and one information witch is produced by some SELECT query that consider a single value (for example temperature) in a time period, for example between last month to today.

Let's work on spelling out the SELECTs for each thing you need.

By "real time", do you mean the last value read? In which case something like this might work?:
SELECT * FROM real_time_123.
(assuming it is client #123)

For finding the all the temps for October 30th:
SELECT data, valore FROM dati_rilevati_123
WHERE idvariable = 'temperature'
AND data >= '2014-10-30'
AND data < '2014-10-30' + INTERVAL 1 DAY;
Note that I have taken the liberty of changing idvariable to
ENUM (temperature, voltage, ampere, ...)

That SELECT needs
INDEX(idvariable, data)

If you have 25 columns for 25 metrics, then it looks like:
SELECT data, val_temperature FROM dati_rilevati_123
WHERE data >= '2014-10-30'
AND data < '2014-10-30' + INTERVAL 1 DAY;
That needs INDEX(data)

If I then change to have all clients in the same table:
SELECT data, val_temperature FROM dati_rilevati
WHERE data >= '2014-10-30'
AND data < '2014-10-30' + INTERVAL 1 DAY
AND cliente = 123;
That needs INDEX(cliente, data);

Performance will be really bad without an appropriate INDEX.

Please continue in this direction when explaining what SELECTs you envision.

Meanwhile...
> `drl_idvariabile` integer (11),
> `drl_nomevariabile` varchar (50),

Are those in a 1-to-1 relationship? If so, that pair should be in a separate table (or switch to an ENUM as I mentioned above). That VARCHAR wastes a lot of space, and hence speed. Put only the smaller field (idvariable) in the history table.

Some calculations:
Assuming 2000 clients, and data kept for 1 year.
Plan A: (each reading is a separate row) 25 rows per second per client = 5000 INSERTs/second (as you said); 150 billion rows; 15TB.
Plan B: (1 table with 25 measurements as columns): 200 INSERTs/sec; 6 billion rows; 1TB of data.

I think that putting 15TB on a single system is neither easy, nor cheap; 1TB is easy and cheap.
I think that 5000 INSERTs/sec may cause a lot of hassle; 200 should be no problem.

OK, maybe I a leaving out a step -- Is this correct?
Data older than, say, one month, is summarize: 60 temperature readings are turned into 1 average reading for the minute? If so, then the table splits into two.

Plan A (second-by-second for one month): 1.2TB; still 5000 INSERTs/sec, 13B rows plus
Plan A (minute-by-minute for one year): 0.3TB; 83 INSERTs/sec, 3B rows
Plan A (total): 1.5TB (not bad), 5083 INSERTs/sec (still bad)

Plan B (second-by-second for one month): 0.1TB; still 200 INSERTs/sec, 1B rows, plus
Plan B (minute-by-minute for one year): 0.1TB; 1 INSERT/sec, 0.2B rows
Plan B (total): 0.2TB (good), 201 INSERTs/sec (good)

Clearly, I am pushing you toward Plan B, especially with the by-minute summarization.

The by-the-second data must be purged -- PARTITIONing is excellent for that.
The by-the-minute data is either kept forever (beyond the 1 year I calculated for) or it is purged. If purged, then, again, PARTITIONing is good.
That is, regardless of which "Plan", Partitioning is called for if you need to purge old data.

We still need to discuss the interactions between PARTITION and the INDEXes and the SELECTs. Please continue with the writing the SELECTs.

Options: ReplyQuote


Subject
Written By
Posted
Re: Partitioning or Not Partitioning?
November 21, 2014 08:26PM


Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.