MySQL Forums
Forum List  »  Partitioning

Re: Does partitioning make sense for this application?
Posted by: Rick James
Date: October 26, 2012 07:41PM

90GB/day? OK, we will have to pull out all the stops!

Task 1: Shrink the data as much as possible
* Smaller datatypes
* Normalization

| seq   | int(10) unsigned      | Will you run out of the 4 Billion here?  If so, consider BIGINT UNSIGNED
| stime | double(18,6) unsigned | That's 11 bytes; DOUBLE is only 8
| saddr | varchar(64)           | IP-address, see below
| sport | varchar(10)           | The sample data would fit into SMALLINT UNSIGNED (0..65535, 2 bytes)
| daddr | varchar(64)           | IP
| dport | varchar(10)           | Maybe SMALLINT UNSIGNED?
| pkts  | bigint(20)            | Samples don't need BIGINT (8 bytes); see below
| bytes | bigint(20)            | Ditto
| state | varchar(32)           | Normalize; see below
| proto | varchar(16)           | ENUM or Normalize; see below

Are you using utf8? Probably don't need it.

IP-addresses. Your samples show IPv4. IPv6 is upon us. Suggest you plan for that. See this page, more for the datatype discussion than for the code:
http://mysql.rjweb.org/doc.php/ipranges
You should probably end up with BINARY(16) or VARBINARY(16).

By "normalize", I mean create another table with a bi-directional mapping between an number and the string. The number would go into your main table. It would be AUTO_INCREMENT in the other table.

Numbers should be INT/MEDIUMINT/SMALLINT/TINYINT UNSIGNED, depending on how many distinct values you will end up with. BIGINT is almost never needed.

PROTO might best be ENUM('udp', 'tcp', ...) -- there are not many choices. 1 byte.

Consider NOT NULL for each field. If you do need NULL values, think through why you need it. (This is a minor issue.)

Which Engine are you using? I recommend InnoDB, even though its disk footprint is bigger. If there is a crash (eg, power failure), InnoDB will self-recover; MyISAM would probably need an hours-long "REPAIR TABLE".

120M rows/day = 1500/second. I assume the streaming would be relatively steady.

Batch the INSERTs -- either LOAD DATA or single INSERT statements with 100-1000 rows each.

(We have not even gotten to PARTITIONing yet.)

> I would assume that adding a second day would double these query times.
Probably not.
* A "point" query (fetch a single row using an index) will not slow down noticeably. Well, there will be a quantum bump when you go from being fully cached in RAM to the data being too big to be cached.
* A query scanning, say, one second's worth of data (~1500 rows) will not be much different than a point query.
* What was the 9-second query?

> Since the stime column is indexed, it would make sense to partition on that
* Not necessarily. There is no connection between being indexed and being a good PARTITION key.
* DECIMAL may not be a partition key (unless there is something new). DOUBLE cannot either.

> hundreds of partitions
Not a good idea (for various obscure reasons). Also there is a hard limit (until 5.6) of 1024 partitions.

> stime has duplicate values
Never trust a time to be UNIQUE ! Hence it cannot be a PRIMARY KEY. But it does not have to be by itself. Is there some pair of fields that are UNIQUE?

> capability of creating a new table each day
Also not a good idea. (But I won't throw it out immediately.)
100K tables causes the OS to groan and slow down.

> does not do any indexing
If it inserts into the table, it is inserting. I see some kind of INDEX in the DESCRIBE. PLEASE use SHOW CREATE TABLE, not the less-descriptive "DESCRIBE"!

Some generic PARTITION tips:
http://mysql.rjweb.org/doc.php/ricksrots#partitioning
Some generic Data Warehousing tips (and DW is what you are into):
http://mysql.rjweb.org/doc.php/ricksrots#data_warehouse

Before I can begin to discuss PARTITIONing and Summary Table advice, I need to see all the queries that will hit the table.

Options: ReplyQuote


Subject
Views
Written By
Posted
Re: Does partitioning make sense for this application?
1539
October 26, 2012 07:41PM


Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.