MySQL Forums
Forum List  »  Partitioning

Does partitioning make sense for this application?
Posted by: Paul Schmehl
Date: October 25, 2012 08:46PM

I'm designing an applicaton for internal use. I'm a senior security analyst, not a DBA and not a programmer. Such is life at an edu. :-)

I've written the basic application in php (5.4.7) to use mysql (5.5.2) as the backend. The purpose of the application is to speed up searches for data stored in binary form. The average day is about 90GB of data. The database will contain the critical elements we search on. One day of data creates a 9GB table with over 120 million rows. The indexes are almost 2GB.

Here's the schema:
+-------+-----------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+-----------------------+------+-----+---------+-------+
| seq | int(10) unsigned | YES | | NULL | |
| stime | double(18,6) unsigned | NO | MUL | NULL | |
| saddr | varchar(64) | NO | MUL | NULL | |
| sport | varchar(10) | NO | | NULL | |
| daddr | varchar(64) | NO | MUL | NULL | |
| dport | varchar(10) | NO | | NULL | |
| pkts | bigint(20) | YES | | NULL | |
| bytes | bigint(20) | YES | | NULL | |
| state | varchar(32) | YES | | NULL | |
| proto | varchar(16) | NO | | NULL | |
+-------+-----------------------+------+-----+---------+-------+
10 rows in set (0.00 sec)

stime, saddr and daddr are indexed, because those are the elements we normally search for.

Here's what the data looks like:

+-----------+-------------------+----------------+-------+---------------+-------+------+-------+-------+-------+
| seq | stime | saddr | sport | daddr | dport | pkts | bytes | state | proto |
+-----------+-------------------+----------------+-------+---------------+-------+------+-------+-------+-------+
| 105807958 | 1349121595.971430 | 10.174.37.162 | 63460 | 68.71.220.62 | 443 | 2 | 120 | A_A | tcp |
| 105807959 | 1349121595.971592 | 10.21.16.47 | 53600 | 64.208.241.65 | 80 | 75 | 68566 | PA_PA | tcp |
| 105807960 | 1349121595.971961 | 129.110.31.40 | 39605 | 217.70.185.0 | 53 | 1 | 87 | INT | udp |
| 105807961 | 1349121595.972484 | 74.125.225.134 | 80 | 10.21.1.146 | 61692 | 1 | 60 | _RA | tcp |
| 105807962 | 1349121595.973084 | 10.21.16.178 | 53433 | 74.125.225.99 | 80 | 5 | 2519 | PA_PA | tcp |
| 105807963 | 1349121595.974383 | 10.190.104.195 | 59254 | 66.94.240.25 | 80 | 2 | 120 | A_FA | tcp |
| 105807964 | 1349121595.974762 | 10.190.101.245 | 62995 | 75.102.13.140 | 80 | 89 | 48592 | A_PA | tcp |
| 105807965 | 1349121595.974790 | 10.110.143.140 | 54014 | 64.4.44.48 | 1863 | 3 | 182 | PA_PA | tcp |
| 105807966 | 1349121595.974874 | 10.21.21.50 | 15326 | 121.9.201.100 | 17788 | 2 | 241 | INT | udp |
| 105807967 | 1349121595.975586 | 10.170.40.197 | 49993 | 70.20.201.40 | 18879 | 8 | 668 | A_PA | tcp |
+-----------+-------------------+----------------+-------+---------------+-------+------+-------+-------+-------+
10 rows in set (0.08 sec)

The stime column is the unixtime date/time plus six digits representing the microseconds.

Eventually the data will be streamed in realtime into the db. So each day another 120 million rows will be added to the database.

Most of the queries I've written return a result in less than a second. Some can take up to 9 seconds. I would assume that adding a second day would double these query times.

So I'm looking in to partitioning. For my application, the most logical way to partition would be by the day. Generally, when we do searches, they are for discrete time intervals within a 24 hour period, so most searches would only traverse one partition.

Does this make sense?

Since the stime column is indexed, it would make sense to partition on that (if I'm reading the docs correctly). I believe those times are unique, so this could be made a primary key.

The problem is, I would need to either create hundreds of partitions in advance somehow or, more optimally, create a partition on the fly during the day for the next day's data.

Is it possible to do this? Can anyone give me some hints on how to do this?

EDIT: Unfortunately, stime has duplicate values, so it cannot be a primary key. Only seq is unique, but we don't search on that.

EDIT2: I should have mentioned that the inserts will be done by the same program that creates the log files. It has the capability of creating a new table each day. It also assigns the data type to each field but does not do any indexing. So I would have to write something to create the indexes later (which is not a big deal.)

My concern is that with uniquely named daily tables, my code has to be a lot more sophisticated. Searching over multiple days would require lots of joins, whereas a paritioned table would not.

I could use the program to create a new table each month and partition that, which would mean 12 tables per year with no more than 31 partitions in each table. I'm not sure how much data we will be keeping. Right now we keep about 25 days on a 1TB drive. That consumes about 950GB of data. The same time frame in the db would be 95GB.



Edited 2 time(s). Last edit at 10/25/2012 09:33PM by Paul Schmehl.

Options: ReplyQuote


Subject
Views
Written By
Posted
Does partitioning make sense for this application?
3336
October 25, 2012 08:46PM


Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.