MySQL Forums
Forum List  »  Partitioning

Re: Does partitioning make sense for this application?
Posted by: Paul Schmehl
Date: October 26, 2012 09:55PM

After doing additional reading I decided to create a covering key that used the three tables we would most frequently search so I could test it. Response seems to be slower than when I had individually indexed the columns, but I'll have to test with the same query to be sure.

Here's a typical query:
select * from argData_2012_10_02 where saddr between '10.21.21.0' and '10.21.30.255' and daddr='8.8.8.8' and dport='53';

Here's the return:
20655 rows in set (2 min 27.82 sec)

Here's show create table for the covering index:

argData_2012_10_02 | CREATE TABLE `argData_2012_10_02` (
`seq` int(10) unsigned NOT NULL DEFAULT '0',
`stime` double(18,6) unsigned NOT NULL,
`saddr` varchar(64) NOT NULL,
`sport` varchar(10) NOT NULL,
`daddr` varchar(64) NOT NULL,
`dport` varchar(10) NOT NULL,
`pkts` bigint(20) DEFAULT NULL,
`bytes` bigint(20) DEFAULT NULL,
`state` varchar(32) DEFAULT NULL,
`proto` varchar(16) NOT NULL,
KEY `stime` (`stime`,`saddr`,`daddr`) USING BTREE
) ENGINE=MyISAM DEFAULT CHARSET=latin1

I dropped that and recreated separate indexes on saddr and daddr and reran the query. Here's the results:

20655 rows in set (33.73 sec)

Clearly much quicker.

Here's show create table for the separate indexes:

argData_2012_10_02 | CREATE TABLE `argData_2012_10_02` (
`seq` int(10) unsigned NOT NULL DEFAULT '0',
`stime` double(18,6) unsigned NOT NULL,
`saddr` varchar(64) NOT NULL,
`sport` varchar(10) NOT NULL,
`daddr` varchar(64) NOT NULL,
`dport` varchar(10) NOT NULL,
`pkts` bigint(20) DEFAULT NULL,
`bytes` bigint(20) DEFAULT NULL,
`state` varchar(32) DEFAULT NULL,
`proto` varchar(16) NOT NULL,
KEY `saddr_idx` (`saddr`),
KEY `daddr_idx` (`daddr`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1

I wonder if I should have separate indexes on saddr and daddr and then covering indexes on saddr and sport and another on daddr and dport?

EDIT: So I created a covering index for saddr and sport and ran the query again. Here's the explain:
mysql> explain select * from argData_2012_10_02 where saddr between '10.21.21.0' and '10.21.30.255' and daddr='8.8.8.8' and dport='53';
+----+-------------+--------------------+------+----------------------------------+-----------+---------+-------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------------+------+----------------------------------+-----------+---------+-------+-------+-------------+
| 1 | SIMPLE | argData_2012_10_02 | ref | saddr_idx,daddr_idx,cover_source | daddr_idx | 66 | const | 62012 | Using where |
+----+-------------+--------------------+------+----------------------------------+-----------+---------+-------+-------+-------------+
1 row in set (0.01 sec)

So mysql chose daddr.idx, I assume because it involved the most descrete lookup.

The time for return was:
20655 rows in set (0.82 sec)

Much faster than the single indexes. Now I'm going to add a covering index for daddr and dport and see what happens.

I'll be back.

EDIT: After creating a covering index for daddr, the results were disappointing.

20655 rows in set (36.88 sec)

Oddly, mysql still used daddr_idx.

mysql> explain select * from argData_2012_10_02 where saddr between '10.21.21.0' and '10.21.30.255' and daddr='8.8.8.8' and dport='53';
+----+-------------+--------------------+------+--------------------------------------------+-----------+---------+-------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------------+------+--------------------------------------------+-----------+---------+-------+-------+-------------+
| 1 | SIMPLE | argData_2012_10_02 | ref | saddr_idx,daddr_idx,cover_source,cover_dst | daddr_idx | 66 | const | 62012 | Using where |
+----+-------------+--------------------+------+--------------------------------------------+-----------+---------+-------+-------+-------------+

Here's the show create table:
argData_2012_10_02 | CREATE TABLE `argData_2012_10_02` (
`seq` int(10) unsigned NOT NULL DEFAULT '0',
`stime` double(18,6) unsigned NOT NULL,
`saddr` varchar(64) NOT NULL,
`sport` varchar(10) NOT NULL,
`daddr` varchar(64) NOT NULL,
`dport` varchar(10) NOT NULL,
`pkts` bigint(20) DEFAULT NULL,
`bytes` bigint(20) DEFAULT NULL,
`state` varchar(32) DEFAULT NULL,
`proto` varchar(16) NOT NULL,
KEY `saddr_idx` (`saddr`),
KEY `daddr_idx` (`daddr`),
KEY `cover_source` (`saddr`,`sport`) USING BTREE,
KEY `cover_dst` (`daddr`,`dport`) USING BTREE
) ENGINE=MyISAM DEFAULT CHARSET=latin1 |

So it looks like covering indexes are much more useful when you're searching ranges of values than distinct one.

The table was created by the open source program that will be doing the inserts. I've been thinking I need to create the tables first so I can control the data types. sport and dport can be names (like http), which is why he uses varchar but I'm suppressing those so they will only be numbers. So there would only be 65535 of them so smallint would work fine there.

Proto would have to enum a lot of values:
cat /etc/services | egrep -v '(^#)' | wc -l
9886

EDIT: This was plain boneheaded. I had ports in my head but wrote about proto. You're probably correct about ENUM for these.

select distinct(proto) from argData_2012_10_02;
+-------+
| proto |
+-------+
| tcp |
| udp |
| rtp |
| udt |
| esp |
| rtcp |
| ospf |
| gre |
| ipv6 |
+-------+

There might be a few more, but that's basically what we're seeing over a 24 hour period.

bytes should definitely be a big int. Ints can only be 4.2 GB, and that's considered a relatively small download in edu.

pkts I'm not sure. I think int would be ok, but who knows.

"| stime | double(18,6) unsigned | That's 11 bytes; DOUBLE is only 8"

Are you sure about that? I caclculate 8 bytes.

Good point about IPV6. I'll have to give that some thought.

I can't batch the inserts. They're all done by the program. I'm working on the front end to display the results of searching all that data (so only selects.)

"* DECIMAL may not be a partition key (unless there is something new). DOUBLE cannot either."

Guess I'd better read the partitioning docs more closely. :-)

My latest idea (after much more reading) is to create monthly tables and partition those by days.

"Is there some pair of fields that are UNIQUE?"

Not likely, except for seq, which is unique. But that would never be searched. The only reason I included it is because it was unique and I thought I might be able to figure some way to use it. Like say start date -> end date tied to seq between x and y would speed up searches over time ranges.



Edited 5 time(s). Last edit at 10/27/2012 11:29AM by Paul Schmehl.

Options: ReplyQuote


Subject
Views
Written By
Posted
Re: Does partitioning make sense for this application?
2135
October 26, 2012 09:55PM


Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.