90GB/day? OK, we will have to pull out all the stops!
Task 1: Shrink the data as much as possible
* Smaller datatypes
| seq | int(10) unsigned | Will you run out of the 4 Billion here? If so, consider BIGINT UNSIGNED
| stime | double(18,6) unsigned | That's 11 bytes; DOUBLE is only 8
| saddr | varchar(64) | IP-address, see below
| sport | varchar(10) | The sample data would fit into SMALLINT UNSIGNED (0..65535, 2 bytes)
| daddr | varchar(64) | IP
| dport | varchar(10) | Maybe SMALLINT UNSIGNED?
| pkts | bigint(20) | Samples don't need BIGINT (8 bytes); see below
| bytes | bigint(20) | Ditto
| state | varchar(32) | Normalize; see below
| proto | varchar(16) | ENUM or Normalize; see below
Are you using utf8? Probably don't need it.
IP-addresses. Your samples show IPv4. IPv6 is upon us. Suggest you plan for that. See this page, more for the datatype discussion than for the code:
You should probably end up with BINARY(16) or VARBINARY(16).
By "normalize", I mean create another table with a bi-directional mapping between an number and the string. The number would go into your main table. It would be AUTO_INCREMENT in the other table.
Numbers should be INT/MEDIUMINT/SMALLINT/TINYINT UNSIGNED, depending on how many distinct values you will end up with. BIGINT is almost never needed.
PROTO might best be ENUM('udp', 'tcp', ...) -- there are not many choices. 1 byte.
Consider NOT NULL for each field. If you do need NULL values, think through why you need it. (This is a minor issue.)
Which Engine are you using? I recommend InnoDB, even though its disk footprint is bigger. If there is a crash (eg, power failure), InnoDB will self-recover; MyISAM would probably need an hours-long "REPAIR TABLE".
120M rows/day = 1500/second. I assume the streaming would be relatively steady.
Batch the INSERTs -- either LOAD DATA or single INSERT statements with 100-1000 rows each.
(We have not even gotten to PARTITIONing yet.)
> I would assume that adding a second day would double these query times.
* A "point" query (fetch a single row using an index) will not slow down noticeably. Well, there will be a quantum bump when you go from being fully cached in RAM to the data being too big to be cached.
* A query scanning, say, one second's worth of data (~1500 rows) will not be much different than a point query.
* What was the 9-second query?
> Since the stime column is indexed, it would make sense to partition on that
* Not necessarily. There is no connection between being indexed and being a good PARTITION key.
* DECIMAL may not be a partition key (unless there is something new). DOUBLE cannot either.
> hundreds of partitions
Not a good idea (for various obscure reasons). Also there is a hard limit (until 5.6) of 1024 partitions.
> stime has duplicate values
Never trust a time to be UNIQUE ! Hence it cannot be a PRIMARY KEY. But it does not have to be by itself. Is there some pair of fields that are UNIQUE?
> capability of creating a new table each day
Also not a good idea. (But I won't throw it out immediately.)
100K tables causes the OS to groan and slow down.
> does not do any indexing
If it inserts into the table, it is inserting. I see some kind of INDEX in the DESCRIBE. PLEASE use SHOW CREATE TABLE, not the less-descriptive "DESCRIBE"!
Some generic PARTITION tips:
Some generic Data Warehousing tips (and DW is what you are into):
Before I can begin to discuss PARTITIONing and Summary Table advice, I need to see all the queries that will hit the table.