Re: 500 mil. rows table partitioning
Rick, thank you for the reply!
Sorry I was on a trip and couldn't answer to you promptly.
I think, I need to provide a little bit more background information. It will address the following questions:
>Does that mean that the data could be a single table with about 83 columns? Looks like it would have `time` as the PRIMARY KEY. Is there some reason not to do that? It would be soooo much faster.
>Could one of those be derived from the other?
>Converting to a single table with 83+ columns will shrink the data significantly because you can get rid of variableid, etc. And you can make use TINYINT UNSIGNED (1 byte) instead of FLOAT (4 bytes) for some of the `value`s.
Data comes from ~140 different sites (siteid). Measurements on each site occurs every 30 min, but may span different time range. For example, we have sites that were active in the past and don't do measurements now. Sites may have gaps in their measurements for up to several month, and off cause sites can have measurements with the same time stamp. Each site measures its own set of variables, some of them measure only 5-8, some of them up to 20-25. Each variable is unique and can't be derived from others. Although there are only 83 core variables (variableid) it can be virtually countless number of them because they can be measured on different height (offsetid), measured several times for redundancy (repeat), might be corrected in several ways (exstensionid) and combinations of all these can be different from site to site. That are the reasons why it can't be organized in to a singe table with 'time' as the PRIMARY KEY and 83 columns.
No data will be dropped. For sites that are still active I will have and new measurements, and some times data for old sites has to be updated. Old values must be archived. I assume it will be easier just to copy old data to an archive DB, delete all data for a site and reload it with new/corrected values.
Now back to the rest of the questions. I used your advise and add index to the table:
CREATE INDEX indx ON L2data (time, siteid, variableid, `repeat`, extensionid, offsetid, flag, value);
So far :
>SHOW CREATE TABLE tbl\G -- engine, indexes;
*************************** 1. row ***************************
Table: L2data
Create Table: CREATE TABLE `L2data` (
`extensionid` smallint(6) NOT NULL,
`flag` varchar(5) NOT NULL,
`offsetid` int(11) NOT NULL,
`repeat` tinyint(4) NOT NULL,
`siteid` int(11) NOT NULL,
`time` decimal(10,6) NOT NULL,
`value` float NOT NULL,
`variableid` smallint(6) NOT NULL,
KEY `indx` (`time`,`siteid`,`variableid`,`repeat`,`extensionid`,`offsetid`,`flag`,`value`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8
1 row in set (0.00 sec)
>SHOW TABLE STATUS LIKE 'tbl'\G -- sizes
************************** 1. row ***************************
Name: L2data
Engine: MyISAM
Version: 10
Row_format: Dynamic
Rows: 414169342
Avg_row_length: 27
Data_length: 11510227132
Max_data_length: 281474976710655
Index_length: 15804361728
Data_free: 0
Auto_increment: NULL
Create_time: 2011-07-13 09:05:29
Update_time: 2011-07-13 09:13:01
Check_time: 2011-07-13 15:59:28
Collation: utf8_general_ci
Checksum: NULL
Create_options:
Comment:
1 row in set (0.00 sec)
>EXPLAIN SELECT
EXPLAIN SELECT 'US-IB2' AS site, t0.time, t0.value AS APAR_0, t1.value AS APAR_fill_0, t2.value AS APARpct_0, t3.value AS APARpct_fill_0, t4.value AS FH2O_0, t5.value AS FH2O_fill_0 FROM L2data t0 INNER JOIN L2data_part_286m t1 ON t0.time = t1.time INNER JOIN L2data t2 ON t0.time = t2.time INNER JOIN L2data t3 ON t0.time = t3.time INNER JOIN L2data_part_286m t4 ON t0.time = t4.time INNER JOIN L2data t5 ON t0.time = t5.time WHERE t0.siteid = 47 AND t0.variableid = 1 AND t0.repeat = 0 AND t0.extensionid = 3 AND t0.offsetid = 3 AND t0.flag = 0 AND t1.siteid = 47 AND t1.variableid = 1 AND t1.repeat = 0 AND t1.extensionid = 1 AND t1.offsetid = 3 AND t1.flag = 0 AND t2.siteid = 47 AND t2.variableid = 6 AND t2.repeat = 0 AND t2.extensionid = 3 AND t2.offsetid = 3 AND t2.flag = 0 AND t3.siteid = 47 AND t3.variableid = 6 AND t3.repeat = 0 AND t3.extensionid = 1 AND t3.offsetid = 3 AND t3.flag = 0 AND t4.siteid = 47 AND t4.variableid = 5 AND t4.repeat = 0 AND t4.extensionid = 3 AND t4.offsetid = 3 AND t4.flag = 0 AND t5.siteid = 47 AND t5.variableid = 5 AND t5.repeat = 0 AND t5.extensionid = 1 AND t5.offsetid = 3 AND t5.flag = 0 ORDER BY t0.time \G;
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: t1
type: ALL
possible_keys: siteid_indx
key: NULL
key_len: NULL
ref: NULL
rows: 5413262
Extra: Using where; Using temporary; Using filesort
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: t0
type: ref
possible_keys: indx
key: indx
key_len: 18
ref: Global.t1.time,const,const,const,const,const
rows: 1183396
Extra: Using where; Using index
*************************** 3. row ***************************
id: 1
select_type: SIMPLE
table: t2
type: ref
possible_keys: indx
key: indx
key_len: 18
ref: Global.t0.time,const,const,const,const,const
rows: 1183396
Extra: Using where; Using index
*************************** 4. row ***************************
id: 1
select_type: SIMPLE
table: t3
type: ref
possible_keys: indx
key: indx
key_len: 18
ref: Global.t0.time,const,const,const,const,const
rows: 1183396
Extra: Using where; Using index
*************************** 5. row ***************************
id: 1
select_type: SIMPLE
table: t5
type: ref
possible_keys: indx
key: indx
key_len: 18
ref: Global.t1.time,const,const,const,const,const
rows: 1183396
Extra: Using where; Using index
*************************** 6. row ***************************
id: 1
select_type: SIMPLE
table: t4
type: ALL
possible_keys: siteid_indx
key: NULL
key_len: NULL
ref: NULL
rows: 5413262
Extra: Using where; Using join buffer
6 rows in set (0.00 sec)
>SHOW VARIABLES LIKE '%buffer%'\G -- cache size;
*************************** 1. row ***************************
Variable_name: bulk_insert_buffer_size
Value: 8388608
*************************** 2. row ***************************
Variable_name: innodb_buffer_pool_size
Value: 8388608
*************************** 3. row ***************************
Variable_name: innodb_log_buffer_size
Value: 1048576
*************************** 4. row ***************************
Variable_name: join_buffer_size
Value: 131072
*************************** 5. row ***************************
Variable_name: key_buffer_size
Value: 8384512
*************************** 6. row ***************************
Variable_name: myisam_sort_buffer_size
Value: 8388608
*************************** 7. row ***************************
Variable_name: net_buffer_length
Value: 16384
*************************** 8. row ***************************
Variable_name: preload_buffer_size
Value: 32768
*************************** 9. row ***************************
Variable_name: read_buffer_size
Value: 131072
*************************** 10. row ***************************
Variable_name: read_rnd_buffer_size
Value: 262144
*************************** 11. row ***************************
Variable_name: sort_buffer_size
Value: 2097144
*************************** 12. row ***************************
Variable_name: sql_buffer_result
Value: OFF
12 rows in set (0.00 sec)
>how much RAM do you have?
currently I'm using a test machine with just 8GB, but we are in the process of ordering a new server specially for this project and it will have at least 16GB. How much would you recommend to put in?
>How time is represented?
Time is represented in the form of decimal year, for example: 2010.234567
>How much disk do you have? How big is the table now? Allow 2x-4x the disk space after adding an index and/or changing to InnoDB.
Disk space is not a problem, I will have at least 2TB. Right now table is 10.7GB and index size is 14.7GB
Thank you very much,
Misha