Re: Database Design for huge set of data
Posted by: Rick James
Date: March 17, 2009 10:22AM

That's about 200GB of data if you use MyISAM. Be sure to use the MAX_ROWS option on CREATE DATA. The two indexes will also take about that much space. And if you ever need to do an ALTER TABLE, you will need room for a complete copy. So, I hope you have more than 800GB of disk space in a single logical volume. RAID striping would be helpful.

Is id1 AUTO_INCREMENT? Is id2 effectively random; that is, it has a lot of different values and they jump around? If so, id2 will kill performance. Think of it this way -- when you add a row, it needs to update the index for id2, but the block that it needs to index is not likely to be cached. So that will be a disk hit (read, modify, and eventually rewrite). On an IDE drive, you can get only 60 (or so) disk hits/sec. So you will be limited to about 60 inserts per second.

Panic? Well, let's dig into why you need an index on id2. I claim, even without knowing more details, that you don't need it. Instead, you need to build summary table(s) and have the id2 index on them. Here's how it goes:

You stream the data into the main table, it increments the auto_increment id1, and effectively appends to the end of the data file and the end of the id1 index. Very little in disk I/O -- on the order of one disk hit per 50 rows. This lets you insert over 1000 rows/second. Recommend you also batch the data -- LOAD INFILE, or batched INSERTs. You data arrives at about 240 rows/sec. Is it steady? Or bursty? LOAD/Batching may or may not be necessary.

But what about id2? Build a summary table. One approach: every hour, pick up where you left off an hour ago (remember the last id1 used), do something like
$maxid = SELECT max(id1) from main;
INSERT INTO summary_by_id2
   SELECT id2, count(*), sum(the_double), ...
      FROM main
      WHERE id1 > $leftoff AND id1 <= $maxid
      GROUP BY id2;
INSERT INTO summary_by_minute
   SELECT left(time, 16), count(*), sum(the_double), ...
      FROM main
      WHERE id1 > $leftoff AND id1 <= $maxid
      GROUP BY left(time, 16);
save $maxid as $leftoff
The summary table(s) would have index(es) that make reporting easy and fast; you don't touch 'main' for reports.

OTOH, if you batch your input (eg, collect for 1 second, then INSERT 240 rows at once), the batching program could simultaneously compute the row(s) needed for the summary tables and do those inserts.

There are more details to consider, such as whether to have unique keys or allow duplicates; I prefer allowing dups rather than updating existing summary rows. The reports deal with the dups.

Are you using PARTITION? That will make it easy and fast to DELETE (via DROP) the oldest day's data. Optionally you could keep the old rows in the summary table(s) long after the raw data is gone.

With the summary tables, and removing id2 index from main, you are down to more like 650GB -- 200GB raw data, 100GB index(id1), maybe 50GB for summary tables and their indexes, plus 300GB elbow room for ALTER.

Also, check out InfoBright. They can do the job in a much smaller footprint.

InnoDB -- plan on 3x more disk space.

Options: ReplyQuote


Subject
Written By
Posted
Re: Database Design for huge set of data
March 17, 2009 10:22AM


Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.