Questions about performance for huge table (100M+ rows)
Recently, I have been hired to make a website which is somewhat different from ones I made so far. Specifically, it will have extremely large database while I only worked with rather small databases whose performance was of no issue. Actually, only one table is giving me troubles, the one that will have in excess of 100M rows with 100k+ of SELECTs and UPDATEs per day in 1:1 ratio and unknown number of INSERTs.
I have been running benchmarks for the last 2-3 days, but have gotten pretty inconclusive results. So, I've come here to ask for a bit of help from someone who has more experience with big tables than me.
Here is the sample of my current structure:
CREATE TABLE `test` (
`id` int(10) unsigned NOT NULL auto_increment,
`name` varchar(50) collate utf8_unicode_ci NOT NULL,
`special` tinyint(1) NOT NULL default '0',
`content` text collate utf8_unicode_ci NOT NULL,
`hash` varchar(40) collate utf8_unicode_ci NOT NULL,
`ip` varchar(15) collate utf8_unicode_ci NOT NULL,
`added` int(10) unsigned NOT NULL,
`updated` int(10) unsigned NOT NULL,
`hits` int(10) unsigned NOT NULL default '0',
PRIMARY KEY (`id`),
KEY `hash` (`hash`),
KEY `name` (`name`)
)
The field names are not important at the moment and are just here to give you general idea. Also, there will be more fields added later on.
Anyway, here are some questions that I have. If you have any more tips, please, feel free to let me know as I appreciate all the help.
1. First off, I'm not sure which storage engine to use. So far, I've only worked with MyISAM, but it's table-wide locking could be a problem. The reason is that for every SELECT which fetches more than 1 row, there will be an equal number of UPDATEs (on the same rows). I was thinking of maybe using InnoDB since it has per row locking, but I don't have any experience with it and I have limited time to test it out.
2. Since I need to do a check if a row with the same `content` field (which could be very lengthy) already exists before an INSERT, instead of making FULLTEXT index I have made another field, `hash`, which holds a sha1 hash of `content` and has a regular index. Will using `hash` bring me better performance in comparison to FULLTEXT on `content` or is it just unnecessary?
3. The `name` is actually base 36 encoded `id`. But, it can also hold custom value, which is why all the SELECTs are done using `name` instead of `id`. That is what's keeping me from splitting the table into smaller ones (e.g. based of first letter of `hash` like table_a...table_z and table_0...table_9). I also though of having one table with just `id`, `name` and `hash` (which will be used for most SELECTs) and another with rest of the fields (where less SELECTS and almost all UPDATEs will be done) with `id` as foreign key. In any case, would spliting the table be a good idea and if so, what would be your suggestion?
4. The `name` and `hash` are actually unique (the 2^63 probability at best of sha1 collision is not really worrying). The reason I'm using regular index instead of UNIQUE is because UNIQUE seems to index the field no matter what it's content is and some rows can be empty (NULL for indexed fields if UNIQUE). I'm thinking that I can achieve smaller index size and therefore higher speeds with smaller indexes since regular index seems to contain only distinct rows. I'm guessing that this is true but I could be mistaken.
5. Almost all of the fields can be used for ordering (except `ip`), so should I index them all, or will that be more of a performance drag with INSERTs and UPDATEs considering that those results (where ordering is needed) I can cache for up to 24 hours?
Those are my top priority issues but like I said, any other tips are more than welcome.
Big thanks ahead, even if only for reading this large post :)