Re: When do I start... partitioning?
Posted by:
Rick James
Date: October 10, 2010 11:23AM
Do not freeze anything in your schema design until it is big enough and busy enough to know what your 'real' problems are going to be.
"But wait!", you say, "Then it will be too difficult to make the change!".
So, the more important thing to do _now_ is to design the overall architecture to allow for change without bringing down the sight.
Start with "dual master, single writer" and a small number of slaves. And use LVM on Linux (or other) for snapshots.
Dual master provides you a way to perform a costly ALTER through the system, even on huge tables, with only seconds or minutes of downtime.
LVM provides virtually zero downtime for taking a dump; this may become important.
Back to partitioning. Well, if you want to be Facebook size, "sharding" is what you are really after. This is much messier. And it is something to think about today...
What is the biggest table going to be? For example... Suppose you have images (bulky) together with metadata (size, owner, tags) about the images, then do vertical partitioning of that table now. That is, have a separate table with nothing but an id and the image. The metadata would be in the other (much smaller) table.
Probably you can keep the metadata in a single machine, but not the bulky data. So, sharding of the bulky data is required. But now JOINs are not possible. This impacts your _code_, not just the schema.
This leads to an important design decision -- Create a database "layer". This can (and should) be done from day one.
1. "front end" clients with "business logic", but NO SQL statements.
2. "database layer" that takes requests from the frontends, constructs and performs SQL queries.
3. The database, itself.
The abstraction you pick between #1 and #2 is critical (and will likely evolve). Do NOT mimic what the SQL will say. It is ok to have "opaque" objects that represent items in the database -- To the frontend, it would be a string. Do the db layer it would be an AUTO_INCREMENT id, or GUID, or whatever is chosen for the link to the table or tables (as in my suggestion of parallel partitioning). The layer knows whether it can get the answer in a simple query, needs a JOIN, or even needs to connect to multiple servers to get the answer; the frontend does not care.
Ordered list of things to build in:
1. One db machine with LVM -- LVM will let you clone the machine to get replication.
2. Database layer
3. Dual Master
4. readwrite / readonly separation -- frontend does not care; layer will decide to send queries to slave instead of master
5. Slaves
6. Sharding -- this will involve more dual-master pairs, plus slaves.
7. IPO