Re: sharding steps
Posted by: Rick James
Date: September 10, 2011 10:54AM

First thing to ask: Do you have only one huge table (plus smaller ones) and it is only indexed by the PRIMARY KEY? If so, then sharding is "easy". You devise a way to turn the PK into a host (more below) and spread that huge table across multiple machines. Your code deals with connecting to the right machine before issuing the INSERT/SELECT/...

Any ancillary tables are either kept on all machines (then you have issues of syncing them) or kept on one machine.

How to split among machines?
* Pure hash -- hash the key, map that to a host
* Pure dictionary -- have a lookup table that maps key to host
* Hybrid -- hash to, say, 4096 values; map (many-to-one) those values to machines.

A maintenance issue: If too many items land on one machine, or those items are too 'busy', then you need to migrate items off that machine to other machine(s) or a new machine. 'Pure hash' does not allow such; 'pure dictionary' lets you migrate individual items as needed; 'hybrid' requires that you migrate batches in unison (usually this is 'good enough' and 'fast enough').

If you have two huge tables, you could shard each one independently.

JOINs -- now things get messy. You would need to simulate the JOIN in your code.

Fetching more than one row from the sharded table -- Another messy situation. Some options:
* Multiple 1-row fetches.
* Code on each shard to get what it can for the query, plus centralized code to stitch the results together.

See Spider; NDB Cluster.

If you would like to describe your application further, I can get in to specifics. Need to know something about the sizes of the important tables, selects, how the data is loaded, etc.

Think out of the box. Do you really need all the data accessible via SELECT? That is, can you restructure your application and/or data such that it will fit on a 'normal' server?

Even if you are reinventing Flikr/Facebook/YouTube/etc., those are not really sharded databases. They simply have the items scattered across servers. Plus, there is a central, unsharded, database containing an 'index' of the items. (Yeah, yeah, it is more complicated than that.)

Numeric data can usually be summarized into a database small enough to sit on a normal server.

Options: ReplyQuote


Subject
Written By
Posted
September 09, 2011 12:42PM
Re: sharding steps
September 10, 2011 10:54AM
September 10, 2011 12:53PM
September 11, 2011 10:22AM
September 11, 2011 10:56AM
September 11, 2011 12:21PM
September 11, 2011 12:24PM


Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.