MySQL :: Re: appropriate database design and query

New Topic

Re: appropriate database design and query

Posted by: Rick James
Date: January 15, 2009 10:10AM

The problem boils down to a massive computational and/or disk I/O bottleneck.

Plan A: 8 tables with lots of columns. This is already a compromise between Plans B and C.

Plan B: 8*5*4*4*4 tables -- impractical

Plan C: One table with N*8*5*4*4*4 rows. (N is number of images.) This will lead to fetching a LOT of rows.

First your solution has to collect the candidate images. Then it checks each one for the metric you have devised. Presumably most images are eliminated before hitting the ORDER BY and delivering the results.

If collecting the candidate images means reading ALL of the data, then why bother with a database; just have a flat file in whatever format is most efficient for parsing.

Ok, so what can a database engine (eg MySQL) do for you? Properly designing the schema and the SELECT, the WHERE clause can eliminate most of the images, thereby cutting down the computational burden. But it does this at a risk of aggravating the I/O burden. In big systems (which I assume this is) my mantra is "count the disk hits". More below.

Once you have whittled the list if candidate images to, say, < 10% of the total, you now have <10% of the computational load. If you can get it down to 1%, that's even better. (Going beyond that may put too much burden on getting there, so let's assume your goal is to get to 1-10% of the images before calculating the metric.)

So, how do we eliminate 90% of the disk hits? Here are some tools:
* Good index (difficult in this case)
* "Clustering" the data -- that is, trying to get consecutive rows to be all useful or all not useful.
* Note that there are (usually) two ways things are clustered: The index, and the data.

If the data is ordered by one of the measurements, then the desired rows for the +/-20 has all the rows clustered for that one measurement. But not for the rest. It is not easy to devise an index that deals with more than one simultaneous range. (A typical example is latitude & longitude -- you can sort on one, but the other is a mess.)

Now enters the Index, which is a separate file (at least in MyISAM). An Index is _usually_ much smaller than the data (remember -- count the disk hits). Let's say we have to hit 20% of the index rows, but extra columns in the index let us eliminate 80% of the rows. We are down to 4% -- this is the fraction of the data rows we have to fetch. 20% of a small index file plus 4% of a large data file -- this adds up to maybe 10% overall.

That leads to having an index with two of the measurements (plus a pointer to the data). My suggested solution extended this to 3. Not knowing the details (distribution of values, etc), I cannot judge what the 'right' balance is.

As for your 8 tables... IF some other scheme works out better, you can do the UNION to merge them, then spray the data into some other structure.

As for BLOBs -- well that is just any set of bytes that the database engine does not care about. That is, a BLOB cannot be indexed, etc. But I am not asking for that. Instead, I envision it to look like:

123,234,345,436,44,55,66,...321

and your application does a 'split on comma' to get the array of measurements, then does the '<20' calculations.

As an aside: INT takes 4 bytes; ',234' takes 4 bytes. That is the BLOB approach is about as compact as if you could have had a zillion separate columns. Splitting on comma is probably faster (CPU-wise) that locating a zillion columns and fetching the values, etc.

One more thing about the BLOB -- If you compress it, it will shrink by about 60% (count the disk hits). Elapsed time for uncompressing is usually less than the elapsed time for fetching a bigger blob. (Do the compress/uncompress in you application, therby also saving the network traffic. Yeah, you are probably running a local socket, but still there is a cost.)

Do I think my approach is nirvana? No. But think out of the box. Something based on my approach may be the most efficient.

Navigate: Previous Message• Next Message

Options: Reply• Quote

Subject

Written By

Posted

appropriate database design and query

gaurav mishra

January 14, 2009 11:10AM

Re: appropriate database design and query

Rick James

January 14, 2009 08:55PM

Re: appropriate database design and query

gaurav mishra

January 15, 2009 01:36AM

Re: appropriate database design and query

hector vass

January 15, 2009 03:10AM

Re: appropriate database design and query

Rick James

January 15, 2009 10:10AM

Re: appropriate database design and query

Shishir Mittal

January 19, 2009 02:31PM

Re: appropriate database design and query

Rick James

January 19, 2009 05:25PM

Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.