MySQL Forums
Forum List  »  Newbie

Re: appropriate database design and query
Posted by: Shishir Mittal
Date: January 19, 2009 02:31PM

Hello,
I am part of the team working on the above said project with Gaurav Mishra, the author of this post.
Sorry for the delayed reply from our end. Actually, we are a newbie in this field and it took some time for us to understand the replies posted.

@Rick: Thanks a lot for such a long reply and making us understand the depth of the problem and its possible solutions.
As advised, we have clustered the images into 4 groups (we can even form more clusters if you suggest) each with around 10,000 images on the basis of probabilistic distribution of values of a particular value of 2 of the columns.
Each column can have a value between 0 and 20000, but the distribution is non uniform. To be precise, for most of the columns more than 50% images of the images have values < 40.
For clustering we chose two columns which have 50% images with values between 0 and 100 for a set of 40,000 images and other 50 % between 100 and 20000.
I hope this also gives you an idea of about how sparse our database is.
Corresponding to each cluster there are 8 tables and in each table there are 321 columns (320*8 features and 1 image id). Lets name them as TxCy (0<=x<=7 and 0<=y<=319)
Ultimately we need just top 200 images (from among 10000 images according to present design) which have least differences with the query image.
So suppose after processing query image we determine that it belongs to cluster 6.
Then we require a query to produce the result
select imageId from T0 order by summation of abs(query.TxCy - dbImage.TxCy) asc limit 200
What should be the SQL query to get the desired result?
Would the query time be less if we consider the absolute differences of say only 5 most important columns? If yes, then may be we can perform the query based on 5 most important columns for the input query image (Note that importance of columns varies with the image) but then there wouldn't be any use of the indexing.
We set imageId as the primary key and indexed the tables on the basis of two Columns which were used for clustering.
The query time in this design was greatly reduced :)
Still can we know the exact number of clusters and the number of tables we should have based on some mathematics rather than experimentation?

Now we have two set of designs for 4 clusters each with 8 tables
A)Each table with 320 columns with primary key as imageId and query for minimum of summation of abs difference of 5 most important columns. (no indexing as the 5 most important columns vary with image).

B)Each table with 4 columns (imageId, 2 fixed columns, rest entries as comma separated in blob) with imageId as primary key, indexing on the two columns and query for min summation of abs difference of just 2 columns.

In case A, we can directly obtain the top 200 queries.
But in case B, to improve the query results we need to take in say 1000 query results and process them (obtaining the values for all of them by separating the commas) to get the 5 important column values and then short list the required the 200 queries based on the 5 values.

Which is a better design?


@Hector: We tried the masterpiece code written by you. Ultimately, it not practical to implement it for this case, because of the number of rows in the design suggested.
Also the query time increased a bit compared to our earlier design of 8 tables each with 320 columns with no clustering.
But, your code is simply a piece of art. I learned quite a few tricks from it.
Using iterations just from a SQL is something pretty new from me.
Thanks for sharing the code with us!

Options: ReplyQuote


Subject
Written By
Posted
Re: appropriate database design and query
January 19, 2009 02:31PM


Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.