MySQL Forums
Forum List  »  Performance

Re: word index - is there a better solution?
Posted by: Craig Logan
Date: June 05, 2010 01:45PM

here's the create table

CREATE TABLE `words_in_docs` (
  `word_id` mediumint(7) unsigned NOT NULL default '0',
  `word_sort` mediumint(6) unsigned NOT NULL default '0',
  `doc_id` int(9) unsigned NOT NULL default '0',
  `word_location` tinyint(1) unsigned NOT NULL default '0',
  `lang` char(3) character set ascii NOT NULL default 'en',
  KEY `word_id` (`word_id`),
  KEY `word_sort` (`word_sort`),
  KEY `doc_id` (`doc_id`),
  KEY `word_location` (`word_location`),
  KEY `lang` (`lang`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

word_id -> the word id which translates to the actual word in a word list table
word_sort -> the word position in the document indexed
doc_id -> points to the respective document
word_location -> whether the word is located in title, description or entire documet
lang -> the document language

the usual query on this table will be like
SELECT doc_id FROM words_in_docs WHERE lang='en' AND word_location=0 AND word_id IN(...)

############

you are probably right about the language column. I guess we'll have around 20 languages, so a tinyint will do... using char makes the database more human readable though. because of the sheer size of the table I may have to let this idea go, however.

on the other hand, I could easily make it a char(2) which would make the right choice a little bit less obvious.

as for the primary key, I could throw in a multi cell index over word_id, doc_id and word_sort, but this would just be for the reason to have a primary key. I will have to investigate about this with mySQL tacking a priamry key to every secondary key like you say... that seems odd.

############

yesterday, I did some calculations based on the fact that I was informed that the documents won't be like one or two pages (like it was indicated until then) but also complete reports and alike with sometimes hundreds of pages.

so with an estimated 1000 unique words per document:
1,000*500,000,000 = 500,000,000,000 rows
no way, I'm not google

bottom line, if we want to do anything with word sort, we'll have to do it totally different. also, we will have to reconsider a complete word index over the entire document.

############

I though about sharding. in fact I did something roughly similar a while back with a massive product index (2B rows)... I had a script that would create plenty of identical tables on different machines as they were needed. the software was then built smart enough to know which server to query. the problem there was that we constantly had servers offline for various reasons and then large chunks of the product index were not available. right now, we're investigating a solution with mySQL cluster for this project.

so what I could do for the current project is basically take either the doc_id or the word_id and make the search engine smart enough to look for a certain range of ids on a certain DB server. this would also have the benefit of being able to use the other id for table partitioning. we would, however, massively increase hardware demand with that.

############

the software will be massively cached (in fact, I want at least 1T of cache available).

other than that, I am looking for a way to reduce the anticipated number of rows. I have no problem admitting that this project is by order of magnitude larger than anything I did before and I don't want to fall face down into the dirt.

I'd really give something to have a look at the DB layout google is using.



Edited 2 time(s). Last edit at 06/05/2010 01:58PM by Craig Logan.

Options: ReplyQuote


Subject
Views
Written By
Posted
Re: word index - is there a better solution?
1580
June 05, 2010 01:45PM


Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.