MySQL Forums
Forum List  »  Newbie

Re: How to set up an extremely large database
Posted by: Steve Doe
Date: October 22, 2016 07:01AM

Steve Doe Wrote:
-------------------------------------------------------
> I'm trying to create my own word information
> database. Description, synonyms, antonyms,
> Homonyms, idioms etc. There's an estimated
> 1,025,110 words in the English language.
>
> Combined that with a Stylometry database of
> 800,000 stories with about 150,000 descriptive tags
> and phrases per story.
>
> How much storage space should be available for it.
> How should it be set up, managed and any other
> advice?
>
> Thanks
> Steve

I'm learning and the first thing I have learned is that I shouldn't use MySQL. I should use an Inverted Index.

The inverted index data structure is a central component of a typical search engine indexing algorithm. A goal of a search engine implementation is to optimize the speed of the query: find the documents where word X occurs. Once a forward index is developed, which stores lists of words per document, it is next inverted to develop an inverted index. Querying the forward index would require sequential iteration through each document and to each word to verify a matching document. The time, memory, and processing resources to perform such a query are not always technically realistic. Instead of listing the words per document in the forward index, the inverted index data structure is developed which lists the documents per word.

With the inverted index created, the query can now be resolved by jumping to the word id (via random access) in the inverted index.

As for space, here's an example indexing the human DNA information. Since the human DNA contains more than 3 billion base pairs, and we need to store a DNA substring for every index and a 32-bit integer for index itself, the storage requirement for such an inverted index would probably be in the tens of gigabytes.

Options: ReplyQuote


Subject
Written By
Posted
Re: How to set up an extremely large database
October 22, 2016 07:01AM


Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.