How to speed up the building of huge (>15 mln entries) normalized database?
Hello everybody,
A have huge text file of 20 Gb. I'm extracting all the possible "word-sense" pairs form it. One word may have many senses. I need to build a dictionary, which will list all the possible senses corresponding to a word and their quantity. According to my preliminary estimations I'm going to have at least 15 million dictionary entries.
I use JDBC in order to get access to MySql.
My database structure is the following.
1) words table (words)
+----------+----------------+------+-----+---------+----------------+
| word_id | int(11) | NO | PRI | NULL | auto_increment |
| word | varbinary(255) | YES | UNI | NULL | |
| rate | int(11) | YES | | NULL | |
+----------+----------------+------+-----+---------+----------------+
2) Senses table (senses)
+----------+----------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+----------+----------------+------+-----+---------+----------------+
| sense_id | int(11) | NO | PRI | NULL | auto_increment |
| sense | varbinary(255) | YES | UNI | NULL | |
+----------+----------------+------+-----+---------+----------------+
3) The dictionary (dictionary)
| Field | Type | Null | Key | Default | Extra |
+----------+---------+------+-----+---------+-------+
| word_id | int(11) | NO | PRI | | |
| sense_id | int(11) | NO | PRI | | |
| quantity | int(11) | YES | | NULL | |
+----------+---------+------+-----+---------+-------+
I fill the database as follows:
1) Process the text dump, keep all the extracted information in a hash table (key=word, data = list of senses and their quantities). When I am running out of RAM, I start filling in the dictionary.
2) Create a batch to fill the words and the senses tables (just INSERT IGNORE DUPLICATES). Execute this batch.
3) For each entry in the hash table, I get the corressponding word_id and sense_id from lemmmas and senses tables. Create a batch for filling in the dictionary table:
stDictUpdate.addBatch("INSERT INTO dictionary(word_id,sense_id,quantity) VALUES ('"+nLemId+"','"+nSenId+"','"+nQuantity+"') ON DUPLICATE KEY UPDATE quantity=quantity+"+nQuantity +";");
4) execute this batch.
All this, specially the step 3, is extremely slow. Is there any chance to speed up the dictionary construction by restructuring queries, restructuring database itself and so on?
Thanks in advance for your help.
Subject
Views
Written By
Posted
How to speed up the building of huge (>15 mln entries) normalized database?
2911
July 27, 2009 05:21AM
1793
August 08, 2009 08:59PM
Sorry, you can't reply to this topic. It has been closed.
Content reproduced on this site is the property of the respective copyright holders.
It is not reviewed in advance by Oracle and does not necessarily represent the opinion
of Oracle or any other party.