How efficiently update frequency of words?
I am trying create big index of English words frequency based on English Wikipedia.
Wikipedia is indexed by about 227 thousands block, each (maybe except last) block has 100 pages.
I m using SpaCy for finding base form of word and part of speech this word.
Not alone word, but pair (word,pos) must be unique (case sensitive). For example , in index must be both (‘name’,’VERB’) and (‘name’,’NOUN’) with frequencies.
I have table words:
CREATE TABLE `words` (
`word` varchar(45) NOT NULL,
`pos` varchar(6) NOT NULL,
`count` bigint unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`word`,`pos`),
)
Each block give several thousands word with its count.
How can I do:
- if pair word,pos not exists in table: add this pair to table, set count from SQL
- if exists, for pair word,pos count := table count + sql.count
Is better directly update from SQL or first fill small table block_words with the same structure and next SQL query add words from small table to main table?