Help with search engine-like behavior
Posted by:
Date: December 14, 2010 06:52PM

Hello, I'm having trouble wrapping my brain around this problem.

This might turn out a little lengthy but I just want to give you a clear idea of what I've done and what the problem is.

I have a large product database (~1 million items from various vendors) and wanted to allow users to search for whatever they wanted (i.e. "yellow gold 18 in. necklace w/ blue topaz"), and the proper items would be returned by the search. There's obviously a bit of work to go into this and a bunch has been done in my DB and on paper.

I go through a process of "normalization." The user can type many variations that mean the same thing (i.e. "eighteen inches" "18in" "18 in." '18"' etc.), Also there are several shorthand notations for various words and different tenses and plural forms. This process distills all the tokens into a standard format.

Then I get the token_id and part of speech for each token.

Then .... the goal is to find compound nouns and adjectives (i.e. "dog collar"). Not done but I think I can handle this one.

Next, I look for the following pattern: (ADJ* NOUN+)+ This is where I start to have problems. That itself I can do. Next in PHP I intended on forming an array of arrays. The first being keyed on the noun and the adjectives associated with that noun would be the values of the base level array. For example what I used above would become 3 groups:

['gold']=>[0]=>'yellow'
['necklace']=>[0]=>'18"'
['topaz']=>[0]=>'blue'

In practice I'll probably use the token_ids. The idea is to group the adjectives with their associated nouns. I'm leaving finding the "principle" noun for later (in this case 'necklace'). (Although it may not be as hard as I think it is.... seems most likely the last noun encountered is the main one [excluding prepositional phrases])

Okay so that's the search side of things... I'm in the process of analyzing the product titles to construct similar groupings.

Now the problem. How do I store the groupings? For example, thinking of a search for "large black cheesey sunglasses with pink lenses". This would have the grouping:

['sunglasses']=> {
[0]=>'black'
[1]=>'cheesey'
[2]=>'black'}

The query should be the equivalent of 'WHERE noun1 LIKE BINARY "sunglasses" AND (adj LIKE BINARY "black" OR ...) AND noun2 LIKE...' I'd later rank the results based on how many adj matched the nouns, etc.

I guess I'm looking for a more elegant way then storing massive individual noun/adj combinations (i.e. sunglasses(SG) SG=>black SG=> blk lg SG=>lg SG=>lg cheesey, etc). And this is one product. There will be many more groupings regarding sunglasses.

Any ideas?

If this isn't a good approach and another popped into your head care to share? I'm kinda stumped on this one.... and it's the last thing preventing me from officially opening my site.

Thanks for your help.

Options: ReplyQuote


Subject
Written By
Posted
Help with search engine-like behavior
December 14, 2010 06:52PM


Sorry, you can't reply to this topic. It has been closed.

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.