Re: Zip Code Proximity search
Anyway, I tried using the precalculation version, with really good results:
In Germany, there exist around 8300 different zips.
I created 3 range tables:
Zip range 0 to 100 km has 5,757,601 records, 94.9 MB
Zip range 100 to 150 km has 5,671,204 records, 92.6 MB
Zip range 150 to 250 km (seldomly used) has 14,223,809 records, 231.8 MB.
Table structure:
CREATE TABLE `zip100` (
`zip1` mediumint( 5 ) unsigned zerofill NOT NULL default '00000',
`distance` tinyint( 3 ) unsigned NOT NULL default '0',
`zip2` mediumint( 5 ) unsigned zerofill NOT NULL default '00000',
`land1` enum( 'D', 'CH', 'A', 'NL' ) NOT NULL default 'D',
`land2` enum( 'D', 'CH', 'A', 'NL' ) NOT NULL default 'D',
KEY `land` ( `land1` , `zip1` , `distance` , `zip2` , `plz2` )
) TYPE = MYISAM PACK_KEYS =1;
I only need distances up to 250 km, so an unsigned tinyint is sufficient.
I'm not shure if "pack_keys" noticably influences performance - maybe someone knows and could post it here?
At present there are no zips other than from Germany in the database.
Values were inserted using a perl script to fetch data from the opengeodb project. This usually took around 40 - 50 min on my 1800 Mhz Developer Machine, 512 MB RAM.
Optimizing:
Make shure nothing will access the tables while optimizing, else they might get corrupted.
Tables were physically sorted using "myisamchk --sort-records=1 <table>"
Then I used "myisampack <table>" for compression (usually around 30%). Again I'm not shure if compression noticably influences performance, will check on this...
After compression, I executed "myisamchk -rqSa <table>".
No, what do we get from this?
"Select land2, plz2 from zip100 where land1='D', plz1=74585" will get you: 1105 rows in set (0.01 sec)
This can now be used in a join with the user table.
On my developer machine - Windows XP Pro, completely unoptimized, lacking memory like hell and with no buffers in mysql optimized - results are returned at around 0.00 - 0.03 seconds, if queried from the zip100 table. zip250 table takes more time, usually 0.07 - 0.09 seconds.
This can still be optimized a lot, since on my machine this data has to be accessed via hard disk. If optimizing buffers and providing enough memory, there should be results at max reading speed without sorting needed, I suppose.