Hi folks,
I am the maintainer of Fedora Tracker (www.fedoratracker.org), a search engine of package repositories for the Fedora Project distribution of Linux. I've run into a problem that has me stuck, which which I hope someone here can help
with.
Basically, the back-end component of the tracker reads xml files that describe each package in a repository and then stores the information for each package in a mysql db. Recently, though, an RPM showed up in one of the repositories that seems to have some unicode escapes (trademark and copyright symbols, I think-- unicode is really not something I've dealt with a lot) that are causing the MySQLdb module to crash and I can't figure out how to get python to translate them into something inoffensive. I tried adding:
if type(q) == types.UnicodeType:
q = q.encode("utf-8","replace")
to deal with it, but it didn't work. And anyway, it was really just a
blind guess-- Like I said, I'll be the first to admit that character encoding stuff is not my strong suit.
Anyway, here is the complete query being executed (note the "Intel\xc2\xae" in
the description field):
"INSERT INTO package_fedora_5 SET `name` = 'ipw2100-kmdl-2.6.17-1.2174_FC5',
`version` = '1.2.0', `release` = '41.rhfc5.at', `url` =
'
http://ipw2100.sourceforge.net/';, `dlurl` =
'
http://dl.atrpms.net/fc5-x86_64/atrpms/stable/ipw2100-kmdl-2.6.17-1.2174_FC5-1.2.0-41.rhfc5.at.x86_64.rpm';,
`description` = 'This package contains kernel drivers for the Intel\xc2\xae
PRO/Wireless 2100.\n\n\nThis package contains the ipw2100-kmdl-2.6.17-1.2174_FC5
kernel modules for the Linux kernel
package:\nkernel-2.6.17-1.2174_FC5.x86_64.rpm.', `rpmgroup` = 'System
Environment/Kernel', `vendor` = 'ATrpms.net', `packager` = 'ATrpms
http://ATrpms.net/';, `prein` = 'NULL', `postin` = 'NULL', `preun` = 'NULL',
`postun` = 'NULL', `arch` = 'x86_64', `checksum` =
'sha:bf3ba4e450021eac031a6e3412980d051ff059f6', `changelog` = 'NULL', `fileList`
= '', `package_id` = NULL, `repo_id` = 5, `epoch` = 0, `numfiles` = 0"
And here is the resulting crash (note: "0xc2"):
Traceback (most recent call last):
File "./tracker-process.py", line 110, in ?
db.updateRepo(r)
File "/home/brads/www/trackerBE.py", line 759, in updateRepo
ret = self.storeRpmInfo(pkg,storeMe.repo_id,storeMe.version,storeMe.url)
File "/home/brads/www/trackerBE.py", line 852, in storeRpmInfo
self.execute(query)
File "/home/brads/www/trackerBE.py", line 304, in execute
res = self.cursor.execute(q)
File "/home/brads/pymods/MySQLdb/cursors.py", line 146, in execute
query = query.encode(charset)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 353:
ordinal not in range(128)
This is currently breaking repo processing, which means that if it's not dealt
with in the next couple of days it will affect Fedora Tracker's ability to keep
up with the FC6 release, so any help would be greatly appreciated.
The relevant code is here if anyone wants to see it in context:
http://fedoratracker.cvs.sourceforge.net/fedoratracker/fedoratracker/trackerBE.py?revision=1.50&view=markup
Thanks!
--Brad