xml and unicode problems
Sandy Norton
sandskyfly at hotmail.com
Wed Mar 26 04:49:03 EST 2003
Hi,
I'm writing a web harvester that reads an xml config file with
keywords and urls and then scans links on the target urls to check if
they contain the specified keywords.
This worked fine, until I wanted to make it unicode compatible. To do
that I encoded the text of keywords strings (and the links) using the
following function:
escape = lambda x: xml.sax.saxutils.escape(x).encode('UTF-8')
Running it, the harvester works pretty much but every now and then I
get an error as below. Do I have to manually register a unicode codec
at the beginning of the script or something? If I do, how do I do
that?
Traceback (most recent call last):
File "C:\Engines\python\dist\src\lib\threading.py", line 411, in
__bootstrap
self.run()
File "menews.py", line 293, in run
current_links = self.parsePage(page)
File "menews.py", line 342, in parsePage
links = p.getLinks()
File "menews.py", line 268, in getLinks
if self.txtprocessor.containsAny( link, self.section_keywords ):
File "menews.py", line 126, in containsAny
txt = self.normalize(txt)
File "menews.py", line 123, in normalize
return escape(" ".join(result).strip())
File "menews.py", line 69, in <lambda>
escape = lambda x: xml.sax.saxutils.escape(x).encode('UTF-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0x93 in position
0: ordinal not in range(128)
Would appreciate any hints as to a solution, code available upon
request.
Oh... I'm running CVS python by the way on winXP.
Cheers,
Sandy
More information about the Python-list
mailing list