xml and unicode problems

Wed Mar 26 04:49:03 EST 2003

Hi,

I'm writing a web harvester that reads an xml config file with
keywords and urls and then scans links on the target urls to check if
they contain the specified keywords.

This worked fine, until I wanted to make it unicode compatible. To do
that I encoded the text of keywords strings (and the links) using the
following function:

escape = lambda x: xml.sax.saxutils.escape(x).encode('UTF-8')

Running it, the harvester works pretty much but every now and then I
get an error as below. Do I have to manually register a unicode codec
at the beginning of the script or something? If I do, how do I do
that?

Traceback (most recent call last):
  File "C:\Engines\python\dist\src\lib\threading.py", line 411, in
__bootstrap
    self.run()
  File "menews.py", line 293, in run
    current_links = self.parsePage(page)
  File "menews.py", line 342, in parsePage
    links = p.getLinks()
  File "menews.py", line 268, in getLinks
    if self.txtprocessor.containsAny( link, self.section_keywords ):
  File "menews.py", line 126, in containsAny
    txt = self.normalize(txt)
  File "menews.py", line 123, in normalize
    return escape(" ".join(result).strip())
  File "menews.py", line 69, in <lambda>
    escape = lambda x: xml.sax.saxutils.escape(x).encode('UTF-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0x93 in position
0: ordinal not in range(128)

Would appreciate any hints as to a solution, code available upon
request.

Oh... I'm running CVS python by the way on winXP.

Cheers,

Sandy