XML with Unicode: what am I doing wrong?

Diez B. Roggisch deetsNOSPAM at web.de
Wed Feb 2 13:48:47 EST 2005


> I started out working in the context of elementtidy, but now I am
> running into trouble in general Python-XML areas, so I thought I'd toss
> the question out here. The code below is fairly self-explanatory. I have
> a small HTML snippet that is UTF-8 encoded and is not 7-bit ASCII
> compatible. I use Tidy to convert it to XHTML, and this particular setup
> returns a unicode instance rather than a string.
> 
> import _elementtidy as et
> from xml.parsers import expat
> 
> data = unicode(open("snippetWithUnicode.html").read(), "utf-8")
> html = et.fixup(data)[0]
> parser = expat.ParserCreate()
> parser.Parse(html)
> 
> UnicodeEncodeError: 'ascii' codec can't encode character '\ub5' in
> position 542: ordinal not in range(128)
> 
> If I set my default encoding to utf8 in sitecustomize.py, it works just
> fine. I'm thinking that I can't be the only one trying to pass unicode
> to expat... Is there something else I need to do here?

you confuse unicode with utf8. Expat can parse the latter - the former is
internal to python. And passing it to something that needs a string will
result in a conversion - which fails because of the ascii encoding.

Do this:

parser.Parse(html.encode('utf-8'))

-- 
Regards,

Diez B. Roggisch



More information about the Python-list mailing list