elementtree w/utf8

Diez B. Roggisch deets at nospam.web.de
Thu Oct 25 17:44:32 EDT 2007


Tim Arnold schrieb:
> Hi, I'm getting the by-now-familiar error:
> return codecs.charmap_decode(input,errors,decoding_map)
> UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 
> 4615: ordinal not in range(128)
> 
> the html file I'm working with is in utf-8, I open it with codecs, try to 
> feed it to TidyHTMLTreeBuilder, but no luck. Here's my code:
> from elementtree import ElementTree as ET
> from elementtidy import TidyHTMLTreeBuilder
> 
>             fd = codecs.open(htmfile,encoding='utf-8')
>             tidyTree = 
> TidyHTMLTreeBuilder.TidyHTMLTreeBuilder(encoding='utf-8')
>             tidyTree.feed(fd.read())
>             self.tree = tidyTree.close()
>             fd.close()
> 
> what am I doing wrong? Thanks in advance.

Being to clever for your own good.. sorry to say so. But 
TidyHTMLTreeBuilder takes the encoding for a reason: it expects a 
byte-string that it will decode itself.

But you decode first, creating a unicode-object. When feeding that to 
the string-expecting feed-method, python attempts a conversion to a 
byte-string using the default-encoding.

Not using codecs but a file instead should do the trick.

diez



More information about the Python-list mailing list