elementtree w/utf8

Tim Arnold tim.arnold at sas.com
Fri Oct 26 13:15:30 EDT 2007


"Marc 'BlackJack' Rintsch" <bj_666 at gmx.net> wrote in message 
news:5ocgedFm1hl5U5 at mid.uni-berlin.de...
> On Thu, 25 Oct 2007 17:15:36 -0400, Tim Arnold wrote:
>
>> Hi, I'm getting the by-now-familiar error:
>> return codecs.charmap_decode(input,errors,decoding_map)
>> UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in 
>> position
>> 4615: ordinal not in range(128)
>>
>> the html file I'm working with is in utf-8, I open it with codecs, try to
>> feed it to TidyHTMLTreeBuilder, but no luck. Here's my code:
>> from elementtree import ElementTree as ET
>> from elementtidy import TidyHTMLTreeBuilder
>>
>>             fd = codecs.open(htmfile,encoding='utf-8')
>>             tidyTree =
>> TidyHTMLTreeBuilder.TidyHTMLTreeBuilder(encoding='utf-8')
>>             tidyTree.feed(fd.read())
>>             self.tree = tidyTree.close()
>>             fd.close()
>>
>> what am I doing wrong? Thanks in advance.
>
> You feed decoded data to `TidyHTMLTreeBuilder`.  As the `encoding`
> argument suggests this class wants bytes not unicode.  Decoding twice
> doesn't work.
>
> Ciao,
> Marc 'BlackJack' Rintsch

well now that you say it, it seems so obvious...
some day I will get the hang of this encode/decode stuff. When I read about 
it, I'm fine, it makes sense, etc. maybe even a little boring. And then I 
write stuff like the above!

Thanks to you and Diez for straightening me out.
--Tim





More information about the Python-list mailing list