Treating a unicode string as latin-1

Diez B. Roggisch deets at nospam.web.de
Thu Jan 3 13:13:49 EST 2008


Duncan Booth schrieb:
> Fredrik Lundh <fredrik at pythonware.com> wrote:
> 
>> ET has already decoded the CP1252 data for you.  If you want UTF-8, all 
>> you need to do is to encode it:
>>
>>>>> u'Bob\x92s Breakfast'.encode('utf8')
>> 'Bob\xc2\x92s Breakfast'
>>
> I think he is claiming that the encoding information in the file is 
> incorrect and therefore it has been decoded incorrectly.
> 
> I would think it more likely that he wants to end up with u'Bob\u2019s 
> Breakfast' rather than u'Bob\x92s Breakfast' although u'Dog\u2019s dinner' 
> seems a probable consequence.

If that's the case, he should read the file as string, de- and encode it 
(probably into a StringIO) and then feed it to the parser.

Diez



More information about the Python-list mailing list