Unicode error

John Machin sjmachin at lexicon.net
Sat Jul 24 18:37:26 EDT 2010


dirknbr <dirknbr <at> gmail.com> writes:

> I have kind of developped this but obviously it's not nice, any better
> ideas?
> 
>         try:
>             text=texts[i]
>             text=text.encode('latin-1')
>             text=text.encode('utf-8')
>         except:
>             text=' '

As Steven has pointed out, if the .encode('latin-1') works, the result is thrown
away. This would be very fortunate. 

It appears that your goal was to encode the text in latin1 if possible,
otherwise in UTF-8, with no indication of which encoding was used. Your second
posting confirmed that you were doing this in a loop, ending up with the
possibility that your output file would have records with mixed encodings.

Did you consider what a programmer writing code to READ your output file would
need to do, e.g. attempt to decode each record as UTF-8 with a fall-back to
latin1??? Did you consider what would be the result of sending a stream of
mixed-encoding text to a display device?

As already advised, the short answer to avoid all of that hassle; just encode in
UTF-8.






More information about the Python-list mailing list