[Tutor] Unicode? UTF-8? UTF-16? WTF-8? ;)

Wed Sep 5 13:52:57 CEST 2012

Ray Jones wrote:

>> You can work around that by specifying the appropriate encoding
>> explicitly:
>>
>> $ python tmp2.py iso-8859-5 | cat
>> �
>> $ python tmp2.py latin1 | cat
>> Traceback (most recent call last):
>>File "tmp2.py", line 4, in <module>
>>print u"Я".encode(encoding)
>> UnicodeEncodeError: 'latin-1' codec can't encode character u'\u042f' in
>> position 0: ordinal not in range(256)
>>
> But doesn't that entail knowing in advance which encoding you will be
> working with? How would you automate the process while reading existing
> files?

If you don't *know* the encoding you *have* to guess. For instance you could 
default to UTF-8 and fall back to Latin-1 if you get an error. While 
decoding non-UTF-8 data with an UTF-8 decoder is likely to fail Latin-1 will 
always "succeed" as there is one codepoint associated with every possible 
byte. The result howerver may not make sense. Think

for line in codecs.open("lol_cat.jpg", encoding="latin1"):
    print line.rstrip()