Changing filenames from Greeklish => Greek (subprocess complain)

Steven D'Aprano steve+comp.lang.python at pearwood.info
Sun Jun 9 09:13:39 EDT 2013


On Sun, 09 Jun 2013 02:38:13 -0700, Νικόλαος Κούρας wrote:

> s = 'α'
> s = s.encode('iso-8859-7').decode('utf-8')
> 
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 0:
> unexpected end of data
> 
> Why this error? because 'a' ordinal value > 127 ?


Look at it this way... consider encoding and decoding to be like 
translating from one language to another.

Suppose you start with the English word "street". You encode it to German 
by looking it up in an English-To-German dictionary:

street -> Straße

The you decode the German by looking "Straße" up in a German-To-English 
dictionary:

Straße -> street

and everything is good. But suppose that after encoding the English to 
German, you get confused, and think that it is Italian, not German. So 
when it comes to decoding, you try to look up 'Staße' in an Italian-To-
English dictionary, and discover that there is no such thing as letter ß 
in Italian. So you cannot look the word up, and you get frustrated and 
shout "this is rubbish, there's no such thing as ß, that's not a letter!"

Not in Italian, but it is a perfectly good letter in German. But you're 
looking it up in the wrong dictionary.

Same thing with UTF-8. You encoded the string 'α' by looking it up in the 
"Unicode To ISO-8859-7 bytes" dictionary. Then you try to decode it by 
looking for those bytes in the "UTF-8 bytes To Unicode" dictionary. But 
you can't find byte 0xe1 on its own in UTF-8 bytes, so Python shouts 
"this is rubbish, there's no such thing as 0xe1 on its own in UTF-8!" and 
raises UnicodeDecodeError.


Sometimes you don't get an exception. Suppose that you are encoding from 
French to German:

qui -> die  (both words mean "who" in English)


Now if you get confused, and decode the word 'die' by looking it up in an 
English-To-French dictionary, instead of German-To-French, you get:

die -> mourir

So instead of getting 'qui' back again, you get 'mourir'. This is like 
mojibake: the results are garbage, but there is no exception raised to 
warn you.


-- 
Steven



More information about the Python-list mailing list