Processing text data with different encodings

Tue Jun 28 07:31:08 EDT 2016

Michael Welle wrote:

> With your help, I fixed logging. Somehow I had in mind that the
> logging module would do the right thing if I don't specify the encoding.

The default encoding depends on the environment (and platform):

$ touch tmp.txt
$ python3 -c 'print(open("tmp.txt").encoding)'
UTF-8
$ LANG=C python3 -c 'print(open("tmp.txt").encoding)'
ANSI_X3.4-1968

> Well, setting the encoding explicitly to utf-8 changes the behaviour.
> 
> If I use decode('windows-1252') on a bit of text I still have trouble to
> understand what's happening. For instance, there is an u umlaut in the
> 1252 encoded portion of the input text. That character is 0xfc in hex.
> After applying .decode('windows-1252') and logging it, the log contains
> a mangled character with hex codes 0xc3 0x20. If I do the same with
> .decode('utf-8'), the result is a working u umlaut with 0xfc in the log.
> 
> On the other hand, if I try the following in the interactive
> interpreter:
> 
> Here I have a few bytes that can be interpreted as a 1252 encoded string
> and I command the interpreter to show me the string, right?
> 
>>>> e=b'\xe4'
>>>> e.decode('1252')
> 'ä'
> 
> Now, I can't to this, because 0xe4 isn't valid utf-8:
>>>> e.decode('utf-8')
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 0:
> unexpected end of data
> 
> But why is it different in my actual script? I guess the assumption that
> what I am reading from sys.stdin.buffer is the same as what is in the
> file, that I pipe into the script, is wrong?

The situation is simple; the string consists of code points, but the file 
may only contain bytes. When reading a string from a file the bytes read 
need decoding, and before writing a string to a file it must be encoded.

What byte sequence denotes a specific code point depends on the encoding.

This is always the case, i. e. if you look at a UTF-8-encoded file with an 
editor that expects cp1252 you will see

>>> in_the_file = "ä".encode("utf-8")
>>> in_the_file
b'\xc3\xa4'
>>> what_the_editor_shows = in_the_file.decode("cp1252")
>>> print(what_the_editor_shows)
Ã¤

On the other hand if you look at a cp1252-encoded file decoding the data as 
UTF-8 you will likely get an error because the byte

>>> "ä".encode("cp1252")
b'\xe4'

alone is not valid UTF-8. As part of a sequence the data may still be 
ambiguous. If you were to write an a-umlaut followed by two euro signs using 
cp1252

>>> in_the_file = 'ä€€'.encode("cp1252")

an editor expecting UTF-8 would show

>>> in_the_file.decode("utf-8")
'䀀'