Processing text data with different encodings

Tue Jun 28 07:26:28 EDT 2016

On Tue, 28 Jun 2016 08:17 pm, Michael Welle wrote:

> After a bit more 'fiddling' I found out that all test cases work if I
> use .decode('utf-8') on the incoming bytes. In my first approach I tried
> to find out at what I was looking and then used a specific .decode, e.g.
> .decode('windows-1252'). That were the trouble makers.

Remember that chardet's detection is based on statistics and heuristics and
cannot be considered 100% reliable. Normally I would expect that chardet
would guess two or three encodings. If the first fails, you might want to
check the others.

Also remember that chardet works best with large amounts of text, like an
entire webpage. If you pass it a single byte, or even a few bytes, the
results will likely be no better than whatever encoding the chardet
developer decided to use as the default:

"If there's not enough data to guess, just return Win-1252, because that's
pretty common..."

> With your help, I fixed logging. Somehow I had in mind that the
> logging module would do the right thing if I don't specify the encoding.
> Well, setting the encoding explicitly to utf-8 changes the behaviour.

I would expect that logging will do the right thing if you pass it text
strings and have set the encoding to UTF-8.

> If I use decode('windows-1252') on a bit of text 

You cannot decode text. Text is ENCODED to bytes, and bytes are DECODED to
text.

> I still have trouble to understand what's happening.

> For instance, there is an u umlaut in the 
> 1252 encoded portion of the input text.

You don't know that. If the input is *bytes*, then all you know is the byte
values. What they mean is anyone's guess unless the name of the encoding is
transmitted separately. You can be reasonably sure that the bytes are
mostly ASCII, because it's email and nobody sends email in EBCDIC, so if
you see a byte 0x41, you can be sure it represents an 'A'. But outside of
the ASCII range, you're on shaky ground.

If the specified encoding is correct, then everything works well: the email
says it is UTF-8, and sure enough it is UTF-8. But if the specified
encoding is wrong, you're in trouble.

You only think the encoding is Windows-1252 because Chardet has guessed
that. But it's not infallible and maybe it has got it wrong. Especially if
your input is made up of a lot of bytes from all sorts of different
encodings, that may be confusing Chardet.

> That character is 0xfc in hex. 

No. Byte 0xFC represents ü if your guess about the encoding is correct. If
the encoding truly is Windows-1252, or Latin-1, then byte 0xFC will mean ü.
(And some others.) If the source is Western European, that might be a good
guess.

But if the encoding actually is (let's say):

- ISO-8859-5 (Cyrillic), then the byte represents ќ

- ISO-8859-7 (Greek), then the byte represents ό

- MacRoman (Apple Macintosh), then the byte represents ¸

(That last one is not a comma, but a cedilla.)

> After applying .decode('windows-1252') and logging it, the log contains
> a mangled character with hex codes 0xc3 0x20. 

I think you are misunderstanding what you are looking at. How are you seeing
that? 0x20 will be a space in most encodings.

(1) How are you opening the log file in Python? Do you specify an encoding?

(2) How are you writing to the log file?

(3) What are you using to read the log file outside of Python? How do you
know the hex codes?

I don't know any way you can start with the character ü and write it to a
file and get bytes 0xc3 0x20. Maybe somebody else will think of something,
but to me, that seems impossible.

> If I do the same with 
> .decode('utf-8'), the result is a working u umlaut with 0xfc in the log.

That suggests that you have opened the log file using Latin-1 or
Windows-1252 as the encoding. You shouldn't do that. Unless you have a good
reason to do otherwise (in other words, for experts only) you should always
use UTF-8 for writing.

> On the other hand, if I try the following in the interactive
> interpreter:
> 
> Here I have a few bytes that can be interpreted as a 1252 encoded string
> and I command the interpreter to show me the string, right?
> 
>>>> e=b'\xe4'

That's ONE byte, not a few.

>>>> e.decode('1252')
> 'ä'

Right -- that means that byte 0xE4 represents ä in Windows-1252, also in
Latin-1 and some others. But:

py> e.decode('iso-8859-7')  # Greek
'δ'
py> e.decode('iso-8859-8')  # Hebrew
'ה'
py> e.decode('iso-8859-6')  # Arabic
'ل'
py> e.decode('MacRoman')
'‰'
py> e.decode('iso-8859-5')
'ф'

So if you find a byte 0xE4 in a file, and don't know where it came from, you
don't know what it means. If you can guess it came from Russia, then it
might be a ф. If you think it came from a Macintosh prior to OS X, then it
probably means a per-mill sign ‰.

> Now, I can't to this, because 0xe4 isn't valid utf-8:
>>>> e.decode('utf-8')
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 0:
> unexpected end of data

Correct.

> But why is it different in my actual script? 

Without seeing your script, it's hard to say what you are actually doing.

> I guess the assumption that 
> what I am reading from sys.stdin.buffer is the same as what is in the
> file, that I pipe into the script, is wrong?

I wouldn't rule that out, but more likely the issue lies elsewhere, in your
own code.

-- 
Steven
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.