Processing text data with different encodings

Tue Jun 28 10:52:22 EDT 2016

On Tue, 28 Jun 2016 10:30 pm, Michael Welle wrote:

> I look at the hex values of the bytes, get the win-1252 table and
> translate the bytes to chars. If the result makes sense, it's win-1252
> (and maybe others, if the tables overlap). So in that sense I know what
> I have. I least for this experiments, when I can control the input.

So let me see if I understand what you are doing.

You create an input file for testing, let's call it "test.txt". In that test
file, you control the input, so you place an ü and save it using
Windows-1252 encoding. Then you open the test file and see a byte 0xFC.

Then you open an email, encoded using a completely unknown encoding, but
claiming to be UTF-8, and see a byte 0xFC. And from this you conclude that
the encoding must be Windows-1252 and the unknown character must be ü.

Is that right?

Maybe I'm misunderstanding you, but frankly from your description I have
very little confidence that the unknown encoding is Windows-1252,
especially given your earlier comment:

    "you will find THREE OR FOUR different encodings in one email. 
    I think at the sending side they just glue different text 
    fragments from different sources together without thinking 
    about the encoding"

But I'm not going to argue any more. Maybe I've misunderstood you, and what
you have done makes perfect sense. Maybe there's only one encoding, 1252,
not three or four. Windows-1252 is a very common encoding, so perhaps you
are right. The worst that will happen is that you will get mojibake.

So I will accept that the encoding is Windows-1252.

>>> After applying .decode('windows-1252') and logging it, the log contains
>>> a mangled character with hex codes 0xc3 0x20.
>>
>> I think you are misunderstanding what you are looking at. How are you
>> seeing that? 0x20 will be a space in most encodings.
>
> Again, I used an hex editor and hd.

hd? Also known as hexdump?

http://www.unix.com/man-page/Linux/1/hd/

I ask because there is also another tool, hdtool, sometimes called hd, used
for formatting hard disks. I assume you're not using that :-)

>> (1) How are you opening the log file in Python? Do you specify an
>> encoding?
>
> Well, I use the logging module. In my very first posting I didn't
> specify an encoding. Later I changed the encoding to utf-8. The details
> of opening the log file can be found somewhere in the logging module

The mystery 0xC3 0x20 hex codes -- what encoding were you using at the time
you logged them? Because there is no way that I can see that you can get
0xC3 0x20 using UTF-8. It is an invalid sequence of bytes. If the logger
was using UTF-8, that implies either a bug in the logger, or disk
corruption.

Are you *sure* it is 0x20? If it were 0xC3 0xBC that would make perfect
sense. But 0xC3 0x20 is invalid UTF-8.

>> (2) How are you writing to the log file?
>
> I use the provided functions for writing records with the given
> severity.

Are you passing it a text string? A bytes string? Something else?

>> (3) What are you using to read the log file outside of Python? How do you
>> know the hex codes?
> 
> I use emacs, hexlify-mode and hd.

Okay.

>> I don't know any way you can start with the character ü and write it to a
>> file and get bytes 0xc3 0x20. Maybe somebody else will think of
>> something, but to me, that seems impossible.

That comment still stands.

>>> If I do the same with
>>> .decode('utf-8'), the result is a working u umlaut with 0xfc in the log.

But that is, I think, impossible. You must be misinterpreting what you are
seeing, or confusing output in the log when it used a different encoding.

py> 'ü'.encode('utf-8')
b'\xc3\xbc'

not 0xFC.

More to follow...

-- 
Steven
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.