what encoding is this? How can I tell? How can I translate?

Skip Montanaro skip at pobox.com
Mon Sep 24 17:02:14 EDT 2001


(My apologies that this has only a little to do with Python.  It's really
more about character set encodings and possibly broken mailers, but since I
deal with that sort of stuff in Python, this group seems as relevant as any
other I would stumble upon.)

I have a multipart/mixed email message that consists of a bunch of
text/plain chunks.  I have a little Python script that pulls out an
interesting chunk using multifile.MultiFile and decodes it using
mimetools.decode.  That is then pushed through another little script to
normalize the line endings.  I'm left with a tab-separated file that
contains a mysterious looking character.  I can infer that what looks like a
capital "O" underneath a tilde in XEmacs (ordinal 213, hex 0xd5) is supposed
to be an apostrophe, so I could do some hack filtering to convert this, but
a quick scan for "d5" in the Python encodings directory suggests it is
mac_latin2 (not sure what that is officially).  Looking at the raw mail
message, there's no indication in the message header or in any of the
headers of the individual chunks what the character set is supposed to be.
The message was generated using the Mac version of Outlook Express
v. 5.02.2002.  Is this a Mac OE bug that the various MIME chunks don't
contain charset attributes?

How do I translate this to latin-1 (I can probably assume that mail from
this particular non-technical person will always be similarly encoded)?
Based on what I saw at the bottom of codecs.py I tried this

    import codecs, sys
    sys.stdin = codecs.EncodedFile(sys.stdin, "latin1", "mac-latin2")
    sys.stdout.write(sys.stdin.read())

but got

    Traceback (most recent call last):
      File "/home/skip/tmp/decode.py", line 3, in ?
        sys.stdout.write(sys.stdin.read())
      File "/usr/local/lib/python2.1/codecs.py", line 417, in read
        data, bytesencoded = self.encode(data, self.errors)
    UnicodeError: Latin-1 encoding error: ordinal not in range(256)

which seemed odd, because the ordinal 213 character is the only character
above ordinal 127.

Thx,

-- 
Skip Montanaro (skip at pobox.com)
http://www.mojam.com/
http://www.musi-cal.com/




More information about the Python-list mailing list