[Mailman-Users] Encoding issues when importing archives

Tue May 22 21:37:32 EDT 2018

Mark Sapiro writes:

 > > content = content.encode(decoding)
 > > 
 > > UnicodeEncodeError: 'gb2312' codec can't encode character '\ufffd' in position 3131: illegal multibyte sequence
 > > 
 > > Apparently the offending attachments are specified as gb2312 (a common
 > > Chinese encoding).
 > > 
 > > Is there something I can do to somehow preprocess the archive mboxes, or
 > > otherwise re-encode the attachments?
 > 
 > Possibly there is, but this is a bug in the hyperkitty_import process.

Technically, it's a bug in common Chinese MUAs.  We can work around it
if we want to, of course, and I think we do.

<tl;dr endsat="Whew!">
The backstory is that Chinese (simplified, aka mainland) has three
major encoding standards: GB 2312, GBK, and GB 18030.  GBK is not
really an encoding, it's an encoding schema which says "future Chinese
encodings shall be supersets of GB 2312" but doesn't assign any new
characters, and GB 18030 is not only a superset of GB 2312 that
actually defines the new characters compatibly with GBK, but it is
also a superset of Unicode that folds Unicode into the GBK code space
algorithmically (GB 2312 and Unicode are incompatible in page 0).

Whew!

So, because GB 18030 is backward compatible with GB 2312, a lot of
Chinese MUAs get away with incorrectly labeling the extended character
set "GB 2312", and you get the error above.  The same thing happens
with Shift JIS, by the way.

OTOH, for that exact reason, we can do what Webencodings does, and
promote GB 2312 claims, and *decode* with GB 18030.  I think this is
safe, as there's really no alternative encoding to worry about, and
since this stuff presumably all text/plain or text/html, we should be
OK on security stuff (although I guess in theory it could be source
code or executable scripts that is doing something sneaky).

(On the other hand, I *am* worried about the fact that there is a
REPLACEMENT CHARACTER in the content at this point.  Presumably that's
because we *decoded* the original mail with errors=who-gives-a-fsck,
which is not appropriate here---we can be almost sure that the content
is *not* corrupt, rather it's mislabeled.)

The OP can do a poor man's version, by going through the existing mbox
and case-independently regexp-replacing r"=\?GB2312\?" with
r"=\?GB18030\?", and r'charset=("?)GB2312' with r'charset=\1GB18030'.

I'm still jet-lagged from PyCon, so I'm not going to do more now, and
if you want some Python code to do this, please feel free to ping me
on or off list.

 > It would help if you file an issue at
 > <https://gitlab.com/mailman/hyperkitty/issues/new> with enough
 > information for us to reproduce it.

print("""
Subject: nothing to see here: =?GB2312?Q?=FF=FD?=

Oops!
""")

should do the trick. ;-)

I'll be looking for this issue, or you can assign it to me.

Steve