[Mailman-Developers] Mailman CVS sends out Japanese template mails in EUC-JP

Ben Gertzfield che@debian.org
Tue, 11 Sep 2001 11:37:42 +0900


>>>>> "BAW" == Barry A Warsaw <barry@zope.com> writes:
>>>>> "BG" == Ben Gertzfield <che@debian.org> writes:

    BG> How should we approach modifying the virgin queue?  I can hack
    BG> in conversion to ISO-2022-JP and adding the headers, but that
    BG> seems wrong somehow.  Maybe have each language supply its own
    BG> special "incoming mail charset conversion", "outgoing mail
    BG> charset conversion", and "header additions" modules?  I know
    BG> Japanese needs to convert incoming mails to EUC before they're
    BG> archived, and back to ISO-2022-JP when they go back out to the
    BG> list.

    BAW> We only have this problem for messages that Mailman
    BAW> generates, right?  IOW, for messages sent to the list by
    BAW> members, we're adhering to least-munging principles, so if
    BAW> someone sends a message to the list all bolluxed up, tough
    BAW> luck.

Unfortunately, it's not that simple.  

All the Japanese web pages for Mailman are in the 8-bit EUC-JP
encoding, which is right and proper.  

However, when mail comes in to a list, we need to convert it from
the 7-bit ISO-2022-JP encoding format to EUC-JP before we archive
it.  Otherwise, the archive web pages will have the static bits in
EUC-JP, and the email contents in ISO-2022-JP!  This is a mess.

Also, we need to deal with de-MIMEifying the headers before they
get archived, or the Subject/To/From/etc. lines will be completely
unreadable Base-64 in the archives.

The kconv Python module, available at
http://tomigaya.shibuya.tokyo.jp/~mak/kconv/index.html , will handle
the ISO-2022-JP <-> EUC-JP conversion.  There's a pure Python version,
so I think we should bundle it.  (It would suck to make ANOTHER module
required to download separately.)

Here are a few possible solutions.  We need to apply these both to
virgin-birth messages and to incoming/outgoing list messages.

1) Store all Japanese web pages and templates as ISO-2022-JP.  This is
   somewhat less than optimal; while it totally removes the archiving
   problem from the picture, many web browsers have a hard time
   auto-detecting ISO-2022-JP.  Also, 7-bit encoding may make the headers
   and web pages slightly bigger (not that big of a deal).  Finally,
   there are some half-width characters that I believe are not legal in
   ISO-2022-JP, so there are some things we would not be able to store in
   the web pages and templates.  This solution does not deal at all with
   the Content-Type issue (we need to add the charset=iso-2022-jp flag),
   nor does it deal with other languages, which have similar issues.

2) Add a set of Handlers/{lang}/ subdirectories, with Incoming.py and
   Outgoing.py modules in each of them.  The Incoming module (if it
   existed for the lang) would do all the munging necessary to make a
   message suitable for archiving.  This includes de-MIMEifying the
   headers if needed; they'll come in base-64 or printed-quotable
   for various languages.  I believe mimelib will deal with this.

   The Outgoing module would convert the message back to the outgoing
   7-bit charset, as well as base-64 encode the headers if needed.
   Also, it would need to add a Content-Type: text/plain; charset=foo
   header.

   Problems: We can't blindly convert the entire message from one charset
   to another and expect things like MIME attachments to stay sane.  MIME
   will have to be handled carefully; what happens to a message that has
   multiple text/plain; sections with different charsets? *shudder*

   We also can't blindly add the Content-Type header, unless we're
   positive we know what charset the message came in with.

   Final problem: Unicode.  M$ Outlook sends email in UTF-8 by
   default.  The kconv module will deal with this, but we need to be
   aware that people *are* sending email encoded with this format.  If
   we wanted to be extra studly, we could offer the option to
   automatically convert from UTF-8 to ISO-2022-JP, but this
   conversion is hardly a bijection, so if a user used multiple languages
   in one message, they're SOL.  Also, this will fully screw up PGP
   signatures, but I doubt any Outlook users are using PGP. *grin*

   Solution: The proper thing to do is to break each message up into
   its constituent MIME sections and munge each one on its own, dealing
   with whatever charset each section says it is, if it's text/*.  There
   isn't a framework for this yet via the Handlers method, as far as
   I know.  So we'll have to come up with one.

I like the 2) solution a lot more, but it's going to take a lot of
work.  But we really can't ship MM2.1 with half-assed language support;
it's awesome that we localized the web pages, but people will go off
their rockers when things like Base64'd Subject/From lines don't work,
and the welcome mails go out in an unreadable format.

I'm going to start working on 2) now.  I'll send in a prototype patch
soon.

Ben

-- 
Brought to you by the letters T and J and the number 18.
"A yonker is a young man."
Debian GNU/Linux maintainer of Gimp and GTK+ -- http://www.debian.org/