[Mailman-Developers] Proper solution to Mailman CVS's Japanese problems

Ben Gertzfield che@debian.org
Tue, 25 Sep 2001 20:33:03 +0900


I had a flash of inspiration and realized the correct way to solve
Mailman's Japanese problems once and for all.

Kikuchi-san's solution, which converted all incoming emails to EUC-JP
for archiving/admin purposes, then re-converted them to ISO-2022-JP,
is fine for users who do not use an external archiver, or depend on
PGP-signed mails to come through properly.  However, this is not the
best solution.

Currently, the Japanese localized messages for Mailman are stored in
EUC-JP (an 8-bit encoding).  This presents a problem, as Japanese
emails are inherently ISO-2022-JP, a very different (7-bit) encoding,
and displaying an EUC-JP web page with an ISO-2022-JP encoded email on
it just leads to garbage.

The "proper" solution is to make all of Mailman's Japanese web pages
be stored internally in ISO-2022-JP; this way, we can include emails
almost verbatim (see below) and never convert their charsets, which
makes external archiving with Japanese a reality, as well as no longer
possibly messing up PGP signed messages when converting the encoding
back and forth.  

In addition, with this solution, we won't need to add an extra
dependancy on the kconv Python module, which Kikuchi-san's method
requires for conversion back and forth between ISO-2022-JP <-> EUC-JP.

The one catch is that, as ISO-2022-JP is a 7-bit medium, it can
contain the characters <, >, and & -- even within the non-ASCII parts
of the text.  These characters are familiar to all of us as the three
special HTML entities that must be represented in HTML documents as
&lt; &gt; and &amp; respectively.  Unfortunately, you can't just take
ISO-2022-JP text and globally search-and-replace these three
characters, as they must remain verbatim in the non-ASCII sections of
the encoded text, or the Japanese will be rendered as garbage!

I have coded up a solution to this problem.  The following module I've
named JisEscape (the name can be changed :) will take any JIS-encoded
text (ISO-2022-JP, the Japanese email charset, is a subset of JIS) and
escape <, >, and & within *ONLY* the ASCII and ASCII-like segments of
the text, while leaving them as-is in the Japanese-encoded segments.

We can use this to process any messages/multipart message segments
that have charset=iso-2022-jp, and display them properly escaped in
the Mailman admin pages, as well as the Pipermail archives.  Since
this solution does not convert the message's charset at all, it works
perfectly even with external archiving solutions, like Hypermail.

I'm Cc:ing Kikuchi-san on this mail, because I believe he's the one
who did most of the work translating Mailman's messages and web pages
into Japanese.  Kikuchi-san, what do you think of this solution?

Also, Barry, I hope this solution makes sense to you.  We can keep the
localized Japanese messages that Mailman outputs at the shell in
EUC-JP, or change them to ISO-2022-JP; I'm not sure where in the
Mailman source we specify the encoding for the 'ja' messages, but it
should be possible to specify that somewhere.  In my experience, most
Japanese-aware terminals can understand both EUC-JP and ISO-2022-JP,
but I'm not certain of that. (At least, krxvt supports them both. :)

Module follows.  I'm willing to also do the work of integrating this
with Mailman.  It won't take much; instead of blindly HTML-escaping
every message, we'll only do so if the message [or message part] is
not ISO-2022-JP, and if it is ISO-2022-JP, we'll run it through this
module.  I can also convert the HTML pages and internal messages to
ISO-2022-JP.

I'm not sure whether this module should be made part of Python proper
or not, although it would be extremely useful for anyone who is
dealing with Japanese email and the web as a general solution.  Barry,
if you think it's proper, should I submit it upstream to get it
integrated into Python proper?  Right now, the cgi.escape() function
does something similar to this, but is not JIS-aware.

Ben

#!/usr/bin/env python

# JisEscape.py
#
# Written 2001-09-25 by Ben Gertzfield <ben@gmo.jp>
#
# Copyright (C) 1998,1999,2000,2001 by the Free Software Foundation, Inc.
#
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License
# as published by the Free Software Foundation; either version 2
# of the License, or (at your option) any later version.
# 
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
# 
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software 
# Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.

import sys

class JisEscape:
    """Safely HTML escape JIS (Japanese) encoded text.

    JIS (of which ISO-2022-JP, the encoding method used for Japanese email,
    is a subset) is a 7-bit encoding that encompasses several Japanese
    character sets, as well as ASCII.

    To properly display JIS on the World Wide Web, the usual HTML
    substitutions (&lt; for <, &gt; for >, and &amp; for &) must be
    treated with care; inside JIS-encoded text, if <, >, or & appear
    inside a part of the document in a Japanese character set, they
    MUST be left verbatim.

    However, as JIS text switches states between ASCII-like character
    sets and Japanese character based on a set of escape codes, we must
    be sure to properly escape <, >, and & inside the ASCII-like segments,
    while leaving the Japanese segments verbatim.

    This module will safely prepare text containing (for example) a
    Japanese email for display on the Web.
    """
    
    ESC = "\x1b"
    SO = "\x0e"
    SI = "\x0f"

    # Possible states for JIS-encoded text.
    NON_JIS          = 0  # Encoding not specified
    JIS_ROMAN        = 1  # modified ASCII: \ = yen, ~ = overbar
    ASCII            = 2  # ASCII
    HANKAKU          = 3  # half-width katakana (JIS X 0201-1997)
    JIS_C_6226_1978  = 4  # old JIS (JIS C 6226-1978)
    JIS_X_0208_1983  = 5  # updated JIS (JIS X 0208-1983)
    JIS_X_0208_1997  = 6  # modern JIS (JIS X 0208:1997)
    JIS_X_0212_1990  = 7  # JIS supplement (JIS X 0212-1990)
    NEC_KANJI        = 8  # NEC kanji
    HANKAKU_START    = 9  # shift to half-width katakana (keep old state)
    HANKAKU_END      = 10 # return to old state

    escapes = { ESC + "(J"              : JIS_ROMAN,
                ESC + "(H"              : JIS_ROMAN,
                ESC + "(B"              : ASCII,
                ESC + "(I"              : HANKAKU,
                ESC + "$@"              : JIS_C_6226_1978,
                ESC + "$B"              : JIS_X_0208_1983,
                ESC + "&@" + ESC + "$B" : JIS_X_0208_1997,
                ESC + "$(D"             : JIS_X_0212_1990,
                ESC + "K"               : NEC_KANJI,
                SO                      : HANKAKU_START,
                SI                      : HANKAKU_END,
                }

    def __init__(self, text):
        """Create the JIS object, initialized to the value of text.

        As JIS text does not necessarily start in any specific character
        set, the object's state is initialized to the non-JIS state.  This
        can be updated by checkJisEscape().
        """

        self.text = text
        self.state = self.NON_JIS
        self.last_state = self.NON_JIS

    def HTMLEscape(self):
        """Return the JIS encoded text with special HTML entities escaped.

        This function returns a string with the special HTML entities
        (<, >, and &) safely escaped as &lt; &gt; and &amp; respectively.

        It is aware of JIS encoding rules, and as such will not escape
        these characters when they are inside Japanese-encoded segments
        of the text.  However, as JIS encoded text can contain ASCII
        (and ASCII-like) segments, it will escape these three characters
        within the non-Japanese segments of the string.
        """

        out = ""
        count = 0

        while (count < len(self.text)):
            # Check if the current character is a control character.
            if self.isControl(self.text[count]):
                # If so, does it start a JIS escape?
                length = self.checkJisEscape(self.text[count:])
                if length:
                    # Yes: copy the escape to the output string and skip
                    # forward that many characters.
                    out = out + self.text[count:count + length]
                    count = count + length
                else:
                    # No: just copy and move forward one character.
                    out = out + self.text[count]
                    count = count + 1
            else:
                # If it's not a control character, are we in a part
                # of the string that's ASCII-like (i.e. HTML escapable?)
                if self.htmlEscapableState():
                    if self.text[count] == '<':
                        out = out + '&lt;'
                    elif self.text[count] == '>':
                        out = out + '&gt;'
                    elif self.text[count] == '&':
                        out = out + '&amp;'
                    else:
                        out = out + self.text[count]
                else:
                    out = out + self.text[count]

                count = count + 1

        return out

    def isControl(self, char):
        """Return true if char is a control character."""
        if 0x00 <= ord(char) <= 0x1f:
            return 1
        # Check for DEL as well.
        elif char == chr(0x7f):
            return 1
        else:
            return 0
    
    def checkJisEscape(self, substring):
        """Update the state of the JIS text if substring is a JIS escape.

        Checks if substring starts with a JIS escape, and if so,
        updates the object's state (and last_state, if needed)
        attributes to reflect the current character set specified.

        Returns the length of the escape that the substring starts with,
        or None if the substring does not start with a JIS escape.
        """

        # Check each of the possible JIS escapes.
        for e in self.escapes.keys():
            # If substring begins with one of them..
            if substring[0:len(e)] == e:
                # First, check if we're ending a half-width katakana
                # segment -- we better have started one first!
                if self.escapes[e] == self.HANKAKU_END:
                    if self.state == self.HANKAKU_START:
                        self.state = self.last_state
                        self.last_state = self.NON_JIS
                    else:
                        continue
                else:
                    # Are we starting a half-width katakana state?
                    # Save the old state in last_state if so.
                    if (self.escapes[e] == self.HANKAKU_START):
                        self.last_state = self.state

                    # In any case, update the object's JIS state.
                    self.state = self.escapes[e]

                return len(e)

        # if we've gotten here, substring does not begin with an escape.
        return None

    def htmlEscapableState(self):
        """Return true if the object is currently in an HTML-escapable state.

        JIS text includes text in both Japanese and ASCII-like
        character sets; this function returns true when the object is
        in one of the ASCII-like charsets, signifying that the special
        HTML characters <, >, and & should be escaped.
        """
        if (self.state == self.JIS_ROMAN or self.state == self.ASCII
            or self.state == self.NON_JIS):
            return 1
        else:
            return 0

# If called as a standalone script, just HTML escape JIS-encoded stdin.
if __name__ == '__main__':
    text = sys.stdin.read()
    jis = JisEscape(text)
    print jis.HTMLEscape()