[Mailman-Developers] I18n proposal

Ben Gertzfield che@debian.org
Wed, 21 Nov 2001 18:29:41 +0900


>>>>> "Mikhail" == Mikhail Zabaluev <mhz@alt-linux.org> writes:

    Mikhail> Hello, I'd like to see some I18n issues in Mailman to be
    Mikhail> addressed prior to the 2.1 release. Basically, it's some
    Mikhail> bugs or misfeatures related to transformation of
    Mikhail> MIME-encoded messages.

I am working actively with Barry now on Mailman's i18n issues. See
my recent patches in the archives.

    Mikhail> The most serious bug I see here is that messages encoded
    Mikhail> in base64 still get decorated with plaintext.

Headers or bodies?  Are you talking about the footer tacked on to the
end of messages?  If so, it would be simple with the new message
structure to make the footer be a separate text part.  Though, I
don't see how adding some plain text after the end of the boundary
could be corrupted; could you put an example corrupted message up?

    Mikahil> No, wait -- there still is an implicit assumption that
    Mikahil> message bodies and the decoration text share the same
    Mikahil> character set. Thus the decorations should be recoded
    Mikahil> from what character set they are assumed to be in (ASCII? 
    Mikahil> ISO8859-1? UTF-8? Selectable per list?) into the
    Mikahil> character set of the message.

I'll work on addressing this now that we have some code that actually
deals with character set issues.

    Mikhail> Another problem is encoded messages in archives. Heck,
    Mikhail> look at this list's archive to see what I'm talking
    Mikhail> about. Those should also be decoded and have character
    Mikhail> set converted to some uniform one. I'd suggest UTF-8, but
    Mikhail> many browsers and text viewers still don't grok this
    Mikhail> charset, so it'd better be selectable as well.

I talked with Barry about this today.  My solution is to "guess" the
character set based on whichever is most common in the archives, and
use that as the charset specified in the HTML.  For any messages with
multi-language subjects or bodies, the main language will be left
in the normal character set, and the multi-language parts will be
encoded with the UTF-8 HTML entity.

This will require Python unicode codecs for all our languages, which
do not exist for KOI-8, Big5, or GB, as far as I know.

Ben

-- 
Brought to you by the letters B and S and the number 14.
"It is sad. *Campers* cannot *dance*. Not even a *party*."
Debian GNU/Linux maintainer of Gimp and Nethack -- http://www.debian.org/