[Mailman-i18n] HTML entities (é) in es, it, no translations

Martin von Loewis loewis@informatik.hu-berlin.de
31 Jan 2002 14:24:24 +0100


Ben Gertzfield <che@debian.org> writes:

> Actually, to be precise, HTML 4.01's native encoding is Unicode,
> which Latin-1 happens to be a (very small) subset of.

To be really precise, HTML 4.01's "document character set" is the
"Universal Character Set" (as defined in ISO 10646), see

http://www.w3.org/TR/html4/charset.html

What the character encoding is is a different matter (Unicode is not a
character encoding); that is transmitted as part of the HTTP
response. As the document above points out, the default encoding, if
none is specified, is Latin-1 (they also point out that it is bad to
rely on that).

> Unfortunately, as much as I'd like, we can't make *everything* 
> Unicode, because a lot of older browsers still don't support it.

That is completely irrelevant; Unicode is *not* a character
encoding. In this context, it is a Python internal datatype. When
producing HTML document, strings of that type need to be encoded in
the target document encoding (which definitely will *not* be Unicode,
but perhaps a Unicode encoding, such as UTF-8, or some other
encoding).

> Which East Asian ones are missing?  Mailman CVS works beautifully
> for me with Japanese, and the screenshot I sent earlier today shows
> Chinese (both simplified and traditional) working in email.

Python does not currently include codecs for iso-2022-jp, gb2312,
big5, euc-jp, shift-jis. Since mailman leaves all strings as-is, and
never mixes encodings, it can let them pass through unmodified. There
are a number of pitfalls, though:

- On mailing lists, people may use different encodings; some of the
  common combinations might be:
  European languages: ISO-8859-1, ISO-8859-15 (for the Euro), UTF-8
  Japanese: ISO-2022-JP, eucJP, shift-jis, UTF-8
  Chinese: gb2312, big5

  This is probably an archive problem only; however, if mailman adds a
  footer, it will produce garbage if the footer encoding differs from
  the message body encoding.

- To analyse the subject, Mailman needs to strip off the
  subject_prefix from the incoming message. If the message uses a
  MIME-encoded header, it may be that the subject prefix is base64
  encoded. Currently, mailman fails to strip the prefix in this
  case. There is a patch on SF that tries to decode the subject. If
  the encoding is not known to Python, this will still fail.

- To produce HTML pages, mailman needs to quote markup characters. For
  some encodings (e.g. iso-2022-jp), HTML markup character such as '<'
  may also occur as part of the multi-byte encoding. For these
  encodings, mailman currently performs no quoting at all. This is
  incorrect if an iso-2022-jp message contains a true '<' character,
  which would need to be converted to '&lt;'.

> The Japanese codec is in a good state and will be easy enough to
> ship; the Chinese ones are only available in CVS that I know of, so
> we will need to make a proper distribution.

I'd encourage you to have a look at the iconv codec also. If the
system iconv is powerful enough (e.g. on Linux glibc), all encodings
of the world would be supported with that single codec.

Regards,
Martin