[Mailman-i18n] "Funny" characters in real names?

Barry A. Warsaw barry@python.org
Tue, 17 Sep 2002 16:58:38 -0400


To follow up, I believe I have this working now.  Here's how it works.

First, the only change to the MemberAdaptor API is that real names can
now be Unicode strings as well as 8-bit strings.  If they're 8-bit
then they'll contain only ascii characters.

When a real name is entered into a web form, we'll first attempt to
convert it to us-ascii.  If that succeeds, we know the real name is
ascii only and we'll store it in the membership database as an 8-bit
ascii-only-containing string.

If the conversion fails, we'll convert the real name to Unicode using
the charset of the context's language (i.e. list preferred if we're
looking at an admin page, user preferred if we're looking at an
options page, and form value if we're looking at the subscribe page --
all with appropriate fallbacks to Something Sensible).  We'll also do
html entity replacement (e.g. #&246; -> =F6).  We'll store this Unicode=

string as the member's real name in the membership database, but we
don't store the charset because...

...when we need to get a printable version of the member name, we yank
out either the ascii string or the Unicode string.  If it's ascii,
we're done.  If it's Unicode, then we try to encode it to the
charset of the web page we're printing (for cgi), or to the charset of
the outgoing email message.

For output web pages, if the encoding fails, we'll convert chars > 127
to html entities (e.g. =F6 -> #&246;) so in most cases we'll still see
the name with the proper characters.  For this case, think about a
user who selects Spanish, enters a =F1 in their name, and then switches=

their preferred language to English (us-ascii).  You'd like their name
to still show up correctly.

For email, if the name has non-ascii characters in it, we'll use the
email.Header.Header class to convert the To string to an RFC-compliant
format.  If that fails we fall back on encoding to us-ascii replacing
non-ascii characters with `?'s.

This seems to work fairly well (with some ugly changes also necessary
to the logging system), with one minor kludge.  I want to allow
non-ASCII characters in real names for English lists.  I'm nervous
about changing the default charset for English from us-ascii because
I'm superstitious about unintended side-effects.  So I'm making a
couple of special cases for us-ascii.  When decoding a string from a
web form, if the default charset would be us-ascii, I'll use
iso-8859-1 instead.  Then when encoding a name in an email header, if
the charset is us-ascii, again, I'll use iso-8859-1.  This seems like
a practical compromise, if a bit ugly.  Feedback is welcome.

I'm about to check all this stuff in.  Testing will be /greatly/
appreciated!

-Barry