[Mailman-i18n] "Funny" characters in real names?

Barry A. Warsaw barry@python.org
Tue, 17 Sep 2002 17:34:36 -0400


>>>>> "BG" =3D=3D Ben Gertzfield <che@debian.org> writes:

    >> To follow up, I believe I have this working now.  Here's how it
    >> works.

    BG> Thanks for the excellent explanation and implementation,
    BG> Barry.

Took me two days.  I still say Unicode is something everyone wants
until they get it. :)
   =20
    BG> I'll test this when it's checked in.  Some comments below..

Excellent!

    >> First, the only change to the MemberAdaptor API is that real
    >> names can now be Unicode strings as well as 8-bit strings.  If
    >> they're 8-bit then they'll contain only ascii characters.

    BG> ASCII is by definition 7-bit, Barry.  Did you mean ISO-8859-1
    BG> here?

Sorry, I meant "normal" Python strings (sometimes called "8-bit
strings") but which contain only 7-bit ascii characters.  Those
beasties I don't convert to Python unicode strings.

    >> When a real name is entered into a web form, we'll first
    >> attempt to convert it to us-ascii.  If that succeeds, we know
    >> the real name is ascii only and we'll store it in the
    >> membership database as an 8-bit ascii-only-containing string.

    BG> Again, I assume you mean ISO-8859-1 instead of ascii here.

Same thing here.  We do name.encode('us-ascii') and catch any
UnicodeError that might occur.  If no error occurs, we know we have a
string with 7-bit ascii characters in it, so we store that as an 8-bit
Python string, not as a unicode Python string.

    >> If the conversion fails, we'll convert the real name to Unicode
    >> using the charset of the context's language (i.e. list
    >> preferred if we're looking at an admin page, user preferred if
    >> we're looking at an options page, and form value if we're
    >> looking at the subscribe page -- all with appropriate fallbacks
    >> to Something Sensible).  We'll also do html entity replacement
    >> (e.g. #&246; -> =F6).  We'll store this Unicode string as the
    >> member's real name in the membership database, but we don't
    >> store the charset because...

    BG> This is a good thing.  Note that some browsers might (I
    BG> haven't checked this) incorrectly send the entity &246; for
    BG> whatever character is at position 246 in the user's default
    BG> character set, not character 246 in Unicode.  This might be
    BG> something to look out for, but I don't know if it's important.

I don't know what else to do.  Note that you could literally type
&#246; into the web form and it would have the same effect.  This is
probably an 80/20 solution.

    BG> Everything else looks good.  The kludge to assume iso-8859-1
    BG> on us-ascii pages is unfortunately a generally good one, as
    BG> that will make the most people happy.  I hate to do it,
    BG> though!

Me too!  It means that names in other charsets will be screwed on
English lists, but again, I think this is best we can do for a
practical 80/20 solution.

Thanks for the feedback.
-Barry