[Mailman-i18n] "Funny" characters in real names?
Barry A. Warsaw
barry@python.org
Tue, 17 Sep 2002 17:34:36 -0400
>>>>> "BG" =3D=3D Ben Gertzfield <che@debian.org> writes:
>> To follow up, I believe I have this working now. Here's how it
>> works.
BG> Thanks for the excellent explanation and implementation,
BG> Barry.
Took me two days. I still say Unicode is something everyone wants
until they get it. :)
=20
BG> I'll test this when it's checked in. Some comments below..
Excellent!
>> First, the only change to the MemberAdaptor API is that real
>> names can now be Unicode strings as well as 8-bit strings. If
>> they're 8-bit then they'll contain only ascii characters.
BG> ASCII is by definition 7-bit, Barry. Did you mean ISO-8859-1
BG> here?
Sorry, I meant "normal" Python strings (sometimes called "8-bit
strings") but which contain only 7-bit ascii characters. Those
beasties I don't convert to Python unicode strings.
>> When a real name is entered into a web form, we'll first
>> attempt to convert it to us-ascii. If that succeeds, we know
>> the real name is ascii only and we'll store it in the
>> membership database as an 8-bit ascii-only-containing string.
BG> Again, I assume you mean ISO-8859-1 instead of ascii here.
Same thing here. We do name.encode('us-ascii') and catch any
UnicodeError that might occur. If no error occurs, we know we have a
string with 7-bit ascii characters in it, so we store that as an 8-bit
Python string, not as a unicode Python string.
>> If the conversion fails, we'll convert the real name to Unicode
>> using the charset of the context's language (i.e. list
>> preferred if we're looking at an admin page, user preferred if
>> we're looking at an options page, and form value if we're
>> looking at the subscribe page -- all with appropriate fallbacks
>> to Something Sensible). We'll also do html entity replacement
>> (e.g. #&246; -> =F6). We'll store this Unicode string as the
>> member's real name in the membership database, but we don't
>> store the charset because...
BG> This is a good thing. Note that some browsers might (I
BG> haven't checked this) incorrectly send the entity &246; for
BG> whatever character is at position 246 in the user's default
BG> character set, not character 246 in Unicode. This might be
BG> something to look out for, but I don't know if it's important.
I don't know what else to do. Note that you could literally type
ö into the web form and it would have the same effect. This is
probably an 80/20 solution.
BG> Everything else looks good. The kludge to assume iso-8859-1
BG> on us-ascii pages is unfortunately a generally good one, as
BG> that will make the most people happy. I hate to do it,
BG> though!
Me too! It means that names in other charsets will be screwed on
English lists, but again, I think this is best we can do for a
practical 80/20 solution.
Thanks for the feedback.
-Barry