From barry@zope.com  Fri Sep 13 20:09:25 2002
From: barry@zope.com (Barry A. Warsaw)
Date: Fri, 13 Sep 2002 15:09:25 -0400
Subject: [Mailman-i18n] "Funny" characters in real names?
Message-ID: <15746.14309.516823.293632@anthem.wooz.org>

Take a look at SF bug # 601082

http://sf.net/tracker/index.php?func=detail&aid=601082&group_id=103&atid=100103

Does anybody have opinions or other ideas of how to handle this?

-Barry


From che@debian.org  Fri Sep 13 21:21:25 2002
From: che@debian.org (Ben Gertzfield)
Date: Fri, 13 Sep 2002 13:21:25 -0700
Subject: [Mailman-i18n] "Funny" characters in real names?
References: <15746.14309.516823.293632@anthem.wooz.org>
Message-ID: <3D8248C5.5040205@debian.org>

Barry A. Warsaw wrote:

>Take a look at SF bug # 601082
>
>http://sf.net/tracker/index.php?func=detail&aid=601082&group_id=103&atid=100103
>
>Does anybody have opinions or other ideas of how to handle this?
>  
>

When submitting an HTML form, the character set used for the submitted 
data is the same as the one specified in the HTML or header of the 
original form's page.

So we do know what character set the user originally used, and can store 
that with their user data.  Then when we send out a personalized 
message, we can easily encode the To: header.

Ben


From barry@zope.com  Fri Sep 13 21:44:57 2002
From: barry@zope.com (Barry A. Warsaw)
Date: Fri, 13 Sep 2002 16:44:57 -0400
Subject: [Mailman-i18n] "Funny" characters in real names?
References: <15746.14309.516823.293632@anthem.wooz.org>
 <3D8248C5.5040205@debian.org>
Message-ID: <15746.20041.343041.501116@anthem.wooz.org>

>>>>> "BG" == Ben Gertzfield <che@debian.org> writes:

    BG> When submitting an HTML form, the character set used for the
    BG> submitted data is the same as the one specified in the HTML or
    BG> header of the original form's page.

    BG> So we do know what character set the user originally used, and
    BG> can store that with their user data.  Then when we send out a
    BG> personalized message, we can easily encode the To: header.

If they were subscribed via email, we'd already have the encoded form
of their real name.

What's left are the command line and mass subscribe page (both the
text box and the file upload).  In these cases should we simply reject
addresses with non-ascii real names?  That'd mean they'd have to be
encoded prior to being subscribed.

-Barry


From che@debian.org  Fri Sep 13 23:41:42 2002
From: che@debian.org (Ben Gertzfield)
Date: Fri, 13 Sep 2002 15:41:42 -0700
Subject: [Mailman-i18n] "Funny" characters in real names?
References: <15746.14309.516823.293632@anthem.wooz.org>	<3D8248C5.5040205@debian.org> <15746.20041.343041.501116@anthem.wooz.org>
Message-ID: <3D8269A6.7070701@debian.org>

Barry A. Warsaw wrote:

>>>>>>"BG" == Ben Gertzfield <che@debian.org> writes:
>>>>>>            
>>>>>>
>
>    BG> When submitting an HTML form, the character set used for the
>    BG> submitted data is the same as the one specified in the HTML or
>    BG> header of the original form's page.
>
>  
>
>If they were subscribed via email, we'd already have the encoded form
>of their real name.
>
>What's left are the command line and mass subscribe page (both the
>text box and the file upload).  In these cases should we simply reject
>addresses with non-ascii real names?  That'd mean they'd have to be
>encoded prior to being subscribed.
>  
>

As far as the command-line goes, we should probably reject non-ASCII 
real names, yes.  (It MIGHT be possible to parse the various 
LANG/LC_CHARSET environment variables and guess the character set, but 
that's a pain.)

The mass subscribe page case should be the same as any other HTML form, 
right?  Whatever character set the original form's page used is what all 
the real names' character sets get set to.

Ben


From barry@zope.com  Sat Sep 14 00:25:02 2002
From: barry@zope.com (Barry A. Warsaw)
Date: Fri, 13 Sep 2002 19:25:02 -0400
Subject: [Mailman-i18n] "Funny" characters in real names?
References: <15746.14309.516823.293632@anthem.wooz.org>
 <3D8248C5.5040205@debian.org>
 <15746.20041.343041.501116@anthem.wooz.org>
 <3D8269A6.7070701@debian.org>
Message-ID: <15746.29646.398343.137890@anthem.wooz.org>

>>>>> "BG" == Ben Gertzfield <che@debian.org> writes:

    BG> As far as the command-line goes, we should probably reject
    BG> non-ASCII real names, yes.

+1
    
    BG> (It MIGHT be possible to parse the
    BG> various LANG/LC_CHARSET environment variables and guess the
    BG> character set, but that's a pain.)

-1

    BG> The mass subscribe page case should be the same as any other
    BG> HTML form, right?  Whatever character set the original form's
    BG> page used is what all the real names' character sets get set
    BG> to.

That simply means that you couldn't mass subscribe users with
different charsets in their real names, but I'm perfectly fine with
that limitation, especially since I don't see a way around it. :)

Ok, I think I'll try to whip this up.  Thanks for the feedback.

-Barry


From barry@zope.com  Sat Sep 14 00:52:16 2002
From: barry@zope.com (Barry A. Warsaw)
Date: Fri, 13 Sep 2002 19:52:16 -0400
Subject: [Mailman-i18n] "Funny" characters in real names?
References: <15746.14309.516823.293632@anthem.wooz.org>
 <3D8248C5.5040205@debian.org>
Message-ID: <15746.31280.719924.780772@anthem.wooz.org>

>>>>> "BG" =3D=3D Ben Gertzfield <che@debian.org> writes:

    BG> When submitting an HTML form, the character set used for the
    BG> submitted data is the same as the one specified in the HTML or
    BG> header of the original form's page.

I must be dense because I'm not quite seeing how this will work.

I visit the mass subscribe page and in the text box, I enter a funny
name, e.g. barry@python.org (Barry W=E2rsaw)
My list is conducted in English.

Now when I look at all the data submitted by the form, I don't see
anything immediately useful in either the cgi environment or in the
form data.  Here are some excerpts:

CONTENT_TYPE: multipart/form-data; boundary=3D-------------------------=
--527473093431726113359136092

Hmm, nothing there.

HTTP_ACCEPT_CHARSET: ISO-8859-1, utf-8;q=3D0.66, *;q=3D0.66

That doesn't really help us does it?  That's telling us what charsets
the browser will accept, right?  Not the same thing.

Now for the form data, I'll see a section like:

-----------------------------527473093431726113359136092
Content-Disposition: form-data; name=3D"subscribees"

warsaw@wooz.org (Barry W&#226;rsaw)
-----------------------------527473093431726113359136092

This doesn't tell me enough either does it?

So I'm at a loss as to where to such the information out of the form
data or the environment to figure out what charset the form was posted
in.

-Barry


From tkikuchi@is.kochi-u.ac.jp  Sat Sep 14 03:01:59 2002
From: tkikuchi@is.kochi-u.ac.jp (Tokio Kikuchi)
Date: Sat, 14 Sep 2002 11:01:59 +0900
Subject: [Mailman-i18n] "Funny" characters in real names?
References: <15746.14309.516823.293632@anthem.wooz.org>	<3D8248C5.5040205@debian.org> <15746.31280.719924.780772@anthem.wooz.org>
Message-ID: <3D829897.8020608@is.kochi-u.ac.jp>

Hi,

> So I'm at a loss as to where to such the information out of the form
> data or the environment to figure out what charset the form was posted
> in.

Let's make it simple;

1. Use the user's preferred language as a first guess and
    try to encode.
2. If it fails, strip off the real name from the recipient header.

You will receive unreadable To: header if you set the preferred
language as French and use Japanese character in real name, but
you can correct it on the users option page.


-- 
Tokio Kikuchi, tkikuchi@ is.kochi-u.ac.jp
http://weather.is.kochi-u.ac.jp/


From loewis@informatik.hu-berlin.de  Sun Sep 15 18:38:05 2002
From: loewis@informatik.hu-berlin.de (Martin v. =?iso-8859-1?q?L=F6wis?=)
Date: 15 Sep 2002 19:38:05 +0200
Subject: [Mailman-i18n] "Funny" characters in real names?
In-Reply-To: <15746.31280.719924.780772@anthem.wooz.org>
References: <15746.14309.516823.293632@anthem.wooz.org>
 <3D8248C5.5040205@debian.org>
 <15746.31280.719924.780772@anthem.wooz.org>
Message-ID: <j4r8fvrw5e.fsf@informatik.hu-berlin.de>

barry@zope.com (Barry A. Warsaw) writes:

> I must be dense because I'm not quite seeing how this will work.
[...]
> This doesn't tell me enough either does it?

You are running into one of the most awful oddities of HTTP and
i18n. In short, the encoding of the page that contained the form was
used to encoding the form contents :-(

The RFC says the browser SHOULD declare the encoding for each field in
the per-field MIME header of multipart/form-data message. None of the
browsers does that. I filed bug reports for all of them, and Mozilla
people responded that they can't do that because many CGI scripts
break when they get a charset= (it won't fit their regexp).

The RFC says, as a fall-back, the browser should use the encoding of
the HTML page which contained the form. Mailman doesn't declare a
charset in the administrative pages, but it should.

It may happen that the user enters a character which cannot be
represented in the charset of the page. In this case, Mozilla sends a
'?' (question mark), so you can only tell that there was a character,
but not which one. Internet Exploder sends a HTML entity, which gives
you more information, but is undistinguishable from the case where the
user entered an ampersand-digits sequence.

For Mailman, this gives two options:

1. Each administrative page should be encoded in the list's "native"
   charset. This will allow to add names in that charset.

2. Each page should be encoded in UTF-8. This will allow to enter
   arbitrary names, but will require recoding to the list's charset
   later (or using UTF-8 in the To: fields as well).

Actually, it appears that mailman already does 1, in the HTTP
header. Barry, what is the charset of your admin pages?

Regards,
Martin


From barry@zope.com  Sun Sep 15 23:34:36 2002
From: barry@zope.com (Barry A. Warsaw)
Date: Sun, 15 Sep 2002 18:34:36 -0400
Subject: [Mailman-i18n] "Funny" characters in real names?
References: <15746.14309.516823.293632@anthem.wooz.org>
 <3D8248C5.5040205@debian.org>
 <15746.31280.719924.780772@anthem.wooz.org>
 <j4r8fvrw5e.fsf@informatik.hu-berlin.de>
Message-ID: <15749.2812.524261.962949@anthem.wooz.org>

Thanks Martin, and everyone, I think I know what to do now.

>>>>> "MvL" =3D=3D Martin v L=F6wis <loewis@informatik.hu-berlin.de> wr=
ites:

    MvL> The RFC says, as a fall-back, the browser should use the
    MvL> encoding of the HTML page which contained the form. Mailman
    MvL> doesn't declare a charset in the administrative pages, but it
    MvL> should.

I think we'll make things simple and assume a list's preferred
language doesn't change between form presentation and submission.  So
if the page has a `lang' key (i.e. the listinfo page, or options
page), we'll use that, otherwise we'll fallback to the list's
preferred language.

If that language's charset is ascii or there are only ascii characters
in the name, we'll simple store the name unencoded in the user
database.  Otherwise, we'll encode the name to Unicode, and store that
along with the charset.  Then for email headers, we'll use the Header
class to encode the name in an RFC conformant way.  I don't think this
will be a huge amount of work, although it will require some changes
to the MemberAdaptor API.

(For command line, we'll still insist on ascii in the names, unless
there's a hue and cry -- or patch <wink> -- for something better.)

    MvL> It may happen that the user enters a character which cannot
    MvL> be represented in the charset of the page. In this case,
    MvL> Mozilla sends a '?' (question mark), so you can only tell
    MvL> that there was a character, but not which one. Internet
    MvL> Exploder sends a HTML entity, which gives you more
    MvL> information, but is undistinguishable from the case where the
    MvL> user entered an ampersand-digits sequence.

We won't handle these specially.  If that's what the browser gives us,
that's what we'll use.

    MvL> For Mailman, this gives two options:

    MvL> 1. Each administrative page should be encoded in the list's
    MvL> "native" charset. This will allow to add names in that
    MvL> charset.

    MvL> 2. Each page should be encoded in UTF-8. This will allow to
    MvL> enter arbitrary names, but will require recoding to the
    MvL> list's charset later (or using UTF-8 in the To: fields as
    MvL> well).

    MvL> Actually, it appears that mailman already does 1, in the HTTP
    MvL> header. Barry, what is the charset of your admin pages?

I had tried iso-8859-1 and us-ascii.  In us-ascii I got the HTML
entity, but in iso-8859-1 I got the actual character.

Let's go with #1.  Thanks.
-Barry


From barry@python.org  Tue Sep 17 21:58:38 2002
From: barry@python.org (Barry A. Warsaw)
Date: Tue, 17 Sep 2002 16:58:38 -0400
Subject: [Mailman-i18n] "Funny" characters in real names?
References: <15746.14309.516823.293632@anthem.wooz.org>
 <3D8248C5.5040205@debian.org>
 <15746.31280.719924.780772@anthem.wooz.org>
 <j4r8fvrw5e.fsf@informatik.hu-berlin.de>
 <15749.2812.524261.962949@anthem.wooz.org>
Message-ID: <15751.38782.74609.163241@anthem.wooz.org>

To follow up, I believe I have this working now.  Here's how it works.

First, the only change to the MemberAdaptor API is that real names can
now be Unicode strings as well as 8-bit strings.  If they're 8-bit
then they'll contain only ascii characters.

When a real name is entered into a web form, we'll first attempt to
convert it to us-ascii.  If that succeeds, we know the real name is
ascii only and we'll store it in the membership database as an 8-bit
ascii-only-containing string.

If the conversion fails, we'll convert the real name to Unicode using
the charset of the context's language (i.e. list preferred if we're
looking at an admin page, user preferred if we're looking at an
options page, and form value if we're looking at the subscribe page --
all with appropriate fallbacks to Something Sensible).  We'll also do
html entity replacement (e.g. #&246; -> =F6).  We'll store this Unicode=

string as the member's real name in the membership database, but we
don't store the charset because...

...when we need to get a printable version of the member name, we yank
out either the ascii string or the Unicode string.  If it's ascii,
we're done.  If it's Unicode, then we try to encode it to the
charset of the web page we're printing (for cgi), or to the charset of
the outgoing email message.

For output web pages, if the encoding fails, we'll convert chars > 127
to html entities (e.g. =F6 -> #&246;) so in most cases we'll still see
the name with the proper characters.  For this case, think about a
user who selects Spanish, enters a =F1 in their name, and then switches=

their preferred language to English (us-ascii).  You'd like their name
to still show up correctly.

For email, if the name has non-ascii characters in it, we'll use the
email.Header.Header class to convert the To string to an RFC-compliant
format.  If that fails we fall back on encoding to us-ascii replacing
non-ascii characters with `?'s.

This seems to work fairly well (with some ugly changes also necessary
to the logging system), with one minor kludge.  I want to allow
non-ASCII characters in real names for English lists.  I'm nervous
about changing the default charset for English from us-ascii because
I'm superstitious about unintended side-effects.  So I'm making a
couple of special cases for us-ascii.  When decoding a string from a
web form, if the default charset would be us-ascii, I'll use
iso-8859-1 instead.  Then when encoding a name in an email header, if
the charset is us-ascii, again, I'll use iso-8859-1.  This seems like
a practical compromise, if a bit ugly.  Feedback is welcome.

I'm about to check all this stuff in.  Testing will be /greatly/
appreciated!

-Barry


From che@debian.org  Tue Sep 17 22:22:51 2002
From: che@debian.org (Ben Gertzfield)
Date: Tue, 17 Sep 2002 14:22:51 -0700
Subject: [Mailman-i18n] "Funny" characters in real names?
References: <15746.14309.516823.293632@anthem.wooz.org>	<3D8248C5.5040205@debian.org>	<15746.31280.719924.780772@anthem.wooz.org>	<j4r8fvrw5e.fsf@informatik.hu-berlin.de>	<15749.2812.524261.962949@anthem.wooz.org> <15751.38782.74609.163241@anthem.wooz.org>
Message-ID: <3D879D2B.5080600@debian.org>

Barry A. Warsaw wrote:

>To follow up, I believe I have this working now.  Here's how it works.
>

Thanks for the excellent explanation and implementation, Barry.  I'll=20
test this when it's checked in.  Some comments below..

>First, the only change to the MemberAdaptor API is that real names can
>now be Unicode strings as well as 8-bit strings.  If they're 8-bit
>then they'll contain only ascii characters.
> =20
>

ASCII is by definition 7-bit, Barry.  Did you mean ISO-8859-1 here?

>When a real name is entered into a web form, we'll first attempt to
>convert it to us-ascii.  If that succeeds, we know the real name is
>ascii only and we'll store it in the membership database as an 8-bit
>ascii-only-containing string.
> =20
>

Again, I assume you mean ISO-8859-1 instead of ascii here.

>If the conversion fails, we'll convert the real name to Unicode using
>the charset of the context's language (i.e. list preferred if we're
>looking at an admin page, user preferred if we're looking at an
>options page, and form value if we're looking at the subscribe page --
>all with appropriate fallbacks to Something Sensible).  We'll also do
>html entity replacement (e.g. #&246; -> =F6).  We'll store this Unicode
>string as the member's real name in the membership database, but we
>don't store the charset because...
> =20
>

This is a good thing.  Note that some browsers might (I haven't checked=20
this) incorrectly send the entity &246; for whatever character is at=20
position 246 in the user's default character set, not character 246 in=20
Unicode.  This might be something to look out for, but I don't know if=20
it's important.

Everything else looks good.  The kludge to assume iso-8859-1 on us-ascii=20
pages is unfortunately a generally good one, as that will make the most=20
people happy.  I hate to do it, though!

Ben


From barry@python.org  Tue Sep 17 22:34:36 2002
From: barry@python.org (Barry A. Warsaw)
Date: Tue, 17 Sep 2002 17:34:36 -0400
Subject: [Mailman-i18n] "Funny" characters in real names?
References: <15746.14309.516823.293632@anthem.wooz.org>
 <3D8248C5.5040205@debian.org>
 <15746.31280.719924.780772@anthem.wooz.org>
 <j4r8fvrw5e.fsf@informatik.hu-berlin.de>
 <15749.2812.524261.962949@anthem.wooz.org>
 <15751.38782.74609.163241@anthem.wooz.org>
 <3D879D2B.5080600@debian.org>
Message-ID: <15751.40940.902092.18578@anthem.wooz.org>

>>>>> "BG" =3D=3D Ben Gertzfield <che@debian.org> writes:

    >> To follow up, I believe I have this working now.  Here's how it
    >> works.

    BG> Thanks for the excellent explanation and implementation,
    BG> Barry.

Took me two days.  I still say Unicode is something everyone wants
until they get it. :)
   =20
    BG> I'll test this when it's checked in.  Some comments below..

Excellent!

    >> First, the only change to the MemberAdaptor API is that real
    >> names can now be Unicode strings as well as 8-bit strings.  If
    >> they're 8-bit then they'll contain only ascii characters.

    BG> ASCII is by definition 7-bit, Barry.  Did you mean ISO-8859-1
    BG> here?

Sorry, I meant "normal" Python strings (sometimes called "8-bit
strings") but which contain only 7-bit ascii characters.  Those
beasties I don't convert to Python unicode strings.

    >> When a real name is entered into a web form, we'll first
    >> attempt to convert it to us-ascii.  If that succeeds, we know
    >> the real name is ascii only and we'll store it in the
    >> membership database as an 8-bit ascii-only-containing string.

    BG> Again, I assume you mean ISO-8859-1 instead of ascii here.

Same thing here.  We do name.encode('us-ascii') and catch any
UnicodeError that might occur.  If no error occurs, we know we have a
string with 7-bit ascii characters in it, so we store that as an 8-bit
Python string, not as a unicode Python string.

    >> If the conversion fails, we'll convert the real name to Unicode
    >> using the charset of the context's language (i.e. list
    >> preferred if we're looking at an admin page, user preferred if
    >> we're looking at an options page, and form value if we're
    >> looking at the subscribe page -- all with appropriate fallbacks
    >> to Something Sensible).  We'll also do html entity replacement
    >> (e.g. #&246; -> =F6).  We'll store this Unicode string as the
    >> member's real name in the membership database, but we don't
    >> store the charset because...

    BG> This is a good thing.  Note that some browsers might (I
    BG> haven't checked this) incorrectly send the entity &246; for
    BG> whatever character is at position 246 in the user's default
    BG> character set, not character 246 in Unicode.  This might be
    BG> something to look out for, but I don't know if it's important.

I don't know what else to do.  Note that you could literally type
&#246; into the web form and it would have the same effect.  This is
probably an 80/20 solution.

    BG> Everything else looks good.  The kludge to assume iso-8859-1
    BG> on us-ascii pages is unfortunately a generally good one, as
    BG> that will make the most people happy.  I hate to do it,
    BG> though!

Me too!  It means that names in other charsets will be screwed on
English lists, but again, I think this is best we can do for a
practical 80/20 solution.

Thanks for the feedback.
-Barry


From tkikuchi@is.kochi-u.ac.jp  Wed Sep 18 01:40:56 2002
From: tkikuchi@is.kochi-u.ac.jp (Tokio Kikuchi)
Date: Wed, 18 Sep 2002 09:40:56 +0900
Subject: [Mailman-i18n] "Funny" characters in real names?
References: <15746.14309.516823.293632@anthem.wooz.org>	<3D8248C5.5040205@debian.org>	<15746.31280.719924.780772@anthem.wooz.org>	<j4r8fvrw5e.fsf@informatik.hu-berlin.de>	<15749.2812.524261.962949@anthem.wooz.org>	<15751.38782.74609.163241@anthem.wooz.org>	<3D879D2B.5080600@debian.org> <15751.40940.902092.18578@anthem.wooz.org>
Message-ID: <3D87CB98.8010908@is.kochi-u.ac.jp>

Thanks Barry.

Barry A. Warsaw wrote:
>>>>>>"BG" == Ben Gertzfield <che@debian.org> writes:
> 
>     BG> Thanks for the excellent explanation and implementation,
>     BG> Barry.
> 
> Took me two days.  I still say Unicode is something everyone wants
> until they get it. :)

I had to add "import japanese" in Utils.py. Otherwise, I got
unknown charset error when I entered japanese in fullname.

And, "korean" ? Where should I put them ?

-- 
Tokio Kikuchi, tkikuchi@ is.kochi-u.ac.jp
http://weather.is.kochi-u.ac.jp/


From barry@python.org  Wed Sep 18 02:51:53 2002
From: barry@python.org (Barry A. Warsaw)
Date: Tue, 17 Sep 2002 21:51:53 -0400
Subject: [Mailman-i18n] "Funny" characters in real names?
References: <15746.14309.516823.293632@anthem.wooz.org>
 <3D8248C5.5040205@debian.org>
 <15746.31280.719924.780772@anthem.wooz.org>
 <j4r8fvrw5e.fsf@informatik.hu-berlin.de>
 <15749.2812.524261.962949@anthem.wooz.org>
 <15751.38782.74609.163241@anthem.wooz.org>
 <3D879D2B.5080600@debian.org>
 <15751.40940.902092.18578@anthem.wooz.org>
 <3D87CB98.8010908@is.kochi-u.ac.jp>
Message-ID: <15751.56377.321564.625249@anthem.wooz.org>

>>>>> "TK" == Tokio Kikuchi <tkikuchi@is.kochi-u.ac.jp> writes:

    TK> I had to add "import japanese" in Utils.py. Otherwise, I got
    TK> unknown charset error when I entered japanese in fullname.

Do you have a traceback?

Can you send me a valid Japanese name string so that I can try it
myself?  I would have thought that the codecs package would have
imported the Japanese codecs automatically.  If not, we have to figure
out the right way to hook it up so it works.

If adding the import is the only way to do it, then we'll go that
route, but I'd like to try to reproduce the problem here.

    TK> And, "korean" ? Where should I put them ?

We probably have to handle this the same way we handle Japanese, but
I'm not sure what the right way to do that is at the moment.

Thanks,
-Barry


From tkikuchi@is.kochi-u.ac.jp  Wed Sep 18 03:39:44 2002
From: tkikuchi@is.kochi-u.ac.jp (Tokio Kikuchi)
Date: Wed, 18 Sep 2002 11:39:44 +0900
Subject: [Mailman-i18n] "Funny" characters in real names?
References: <15746.14309.516823.293632@anthem.wooz.org>	<3D8248C5.5040205@debian.org>	<15746.31280.719924.780772@anthem.wooz.org>	<j4r8fvrw5e.fsf@informatik.hu-berlin.de>	<15749.2812.524261.962949@anthem.wooz.org>	<15751.38782.74609.163241@anthem.wooz.org>	<3D879D2B.5080600@debian.org>	<15751.40940.902092.18578@anthem.wooz.org>	<3D87CB98.8010908@is.kochi-u.ac.jp> <15751.56377.321564.625249@anthem.wooz.org>
Message-ID: <3D87E770.9000903@is.kochi-u.ac.jp>


Barry A. Warsaw wrote:
>>>>>>"TK" == Tokio Kikuchi <tkikuchi@is.kochi-u.ac.jp> writes:
>>>>>
> 
>     TK> I had to add "import japanese" in Utils.py. Otherwise, I got
>     TK> unknown charset error when I entered japanese in fullname.
> 
> Do you have a traceback?

Here is it.

Traceback:

Traceback (most recent call last):
   File "/home/mailman3/scripts/driver", line 82, in run_main
     main()
   File "/home/mailman3/Mailman/Cgi/subscribe.py", line 94, in main
     process_form(mlist, doc, cgidata, language)
   File "/home/mailman3/Mailman/Cgi/subscribe.py", line 113, in process_form
     fullname = Utils.canonstr(fullname, lang)
   File "/home/mailman3/Mailman/Utils.py", line 731, in canonstr
     return unicode(newstr, charset, 'replace')
LookupError: unknown encoding

> 
> Can you send me a valid Japanese name string so that I can try it
> myself?  

Well, you must first configure your browser japanese capable.
Here is my name in euc-jp and unicode:
'\xb5\xc6\xc3\xcf\xbb\xfe\xc9\xd7'
u'\u83ca\u5730\u6642\u592b'

> I would have thought that the codecs package would have
> imported the Japanese codecs automatically.  If not, we have to figure
> out the right way to hook it up so it works.

Me too...

-- 
Tokio Kikuchi, tkikuchi@ is.kochi-u.ac.jp
http://weather.is.kochi-u.ac.jp/


From tkikuchi@is.kochi-u.ac.jp  Wed Sep 18 04:45:04 2002
From: tkikuchi@is.kochi-u.ac.jp (Tokio Kikuchi)
Date: Wed, 18 Sep 2002 12:45:04 +0900
Subject: [Mailman-i18n] "Funny" characters in real names?
References: <15746.14309.516823.293632@anthem.wooz.org>	<3D8248C5.5040205@debian.org>	<15746.31280.719924.780772@anthem.wooz.org>	<j4r8fvrw5e.fsf@informatik.hu-berlin.de>	<15749.2812.524261.962949@anthem.wooz.org> <15751.38782.74609.163241@anthem.wooz.org>
Message-ID: <3D87F6C0.7020809@is.kochi-u.ac.jp>

Barry,

Here is another bug. I think I could track down.
On the admin/<list>/member page, if a user have set his name
in japanese, we get an error when you click the change button.
It is because admin.py fails to encode the full name in unicode.

Here is the patch:

--- /home/mailman/src/mailman/Mailman/Cgi/admin.py      Wed Sep 18 08:47:39 2002
+++ Cgi/admin.py        Wed Sep 18 12:36:29 2002
@@ -1367,6 +1367,7 @@
                  pass

              newname = cgidata.getvalue(user+'_realname', '')
+            newname = Utils.canonstr(newname, mlist.preferred_language)
              mlist.setMemberName(user, newname)

              newlang = cgidata.getvalue(user+'_language')


-- 
Tokio Kikuchi, tkikuchi@ is.kochi-u.ac.jp
http://weather.is.kochi-u.ac.jp/


From barry@python.org  Wed Sep 18 06:41:35 2002
From: barry@python.org (Barry A. Warsaw)
Date: Wed, 18 Sep 2002 01:41:35 -0400
Subject: [Mailman-i18n] "Funny" characters in real names?
References: <15746.14309.516823.293632@anthem.wooz.org>
 <3D8248C5.5040205@debian.org>
 <15746.31280.719924.780772@anthem.wooz.org>
 <j4r8fvrw5e.fsf@informatik.hu-berlin.de>
 <15749.2812.524261.962949@anthem.wooz.org>
 <15751.38782.74609.163241@anthem.wooz.org>
 <3D879D2B.5080600@debian.org>
 <15751.40940.902092.18578@anthem.wooz.org>
 <3D87CB98.8010908@is.kochi-u.ac.jp>
 <15751.56377.321564.625249@anthem.wooz.org>
 <3D87E770.9000903@is.kochi-u.ac.jp>
Message-ID: <15752.4623.531823.282922@anthem.wooz.org>

Thanks for the feedback.  I think I've got the Japanese support
working in cvs now.  But then, I'm pretty tired, so please double
check!

-Barry


From loewis@informatik.hu-berlin.de  Wed Sep 18 07:08:06 2002
From: loewis@informatik.hu-berlin.de (Martin v. =?iso-8859-1?q?L=F6wis?=)
Date: 18 Sep 2002 08:08:06 +0200
Subject: [Mailman-i18n] "Funny" characters in real names?
In-Reply-To: <15751.40940.902092.18578@anthem.wooz.org>
References: <15746.14309.516823.293632@anthem.wooz.org>
 <3D8248C5.5040205@debian.org>
 <15746.31280.719924.780772@anthem.wooz.org>
 <j4r8fvrw5e.fsf@informatik.hu-berlin.de>
 <15749.2812.524261.962949@anthem.wooz.org>
 <15751.38782.74609.163241@anthem.wooz.org>
 <3D879D2B.5080600@debian.org>
 <15751.40940.902092.18578@anthem.wooz.org>
Message-ID: <j4ofavde49.fsf@informatik.hu-berlin.de>

barry@python.org (Barry A. Warsaw) writes:

>     BG> Again, I assume you mean ISO-8859-1 instead of ascii here.
> 
> Same thing here.  We do name.encode('us-ascii') and catch any
> UnicodeError that might occur.  

That should be name.decode, or unicode(name, "us-ascii"). name.encode
unfortunately works as well, but it first decodes using the system
default encoding, then encodes using us-ascii - very close to what you
want, but not the precise same thing.

Regards,
Martin


From loewis@informatik.hu-berlin.de  Wed Sep 18 07:09:26 2002
From: loewis@informatik.hu-berlin.de (Martin v. =?iso-8859-1?q?L=F6wis?=)
Date: 18 Sep 2002 08:09:26 +0200
Subject: [Mailman-i18n] "Funny" characters in real names?
In-Reply-To: <3D879D2B.5080600@debian.org>
References: <15746.14309.516823.293632@anthem.wooz.org>
 <3D8248C5.5040205@debian.org>
 <15746.31280.719924.780772@anthem.wooz.org>
 <j4r8fvrw5e.fsf@informatik.hu-berlin.de>
 <15749.2812.524261.962949@anthem.wooz.org>
 <15751.38782.74609.163241@anthem.wooz.org>
 <3D879D2B.5080600@debian.org>
Message-ID: <j4k7ljde21.fsf@informatik.hu-berlin.de>

Ben Gertzfield <che@debian.org> writes:

> This is a good thing.  Note that some browsers might (I haven't
> checked this) incorrectly send the entity &246; for whatever character
> is at position 246 in the user's default character set, not character
> 246 in Unicode.  This might be something to look out for, but I don't
> know if it's important.

I'm not aware of any browser that does that. IE is the only one that
sends HTML entities at all if you get an unsupported character.

> Everything else looks good.  The kludge to assume iso-8859-1 on
> us-ascii pages is unfortunately a generally good one, as that will
> make the most people happy.  

I think this kludge does not help at all, see my other message.

Regards,
Martin


From loewis@informatik.hu-berlin.de  Wed Sep 18 07:13:15 2002
From: loewis@informatik.hu-berlin.de (Martin v. =?iso-8859-1?q?L=F6wis?=)
Date: 18 Sep 2002 08:13:15 +0200
Subject: [Mailman-i18n] "Funny" characters in real names?
In-Reply-To: <15751.38782.74609.163241@anthem.wooz.org>
References: <15746.14309.516823.293632@anthem.wooz.org>
 <3D8248C5.5040205@debian.org>
 <15746.31280.719924.780772@anthem.wooz.org>
 <j4r8fvrw5e.fsf@informatik.hu-berlin.de>
 <15749.2812.524261.962949@anthem.wooz.org>
 <15751.38782.74609.163241@anthem.wooz.org>
Message-ID: <j4fzw7ddvo.fsf@informatik.hu-berlin.de>

barry@python.org (Barry A. Warsaw) writes:

> This seems to work fairly well (with some ugly changes also necessary
> to the logging system), with one minor kludge.  I want to allow
> non-ASCII characters in real names for English lists.  I'm nervous
> about changing the default charset for English from us-ascii because
> I'm superstitious about unintended side-effects.  So I'm making a
> couple of special cases for us-ascii.  When decoding a string from a
> web form, if the default charset would be us-ascii, I'll use
> iso-8859-1 instead.  Then when encoding a name in an email header, if
> the charset is us-ascii, again, I'll use iso-8859-1.  This seems like
> a practical compromise, if a bit ugly.  Feedback is welcome.

Do you already send the page that has the form in iso-8859-1, or do
you use latin-1 only when interpreting form data?

If the latter, I think you gain nothing: the web browser will not
transmit latin-1 data if the form was us-ascii, so decoding the data
with latin-1 will work, but not allow to transmit latin-1 data.

On encoding Unicode names in email messages: I hope you have a general
fallback to UTF-8. If all else fails, UTF-8 will still work, and DTRT.

Regards,
Martin


From tkikuchi@is.kochi-u.ac.jp  Wed Sep 18 07:17:43 2002
From: tkikuchi@is.kochi-u.ac.jp (Tokio Kikuchi)
Date: Wed, 18 Sep 2002 15:17:43 +0900
Subject: [Mailman-i18n] "Funny" characters in real names?
References: <15746.14309.516823.293632@anthem.wooz.org>	<3D8248C5.5040205@debian.org>	<15746.31280.719924.780772@anthem.wooz.org>	<j4r8fvrw5e.fsf@informatik.hu-berlin.de>	<15749.2812.524261.962949@anthem.wooz.org>	<15751.38782.74609.163241@anthem.wooz.org>	<3D879D2B.5080600@debian.org>	<15751.40940.902092.18578@anthem.wooz.org>	<3D87CB98.8010908@is.kochi-u.ac.jp>	<15751.56377.321564.625249@anthem.wooz.org>	<3D87E770.9000903@is.kochi-u.ac.jp> <15752.4623.531823.282922@anthem.wooz.org>
Message-ID: <3D881A87.7030001@is.kochi-u.ac.jp>

Thanks a lot, Barry.

It's afternoon in Japan now and you are tired in midnight.
Then, it's morning in Europe ... ;-)

Good night.

Barry A. Warsaw wrote:
> Thanks for the feedback.  I think I've got the Japanese support
> working in cvs now.  But then, I'm pretty tired, so please double
> check!
> 
> -Barry
> 
> _______________________________________________
> Mailman-i18n mailing list
> Mailman-i18n@python.org
> http://mail.python.org/mailman/listinfo/mailman-i18n
> 
> 


-- 
Tokio Kikuchi, tkikuchi@ is.kochi-u.ac.jp
http://weather.is.kochi-u.ac.jp/


From loewis@informatik.hu-berlin.de  Wed Sep 18 07:17:44 2002
From: loewis@informatik.hu-berlin.de (Martin v. =?iso-8859-1?q?L=F6wis?=)
Date: 18 Sep 2002 08:17:44 +0200
Subject: [Mailman-i18n] "Funny" characters in real names?
In-Reply-To: <15751.40940.902092.18578@anthem.wooz.org>
References: <15746.14309.516823.293632@anthem.wooz.org>
 <3D8248C5.5040205@debian.org>
 <15746.31280.719924.780772@anthem.wooz.org>
 <j4r8fvrw5e.fsf@informatik.hu-berlin.de>
 <15749.2812.524261.962949@anthem.wooz.org>
 <15751.38782.74609.163241@anthem.wooz.org>
 <3D879D2B.5080600@debian.org>
 <15751.40940.902092.18578@anthem.wooz.org>
Message-ID: <j4bs6vddo7.fsf@informatik.hu-berlin.de>

barry@python.org (Barry A. Warsaw) writes:

> Took me two days.  I still say Unicode is something everyone wants
> until they get it. :)

Seriously, I think it's slightly different: End users don't want
Unicode; they should not care. Software developers become scared
easily when confronted with Unicode. However, considering the
alternatives (preserving the original encoding all the time, and
having to combine strings in different encodings), Unicode does
simplify processing, IMO.

Regards,
Martin


From quique@sindominio.net  Wed Sep 18 07:38:05 2002
From: quique@sindominio.net (Quique)
Date: Wed, 18 Sep 2002 08:38:05 +0200 (CEST)
Subject: [Mailman-i18n] 'Funny' characters in real names?
In-Reply-To: <15751.40940.902092.18578@anthem.wooz.org>
References: <15746.14309.516823.293632@anthem.wooz.org>
 <3D8248C5.5040205@debian.org>
 <15746.31280.719924.780772@anthem.wooz.org>
 <j4r8fvrw5e.fsf@informatik.hu-berlin.de>
 <15749.2812.524261.962949@anthem.wooz.org>
 <15751.38782.74609.163241@anthem.wooz.org>
 <3D879D2B.5080600@debian.org>
 <15751.40940.902092.18578@anthem.wooz.org>
Message-ID: <35199.199.228.142.9.1032331085.squirrel@www.sindominio.net>

Barry A. Warsaw dijo:

>     BG> Everything else looks good.  The kludge to assume iso-8859-1 BG>
> on us-ascii pages is unfortunately a generally good one, as BG> that
> will make the most people happy.  I hate to do it,
>     BG> though!
>
> Me too!  It means that names in other charsets will be screwed on
> English lists, but again, I think this is best we can do for a
> practical 80/20 solution.

from my ignorance:
what about using iso-8859-15 instead of -1?

it seems to add some forgotten french, finnish, estonian and czech letters.

cheers,
 quique


-- 
Torres m�s altas han ca�do.


From loewis@informatik.hu-berlin.de  Wed Sep 18 08:52:38 2002
From: loewis@informatik.hu-berlin.de (Martin v. =?iso-8859-1?q?L=F6wis?=)
Date: 18 Sep 2002 09:52:38 +0200
Subject: [Mailman-i18n] 'Funny' characters in real names?
In-Reply-To: <35199.199.228.142.9.1032331085.squirrel@www.sindominio.net>
References: <15746.14309.516823.293632@anthem.wooz.org>
 <3D8248C5.5040205@debian.org>
 <15746.31280.719924.780772@anthem.wooz.org>
 <j4r8fvrw5e.fsf@informatik.hu-berlin.de>
 <15749.2812.524261.962949@anthem.wooz.org>
 <15751.38782.74609.163241@anthem.wooz.org>
 <3D879D2B.5080600@debian.org>
 <15751.40940.902092.18578@anthem.wooz.org>
 <35199.199.228.142.9.1032331085.squirrel@www.sindominio.net>
Message-ID: <j41y7rbupl.fsf@informatik.hu-berlin.de>

"Quique" <quique@sindominio.net> writes:

> from my ignorance:
> what about using iso-8859-15 instead of -1?
>
> it seems to add some forgotten french, finnish, estonian and czech
> letters.

It doesn't really matter, IMO: administrators can always turn the
default encoding to -15 if they want to. However, it is likely that
some email readers will have difficulties to represent those
characters when confronted with them, so it is unclear what you gain.

I'm personally quite unhappy with iso-8859-15: it was invented at a
time when Unicode was already there, and the world didn't really need
any more character sets.

Regards,
Martin


From tkikuchi@is.kochi-u.ac.jp  Wed Sep 18 09:00:41 2002
From: tkikuchi@is.kochi-u.ac.jp (Tokio Kikuchi)
Date: Wed, 18 Sep 2002 17:00:41 +0900
Subject: [Mailman-i18n] "Funny" characters in real names?
References: <15746.14309.516823.293632@anthem.wooz.org>	<3D8248C5.5040205@debian.org>	<15746.31280.719924.780772@anthem.wooz.org>	<j4r8fvrw5e.fsf@informatik.hu-berlin.de>	<15749.2812.524261.962949@anthem.wooz.org>	<15751.38782.74609.163241@anthem.wooz.org>	<3D879D2B.5080600@debian.org>	<15751.40940.902092.18578@anthem.wooz.org>	<3D87CB98.8010908@is.kochi-u.ac.jp>	<15751.56377.321564.625249@anthem.wooz.org>	<3D87E770.9000903@is.kochi-u.ac.jp> <15752.4623.531823.282922@anthem.wooz.org> <3D881A87.7030001@is.kochi-u.ac.jp>
Message-ID: <3D8832A9.8060302@is.kochi-u.ac.jp>


Tokio Kikuchi wrote:


> Good night.
> 
> Barry A. Warsaw wrote:
> 
>> Thanks for the feedback.  I think I've got the Japanese support
>> working in cvs now.  But then, I'm pretty tired, so please double
>> check!

Well, things are not so easy. I get errors here and there.
It's 5 in Japan and I have to feed my kid and puppy.

later,

-- 
Tokio Kikuchi, tkikuchi@ is.kochi-u.ac.jp
http://weather.is.kochi-u.ac.jp/


From tkikuchi@is.kochi-u.ac.jp  Wed Sep 18 13:58:11 2002
From: tkikuchi@is.kochi-u.ac.jp (Tokio Kikuchi)
Date: Wed, 18 Sep 2002 21:58:11 +0900
Subject: [Mailman-i18n] "Funny" characters in real names?
References: <15746.14309.516823.293632@anthem.wooz.org>	<3D8248C5.5040205@debian.org>	<15746.31280.719924.780772@anthem.wooz.org>	<j4r8fvrw5e.fsf@informatik.hu-berlin.de>	<15749.2812.524261.962949@anthem.wooz.org>	<15751.38782.74609.163241@anthem.wooz.org>	<3D879D2B.5080600@debian.org>	<15751.40940.902092.18578@anthem.wooz.org>	<3D87CB98.8010908@is.kochi-u.ac.jp>	<15751.56377.321564.625249@anthem.wooz.org>	<3D87E770.9000903@is.kochi-u.ac.jp> <15752.4623.531823.282922@anthem.wooz.org> <3D881A87.7030001@is.kochi-u.ac.jp> <3D8832A9.8060302@is.kochi-u.ac.jp>
Message-ID: <3D887863.8070808@is.kochi-u.ac.jp>


Tokio Kikuchi wrote:


> Well, things are not so easy. I get errors here and there.

I think I've tracked one. Here is a patch.

--- /home/mailman/src/mailman/Mailman/Cgi/confirm.py    Wed Sep 18 08:47:39 2002
+++ Cgi/confirm.py      Wed Sep 18 21:50:37 2002
@@ -200,6 +200,7 @@
      password = userdesc.password
      digest = userdesc.digest
      lang = userdesc.language
+    name = Utils.uncanonstr(name, lang)
      title = _('Confirm subscription request')
      doc.SetTitle(title)
      i18n.set_language(lang)

Without this, you get

admin(79627): [----- Mailman Version: 2.1b3+ -----]
admin(79627): [----- Traceback ------]
admin(79627): Traceback (most recent call last):
admin(79627):   File "/home/mailman3/scripts/driver", line 82, in run_main
admin(79627):     main()
admin(79627):   File "/home/mailman3/Mailman/Cgi/confirm.py", line 155, in main
admin(79627):     print doc.Format()
admin(79627):   File "/home/mailman3/Mailman/htmlformat.py", line 331, in Format
admin(79627):     output.append(Container.Format(self, indent))
(snip)
admin(79627):   File "/home/mailman3/Mailman/htmlformat.py", line 188, in Format
Row
admin(79627):     output = output + self.FormatCell(row, i, indent + 2)
admin(79627): UnicodeError: ASCII decoding error: ordinal not in range(128)

-- 
Tokio Kikuchi, tkikuchi@ is.kochi-u.ac.jp
http://weather.is.kochi-u.ac.jp/


From tkikuchi@is.kochi-u.ac.jp  Wed Sep 18 14:27:54 2002
From: tkikuchi@is.kochi-u.ac.jp (Tokio Kikuchi)
Date: Wed, 18 Sep 2002 22:27:54 +0900
Subject: [Mailman-i18n] "Funny" characters in real names?
References: <15746.14309.516823.293632@anthem.wooz.org>	<3D8248C5.5040205@debian.org>	<15746.31280.719924.780772@anthem.wooz.org>	<j4r8fvrw5e.fsf@informatik.hu-berlin.de>	<15749.2812.524261.962949@anthem.wooz.org>	<15751.38782.74609.163241@anthem.wooz.org>	<3D879D2B.5080600@debian.org>	<15751.40940.902092.18578@anthem.wooz.org>	<3D87CB98.8010908@is.kochi-u.ac.jp>	<15751.56377.321564.625249@anthem.wooz.org>	<3D87E770.9000903@is.kochi-u.ac.jp> <15752.4623.531823.282922@anthem.wooz.org> <3D881A87.7030001@is.kochi-u.ac.jp> <3D8832A9.8060302@is.kochi-u.ac.jp> <3D887863.8070808@is.kochi-u.ac.jp>
Message-ID: <3D887F5A.50108@is.kochi-u.ac.jp>

> 
> I think I've tracked one. Here is a patch.

tracked two :-) patch merged.

--- /home/mailman/src/mailman/Mailman/Cgi/confirm.py    Wed Sep 18 08:47:39 2002
+++ Cgi/confirm.py      Wed Sep 18 22:17:49 2002
@@ -200,6 +200,7 @@
      password = userdesc.password
      digest = userdesc.digest
      lang = userdesc.language
+    name = Utils.uncanonstr(name, lang)
      title = _('Confirm subscription request')
      doc.SetTitle(title)
      i18n.set_language(lang)
@@ -314,6 +315,7 @@
              overrides = UserDesc(fullname=cgidata.getvalue('realname', None),
                                   digest=digest, lang=lang)
              userdesc += overrides
+            userdesc.fullname = Utils.canonstr(userdesc.fullname, userdesc.language)
              op, addr, pw, digest, lang = mlist.ProcessConfirmation(
                  cookie, userdesc)
          except Errors.MMNeedApproval:

traceback for the second

admin(79816): [----- Mailman Version: 2.1b3+ -----]
admin(79816): [----- Traceback ------]
admin(79816): Traceback (most recent call last):
admin(79816):   File "/home/mailman3/scripts/driver", line 82, in run_main
admin(79816):     main()
admin(79816):   File "/home/mailman3/Mailman/Cgi/options.py", line 598, in main
admin(79816):     options_page(mlist, doc, user, cpuser, userlang)
admin(79816):   File "/home/mailman3/Mailman/Cgi/options.py", line 616, in optio
ns_page
admin(79816):     fullname = Utils.uncanonstr(mlist.getMemberName(user), userlan
g)
admin(79816):   File "/home/mailman3/Mailman/Utils.py", line 755, in uncanonstr
admin(79816):     return s.encode(charset, 'strict')
admin(79816): TypeError: _japanese_codecs_euc_jp_encode() argument 1 must be uni
code, not string

Good Night!
-- 
Tokio Kikuchi, tkikuchi@ is.kochi-u.ac.jp
http://weather.is.kochi-u.ac.jp/


From barry@python.org  Thu Sep 19 04:38:41 2002
From: barry@python.org (Barry A. Warsaw)
Date: Wed, 18 Sep 2002 23:38:41 -0400
Subject: [Mailman-i18n] "Funny" characters in real names?
References: <15746.14309.516823.293632@anthem.wooz.org>
 <3D8248C5.5040205@debian.org>
 <15746.31280.719924.780772@anthem.wooz.org>
 <j4r8fvrw5e.fsf@informatik.hu-berlin.de>
 <15749.2812.524261.962949@anthem.wooz.org>
 <15751.38782.74609.163241@anthem.wooz.org>
 <3D879D2B.5080600@debian.org>
 <15751.40940.902092.18578@anthem.wooz.org>
 <3D87CB98.8010908@is.kochi-u.ac.jp>
 <15751.56377.321564.625249@anthem.wooz.org>
 <3D87E770.9000903@is.kochi-u.ac.jp>
 <15752.4623.531823.282922@anthem.wooz.org>
 <3D881A87.7030001@is.kochi-u.ac.jp>
 <3D8832A9.8060302@is.kochi-u.ac.jp>
 <3D887863.8070808@is.kochi-u.ac.jp>
 <3D887F5A.50108@is.kochi-u.ac.jp>
Message-ID: <15753.18113.512461.154971@anthem.wooz.org>

Thanks.  I modified this slightly, so please double check me!

-Barry


From barry@python.org  Thu Sep 19 05:07:40 2002
From: barry@python.org (Barry A. Warsaw)
Date: Thu, 19 Sep 2002 00:07:40 -0400
Subject: [Mailman-i18n] "Funny" characters in real names?
References: <15746.14309.516823.293632@anthem.wooz.org>
 <3D8248C5.5040205@debian.org>
 <15746.31280.719924.780772@anthem.wooz.org>
 <j4r8fvrw5e.fsf@informatik.hu-berlin.de>
 <15749.2812.524261.962949@anthem.wooz.org>
 <15751.38782.74609.163241@anthem.wooz.org>
 <3D879D2B.5080600@debian.org>
 <15751.40940.902092.18578@anthem.wooz.org>
 <j4ofavde49.fsf@informatik.hu-berlin.de>
Message-ID: <15753.19852.142162.352140@anthem.wooz.org>

>>>>> "MvL" =3D=3D Martin v L=F6wis <loewis@informatik.hu-berlin.de> wr=
ites:

    >> BG> Again, I assume you mean ISO-8859-1 instead of ascii here.
    >> Same thing here.  We do name.encode('us-ascii') and catch any
    >> UnicodeError that might occur.

    MvL> That should be name.decode, or unicode(name,
    MvL> "us-ascii"). name.encode unfortunately works as well, but it
    MvL> first decodes using the system default encoding, then encodes
    MvL> using us-ascii - very close to what you want, but not the
    MvL> precise same thing.

Martin, sometimes this Unicode stuff makes my head hurt. ;)

I don't think I can use name.decode() because that's a Python 2.2-ism
and we need to stick to Python 2.1.

I don't think I can use unicode(name, 'us-ascii') because what if name
is already a Unicode string?  This'll give me a TypeError.

So it seems like name.encode('us-ascii') is my only choice.  What am I
missing?

-Barry


From loewis@informatik.hu-berlin.de  Thu Sep 19 08:37:38 2002
From: loewis@informatik.hu-berlin.de (Martin v. =?iso-8859-1?q?L=F6wis?=)
Date: 19 Sep 2002 09:37:38 +0200
Subject: [Mailman-i18n] "Funny" characters in real names?
In-Reply-To: <15753.19852.142162.352140@anthem.wooz.org>
References: <15746.14309.516823.293632@anthem.wooz.org>
 <3D8248C5.5040205@debian.org>
 <15746.31280.719924.780772@anthem.wooz.org>
 <j4r8fvrw5e.fsf@informatik.hu-berlin.de>
 <15749.2812.524261.962949@anthem.wooz.org>
 <15751.38782.74609.163241@anthem.wooz.org>
 <3D879D2B.5080600@debian.org>
 <15751.40940.902092.18578@anthem.wooz.org>
 <j4ofavde49.fsf@informatik.hu-berlin.de>
 <15753.19852.142162.352140@anthem.wooz.org>
Message-ID: <j4k7litoot.fsf@informatik.hu-berlin.de>

barry@python.org (Barry A. Warsaw) writes:

> Martin, sometimes this Unicode stuff makes my head hurt. ;)

In an application that deals with multiple charsets on a regular basis
(such as mailman), I recommend not to mix byte strings and Unicode
strings. This can be achieved by
- converting all byte strings that represent text data to Unicode
  at the earliest possible point in processing,
- converting all Unicode strings back to byte strings just before
  output.

If most data is likely ASCII, it is tempting to use byte strings for
pure-ASCII, and Unicode for everything else. Try to resist this
temptation.

If you follow this strategy, you find that processing becomes much
simpler.

> So it seems like name.encode('us-ascii') is my only choice.  What am I
> missing?

If you are following the above strategy, you will know whether name is
Unicode or byte string. If it is Unicode, .encode is fine. If it is a
byte string, unicode(name,'ascii') will work.

I admit that the strategy has two problems:

1. In some cases, it might be impossible to generate a Unicode string
   for text data. In MIME, the encoding may not be specified, or it
   may be unknown to mailman, or the data may fail to convert.

   In these cases, it may be acceptable to "force" the data to
   Unicode: If there is no encoding, guess latin-1. If the string
   fails to convert, convert it with "replace". If the encoding is
   unknown, replace all non-printable characters with question marks.

   Whether this is acceptable depends on how frequent the problem
   occurs and whose fault that is (e.g. an unknown encoding should be
   added to Mailman).

2. When converting an application that used to be byte-oriented to
   Unicode, adding conversions at all required places might be too
   much effort, or breakage because of incorrect data might be
   unacceptable.

   In these cases, I recommend to add type tests at strategic places,
   and taper over any incorrect data.

E.g. in this case, you could write a function

def unicode_is_pure_ascii(text):
  if type(text) is types.UnicodeType:
     try:
       text.encode("ascii")
       return 1
     except UnicodeError:
       return 0
  if DEBUG:
     raise DebugError, "string not unicode:"+repr(text)
   try:
     unicode(text,"ascii")
     return 1
   except UnicodeError:
     return 0

If you expect name to be a byte string, the function would be
bytes_are_ascii, of course.

Regards,
Martin


From tkikuchi@is.kochi-u.ac.jp  Fri Sep 20 03:06:05 2002
From: tkikuchi@is.kochi-u.ac.jp (Tokio Kikuchi)
Date: Fri, 20 Sep 2002 11:06:05 +0900
Subject: [Mailman-i18n] "Funny" characters in real names?
References: <15746.14309.516823.293632@anthem.wooz.org>	<3D8248C5.5040205@debian.org>	<15746.31280.719924.780772@anthem.wooz.org>	<j4r8fvrw5e.fsf@informatik.hu-berlin.de>	<15749.2812.524261.962949@anthem.wooz.org>	<15751.38782.74609.163241@anthem.wooz.org>	<3D879D2B.5080600@debian.org>	<15751.40940.902092.18578@anthem.wooz.org>	<3D87CB98.8010908@is.kochi-u.ac.jp>	<15751.56377.321564.625249@anthem.wooz.org>	<3D87E770.9000903@is.kochi-u.ac.jp>	<15752.4623.531823.282922@anthem.wooz.org>	<3D881A87.7030001@is.kochi-u.ac.jp>	<3D8832A9.8060302@is.kochi-u.ac.jp>	<3D887863.8070808@is.kochi-u.ac.jp>	<3D887F5A.50108@is.kochi-u.ac.jp> <15753.18113.512461.154971@anthem.wooz.org>
Message-ID: <3D8A828D.1000701@is.kochi-u.ac.jp>

Barry, sorry for the hard work but still get errors. I've sent
a comment for the cvs-checkins and here are two more.

=========
1. In admin.py, the name 'lang' is used elsewhere and cause side effect.

diff -u ~/src/mailman/Mailman/Cgi/admin.py Mailman/Cgi/admin.py
--- /home/mailman/src/mailman/Mailman/Cgi/admin.py      Wed Sep 18 15:37:27 2002
+++ Mailman/Cgi/admin.py        Fri Sep 20 10:55:47 2002
@@ -904,11 +904,11 @@
                    MemberAdaptor.BYBOUNCE: _('B'),
                    }
      # Now populate the rows
-    lang = mlist.preferred_language
+    listlang = mlist.preferred_language
      for addr in members:
          link = Link(mlist.GetOptionsURL(addr, obscure=1),
                      mlist.getMemberCPAddress(addr))
-        fullname = Utils.uncanonstr(mlist.getMemberName(addr), lang)
+        fullname = Utils.uncanonstr(mlist.getMemberName(addr), listlang)
          name = TextBox(addr + '_realname', fullname, size=longest).Format()
          cells = [Center(CheckBox(addr + '_unsub', 'off', 0).Format()),
                   link.Format() + '<br>' +

=====
2. JapaneseCodecs raises LookupError instead of UnicodeError.

--- /home/mailman/src/mailman/Mailman/Utils.py  Wed Sep 18 15:37:26 2002
+++ Mailman/Utils.py    Fri Sep 20 10:38:50 2002
@@ -753,7 +753,7 @@
          charset = GetCharSet(lang)
      try:
          return s.encode(charset, 'strict')
-    except UnicodeError:
+    except (UnicodeError, LookupError):
          a = []
          for c in s:
              o = ord(c)

-- 
Tokio Kikuchi, tkikuchi@ is.kochi-u.ac.jp
http://weather.is.kochi-u.ac.jp/


From barry@zope.com  Sat Sep 21 18:00:43 2002
From: barry@zope.com (Barry A. Warsaw)
Date: Sat, 21 Sep 2002 13:00:43 -0400
Subject: [Mailman-i18n] Unicode in headers
Message-ID: <15756.42427.421288.925121@anthem.wooz.org>

I've been trying to fix the outstanding problems with "funny"
characters in real names in Mailman[*] and along the way I ran into a
situation that I /think/ needs to be addressed in the email package.
I'm not sure this is a good fix, let alone the right fix so I wanted
to get some feedback from these two mailing lists.

Say I create a Header instance like so:

    from email.Header import Header
    h = Header(u'[P\xf6stal]', 'us-ascii')
    s = str(h)

what would you expect the value of s to be?

It's a bit of a trick question because in the current version, the
str(h) will raise a UnicodeError since the h.encode() will be a
unicode string containing non-ascii characters.

But I think this may not be the right thing to do.  For one thing,
we're saying we want the header to be in the us-ascii character set.
For another, the RFCs state that headers need to be ascii characters
and we should encode them if necessary.  OTOH, what we're doing /is/ a
bit bogus since the value is clearly not in the requested character
set.  But OTOOH, I don't think we should have to check the value
and do a bunch of coercion before we create the Header instance.

My proposal is to do a type check in Header.__str__() so that if the
value of self.encode() returns a unicode string, we will coerce it to
an 8-bit string like so:

    def __str__(self):
        """A synonym for self.encode().
        Guarantees that the return value contains only ASCII characters.
        """
        s = self.encode()
        if isinstance(s, type(u'')):
            return s.encode(str(self._charset), 'replace')
        return s

Here's a new test case that fails without this change, but succeeds
with it (with no regressions).

    def test_unicode_value(self):
        eq = self.assertEqual
        v = u'[P\xf6stal]'
        h = Header(v, 'us-ascii')
        eq(str(h), '[P?stal]')

In the view of doing what's most useful, I'd like to make this
change, but I still don't trust my judgement about things unicode, so
I'd like to get some other opinions.

If we don't do this, then we'll probably have to add some defense in
Generator._write_headers(), which wants to do

            text = '%s: %s' % (h, v)

That'll raise the UnicodeError in this situation, and because this can
be fairly widely removed from what might be considered the real error,
it's difficult to debug.

-Barry

[*] BTW, Martin, Ben, Tokio and others have been very helpful here.
Thanks!  And I hope to have fixes in place soon.


From Dan@feld.cvut.cz  Sat Sep 21 21:12:28 2002
From: Dan@feld.cvut.cz (Dan Ohnesorg)
Date: Sat, 21 Sep 2002 22:12:28 +0200
Subject: [Mailman-i18n] Unicode in headers
In-Reply-To: <15756.42427.421288.925121@anthem.wooz.org>
References: <15756.42427.421288.925121@anthem.wooz.org>
Message-ID: <20020921201228.GB1717@ohnesorg.cz>

Dne Sat, Sep 21, 2002 at 01:00:43PM -0400, Barry A. Warsaw napsal:

>     from email.Header import Header
>     h =3D Header(u'[P\xf6stal]', 'us-ascii')
>     s =3D str(h)
>=20
> what would you expect the value of s to be?

Somethink like =3D?utf-8?Q?P=3Df6stal?=3D

According to RTF we should find the most simple encoding, which is UTF8.

In czech we use ISO-8859-2 and we check if there are only ASCII character=
s =3D
we are using ascii, or if there are some other characters we are using
ISO-8859-2. So the way can be:

- are there only ASCII characters =3D OK let it be
- are there only characters from locale preferred encoding =3D use locale
encoding
- in other cases, use UTF.


cheers
dan


--=20
  -----------------------------------------------------------
 / Dan Ohnesorg                              Dan@ohnesorg.cz \
<  Jino=E8ansk=E1 7                        252 19  Rudn=E1 u Prahy  >
 \ tel: +420 311 679679 +420 311 679976 fax: +420 311 679311 /
  -----------------------------------------------------------


From loewis@informatik.hu-berlin.de  Sat Sep 21 22:08:36 2002
From: loewis@informatik.hu-berlin.de (Martin von Loewis)
Date: Sat, 21 Sep 2002 23:08:36 +0200 (CEST)
Subject: [Mailman-i18n] Unicode in headers
In-Reply-To: <15756.42427.421288.925121@anthem.wooz.org> from "Barry A. Warsaw" at "Sep 21, 2002 01:00:43 pm"
Message-ID: <200209212108.g8LL8avs018452@paros.informatik.hu-berlin.de>

>     from email.Header import Header
>     h = Header(u'[P\xf6stal]', 'us-ascii')
>     s = str(h)
[...]
> But I think this may not be the right thing to do.  For one thing,
> we're saying we want the header to be in the us-ascii character set.

I think you are confusing issues here: You are *not* saying that you
want the header to be in us-ascii. Instead, (to quote the docstring)

        Specify both s's character set, and the default character set by
        setting the charset argument to a Charset object 

You need this argument to specify the encoding of the string *you are
passing*, not (primarily) of the resulting Header. Since the argument
is a Unicode string and not a byte string, the encoding argument is
superfluous.

Now, the documentation also says that it uses the argument as the "default
character set". By that, it does *not* mean that the entire header is going
to be encoding in that encoding. Instead, it means that this value is used
if later append calls do not declare an encoding.

> My proposal is to do a type check in Header.__str__() so that if the
> value of self.encode() returns a unicode string, we will coerce it to
> an 8-bit string like so:

This is evil. You are losing data without any need.

Instead, I propose the following procedure:
- if a Unicode argument is passed to Header.__init__ or Header.append,
  take the encoding only as a hint. As an argument to __init__, also
  record it as the default for later .append calls.
- when encoding the header, encode all Unicode strings with the hint.
  If that fails, encode them as UTF-8.

Regards,
Martin


From tkikuchi@is.kochi-u.ac.jp  Sun Sep 22 06:52:08 2002
From: tkikuchi@is.kochi-u.ac.jp (Tokio Kikuchi)
Date: Sun, 22 Sep 2002 14:52:08 +0900
Subject: [Mailman-i18n] Unicode in headers
References: <15756.42427.421288.925121@anthem.wooz.org> <20020921201228.GB1717@ohnesorg.cz>
Message-ID: <3D8D5A88.9080105@is.kochi-u.ac.jp>


Dan Ohnesorg wrote:
> Dne Sat, Sep 21, 2002 at 01:00:43PM -0400, Barry A. Warsaw napsal:
> 
> 
>>    from email.Header import Header
>>    h = Header(u'[P\xf6stal]', 'us-ascii')
>>    s = str(h)
>>
>>what would you expect the value of s to be?


> - are there only ASCII characters = OK let it be
> - are there only characters from locale preferred encoding = use locale
> encoding

I like this idea but how do you define email's preferred language.
In mailman, it will be mm_cfg.DEFAULT_SERVER_LANGUAGE but ,,,

> - in other cases, use UTF.

I think UTF-8 is OK. Older MUA won't break.

-- 
Tokio Kikuchi, tkikuchi@ is.kochi-u.ac.jp
http://weather.is.kochi-u.ac.jp/


From Dan@feld.cvut.cz  Sun Sep 22 16:38:50 2002
From: Dan@feld.cvut.cz (Dan Ohnesorg)
Date: Sun, 22 Sep 2002 17:38:50 +0200
Subject: [Mailman-i18n] Unicode in headers
In-Reply-To: <3D8D5A88.9080105@is.kochi-u.ac.jp>
References: <15756.42427.421288.925121@anthem.wooz.org> <20020921201228.GB1717@ohnesorg.cz> <3D8D5A88.9080105@is.kochi-u.ac.jp>
Message-ID: <20020922153850.GA11870@ohnesorg.cz>

Dne Sun, Sep 22, 2002 at 02:52:08PM +0900, Tokio Kikuchi napsal:

> >- are there only ASCII characters =3D OK let it be
> >- are there only characters from locale preferred encoding =3D use loc=
ale
> >encoding
>=20
> I like this idea but how do you define email's preferred language.
> In mailman, it will be mm_cfg.DEFAULT_SERVER_LANGUAGE but ,,,

It is very good solved in mutt, in file sendlib.c, line 800 and above. I
send also comments from the file. In mutt I have a variable, which has li=
st
of encodings in orded of preference:

/*
 * Find the best charset conversion of the file from fromcode into one
 * of the tocodes. If successful, set *tocode and CONTENT *info and
 * return the number of characters converted inexactly. If no
 * conversion was possible, return -1.
 *
 * We convert via UTF-8 in order to avoid the condition -1(EINVAL),
 * which would otherwise prevent us from knowing the number of inexact
 * conversions. Where the candidate target charset is UTF-8 we avoid
 * doing the second conversion because iconv_open("UTF-8", "UTF-8")
 * fails with some libraries.
 *
 * We assume that the output from iconv is never more than 4 times as
 * long as the input for any pair of charsets we might be interested
 * in.
 */


/*
 * Find the first of the fromcodes that gives a valid conversion and
 * the best charset conversion of the file into one of the tocodes. If
 * successful, set *fromcode and *tocode to dynamically allocated
 * strings, set CONTENT *info, and return the number of characters
 * converted inexactly. If no conversion was possible, return -1.
 *
 * Both fromcodes and tocodes may be colon-separated lists of charsets.
 * However, if fromcode is zero then fromcodes is assumed to be the
 * name of a single charset even if it contains a colon.
 */

cheers
dan


--=20
  -----------------------------------------------------------
 / Dan Ohnesorg                              Dan@ohnesorg.cz \
<  Jino=E8ansk=E1 7                        252 19  Rudn=E1 u Prahy  >
 \ tel: +420 311 679679 +420 311 679976 fax: +420 311 679311 /
  -----------------------------------------------------------


From barry@zope.com  Sun Sep 22 18:30:20 2002
From: barry@zope.com (Barry A. Warsaw)
Date: Sun, 22 Sep 2002 13:30:20 -0400
Subject: [Mailman-i18n] Unicode in headers
References: <15756.42427.421288.925121@anthem.wooz.org>
 <200209212108.g8LL8avs018452@paros.informatik.hu-berlin.de>
Message-ID: <15757.65068.972135.810802@anthem.wooz.org>

>>>>> "MvL" == Martin von Loewis <loewis@informatik.hu-berlin.de> writes:

    MvL> You need this argument to specify the encoding of the string
    MvL> *you are passing*, not (primarily) of the resulting
    MvL> Header. Since the argument is a Unicode string and not a byte
    MvL> string, the encoding argument is superfluous.

D'oh, of course you're right Martin.

    >> My proposal is to do a type check in Header.__str__() so that
    >> if the value of self.encode() returns a unicode string, we will
    >> coerce it to an 8-bit string like so:

    MvL> This is evil. You are losing data without any need.

    MvL> Instead, I propose the following procedure: - if a Unicode
    MvL> argument is passed to Header.__init__ or Header.append,
    MvL>   take the encoding only as a hint. As an argument to
    MvL> __init__, also record it as the default for later .append
    MvL> calls.
    MvL> - when encoding the header, encode all Unicode strings with
    MvL> the hint.  If that fails, encode them as UTF-8.

Alternatively, we could try to provoke a UnicodeError early, at the
__init__ or .append call by doing something like:

    def append(self, s, charset=None):
	# ...
        # Encoding check.  Better to know now whether we'll have an encoding
        # error than when we try to str'ify the header.  Let UnicodeErrors
        # percolate to the caller.
        if _isunicode(s):
            s.encode(str(charset))
        else:
            unicode(s, str(charset))
        self._chunks.append((s, charset))

In other words, the caller is claiming that the string being passed in
is encoded with the given character set (or the default if None is
used).  Fine, let's check that here since it will be easier to debug
if the UnicodeError is raised now, rather than when the Generator
tries to print the message header.

I think I could live with that, and will work out a different
algorithm in Mailman.

-Barry


From loewis@informatik.hu-berlin.de  Mon Sep 23 10:18:45 2002
From: loewis@informatik.hu-berlin.de (Martin v. =?iso-8859-1?q?L=F6wis?=)
Date: 23 Sep 2002 11:18:45 +0200
Subject: [Mailman-i18n] Unicode in headers
In-Reply-To: <15757.65068.972135.810802@anthem.wooz.org>
References: <15756.42427.421288.925121@anthem.wooz.org>
 <200209212108.g8LL8avs018452@paros.informatik.hu-berlin.de>
 <15757.65068.972135.810802@anthem.wooz.org>
Message-ID: <j4znu9gj2i.fsf@informatik.hu-berlin.de>

barry@zope.com (Barry A. Warsaw) writes:

> Alternatively, we could try to provoke a UnicodeError early, at the
> __init__ or .append call by doing something like:

I see no reason to provoke a UnicodeError at all. An exception should
only be raised if the library cannot correctly process the data being
passed, or if the requested processing is ambiguous.

In this case, neither is the case: there is a perfectly correct and
meaningful processing of the data. If you raise an exception, the
application would need to deal with it just in the same way as I
propose.

> I think I could live with that, and will work out a different
> algorithm in Mailman.

I think users of the email package will find it more acceptable if no
exception is raised.

Regards,
Martin