From barry@zope.com Fri Sep 13 20:09:25 2002 From: barry@zope.com (Barry A. Warsaw) Date: Fri, 13 Sep 2002 15:09:25 -0400 Subject: [Mailman-i18n] "Funny" characters in real names? Message-ID: <15746.14309.516823.293632@anthem.wooz.org> Take a look at SF bug # 601082 http://sf.net/tracker/index.php?func=detail&aid=601082&group_id=103&atid=100103 Does anybody have opinions or other ideas of how to handle this? -Barry From che@debian.org Fri Sep 13 21:21:25 2002 From: che@debian.org (Ben Gertzfield) Date: Fri, 13 Sep 2002 13:21:25 -0700 Subject: [Mailman-i18n] "Funny" characters in real names? References: <15746.14309.516823.293632@anthem.wooz.org> Message-ID: <3D8248C5.5040205@debian.org> Barry A. Warsaw wrote: >Take a look at SF bug # 601082 > >http://sf.net/tracker/index.php?func=detail&aid=601082&group_id=103&atid=100103 > >Does anybody have opinions or other ideas of how to handle this? > > When submitting an HTML form, the character set used for the submitted data is the same as the one specified in the HTML or header of the original form's page. So we do know what character set the user originally used, and can store that with their user data. Then when we send out a personalized message, we can easily encode the To: header. Ben From barry@zope.com Fri Sep 13 21:44:57 2002 From: barry@zope.com (Barry A. Warsaw) Date: Fri, 13 Sep 2002 16:44:57 -0400 Subject: [Mailman-i18n] "Funny" characters in real names? References: <15746.14309.516823.293632@anthem.wooz.org> <3D8248C5.5040205@debian.org> Message-ID: <15746.20041.343041.501116@anthem.wooz.org> >>>>> "BG" == Ben Gertzfield writes: BG> When submitting an HTML form, the character set used for the BG> submitted data is the same as the one specified in the HTML or BG> header of the original form's page. BG> So we do know what character set the user originally used, and BG> can store that with their user data. Then when we send out a BG> personalized message, we can easily encode the To: header. If they were subscribed via email, we'd already have the encoded form of their real name. What's left are the command line and mass subscribe page (both the text box and the file upload). In these cases should we simply reject addresses with non-ascii real names? That'd mean they'd have to be encoded prior to being subscribed. -Barry From che@debian.org Fri Sep 13 23:41:42 2002 From: che@debian.org (Ben Gertzfield) Date: Fri, 13 Sep 2002 15:41:42 -0700 Subject: [Mailman-i18n] "Funny" characters in real names? References: <15746.14309.516823.293632@anthem.wooz.org> <3D8248C5.5040205@debian.org> <15746.20041.343041.501116@anthem.wooz.org> Message-ID: <3D8269A6.7070701@debian.org> Barry A. Warsaw wrote: >>>>>>"BG" == Ben Gertzfield writes: >>>>>> >>>>>> > > BG> When submitting an HTML form, the character set used for the > BG> submitted data is the same as the one specified in the HTML or > BG> header of the original form's page. > > > >If they were subscribed via email, we'd already have the encoded form >of their real name. > >What's left are the command line and mass subscribe page (both the >text box and the file upload). In these cases should we simply reject >addresses with non-ascii real names? That'd mean they'd have to be >encoded prior to being subscribed. > > As far as the command-line goes, we should probably reject non-ASCII real names, yes. (It MIGHT be possible to parse the various LANG/LC_CHARSET environment variables and guess the character set, but that's a pain.) The mass subscribe page case should be the same as any other HTML form, right? Whatever character set the original form's page used is what all the real names' character sets get set to. Ben From barry@zope.com Sat Sep 14 00:25:02 2002 From: barry@zope.com (Barry A. Warsaw) Date: Fri, 13 Sep 2002 19:25:02 -0400 Subject: [Mailman-i18n] "Funny" characters in real names? References: <15746.14309.516823.293632@anthem.wooz.org> <3D8248C5.5040205@debian.org> <15746.20041.343041.501116@anthem.wooz.org> <3D8269A6.7070701@debian.org> Message-ID: <15746.29646.398343.137890@anthem.wooz.org> >>>>> "BG" == Ben Gertzfield writes: BG> As far as the command-line goes, we should probably reject BG> non-ASCII real names, yes. +1 BG> (It MIGHT be possible to parse the BG> various LANG/LC_CHARSET environment variables and guess the BG> character set, but that's a pain.) -1 BG> The mass subscribe page case should be the same as any other BG> HTML form, right? Whatever character set the original form's BG> page used is what all the real names' character sets get set BG> to. That simply means that you couldn't mass subscribe users with different charsets in their real names, but I'm perfectly fine with that limitation, especially since I don't see a way around it. :) Ok, I think I'll try to whip this up. Thanks for the feedback. -Barry From barry@zope.com Sat Sep 14 00:52:16 2002 From: barry@zope.com (Barry A. Warsaw) Date: Fri, 13 Sep 2002 19:52:16 -0400 Subject: [Mailman-i18n] "Funny" characters in real names? References: <15746.14309.516823.293632@anthem.wooz.org> <3D8248C5.5040205@debian.org> Message-ID: <15746.31280.719924.780772@anthem.wooz.org> >>>>> "BG" =3D=3D Ben Gertzfield writes: BG> When submitting an HTML form, the character set used for the BG> submitted data is the same as the one specified in the HTML or BG> header of the original form's page. I must be dense because I'm not quite seeing how this will work. I visit the mass subscribe page and in the text box, I enter a funny name, e.g. barry@python.org (Barry W=E2rsaw) My list is conducted in English. Now when I look at all the data submitted by the form, I don't see anything immediately useful in either the cgi environment or in the form data. Here are some excerpts: CONTENT_TYPE: multipart/form-data; boundary=3D-------------------------= --527473093431726113359136092 Hmm, nothing there. HTTP_ACCEPT_CHARSET: ISO-8859-1, utf-8;q=3D0.66, *;q=3D0.66 That doesn't really help us does it? That's telling us what charsets the browser will accept, right? Not the same thing. Now for the form data, I'll see a section like: -----------------------------527473093431726113359136092 Content-Disposition: form-data; name=3D"subscribees" warsaw@wooz.org (Barry Wârsaw) -----------------------------527473093431726113359136092 This doesn't tell me enough either does it? So I'm at a loss as to where to such the information out of the form data or the environment to figure out what charset the form was posted in. -Barry From tkikuchi@is.kochi-u.ac.jp Sat Sep 14 03:01:59 2002 From: tkikuchi@is.kochi-u.ac.jp (Tokio Kikuchi) Date: Sat, 14 Sep 2002 11:01:59 +0900 Subject: [Mailman-i18n] "Funny" characters in real names? References: <15746.14309.516823.293632@anthem.wooz.org> <3D8248C5.5040205@debian.org> <15746.31280.719924.780772@anthem.wooz.org> Message-ID: <3D829897.8020608@is.kochi-u.ac.jp> Hi, > So I'm at a loss as to where to such the information out of the form > data or the environment to figure out what charset the form was posted > in. Let's make it simple; 1. Use the user's preferred language as a first guess and try to encode. 2. If it fails, strip off the real name from the recipient header. You will receive unreadable To: header if you set the preferred language as French and use Japanese character in real name, but you can correct it on the users option page. -- Tokio Kikuchi, tkikuchi@ is.kochi-u.ac.jp http://weather.is.kochi-u.ac.jp/ From loewis@informatik.hu-berlin.de Sun Sep 15 18:38:05 2002 From: loewis@informatik.hu-berlin.de (Martin v. =?iso-8859-1?q?L=F6wis?=) Date: 15 Sep 2002 19:38:05 +0200 Subject: [Mailman-i18n] "Funny" characters in real names? In-Reply-To: <15746.31280.719924.780772@anthem.wooz.org> References: <15746.14309.516823.293632@anthem.wooz.org> <3D8248C5.5040205@debian.org> <15746.31280.719924.780772@anthem.wooz.org> Message-ID: barry@zope.com (Barry A. Warsaw) writes: > I must be dense because I'm not quite seeing how this will work. [...] > This doesn't tell me enough either does it? You are running into one of the most awful oddities of HTTP and i18n. In short, the encoding of the page that contained the form was used to encoding the form contents :-( The RFC says the browser SHOULD declare the encoding for each field in the per-field MIME header of multipart/form-data message. None of the browsers does that. I filed bug reports for all of them, and Mozilla people responded that they can't do that because many CGI scripts break when they get a charset= (it won't fit their regexp). The RFC says, as a fall-back, the browser should use the encoding of the HTML page which contained the form. Mailman doesn't declare a charset in the administrative pages, but it should. It may happen that the user enters a character which cannot be represented in the charset of the page. In this case, Mozilla sends a '?' (question mark), so you can only tell that there was a character, but not which one. Internet Exploder sends a HTML entity, which gives you more information, but is undistinguishable from the case where the user entered an ampersand-digits sequence. For Mailman, this gives two options: 1. Each administrative page should be encoded in the list's "native" charset. This will allow to add names in that charset. 2. Each page should be encoded in UTF-8. This will allow to enter arbitrary names, but will require recoding to the list's charset later (or using UTF-8 in the To: fields as well). Actually, it appears that mailman already does 1, in the HTTP header. Barry, what is the charset of your admin pages? Regards, Martin From barry@zope.com Sun Sep 15 23:34:36 2002 From: barry@zope.com (Barry A. Warsaw) Date: Sun, 15 Sep 2002 18:34:36 -0400 Subject: [Mailman-i18n] "Funny" characters in real names? References: <15746.14309.516823.293632@anthem.wooz.org> <3D8248C5.5040205@debian.org> <15746.31280.719924.780772@anthem.wooz.org> Message-ID: <15749.2812.524261.962949@anthem.wooz.org> Thanks Martin, and everyone, I think I know what to do now. >>>>> "MvL" =3D=3D Martin v L=F6wis wr= ites: MvL> The RFC says, as a fall-back, the browser should use the MvL> encoding of the HTML page which contained the form. Mailman MvL> doesn't declare a charset in the administrative pages, but it MvL> should. I think we'll make things simple and assume a list's preferred language doesn't change between form presentation and submission. So if the page has a `lang' key (i.e. the listinfo page, or options page), we'll use that, otherwise we'll fallback to the list's preferred language. If that language's charset is ascii or there are only ascii characters in the name, we'll simple store the name unencoded in the user database. Otherwise, we'll encode the name to Unicode, and store that along with the charset. Then for email headers, we'll use the Header class to encode the name in an RFC conformant way. I don't think this will be a huge amount of work, although it will require some changes to the MemberAdaptor API. (For command line, we'll still insist on ascii in the names, unless there's a hue and cry -- or patch -- for something better.) MvL> It may happen that the user enters a character which cannot MvL> be represented in the charset of the page. In this case, MvL> Mozilla sends a '?' (question mark), so you can only tell MvL> that there was a character, but not which one. Internet MvL> Exploder sends a HTML entity, which gives you more MvL> information, but is undistinguishable from the case where the MvL> user entered an ampersand-digits sequence. We won't handle these specially. If that's what the browser gives us, that's what we'll use. MvL> For Mailman, this gives two options: MvL> 1. Each administrative page should be encoded in the list's MvL> "native" charset. This will allow to add names in that MvL> charset. MvL> 2. Each page should be encoded in UTF-8. This will allow to MvL> enter arbitrary names, but will require recoding to the MvL> list's charset later (or using UTF-8 in the To: fields as MvL> well). MvL> Actually, it appears that mailman already does 1, in the HTTP MvL> header. Barry, what is the charset of your admin pages? I had tried iso-8859-1 and us-ascii. In us-ascii I got the HTML entity, but in iso-8859-1 I got the actual character. Let's go with #1. Thanks. -Barry From barry@python.org Tue Sep 17 21:58:38 2002 From: barry@python.org (Barry A. Warsaw) Date: Tue, 17 Sep 2002 16:58:38 -0400 Subject: [Mailman-i18n] "Funny" characters in real names? References: <15746.14309.516823.293632@anthem.wooz.org> <3D8248C5.5040205@debian.org> <15746.31280.719924.780772@anthem.wooz.org> <15749.2812.524261.962949@anthem.wooz.org> Message-ID: <15751.38782.74609.163241@anthem.wooz.org> To follow up, I believe I have this working now. Here's how it works. First, the only change to the MemberAdaptor API is that real names can now be Unicode strings as well as 8-bit strings. If they're 8-bit then they'll contain only ascii characters. When a real name is entered into a web form, we'll first attempt to convert it to us-ascii. If that succeeds, we know the real name is ascii only and we'll store it in the membership database as an 8-bit ascii-only-containing string. If the conversion fails, we'll convert the real name to Unicode using the charset of the context's language (i.e. list preferred if we're looking at an admin page, user preferred if we're looking at an options page, and form value if we're looking at the subscribe page -- all with appropriate fallbacks to Something Sensible). We'll also do html entity replacement (e.g. #&246; -> =F6). We'll store this Unicode= string as the member's real name in the membership database, but we don't store the charset because... ...when we need to get a printable version of the member name, we yank out either the ascii string or the Unicode string. If it's ascii, we're done. If it's Unicode, then we try to encode it to the charset of the web page we're printing (for cgi), or to the charset of the outgoing email message. For output web pages, if the encoding fails, we'll convert chars > 127 to html entities (e.g. =F6 -> #&246;) so in most cases we'll still see the name with the proper characters. For this case, think about a user who selects Spanish, enters a =F1 in their name, and then switches= their preferred language to English (us-ascii). You'd like their name to still show up correctly. For email, if the name has non-ascii characters in it, we'll use the email.Header.Header class to convert the To string to an RFC-compliant format. If that fails we fall back on encoding to us-ascii replacing non-ascii characters with `?'s. This seems to work fairly well (with some ugly changes also necessary to the logging system), with one minor kludge. I want to allow non-ASCII characters in real names for English lists. I'm nervous about changing the default charset for English from us-ascii because I'm superstitious about unintended side-effects. So I'm making a couple of special cases for us-ascii. When decoding a string from a web form, if the default charset would be us-ascii, I'll use iso-8859-1 instead. Then when encoding a name in an email header, if the charset is us-ascii, again, I'll use iso-8859-1. This seems like a practical compromise, if a bit ugly. Feedback is welcome. I'm about to check all this stuff in. Testing will be /greatly/ appreciated! -Barry From che@debian.org Tue Sep 17 22:22:51 2002 From: che@debian.org (Ben Gertzfield) Date: Tue, 17 Sep 2002 14:22:51 -0700 Subject: [Mailman-i18n] "Funny" characters in real names? References: <15746.14309.516823.293632@anthem.wooz.org> <3D8248C5.5040205@debian.org> <15746.31280.719924.780772@anthem.wooz.org> <15749.2812.524261.962949@anthem.wooz.org> <15751.38782.74609.163241@anthem.wooz.org> Message-ID: <3D879D2B.5080600@debian.org> Barry A. Warsaw wrote: >To follow up, I believe I have this working now. Here's how it works. > Thanks for the excellent explanation and implementation, Barry. I'll=20 test this when it's checked in. Some comments below.. >First, the only change to the MemberAdaptor API is that real names can >now be Unicode strings as well as 8-bit strings. If they're 8-bit >then they'll contain only ascii characters. > =20 > ASCII is by definition 7-bit, Barry. Did you mean ISO-8859-1 here? >When a real name is entered into a web form, we'll first attempt to >convert it to us-ascii. If that succeeds, we know the real name is >ascii only and we'll store it in the membership database as an 8-bit >ascii-only-containing string. > =20 > Again, I assume you mean ISO-8859-1 instead of ascii here. >If the conversion fails, we'll convert the real name to Unicode using >the charset of the context's language (i.e. list preferred if we're >looking at an admin page, user preferred if we're looking at an >options page, and form value if we're looking at the subscribe page -- >all with appropriate fallbacks to Something Sensible). We'll also do >html entity replacement (e.g. #&246; -> =F6). We'll store this Unicode >string as the member's real name in the membership database, but we >don't store the charset because... > =20 > This is a good thing. Note that some browsers might (I haven't checked=20 this) incorrectly send the entity &246; for whatever character is at=20 position 246 in the user's default character set, not character 246 in=20 Unicode. This might be something to look out for, but I don't know if=20 it's important. Everything else looks good. The kludge to assume iso-8859-1 on us-ascii=20 pages is unfortunately a generally good one, as that will make the most=20 people happy. I hate to do it, though! Ben From barry@python.org Tue Sep 17 22:34:36 2002 From: barry@python.org (Barry A. Warsaw) Date: Tue, 17 Sep 2002 17:34:36 -0400 Subject: [Mailman-i18n] "Funny" characters in real names? References: <15746.14309.516823.293632@anthem.wooz.org> <3D8248C5.5040205@debian.org> <15746.31280.719924.780772@anthem.wooz.org> <15749.2812.524261.962949@anthem.wooz.org> <15751.38782.74609.163241@anthem.wooz.org> <3D879D2B.5080600@debian.org> Message-ID: <15751.40940.902092.18578@anthem.wooz.org> >>>>> "BG" =3D=3D Ben Gertzfield writes: >> To follow up, I believe I have this working now. Here's how it >> works. BG> Thanks for the excellent explanation and implementation, BG> Barry. Took me two days. I still say Unicode is something everyone wants until they get it. :) =20 BG> I'll test this when it's checked in. Some comments below.. Excellent! >> First, the only change to the MemberAdaptor API is that real >> names can now be Unicode strings as well as 8-bit strings. If >> they're 8-bit then they'll contain only ascii characters. BG> ASCII is by definition 7-bit, Barry. Did you mean ISO-8859-1 BG> here? Sorry, I meant "normal" Python strings (sometimes called "8-bit strings") but which contain only 7-bit ascii characters. Those beasties I don't convert to Python unicode strings. >> When a real name is entered into a web form, we'll first >> attempt to convert it to us-ascii. If that succeeds, we know >> the real name is ascii only and we'll store it in the >> membership database as an 8-bit ascii-only-containing string. BG> Again, I assume you mean ISO-8859-1 instead of ascii here. Same thing here. We do name.encode('us-ascii') and catch any UnicodeError that might occur. If no error occurs, we know we have a string with 7-bit ascii characters in it, so we store that as an 8-bit Python string, not as a unicode Python string. >> If the conversion fails, we'll convert the real name to Unicode >> using the charset of the context's language (i.e. list >> preferred if we're looking at an admin page, user preferred if >> we're looking at an options page, and form value if we're >> looking at the subscribe page -- all with appropriate fallbacks >> to Something Sensible). We'll also do html entity replacement >> (e.g. #&246; -> =F6). We'll store this Unicode string as the >> member's real name in the membership database, but we don't >> store the charset because... BG> This is a good thing. Note that some browsers might (I BG> haven't checked this) incorrectly send the entity &246; for BG> whatever character is at position 246 in the user's default BG> character set, not character 246 in Unicode. This might be BG> something to look out for, but I don't know if it's important. I don't know what else to do. Note that you could literally type ö into the web form and it would have the same effect. This is probably an 80/20 solution. BG> Everything else looks good. The kludge to assume iso-8859-1 BG> on us-ascii pages is unfortunately a generally good one, as BG> that will make the most people happy. I hate to do it, BG> though! Me too! It means that names in other charsets will be screwed on English lists, but again, I think this is best we can do for a practical 80/20 solution. Thanks for the feedback. -Barry From tkikuchi@is.kochi-u.ac.jp Wed Sep 18 01:40:56 2002 From: tkikuchi@is.kochi-u.ac.jp (Tokio Kikuchi) Date: Wed, 18 Sep 2002 09:40:56 +0900 Subject: [Mailman-i18n] "Funny" characters in real names? References: <15746.14309.516823.293632@anthem.wooz.org> <3D8248C5.5040205@debian.org> <15746.31280.719924.780772@anthem.wooz.org> <15749.2812.524261.962949@anthem.wooz.org> <15751.38782.74609.163241@anthem.wooz.org> <3D879D2B.5080600@debian.org> <15751.40940.902092.18578@anthem.wooz.org> Message-ID: <3D87CB98.8010908@is.kochi-u.ac.jp> Thanks Barry. Barry A. Warsaw wrote: >>>>>>"BG" == Ben Gertzfield writes: > > BG> Thanks for the excellent explanation and implementation, > BG> Barry. > > Took me two days. I still say Unicode is something everyone wants > until they get it. :) I had to add "import japanese" in Utils.py. Otherwise, I got unknown charset error when I entered japanese in fullname. And, "korean" ? Where should I put them ? -- Tokio Kikuchi, tkikuchi@ is.kochi-u.ac.jp http://weather.is.kochi-u.ac.jp/ From barry@python.org Wed Sep 18 02:51:53 2002 From: barry@python.org (Barry A. Warsaw) Date: Tue, 17 Sep 2002 21:51:53 -0400 Subject: [Mailman-i18n] "Funny" characters in real names? References: <15746.14309.516823.293632@anthem.wooz.org> <3D8248C5.5040205@debian.org> <15746.31280.719924.780772@anthem.wooz.org> <15749.2812.524261.962949@anthem.wooz.org> <15751.38782.74609.163241@anthem.wooz.org> <3D879D2B.5080600@debian.org> <15751.40940.902092.18578@anthem.wooz.org> <3D87CB98.8010908@is.kochi-u.ac.jp> Message-ID: <15751.56377.321564.625249@anthem.wooz.org> >>>>> "TK" == Tokio Kikuchi writes: TK> I had to add "import japanese" in Utils.py. Otherwise, I got TK> unknown charset error when I entered japanese in fullname. Do you have a traceback? Can you send me a valid Japanese name string so that I can try it myself? I would have thought that the codecs package would have imported the Japanese codecs automatically. If not, we have to figure out the right way to hook it up so it works. If adding the import is the only way to do it, then we'll go that route, but I'd like to try to reproduce the problem here. TK> And, "korean" ? Where should I put them ? We probably have to handle this the same way we handle Japanese, but I'm not sure what the right way to do that is at the moment. Thanks, -Barry From tkikuchi@is.kochi-u.ac.jp Wed Sep 18 03:39:44 2002 From: tkikuchi@is.kochi-u.ac.jp (Tokio Kikuchi) Date: Wed, 18 Sep 2002 11:39:44 +0900 Subject: [Mailman-i18n] "Funny" characters in real names? References: <15746.14309.516823.293632@anthem.wooz.org> <3D8248C5.5040205@debian.org> <15746.31280.719924.780772@anthem.wooz.org> <15749.2812.524261.962949@anthem.wooz.org> <15751.38782.74609.163241@anthem.wooz.org> <3D879D2B.5080600@debian.org> <15751.40940.902092.18578@anthem.wooz.org> <3D87CB98.8010908@is.kochi-u.ac.jp> <15751.56377.321564.625249@anthem.wooz.org> Message-ID: <3D87E770.9000903@is.kochi-u.ac.jp> Barry A. Warsaw wrote: >>>>>>"TK" == Tokio Kikuchi writes: >>>>> > > TK> I had to add "import japanese" in Utils.py. Otherwise, I got > TK> unknown charset error when I entered japanese in fullname. > > Do you have a traceback? Here is it. Traceback: Traceback (most recent call last): File "/home/mailman3/scripts/driver", line 82, in run_main main() File "/home/mailman3/Mailman/Cgi/subscribe.py", line 94, in main process_form(mlist, doc, cgidata, language) File "/home/mailman3/Mailman/Cgi/subscribe.py", line 113, in process_form fullname = Utils.canonstr(fullname, lang) File "/home/mailman3/Mailman/Utils.py", line 731, in canonstr return unicode(newstr, charset, 'replace') LookupError: unknown encoding > > Can you send me a valid Japanese name string so that I can try it > myself? Well, you must first configure your browser japanese capable. Here is my name in euc-jp and unicode: '\xb5\xc6\xc3\xcf\xbb\xfe\xc9\xd7' u'\u83ca\u5730\u6642\u592b' > I would have thought that the codecs package would have > imported the Japanese codecs automatically. If not, we have to figure > out the right way to hook it up so it works. Me too... -- Tokio Kikuchi, tkikuchi@ is.kochi-u.ac.jp http://weather.is.kochi-u.ac.jp/ From tkikuchi@is.kochi-u.ac.jp Wed Sep 18 04:45:04 2002 From: tkikuchi@is.kochi-u.ac.jp (Tokio Kikuchi) Date: Wed, 18 Sep 2002 12:45:04 +0900 Subject: [Mailman-i18n] "Funny" characters in real names? References: <15746.14309.516823.293632@anthem.wooz.org> <3D8248C5.5040205@debian.org> <15746.31280.719924.780772@anthem.wooz.org> <15749.2812.524261.962949@anthem.wooz.org> <15751.38782.74609.163241@anthem.wooz.org> Message-ID: <3D87F6C0.7020809@is.kochi-u.ac.jp> Barry, Here is another bug. I think I could track down. On the admin//member page, if a user have set his name in japanese, we get an error when you click the change button. It is because admin.py fails to encode the full name in unicode. Here is the patch: --- /home/mailman/src/mailman/Mailman/Cgi/admin.py Wed Sep 18 08:47:39 2002 +++ Cgi/admin.py Wed Sep 18 12:36:29 2002 @@ -1367,6 +1367,7 @@ pass newname = cgidata.getvalue(user+'_realname', '') + newname = Utils.canonstr(newname, mlist.preferred_language) mlist.setMemberName(user, newname) newlang = cgidata.getvalue(user+'_language') -- Tokio Kikuchi, tkikuchi@ is.kochi-u.ac.jp http://weather.is.kochi-u.ac.jp/ From barry@python.org Wed Sep 18 06:41:35 2002 From: barry@python.org (Barry A. Warsaw) Date: Wed, 18 Sep 2002 01:41:35 -0400 Subject: [Mailman-i18n] "Funny" characters in real names? References: <15746.14309.516823.293632@anthem.wooz.org> <3D8248C5.5040205@debian.org> <15746.31280.719924.780772@anthem.wooz.org> <15749.2812.524261.962949@anthem.wooz.org> <15751.38782.74609.163241@anthem.wooz.org> <3D879D2B.5080600@debian.org> <15751.40940.902092.18578@anthem.wooz.org> <3D87CB98.8010908@is.kochi-u.ac.jp> <15751.56377.321564.625249@anthem.wooz.org> <3D87E770.9000903@is.kochi-u.ac.jp> Message-ID: <15752.4623.531823.282922@anthem.wooz.org> Thanks for the feedback. I think I've got the Japanese support working in cvs now. But then, I'm pretty tired, so please double check! -Barry From loewis@informatik.hu-berlin.de Wed Sep 18 07:08:06 2002 From: loewis@informatik.hu-berlin.de (Martin v. =?iso-8859-1?q?L=F6wis?=) Date: 18 Sep 2002 08:08:06 +0200 Subject: [Mailman-i18n] "Funny" characters in real names? In-Reply-To: <15751.40940.902092.18578@anthem.wooz.org> References: <15746.14309.516823.293632@anthem.wooz.org> <3D8248C5.5040205@debian.org> <15746.31280.719924.780772@anthem.wooz.org> <15749.2812.524261.962949@anthem.wooz.org> <15751.38782.74609.163241@anthem.wooz.org> <3D879D2B.5080600@debian.org> <15751.40940.902092.18578@anthem.wooz.org> Message-ID: barry@python.org (Barry A. Warsaw) writes: > BG> Again, I assume you mean ISO-8859-1 instead of ascii here. > > Same thing here. We do name.encode('us-ascii') and catch any > UnicodeError that might occur. That should be name.decode, or unicode(name, "us-ascii"). name.encode unfortunately works as well, but it first decodes using the system default encoding, then encodes using us-ascii - very close to what you want, but not the precise same thing. Regards, Martin From loewis@informatik.hu-berlin.de Wed Sep 18 07:09:26 2002 From: loewis@informatik.hu-berlin.de (Martin v. =?iso-8859-1?q?L=F6wis?=) Date: 18 Sep 2002 08:09:26 +0200 Subject: [Mailman-i18n] "Funny" characters in real names? In-Reply-To: <3D879D2B.5080600@debian.org> References: <15746.14309.516823.293632@anthem.wooz.org> <3D8248C5.5040205@debian.org> <15746.31280.719924.780772@anthem.wooz.org> <15749.2812.524261.962949@anthem.wooz.org> <15751.38782.74609.163241@anthem.wooz.org> <3D879D2B.5080600@debian.org> Message-ID: Ben Gertzfield writes: > This is a good thing. Note that some browsers might (I haven't > checked this) incorrectly send the entity &246; for whatever character > is at position 246 in the user's default character set, not character > 246 in Unicode. This might be something to look out for, but I don't > know if it's important. I'm not aware of any browser that does that. IE is the only one that sends HTML entities at all if you get an unsupported character. > Everything else looks good. The kludge to assume iso-8859-1 on > us-ascii pages is unfortunately a generally good one, as that will > make the most people happy. I think this kludge does not help at all, see my other message. Regards, Martin From loewis@informatik.hu-berlin.de Wed Sep 18 07:13:15 2002 From: loewis@informatik.hu-berlin.de (Martin v. =?iso-8859-1?q?L=F6wis?=) Date: 18 Sep 2002 08:13:15 +0200 Subject: [Mailman-i18n] "Funny" characters in real names? In-Reply-To: <15751.38782.74609.163241@anthem.wooz.org> References: <15746.14309.516823.293632@anthem.wooz.org> <3D8248C5.5040205@debian.org> <15746.31280.719924.780772@anthem.wooz.org> <15749.2812.524261.962949@anthem.wooz.org> <15751.38782.74609.163241@anthem.wooz.org> Message-ID: barry@python.org (Barry A. Warsaw) writes: > This seems to work fairly well (with some ugly changes also necessary > to the logging system), with one minor kludge. I want to allow > non-ASCII characters in real names for English lists. I'm nervous > about changing the default charset for English from us-ascii because > I'm superstitious about unintended side-effects. So I'm making a > couple of special cases for us-ascii. When decoding a string from a > web form, if the default charset would be us-ascii, I'll use > iso-8859-1 instead. Then when encoding a name in an email header, if > the charset is us-ascii, again, I'll use iso-8859-1. This seems like > a practical compromise, if a bit ugly. Feedback is welcome. Do you already send the page that has the form in iso-8859-1, or do you use latin-1 only when interpreting form data? If the latter, I think you gain nothing: the web browser will not transmit latin-1 data if the form was us-ascii, so decoding the data with latin-1 will work, but not allow to transmit latin-1 data. On encoding Unicode names in email messages: I hope you have a general fallback to UTF-8. If all else fails, UTF-8 will still work, and DTRT. Regards, Martin From tkikuchi@is.kochi-u.ac.jp Wed Sep 18 07:17:43 2002 From: tkikuchi@is.kochi-u.ac.jp (Tokio Kikuchi) Date: Wed, 18 Sep 2002 15:17:43 +0900 Subject: [Mailman-i18n] "Funny" characters in real names? References: <15746.14309.516823.293632@anthem.wooz.org> <3D8248C5.5040205@debian.org> <15746.31280.719924.780772@anthem.wooz.org> <15749.2812.524261.962949@anthem.wooz.org> <15751.38782.74609.163241@anthem.wooz.org> <3D879D2B.5080600@debian.org> <15751.40940.902092.18578@anthem.wooz.org> <3D87CB98.8010908@is.kochi-u.ac.jp> <15751.56377.321564.625249@anthem.wooz.org> <3D87E770.9000903@is.kochi-u.ac.jp> <15752.4623.531823.282922@anthem.wooz.org> Message-ID: <3D881A87.7030001@is.kochi-u.ac.jp> Thanks a lot, Barry. It's afternoon in Japan now and you are tired in midnight. Then, it's morning in Europe ... ;-) Good night. Barry A. Warsaw wrote: > Thanks for the feedback. I think I've got the Japanese support > working in cvs now. But then, I'm pretty tired, so please double > check! > > -Barry > > _______________________________________________ > Mailman-i18n mailing list > Mailman-i18n@python.org > http://mail.python.org/mailman/listinfo/mailman-i18n > > -- Tokio Kikuchi, tkikuchi@ is.kochi-u.ac.jp http://weather.is.kochi-u.ac.jp/ From loewis@informatik.hu-berlin.de Wed Sep 18 07:17:44 2002 From: loewis@informatik.hu-berlin.de (Martin v. =?iso-8859-1?q?L=F6wis?=) Date: 18 Sep 2002 08:17:44 +0200 Subject: [Mailman-i18n] "Funny" characters in real names? In-Reply-To: <15751.40940.902092.18578@anthem.wooz.org> References: <15746.14309.516823.293632@anthem.wooz.org> <3D8248C5.5040205@debian.org> <15746.31280.719924.780772@anthem.wooz.org> <15749.2812.524261.962949@anthem.wooz.org> <15751.38782.74609.163241@anthem.wooz.org> <3D879D2B.5080600@debian.org> <15751.40940.902092.18578@anthem.wooz.org> Message-ID: barry@python.org (Barry A. Warsaw) writes: > Took me two days. I still say Unicode is something everyone wants > until they get it. :) Seriously, I think it's slightly different: End users don't want Unicode; they should not care. Software developers become scared easily when confronted with Unicode. However, considering the alternatives (preserving the original encoding all the time, and having to combine strings in different encodings), Unicode does simplify processing, IMO. Regards, Martin From quique@sindominio.net Wed Sep 18 07:38:05 2002 From: quique@sindominio.net (Quique) Date: Wed, 18 Sep 2002 08:38:05 +0200 (CEST) Subject: [Mailman-i18n] 'Funny' characters in real names? In-Reply-To: <15751.40940.902092.18578@anthem.wooz.org> References: <15746.14309.516823.293632@anthem.wooz.org> <3D8248C5.5040205@debian.org> <15746.31280.719924.780772@anthem.wooz.org> <15749.2812.524261.962949@anthem.wooz.org> <15751.38782.74609.163241@anthem.wooz.org> <3D879D2B.5080600@debian.org> <15751.40940.902092.18578@anthem.wooz.org> Message-ID: <35199.199.228.142.9.1032331085.squirrel@www.sindominio.net> Barry A. Warsaw dijo: > BG> Everything else looks good. The kludge to assume iso-8859-1 BG> > on us-ascii pages is unfortunately a generally good one, as BG> that > will make the most people happy. I hate to do it, > BG> though! > > Me too! It means that names in other charsets will be screwed on > English lists, but again, I think this is best we can do for a > practical 80/20 solution. from my ignorance: what about using iso-8859-15 instead of -1? it seems to add some forgotten french, finnish, estonian and czech letters. cheers, quique -- Torres más altas han caído. From loewis@informatik.hu-berlin.de Wed Sep 18 08:52:38 2002 From: loewis@informatik.hu-berlin.de (Martin v. =?iso-8859-1?q?L=F6wis?=) Date: 18 Sep 2002 09:52:38 +0200 Subject: [Mailman-i18n] 'Funny' characters in real names? In-Reply-To: <35199.199.228.142.9.1032331085.squirrel@www.sindominio.net> References: <15746.14309.516823.293632@anthem.wooz.org> <3D8248C5.5040205@debian.org> <15746.31280.719924.780772@anthem.wooz.org> <15749.2812.524261.962949@anthem.wooz.org> <15751.38782.74609.163241@anthem.wooz.org> <3D879D2B.5080600@debian.org> <15751.40940.902092.18578@anthem.wooz.org> <35199.199.228.142.9.1032331085.squirrel@www.sindominio.net> Message-ID: "Quique" writes: > from my ignorance: > what about using iso-8859-15 instead of -1? > > it seems to add some forgotten french, finnish, estonian and czech > letters. It doesn't really matter, IMO: administrators can always turn the default encoding to -15 if they want to. However, it is likely that some email readers will have difficulties to represent those characters when confronted with them, so it is unclear what you gain. I'm personally quite unhappy with iso-8859-15: it was invented at a time when Unicode was already there, and the world didn't really need any more character sets. Regards, Martin From tkikuchi@is.kochi-u.ac.jp Wed Sep 18 09:00:41 2002 From: tkikuchi@is.kochi-u.ac.jp (Tokio Kikuchi) Date: Wed, 18 Sep 2002 17:00:41 +0900 Subject: [Mailman-i18n] "Funny" characters in real names? References: <15746.14309.516823.293632@anthem.wooz.org> <3D8248C5.5040205@debian.org> <15746.31280.719924.780772@anthem.wooz.org> <15749.2812.524261.962949@anthem.wooz.org> <15751.38782.74609.163241@anthem.wooz.org> <3D879D2B.5080600@debian.org> <15751.40940.902092.18578@anthem.wooz.org> <3D87CB98.8010908@is.kochi-u.ac.jp> <15751.56377.321564.625249@anthem.wooz.org> <3D87E770.9000903@is.kochi-u.ac.jp> <15752.4623.531823.282922@anthem.wooz.org> <3D881A87.7030001@is.kochi-u.ac.jp> Message-ID: <3D8832A9.8060302@is.kochi-u.ac.jp> Tokio Kikuchi wrote: > Good night. > > Barry A. Warsaw wrote: > >> Thanks for the feedback. I think I've got the Japanese support >> working in cvs now. But then, I'm pretty tired, so please double >> check! Well, things are not so easy. I get errors here and there. It's 5 in Japan and I have to feed my kid and puppy. later, -- Tokio Kikuchi, tkikuchi@ is.kochi-u.ac.jp http://weather.is.kochi-u.ac.jp/ From tkikuchi@is.kochi-u.ac.jp Wed Sep 18 13:58:11 2002 From: tkikuchi@is.kochi-u.ac.jp (Tokio Kikuchi) Date: Wed, 18 Sep 2002 21:58:11 +0900 Subject: [Mailman-i18n] "Funny" characters in real names? References: <15746.14309.516823.293632@anthem.wooz.org> <3D8248C5.5040205@debian.org> <15746.31280.719924.780772@anthem.wooz.org> <15749.2812.524261.962949@anthem.wooz.org> <15751.38782.74609.163241@anthem.wooz.org> <3D879D2B.5080600@debian.org> <15751.40940.902092.18578@anthem.wooz.org> <3D87CB98.8010908@is.kochi-u.ac.jp> <15751.56377.321564.625249@anthem.wooz.org> <3D87E770.9000903@is.kochi-u.ac.jp> <15752.4623.531823.282922@anthem.wooz.org> <3D881A87.7030001@is.kochi-u.ac.jp> <3D8832A9.8060302@is.kochi-u.ac.jp> Message-ID: <3D887863.8070808@is.kochi-u.ac.jp> Tokio Kikuchi wrote: > Well, things are not so easy. I get errors here and there. I think I've tracked one. Here is a patch. --- /home/mailman/src/mailman/Mailman/Cgi/confirm.py Wed Sep 18 08:47:39 2002 +++ Cgi/confirm.py Wed Sep 18 21:50:37 2002 @@ -200,6 +200,7 @@ password = userdesc.password digest = userdesc.digest lang = userdesc.language + name = Utils.uncanonstr(name, lang) title = _('Confirm subscription request') doc.SetTitle(title) i18n.set_language(lang) Without this, you get admin(79627): [----- Mailman Version: 2.1b3+ -----] admin(79627): [----- Traceback ------] admin(79627): Traceback (most recent call last): admin(79627): File "/home/mailman3/scripts/driver", line 82, in run_main admin(79627): main() admin(79627): File "/home/mailman3/Mailman/Cgi/confirm.py", line 155, in main admin(79627): print doc.Format() admin(79627): File "/home/mailman3/Mailman/htmlformat.py", line 331, in Format admin(79627): output.append(Container.Format(self, indent)) (snip) admin(79627): File "/home/mailman3/Mailman/htmlformat.py", line 188, in Format Row admin(79627): output = output + self.FormatCell(row, i, indent + 2) admin(79627): UnicodeError: ASCII decoding error: ordinal not in range(128) -- Tokio Kikuchi, tkikuchi@ is.kochi-u.ac.jp http://weather.is.kochi-u.ac.jp/ From tkikuchi@is.kochi-u.ac.jp Wed Sep 18 14:27:54 2002 From: tkikuchi@is.kochi-u.ac.jp (Tokio Kikuchi) Date: Wed, 18 Sep 2002 22:27:54 +0900 Subject: [Mailman-i18n] "Funny" characters in real names? References: <15746.14309.516823.293632@anthem.wooz.org> <3D8248C5.5040205@debian.org> <15746.31280.719924.780772@anthem.wooz.org> <15749.2812.524261.962949@anthem.wooz.org> <15751.38782.74609.163241@anthem.wooz.org> <3D879D2B.5080600@debian.org> <15751.40940.902092.18578@anthem.wooz.org> <3D87CB98.8010908@is.kochi-u.ac.jp> <15751.56377.321564.625249@anthem.wooz.org> <3D87E770.9000903@is.kochi-u.ac.jp> <15752.4623.531823.282922@anthem.wooz.org> <3D881A87.7030001@is.kochi-u.ac.jp> <3D8832A9.8060302@is.kochi-u.ac.jp> <3D887863.8070808@is.kochi-u.ac.jp> Message-ID: <3D887F5A.50108@is.kochi-u.ac.jp> > > I think I've tracked one. Here is a patch. tracked two :-) patch merged. --- /home/mailman/src/mailman/Mailman/Cgi/confirm.py Wed Sep 18 08:47:39 2002 +++ Cgi/confirm.py Wed Sep 18 22:17:49 2002 @@ -200,6 +200,7 @@ password = userdesc.password digest = userdesc.digest lang = userdesc.language + name = Utils.uncanonstr(name, lang) title = _('Confirm subscription request') doc.SetTitle(title) i18n.set_language(lang) @@ -314,6 +315,7 @@ overrides = UserDesc(fullname=cgidata.getvalue('realname', None), digest=digest, lang=lang) userdesc += overrides + userdesc.fullname = Utils.canonstr(userdesc.fullname, userdesc.language) op, addr, pw, digest, lang = mlist.ProcessConfirmation( cookie, userdesc) except Errors.MMNeedApproval: traceback for the second admin(79816): [----- Mailman Version: 2.1b3+ -----] admin(79816): [----- Traceback ------] admin(79816): Traceback (most recent call last): admin(79816): File "/home/mailman3/scripts/driver", line 82, in run_main admin(79816): main() admin(79816): File "/home/mailman3/Mailman/Cgi/options.py", line 598, in main admin(79816): options_page(mlist, doc, user, cpuser, userlang) admin(79816): File "/home/mailman3/Mailman/Cgi/options.py", line 616, in optio ns_page admin(79816): fullname = Utils.uncanonstr(mlist.getMemberName(user), userlan g) admin(79816): File "/home/mailman3/Mailman/Utils.py", line 755, in uncanonstr admin(79816): return s.encode(charset, 'strict') admin(79816): TypeError: _japanese_codecs_euc_jp_encode() argument 1 must be uni code, not string Good Night! -- Tokio Kikuchi, tkikuchi@ is.kochi-u.ac.jp http://weather.is.kochi-u.ac.jp/ From barry@python.org Thu Sep 19 04:38:41 2002 From: barry@python.org (Barry A. Warsaw) Date: Wed, 18 Sep 2002 23:38:41 -0400 Subject: [Mailman-i18n] "Funny" characters in real names? References: <15746.14309.516823.293632@anthem.wooz.org> <3D8248C5.5040205@debian.org> <15746.31280.719924.780772@anthem.wooz.org> <15749.2812.524261.962949@anthem.wooz.org> <15751.38782.74609.163241@anthem.wooz.org> <3D879D2B.5080600@debian.org> <15751.40940.902092.18578@anthem.wooz.org> <3D87CB98.8010908@is.kochi-u.ac.jp> <15751.56377.321564.625249@anthem.wooz.org> <3D87E770.9000903@is.kochi-u.ac.jp> <15752.4623.531823.282922@anthem.wooz.org> <3D881A87.7030001@is.kochi-u.ac.jp> <3D8832A9.8060302@is.kochi-u.ac.jp> <3D887863.8070808@is.kochi-u.ac.jp> <3D887F5A.50108@is.kochi-u.ac.jp> Message-ID: <15753.18113.512461.154971@anthem.wooz.org> Thanks. I modified this slightly, so please double check me! -Barry From barry@python.org Thu Sep 19 05:07:40 2002 From: barry@python.org (Barry A. Warsaw) Date: Thu, 19 Sep 2002 00:07:40 -0400 Subject: [Mailman-i18n] "Funny" characters in real names? References: <15746.14309.516823.293632@anthem.wooz.org> <3D8248C5.5040205@debian.org> <15746.31280.719924.780772@anthem.wooz.org> <15749.2812.524261.962949@anthem.wooz.org> <15751.38782.74609.163241@anthem.wooz.org> <3D879D2B.5080600@debian.org> <15751.40940.902092.18578@anthem.wooz.org> Message-ID: <15753.19852.142162.352140@anthem.wooz.org> >>>>> "MvL" =3D=3D Martin v L=F6wis wr= ites: >> BG> Again, I assume you mean ISO-8859-1 instead of ascii here. >> Same thing here. We do name.encode('us-ascii') and catch any >> UnicodeError that might occur. MvL> That should be name.decode, or unicode(name, MvL> "us-ascii"). name.encode unfortunately works as well, but it MvL> first decodes using the system default encoding, then encodes MvL> using us-ascii - very close to what you want, but not the MvL> precise same thing. Martin, sometimes this Unicode stuff makes my head hurt. ;) I don't think I can use name.decode() because that's a Python 2.2-ism and we need to stick to Python 2.1. I don't think I can use unicode(name, 'us-ascii') because what if name is already a Unicode string? This'll give me a TypeError. So it seems like name.encode('us-ascii') is my only choice. What am I missing? -Barry From loewis@informatik.hu-berlin.de Thu Sep 19 08:37:38 2002 From: loewis@informatik.hu-berlin.de (Martin v. =?iso-8859-1?q?L=F6wis?=) Date: 19 Sep 2002 09:37:38 +0200 Subject: [Mailman-i18n] "Funny" characters in real names? In-Reply-To: <15753.19852.142162.352140@anthem.wooz.org> References: <15746.14309.516823.293632@anthem.wooz.org> <3D8248C5.5040205@debian.org> <15746.31280.719924.780772@anthem.wooz.org> <15749.2812.524261.962949@anthem.wooz.org> <15751.38782.74609.163241@anthem.wooz.org> <3D879D2B.5080600@debian.org> <15751.40940.902092.18578@anthem.wooz.org> <15753.19852.142162.352140@anthem.wooz.org> Message-ID: barry@python.org (Barry A. Warsaw) writes: > Martin, sometimes this Unicode stuff makes my head hurt. ;) In an application that deals with multiple charsets on a regular basis (such as mailman), I recommend not to mix byte strings and Unicode strings. This can be achieved by - converting all byte strings that represent text data to Unicode at the earliest possible point in processing, - converting all Unicode strings back to byte strings just before output. If most data is likely ASCII, it is tempting to use byte strings for pure-ASCII, and Unicode for everything else. Try to resist this temptation. If you follow this strategy, you find that processing becomes much simpler. > So it seems like name.encode('us-ascii') is my only choice. What am I > missing? If you are following the above strategy, you will know whether name is Unicode or byte string. If it is Unicode, .encode is fine. If it is a byte string, unicode(name,'ascii') will work. I admit that the strategy has two problems: 1. In some cases, it might be impossible to generate a Unicode string for text data. In MIME, the encoding may not be specified, or it may be unknown to mailman, or the data may fail to convert. In these cases, it may be acceptable to "force" the data to Unicode: If there is no encoding, guess latin-1. If the string fails to convert, convert it with "replace". If the encoding is unknown, replace all non-printable characters with question marks. Whether this is acceptable depends on how frequent the problem occurs and whose fault that is (e.g. an unknown encoding should be added to Mailman). 2. When converting an application that used to be byte-oriented to Unicode, adding conversions at all required places might be too much effort, or breakage because of incorrect data might be unacceptable. In these cases, I recommend to add type tests at strategic places, and taper over any incorrect data. E.g. in this case, you could write a function def unicode_is_pure_ascii(text): if type(text) is types.UnicodeType: try: text.encode("ascii") return 1 except UnicodeError: return 0 if DEBUG: raise DebugError, "string not unicode:"+repr(text) try: unicode(text,"ascii") return 1 except UnicodeError: return 0 If you expect name to be a byte string, the function would be bytes_are_ascii, of course. Regards, Martin From tkikuchi@is.kochi-u.ac.jp Fri Sep 20 03:06:05 2002 From: tkikuchi@is.kochi-u.ac.jp (Tokio Kikuchi) Date: Fri, 20 Sep 2002 11:06:05 +0900 Subject: [Mailman-i18n] "Funny" characters in real names? References: <15746.14309.516823.293632@anthem.wooz.org> <3D8248C5.5040205@debian.org> <15746.31280.719924.780772@anthem.wooz.org> <15749.2812.524261.962949@anthem.wooz.org> <15751.38782.74609.163241@anthem.wooz.org> <3D879D2B.5080600@debian.org> <15751.40940.902092.18578@anthem.wooz.org> <3D87CB98.8010908@is.kochi-u.ac.jp> <15751.56377.321564.625249@anthem.wooz.org> <3D87E770.9000903@is.kochi-u.ac.jp> <15752.4623.531823.282922@anthem.wooz.org> <3D881A87.7030001@is.kochi-u.ac.jp> <3D8832A9.8060302@is.kochi-u.ac.jp> <3D887863.8070808@is.kochi-u.ac.jp> <3D887F5A.50108@is.kochi-u.ac.jp> <15753.18113.512461.154971@anthem.wooz.org> Message-ID: <3D8A828D.1000701@is.kochi-u.ac.jp> Barry, sorry for the hard work but still get errors. I've sent a comment for the cvs-checkins and here are two more. ========= 1. In admin.py, the name 'lang' is used elsewhere and cause side effect. diff -u ~/src/mailman/Mailman/Cgi/admin.py Mailman/Cgi/admin.py --- /home/mailman/src/mailman/Mailman/Cgi/admin.py Wed Sep 18 15:37:27 2002 +++ Mailman/Cgi/admin.py Fri Sep 20 10:55:47 2002 @@ -904,11 +904,11 @@ MemberAdaptor.BYBOUNCE: _('B'), } # Now populate the rows - lang = mlist.preferred_language + listlang = mlist.preferred_language for addr in members: link = Link(mlist.GetOptionsURL(addr, obscure=1), mlist.getMemberCPAddress(addr)) - fullname = Utils.uncanonstr(mlist.getMemberName(addr), lang) + fullname = Utils.uncanonstr(mlist.getMemberName(addr), listlang) name = TextBox(addr + '_realname', fullname, size=longest).Format() cells = [Center(CheckBox(addr + '_unsub', 'off', 0).Format()), link.Format() + '
' + ===== 2. JapaneseCodecs raises LookupError instead of UnicodeError. --- /home/mailman/src/mailman/Mailman/Utils.py Wed Sep 18 15:37:26 2002 +++ Mailman/Utils.py Fri Sep 20 10:38:50 2002 @@ -753,7 +753,7 @@ charset = GetCharSet(lang) try: return s.encode(charset, 'strict') - except UnicodeError: + except (UnicodeError, LookupError): a = [] for c in s: o = ord(c) -- Tokio Kikuchi, tkikuchi@ is.kochi-u.ac.jp http://weather.is.kochi-u.ac.jp/ From barry@zope.com Sat Sep 21 18:00:43 2002 From: barry@zope.com (Barry A. Warsaw) Date: Sat, 21 Sep 2002 13:00:43 -0400 Subject: [Mailman-i18n] Unicode in headers Message-ID: <15756.42427.421288.925121@anthem.wooz.org> I've been trying to fix the outstanding problems with "funny" characters in real names in Mailman[*] and along the way I ran into a situation that I /think/ needs to be addressed in the email package. I'm not sure this is a good fix, let alone the right fix so I wanted to get some feedback from these two mailing lists. Say I create a Header instance like so: from email.Header import Header h = Header(u'[P\xf6stal]', 'us-ascii') s = str(h) what would you expect the value of s to be? It's a bit of a trick question because in the current version, the str(h) will raise a UnicodeError since the h.encode() will be a unicode string containing non-ascii characters. But I think this may not be the right thing to do. For one thing, we're saying we want the header to be in the us-ascii character set. For another, the RFCs state that headers need to be ascii characters and we should encode them if necessary. OTOH, what we're doing /is/ a bit bogus since the value is clearly not in the requested character set. But OTOOH, I don't think we should have to check the value and do a bunch of coercion before we create the Header instance. My proposal is to do a type check in Header.__str__() so that if the value of self.encode() returns a unicode string, we will coerce it to an 8-bit string like so: def __str__(self): """A synonym for self.encode(). Guarantees that the return value contains only ASCII characters. """ s = self.encode() if isinstance(s, type(u'')): return s.encode(str(self._charset), 'replace') return s Here's a new test case that fails without this change, but succeeds with it (with no regressions). def test_unicode_value(self): eq = self.assertEqual v = u'[P\xf6stal]' h = Header(v, 'us-ascii') eq(str(h), '[P?stal]') In the view of doing what's most useful, I'd like to make this change, but I still don't trust my judgement about things unicode, so I'd like to get some other opinions. If we don't do this, then we'll probably have to add some defense in Generator._write_headers(), which wants to do text = '%s: %s' % (h, v) That'll raise the UnicodeError in this situation, and because this can be fairly widely removed from what might be considered the real error, it's difficult to debug. -Barry [*] BTW, Martin, Ben, Tokio and others have been very helpful here. Thanks! And I hope to have fixes in place soon. From Dan@feld.cvut.cz Sat Sep 21 21:12:28 2002 From: Dan@feld.cvut.cz (Dan Ohnesorg) Date: Sat, 21 Sep 2002 22:12:28 +0200 Subject: [Mailman-i18n] Unicode in headers In-Reply-To: <15756.42427.421288.925121@anthem.wooz.org> References: <15756.42427.421288.925121@anthem.wooz.org> Message-ID: <20020921201228.GB1717@ohnesorg.cz> Dne Sat, Sep 21, 2002 at 01:00:43PM -0400, Barry A. Warsaw napsal: > from email.Header import Header > h =3D Header(u'[P\xf6stal]', 'us-ascii') > s =3D str(h) >=20 > what would you expect the value of s to be? Somethink like =3D?utf-8?Q?P=3Df6stal?=3D According to RTF we should find the most simple encoding, which is UTF8. In czech we use ISO-8859-2 and we check if there are only ASCII character= s =3D we are using ascii, or if there are some other characters we are using ISO-8859-2. So the way can be: - are there only ASCII characters =3D OK let it be - are there only characters from locale preferred encoding =3D use locale encoding - in other cases, use UTF. cheers dan --=20 ----------------------------------------------------------- / Dan Ohnesorg Dan@ohnesorg.cz \ < Jino=E8ansk=E1 7 252 19 Rudn=E1 u Prahy > \ tel: +420 311 679679 +420 311 679976 fax: +420 311 679311 / ----------------------------------------------------------- From loewis@informatik.hu-berlin.de Sat Sep 21 22:08:36 2002 From: loewis@informatik.hu-berlin.de (Martin von Loewis) Date: Sat, 21 Sep 2002 23:08:36 +0200 (CEST) Subject: [Mailman-i18n] Unicode in headers In-Reply-To: <15756.42427.421288.925121@anthem.wooz.org> from "Barry A. Warsaw" at "Sep 21, 2002 01:00:43 pm" Message-ID: <200209212108.g8LL8avs018452@paros.informatik.hu-berlin.de> > from email.Header import Header > h = Header(u'[P\xf6stal]', 'us-ascii') > s = str(h) [...] > But I think this may not be the right thing to do. For one thing, > we're saying we want the header to be in the us-ascii character set. I think you are confusing issues here: You are *not* saying that you want the header to be in us-ascii. Instead, (to quote the docstring) Specify both s's character set, and the default character set by setting the charset argument to a Charset object You need this argument to specify the encoding of the string *you are passing*, not (primarily) of the resulting Header. Since the argument is a Unicode string and not a byte string, the encoding argument is superfluous. Now, the documentation also says that it uses the argument as the "default character set". By that, it does *not* mean that the entire header is going to be encoding in that encoding. Instead, it means that this value is used if later append calls do not declare an encoding. > My proposal is to do a type check in Header.__str__() so that if the > value of self.encode() returns a unicode string, we will coerce it to > an 8-bit string like so: This is evil. You are losing data without any need. Instead, I propose the following procedure: - if a Unicode argument is passed to Header.__init__ or Header.append, take the encoding only as a hint. As an argument to __init__, also record it as the default for later .append calls. - when encoding the header, encode all Unicode strings with the hint. If that fails, encode them as UTF-8. Regards, Martin From tkikuchi@is.kochi-u.ac.jp Sun Sep 22 06:52:08 2002 From: tkikuchi@is.kochi-u.ac.jp (Tokio Kikuchi) Date: Sun, 22 Sep 2002 14:52:08 +0900 Subject: [Mailman-i18n] Unicode in headers References: <15756.42427.421288.925121@anthem.wooz.org> <20020921201228.GB1717@ohnesorg.cz> Message-ID: <3D8D5A88.9080105@is.kochi-u.ac.jp> Dan Ohnesorg wrote: > Dne Sat, Sep 21, 2002 at 01:00:43PM -0400, Barry A. Warsaw napsal: > > >> from email.Header import Header >> h = Header(u'[P\xf6stal]', 'us-ascii') >> s = str(h) >> >>what would you expect the value of s to be? > - are there only ASCII characters = OK let it be > - are there only characters from locale preferred encoding = use locale > encoding I like this idea but how do you define email's preferred language. In mailman, it will be mm_cfg.DEFAULT_SERVER_LANGUAGE but ,,, > - in other cases, use UTF. I think UTF-8 is OK. Older MUA won't break. -- Tokio Kikuchi, tkikuchi@ is.kochi-u.ac.jp http://weather.is.kochi-u.ac.jp/ From Dan@feld.cvut.cz Sun Sep 22 16:38:50 2002 From: Dan@feld.cvut.cz (Dan Ohnesorg) Date: Sun, 22 Sep 2002 17:38:50 +0200 Subject: [Mailman-i18n] Unicode in headers In-Reply-To: <3D8D5A88.9080105@is.kochi-u.ac.jp> References: <15756.42427.421288.925121@anthem.wooz.org> <20020921201228.GB1717@ohnesorg.cz> <3D8D5A88.9080105@is.kochi-u.ac.jp> Message-ID: <20020922153850.GA11870@ohnesorg.cz> Dne Sun, Sep 22, 2002 at 02:52:08PM +0900, Tokio Kikuchi napsal: > >- are there only ASCII characters =3D OK let it be > >- are there only characters from locale preferred encoding =3D use loc= ale > >encoding >=20 > I like this idea but how do you define email's preferred language. > In mailman, it will be mm_cfg.DEFAULT_SERVER_LANGUAGE but ,,, It is very good solved in mutt, in file sendlib.c, line 800 and above. I send also comments from the file. In mutt I have a variable, which has li= st of encodings in orded of preference: /* * Find the best charset conversion of the file from fromcode into one * of the tocodes. If successful, set *tocode and CONTENT *info and * return the number of characters converted inexactly. If no * conversion was possible, return -1. * * We convert via UTF-8 in order to avoid the condition -1(EINVAL), * which would otherwise prevent us from knowing the number of inexact * conversions. Where the candidate target charset is UTF-8 we avoid * doing the second conversion because iconv_open("UTF-8", "UTF-8") * fails with some libraries. * * We assume that the output from iconv is never more than 4 times as * long as the input for any pair of charsets we might be interested * in. */ /* * Find the first of the fromcodes that gives a valid conversion and * the best charset conversion of the file into one of the tocodes. If * successful, set *fromcode and *tocode to dynamically allocated * strings, set CONTENT *info, and return the number of characters * converted inexactly. If no conversion was possible, return -1. * * Both fromcodes and tocodes may be colon-separated lists of charsets. * However, if fromcode is zero then fromcodes is assumed to be the * name of a single charset even if it contains a colon. */ cheers dan --=20 ----------------------------------------------------------- / Dan Ohnesorg Dan@ohnesorg.cz \ < Jino=E8ansk=E1 7 252 19 Rudn=E1 u Prahy > \ tel: +420 311 679679 +420 311 679976 fax: +420 311 679311 / ----------------------------------------------------------- From barry@zope.com Sun Sep 22 18:30:20 2002 From: barry@zope.com (Barry A. Warsaw) Date: Sun, 22 Sep 2002 13:30:20 -0400 Subject: [Mailman-i18n] Unicode in headers References: <15756.42427.421288.925121@anthem.wooz.org> <200209212108.g8LL8avs018452@paros.informatik.hu-berlin.de> Message-ID: <15757.65068.972135.810802@anthem.wooz.org> >>>>> "MvL" == Martin von Loewis writes: MvL> You need this argument to specify the encoding of the string MvL> *you are passing*, not (primarily) of the resulting MvL> Header. Since the argument is a Unicode string and not a byte MvL> string, the encoding argument is superfluous. D'oh, of course you're right Martin. >> My proposal is to do a type check in Header.__str__() so that >> if the value of self.encode() returns a unicode string, we will >> coerce it to an 8-bit string like so: MvL> This is evil. You are losing data without any need. MvL> Instead, I propose the following procedure: - if a Unicode MvL> argument is passed to Header.__init__ or Header.append, MvL> take the encoding only as a hint. As an argument to MvL> __init__, also record it as the default for later .append MvL> calls. MvL> - when encoding the header, encode all Unicode strings with MvL> the hint. If that fails, encode them as UTF-8. Alternatively, we could try to provoke a UnicodeError early, at the __init__ or .append call by doing something like: def append(self, s, charset=None): # ... # Encoding check. Better to know now whether we'll have an encoding # error than when we try to str'ify the header. Let UnicodeErrors # percolate to the caller. if _isunicode(s): s.encode(str(charset)) else: unicode(s, str(charset)) self._chunks.append((s, charset)) In other words, the caller is claiming that the string being passed in is encoded with the given character set (or the default if None is used). Fine, let's check that here since it will be easier to debug if the UnicodeError is raised now, rather than when the Generator tries to print the message header. I think I could live with that, and will work out a different algorithm in Mailman. -Barry From loewis@informatik.hu-berlin.de Mon Sep 23 10:18:45 2002 From: loewis@informatik.hu-berlin.de (Martin v. =?iso-8859-1?q?L=F6wis?=) Date: 23 Sep 2002 11:18:45 +0200 Subject: [Mailman-i18n] Unicode in headers In-Reply-To: <15757.65068.972135.810802@anthem.wooz.org> References: <15756.42427.421288.925121@anthem.wooz.org> <200209212108.g8LL8avs018452@paros.informatik.hu-berlin.de> <15757.65068.972135.810802@anthem.wooz.org> Message-ID: barry@zope.com (Barry A. Warsaw) writes: > Alternatively, we could try to provoke a UnicodeError early, at the > __init__ or .append call by doing something like: I see no reason to provoke a UnicodeError at all. An exception should only be raised if the library cannot correctly process the data being passed, or if the requested processing is ambiguous. In this case, neither is the case: there is a perfectly correct and meaningful processing of the data. If you raise an exception, the application would need to deal with it just in the same way as I propose. > I think I could live with that, and will work out a different > algorithm in Mailman. I think users of the email package will find it more acceptable if no exception is raised. Regards, Martin