From jim at zope.com Tue Nov 15 21:39:25 2005 From: jim at zope.com (Jim Fulton) Date: Tue, 15 Nov 2005 15:39:25 -0500 Subject: [I18n-sig] locale-specific string sorting for server applications? Message-ID: <1132087165.16795.8.camel@localhost.localdomain> We're looking at locale-specific sorting of strings in Zope. The locale module doesn't seem to help because: - It's not thread safe (yeah, we can hack around that), and - It seems to depend on unpredictable host data. In 2002, Martin von L?wis was interested in creating a Python wrapper around ICU for this purpose, but I don't see anything recent. Does anyone know if anything happened with this? Does anyone have any other suggestions? Jim -- Jim Fulton mailto:jim at zope.com Python Powered! CTO (540) 361-1714 http://www.python.org Zope Corporation http://www.zope.com http://www.zope.org From jim at zope.com Wed Nov 16 16:25:50 2005 From: jim at zope.com (Jim Fulton) Date: Wed, 16 Nov 2005 10:25:50 -0500 Subject: [I18n-sig] Anyone here? :) Message-ID: <1132154750.7962.20.camel@localhost.localdomain> Hi, I've noticed that: - Most of the email to this list recently has been spam. - The contact email for the SIG coordinator, Andy Robinson, doesn't seem to be valid anymore. - My recent question to the list has gone unanswered for almost a day! ;) Should the SIG be retired? If not, I suggest managing the mailing list a bit differently, holding messages from non-subscribers. If we decide to keep the sig, and someone answers my question , then I'd be willing to take over management of the list. Jim -- Jim Fulton mailto:jim at zope.com Python Powered! CTO (540) 361-1714 http://www.python.org Zope Corporation http://www.zope.com http://www.zope.org From RD6T-KJYM at asahi-net.or.jp Wed Nov 16 16:25:05 2005 From: RD6T-KJYM at asahi-net.or.jp (Tamito KAJIYAMA) Date: Thu, 17 Nov 2005 00:25:05 +0900 Subject: [I18n-sig] Anyone here? :) In-Reply-To: <1132154750.7962.20.camel@localhost.localdomain> (message from Jim Fulton on Wed, 16 Nov 2005 10:25:50 -0500) References: <1132154750.7962.20.camel@localhost.localdomain> Message-ID: <200511161525.jAGFP5h01333@bmdi0141.bmobile.ne.jp> Hi, I'd like to post a reply, although I have no idea with regard to your question on locale-aware sort. | Should the SIG be retired? Honestly to say, I'm less interested in this SIG than ever, since we have a rich set of Unicode codecs in the 2.4 series, and IMHO the largest part of Python i18n has been done with the current Unicode support. However, it seems to me a little bit early to break up the SIG, as there are other i18n-related topics (with regard to which I don't expect I could make some contribution, though). | If not, I suggest managing the mailing | list a bit differently, holding messages from non-subscribers. +1. -- KAJIYAMA, Tamito Jim Fulton writes: | | Hi, | | I've noticed that: | | - Most of the email to this list recently has been spam. | | - The contact email for the SIG coordinator, Andy Robinson, doesn't seem | to be valid anymore. | | - My recent question to the list has gone unanswered for almost | a day! ;) | | Should the SIG be retired? If not, I suggest managing the mailing | list a bit differently, holding messages from non-subscribers. | If we decide to keep the sig, and someone answers | my question , then I'd be willing to take over management of the | list. | | Jim | | -- | Jim Fulton mailto:jim at zope.com Python Powered! | CTO (540) 361-1714 http://www.python.org | Zope Corporation http://www.zope.com http://www.zope.org From jim at zope.com Wed Nov 16 17:07:36 2005 From: jim at zope.com (Jim Fulton) Date: Wed, 16 Nov 2005 11:07:36 -0500 Subject: [I18n-sig] locale-aware sorting (was Re: Anyone here? :) In-Reply-To: <200511161525.jAGFP5h01333@bmdi0141.bmobile.ne.jp> References: <1132154750.7962.20.camel@localhost.localdomain> <200511161525.jAGFP5h01333@bmdi0141.bmobile.ne.jp> Message-ID: <1132157256.7962.23.camel@localhost.localdomain> On Thu, 2005-11-17 at 00:25 +0900, Tamito KAJIYAMA wrote: > Hi, > > I'd like to post a reply, although I have no idea with regard to > your question on locale-aware sort. Was my question too terse? Or do you simply not have an answer? Jim -- Jim Fulton mailto:jim at zope.com Python Powered! CTO (540) 361-1714 http://www.python.org Zope Corporation http://www.zope.com http://www.zope.org From RD6T-KJYM at asahi-net.or.jp Wed Nov 16 17:01:21 2005 From: RD6T-KJYM at asahi-net.or.jp (Tamito KAJIYAMA) Date: Thu, 17 Nov 2005 01:01:21 +0900 Subject: [I18n-sig] locale-aware sorting (was Re: Anyone here? :) In-Reply-To: <1132157256.7962.23.camel@localhost.localdomain> (message from Jim Fulton on Wed, 16 Nov 2005 11:07:36 -0500) References: <1132157256.7962.23.camel@localhost.localdomain> Message-ID: <200511161601.jAGG1Lj01392@bmdi0141.bmobile.ne.jp> Hi again, Jim Fulton writes: | | On Thu, 2005-11-17 at 00:25 +0900, Tamito KAJIYAMA wrote: | > Hi, | > | > I'd like to post a reply, although I have no idea with regard to | > your question on locale-aware sort. | | Was my question too terse? Or do you simply not have an answer? | | Jim | | -- | Jim Fulton mailto:jim at zope.com Python Powered! | CTO (540) 361-1714 http://www.python.org | Zope Corporation http://www.zope.com http://www.zope.org The latter: I simply do not have an answer. -- KAJIYAMA, Tamito From mal at egenix.com Wed Nov 16 18:51:53 2005 From: mal at egenix.com (M.-A. Lemburg) Date: Wed, 16 Nov 2005 18:51:53 +0100 Subject: [I18n-sig] Anyone here? :) In-Reply-To: <200511161525.jAGFP5h01333@bmdi0141.bmobile.ne.jp> References: <1132154750.7962.20.camel@localhost.localdomain> <200511161525.jAGFP5h01333@bmdi0141.bmobile.ne.jp> Message-ID: <437B71B9.3060400@egenix.com> Tamito KAJIYAMA wrote: > Hi, > > I'd like to post a reply, although I have no idea with regard to > your question on locale-aware sort. > > | Should the SIG be retired? > > Honestly to say, I'm less interested in this SIG than ever, > since we have a rich set of Unicode codecs in the 2.4 series, > and IMHO the largest part of Python i18n has been done with the > current Unicode support. > > However, it seems to me a little bit early to break up the SIG, > as there are other i18n-related topics (with regard to which I > don't expect I could make some contribution, though). There's still a lot to do on the Unicode front: * Unicode collation support (this is what Jim's after) http://www.unicode.org/reports/tr10/ * Unicode compression http://www.unicode.org/reports/tr6/ * Adding Locale Data support: http://www.unicode.org/reports/tr35/tr35-5.html http://www.unicode.org/cldr/ This is a rather new option and would finally make Python independent from the libc locale support which causes much trouble in user-land. > | If not, I suggest managing the mailing > | list a bit differently, holding messages from non-subscribers. > > +1. Only if you are willing to administer the mailing list. Note that the spam that's coming in on python.org lists can make this a nasty task. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Nov 16 2005) >>> Python/Zope Consulting and Support ... http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ ________________________________________________________________________ ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! :::: From jim at zope.com Wed Nov 16 19:02:00 2005 From: jim at zope.com (Jim Fulton) Date: Wed, 16 Nov 2005 13:02:00 -0500 Subject: [I18n-sig] Anyone here? :) In-Reply-To: <437B71B9.3060400@egenix.com> References: <1132154750.7962.20.camel@localhost.localdomain> <200511161525.jAGFP5h01333@bmdi0141.bmobile.ne.jp> <437B71B9.3060400@egenix.com> Message-ID: <1132164120.7962.30.camel@localhost.localdomain> On Wed, 2005-11-16 at 18:51 +0100, M.-A. Lemburg wrote: > Tamito KAJIYAMA wrote: > > Hi, > > > > I'd like to post a reply, although I have no idea with regard to > > your question on locale-aware sort. > > > > | Should the SIG be retired? > > > > Honestly to say, I'm less interested in this SIG than ever, > > since we have a rich set of Unicode codecs in the 2.4 series, > > and IMHO the largest part of Python i18n has been done with the > > current Unicode support. > > > > However, it seems to me a little bit early to break up the SIG, > > as there are other i18n-related topics (with regard to which I > > don't expect I could make some contribution, though). > > There's still a lot to do on the Unicode front: > > * Unicode collation support (this is what Jim's after) > > http://www.unicode.org/reports/tr10/ > > * Unicode compression > > http://www.unicode.org/reports/tr6/ > > * Adding Locale Data support: > > http://www.unicode.org/reports/tr35/tr35-5.html > http://www.unicode.org/cldr/ > > This is a rather new option and would finally make > Python independent from the libc locale support which > causes much trouble in user-land. What is a rather new option? CLDR? Hasn't ICU covered this ground already? In any case, is someone working on leveraging CLDR in Python? Are you aware of any work to wrap ICU besides PICU and the work Stephan Richter did leveraging some of the ICU data in Zope? > > | If not, I suggest managing the mailing > > | list a bit differently, holding messages from non-subscribers. > > > > +1. > > Only if you are willing to administer the mailing list. I am. > Note that the spam that's coming in on python.org lists > can make this a nasty task. Yeah. I do this for some Zope lists. It's a pain, but I think it's necessary for the list to be useful. Jim -- Jim Fulton mailto:jim at zope.com Python Powered! CTO (540) 361-1714 http://www.python.org Zope Corporation http://www.zope.com http://www.zope.org From tree at basistech.com Wed Nov 16 19:05:31 2005 From: tree at basistech.com (Tom Emerson) Date: Wed, 16 Nov 2005 13:05:31 -0500 Subject: [I18n-sig] Anyone here? :) In-Reply-To: <1132164120.7962.30.camel@localhost.localdomain> References: <1132154750.7962.20.camel@localhost.localdomain> <200511161525.jAGFP5h01333@bmdi0141.bmobile.ne.jp> <437B71B9.3060400@egenix.com> <1132164120.7962.30.camel@localhost.localdomain> Message-ID: <17275.29931.988846.609010@tiphares.basistech.net> Jim Fulton writes: > Are you aware of any work to wrap ICU besides PICU and the work > Stephan Richter did leveraging some of the ICU data in Zope? I've had an itch for a while to provide the ICU break iterators in Python, but each time I go to scratch it I get distracted. Is this something that people would find generally useful? -- Tom Emerson Basis Technology Corp. Software Architect http://www.basistech.com "You can't fake quality any more than you can fake a good meal." (W.S.B.) From barry at python.org Wed Nov 16 20:05:47 2005 From: barry at python.org (Barry Warsaw) Date: Wed, 16 Nov 2005 14:05:47 -0500 Subject: [I18n-sig] Anyone here? :) In-Reply-To: <1132164120.7962.30.camel@localhost.localdomain> References: <1132154750.7962.20.camel@localhost.localdomain> <200511161525.jAGFP5h01333@bmdi0141.bmobile.ne.jp> <437B71B9.3060400@egenix.com> <1132164120.7962.30.camel@localhost.localdomain> Message-ID: <1132167947.11600.89.camel@geddy.wooz.org> On Wed, 2005-11-16 at 13:02 -0500, Jim Fulton wrote: > > Only if you are willing to administer the mailing list. > > I am. > > > Note that the spam that's coming in on python.org lists > > can make this a nasty task. > > Yeah. I do this for some Zope lists. It's a pain, but I think > it's necessary for the list to be useful. Jim, I'm happy to make you the list administrator for this sig. Send a confirmation message just to me and I'll make it happen. -Barry -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 307 bytes Desc: This is a digitally signed message part Url : http://mail.python.org/pipermail/i18n-sig/attachments/20051116/f2bc3f98/attachment.pgp From mal at egenix.com Wed Nov 16 21:40:45 2005 From: mal at egenix.com (M.-A. Lemburg) Date: Wed, 16 Nov 2005 21:40:45 +0100 Subject: [I18n-sig] Anyone here? :) In-Reply-To: <1132164120.7962.30.camel@localhost.localdomain> References: <1132154750.7962.20.camel@localhost.localdomain> <200511161525.jAGFP5h01333@bmdi0141.bmobile.ne.jp> <437B71B9.3060400@egenix.com> <1132164120.7962.30.camel@localhost.localdomain> Message-ID: <437B994D.8060401@egenix.com> Jim Fulton wrote: >>There's still a lot to do on the Unicode front: >> >>* Unicode collation support (this is what Jim's after) >> >> http://www.unicode.org/reports/tr10/ >> >>* Unicode compression >> >> http://www.unicode.org/reports/tr6/ >> >>* Adding Locale Data support: >> >> http://www.unicode.org/reports/tr35/tr35-5.html >> http://www.unicode.org/cldr/ >> >> This is a rather new option and would finally make >> Python independent from the libc locale support which >> causes much trouble in user-land. > > > What is a rather new option? CLDR? Yes, CLDR. > Hasn't ICU covered this ground already? No idea, I haven't looked at ICU yet - but from a glimpse at the site: http://www-306.ibm.com/software/globalization/icu/index.jsp it looks as if most things are available in ICU. The downside is its footprint: http://icu.sourceforge.net/charts/icu4c_footprint.html Not sure about the in-memory footprint, though. > In any case, is someone working on leveraging CLDR in Python? > > Are you aware of any work to wrap ICU besides PICU and the work > Stephan Richter did leveraging some of the ICU data in Zope? No. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Nov 16 2005) >>> Python/Zope Consulting and Support ... http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ ________________________________________________________________________ ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! :::: From andy at reportlab.com Wed Nov 16 21:43:45 2005 From: andy at reportlab.com (Andy Robinson) Date: Wed, 16 Nov 2005 20:43:45 +0000 Subject: [I18n-sig] Anyone here? :) In-Reply-To: <1132154750.7962.20.camel@localhost.localdomain> References: <1132154750.7962.20.camel@localhost.localdomain> Message-ID: <437B9A01.8080204@reportlab.com> Jim Fulton wrote: > Hi, > > I've noticed that: > > - Most of the email to this list recently has been spam. > > - The contact email for the SIG coordinator, Andy Robinson, doesn't seem > to be valid anymore. > > - My recent question to the list has gone unanswered for almost > a day! ;) > Hi all, How timely. I owe you all a large apology. I was under the mistaken belief that I had handed over the admin responsibilities about 2 years ago and unsubscribed; I certainly intended to. It turns out that the spam filtering service I have been subscribed to all that time (mailblocks.com, bought by AOL recently and they shut it down permanently today) was blocking the list altogether, although strangely most other mailman lists came through. I just saw this while training Thunderbird on a fresh can of spam... In any event I am extremely busy and not working in i18n these days, and I'll have to ask Barry to reset my admin password (he'll confirm that I lose them about once a year) so I would very much prefer if someone else took over. Sorry not to have offered a better service. Best Regards, Andy Robinson From jim at zope.com Wed Nov 16 21:59:54 2005 From: jim at zope.com (Jim Fulton) Date: Wed, 16 Nov 2005 15:59:54 -0500 Subject: [I18n-sig] Anyone here? :) In-Reply-To: <437B994D.8060401@egenix.com> References: <1132154750.7962.20.camel@localhost.localdomain> <200511161525.jAGFP5h01333@bmdi0141.bmobile.ne.jp> <437B71B9.3060400@egenix.com> <1132164120.7962.30.camel@localhost.localdomain> <437B994D.8060401@egenix.com> Message-ID: <1132174794.7962.53.camel@localhost.localdomain> On Wed, 2005-11-16 at 21:40 +0100, M.-A. Lemburg wrote: > Jim Fulton wrote: > >>There's still a lot to do on the Unicode front: > >> > >>* Unicode collation support (this is what Jim's after) > >> > >> http://www.unicode.org/reports/tr10/ > >> > >>* Unicode compression > >> > >> http://www.unicode.org/reports/tr6/ > >> > >>* Adding Locale Data support: > >> > >> http://www.unicode.org/reports/tr35/tr35-5.html > >> http://www.unicode.org/cldr/ > >> > >> This is a rather new option and would finally make > >> Python independent from the libc locale support which > >> causes much trouble in user-land. > > > > > > What is a rather new option? CLDR? > > Yes, CLDR. > > > Hasn't ICU covered this ground already? > > No idea, I haven't looked at ICU yet - but from a glimpse > at the site: > > http://www-306.ibm.com/software/globalization/icu/index.jsp > > it looks as if most things are available in ICU. > > The downside is its footprint: > > http://icu.sourceforge.net/charts/icu4c_footprint.html > > Not sure about the in-memory footprint, though. Looking a bit further, it seems that ICU uses and is the most popular library for using CLDR. > > In any case, is someone working on leveraging CLDR in Python? > > > > Are you aware of any work to wrap ICU besides PICU and the work > > Stephan Richter did leveraging some of the ICU data in Zope? > > No. Dang. Jim -- Jim Fulton mailto:jim at zope.com Python Powered! CTO (540) 361-1714 http://www.python.org Zope Corporation http://www.zope.com http://www.zope.org From amk at amk.ca Wed Nov 16 22:05:09 2005 From: amk at amk.ca (A.M. Kuchling) Date: Wed, 16 Nov 2005 16:05:09 -0500 Subject: [I18n-sig] Anyone here? :) In-Reply-To: <437B994D.8060401@egenix.com> References: <1132154750.7962.20.camel@localhost.localdomain> <200511161525.jAGFP5h01333@bmdi0141.bmobile.ne.jp> <437B71B9.3060400@egenix.com> <1132164120.7962.30.camel@localhost.localdomain> <437B994D.8060401@egenix.com> Message-ID: <20051116210509.GA6317@rogue.amk.ca> On Wed, Nov 16, 2005 at 09:40:45PM +0100, M.-A. Lemburg wrote: > No idea, I haven't looked at ICU yet - but from a glimpse > at the site: > http://www-306.ibm.com/software/globalization/icu/index.jsp > it looks as if most things are available in ICU. The Parrot VM uses ICU for its Unicode implementation. You could ask the Parrot implementors what their experience with ICU has been. (I don't believe they've had any problems with it, but don't follow Parrot that closely.) --amk From jim at zope.com Fri Nov 18 19:23:13 2005 From: jim at zope.com (Jim Fulton) Date: Fri, 18 Nov 2005 13:23:13 -0500 Subject: [I18n-sig] Python binding for ICU Message-ID: <1132338194.9002.15.camel@localhost.localdomain> Hey look: http://pyicu.osafoundation.org/ :) I think this is great news. I hope people who need this stuff bother to try it out, provide input, contribute, etc (convince then to use PyRex rather than SWIG :). Jim -- Jim Fulton mailto:jim at zope.com Python Powered! CTO (540) 361-1714 http://www.python.org Zope Corporation http://www.zope.com http://www.zope.org From 2005 at kuarepoti-dju.net Fri Nov 25 17:07:42 2005 From: 2005 at kuarepoti-dju.net (Josef Spillner) Date: Fri, 25 Nov 2005 17:07:42 +0100 Subject: [I18n-sig] Format strings Message-ID: <20051125161303.53FC8A848DC@pomo.hostsharing.net> Hi, as pointed out by Adeodato Sim?, there exists a discrepancy between the intuitive and the (C-legacy-based?) actual handling of format strings with unicode arguments: http://chistera.yi.org/~adeodato/blog/misc/44_utf8_printf.html This does indeed seem to be problematic, and I cannot see any legitimate reason for this. Can this be fixed for the upcoming version please? Josef From mal at egenix.com Fri Nov 25 19:14:53 2005 From: mal at egenix.com (M.-A. Lemburg) Date: Fri, 25 Nov 2005 19:14:53 +0100 Subject: [I18n-sig] Format strings In-Reply-To: <20051125161303.53FC8A848DC@pomo.hostsharing.net> References: <20051125161303.53FC8A848DC@pomo.hostsharing.net> Message-ID: <4387549D.9070606@egenix.com> Josef Spillner wrote: > Hi, > > as pointed out by Adeodato Sim?, there exists a discrepancy between the > intuitive and the (C-legacy-based?) actual handling of format strings with > unicode arguments: > > http://chistera.yi.org/~adeodato/blog/misc/44_utf8_printf.html I don't see the relationship to Python in that posting... > This does indeed seem to be problematic, and I cannot see any legitimate > reason for this. Can this be fixed for the upcoming version please? Please give an example. Thanks, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Nov 25 2005) >>> Python/Zope Consulting and Support ... http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ ________________________________________________________________________ ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! :::: From 2005 at kuarepoti-dju.net Fri Nov 25 19:45:15 2005 From: 2005 at kuarepoti-dju.net (Josef Spillner) Date: Fri, 25 Nov 2005 19:45:15 +0100 Subject: [I18n-sig] Format strings In-Reply-To: <4387549D.9070606@egenix.com> References: <20051125161303.53FC8A848DC@pomo.hostsharing.net> <4387549D.9070606@egenix.com> Message-ID: <200511251945.16580.2005@kuarepoti-dju.net> El Viernes, 25. Noviembre 2005 19:14, escribi?: > I don't see the relationship to Python in that posting... The following should demonstrate it: # -*- coding: utf-8 -*- print "'%2s'" % "a" print "'%2s'" % "?" print "'%2s'" % u"?" In the second case, while the string literal is recognized as utf-8 (thus two bytes being one character in this case), it eats the two character format string alone and doesn't leave any space for the empty character. Note that if the file encoding is not given, then it would display as '??', which is correct under the circumstances. But in general, I don't see why line two in the example above cannot be like line three. It is not intuitive to only have one character printed as opposed to the two that are requested from the format string. Actually, a related question: why are string objects ASCII by default instead of the encoding specified at the beginning of the file? Are there any plans to merge the "unicode" string functionality into basic strings? Josef From martin at v.loewis.de Fri Nov 25 23:16:17 2005 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Fri, 25 Nov 2005 23:16:17 +0100 Subject: [I18n-sig] Format strings In-Reply-To: <200511251945.16580.2005@kuarepoti-dju.net> References: <20051125161303.53FC8A848DC@pomo.hostsharing.net> <4387549D.9070606@egenix.com> <200511251945.16580.2005@kuarepoti-dju.net> Message-ID: <43878D31.40501@v.loewis.de> Josef Spillner wrote: > # -*- coding: utf-8 -*- > print "'%2s'" % "a" > print "'%2s'" % "?" > print "'%2s'" % u"?" > > In the second case, while the string literal is recognized as utf-8 (thus two > bytes being one character in this case), it eats the two character format > string alone and doesn't leave any space for the empty character. This is correct behaviour, and by design. > Note that if the file encoding is not given, then it would display as '??', > which is correct under the circumstances. It is correct either way. A byte string is a byte string is a byte string is a string of bytes is not a Unicode string. The string in the second print statement actually *has* two bytes, so that it takes two bytes of output is correct. Regards, Martin From 2005 at kuarepoti-dju.net Mon Nov 28 10:23:36 2005 From: 2005 at kuarepoti-dju.net (Josef Spillner) Date: Mon, 28 Nov 2005 10:23:36 +0100 Subject: [I18n-sig] Format strings In-Reply-To: <43878D31.40501@v.loewis.de> References: <20051125161303.53FC8A848DC@pomo.hostsharing.net> <200511251945.16580.2005@kuarepoti-dju.net> <43878D31.40501@v.loewis.de> Message-ID: <200511281023.38295.2005@kuarepoti-dju.net> El Viernes, 25. Noviembre 2005 23:16, escribi?: > It is correct either way. A byte string is a byte string is a byte > string is a string of bytes is not a Unicode string. That was the second part of my question. If a programmer writes down a string, and the source file encoding is declared to be utf-8, why then is the string still not encoded in utf-8 by default? Why all the hassle of using u"..." instead of making it the default? There is a lot of python source code I maintain, and it would simplify coding a lot if this could be made the default. Josef From mal at egenix.com Mon Nov 28 12:55:40 2005 From: mal at egenix.com (M.-A. Lemburg) Date: Mon, 28 Nov 2005 12:55:40 +0100 Subject: [I18n-sig] Format strings In-Reply-To: <200511281023.38295.2005@kuarepoti-dju.net> References: <20051125161303.53FC8A848DC@pomo.hostsharing.net> <200511251945.16580.2005@kuarepoti-dju.net> <43878D31.40501@v.loewis.de> <200511281023.38295.2005@kuarepoti-dju.net> Message-ID: <438AF03C.3000903@egenix.com> Josef Spillner wrote: > El Viernes, 25. Noviembre 2005 23:16, escribi?: > >>It is correct either way. A byte string is a byte string is a byte >>string is a string of bytes is not a Unicode string. > > > That was the second part of my question. If a programmer writes down a string, > and the source file encoding is declared to be utf-8, why then is the string > still not encoded in utf-8 by default? Because the source code encoding is only used to decode the Unicode literals in the source code into Unicode objects. Plain string literals do not have an encoding attached and are regarded as plain byte code strings. As a result, they are passed through the decoding mechanism by reencoding them after first decding them to Unicode (using the source code encoding). > Why all the hassle of using u"..." instead of making it the default? This will happen in Python 3.0. > There is a lot of python source code I maintain, and it would simplify coding > a lot if this could be made the default. Indeed, but it potentially also breaks a lot of code since Python and the many extensions for it are not yet fully Unicode compatible. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Nov 28 2005) >>> Python/Zope Consulting and Support ... http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ ________________________________________________________________________ ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! :::: From martin at v.loewis.de Mon Nov 28 20:46:15 2005 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Mon, 28 Nov 2005 20:46:15 +0100 Subject: [I18n-sig] Format strings In-Reply-To: <200511281023.38295.2005@kuarepoti-dju.net> References: <20051125161303.53FC8A848DC@pomo.hostsharing.net> <200511251945.16580.2005@kuarepoti-dju.net> <43878D31.40501@v.loewis.de> <200511281023.38295.2005@kuarepoti-dju.net> Message-ID: <438B5E87.6000908@v.loewis.de> Josef Spillner wrote: > El Viernes, 25. Noviembre 2005 23:16, escribi?: > >>It is correct either way. A byte string is a byte string is a byte >>string is a string of bytes is not a Unicode string. > > > That was the second part of my question. If a programmer writes down a string, > and the source file encoding is declared to be utf-8, why then is the string > still not encoded in utf-8 by default? But it is encoded in utf-8! Why do you say it isn't? "be encoded in UTF-8" is different from "be a Unicode string". Unicode strings are a separate data type (different from byte strings). "UTF-8" is a *byte* encoding, so an UTF-8 string is *not* a character string, but a byte string. > Why all the hassle of using u"..." instead of making it the default? > There is a lot of python source code I maintain, and it would simplify coding > a lot if this could be made the default. There is an undocumented -U option which makes all string literals Unicode strings. Please try this out - you will likely find that your application breaks. Regards, Martin From 2005 at kuarepoti-dju.net Wed Nov 30 15:36:21 2005 From: 2005 at kuarepoti-dju.net (Josef Spillner) Date: Wed, 30 Nov 2005 15:36:21 +0100 Subject: [I18n-sig] Format strings In-Reply-To: <438AF03C.3000903@egenix.com> References: <20051125161303.53FC8A848DC@pomo.hostsharing.net> <200511281023.38295.2005@kuarepoti-dju.net> <438AF03C.3000903@egenix.com> Message-ID: <200511301536.21811.2005@kuarepoti-dju.net> [I removed the CC:s since we're all subscribed I think.] El Lunes, 28. Noviembre 2005 12:55, escribi?: > Plain string literals do not have an encoding attached and > are regarded as plain byte code strings. As a result, they are > passed through the decoding mechanism by reencoding them after > first decding them to Unicode (using the source code encoding). But (my last remaining question, as it seems), the default encoding of unicode() is "ascii" instead of "utf-8" even for this particular source file which specifies utf-8 encoding. Would changing this to match the source file encoding break applications as well? Note that the documentation is not really helpful about this aspect. I'd like to advocate for an i18n paragraph in the tutorial even, where such behavioural aspects are put into relation with each other, and explained in the concept of modern (and legacy) runtime environment concepts. Or it'd be helpful to link to the Unicode HOWTO from the tutorial/module index. However, both of them contradict slightly, e.g. in the parameter description to unicode(). Compare: [All of its arguments should be 8-bit strings] vs. [if object is a Unicode string or subclass it will return that Unicode string] (actually it should say "Unicode object" below, right?) >> Why all the hassle of using u"..." instead of making it the default? >This will happen in Python 3.0. Ah, nice to know. >> There is a lot of python source code I maintain, and it would simplify >> coding a lot if this could be made the default. > Indeed, but it potentially also breaks a lot of code since Python > and the many extensions for it are not yet fully Unicode compatible. I just tested -U on my applications. It seems that the 'random' module is a large offender. Otherwise, it seems to work ok. Some PyGame oddities but those are actually present without -U as well, and I'm going to look into fixing the library. Is anyone coordinating the work, i.e. is there a "unicode compatibility status map" or anything similar? Josef From 2005 at kuarepoti-dju.net Wed Nov 30 15:39:31 2005 From: 2005 at kuarepoti-dju.net (Josef Spillner) Date: Wed, 30 Nov 2005 15:39:31 +0100 Subject: [I18n-sig] Format strings In-Reply-To: <438B5E87.6000908@v.loewis.de> References: <20051125161303.53FC8A848DC@pomo.hostsharing.net> <200511281023.38295.2005@kuarepoti-dju.net> <438B5E87.6000908@v.loewis.de> Message-ID: <200511301539.32170.2005@kuarepoti-dju.net> El Lunes, 28. Noviembre 2005 20:46, Martin v. L?wis escribi?: > But it is encoded in utf-8! Why do you say it isn't? "be encoded in > UTF-8" is different from "be a Unicode string". Unicode strings are > a separate data type (different from byte strings). "UTF-8" is a > *byte* encoding, so an UTF-8 string is *not* a character string, > but a byte string. OK, sorry, my mistake. > There is an undocumented -U option which makes all string literals > Unicode strings. Please try this out - you will likely find that > your application breaks. See my other reply. It'd be helpful to have a chapter advertising this option for people to be able to prepare any necessary changes. Of course a prerequisite would be that at least the basic included modules work with it. It could also serve as an invitation for people to help migrating the modules. Josef From martin at v.loewis.de Wed Nov 30 23:52:50 2005 From: martin at v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Wed, 30 Nov 2005 23:52:50 +0100 Subject: [I18n-sig] Format strings In-Reply-To: <200511301536.21811.2005@kuarepoti-dju.net> References: <20051125161303.53FC8A848DC@pomo.hostsharing.net> <200511281023.38295.2005@kuarepoti-dju.net> <438AF03C.3000903@egenix.com> <200511301536.21811.2005@kuarepoti-dju.net> Message-ID: <438E2D42.7040908@v.loewis.de> Josef Spillner wrote: > But (my last remaining question, as it seems), the default encoding of > unicode() is "ascii" instead of "utf-8" even for this particular source file > which specifies utf-8 encoding. > Would changing this to match the source file encoding break applications as > well? No. *That* would not be implementable (or, if somehow implemented, would break applications). In general, if you convert a Unicode string into a byte string, you cannot even be sure it originally came from source code. Say you do a = u"Martin " b = u"v. " c = u"L?wis" mvl = a+b+c Now, the object mvl does not have any source code: so which encoding should be used to encode it? If you have an answer: how does that change if I have mvl = mod1.a+mod2.b+mod3.c > Note that the documentation is not really helpful about this aspect. I'd like > to advocate for an i18n paragraph in the tutorial even, where such > behavioural aspects are put into relation with each other, and explained in > the concept of modern (and legacy) runtime environment concepts. Contributions to the documentation is welcome. > Compare: > [All of its arguments should be 8-bit strings] > vs. > [if object is a Unicode string or subclass it will return that Unicode string] > (actually it should say "Unicode object" below, right?) I personally use "Unicode string" (type unicode) vs. "byte string" (type str). Both are strings. > Is anyone coordinating the work, i.e. is there a "unicode compatibility status > map" or anything similar? No. It is so far from actually working that nobody bothers to fix it. However, if you have specific contributions which improve the state (i.e. have no behaviour change if -U is not specified, but fix a bug when it is), those are appreciated. Regards, Martin