From martin@loewis.home.cs.tu-berlin.de Fri Sep 1 08:17:34 2000 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Fri, 1 Sep 2000 09:17:34 +0200 Subject: [I18n-sig] Translating doc strings Message-ID: <200009010717.JAA02431@loewis.home.cs.tu-berlin.de> Now that Python 2 supports the gettext API and methodology, I'd like to start discussion on translating messages in the Python core and library proper. I see two different kinds of messages in the Python source: Printed messages, which are produced in the course of running the interpreter (including, say, informative parameters to exceptions), and doc strings (which are not normally printed during program execution, but are instead retrieved by a developer. I have produced patch 101320, which is available from http://sourceforge.net/patch/?func=detailpatch&patch_id=101320&group_id=5470 The patch consists of a message catalog for Python doc strings, beginnings of a German translation thereof, a compiled version of the German catalog, and makefile machinery to install the catalogs. When discussing this patch with BeOpen, Barry and Guido raised concerns about the size of the catalog; Barry proposed to split it into pieces. Splitting the patch into pieces has its own problems: How to split, and should the pieces become their own textual domains? How would users retrieve translations of the doc strings in the first place? I have proposed patch 101313 http://sourceforge.net/patch/?func=detailpatch&patch_id=101313&group_id=5470 which introduces a doc() function, so that users could write >>> doc("".split) S.split([sep [,maxsplit]]) -> Liste von Strings Gib eine Liste der Worte im String S zurück, mit sep als Trennstring. Wenn maxsplit angegeben ist, werden höchstens maxsplit Worte abgetrennt. Wenn sep nicht angegeben ist, gelten beliebige Whitespace-Strings als Trenner. This interface has a number of advantages: - you don't have to type print in the front to get line breaks display properly - you don't have to type _ four times - it will transparently retrieve the translation if available For this to work, all doc strings must be in a single textual domain. The implementation of the doc function will retrieve the __doc__ attribute of the argument and look for a translation. With that approach, the next question is: What is the name of the textual domain, and how are translation managed? My proposal was "pylib"; Barry's "docstring". As for management of translations, I'd like to ask the Free Translation Project for help. As soon as we've settled the technical issues, I'd like to submit a catalog for translation. Comments? Martin From guido@beopen.com Fri Sep 1 16:54:46 2000 From: guido@beopen.com (Guido van Rossum) Date: Fri, 01 Sep 2000 10:54:46 -0500 Subject: [I18n-sig] Translating doc strings In-Reply-To: Your message of "Fri, 01 Sep 2000 09:17:34 +0200." <200009010717.JAA02431@loewis.home.cs.tu-berlin.de> References: <200009010717.JAA02431@loewis.home.cs.tu-berlin.de> Message-ID: <200009011554.KAA09534@cj20424-a.reston1.va.home.com> > How would users retrieve translations of the doc strings in the first > place? I have proposed patch 101313 > > http://sourceforge.net/patch/?func=detailpatch&patch_id=101313&group_id=5470 > > which introduces a doc() function, so that users could write > > >>> doc("".split) > S.split([sep [,maxsplit]]) -> Liste von Strings > > Gib eine Liste der Worte im String S zurück, mit sep als Trennstring. > Wenn maxsplit angegeben ist, werden höchstens maxsplit Worte > abgetrennt. Wenn sep nicht angegeben ist, gelten beliebige > Whitespace-Strings als Trenner. I like the interface fine. (Some might prefer to call it help()). > This interface has a number of advantages: > - you don't have to type print in the front to get line breaks display > properly > - you don't have to type _ four times > - it will transparently retrieve the translation if available In an IDE, doc() could be replaced by something that pops up the docs in a separate window. > For this to work, all doc strings must be in a single textual > domain. The implementation of the doc function will retrieve the > __doc__ attribute of the argument and look for a translation. Hmm... This lumps together *all* documentation for *all* modules and packages. What about documentation for 3rd party packages? How will your doc() deal with unrelated objects that somehow have the same (probably brief) docstring but for which the translation (depending on context) should be different? For functions, classes, methods and instances, the module name is easily accessible, e.g.: >>> import rfc822 >>> m = rfc822.Message(open("/dev/null")) >>> m.__class__.__name__ 'Message' >>> m.__class__.__module__ 'rfc822' >>> (For submodules of packages, __module__ gives the full package name.) --Guido van Rossum (home page: http://www.pythonlabs.com/~guido/) From martin@loewis.home.cs.tu-berlin.de Fri Sep 1 19:58:16 2000 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Fri, 1 Sep 2000 20:58:16 +0200 Subject: [I18n-sig] Translating doc strings In-Reply-To: <200009011554.KAA09534@cj20424-a.reston1.va.home.com> (message from Guido van Rossum on Fri, 01 Sep 2000 10:54:46 -0500) References: <200009010717.JAA02431@loewis.home.cs.tu-berlin.de> <200009011554.KAA09534@cj20424-a.reston1.va.home.com> Message-ID: <200009011858.UAA00812@loewis.home.cs.tu-berlin.de> > Hmm... This lumps together *all* documentation for *all* modules and > packages. Yes, it would. In itself, I don't see it as a problem. In the lumped-together form, only translators see it. This will guarantee consistency of terminology (eg. is it "Strings" or "Zeichenketten"; what is "Slicing"). > What about documentation for 3rd party packages? That is indeed a problem. > For functions, classes, methods and instances, the module name is > easily accessible, e.g.: > > >>> import rfc822 > >>> m = rfc822.Message(open("/dev/null")) > >>> m.__class__.__name__ > 'Message' > >>> m.__class__.__module__ > 'rfc822' > >>> I see two problems with using the package name. Exactly how do you obtain it for functions? f.func_globals['__name__']? And for builtin functions? As for __module__: I know, it was my idea, after all :-) The other problem is that this would give an inflation of hundreds of .mo files. I'd rather prefer to have one per product (in the Zope sense). One heuristic would be to the use the catalog that _ is bound to, i.e. def module_of(symbol): as above def catalog_of(symbol): return sys.modules[module_of(symbol)]._.im_self There could be an official protocol as well, of course, but a global catalog together with that convention might do. Regards, Martin From pinard@iro.umontreal.ca Sat Sep 2 14:34:51 2000 From: pinard@iro.umontreal.ca (=?ISO-8859-1?Q?Fran=E7ois_Pinard?=) Date: 02 Sep 2000 09:34:51 -0400 Subject: [I18n-sig] UTF-8 decoder in CVS still buggy In-Reply-To: M.-A. Lemburg mal@lemburg.com's message of "Mon, 24 Jul 2000 10:26:25 +0200" Message-ID: [mal@lemburg.com] > Please keep us informed of any quirks you may experience during this > conversion. We can use some real life reports for the new Unicode > support in Python to polish up the implementation and design. Hi, people. I just recently subscribed to i18n-sig, and started to read the archives. Let me hope you will tolerate that I jump in some conversations without having matured all the background. On the above topic, I did not check what Python exactly does, but I wanted to share that my `recode' program is not perfect in that area. In particular, there is a requirement for UTF-8 to be valid that the sequence be minimal, which `recode' currently does not check on input. Roughly said, an UTF-8 sequence is not valid if it could have been expressed in fewer bytes. I've nothing against Python beating me at it! :-) -- François Pinard http://www.iro.umontreal.ca/~pinard From pinard@iro.umontreal.ca Sat Sep 2 14:49:14 2000 From: pinard@iro.umontreal.ca (=?ISO-8859-1?Q?Fran=E7ois_Pinard?=) Date: 02 Sep 2000 09:49:14 -0400 Subject: [I18n-sig] Python Translation In-Reply-To: Dinesh Nadarajah dindin2k@yahoo.com's message of "Mon, 10 Jul 2000 20:12:58 -0700 (PDT)" Message-ID: [dindin2k@yahoo.com] > Is there any working/ target towards translating Python to other > languages. i.e. Some sort of structure like the *.po files in KDE such > that native languages can be substituted for the standards keywords. > Are there any plans to port Python to other (human) languages. I would not think there is. Some while ago, I wrote to Guido about i18n issues, and to my surprise, he replied quite strongly against the above suggestion, which I did not even make in my letter. So, I presumed the issue was rather hot for him, for him to read it where not written :-). The main point of Guido is that it goes against source portability. Yet, and even I do not remember having discussed this with Guido, I think it would be a good idea. Some shops develop in-house code never meant to be exported, and it would locally help a lot, and not hurt everybody outside, being able to use diacritics within identifiers, and even translated keywords. For one of my contracts, I'm working in such a shop. I had a very comfortable experience in such things when I was younger, which lasted for many years, using a French adaptation of a Pascal compiler. See `http://www.iro.umontreal.ca/~pinard/accents/bonjour.tar.gz' to see some archived code from this period (better to like French and CDC machines! :-). My point is that source portability might be a concern for some, but not for everybody, and I wish Python is open enough to not impose source portability where it has no meaning. If Python may be nationally comfortable, just let it be, and let users choose where their priorities are. -- François Pinard http://www.iro.umontreal.ca/~pinard From mal@lemburg.com Sat Sep 2 15:03:46 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Sat, 02 Sep 2000 16:03:46 +0200 Subject: [I18n-sig] UTF-8 decoder in CVS still buggy References: Message-ID: <39B108C2.F22A0660@lemburg.com> François Pinard wrote: > > [mal@lemburg.com] > > > Please keep us informed of any quirks you may experience during this > > conversion. We can use some real life reports for the new Unicode > > support in Python to polish up the implementation and design. > > Hi, people. I just recently subscribed to i18n-sig, and started to > read the archives. Let me hope you will tolerate that I jump in some > conversations without having matured all the background. > > On the above topic, I did not check what Python exactly does, but I wanted to > share that my `recode' program is not perfect in that area. In particular, > there is a requirement for UTF-8 to be valid that the sequence be minimal, > which `recode' currently does not check on input. Roughly said, an UTF-8 > sequence is not valid if it could have been expressed in fewer bytes. > > I've nothing against Python beating me at it! :-) Could you give some examples ? I'm not sure I understand what you mean by "could have been expressed with fewer bytes" -- perhaps a multi-byte encoding where the top-most bytes are 0 ? -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From guido@beopen.com Sat Sep 2 16:46:35 2000 From: guido@beopen.com (Guido van Rossum) Date: Sat, 02 Sep 2000 10:46:35 -0500 Subject: [I18n-sig] Python Translation In-Reply-To: Your message of "02 Sep 2000 09:49:14 -0400." References: Message-ID: <200009021546.KAA02082@cj20424-a.reston1.va.home.com> > [dindin2k@yahoo.com] > > > Is there any working/ target towards translating Python to other > > languages. i.e. Some sort of structure like the *.po files in KDE such > > that native languages can be substituted for the standards keywords. > > Are there any plans to port Python to other (human) languages. [pinard@iro.umontreal.ca] > I would not think there is. Some while ago, I wrote to Guido about i18n > issues, and to my surprise, he replied quite strongly against the above > suggestion, which I did not even make in my letter. So, I presumed the > issue was rather hot for him, for him to read it where not written :-). > The main point of Guido is that it goes against source portability. > > Yet, and even I do not remember having discussed this with Guido, I think > it would be a good idea. Some shops develop in-house code never meant > to be exported, and it would locally help a lot, and not hurt everybody > outside, being able to use diacritics within identifiers, and even > translated keywords. For one of my contracts, I'm working in such a shop. > > I had a very comfortable experience in such things when I was younger, > which lasted for many years, using a French adaptation of a Pascal compiler. > See `http://www.iro.umontreal.ca/~pinard/accents/bonjour.tar.gz' to see some > archived code from this period (better to like French and CDC machines! :-). > > My point is that source portability might be a concern for some, but not for > everybody, and I wish Python is open enough to not impose source portability > where it has no meaning. If Python may be nationally comfortable, just > let it be, and let users choose where their priorities are. Let me restate my position. It's not a priority for me, and I believe that most in the Python community probably don't see it as a priority for themselves either. There is so much else to do that I don't see myself putting effort in it. But if it is a priority for you, I won't stop you! It would probably best be implemented as a custom translator. We're thinking about making the Python chain of command (input loop -> parser -> compiler -> optimizer -> bytecode interpreter -> runtime) more pluggable in future (post-2.0) versions, and an internationalization pass would easily plug in there. --Guido van Rossum (home page: http://www.pythonlabs.com/~guido/) From Fredrik Lundh" Message-ID: <02bf01c014fb$2e66a6c0$766940d5@hagrid> François Pinard wrote: > Hi, people. I just recently subscribed to i18n-sig, and started to > read the archives. Let me hope you will tolerate that I jump in some > conversations without having matured all the background. > > On the above topic, I did not check what Python exactly does, but I wanted to > share that my `recode' program is not perfect in that area. In particular, > there is a requirement for UTF-8 to be valid that the sequence be minimal, > which `recode' currently does not check on input. Roughly said, an UTF-8 > sequence is not valid if it could have been expressed in fewer bytes. for security reasons, the UTF-8 codec gives you an "illegal encoding" error in this case. mal wrote: > Could you give some examples ? I'm not sure I understand what you > mean by "could have been expressed with fewer bytes" -- perhaps > a multi-byte encoding where the top-most bytes are 0 ? quoting RFC 2279: Implementors of UTF-8 need to consider the security aspects of how they handle illegal UTF-8 sequences. It is conceivable that in some circumstances an attacker would be able to exploit an incautious UTF-8 parser by sending it an octet sequence that is not permitted by the UTF-8 syntax. A particularly subtle form of this attack could be carried out against a parser which performs security-critical validity checks against the UTF-8 encoded form of its input, but interprets certain illegal octet sequences as characters. For example, a parser might prohibit the NUL character when encoded as the single-octet sequence 00, but allow the illegal two-octet sequence C0 80 and interpret it as a NUL character. Another example might be a parser which prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the illegal octet sequence 2F C0 AE 2E 2F. From mal@lemburg.com Sat Sep 2 18:05:08 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Sat, 02 Sep 2000 19:05:08 +0200 Subject: [I18n-sig] UTF-8 decoder in CVS still buggy References: <02bf01c014fb$2e66a6c0$766940d5@hagrid> Message-ID: <39B13344.1DCCB05A@lemburg.com> Fredrik Lundh wrote: > > François Pinard wrote: > > Hi, people. I just recently subscribed to i18n-sig, and started to > > read the archives. Let me hope you will tolerate that I jump in some > > conversations without having matured all the background. > > > > On the above topic, I did not check what Python exactly does, but I wanted to > > share that my `recode' program is not perfect in that area. In particular, > > there is a requirement for UTF-8 to be valid that the sequence be minimal, > > which `recode' currently does not check on input. Roughly said, an UTF-8 > > sequence is not valid if it could have been expressed in fewer bytes. > > for security reasons, the UTF-8 codec gives you an "illegal encoding" > error in this case. > > mal wrote: > > Could you give some examples ? I'm not sure I understand what you > > mean by "could have been expressed with fewer bytes" -- perhaps > > a multi-byte encoding where the top-most bytes are 0 ? > > quoting RFC 2279: > > Implementors of UTF-8 need to consider the security aspects of how > they handle illegal UTF-8 sequences. It is conceivable that in some > circumstances an attacker would be able to exploit an incautious > UTF-8 parser by sending it an octet sequence that is not permitted by > the UTF-8 syntax. > > A particularly subtle form of this attack could be carried out > against a parser which performs security-critical validity checks > against the UTF-8 encoded form of its input, but interprets certain > illegal octet sequences as characters. For example, a parser might > prohibit the NUL character when encoded as the single-octet sequence > 00, but allow the illegal two-octet sequence C0 80 and interpret it > as a NUL character. Another example might be a parser which > prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the > illegal octet sequence 2F C0 AE 2E 2F. Hmm... >>> unicode('\xC0\x80','utf-8') Traceback (most recent call last): File "", line 1, in ? UnicodeError: UTF-8 decoding error: illegal encoding >>> unicode('\x2F\x2E\x2E\x2F','utf-8') u'/../' >>> unicode('\x2F\xC0\xAE\x2E\x2F','utf-8') Traceback (most recent call last): File "", line 1, in ? UnicodeError: UTF-8 decoding error: illegal encoding >>> ... so what's buggy about the codec ? -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From Fredrik Lundh" <02bf01c014fb$2e66a6c0$766940d5@hagrid> <39B13344.1DCCB05A@lemburg.com> Message-ID: <02ef01c01502$57479200$766940d5@hagrid> mal wrote: > >>> unicode('\xC0\x80','utf-8') > Traceback (most recent call last): > File "", line 1, in ? > UnicodeError: UTF-8 decoding error: illegal encoding > >>> unicode('\x2F\x2E\x2E\x2F','utf-8') > u'/../' > >>> unicode('\x2F\xC0\xAE\x2E\x2F','utf-8') > Traceback (most recent call last): > File "", line 1, in ? > UnicodeError: UTF-8 decoding error: illegal encoding > >>> > > ... so what's buggy about the codec ? nothing -- francois posted under a misleading subject, without checking the code first. (and I never write buggy code anyway ;-) From pinard@iro.umontreal.ca Sat Sep 2 21:13:25 2000 From: pinard@iro.umontreal.ca (=?ISO-8859-1?Q?Fran=E7ois_Pinard?=) Date: 02 Sep 2000 16:13:25 -0400 Subject: [I18n-sig] UTF-8 decoder in CVS still buggy In-Reply-To: "Fredrik Lundh"'s message of "Sat, 2 Sep 2000 19:22:05 +0200" References: <02bf01c014fb$2e66a6c0$766940d5@hagrid> <39B13344.1DCCB05A@lemburg.com> <02ef01c01502$57479200$766940d5@hagrid> Message-ID: [Fredrik Lundh] > nothing -- francois posted under a misleading subject, > without checking the code first. I wrote that I did not check the code, so I'm safe there. But it is also true I did not check, nor change the subject, merely replied to a message. > (and I never write buggy code anyway ;-) Far from me the idea to suggest otherwise! :-) -- François Pinard http://www.iro.umontreal.ca/~pinard From tdickenson@geminidataloggers.com Mon Sep 4 09:17:04 2000 From: tdickenson@geminidataloggers.com (Toby Dickenson) Date: Mon, 04 Sep 2000 09:17:04 +0100 Subject: [I18n-sig] Terminology gap In-Reply-To: <39AE5F20.E68BC43F@lemburg.com> References: <39AE5F20.E68BC43F@lemburg.com> Message-ID: On Thu, 31 Aug 2000 15:35:28 +0200, "M.-A. Lemburg" wrote: >Toby Dickenson wrote: >>=20 >> Ive recently been updating my documentation to account for Unicode >> issues, and have been troubled by the lack of a good name to describe >> an object that can be *either* a "plain string" or a "unicode string". > >I usually use "8-bit string" and "Unicode object". >=20 >> My best attempt so far is to call it a "string-like object", but that >> feels too long for something so common. >>=20 >> I would like to use the simple "string", but a quick poll of my local >> developers suggests that this does not convey the unicode option. >>=20 >> Does anyone have any suggestions? > >I think the accepted term is "string", since someday Python will >have a string base class. Unicode objects and 8-bit strings will >then be subclasses of this string class. I think the more specific use of "string" will be a hard habit to break.... >>> type('') Toby Dickenson tdickenson@geminidataloggers.com From andy@reportlab.com Mon Sep 4 10:00:11 2000 From: andy@reportlab.com (Andy Robinson) Date: Mon, 4 Sep 2000 10:00:11 +0100 Subject: [I18n-sig] Terminology gap In-Reply-To: <39AE5F20.E68BC43F@lemburg.com> Message-ID: > > My best attempt so far is to call it a "string-like > object", but that > > feels too long for something so common. > > > > I would like to use the simple "string", but a quick poll > of my local > > developers suggests that this does not convey the unicode option. > > > > Does anyone have any suggestions? > > I think the accepted term is "string", since someday Python will > have a string base class. Unicode objects and 8-bit strings will > then be subclasses of this string class. > I agree with MAL. "string" should refer to an interface; people doing i18n stuff could then write their own ones in future if needed. I cannot get at CVS this week, but I think we actually checked in a UserString class into the standard library in order to clearly define the interface for string-like objects. - Andy Robinson. From andy@reportlab.com Mon Sep 4 10:00:16 2000 From: andy@reportlab.com (Andy Robinson) Date: Mon, 4 Sep 2000 10:00:16 +0100 Subject: [I18n-sig] Python Translation In-Reply-To: <200009021546.KAA02082@cj20424-a.reston1.va.home.com> Message-ID: > But if it is a priority for you, I won't stop you! It > would probably > best be implemented as a custom translator. We're thinking about > making the Python chain of command (input loop -> parser -> compiler > -> optimizer -> bytecode interpreter -> runtime) more pluggable in > future (post-2.0) versions, and an internationalization pass would > easily plug in there. > For inspiration on what can be done with pluggable parsers, check out Damian Conway's lingua::romana::perligata. He built an alternate syntax and parser for Perl in Latin, getting a lot of help from the Monash classics department on the correct case endings to substitute for $, @ and all that stuff. Don't ask me why. (Sorry, I don't have a URL and am off line at the moment). BTW, I sat next to him at an author signing at which someone was volunteering to do the Klingon port and make Perl the official scripting language of the Klingon empire. It seems like there is More Than One Way to Say "die" in Klingon. We'd better watch out. - Andy Robinson From loewis@informatik.hu-berlin.de Mon Sep 4 14:11:41 2000 From: loewis@informatik.hu-berlin.de (Martin von Loewis) Date: Mon, 4 Sep 2000 15:11:41 +0200 (MET DST) Subject: [I18n-sig] Re: Patch 101320: doc strings In-Reply-To: (message from =?ISO-8859-1?Q?Fran=E7ois?= Pinard on 02 Sep 2000 11:59:14 -0400) References: <200008302026.WAA13599@pandora.informatik.hu-berlin.de> <14765.62346.449402.907537@anthem.concentric.net> <200008311550.RAA29820@pandora.informatik.hu-berlin.de> Message-ID: <200009041311.PAA27712@pandora.informatik.hu-berlin.de> [Martin v. L=F6wis] > > The textual domain of a module will relate to what _ binds to. Doc > > strings won't be wrapped into _(), as a result, you can't use the > > binding of _. [Fran=E7ois Pinard] > "_(__doc__)" should work if the docstring shares the textual domain of > the rest of the module, which looks like the correct thing to do in > my eyes. I don't see how this could work for doc strings of classes, methods and functions. Do you propose to write def foo(): _("This does the foo thing.") pass That won't work; the parser won't recognize it as a doc string. Regards, Martin From loewis@informatik.hu-berlin.de Mon Sep 4 14:14:34 2000 From: loewis@informatik.hu-berlin.de (Martin von Loewis) Date: Mon, 4 Sep 2000 15:14:34 +0200 (MET DST) Subject: [I18n-sig] Re: Patch 101320: doc strings In-Reply-To: (message from =?ISO-8859-1?Q?Fran=E7ois?= Pinard on 02 Sep 2000 12:05:12 -0400) References: <200008302026.WAA13599@pandora.informatik.hu-berlin.de> <14765.62346.449402.907537@anthem.concentric.net> <200008311550.RAA29820@pandora.informatik.hu-berlin.de> <14766.48956.463218.310154@anthem.concentric.net> Message-ID: <200009041314.PAA27902@pandora.informatik.hu-berlin.de> [Barry A. Warsaw] > So maybe for /docstrings/ there should be one domain, and then each module > can have it's own domain for its own additional translatable strings? [Fran=E7ois Pinard] > I do not understand the advantage of doing this. Of course, if we do > not need the translation of docstrings, these should not be collected > for translation. But if they get collected, there is no reason to have > a separate domain for them. It is just natural that they be part of the > domain for the collection of modules they are part of. How would you access the doc strings? Today, I do >>> import httplib =20 >>> print httplib.HTTP.__doc__ This class manages a connection to an HTTP server. Now, how do I get to the translation of this message? Regards, Martin From loewis@informatik.hu-berlin.de Mon Sep 4 14:29:25 2000 From: loewis@informatik.hu-berlin.de (Martin von Loewis) Date: Mon, 4 Sep 2000 15:29:25 +0200 (MET DST) Subject: [I18n-sig] Re: Patch 101320: doc strings In-Reply-To: (message from =?ISO-8859-1?Q?Fran=E7ois?= Pinard on 02 Sep 2000 11:46:23 -0400) References: <200008302026.WAA13599@pandora.informatik.hu-berlin.de> <14765.62346.449402.907537@anthem.concentric.net> <200008311550.RAA29820@pandora.informatik.hu-berlin.de> Message-ID: <200009041329.PAA28928@pandora.informatik.hu-berlin.de> [Fran=E7ois Pinard] > People might fear that the POT file is too time consuming to load all > at once. If this is the case, then the problem lies in the implementation > of the `gettext' interface. I repeated all along that it should be lazy > evaluated, exactly to avoid that an insufficient implementation becomes > an excuse to split a textual domain in many smaller ones. I have started translating the Python doc strings into German, and covered about 30% so far. Using the Python 2 gettext.py, I did not experience any noticable delay in loading the mo file, on my 300MHz machine. While I agree that lazy loading may become necessary, I think it is ok to do implement the feature when the problem actually arises. I'm pretty certain you can implement lazy access without changing the existing API. > People might fear that the PO file would take too much memory. On > modern systems, there is no problem `mmap'ing a file, as virtual > address space is more than enough to hold even big translation > files. The Python difficulty, here, is that it is (nicely) portable > to some less capable systems, where `mmap' has no equivalent. The Python 2 mmap works on Unix and Win32. It probably is the best solution if available. > In my opinion, the solution might then be for these systems to load > the MO hash tables only, and then retrieve messages from disk. If you load the hash tables, does this give enough information so that you can use two seek(2) calls only; on average? If so, it would be probably good if there was a) documentation for the hash table format, and/or b) an implementation of it in Python. > The last fear might be that the POT file might be too big for > translators to handle. That indeed is my concern. The largest catalog so far was Lynx (AFAICT), with 1100 messages. I guess gcc might also be pretty large. > One of the goal of the Translation Project has been to promote a > clean separation of responsibilities between software maintainers > and national translators, as software maintainers spontaneously have > a wide variety of (often contradictory) opinions about how (and even > when!) translators should work :-). It is a difficult aspect of the > overall thing, in fact. I think for the Python docstring catalog, we can give some guidance - perhaps by shipping not all at once, but waiting for translators to complete with the most interesting things first (like docstrings for the builtin core functions). I'm certain it will take some time to get translations back, so if=20 we want to have something in the next release (after 2.0), we should start today. Regards, Martin From loewis@informatik.hu-berlin.de Mon Sep 4 14:42:57 2000 From: loewis@informatik.hu-berlin.de (Martin von Loewis) Date: Mon, 4 Sep 2000 15:42:57 +0200 (MET DST) Subject: [I18n-sig] Re: gettext in the standard library In-Reply-To: (message from =?ISO-8859-1?Q?Fran=E7ois?= Pinard on 03 Sep 2000 16:04:41 -0400) References: <14749.42747.411862.940207@anthem.concentric.net> <14757.24220.225628.464982@anthem.concentric.net> Message-ID: <200009041342.PAA29915@pandora.informatik.hu-berlin.de> > I do not see, nor understand, why we should have special API provisions > for Unicode. I thought a great effort has been put in Unicode support > design so it would be as transparent as possible. Isn't making Unicode > explicit going against this spirit? In Python 2, unicode strings are a separate type from byte strings. The catalog objects will have two methods, one for retrieving a byte string, as it appears in the mo file, and one for retrieving a unicode string. It is then the application developer's choice whether his application can deal with Unicode messages on output or not. The core issue is that catalogs only map byte strings to byte strings. > Should not "_(...)" return either a simple string or a Unicode string, > depending solely on the goal language? Would not all the rest just fall > out naturally from this choice? What is that problem that I do not > see? You can't be certain that the encoding of the catalog msgstrs is the same as the one of the user. For example, the catalog may use KOI-8, whereas the user's terminals are all in UTF-8. So you have know the catalog's encoding. This, in turn, is only available of the catalog follows the convention of containing a valid Content-Type field in the translation of the empty string. Or, the Python installation may not have the converter from the .mo file's encoding to Unicode. Also, how would goal language determine whether Unicode is a better representation for messages than some MBCS? > Also, what means "GNUTranslations" above? What is especially "GNU" in > the act of translating? Should not we just avoid any "GNU" > references? The format of the catalog files is defined by GNU gettext. Regards, Martin From loewis@informatik.hu-berlin.de Mon Sep 4 14:56:56 2000 From: loewis@informatik.hu-berlin.de (Martin von Loewis) Date: Mon, 4 Sep 2000 15:56:56 +0200 (MET DST) Subject: [I18n-sig] Re: gettext in the standard library In-Reply-To: (message from =?ISO-8859-1?Q?Fran=E7ois?= Pinard on 03 Sep 2000 16:19:13 -0400) References: <14749.42747.411862.940207@anthem.concentric.net> <14757.24220.225628.464982@anthem.concentric.net> <200008241935.VAA05311@pandora.informatik.hu-berlin.de> Message-ID: <200009041356.PAA01550@pandora.informatik.hu-berlin.de> > [Martin von Loewis] >=20 > > Also, after discussion, I think we concluded that supporting alternative > > locale categories is useless; the code should always assume LC_MESSAGES. [Fran=E7ois Pinard] > The charset selection could be also part of the LANG specification (after > a period), or implied by the LC_CTYPE value (which itself might be derived > from LC_ALL). To make things a bit worse, many packages allow LANGUAGE > to override LANG. That was not the issue here. The question was whether dcgettext should be supported, which allows to specify a category other than LC_MESSAGES when looking for catalogs. > LANGUAGE is an extension of LANG allowing fallback languages, > something that has been asked by people when `gettext' was designed > and which looked reasonable to us (yet Richard objected that we > loose time over this). Yes, gettext.py supports this convention. > I also wanted to stress another point. Regionalised translation files > automatically fallback on non-regionalised files when available, on a > message per message basis. For example, a typical `de_AT' (Autrichian > German) translation file contains only a few re-translations, the bulk > of them is still kept within `de'. The current gettext supports trying these in order. However, looking at the implementation, it seems both conventions are implemented incorrectly: The fall-backs are used when opening the catalog. When the catalog is there, but lookup finds that a message is not translated, it won't try the fall-backs. Instead, it will just return the English message. In the case of LANGUAGE, I think this is acceptable: If you set it to de:sv, you may get German, Swedish, or English translations. However, in real live, you either get German or Swedish, since catalogs are likely full translations, or not present at all. As for de_AT falling back to de on a per-message basis - gettext.py doesn't do that. As for 'a typical' de_AT file: I have a total of 2 de_AT files on my installation, whereas I have 211 de translations. So it seems that the typical de_AT translation is empty, in which case it would indeed fall back to de. Regards, Martin From loewis@informatik.hu-berlin.de Mon Sep 4 15:00:10 2000 From: loewis@informatik.hu-berlin.de (Martin von Loewis) Date: Mon, 4 Sep 2000 16:00:10 +0200 (MET DST) Subject: [I18n-sig] Re: gettext in the standard library In-Reply-To: (message from =?ISO-8859-1?Q?Fran=E7ois?= Pinard on 03 Sep 2000 16:33:50 -0400) References: <14749.42747.411862.940207@anthem.concentric.net> <14757.24220.225628.464982@anthem.concentric.net> <200008241935.VAA05311@pandora.informatik.hu-berlin.de> <14758.53732.459528.857102@anthem.concentric.net> Message-ID: <200009041400.QAA02532@pandora.informatik.hu-berlin.de> [Fran=E7ois Pinard] > Near the times of the beginning of the Translation Project, the > mentality was that a PO file could be used to translate from any > original language to any goal language. The original language > being, of course, the language used by the programmer. With only a > few exception, I can say that almost all examples I saw or handled > use English as the original language. But the spirit was opened to > the fact people could program in their own national language, and > _then_ have translation files towards English. > > Currently, this opened-ness is getting reversed. Not only the > original language is mandated to be English in the spirit of many, > there are now pressures for the charset in use be a small subset of > ASCII, with some strange code already committed, for parameterising > ASCII to Unicode conversions (I've strong and probably biased > opinions in that debate, so better not let me try to summarising it > here :-). A sure thing is that it looks all wrong to me, as just > giving in highly pedantic complexity. >=20 > So, not only would I like that Python does it better, but I would > welcome if Python was allowing the original language to be based on > either ASCII or Unicode, the most transparently as possible, of > course. Isn't that limited by the structure of mo files? You'd somehow have to know what encoding to use when looking into the catalog - the content type only talks about the encoding of the translations. Regards, Martin From pinard@iro.umontreal.ca Mon Sep 4 15:06:40 2000 From: pinard@iro.umontreal.ca (=?ISO-8859-1?Q?Fran=E7ois_Pinard?=) Date: 04 Sep 2000 10:06:40 -0400 Subject: [I18n-sig] Re: Patch 101320: doc strings In-Reply-To: Martin von Loewis's message of "Mon, 4 Sep 2000 15:11:41 +0200 (MET DST)" References: <200008302026.WAA13599@pandora.informatik.hu-berlin.de> <14765.62346.449402.907537@anthem.concentric.net> <200008311550.RAA29820@pandora.informatik.hu-berlin.de> <200009041311.PAA27712@pandora.informatik.hu-berlin.de> Message-ID: [Martin von Loewis] > [François Pinard] > > "_(__doc__)" should work if the docstring shares the textual domain of > > the rest of the module, which looks like the correct thing to do in > > my eyes. > I don't see how this could work for doc strings of classes, methods > and functions. Do you propose to write > def foo(): > _("This does the foo thing.") > pass > That won't work; the parser won't recognize it as a doc string. Of course. The idea is to write: def foo(): "This does the foo thing." pass and at some later place: print _(foo.__doc__) -- François Pinard http://www.iro.umontreal.ca/~pinard From loewis@informatik.hu-berlin.de Mon Sep 4 15:16:29 2000 From: loewis@informatik.hu-berlin.de (Martin von Loewis) Date: Mon, 4 Sep 2000 16:16:29 +0200 (MET DST) Subject: [I18n-sig] Re: Marking translatable strings In-Reply-To: (message from =?ISO-8859-1?Q?Fran=E7ois?= Pinard on 04 Sep 2000 08:45:37 -0400) References: <200008280639.IAA24958@pandora.informatik.hu-berlin.de> <200008281626.SAA04073@pandora.informatik.hu-berlin.de> <14763.14225.909612.157094@anthem.concentric.net> <39B35742.24C18621@lemburg.com> Message-ID: <200009041416.QAA04065@pandora.informatik.hu-berlin.de> > I much prefer this as well, and `i' as a string modifier would be welcome. > However, this requires a change to the Python interpreter. If we can > obtain that this change be done, then that's wonderful. However, if such > a change is out of question for some reason, quote mangling is our best > next choice for delayed strings. Be sure that if i"..." gets adopted in > Python as a kind of "ignored" modifier, I'll modify PO utils so it is the > preferred form, and deprecate quote mangling soon after 2.0 is out. > > Another advantage of i"..." is that it could be used to segregate and > mark doc-strings needing translation at run-time, from those not really > needing it. It's better than extracting either all of them, or none. Is there any precedent of a large Python application that uses (or could use) that of lazy translation of strings? Regards, Martin From pinard@iro.umontreal.ca Mon Sep 4 15:25:03 2000 From: pinard@iro.umontreal.ca (=?ISO-8859-1?Q?Fran=E7ois_Pinard?=) Date: 04 Sep 2000 10:25:03 -0400 Subject: [I18n-sig] Re: Patch 101320: doc strings In-Reply-To: Martin von Loewis's message of "Mon, 4 Sep 2000 15:14:34 +0200 (MET DST)" References: <200008302026.WAA13599@pandora.informatik.hu-berlin.de> <14765.62346.449402.907537@anthem.concentric.net> <200008311550.RAA29820@pandora.informatik.hu-berlin.de> <14766.48956.463218.310154@anthem.concentric.net> <200009041314.PAA27902@pandora.informatik.hu-berlin.de> Message-ID: > [Barry A. Warsaw] > > So maybe for /docstrings/ there should be one domain, and then each module > > can have it's own domain for its own additional translatable strings? > [François Pinard] > > I do not understand the advantage of doing this. Of course, if we do > > not need the translation of docstrings, these should not be collected > > for translation. But if they get collected, there is no reason to have > > a separate domain for them. It is just natural that they be part of the > > domain for the collection of modules they are part of. [Martin von Loewis] > How would you access the doc strings? Today, I do > >>> import httplib > >>> print httplib.HTTP.__doc__ > This class manages a connection to an HTTP server. > Now, how do I get to the translation of this message? I do not imagine all the details, but I think the spirit of the thing is that at "import httplib" time, some function (or class instantiator) was called at the top level of the httplib module, to produce a translating function, which the httplib module soon assigned to the `_' variable, or something else if the programmer did not like `_'. The httplib module transmitted its translation domain to the mechanism generating the translating function. If it was systematic that `_' was assigned to, we could try to retrieve the function stored in the `_' global variable of `httplib', and then use it to translate any docstring from httplib. However, it would be nicer if the constraint of using `_' for the translating function did not exist, and if it was rather completely left at the discretion of the programmer. If we use `_' systematically in documentation examples we produce, it is likely to become the popular choice, but let's avoid mandating it. If we are not forcing `_', the doc() or help() function able to retrieve the translated docstring would have to be a bit more clever. I'm not familiar enough with the Python system variables to exactly know how to do this, but I have the feeling that it would not be hard to organise without having to make the API any less simple than it already is. The mechanism producing the translating function and the help() function (let me confess I have a preference for `help' over `doc' :-) could be designed so they collaborate, if a straightforward implementation of help() appears difficult. -- François Pinard http://www.iro.umontreal.ca/~pinard From loewis@informatik.hu-berlin.de Mon Sep 4 15:45:20 2000 From: loewis@informatik.hu-berlin.de (Martin von Loewis) Date: Mon, 4 Sep 2000 16:45:20 +0200 (MET DST) Subject: [I18n-sig] Re: Patch 101320: doc strings In-Reply-To: (message from =?ISO-8859-1?Q?Fran=E7ois?= Pinard on 04 Sep 2000 10:25:03 -0400) References: <200008302026.WAA13599@pandora.informatik.hu-berlin.de> <14765.62346.449402.907537@anthem.concentric.net> <200008311550.RAA29820@pandora.informatik.hu-berlin.de> <14766.48956.463218.310154@anthem.concentric.net> <200009041314.PAA27902@pandora.informatik.hu-berlin.de> Message-ID: <200009041445.QAA06095@pandora.informatik.hu-berlin.de> > I do not imagine all the details, but I think the spirit of the > thing is that at "import httplib" time, some function (or class > instantiator) was called at the top level of the httplib module, to > produce a translating function, which the httplib module soon > assigned to the `_' variable, or something else if the programmer > did not like `_'. The httplib module transmitted its translation > domain to the mechanism generating the translating function. Ok, so if _ is bound, all is well. That brings us back to square one: Should we split the Python library into different textual domains? If yes, then how? *If* we decide to split that, it would be very easy to extract doc strings of different modules into different catalogs. Even in that case, I guess there would be some code left that did not have its own textual domain. So there would still be the need for some kind of "fallback" domain for the docstrings. The proposed operation of the help function would then be that: - if the module of the object (function, class, etc) can be established, and has _ bound, then translate the doc string in the catalog associated with _. - else, try to translate the doc string in the domain for Python doc strings ("pydoc"?). However, you also brought the point that the doc strings should use the same catalog as any other strings of the Python core, and that this should be a single domain (e.g. "python"). In that case, lookup would fall-back to the "python" domain, and it would not matter whether _ was bound in any of the modules of the standard Python library. Regards, Martin From pinard@iro.umontreal.ca Mon Sep 4 17:32:09 2000 From: pinard@iro.umontreal.ca (=?ISO-8859-1?Q?Fran=E7ois_Pinard?=) Date: 04 Sep 2000 12:32:09 -0400 Subject: [I18n-sig] Re: Patch 101320: doc strings In-Reply-To: Martin von Loewis's message of "Mon, 4 Sep 2000 16:45:20 +0200 (MET DST)" References: <200008302026.WAA13599@pandora.informatik.hu-berlin.de> <14765.62346.449402.907537@anthem.concentric.net> <200008311550.RAA29820@pandora.informatik.hu-berlin.de> <14766.48956.463218.310154@anthem.concentric.net> <200009041314.PAA27902@pandora.informatik.hu-berlin.de> <200009041445.QAA06095@pandora.informatik.hu-berlin.de> Message-ID: [Martin von Loewis] > > I do not imagine all the details, but I think the spirit of the > > thing is that at "import httplib" time, some function (or class > > instantiator) was called at the top level of the httplib module, to > > produce a translating function, which the httplib module soon > > assigned to the `_' variable, or something else if the programmer > > did not like `_'. The httplib module transmitted its translation > > domain to the mechanism generating the translating function. > Ok, so if _ is bound, all is well. Not necessarily. `_' could be bound to a lot of things, not necessarily a translating function. > That brings us back to square one: Should we split the Python library > into different textual domains? I miss the logic of sliding over the snake, down to square one. I perceive the issues as rather orthogonal. How are they connected? > If yes, then how? *If* we decide to split that, it would be very easy > to extract doc strings of different modules into different catalogs. Everything should be easy. It is just not "convenient" to handle a multiplicity of domains, without very serious reasons to do so. Best is to use one textual (or translation) domain per distribution of a system or package. > Even in that case, I guess there would be some code left that did not > have its own textual domain. So there would still be the need for some > kind of "fallback" domain for the doc strings. Why should we use separate domains for doc strings? > The proposed operation of the help function would then be that: > - if the module of the object (function, class, etc) can be > established, and has _ bound, then translate the doc string > in the catalog associated with _. My feeling is that we should not rely on `_'. The variable used to hold the translating function should be left at the discretion of the user. > - else, try to translate the doc string in the domain for Python > doc strings ("pydoc"?). Why not just use the textual domain of a module, to translate the doc strings it contains? It may well happen that if the module comes with the Python distribution, it will have "python" for its textual domain. But it might come from anywhere, and we cannot predict the textual domain of a randomly imported module. However, all modules holding translated strings should also get, right on initial import, a translating function out of their textual domain, and the mechanics producing that translating function might save a correspondence between the module and the textual domain for that module (unless we find something more straightfoward). It should be possible to communicate with the mechanics to get a copy of the translating function for that module, and use that function to translate doc strings held within that module. > However, you also brought the point that the doc strings should use > the same catalog as any other strings of the Python core, I just checked the `To:' of your message to make sure, and indeed, you are writing to me :-). No, I'm pretty sure I never said that, or else, if I did, I surely was extremely tired! :-) Simplicity asks that doc strings share the textual domain of all other strings for the same module. Is there a need to do otherwise? Keep happy! P.S. - I slightly begin to fear that we will not have a full, clear consensus by the 4th of September... :-) -- François Pinard http://www.iro.umontreal.ca/~pinard From pinard@iro.umontreal.ca Mon Sep 4 17:59:27 2000 From: pinard@iro.umontreal.ca (=?ISO-8859-1?Q?Fran=E7ois_Pinard?=) Date: 04 Sep 2000 12:59:27 -0400 Subject: [I18n-sig] Re: Patch 101320: doc strings In-Reply-To: Martin von Loewis's message of "Mon, 4 Sep 2000 15:29:25 +0200 (MET DST)" References: <200008302026.WAA13599@pandora.informatik.hu-berlin.de> <14765.62346.449402.907537@anthem.concentric.net> <200008311550.RAA29820@pandora.informatik.hu-berlin.de> <200009041329.PAA28928@pandora.informatik.hu-berlin.de> Message-ID: [Martin von Loewis] > I have started translating the Python doc strings into German, and > covered about 30% so far. Using the Python 2 gettext.py, I did not > experience any noticable delay in loading the mo file, on my 300MHz > machine. While I agree that lazy loading may become necessary, I think > it is ok to do implement the feature when the problem actually arises. > I'm pretty certain you can implement lazy access without changing the > existing API. Excellent. You just have to remember this, if you ever read someone asking that we split textual domains in "smaller" or "more manageable" parts. We should then correct implementations and tools as needed, rather than give in splitting, or multiplying textual domains. > The Python 2 mmap works on Unix and Win32. It probably is the best > solution if available. Wow! Good news. Our luck would be that it works on Macintosh as well... > > In my opinion, the solution might then be for these systems to load > > the MO hash tables only, and then retrieve messages from disk. > If you load the hash tables, does this give enough information so that > you can use two seek(2) calls only; on average? If so, it would be > probably good if there was a) documentation for the hash table format, > and/or b) an implementation of it in Python. We could use the compendium of all existing PO files in the Translation Project to establish statistics (I'm not rushing in doing this today! :-). My guess is that we could hold full hash tables in memory through a quick swallow, and that double hashing would later guarantee a single seek on the average. The precise hash algorithm is only documented in the sources. Using it from GNU `gettext' would raise question about how the GPL applies, but using the copy bought by the Danish UUG should be OK. Best might be to postpone for now, according to the first quoted paragraph of this message. > > The last fear might be that the POT file might be too big for > > translators to handle. > That indeed is my concern. The largest catalog so far was Lynx > (AFAICT), with 1100 messages. I guess gcc might also be pretty large. It should be the concern of translators, national teams, or the Translation Project, but surely not the concern of programmers in the sake of translators. At the start of the Translation Project, it has been a recurrent difficulty that each and every programmer needed to decide how translators should work. Better to keep responsibility well separated: everybody sleeps better, and is happier in the long run. > I think for the Python docstring catalog, we can give some guidance - > perhaps by shipping not all at once, but waiting for translators to > complete with the most interesting things first (like docstrings for > the builtin core functions). No, no, I don't think so. As programmers, we should just not interfere. Believe me, people do not need so much of our precious "guidance". > I'm certain it will take some time to get translations back, so if > we want to have something in the next release (after 2.0), we should > start today. This is another thing. You have to loose hope, _right now_, to ever keep all translations synchronous with releases. Some teams, and only a few of them, react in a fast way, but most teams are slow. You will live endless irritation, and might end up pretty disgusted, if you start trying to push and pull on teams. You have to peace down your own soul, and become quiet. Consider, as a programmer, that your job is to internationalise your scripts, (and maybe comply, once in a while, when you receive reports about burning too much of English grammar in your run-time construction of strings), and then accepting translations from translators, almost blindly, without judging if they are worth being distributed or not. Linguistics matters is something to be discussed and resolved between the national team of translators for a language, and the users of that language. You have to detach yourself, as a programmer, of all such concerns: they are not yours. The quality of your package is orthogonal and independent from the quality of translations, this should be absolutely clear for everybody. -- François Pinard http://www.iro.umontreal.ca/~pinard From pinard@iro.umontreal.ca Mon Sep 4 18:08:08 2000 From: pinard@iro.umontreal.ca (=?ISO-8859-1?Q?Fran=E7ois_Pinard?=) Date: 04 Sep 2000 13:08:08 -0400 Subject: [I18n-sig] Re: Marking translatable strings In-Reply-To: Martin von Loewis's message of "Mon, 4 Sep 2000 16:16:29 +0200 (MET DST)" References: <200008280639.IAA24958@pandora.informatik.hu-berlin.de> <200008281626.SAA04073@pandora.informatik.hu-berlin.de> <14763.14225.909612.157094@anthem.concentric.net> <39B35742.24C18621@lemburg.com> <200009041416.QAA04065@pandora.informatik.hu-berlin.de> Message-ID: [Martin von Loewis] > Is there any precedent of a large Python application that uses (or > could use) that of lazy translation of strings? A friend and I marked all of an older Mailman, and delayed translations are needed here and there, like for most other big programs. It is typical of many applications, anyway. I would not think that Python is very special on this particular aspect, compared to other languages. In my experience, delayed translations are not often needed on average, but yet, inescapable here and there, once in a while. In the case of Python, of course, all doc strings are inherently delayed, but maybe they are not necessarily always meant to be translated in every application. (I guess this may be debated. :-) -- François Pinard http://www.iro.umontreal.ca/~pinard From pinard@iro.umontreal.ca Mon Sep 4 18:19:04 2000 From: pinard@iro.umontreal.ca (=?ISO-8859-1?Q?Fran=E7ois_Pinard?=) Date: 04 Sep 2000 13:19:04 -0400 Subject: [I18n-sig] Re: gettext in the standard library In-Reply-To: Martin von Loewis's message of "Mon, 4 Sep 2000 15:56:56 +0200 (MET DST)" References: <14749.42747.411862.940207@anthem.concentric.net> <14757.24220.225628.464982@anthem.concentric.net> <200008241935.VAA05311@pandora.informatik.hu-berlin.de> <200009041356.PAA01550@pandora.informatik.hu-berlin.de> Message-ID: [Martin von Loewis] > However, looking at the implementation, it seems both conventions are > implemented incorrectly: The fall-backs are used when opening the catalog. This can be seen as an important optimisation, indeed. > When the catalog is there, but lookup finds that a message is > not translated, it won't try the fall-backs. It most probably should, to respect the spirit of fall-backs. > In the case of LANGUAGE, I think this is acceptable: If you set it to > de:sv, you may get German, Swedish, or English translations. However, > in real live, you either get German or Swedish, since catalogs are likely > full translations, or not present at all. The truth of experience is that for many teams, translations will lag over releases, and you will often not have full translation files, a few holes will exist. It is then more important that fall-backs are taken on a per message basis. > As for de_AT falling back to de on a per-message basis - gettext.py > doesn't do that. As for 'a typical' de_AT file: I have a total of 2 > de_AT files on my installation, whereas I have 211 de translations. > So it seems that the typical de_AT translation is empty, in which case > it would indeed fall back to de. Indeed :-). When `de_AT' does not even exist, no need to consider it. -- François Pinard http://www.iro.umontreal.ca/~pinard From pinard@iro.umontreal.ca Mon Sep 4 18:29:32 2000 From: pinard@iro.umontreal.ca (=?ISO-8859-1?Q?Fran=E7ois_Pinard?=) Date: 04 Sep 2000 13:29:32 -0400 Subject: [I18n-sig] Re: gettext in the standard library In-Reply-To: Martin von Loewis's message of "Mon, 4 Sep 2000 16:00:10 +0200 (MET DST)" References: <14749.42747.411862.940207@anthem.concentric.net> <14757.24220.225628.464982@anthem.concentric.net> <200008241935.VAA05311@pandora.informatik.hu-berlin.de> <14758.53732.459528.857102@anthem.concentric.net> <200009041400.QAA02532@pandora.informatik.hu-berlin.de> Message-ID: > [François Pinard] > > So, not only would I like that Python does it better, but I would > > welcome if Python was allowing the original language to be based on > > either ASCII or Unicode, the most transparently as possible, of > > course. [Martin von Loewis] > Isn't that limited by the structure of mo files? You'd somehow have to > know what encoding to use when looking into the catalog - the content > type only talks about the encoding of the translations. It is surely a bit sad that the PO file header (the translation of the empty string) has no current provision to describe `msgstr' language and encoding. Yet, in practice, as long as the POT file is automatically derived from the sources, each `msgstr' is identical to how it appears in the sources, and consequently, it uses in the POT file the same encoding that in the source. So, it is likely that retrieving the `msgstr' at run-time will work. Problems would arise if the source strings were recoded, between string extraction by POT tools, and string usage for translation at run-time. Python will likely "internalise" or convert Unicode strings from UTF-8, and this is a change of representation. Maybe we could do similar changes in the POT extractors, so the match occurs. This might become difficult if the Python sources are coded in other things than UTF-8. But whatever means will exist for Python to do the conversion, POT extractors might have to be modified to use the same means. Matches shall occur. -- François Pinard http://www.iro.umontreal.ca/~pinard From loewis@informatik.hu-berlin.de Mon Sep 4 18:32:31 2000 From: loewis@informatik.hu-berlin.de (Martin von Loewis) Date: Mon, 4 Sep 2000 19:32:31 +0200 (MET DST) Subject: [I18n-sig] Re: Patch 101320: doc strings In-Reply-To: (message from =?ISO-8859-1?Q?Fran=E7ois?= Pinard on 04 Sep 2000 12:32:09 -0400) References: <200008302026.WAA13599@pandora.informatik.hu-berlin.de> <14765.62346.449402.907537@anthem.concentric.net> <200008311550.RAA29820@pandora.informatik.hu-berlin.de> <14766.48956.463218.310154@anthem.concentric.net> <200009041314.PAA27902@pandora.informatik.hu-berlin.de> <200009041445.QAA06095@pandora.informatik.hu-berlin.de> Message-ID: <200009041732.TAA14636@pandora.informatik.hu-berlin.de> > > That brings us back to square one: Should we split the Python library > > into different textual domains? >=20 > I miss the logic of sliding over the snake, down to square one. I percei= ve > the issues as rather orthogonal. How are they connected? [me, paraphrasing] Me: I propose a single domain for docstrings, 'pylib'. Barry: This is too large, split it up. Me: Then how do you access the individual strings? Barry: Use the module name. Me: This will give too many domains. Barry: There might be some point in having a /docstring/ domain [for the python library] You: Why should we have one? All docstrings should be in the same domain as the module. Me: Then how do you access individual strings? You: I don't know, but maybe you can use the binding of _. Me: I propose a single domain for docstrings. Anybody proposing a different organisation? > > Even in that case, I guess there would be some code left that did not > > have its own textual domain. So there would still be the need for some > > kind of "fallback" domain for the doc strings. >=20 > Why should we use separate domains for doc strings? I did not propose a *separate* domain for doc strings. I proposed that there is one well-known domain in which the doc strings of the core python library can be found. I don't care too much at this time whether it contains anything else - there are no other translatable strings in the Python sources at this point in time. > My feeling is that we should not rely on `_'. The variable used to hold > the translating function should be left at the discretion of the > user. Well, what else do you propose? > > - else, try to translate the doc string in the domain for Python > > doc strings ("pydoc"?). >=20 > Why not just use the textual domain of a module, to translate the doc > strings it contains? How do I find out the textual domain of a module? How do I find out the module of a builtin function? > However, all modules holding translated strings should also get, > right on initial import, a translating function out of their textual > domain, and the mechanics producing that translating function might > save a correspondence between the module and the textual domain for > that module (unless we find something more straightfoward). So you propose that there be some kind of protocol to be observed by a module that wants to make "its" textual domain known. What is that protocol? I also propose a protocol: A module can announce its textual domain by binding _. It may chose not to bind _, or it may chose not to bind it to a catalog method. In either case, it does not follow the protocol, so anybody using that protocol may get some kind of failure. > > However, you also brought the point that the doc strings should use > > the same catalog as any other strings of the Python core, >=20 > I just checked the `To:' of your message to make sure, and indeed, you > are writing to me :-). No, I'm pretty sure I never said that, or else, > if I did, I surely was extremely tired! :-) Simplicity asks that doc > strings share the textual domain of all other strings for the same module. > Is there a need to do otherwise? Maybe my logic is somewhat flawed: - Did you agree that doc strings of a module should use the same domain as all other strings of the module? - Did you propose that a single package, distributed as a whole, should have a single textual domain? - Do you agree that the Python core+libs is a single package? =46rom that, I'd conclude that you are in favour of having a single domain for the Python core+libs, which contains both doc strings and other translatable strings of Python core+libs. > P.S. - I slightly begin to fear that we will not have a full, clear > consensus by the 4th of September... :-) I've given up on having message catalogs in the Python 2.0 distribution. Since there is no point in having the catalog without any translations, this is not so urgent. What *is* urgent is to give the catalog to the translators. Regards, Martin From loewis@informatik.hu-berlin.de Mon Sep 4 18:44:42 2000 From: loewis@informatik.hu-berlin.de (Martin von Loewis) Date: Mon, 4 Sep 2000 19:44:42 +0200 (MET DST) Subject: [I18n-sig] Re: Patch 101320: doc strings In-Reply-To: (message from =?ISO-8859-1?Q?Fran=E7ois?= Pinard on 04 Sep 2000 12:59:27 -0400) References: <200008302026.WAA13599@pandora.informatik.hu-berlin.de> <14765.62346.449402.907537@anthem.concentric.net> <200008311550.RAA29820@pandora.informatik.hu-berlin.de> <200009041329.PAA28928@pandora.informatik.hu-berlin.de> Message-ID: <200009041744.TAA15131@pandora.informatik.hu-berlin.de> > > I'm certain it will take some time to get translations back, so if > > we want to have something in the next release (after 2.0), we should > > start today. > > This is another thing. You have to loose hope, _right now_, to ever keep > all translations synchronous with releases. I never had this hope - this is the first thing the gettext manual told me a few years ago. However, would you then conclude to the contrary: Teams never finish, so we don't need to start? > You will live endless irritation, and might end up pretty disgusted, > if you start trying to push and pull on teams. I certainly won't push teams. At the moment, I'm pushing Python maintainers to grant me the freedom to release an already existing catalogue. As a translator, I'm always frustrated when my translations aren't used in released software (*). The reason for that is that quite a lot of translations tend to get fuzzy in short time. (*) In the German catalog of GNU grep, which I maintain, a number of option descriptions appear in English in grep 2.3, even though they had mostly-correct translations. It does not help at all that the manual says I should not worry - I did. In grep 2.4.2, everything is fine - mainly as a result of better coordination. Regards, Martin From pinard@iro.umontreal.ca Mon Sep 4 18:49:33 2000 From: pinard@iro.umontreal.ca (=?ISO-8859-1?Q?Fran=E7ois_Pinard?=) Date: 04 Sep 2000 13:49:33 -0400 Subject: [I18n-sig] Re: gettext in the standard library In-Reply-To: Martin von Loewis's message of "Mon, 4 Sep 2000 15:42:57 +0200 (MET DST)" References: <14749.42747.411862.940207@anthem.concentric.net> <14757.24220.225628.464982@anthem.concentric.net> <200009041342.PAA29915@pandora.informatik.hu-berlin.de> Message-ID: [Martin von Loewis] > > I do not see, nor understand, why we should have special API provisions > > for Unicode. I thought a great effort has been put in Unicode support > > design so it would be as transparent as possible. Isn't making Unicode > > explicit going against this spirit? > In Python 2, unicode strings are a separate type from byte strings. > The catalog objects will have two methods, one for retrieving a byte > string, as it appears in the mo file, and one for retrieving a unicode > string. It is then the application developer's choice whether his > application can deal with Unicode messages on output or not. You are merely re-stating that there is a special API for Unicode, here. I got this already! :-). My question is about why it is necessary. > You can't be certain that the encoding of the catalog msgstrs is the > same as the one of the user. For example, the catalog may use KOI-8, > whereas the user's terminals are all in UTF-8. So you have know the > catalog's encoding. Yes, it is described in the PO file header (the translation of the empty string). The idea is to convert KOI-8 (or whatever) while retrieving the translation. Most of the time, the conversion will be to Unicode. In some very rare cases, like for Netherlands, ASCII is sufficient. This all can be done automatically, I do not see why we need two APIs. > the Python installation may not have the converter from the .mo file's > encoding to Unicode. I thought Python 2.0 was to come with a comprehensive set of conversion routines for doing such things. If we ever find that one is missing, we might try to add it, shouldn't we? > Also, how would goal language determine whether Unicode is a better > representation for messages than some MBCS? Oh, no doubt that this may yield to hot debates. I thought that Python was trying to give a special treat to Unicode. You might remember, I do not know, that I tried to warn people that Unicode is not the end of everything. I guess you are saying the same thing, here. :-) For translation purposes, I thought Python was to produce either ASCII or UTF-8 rather automatically on output. It is likely to produce a mix, as the original strings are written in ASCII most of times, which do not get all translated. If something else is needed on output, I thought the intent was to override UTF-8 as an output encoding, yet still use Unicode internally, instead of any MBCS, taking advantage of all the magic Python 2.0 will have in that respect. Otherwise, you have to make your Python script aware of those coding a lot more, internationalisation becomes much more intrusive in your sources, while we wanted it to be as light weight as possible. > > Also, what means "GNUTranslations" above? What is especially "GNU" in > > the act of translating? Should not we just avoid any "GNU" > > references? > The format of the catalog files is defined by GNU gettext. Let's avoid "GNU" in the terminology, if we avoid the GPL. They usually go together! :-) And besides, I think we should not overly insist in the documentation, nor in the API, on the fact that a particular `gettext' is used underneath. -- François Pinard http://www.iro.umontreal.ca/~pinard From loewis@informatik.hu-berlin.de Mon Sep 4 18:52:29 2000 From: loewis@informatik.hu-berlin.de (Martin von Loewis) Date: Mon, 4 Sep 2000 19:52:29 +0200 (MET DST) Subject: [I18n-sig] Re: gettext in the standard library In-Reply-To: (message from =?ISO-8859-1?Q?Fran=E7ois?= Pinard on 04 Sep 2000 13:19:04 -0400) References: <14749.42747.411862.940207@anthem.concentric.net> <14757.24220.225628.464982@anthem.concentric.net> <200008241935.VAA05311@pandora.informatik.hu-berlin.de> <200009041356.PAA01550@pandora.informatik.hu-berlin.de> Message-ID: <200009041752.TAA15488@pandora.informatik.hu-berlin.de> > The truth of experience is that for many teams, translations will lag over > releases, and you will often not have full translation files, a few holes > will exist. It is then more important that fall-backs are taken on a per > message basis. I agree in principle. From a practical point of view: Do you know any user that actually has a a LANGUAGE setting listing more than one language? Even in the sv:de example, there is still a chance that neither the Swedish nor the German catalog has a translation, so the user would get three languages on her screen. I don't know anybody who'd prefer that over just falling back to English. Regards, Martin From loewis@informatik.hu-berlin.de Mon Sep 4 19:01:48 2000 From: loewis@informatik.hu-berlin.de (Martin von Loewis) Date: Mon, 4 Sep 2000 20:01:48 +0200 (MET DST) Subject: [I18n-sig] Re: gettext in the standard library In-Reply-To: (message from =?ISO-8859-1?Q?Fran=E7ois?= Pinard on 04 Sep 2000 13:29:32 -0400) References: <14749.42747.411862.940207@anthem.concentric.net> <14757.24220.225628.464982@anthem.concentric.net> <200008241935.VAA05311@pandora.informatik.hu-berlin.de> <14758.53732.459528.857102@anthem.concentric.net> <200009041400.QAA02532@pandora.informatik.hu-berlin.de> Message-ID: <200009041801.UAA15838@pandora.informatik.hu-berlin.de> > Problems would arise if the source strings were recoded, between string > extraction by POT tools, and string usage for translation at run-time. > Python will likely "internalise" or convert Unicode strings from UTF-8, > and this is a change of representation. Currently, to put Unicode strings into source code, you'll have to use \u escapes in your source(e.g. print u"\u263A"). I'm not aware of any editor that transparently displays these beasts. So if you want to have non-English msgid strings using the Unicode standard (rather than Unicode objects), your best bet is probably to encode the Python source as UTF-8. As a result, you'll use byte strings as parameters to _, which is supported well by the API. [As a side note: I would have preferred if u"" strings had UTF-8 inside them. As it is, I doubt anybody will use them for things other than WHITE SMILING FACE]. With byte strings, Python won't do any internalisation, so at run time, you'll always have the same byte string that you got at extraction time. Regards, Martin From loewis@informatik.hu-berlin.de Mon Sep 4 19:13:57 2000 From: loewis@informatik.hu-berlin.de (Martin von Loewis) Date: Mon, 4 Sep 2000 20:13:57 +0200 (MET DST) Subject: [I18n-sig] Re: gettext in the standard library In-Reply-To: (message from =?ISO-8859-1?Q?Fran=E7ois?= Pinard on 04 Sep 2000 13:49:33 -0400) References: <14749.42747.411862.940207@anthem.concentric.net> <14757.24220.225628.464982@anthem.concentric.net> <200009041342.PAA29915@pandora.informatik.hu-berlin.de> Message-ID: <200009041813.UAA16309@pandora.informatik.hu-berlin.de> > > In Python 2, unicode strings are a separate type from byte strings. > > The catalog objects will have two methods, one for retrieving a byte > > string, as it appears in the mo file, and one for retrieving a unicode > > string. It is then the application developer's choice whether his > > application can deal with Unicode messages on output or not. > > You are merely re-stating that there is a special API for Unicode, here. > I got this already! :-). My question is about why it is necessary. Which part do you deem unnecessary? The part returning a byte string, or the part returning a Unicode string? > Yes, it is described in the PO file header (the translation of the empty > string). The idea is to convert KOI-8 (or whatever) while retrieving > the translation. Most of the time, the conversion will be to Unicode. > In some very rare cases, like for Netherlands, ASCII is sufficient. > This all can be done automatically, I do not see why we need two > APIs. So you are proposing that an application cannot tell in advance what the return type of _ will be? In some application, writing header = '\x01\x01' body = _('warning') messgage = header + body Will this work or not? Anwer: It depends. In the Netherlands, it will work, elsewhere, it won't. > I thought Python 2.0 was to come with a comprehensive set of conversion > routines for doing such things. If we ever find that one is missing, > we might try to add it, shouldn't we? I think it was decided not to include the JIS something tables in the Python 2 distribution, because they are too large to include. > > Also, how would goal language determine whether Unicode is a better > > representation for messages than some MBCS? > > Oh, no doubt that this may yield to hot debates. I did not really ask for an opinion, I asked for an algorithm: def mbcs_p(parameters): your code here > For translation purposes, I thought Python was to produce either ASCII > or UTF-8 rather automatically on output. It is likely to produce a mix, > as the original strings are written in ASCII most of times, which do not > get all translated. In Python 2.0, developers should be aware at all times whether they operate on Unicode strings or on byte strings. Python will try to do the right thing if there is a clear right thing, and try to raise exceptions whenever it is not so clear what the right thing would be. Having an API that sometimes returns Unicode strings and sometimes byte strings (depending on environment variables (!)) would be just terrible. > If something else is needed on output, I thought the intent was to > override UTF-8 as an output encoding, yet still use Unicode > internally, instead of any MBCS, taking advantage of all the magic > Python 2.0 will have in that respect. Maybe it's a terminology issue: I consider UTF-8 as a MBCS (multi-byte character set); UTF-8 strings are byte strings, not Unicode strings. > Otherwise, you have to make your Python script aware of those coding > a lot more, internationalisation becomes much more intrusive in your > sources, while we wanted it to be as light weight as possible. I simply want to give users a choice. If they chose to "let's try Unicode", they have the choice. If they find it all works, well. Otherwise, they can go for byte strings, with a different set of limitations. Regards, Martin From mal@lemburg.com Mon Sep 4 19:39:03 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Mon, 04 Sep 2000 20:39:03 +0200 Subject: [I18n-sig] Re: gettext in the standard library References: <14749.42747.411862.940207@anthem.concentric.net> <14757.24220.225628.464982@anthem.concentric.net> <200008241935.VAA05311@pandora.informatik.hu-berlin.de> <14758.53732.459528.857102@anthem.concentric.net> <200009041400.QAA02532@pandora.informatik.hu-berlin.de> <200009041801.UAA15838@pandora.informatik.hu-berlin.de> Message-ID: <39B3EC47.D54483D2@lemburg.com> Martin von Loewis wrote: > > > Problems would arise if the source strings were recoded, between string > > extraction by POT tools, and string usage for translation at run-time. > > Python will likely "internalise" or convert Unicode strings from UTF-8, > > and this is a change of representation. > > Currently, to put Unicode strings into source code, you'll have to use > \u escapes in your source(e.g. print u"\u263A"). I'm not aware of any > editor that transparently displays these beasts. You could wrap the decoding processing into the _ function: def _(s): return unicode(s, "utf-8") This would allow you not only to use translatable strings, but also any unicode string encoding you like, e.g. utf-8 or latin1. Once the "declare" statement is in place you should also be able to write: declare encoding = "utf-8" ... u"utf-8 encoded string" ... in Python source code. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From loewis@informatik.hu-berlin.de Mon Sep 4 19:48:58 2000 From: loewis@informatik.hu-berlin.de (Martin von Loewis) Date: Mon, 4 Sep 2000 20:48:58 +0200 (MET DST) Subject: [I18n-sig] Re: gettext in the standard library In-Reply-To: <39B3EC47.D54483D2@lemburg.com> (mal@lemburg.com) References: <14749.42747.411862.940207@anthem.concentric.net> <14757.24220.225628.464982@anthem.concentric.net> <200008241935.VAA05311@pandora.informatik.hu-berlin.de> <14758.53732.459528.857102@anthem.concentric.net> <200009041400.QAA02532@pandora.informatik.hu-berlin.de> <200009041801.UAA15838@pandora.informatik.hu-berlin.de> <39B3EC47.D54483D2@lemburg.com> Message-ID: <200009041848.UAA17574@pandora.informatik.hu-berlin.de> > You could wrap the decoding processing into the _ function: > > def _(s): > return unicode(s, "utf-8") > > This would allow you not only to use translatable strings, > but also any unicode string encoding you like, e.g. utf-8 > or latin1. Maybe I'm missing something here. How does the catalog come into play in this definition of _? Regards, Martin From mal@lemburg.com Mon Sep 4 20:39:07 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Mon, 04 Sep 2000 21:39:07 +0200 Subject: [I18n-sig] Re: gettext in the standard library References: <14749.42747.411862.940207@anthem.concentric.net> <14757.24220.225628.464982@anthem.concentric.net> <200008241935.VAA05311@pandora.informatik.hu-berlin.de> <14758.53732.459528.857102@anthem.concentric.net> <200009041400.QAA02532@pandora.informatik.hu-berlin.de> <200009041801.UAA15838@pandora.informatik.hu-berlin.de> <39B3EC47.D54483D2@lemburg.com> <200009041848.UAA17574@pandora.informatik.hu-berlin.de> Message-ID: <39B3FA5B.A8A4BF19@lemburg.com> Martin von Loewis wrote: > > > You could wrap the decoding processing into the _ function: > > > > def _(s): > > return unicode(s, "utf-8") > > > > This would allow you not only to use translatable strings, > > but also any unicode string encoding you like, e.g. utf-8 > > or latin1. > > Maybe I'm missing something here. How does the catalog come into play > in this definition of _? That was just an example of how you could add the decoding functionality to the _ function. You would of course also add a gettext.gettext call somewhere in there which translates the string first (possibly recoding it to some other encoding for the table lookup first). -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From pinard@iro.umontreal.ca Mon Sep 4 21:28:29 2000 From: pinard@iro.umontreal.ca (=?ISO-8859-1?Q?Fran=E7ois_Pinard?=) Date: 04 Sep 2000 16:28:29 -0400 Subject: [I18n-sig] Re: Patch 101320: doc strings References: <200008302026.WAA13599@pandora.informatik.hu-berlin.de> <14765.62346.449402.907537@anthem.concentric.net> <200008311550.RAA29820@pandora.informatik.hu-berlin.de> <14766.48956.463218.310154@anthem.concentric.net> <200009041314.PAA27902@pandora.informatik.hu-berlin.de> <200009041445.QAA06095@pandora.informatik.hu-berlin.de> <200009041732.TAA14636@pandora.informatik.hu-berlin.de> Message-ID: [Martin von Loewis] > Me: I propose a single domain for docstrings, 'pylib'. > Barry: This is too large, split it up. > Me: Then how do you access the individual strings? > Barry: Use the module name. > Me: This will give too many domains. > Barry: There might be some point in having a /docstring/ domain [for > the python library] Oh, oh! So, I see that Barry is the bad guy, after all? :-) > I did not propose a *separate* domain for doc strings. I proposed that > there is one well-known domain in which the doc strings of the core > python library can be found. I don't care too much at this time whether > it contains anything else - there are no other translatable strings in > the Python sources at this point in time. OK, then, let's call it "python", and if other strings are needed from the Python distribution, let's also use that "python" domain for it as well for them. Very fine with me! :-) > > My feeling is that we should not rely on `_'. The variable used to hold > > the translating function should be left at the discretion of the user. > Well, what else do you propose? Nothing special. I suggest that we systematically use `_' in the documentation and examples, but that we also avoid forcing the issue in any way from within Python. Let's use a function in `locale', say, to get a Translations instance (say) given the textual domain. The language to use could be obtained from environment variables as well as the search path for the MO file, unless such things get overridden by keyword arguments. > > Why not just use the textual domain of a module, to translate the doc > > strings it contains? > How do I find out the textual domain of a module? How do I find out > the module of a builtin function? You ask the `locale' module to return you a Translations instance for that module, maybe though another keyword argument stating the name of the module for which you need a translator. This would only work if that module previously registered its domain name, through asking the creation of a Translations instance for itself (and without specifying the keyword argument specifying a module, or course). > So you propose that there be some kind of protocol to be observed by a > module that wants to make "its" textual domain known. What is that > protocol? Maybe, I do not know, something like: _ = locale.translator(TEXTUAL_DOMAIN) after the overall doc string for the module, for each module? For all modules being part of the Python distribution, it would be: _ = locale.translator("python") Of course, if a module does not need to translate any string explicitely, a mere: locale.translator("python") would be sufficient, in which case the `_' variable gets undisturbed, of course. The `locale.translator' function would call `locale.Translator()' if it finds that none exist yet for the textual domain "python" and the specified language sequence (which could be specified by keyword, but defaulting to LANGUAGE in the environment, or else LANG, or none). > I also propose a protocol: A module can announce its textual domain by > binding _. It may chose not to bind _, or it may chose not to bind it > to a catalog method. In either case, it does not follow the protocol, > so anybody using that protocol may get some kind of failure. I understand, but I think we may avoid imposing `_'. Even if we expect it to be popular, best is to not rely on it, if we can avoid doing it. > - Did you agree that doc strings of a module should use the same > domain as all other strings of the module? Sounds good to me. > - Did you propose that a single package, distributed as a whole, > should have a single textual domain? As far as possible, yes. It seems to be the good thing to do for most things so far. This is not an absolute, of course, but we should not start with the idea that splitting is necessary. If we later discover some exceptional property or condition that makes a sounded and solid justification for it, it would be worth exploring, but so far that I know (and given I've not read all my mail yet :-), none shown up yet. If we have many tens of thousands of doc strings, it might change the balance, I do not know. > - Do you agree that the Python core+libs is a single package? I'm tempted to much agree, yes. > From that, I'd conclude that you are in favour of having a single > domain for the Python core+libs, which contains both doc strings > and other translatable strings of Python core+libs. Yes, of course. But we cannot blindly rely that the textual domain for any module is "python", as a Python relies on the run-time importation of many scripts from various sources. A good deal of modules will have "python" to start with, but modules could be added or overridden: the textual domain of a module should be registered by that module, and retrieved whenever appropriate. > I've given up on having message catalogs in the Python 2.0 > distribution. Do not loose hope yet. Who knows what will happen! :-) CWRI never confessed its true reasons, but now, we can tell it. If they made all that legalese noise and stuff, that was only a circumvoluted way to buy us more time for completing internationalisation specifications. :-) > What *is* urgent is to give the catalog to the translators. This, I deeply understand! The big work is mainly done by translators, and PO files are re-usable even when the API changes or fluctuates, or is postponed. So, the translation effort is usually best invested. But it gets frustrating for translators, at times. I remember the long years it took before `make' translations could start to work, for example. `bison' was not immediate either. And even now, `diffutils' and `bash' are not settled, while translations for those existed for years. I thought that the `tar'/`cpio' saga was over, but it seems it has to be restarted, for reasons some of you might know :-). If we can get Python itself to be internationalised within a year, say, it would be good publishing its POT file now. But Python may also be seen as a package among others. Python could offer internationalisation methods for Python scripts, without being immediately internationalised itself. -- François Pinard http://www.iro.umontreal.ca/~pinard From pinard@iro.umontreal.ca Mon Sep 4 21:37:11 2000 From: pinard@iro.umontreal.ca (=?ISO-8859-1?Q?Fran=E7ois_Pinard?=) Date: 04 Sep 2000 16:37:11 -0400 Subject: [I18n-sig] Re: gettext in the standard library In-Reply-To: Martin von Loewis's message of "Mon, 4 Sep 2000 19:52:29 +0200 (MET DST)" References: <14749.42747.411862.940207@anthem.concentric.net> <14757.24220.225628.464982@anthem.concentric.net> <200008241935.VAA05311@pandora.informatik.hu-berlin.de> <200009041356.PAA01550@pandora.informatik.hu-berlin.de> <200009041752.TAA15488@pandora.informatik.hu-berlin.de> Message-ID: [Martin von Loewis] > Even in the sv:de example, there is still a chance that neither the > Swedish nor the German catalog has a translation, so the user would get > three languages on her screen. I don't know anybody who'd prefer that > over just falling back to English. This is precisely because it was asked for, that we did it. The idea did not come from us, but from users. I only know English and French, so this would not be useful to me. I guess most Americans know only one language, so their need are even simpler than mine! But I got that in Europe, many people have an extended culture, making me jealous (:-), and it is not uncommon for them to be comfortable with many languages. So, in a word, this specification for fall-backs is a service for the most cultured of our users. Let's admire them, and consider they deserve it? :-) -- François Pinard http://www.iro.umontreal.ca/~pinard From loewis@informatik.hu-berlin.de Mon Sep 4 22:00:13 2000 From: loewis@informatik.hu-berlin.de (Martin von Loewis) Date: Mon, 4 Sep 2000 23:00:13 +0200 (MET DST) Subject: [I18n-sig] Re: gettext in the standard library In-Reply-To: <39B3FA5B.A8A4BF19@lemburg.com> (mal@lemburg.com) References: <14749.42747.411862.940207@anthem.concentric.net> <14757.24220.225628.464982@anthem.concentric.net> <200008241935.VAA05311@pandora.informatik.hu-berlin.de> <14758.53732.459528.857102@anthem.concentric.net> <200009041400.QAA02532@pandora.informatik.hu-berlin.de> <200009041801.UAA15838@pandora.informatik.hu-berlin.de> <39B3EC47.D54483D2@lemburg.com> <200009041848.UAA17574@pandora.informatik.hu-berlin.de> <39B3FA5B.A8A4BF19@lemburg.com> Message-ID: <200009042100.XAA22248@pandora.informatik.hu-berlin.de> > > > def _(s): > > > return unicode(s, "utf-8") > That was just an example of how you could add the decoding > functionality to the _ function. > > You would of course also add a gettext.gettext call > somewhere in there which translates the string first > (possibly recoding it to some other encoding for the > table lookup first). So it would be def _(s): gettext.gettext(unicode(s,"utf-8")) then??? There is no reason to do such a thing. First, you take a good UTF-8 string, transform it into a Unicode object; then gettext must encode the Unicode object into some byte string (possibly using UTF-8), as there msgids are stored as bytes on the disk (i.e. using some encoding). If you put UTF-8 in your source as msgid, you can *directly* invoking gettext, without needing to create a temporary Unicode object first. Even if there is some pragma utf-8 some day, it would be still more straight-forward to write _("") than _(u"") as gettext would need some clue what byte encoding it needs to use, whereas the byte encoding is obvious in the first case. Regards, Martin From loewis@informatik.hu-berlin.de Mon Sep 4 22:23:34 2000 From: loewis@informatik.hu-berlin.de (Martin von Loewis) Date: Mon, 4 Sep 2000 23:23:34 +0200 (MET DST) Subject: [I18n-sig] Re: Patch 101320: doc strings In-Reply-To: (message from =?ISO-8859-1?Q?Fran=E7ois?= Pinard on 04 Sep 2000 16:28:29 -0400) References: <200008302026.WAA13599@pandora.informatik.hu-berlin.de> <14765.62346.449402.907537@anthem.concentric.net> <200008311550.RAA29820@pandora.informatik.hu-berlin.de> <14766.48956.463218.310154@anthem.concentric.net> <200009041314.PAA27902@pandora.informatik.hu-berlin.de> <200009041445.QAA06095@pandora.informatik.hu-berlin.de> <200009041732.TAA14636@pandora.informatik.hu-berlin.de> Message-ID: <200009042123.XAA23066@pandora.informatik.hu-berlin.de> > > > My feeling is that we should not rely on `_'. The variable used to hold > > > the translating function should be left at the discretion of the user. > > > Well, what else do you propose? > > Nothing special. I suggest that we systematically use `_' in the > documentation and examples, but that we also avoid forcing the issue in > any way from within Python. Let's use a function in `locale', say, to > get a Translations instance (say) given the textual domain. The language > to use could be obtained from environment variables as well as the search > path for the MO file, unless such things get overridden by keyword > arguments. That is indeed how the gettext.py API works: given a textual domain, you get a Translations instance, considering environment variables. The question is still how *doc* strings get translated, from outside the module. I.e. the help() function needs to determine what textual domain it is supposed to use when accessing the doc string of some object. Presence of any function expecting a textual domain does no good, as help() needs to find out what the textual domain is first. > You ask the `locale' module to return you a Translations instance for > that module, maybe though another keyword argument stating the name of > the module for which you need a translator. This would only work if that > module previously registered its domain name, through asking the creation > of a Translations instance for itself (and without specifying the keyword > argument specifying a module, or course). I see. I doubt that *not* specifying the module name gives is acceptable, though - that locale function would need to know who its caller is. I feel doing that is too hacky to be accepted for the standard library. > > So you propose that there be some kind of protocol to be observed by a > > module that wants to make "its" textual domain known. What is that > > protocol? > > Maybe, I do not know, something like: > > _ = locale.translator(TEXTUAL_DOMAIN) > > after the overall doc string for the module, for each module? I believe this would rather become _ = locale.translator(TEXTUAL_DOMAIN, module = __name__) for the reason mentioned above. But yes, that might work. It would invalidate (or, rather, not support) prior art for binding _, though. Traditionally, Python programs do (in GNOME specifically) _ = gettext.gettext With Barry's API, you do gettext.install(TEXTUAL_DOMAIN) which puts _ into __builtins__, so individual modules won't even bind _ themselves. > CWRI never confessed its true reasons, but now, we can tell it. If they > made all that legalese noise and stuff, that was only a circumvoluted way > to buy us more time for completing internationalisation specifications. :-) :-) > But it gets frustrating for translators, at times. I remember the long > years it took before `make' translations could start to work That's why I want to get some assurance that translations will be indeed used when done. I'd like to get some agreement on procedures among all interested people here, and I'd like to get some go-ahead from BeOpen that they'll consider including it when it's done. In any case, I'll push that Python distributors and packagers (RedHat, Debian, ActiveState, ...) to include available catalogs even before they get into an official distribution. As they are plain data files and don't harm functionality, it's just a matter of file size to use them or to leave them. I also hope that the help module takes off, so that there is some convenient way to access the doc string translations. > If we can get Python itself to be internationalised within a year, say, > it would be good publishing its POT file now. But Python may also be seen > as a package among others. Python could offer internationalisation methods > for Python scripts, without being immediately internationalised itself. That is certain. I believe Python 2 will be well-equipped already, having a gettext module, and xgettext and msgfmt utilities in 100% pure Python. For a full i18n process, only a msgmerge utility and a po-mode editor (in Tk?) would be missing. Of course, on many systems, GNU equivalents of these tools will be available now. Regards, Martin From pinard@iro.umontreal.ca Mon Sep 4 22:26:42 2000 From: pinard@iro.umontreal.ca (=?ISO-8859-1?Q?Fran=E7ois_Pinard?=) Date: 04 Sep 2000 17:26:42 -0400 Subject: [I18n-sig] Re: gettext in the standard library In-Reply-To: Martin von Loewis's message of "Mon, 4 Sep 2000 20:13:57 +0200 (MET DST)" References: <14749.42747.411862.940207@anthem.concentric.net> <14757.24220.225628.464982@anthem.concentric.net> <200009041342.PAA29915@pandora.informatik.hu-berlin.de> <200009041813.UAA16309@pandora.informatik.hu-berlin.de> Message-ID: [Martin von Loewis] > > > In Python 2, unicode strings are a separate type from byte strings. > > > The catalog objects will have two methods, one for retrieving a byte > > > string, as it appears in the mo file, and one for retrieving a unicode > > > string. It is then the application developer's choice whether his > > > application can deal with Unicode messages on output or not. > > > > You are merely re-stating that there is a special API for Unicode, here. > > I got this already! :-). My question is about why it is necessary. > Which part do you deem unnecessary? The part returning a byte string, > or the part returning a Unicode string? Any part in which one has to make a distinction between both types of strings. Let's have the translator function returning a string. It is not important to know which kind of string. Python takes care of what needs care, anyway. It should be fairly transparent to the programmer, and our API should be just as transparent. Shouldn't it? > So you are proposing that an application cannot tell in advance what > the return type of _ will be? In some application, writing > header = '\x01\x01' > body = _('warning') > message = header + body Perfect. No problem. Python will do something proper, whatever the type of string which `body' receives... > I think it was decided not to include the JIS something tables in the > Python 2 distribution, because they are too large to include. Then, working with JIS translations would require that Japanese users fetch the JIS tables from other sources. A script written for JIS will need such tables, wherever they come from. It would be nicer if Python was offering them, but... Hmph! :-) > In Python 2.0, developers should be aware at all times whether they > operate on Unicode strings or on byte strings. Python will try to do the > right thing if there is a clear right thing, and try to raise exceptions > whenever it is not so clear what the right thing would be. I thought that every effort was made (at least for 1.6a1 and 1.6a2) for developers should just _not_ be aware of the type of strings. Is 2.0 different? Or did I wholly miss the issue? It would make me sad... If I missed the issue, you may dismiss many things among what I wrote, as we are then not reasoning on the same grounds. If elegance has already been lost from the start, surely, there is no need for me to in trying to preserve it, and I'm a mere kibitzer :-(. Tell me before I make a fool of myself... Oh! It is too late already? :-) > > If something else is needed on output, I thought the intent was to > > override UTF-8 as an output encoding, yet still use Unicode internally, > > instead of any MBCS, taking advantage of all the magic Python 2.0 will > > have in that respect. > Maybe it's a terminology issue: I consider UTF-8 as a MBCS (multi-byte > character set); UTF-8 strings are byte strings, not Unicode strings. I thought that, by using some 8-bit API instead of some Unicode API for translation matters, you were intending to handle MBCS directly, all over, instead of relying on Unicode strings. > > Otherwise, you have to make your Python script aware of those coding > > a lot more, internationalisation becomes much more intrusive in your > > sources, while we wanted it to be as light weight as possible. > I simply want to give users a choice. If they chose to "let's try > Unicode", they have the choice. If they find it all works, well. > Otherwise, they can go for byte strings, with a different set of > limitations. Shouldn't we just have confidence that Python works? I would rather see programmers just using strings and then, playing interactively, or looking at their output, have a slight and momentary astonishment, saying: "Hey, things apparently turned Unicode at some point", be satisfied by the results anyway, and not bother much more about the issue. If we put unusual exceptions aside (like "English" translation, or Netherlands), users experience could be that things just happen to work in ASCII when no translation is requested, and just happen to use Unicode otherwise. > > > Also, how would goal language determine whether Unicode is a better > > > representation for messages than some MBCS? > I did not really ask for an opinion, I asked for an algorithm: > def mbcs_p(parameters): > your code here If we get Unicode out of the translating routine, there should not be much more needed, except maybe a final encoding of the output stream. This, I feel we did not discuss enough yet (how to connect the translation function to the output stream encoding, as transparently as possible). But once again, maybe I missed so much of the whole point about Unicode and Python, that none of my remarks hold. -- François Pinard http://www.iro.umontreal.ca/~pinard From pinard@iro.umontreal.ca Mon Sep 4 22:32:54 2000 From: pinard@iro.umontreal.ca (=?ISO-8859-1?Q?Fran=E7ois_Pinard?=) Date: 04 Sep 2000 17:32:54 -0400 Subject: [I18n-sig] Re: gettext in the standard library In-Reply-To: Martin von Loewis's message of "Mon, 4 Sep 2000 20:48:58 +0200 (MET DST)" References: <14749.42747.411862.940207@anthem.concentric.net> <14757.24220.225628.464982@anthem.concentric.net> <200008241935.VAA05311@pandora.informatik.hu-berlin.de> <14758.53732.459528.857102@anthem.concentric.net> <200009041400.QAA02532@pandora.informatik.hu-berlin.de> <200009041801.UAA15838@pandora.informatik.hu-berlin.de> <39B3EC47.D54483D2@lemburg.com> <200009041848.UAA17574@pandora.informatik.hu-berlin.de> Message-ID: [Martin von Loewis] > > You could wrap the decoding processing into the _ function: > > def _(s): > > return unicode(s, "utf-8") > > This would allow you not only to use translatable strings, > > but also any unicode string encoding you like, e.g. utf-8 > > or latin1. > Maybe I'm missing something here. How does the catalog come into play > in this definition of _? The conversion to Unicode strings would be done from within the translating function. This one might be a bound method from a class instance knowing a few things besides the textual domain. In particular, the instance would know the encoding to use from the PO file header, and so, the translating function should be able to do the proper conversion, transparently. -- François Pinard http://www.iro.umontreal.ca/~pinard From pinard@iro.umontreal.ca Mon Sep 4 22:37:35 2000 From: pinard@iro.umontreal.ca (=?ISO-8859-1?Q?Fran=E7ois_Pinard?=) Date: 04 Sep 2000 17:37:35 -0400 Subject: [I18n-sig] Re: gettext in the standard library In-Reply-To: "M.-A. Lemburg"'s message of "Mon, 04 Sep 2000 20:39:03 +0200" References: <14749.42747.411862.940207@anthem.concentric.net> <14757.24220.225628.464982@anthem.concentric.net> <200008241935.VAA05311@pandora.informatik.hu-berlin.de> <14758.53732.459528.857102@anthem.concentric.net> <200009041400.QAA02532@pandora.informatik.hu-berlin.de> <200009041801.UAA15838@pandora.informatik.hu-berlin.de> <39B3EC47.D54483D2@lemburg.com> Message-ID: [M.-A. Lemburg] > Once the "declare" statement is in place you should also be able to write: > declare encoding = "utf-8" > ... u"utf-8 encoded string" ... > in Python source code. I'm not aware of that "declare" statement (or declaration?), but it sounds like addressing a need. But if it exists as stated above, I predict for myself that I'll often forget the `u' prefix. :-). Is that what you meant when saying that the programmer will have to be aware all the time if using Unicode strings? It looks like it. The POT extractors will have to be modified to know such conventions. -- François Pinard http://www.iro.umontreal.ca/~pinard From mal@lemburg.com Mon Sep 4 22:56:50 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Mon, 04 Sep 2000 23:56:50 +0200 Subject: [I18n-sig] Re: gettext in the standard library References: <14749.42747.411862.940207@anthem.concentric.net> <14757.24220.225628.464982@anthem.concentric.net> <200008241935.VAA05311@pandora.informatik.hu-berlin.de> <14758.53732.459528.857102@anthem.concentric.net> <200009041400.QAA02532@pandora.informatik.hu-berlin.de> <200009041801.UAA15838@pandora.informatik.hu-berlin.de> <39B3EC47.D54483D2@lemburg.com> Message-ID: <39B41AA2.A8454326@lemburg.com> François Pinard wrote: > > [M.-A. Lemburg] > > > Once the "declare" statement is in place you should also be able to write: > > > declare encoding = "utf-8" > > ... u"utf-8 encoded string" ... > > > in Python source code. > > I'm not aware of that "declare" statement (or declaration?), but it sounds > like addressing a need. But if it exists as stated above, I predict for > myself that I'll often forget the `u' prefix. :-). > > Is that what you meant when saying that the programmer will have to be > aware all the time if using Unicode strings? It looks like it. > > The POT extractors will have to be modified to know such conventions. The "declare" statement will be a PEP for 2.1. Until then you'll have to stick to the _ function trick I posted to Martin. Note that you will still have to use the "u" string modifier to have the compiler trigger the conversion. There will probably also be a similar recoder for 8-bit string literals, but this will only work provided that the default encoding is set to something a little more capable than ASCII, e.g. utf-8 or latin-1. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From pinard@iro.umontreal.ca Mon Sep 4 23:32:59 2000 From: pinard@iro.umontreal.ca (=?ISO-8859-1?Q?Fran=E7ois_Pinard?=) Date: 04 Sep 2000 18:32:59 -0400 Subject: [I18n-sig] Re: Patch 101320: doc strings In-Reply-To: Martin von Loewis's message of "Mon, 4 Sep 2000 23:23:34 +0200 (MET DST)" References: <200008302026.WAA13599@pandora.informatik.hu-berlin.de> <14765.62346.449402.907537@anthem.concentric.net> <200008311550.RAA29820@pandora.informatik.hu-berlin.de> <14766.48956.463218.310154@anthem.concentric.net> <200009041314.PAA27902@pandora.informatik.hu-berlin.de> <200009041445.QAA06095@pandora.informatik.hu-berlin.de> <200009041732.TAA14636@pandora.informatik.hu-berlin.de> <200009042123.XAA23066@pandora.informatik.hu-berlin.de> Message-ID: > [François Pinard] > > You ask the `locale' module to return you a Translations instance for > > that module, maybe though another keyword argument stating the name of > > the module for which you need a translator. This would only work if that > > module previously registered its domain name, through asking the creation > > of a Translations instance for itself (and without specifying the keyword > > argument specifying a module, or course). [Martin von Loewis] > I doubt that *not* specifying the module name gives is acceptable, > though - that locale function would need to know who its caller is. > I feel doing that is too hacky to be accepted for the standard library. > [...] With Barry's API, you do > gettext.install(TEXTUAL_DOMAIN) > which puts _ into __builtins__, so individual modules won't even bind > _ themselves. Hackery for hackery, I would prefer to see the function creating the translating function to seek for the calling module, as this would really useful. As for `gettext.install' function, it looks awkward. This would be the only case I know, in the Python library, where a library function hacks a variable in the local name space. I do not doubt that it is clever, but the cleverness alone does not make it attractive enough to make it look acceptable. I would suggest that we go without it. No need having two ways for doing the same thing, with the `gettext.install' being questionable. > That is certain. I believe Python 2 will be well-equipped already, > having a gettext module, Yet, `gettext' is not an ideal name. We should avoid using it, and sticking too closely to the `gettext' API. > and xgettext and msgfmt utilities in 100% pure Python. Barry wrote `pygettext.py', but I'm not aware of any `msgfmt' program. The double hashing algorithm would have to be known for it to exist, and would not be a legalistic problem then for the MO file reader. > For a full i18n process, only a msgmerge utility and a po-mode editor > (in Tk?) would be missing. I'm starving to find some time for looking at Pango, but the little I read about it, it looks especially promising as a basis for a rewritten PO mode. -- François Pinard http://www.iro.umontreal.ca/~pinard From martin@loewis.home.cs.tu-berlin.de Mon Sep 4 23:31:25 2000 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Tue, 5 Sep 2000 00:31:25 +0200 Subject: [I18n-sig] Re: gettext in the standard library In-Reply-To: (message from =?ISO-8859-1?Q?Fran=E7ois?= Pinard on 04 Sep 2000 17:26:42 -0400) References: <14749.42747.411862.940207@anthem.concentric.net> <14757.24220.225628.464982@anthem.concentric.net> <200009041342.PAA29915@pandora.informatik.hu-berlin.de> <200009041813.UAA16309@pandora.informatik.hu-berlin.de> Message-ID: <200009042231.AAA00904@loewis.home.cs.tu-berlin.de> > Any part in which one has to make a distinction between both types of > strings. Let's have the translator function returning a string. In the specific implementation that is in Python 2.0, which kind of string should it return? It has to make a choice; just saying "I don't care" is a bad basis for an algorithm. > It is not important to know which kind of string. Python takes care > of what needs care, anyway. No, it doesn't. It will in some cases, but won't in others. > It should be fairly transparent to the programmer, and our API > should be just as transparent. Shouldn't it? It should, but I feel it isn't. > > header = '\x01\x01' > > body = _('warning') > > message = header + body > > Perfect. No problem. Python will do something proper, whatever the type > of string which `body' receives... >>> header = '\xFF\x01' >>> body = u'warning' >>> message = header + body Traceback (most recent call last): File "", line 1, in ? UnicodeError: ASCII decoding error: ordinal not in range(128) Is that proper? Is it what the user expected? If not, how should the user modify her code so it does what she wanted? > I thought that every effort was made (at least for 1.6a1 and 1.6a2) for > developers should just _not_ be aware of the type of strings. Is 2.0 > different? No, 2.0 is just the same as 1.6 in that area. I suggest you play around with the Unicode type somewhat before recommending that API functions should blindly return it... > If I missed the issue, you may dismiss many things among what I wrote, > as we are then not reasoning on the same grounds. I don't know whether there is an issue. There is a number of cases where mixing byte strings and Unicode strings will cause runtime errors; it is not (and IMO shouldn't be) totally transparent. > Shouldn't we just have confidence that Python works? Well, I think I know how it works, and I believe that developers need to be fully aware of Unicode vs byte strings. They still can employ elegance where available, but I promise that handing out randomly either byte or Unicode strings will result in complaints. > If we get Unicode out of the translating routine, there should not be much > more needed, except maybe a final encoding of the output stream. This, > I feel we did not discuss enough yet (how to connect the translation > function to the output stream encoding, as transparently as possible). Indeed, this is the crucial issue. Unfortunately, we don't know how user would eject the messages. I know that passing them to Tkinter works well for Unicode strings, and I know passing byte strings to stdout works well. Other combinations don't work as good: mira% echo $LANG de_DE.ISO-8859-1 mira% python Python 2.0b1 (#31, Aug 31 2000, 23:36:28) [GCC 2.95.2 19991024 (release)] on linux2 Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam Copyright 1995-2000 Corporation for National Research Initiatives (CNRI) >>> unicode('fön','latin-1') u'f\366n' >>> print unicode('fön','latin-1') Traceback (most recent call last): File "", line 1, in ? UnicodeError: ASCII encoding error: ordinal not in range(128) So I'd rather not return a Unicode string representing an error message from gettext: the user expecting an error message may be surprised about the totally unrelated UnicodeError. Regards, Martin From martin@loewis.home.cs.tu-berlin.de Mon Sep 4 23:48:40 2000 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Tue, 5 Sep 2000 00:48:40 +0200 Subject: [I18n-sig] Re: Patch 101320: doc strings In-Reply-To: (message from =?ISO-8859-1?Q?Fran=E7ois?= Pinard on 04 Sep 2000 18:32:59 -0400) References: <200008302026.WAA13599@pandora.informatik.hu-berlin.de> <14765.62346.449402.907537@anthem.concentric.net> <200008311550.RAA29820@pandora.informatik.hu-berlin.de> <14766.48956.463218.310154@anthem.concentric.net> <200009041314.PAA27902@pandora.informatik.hu-berlin.de> <200009041445.QAA06095@pandora.informatik.hu-berlin.de> <200009041732.TAA14636@pandora.informatik.hu-berlin.de> <200009042123.XAA23066@pandora.informatik.hu-berlin.de> Message-ID: <200009042248.AAA01054@loewis.home.cs.tu-berlin.de> > Barry wrote `pygettext.py', but I'm not aware of any `msgfmt' program. I'm aware of one, as I wrote it :-) See Tools/i18n/msgfmt.py in the Python CVS, or any upcoming 2.0b1 snapshot. > The double hashing algorithm would have to be known for it to exist, > and would not be a legalistic problem then for the MO file reader. This implementation of msgfmt does not generate the hash table, which, according to the GNU gettext manual, is a conforming implementation. Regards, Martin From pinard@iro.umontreal.ca Tue Sep 5 00:59:32 2000 From: pinard@iro.umontreal.ca (=?ISO-8859-1?Q?Fran=E7ois_Pinard?=) Date: 04 Sep 2000 19:59:32 -0400 Subject: [I18n-sig] Re: Patch 101320: doc strings In-Reply-To: "Martin v. Loewis"'s message of "Tue, 5 Sep 2000 00:48:40 +0200" References: <200008302026.WAA13599@pandora.informatik.hu-berlin.de> <14765.62346.449402.907537@anthem.concentric.net> <200008311550.RAA29820@pandora.informatik.hu-berlin.de> <14766.48956.463218.310154@anthem.concentric.net> <200009041314.PAA27902@pandora.informatik.hu-berlin.de> <200009041445.QAA06095@pandora.informatik.hu-berlin.de> <200009041732.TAA14636@pandora.informatik.hu-berlin.de> <200009042123.XAA23066@pandora.informatik.hu-berlin.de> <200009042248.AAA01054@loewis.home.cs.tu-berlin.de> Message-ID: [Martin v. Loewis] > > Barry wrote `pygettext.py', but I'm not aware of any `msgfmt' program. > I'm aware of one, as I wrote it :-) See Tools/i18n/msgfmt.py in the > Python CVS, or any upcoming 2.0b1 snapshot. Thanks. > > The double hashing algorithm would have to be known for it to exist, > > and would not be a legalistic problem then for the MO file reader. > This implementation of msgfmt does not generate the hash table, which, > according to the GNU gettext manual, is a conforming implementation. I wrote most of that manual, and I do not remember that :-). But it was quite a while ago, and we discussed _so_ many things at the time... -- François Pinard http://www.iro.umontreal.ca/~pinard From pinard@iro.umontreal.ca Tue Sep 5 01:44:12 2000 From: pinard@iro.umontreal.ca (=?ISO-8859-1?Q?Fran=E7ois_Pinard?=) Date: 04 Sep 2000 20:44:12 -0400 Subject: [I18n-sig] Re: gettext in the standard library In-Reply-To: "Martin v. Loewis"'s message of "Tue, 5 Sep 2000 00:31:25 +0200" References: <14749.42747.411862.940207@anthem.concentric.net> <14757.24220.225628.464982@anthem.concentric.net> <200009041342.PAA29915@pandora.informatik.hu-berlin.de> <200009041813.UAA16309@pandora.informatik.hu-berlin.de> <200009042231.AAA00904@loewis.home.cs.tu-berlin.de> Message-ID: [Martin v. Loewis] > > Python takes care of what needs care, anyway. > No, it doesn't. It will in some cases, but won't in others. > > It should be fairly transparent to the programmer, and our API > > should be just as transparent. Shouldn't it? > It should, but I feel it isn't. OK. My good prejudice for Unicode support in Python was a bit exaggerated, then. > >>> header = '\xFF\x01' > >>> body = u'warning' > >>> message = header + body > Traceback (most recent call last): > File "", line 1, in ? > UnicodeError: ASCII decoding error: ordinal not in range(128) > Is that proper? Sounds proper to me. > Is it what the user expected? If not, how should the user modify her > code so it does what she wanted? I do not know what the user wanted, so I cannot say how to modify the code. If she wants to play with bits and bytes, rather than strings, she would have to make explicit the conversions she wants. Python cannot guess them. > I suggest you play around with the Unicode type somewhat before > recommending that API functions should blindly return it... Oh, I should surely read and try a lot more before saying anything. I was invited in this discussion only recently. Today as a deadline was not giving me enough time to as careful as I usually like to be. So, I merely tried contributing my best given the circumstances, with my limited experience and knowledge. I think it was better that I risk a few suggestions and opinions, than stay silent and regret having said nothing. I hope having been a bit useful, somewhat, despite all the noise I made :-). > So I'd rather not return a Unicode string representing an error message > from gettext: the user expecting an error message may be surprised about > the totally unrelated UnicodeError. I would have hoped that one could merely replace STRING by _(STRING), and get a working program. If I read you correctly, you say that it has more chance to work _if_ we avoid the Unicode string route, and mimick what we dumbly do in C. Instead of: _ = locale.translator(DOMAIN) could we have: _, _u = locale.translator(DOMAIN) and use _(TEXT) or _u(TEXT) for the flat byte string out of the PO file, or the string converted to a Unicode string from the PO `msgstr' encoding? Or maybe: _, _e = locale.translator(DOMAIN) with the above _u(TEXT) being rather written unicode(_(TEXT), _e) ? Or maybe even: _, _e, _u = locale.translator(DOMAIN) But I'm not sure I like any of these things. Maybe nicer would be that `_` is the class instance itself, with a __call__ method for implementing _(TEXT). One could then use _.charset or such to get then `msgstr' encoding, and the convenience: _.unicode(TEXT) would be equivalent to: unicode(_(TEXT), _.charset) Better ideas? I am still under the shock! :-) -- François Pinard http://www.iro.umontreal.ca/~pinard From pinard@iro.umontreal.ca Tue Sep 5 02:16:45 2000 From: pinard@iro.umontreal.ca (=?ISO-8859-1?Q?Fran=E7ois_Pinard?=) Date: 04 Sep 2000 21:16:45 -0400 Subject: [I18n-sig] Re: Translating doc strings In-Reply-To: Martin v. Loewis martin@loewis.home.cs.tu-berlin.de's message of "Fri, 1 Sep 2000 09:17:34 +0200" Message-ID: [martin@loewis.home.cs.tu-berlin.de] > With that approach, the next question is: What is the name of the textual > domain, and how are translation managed? My proposal was "pylib"; Barry's > "docstring". Why not merely "python"? > As for management of translations, I'd like to ask the Free Translation > Project for help. As soon as we've settled the technical issues, I'd > like to submit a catalog for translation. You will be quite welcome, and have an accomplice within! :-) When you will feel that the time is proper, just write to me again. You may browse `http://www.iro.umontreal.ca/contrib/po/HTML/maintainers.html' if you want to know the questions we usually need answered, and I may open the translation domain as soon as the textual domain name is decided. -- François Pinard http://www.iro.umontreal.ca/~pinard From martin@loewis.home.cs.tu-berlin.de Tue Sep 5 07:44:44 2000 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Tue, 5 Sep 2000 08:44:44 +0200 Subject: [I18n-sig] Re: gettext in the standard library In-Reply-To: (message from =?ISO-8859-1?Q?Fran=E7ois?= Pinard on 04 Sep 2000 20:44:12 -0400) References: <14749.42747.411862.940207@anthem.concentric.net> <14757.24220.225628.464982@anthem.concentric.net> <200009041342.PAA29915@pandora.informatik.hu-berlin.de> <200009041813.UAA16309@pandora.informatik.hu-berlin.de> <200009042231.AAA00904@loewis.home.cs.tu-berlin.de> Message-ID: <200009050644.IAA00730@loewis.home.cs.tu-berlin.de> > Instead of: > > _ = locale.translator(DOMAIN) > > could we have: > > _, _u = locale.translator(DOMAIN) Currently, you would write _ = locale.translator(DOMAIN).gettext for the first one, and cat = locale.translator(DOMAIN) _, _u = cat.gettext, cat.ugettext for the second one. However, I doubt many users would need both methods. They either trust that their output channels are unicode-safe or they don't. I'd even emagine cases where they do def _(msg): return cat.ugettext(msg).encode("utf-8") so they get UTF-8 even if the catalog uses some different encoding; that may be useful when they write to log files. Of course, in that case, they should really write logfile = codecs.open("logfilename","w",encoding="utf-8") to get a unicode-safe output channel. > Or maybe: > > _, _e = locale.translator(DOMAIN) That would be _e = cat.charset() > But I'm not sure I like any of these things. Maybe nicer would be > that `_` is the class instance itself, with a __call__ method for > implementing _(TEXT). One could then use _.charset or such to get > then `msgstr' encoding Maybe it's not nicer. __call__ is typically used to hide the fact that something is an instance object, so users can treat it as if it was a function. Now, if you say that users need to be aware that it is indeed an instance (since it exposes additional methods), they also need to understand how __call__ works for these instances. That is more difficult to grasp than telling them about instances, their methods, and the notion of bound methods. Regards, Martin From martin@loewis.home.cs.tu-berlin.de Tue Sep 5 07:51:16 2000 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Tue, 5 Sep 2000 08:51:16 +0200 Subject: [I18n-sig] Re: Patch 101320: doc strings In-Reply-To: (message from =?ISO-8859-1?Q?Fran=E7ois?= Pinard on 04 Sep 2000 19:59:32 -0400) References: <200008302026.WAA13599@pandora.informatik.hu-berlin.de> <14765.62346.449402.907537@anthem.concentric.net> <200008311550.RAA29820@pandora.informatik.hu-berlin.de> <14766.48956.463218.310154@anthem.concentric.net> <200009041314.PAA27902@pandora.informatik.hu-berlin.de> <200009041445.QAA06095@pandora.informatik.hu-berlin.de> <200009041732.TAA14636@pandora.informatik.hu-berlin.de> <200009042123.XAA23066@pandora.informatik.hu-berlin.de> <200009042248.AAA01054@loewis.home.cs.tu-berlin.de> Message-ID: <200009050651.IAA00782@loewis.home.cs.tu-berlin.de> > I wrote most of that manual, and I do not remember that :-). But it was > quite a while ago, and we discussed _so_ many things at the time... So I guess it's your text I'm quoting here: The size S of the hash table can be zero. In this case, the hash table itself is not contained in the MO file. Some people might prefer this because a precomputed hashing table takes disk space, and does not win *that* much speed. I actually preferred this 'cause it's easier to implement :-) It's interesting to notice that the description of the mo file format needs 126 lines of English text and that the implementation of a generator needs only 194 lines of text (of which only 122 contain actual Python code). Regards, Martin From tdickenson@geminidataloggers.com Tue Sep 5 12:19:42 2000 From: tdickenson@geminidataloggers.com (Toby Dickenson) Date: Tue, 05 Sep 2000 12:19:42 +0100 Subject: [I18n-sig] ustr In-Reply-To: <200007071244.HAA03694@cj20424-a.reston1.va.home.com> References: <3965BBE5.D67DD838@lemburg.com> <200007071244.HAA03694@cj20424-a.reston1.va.home.com> Message-ID: On Fri, 07 Jul 2000 07:44:03 -0500, Guido van Rossum wrote: We debated a ustr function in July. Does anyone have this in hand? I can prepare a patch if necessary. >> Toby Dickenson wrote: >> >=20 >> > I'm just nearing the end of getting Zope to play well with unicode >> > data. Most of the changes involved replacing a call to str, in >> > situations where either a unicode or narrow string would be >> > acceptable. >> >=20 >> > My best alternative is: >> >=20 >> > def convert_to_something_stringlike(x): >> > if type(x)=3D=3Dtype(u''): >> > return x >> > else: >> > return str(x) >> >=20 >> > This seems like a fundamental operation - would it be worth having >> > something similar in the standard library? > >Marc-Andre Lemburg replied: > >> You mean: for Unicode return Unicode and for everything else >> return strings ? >>=20 >> It doesn't fit well with the builtins str() and unicode(). I'd >> say, make this a userland helper. > >I think this would be helpful to have in the std library. Note that >in JPython, you'd already use str() for this, and in Python 3000 this >may also be the case. At some point in the design discussion for the >current Unicode support we also thought that we wanted str() to do >this (i.e. allow 8-bit and Unicode string returns), until we realized >that there were too many places that would be very unhappy if str() >returned a Unicode string! > >The problem is similar to a situation you have with numbers: sometimes >you want a coercion that converts everything to float except it should >leave complex numbers complex. In other words it coerces up to float >but it never coerces down to float. Luckily you can write that as >"x+0.0" while converts int and long to float with the same value while >leaving complex alone. > >For strings there is no compact notation like "+0.0" if you want to >convert to string or Unicode -- adding "" might work in Perl, but not >in Python. > >I propose ustr(x) with the semantics given by Toby. Class support (an >__ustr__ method, with fallbacks on __str__ and __unicode__) would also >be handy. Toby Dickenson tdickenson@geminidataloggers.com From bwarsaw@beopen.com Tue Sep 5 20:00:27 2000 From: bwarsaw@beopen.com (Barry A. Warsaw) Date: Tue, 5 Sep 2000 15:00:27 -0400 (EDT) Subject: [I18n-sig] Terminology gap References: <39AE5F20.E68BC43F@lemburg.com> Message-ID: <14773.17099.703161.266580@anthem.concentric.net> >>>>> "AR" == Andy Robinson writes: AR> I agree with MAL. "string" should refer to an interface; AR> people doing i18n stuff could then write their own ones in AR> future if needed. I cannot get at CVS this week, but I think AR> we actually checked in a UserString class into the standard AR> library in order to clearly define the interface for AR> string-like objects. The answer to that is yes, UserString.py is in the standard distribution. It actually defines UserString class, with the basic interface, and MutableString class which will mutate in place but can't be used as a dictionary key. -Barry From bwarsaw@beopen.com Wed Sep 6 04:36:37 2000 From: bwarsaw@beopen.com (Barry A. Warsaw) Date: Tue, 5 Sep 2000 23:36:37 -0400 (EDT) Subject: [I18n-sig] Re: Patch 101320: doc strings References: <200008302026.WAA13599@pandora.informatik.hu-berlin.de> <14765.62346.449402.907537@anthem.concentric.net> <200008311550.RAA29820@pandora.informatik.hu-berlin.de> <14766.48956.463218.310154@anthem.concentric.net> <200009041314.PAA27902@pandora.informatik.hu-berlin.de> Message-ID: <14773.48069.943223.602061@anthem.concentric.net> Just to follow up on some ideas: FP> If it was systematic that `_' was assigned to, we could try to FP> retrieve the function stored in the `_' global variable of FP> `httplib', and then use it to translate any docstring from FP> httplib. However, it would be nicer if the constraint of FP> using `_' for the translating function did not exist, and if FP> it was rather completely left at the discretion of the FP> programmer. If we use `_' systematically in documentation FP> examples we produce, it is likely to become the popular FP> choice, but let's avoid mandating it. Here's another suggestion. I'm not sure I like it but here goes anyway. Say we had an import hook that isn't installed by default (for Python environments that don't care at all about i18n). If this import hook is installed, though, it interposes a little extra functionality whenever a module is imported for the first time. What this hook does is import the module, then look to see if the module has a '__domain__' attribute set. If it does, then the importer uses that textual domain for that module's translations, locating the .mo file using the "standard lookup algorithm". If __domain__ is not set, then if the module's name can be determined, the import hook tries to use that textual domain. If that can't be found, it falls back on the textual domain "python". So we can generate a big .po file containing the entirety of the core libraries, but we can override individual modules as needed. This would also work with 3rd party libraries, since the same import hook would run when they are imported. Caveats: - Using the module's name as the textual domain may create conflicts. E.g. mypackage.foo.datetime and yourpackage.bar.datetime. One possible resolution is to first try the fully qualified name with period->underscore substitutions. If that isn't found, fallback to the rightmost module name. - This is a lot of disk statting to do all these searches. And because the import hook will be written in Python, it means that i18n'd applications will all import much more slowly. - I think it's still tricky to get modules to play nice, especially if you want to handle the situation where a Python user doesn't know about or care about i18n. How would you define a module's _() function to work in both cases? Would the import hook poke a new _() function into the module namespace, or perhaps delete one it finds there, assuming the one in builtins will still be there? Maybe it's a dumb idea anyway. -Barry From bwarsaw@beopen.com Wed Sep 6 04:43:25 2000 From: bwarsaw@beopen.com (Barry A. Warsaw) Date: Tue, 5 Sep 2000 23:43:25 -0400 (EDT) Subject: [I18n-sig] Re: Patch 101320: doc strings References: <200008302026.WAA13599@pandora.informatik.hu-berlin.de> <14765.62346.449402.907537@anthem.concentric.net> <200008311550.RAA29820@pandora.informatik.hu-berlin.de> <14766.48956.463218.310154@anthem.concentric.net> <200009041314.PAA27902@pandora.informatik.hu-berlin.de> <200009041445.QAA06095@pandora.informatik.hu-berlin.de> <200009041732.TAA14636@pandora.informatik.hu-berlin.de> Message-ID: <14773.48477.738019.157702@anthem.concentric.net> >>>>> "MvL" == Martin von Loewis writes: MvL> Maybe my logic is somewhat flawed: - Did you agree that doc MvL> strings of a module should use the same domain as all other MvL> strings of the module? Yes. MvL> - Did you propose that a single MvL> package, distributed as a whole, should have a single textual MvL> domain? Yes. MvL> - Do you agree that the Python core+libs is a single MvL> package? Not sure. I think you and Francois do, so I'll defer. One issue is for 3rd party modules, and for modules that migrate into the core. At the very least, 3rd party modules will /not/ be in the "python" domain, but if they are migrated into the core, that may change. If I distribute a module independently, say using distutils, then I'm going to want to mark the translatable strings, and possibly distribute a .po file for my module. In that sense my single module is a single package. MvL> I've given up on having message catalogs in the Python 2.0 MvL> distribution. Since there is no point in having the catalog MvL> without any translations, this is not so urgent. What *is* MvL> urgent is to give the catalog to the translators. I sent out a message about a file system layout for including the files in the nondist tree of the CVS repository. Did you read that message Martin? What did you think? Guido's amenable to that solution for now. -Barry From bwarsaw@beopen.com Wed Sep 6 04:47:19 2000 From: bwarsaw@beopen.com (Barry A. Warsaw) Date: Tue, 5 Sep 2000 23:47:19 -0400 (EDT) Subject: [I18n-sig] Re: Patch 101320: doc strings References: <200008302026.WAA13599@pandora.informatik.hu-berlin.de> <14765.62346.449402.907537@anthem.concentric.net> <200008311550.RAA29820@pandora.informatik.hu-berlin.de> <14766.48956.463218.310154@anthem.concentric.net> <200009041314.PAA27902@pandora.informatik.hu-berlin.de> <200009041445.QAA06095@pandora.informatik.hu-berlin.de> <200009041732.TAA14636@pandora.informatik.hu-berlin.de> <200009042123.XAA23066@pandora.informatik.hu-berlin.de> Message-ID: <14773.48711.148352.627828@anthem.concentric.net> >>>>> "FP" == ISO writes: FP> As for `gettext.install' function, it looks awkward. This FP> would be the only case I know, in the Python library, where a FP> library function hacks a variable in the local name space. It doesn't. gettext.install() hacks the __builtin__ module's namespace, which is the last namespace search after locals and globals. So if a module defines _(), that definition will override the one put in __builtin__ by gettext.install(). -Barry From bwarsaw@beopen.com Wed Sep 6 04:53:23 2000 From: bwarsaw@beopen.com (Barry A. Warsaw) Date: Tue, 5 Sep 2000 23:53:23 -0400 (EDT) Subject: [I18n-sig] Re: Patch 101320: doc strings References: <200008302026.WAA13599@pandora.informatik.hu-berlin.de> <14765.62346.449402.907537@anthem.concentric.net> <200008311550.RAA29820@pandora.informatik.hu-berlin.de> <200009041311.PAA27712@pandora.informatik.hu-berlin.de> Message-ID: <14773.49075.60305.10297@anthem.concentric.net> >>>>> "MvL" == Martin von Loewis writes: MvL> I don't see how this could work for doc strings of classes, MvL> methods and functions. Do you propose to write MvL> def foo(): | _("This does the foo thing.") | pass MvL> That won't work; the parser won't recognize it as a doc MvL> string. Martin's right. Fortunately docstrings are rarely used by the program itself (they are mostly used by outside tools, like help()/doc() or IDE's or interactive interpreters). One place a docstring /is/ used by the program and needs to be translated is for script help messages. In most of the executable scripts I write, the file's docstring is the usage text, and I include a function that prints the global __doc__. If that first string in the file is wrapped in _('') it won't be a docstring. If it isn't wrapped, it won't be translated. Two solutions: either the extractor needs to be smarter (and xpot currently is, but pygettext isn't), or you can hack around it like so: #! /usr/bin/env python __doc__ = _("blech, my module doc string") A second place I've used class docstrings inside a program is to write the error messages for exception classes as the class's docstring. This can be done in other ways, but also either solution above would work. -Barry From bwarsaw@beopen.com Wed Sep 6 04:57:48 2000 From: bwarsaw@beopen.com (Barry A. Warsaw) Date: Tue, 5 Sep 2000 23:57:48 -0400 (EDT) Subject: [I18n-sig] Re: Patch 101320: doc strings References: <200008302026.WAA13599@pandora.informatik.hu-berlin.de> <14765.62346.449402.907537@anthem.concentric.net> <200008311550.RAA29820@pandora.informatik.hu-berlin.de> <200009041329.PAA28928@pandora.informatik.hu-berlin.de> Message-ID: <14773.49340.983150.382453@anthem.concentric.net> >>>>> "MvL" == Martin von Loewis writes: MvL> If you load the hash tables, does this give enough MvL> information so that you can use two seek(2) calls only; on MvL> average? If so, it would be probably good if there was a) MvL> documentation for the hash table format, and/or b) an MvL> implementation of it in Python. Documentation, please! MvL> I'm certain it will take some time to get translations back, MvL> so if we want to have something in the next release (after MvL> 2.0), we should start today. I'd still like to investigate using distutils as the standard way to distribute the .mo files. -Barry From bwarsaw@beopen.com Wed Sep 6 05:03:03 2000 From: bwarsaw@beopen.com (Barry A. Warsaw) Date: Wed, 6 Sep 2000 00:03:03 -0400 (EDT) Subject: [I18n-sig] Re: Translating doc strings References: Message-ID: <14773.49655.533135.622916@anthem.concentric.net> >>>>> "FP" == writes: >> With that approach, the next question is: What is the name of >> the textual domain, and how are translation managed? My >> proposal was "pylib"; Barry's "docstring". FP> Why not merely "python"? I like it. If we are to go with a single translation file for the entire library (and I think I've now agreed with you on that :), then "python" is better as the textual domain. -Barry From pinard@iro.umontreal.ca Wed Sep 6 05:19:40 2000 From: pinard@iro.umontreal.ca (=?ISO-8859-1?Q?Fran=E7ois_Pinard?=) Date: 06 Sep 2000 00:19:40 -0400 Subject: [I18n-sig] Re: Patch 101320: doc strings In-Reply-To: bwarsaw@beopen.com's message of "Tue, 5 Sep 2000 23:47:19 -0400 (EDT)" References: <200008302026.WAA13599@pandora.informatik.hu-berlin.de> <14765.62346.449402.907537@anthem.concentric.net> <200008311550.RAA29820@pandora.informatik.hu-berlin.de> <14766.48956.463218.310154@anthem.concentric.net> <200009041314.PAA27902@pandora.informatik.hu-berlin.de> <200009041445.QAA06095@pandora.informatik.hu-berlin.de> <200009041732.TAA14636@pandora.informatik.hu-berlin.de> <200009042123.XAA23066@pandora.informatik.hu-berlin.de> <14773.48711.148352.627828@anthem.concentric.net> Message-ID: [Barry A. Warsaw] > >>>>> "FP" == ISO writes: > FP> As for `gettext.install' function, it looks awkward. This > FP> would be the only case I know, in the Python library, where a > FP> library function hacks a variable in the local name space. > It doesn't. gettext.install() hacks the __builtin__ module's > namespace, which is the last namespace search after locals and > globals. So if a module defines _(), that definition will override > the one put in __builtin__ by gettext.install(). Bizarrier and bizarrier! :-) What is the purpose of installing a definition of _() just meant to be overriden? It should not make sense for any module to use _() without defining it, as this is the way to associate that module to a textual domain. Each module ought make this association separately. -- François Pinard http://www.iro.umontreal.ca/~pinard From bwarsaw@beopen.com Wed Sep 6 05:34:31 2000 From: bwarsaw@beopen.com (Barry A. Warsaw) Date: Wed, 6 Sep 2000 00:34:31 -0400 (EDT) Subject: [I18n-sig] Re: Patch 101320: doc strings References: <200008302026.WAA13599@pandora.informatik.hu-berlin.de> <14765.62346.449402.907537@anthem.concentric.net> <200008311550.RAA29820@pandora.informatik.hu-berlin.de> <14766.48956.463218.310154@anthem.concentric.net> <200009041314.PAA27902@pandora.informatik.hu-berlin.de> <200009041445.QAA06095@pandora.informatik.hu-berlin.de> <200009041732.TAA14636@pandora.informatik.hu-berlin.de> <200009042123.XAA23066@pandora.informatik.hu-berlin.de> <14773.48711.148352.627828@anthem.concentric.net> Message-ID: <14773.51543.773354.942958@anthem.concentric.net> >>>>> "I" == ISO writes: FP> What is the purpose of installing a definition of _() just FP> meant to be overriden? It should not make sense for any FP> module to use _() without defining it, as this is the way to FP> associate that module to a textual domain. Each module ought FP> make this association separately. Agreed, for modules. The documentation even recommends that modules never install(). gettext.install() is for application that have their own global text domains. You don't want to have to define _() in every file in the application. -Barry From just@letterror.com Wed Sep 6 07:57:37 2000 From: just@letterror.com (Just van Rossum) Date: Wed, 6 Sep 2000 07:57:37 +0100 Subject: [I18n-sig] Re: Patch 101320: doc strings In-Reply-To: <14773.51543.773354.942958@anthem.concentric.net> References: <200008302026.WAA13599@pandora.informatik.hu-berlin.de> <14765.62346.449402.907537@anthem.concentric.net> <200008311550.RAA29820@pandora.informatik.hu-berlin.de> <14766.48956.463218.310154@anthem.concentric.net> <200009041314.PAA27902@pandora.informatik.hu-berlin.de> <200009041445.QAA06095@pandora.informatik.hu-berlin.de> <200009041732.TAA14636@pandora.informatik.hu-berlin.de> <200009042123.XAA23066@pandora.informatik.hu-berlin.de> <14773.48711.148352.627828@anthem.concentric.net> Message-ID: At 12:34 AM -0400 06-09-2000, Barry A. Warsaw wrote: >>>>>> "I" == ISO writes: > > FP> What is the purpose of installing a definition of _() just > FP> meant to be overriden? It should not make sense for any > FP> module to use _() without defining it, as this is the way to > FP> associate that module to a textual domain. Each module ought > FP> make this association separately. > >Agreed, for modules. The documentation even recommends that modules >never install(). > >gettext.install() is for application that have their own global text >domains. You don't want to have to define _() in every file in the >application. If such an application also uses exec in combination with compile(src, "", "single") (ie. wants to offer an interactive Python window), it's pretty much screwed, as this also uses __builtins__._... Just From martin@loewis.home.cs.tu-berlin.de Wed Sep 6 07:29:58 2000 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Wed, 6 Sep 2000 08:29:58 +0200 Subject: [I18n-sig] Re: Patch 101320: doc strings In-Reply-To: <14773.48477.738019.157702@anthem.concentric.net> (bwarsaw@beopen.com) References: <200008302026.WAA13599@pandora.informatik.hu-berlin.de> <14765.62346.449402.907537@anthem.concentric.net> <200008311550.RAA29820@pandora.informatik.hu-berlin.de> <14766.48956.463218.310154@anthem.concentric.net> <200009041314.PAA27902@pandora.informatik.hu-berlin.de> <200009041445.QAA06095@pandora.informatik.hu-berlin.de> <200009041732.TAA14636@pandora.informatik.hu-berlin.de> <14773.48477.738019.157702@anthem.concentric.net> Message-ID: <200009060629.IAA00884@loewis.home.cs.tu-berlin.de> > Not sure. I think you and Francois do, so I'll defer. One issue is > for 3rd party modules, and for modules that migrate into the core. At > the very least, 3rd party modules will /not/ be in the "python" > domain, but if they are migrated into the core, that may change. Indeed, having a single textual domain for all extensions would not be feasible; they certainly will have their own domain. > I sent out a message about a file system layout for including the > files in the nondist tree of the CVS repository. Did you read that > message Martin? Just to repeat the proposal here, it was nondist/i18n/ po/ docstrings.pot docstrings-de.po de/LC_MESSAGES/ docstrings.mo > What did you think? I'd do (replace-regexp "docstrings" "python") now, but apart from that: sounds good to me. I'll extract the strings from the official 2.0b1, then try to create this structure. Regards, Martin From kajiyama@grad.sccs.chukyo-u.ac.jp Wed Sep 6 12:09:41 2000 From: kajiyama@grad.sccs.chukyo-u.ac.jp (Tamito KAJIYAMA) Date: Wed, 6 Sep 2000 20:09:41 +0900 Subject: [I18n-sig] JapaneseCodecs-1.0 released Message-ID: <200009061109.UAA02662@dhcp198.grad.sccs.chukyo-u.ac.jp> Hi, I released JapaneseCodecs-1.0, the latest version of my Unicode codecs for Japanese character encodings (EUC-JP and Shift_JIS). It is available at the following location: http://pseudo.grad.sccs.chukyo-u.ac.jp/~kajiyama/python/ The whole code is refined so as to follow the proposal version 1.6. Some possible bugs are also fixed. In addition, the codecs are packaged using Distutils so that installation should be quite easy (special thanks to Distutils developers). Character mapping tables have remained unchanged; they do not include vendor-specific characters. Performance issues have also been left. These need addressing in the future work. Regards, -- KAJIYAMA, Tamito From bwarsaw@beopen.com Wed Sep 6 14:10:35 2000 From: bwarsaw@beopen.com (Barry A. Warsaw) Date: Wed, 6 Sep 2000 09:10:35 -0400 (EDT) Subject: [I18n-sig] Re: Patch 101320: doc strings References: <200008302026.WAA13599@pandora.informatik.hu-berlin.de> <14765.62346.449402.907537@anthem.concentric.net> <200008311550.RAA29820@pandora.informatik.hu-berlin.de> <14766.48956.463218.310154@anthem.concentric.net> <200009041314.PAA27902@pandora.informatik.hu-berlin.de> <200009041445.QAA06095@pandora.informatik.hu-berlin.de> <200009041732.TAA14636@pandora.informatik.hu-berlin.de> <14773.48477.738019.157702@anthem.concentric.net> <200009060629.IAA00884@loewis.home.cs.tu-berlin.de> Message-ID: <14774.16971.243296.903895@anthem.concentric.net> >>>>> "MvL" == Martin v Loewis writes: MvL> Just to repeat the proposal here, it was | nondist/i18n/ | po/ | docstrings.pot | docstrings-de.po | de/LC_MESSAGES/ | docstrings.mo >> What did you think? MvL> I'd do (replace-regexp "docstrings" "python") now, Yes. MvL> but apart from that: sounds good to me. I'll extract the MvL> strings from the official 2.0b1, then try to create this MvL> structure. I've added the directory structure to nondist, so please do an update. You now have checkins privs so feel free to add the .pot, .po, and .mo files when you have them ready. Also, could you write up a short README file for nondist/i18n? I don't have time right now. Thanks, -Barry From pinard@iro.umontreal.ca Wed Sep 6 16:05:00 2000 From: pinard@iro.umontreal.ca (=?ISO-8859-1?Q?Fran=E7ois_Pinard?=) Date: 06 Sep 2000 11:05:00 -0400 Subject: [I18n-sig] Re: Patch 101320: doc strings In-Reply-To: bwarsaw@beopen.com's message of "Wed, 6 Sep 2000 00:34:31 -0400 (EDT)" References: <200008302026.WAA13599@pandora.informatik.hu-berlin.de> <14765.62346.449402.907537@anthem.concentric.net> <200008311550.RAA29820@pandora.informatik.hu-berlin.de> <14766.48956.463218.310154@anthem.concentric.net> <200009041314.PAA27902@pandora.informatik.hu-berlin.de> <200009041445.QAA06095@pandora.informatik.hu-berlin.de> <200009041732.TAA14636@pandora.informatik.hu-berlin.de> <200009042123.XAA23066@pandora.informatik.hu-berlin.de> <14773.48711.148352.627828@anthem.concentric.net> <14773.51543.773354.942958@anthem.concentric.net> Message-ID: [Barry A. Warsaw] > gettext.install() is for application that have their own global text > domains. You don't want to have to define _() in every file in the > application. I would. It is a simple habit to define _() after the docstring, for modules needing it, and it might also be a safer habit when you move modules around, something which is more natural in Python than in other languages. -- François Pinard http://www.iro.umontreal.ca/~pinard From pinard@iro.umontreal.ca Wed Sep 6 16:07:31 2000 From: pinard@iro.umontreal.ca (=?ISO-8859-1?Q?Fran=E7ois_Pinard?=) Date: 06 Sep 2000 11:07:31 -0400 Subject: [I18n-sig] Re: Patch 101320: doc strings In-Reply-To: "Martin v. Loewis"'s message of "Wed, 6 Sep 2000 08:29:58 +0200" References: <200008302026.WAA13599@pandora.informatik.hu-berlin.de> <14765.62346.449402.907537@anthem.concentric.net> <200008311550.RAA29820@pandora.informatik.hu-berlin.de> <14766.48956.463218.310154@anthem.concentric.net> <200009041314.PAA27902@pandora.informatik.hu-berlin.de> <200009041445.QAA06095@pandora.informatik.hu-berlin.de> <200009041732.TAA14636@pandora.informatik.hu-berlin.de> <14773.48477.738019.157702@anthem.concentric.net> <200009060629.IAA00884@loewis.home.cs.tu-berlin.de> Message-ID: [Martin v. Loewis] > having a single textual domain for all extensions would not be > feasible; they certainly will have their own domain. You mean, for the Python distribution? What do you mean by `not feasable' and `certainly'? I do not understand the need of splitting. What is it? -- François Pinard http://www.iro.umontreal.ca/~pinard From martin@loewis.home.cs.tu-berlin.de Wed Sep 6 19:37:46 2000 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Wed, 6 Sep 2000 20:37:46 +0200 Subject: [I18n-sig] Re: Patch 101320: doc strings In-Reply-To: <14774.16971.243296.903895@anthem.concentric.net> (bwarsaw@beopen.com) References: <200008302026.WAA13599@pandora.informatik.hu-berlin.de> <14765.62346.449402.907537@anthem.concentric.net> <200008311550.RAA29820@pandora.informatik.hu-berlin.de> <14766.48956.463218.310154@anthem.concentric.net> <200009041314.PAA27902@pandora.informatik.hu-berlin.de> <200009041445.QAA06095@pandora.informatik.hu-berlin.de> <200009041732.TAA14636@pandora.informatik.hu-berlin.de> <14773.48477.738019.157702@anthem.concentric.net> <200009060629.IAA00884@loewis.home.cs.tu-berlin.de> <14774.16971.243296.903895@anthem.concentric.net> Message-ID: <200009061837.UAA00749@loewis.home.cs.tu-berlin.de> > I've added the directory structure to nondist, so please do an > update. You now have checkins privs so feel free to add the .pot, > .po, and .mo files when you have them ready. Also, could you write up > a short README file for nondist/i18n? I don't have time right now. Sure will. Regards, Martin From martin@loewis.home.cs.tu-berlin.de Wed Sep 6 19:42:27 2000 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Wed, 6 Sep 2000 20:42:27 +0200 Subject: [I18n-sig] Re: Patch 101320: doc strings In-Reply-To: (message from =?ISO-8859-1?Q?Fran=E7ois?= Pinard on 06 Sep 2000 11:07:31 -0400) References: <200008302026.WAA13599@pandora.informatik.hu-berlin.de> <14765.62346.449402.907537@anthem.concentric.net> <200008311550.RAA29820@pandora.informatik.hu-berlin.de> <14766.48956.463218.310154@anthem.concentric.net> <200009041314.PAA27902@pandora.informatik.hu-berlin.de> <200009041445.QAA06095@pandora.informatik.hu-berlin.de> <200009041732.TAA14636@pandora.informatik.hu-berlin.de> <14773.48477.738019.157702@anthem.concentric.net> <200009060629.IAA00884@loewis.home.cs.tu-berlin.de> Message-ID: <200009061842.UAA00796@loewis.home.cs.tu-berlin.de> > > having a single textual domain for all extensions would not be > > feasible; they certainly will have their own domain. > > You mean, for the Python distribution? No, not for the Python distribution. For extensions to Python: pyqt, gnome-python, NumPy. > What do you mean by `not feasable' It is not feasible that everybody writing a Python library submits her doc strings to the Python maintainers for inclusion into the python textual domain. > and `certainly'? If anybody writing a Python library, and can't use the python domain, then he'll certainly create his own one. > I do not understand the need of splitting. What is it? It is the same reason why there isn't a single coordinated domain for all free software. Regards, Martin From kajiyama@grad.sccs.chukyo-u.ac.jp Thu Sep 7 14:13:51 2000 From: kajiyama@grad.sccs.chukyo-u.ac.jp (Tamito KAJIYAMA) Date: Thu, 7 Sep 2000 22:13:51 +0900 Subject: [I18n-sig] sys.(set|get)_string_encoding in 1.6 Message-ID: <200009071313.WAA05858@dhcp198.grad.sccs.chukyo-u.ac.jp> Hi, I found that sys.(get|set)defaultencoding() defined in the Unicode proposal version 1.6 were implemented with different names sys.(get|set)_string_encoding() in the 1.6 final release. Is this an intended change? If so, why is this incompatibility introduced? Thanks, -- KAJIYAMA, Tamito From mal@lemburg.com Thu Sep 7 17:30:11 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 07 Sep 2000 18:30:11 +0200 Subject: [I18n-sig] sys.(set|get)_string_encoding in 1.6 References: <200009071313.WAA05858@dhcp198.grad.sccs.chukyo-u.ac.jp> Message-ID: <39B7C293.70FD7E8A@lemburg.com> Tamito KAJIYAMA wrote: > > Hi, > > I found that sys.(get|set)defaultencoding() defined in the > Unicode proposal version 1.6 were implemented with different > names sys.(get|set)_string_encoding() in the 1.6 final release. > Is this an intended change? If so, why is this incompatibility > introduced? These APIs were first introduced as experiment to the CVS tree under the names you find in the 1.6 release. They were meant to provide an easy way to experiment with different default encodings. After some discussions on python-dev the outcome was to keep the APIs for use by site.py to set a locale dependent default encoding. This idea was then retracted some weeks later and replaced with the now standard ASCII default encoding which you find in both 1.6 and 2.0. So to answer your question: the sys APIs in 1.6 are to be considered undocumented features and should *not* be used. I haven't followed the 1.6 release too closely and didn't even realize that these APIs made it into the release version... things were moving much too fast at the time and I was busy with 2.0. Sorry :-/ Python 2.0 will have the sys APIs which are documented in the Misc/unicode.txt file: getdefaultencoding() -> string Return the current default string encoding used by the Unicode implementation. setdefaultencoding(encoding) Set the current default string encoding used by the Unicode implementation. Only available in site.py. Also see the disabled code in site.py for details on how to reenable the locale dependent default encodings. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From kajiyama@grad.sccs.chukyo-u.ac.jp Thu Sep 7 17:59:18 2000 From: kajiyama@grad.sccs.chukyo-u.ac.jp (Tamito KAJIYAMA) Date: Fri, 8 Sep 2000 01:59:18 +0900 Subject: [I18n-sig] sys.(set|get)_string_encoding in 1.6 In-Reply-To: <39B7C293.70FD7E8A@lemburg.com> (mal@lemburg.com) References: <39B7C293.70FD7E8A@lemburg.com> Message-ID: <200009071659.BAA06179@dhcp198.grad.sccs.chukyo-u.ac.jp> "M.-A. Lemburg" writes: | | Tamito KAJIYAMA wrote: | > | > I found that sys.(get|set)defaultencoding() defined in the | > Unicode proposal version 1.6 were implemented with different | > names sys.(get|set)_string_encoding() in the 1.6 final release. | > Is this an intended change? If so, why is this incompatibility | > introduced? | | These APIs were first introduced as experiment to the CVS tree | under the names you find in the 1.6 release. They were meant | to provide an easy way to experiment with different default | encodings. | | After some discussions on python-dev the outcome | was to keep the APIs for use by site.py to set a locale | dependent default encoding. | | This idea was then retracted some weeks later and replaced | with the now standard ASCII default encoding which you find | in both 1.6 and 2.0. I see. | So to answer your question: the sys APIs in 1.6 are to be | considered undocumented features and should *not* be used. Then, is there no way to set/get the default encoding in 1.6? -- KAJIYAMA, Tamito From keichwa@gmx.net Fri Sep 8 11:08:31 2000 From: keichwa@gmx.net (Karl Eichwalder) Date: 08 Sep 2000 12:08:31 +0200 Subject: [I18n-sig] Re: gettext in the standard library In-Reply-To: =?iso-8859-1?q?Fran=E7ois?= Pinard's message of "04 Sep 2000 16:37:11 -0400" References: <14749.42747.411862.940207@anthem.concentric.net> <14757.24220.225628.464982@anthem.concentric.net> <200008241935.VAA05311@pandora.informatik.hu-berlin.de> <200009041356.PAA01550@pandora.informatik.hu-berlin.de> <200009041752.TAA15488@pandora.informatik.hu-berlin.de> Message-ID: > [Martin von Loewis] > > I don't know anybody who'd prefer that > > over just falling back to English. Yes, there are quite some (I'm told by native speakers -- personally, I'm not familiar with these languages): br:fr_FR Bretonian - French (France) gl:es_ES:pt_PT Galician - Spanish (Spain) - Portuguese (Portugal) XX:ru where XX stands for eastern European languages - Russian=20 Fran=E7ois Pinard writes: > But I got that in Europe, many people have an extended culture, making > me jealous (:-), and it is not uncommon for them to be comfortable > with many languages. You simply have to if you wish to travel a bit ;) (unfortunately, my active languages are rather limited). --=20 work : ke@suse.de | ------ ,__o : http://www.suse.de/~ke/ | ------ _-\_<, home : keichwa@gmx.net | ------ (*)/'(*) From mal@lemburg.com Fri Sep 8 12:56:08 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Fri, 08 Sep 2000 13:56:08 +0200 Subject: [I18n-sig] sys.(set|get)_string_encoding in 1.6 References: <39B7C293.70FD7E8A@lemburg.com> <200009071659.BAA06179@dhcp198.grad.sccs.chukyo-u.ac.jp> Message-ID: <39B8D3D8.EF9C7738@lemburg.com> Tamito KAJIYAMA wrote: > > "M.-A. Lemburg" writes: > | > | Tamito KAJIYAMA wrote: > | > > | > I found that sys.(get|set)defaultencoding() defined in the > | > Unicode proposal version 1.6 were implemented with different > | > names sys.(get|set)_string_encoding() in the 1.6 final release. > | > Is this an intended change? If so, why is this incompatibility > | > introduced? > | > | These APIs were first introduced as experiment to the CVS tree > | under the names you find in the 1.6 release. They were meant > | to provide an easy way to experiment with different default > | encodings. > | > | After some discussions on python-dev the outcome > | was to keep the APIs for use by site.py to set a locale > | dependent default encoding. > | > | This idea was then retracted some weeks later and replaced > | with the now standard ASCII default encoding which you find > | in both 1.6 and 2.0. > > I see. > > | So to answer your question: the sys APIs in 1.6 are to be > | considered undocumented features and should *not* be used. > > Then, is there no way to set/get the default encoding in 1.6? No, there's no offical way to do this. You could of course use the undocumented APIs, but you should be careful not to create any Unicode objects *before* setting the default in e.g. site.py. The same applies to 2.0. The reason is that Unicode objects cache their default encoded string version but don't store the encoding this string uses. This could lead to the cached version using a different encoding than the current default encoding. In any case I'd suggest not relying on the default encoding, but instead using explicit calls to .encode() and unicode() to apply the proper conversions -- this is always safe, uses less magic and is also more portable across Python installations. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From martin@loewis.home.cs.tu-berlin.de Mon Sep 11 23:48:15 2000 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Tue, 12 Sep 2000 00:48:15 +0200 Subject: [I18n-sig] Re: [4suite] Output encodings again In-Reply-To: <39BC9737.C302C19C@fourthought.com> (message from Uche Ogbuji on Mon, 11 Sep 2000 02:26:31 -0600) References: <871yzla17z.fsf@psyche.evansnet> <39BC9737.C302C19C@fourthought.com> Message-ID: <200009112248.AAA00801@loewis.home.cs.tu-berlin.de> [for i18n readers: the issue is to convert u"\u000A9\u01A9" to latin-1, so that it comes out as "\251A9;"] > Currently, on output to XML (and HTML), we first convert the UTF-8 that > the DOM uses into Martin von Lowis's wchar type. It may be the time to slowly retire this type. It is still needed for 1.5 installations, but the 1.6/2.0 type has a comparable feature set yet an interface that is here to stay; plus it offers quite some additional feature. Still, I believe it shares this problem with my type. > So I'm rather at a loss as to how to efficiently escape such characters > for XML output. I know I want to render them as &#???;, but every > method I see for doing so is rather wasteful. In principle, the approach should be introduce new encodings. That is, you get latin-1-xml, latin-2-xml, koi-8r-xml, utf-8-xml, and so on. These encodings are the same as the original ones, except that they have different error handling. This approach is possible both with my type and with the 2.0 type - however, implementing these encodings is quite some effort. I'm sure you've thought of the approach to catch the exception, then retry with a smaller string. That may not be too bad - it requires a binary search to work efficiently. E.g. def latin1_xml(str): try: result = result + str.encode("latin-1") except UnicodeError: if len(str)==1: return "&%x;" % ord(str) m = len(str)/2 return latin1_xml(str[:m]) + latin1_xml(str[m:]) It could be implemented more efficiently if the UnicodeError told at what offset exactly the problem occured, or at least what character was causing the problem, e.g. def latin1_xml(str): try: result = result + str.encode("latin-1") except UnicodeError,e: if len(str)==1: return "&%x;" % ord(str) m = str.find(e.bad_char) r = "&%x;" % e.bad_char return latin1_xml(str[:m-1]) + r + % e.bad_charlatin1_xml(str[m+1:]) I think such an advanced error reporting could be useful; it is questionable whether it could go into 2.0 if implemented. In any case, it would probably be reasonable not to require a bad_char attribute in every UnicodeError instance - perhaps UnicodeError must be further subclassed: def latin1_xml(str): try: result = result + str.encode("latin-1") except ConversionError,e: m = e.offset r = "&%x;" % e.bad_char return latin1_xml(str[:m-1]) + r + % e.bad_charlatin1_xml(str[m+1:]) except UnicodeError,e: if len(str)==1: return "&%x;" % ord(str) m = len(str)/2 return latin1_xml(str[:m]) + latin1_xml(str[m:]) Regards, Martin From mal@lemburg.com Tue Sep 12 13:30:36 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Tue, 12 Sep 2000 14:30:36 +0200 Subject: [I18n-sig] Re: [XML-SIG] Re: [4suite] Output encodings again References: <871yzla17z.fsf@psyche.evansnet> <39BC9737.C302C19C@fourthought.com> <200009112248.AAA00801@loewis.home.cs.tu-berlin.de> Message-ID: <39BE21EB.842A066F@lemburg.com> "Martin v. Loewis" wrote: > > [for i18n readers: the issue is to convert u"\u000A9\u01A9" to latin-1, > so that it comes out as "\251A9;"] > > > Currently, on output to XML (and HTML), we first convert the UTF-8 that > > the DOM uses into Martin von Lowis's wchar type. > > It may be the time to slowly retire this type. It is still needed for > 1.5 installations, but the 1.6/2.0 type has a comparable feature set > yet an interface that is here to stay; plus it offers quite some > additional feature. > > Still, I believe it shares this problem with my type. > > > So I'm rather at a loss as to how to efficiently escape such characters > > for XML output. I know I want to render them as &#???;, but every > > method I see for doing so is rather wasteful. > > In principle, the approach should be introduce new encodings. That is, > you get latin-1-xml, latin-2-xml, koi-8r-xml, utf-8-xml, and so on. > > These encodings are the same as the original ones, except that they > have different error handling. This approach is possible both with my > type and with the 2.0 type - however, implementing these encodings is > quite some effort. It's not really all that hard to write codecs for Python 2.0. You'll have to do two things: 1. write the codec by subclassing the base classes in codecs.py 2. write a search function which returns the needed constructors and functions. You will then have to register the search function using the APIs in codecs.py. After having done that, the codec will be accessible via the usual 2.0 methods, e.g. .encode() and unicode(). Documentation is available in codecs.py itself, the various codecs in the encodings/ package directory and Misc/unicode.txt. For a good pure-Python implementation built using these techniques have a look at the Japanese codecs which were recently announced on the i18n sig-list. -- Marc-Andre Lemburg ________________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From martin@loewis.home.cs.tu-berlin.de Wed Sep 13 12:11:08 2000 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Wed, 13 Sep 2000 13:11:08 +0200 Subject: [I18n-sig] Re: [XML-SIG] Re: [4suite] Output encodings again In-Reply-To: <39BE21EB.842A066F@lemburg.com> (mal@lemburg.com) References: <871yzla17z.fsf@psyche.evansnet> <39BC9737.C302C19C@fourthought.com> <200009112248.AAA00801@loewis.home.cs.tu-berlin.de> <39BE21EB.842A066F@lemburg.com> Message-ID: <200009131111.NAA00929@loewis.home.cs.tu-berlin.de> > It's not really all that hard to write codecs for Python 2.0. > > You'll have to do two things: > 1. write the codec by subclassing the base classes in codecs.py > 2. write a search function which returns the needed constructors > and functions. So how would I write a codec that converts all characters to Latin-1, and converts those out of latin-1 to &#xxx; (instead of the replacement character)? I'd need knowledge about what character are in Latin-1, and I'd need to do conversion on a character-by-character basis, right? And I can't possible use any of the _codecs helper functions? This is certainly feasible if I want it for a single character set, but now if I want to do it wholesale for the entire set of character sets supported by Python 2.0. Regards, Martin From mal@lemburg.com Wed Sep 13 18:57:07 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Wed, 13 Sep 2000 19:57:07 +0200 Subject: [I18n-sig] Re: [XML-SIG] Re: [4suite] Output encodings again References: <871yzla17z.fsf@psyche.evansnet> <39BC9737.C302C19C@fourthought.com> <200009112248.AAA00801@loewis.home.cs.tu-berlin.de> <39BE21EB.842A066F@lemburg.com> <200009131111.NAA00929@loewis.home.cs.tu-berlin.de> Message-ID: <39BFBFF3.C7BDD1F4@lemburg.com> "Martin v. Loewis" wrote: > > > It's not really all that hard to write codecs for Python 2.0. > > > > You'll have to do two things: > > 1. write the codec by subclassing the base classes in codecs.py > > 2. write a search function which returns the needed constructors > > and functions. > > So how would I write a codec that converts all characters to Latin-1, > and converts those out of latin-1 to &#xxx; (instead of the > replacement character)? I'd need knowledge about what character are in > Latin-1, and I'd need to do conversion on a character-by-character > basis, right? Right. > And I can't possible use any of the _codecs helper > functions? You could play some tricks with the character mapping codec which is used by all code page codecs. You will achieve better performance with a native codec written in C though. > This is certainly feasible if I want it for a single character set, > but now if I want to do it wholesale for the entire set of character > sets supported by Python 2.0. This is probably not possible since there's no way to have the codecs use e.g. a callback function to handle error situations. But the situation is not all that bad: most codecs rely on the character mapping codec and you could simply implement a new version of it which does the XML escaping instead of raising errors. -- Marc-Andre Lemburg ________________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From jpsc@users.sourceforge.net Wed Sep 27 23:18:35 2000 From: jpsc@users.sourceforge.net (JP S-C) Date: Wed, 27 Sep 2000 15:18:35 -0700 (PDT) Subject: [I18n-sig] Python for the Visually Impaired Message-ID: <20000927221835.26496.qmail@web2201.mail.yahoo.com> Dear edu-sig and i18n-sig mailing lists, The subject of this message is somewhere in between education and internationalization, so I am writing you both. I run a project named Ocularis and am interested in collaborating with developers from both SIG's or the SIG's themselves. Ocularis in brief, is a distribution of the Linux Operating System that aims to allow the visually impaired to communicate, work, and express themselves through computers as well as to install and customize their system, independent of sighted assistance. The development of Ocularis is already underway and all software is created by volunteers and is released under the GNU Public License. More detailed information about Ocularis in included below. The ocularis-desktop package (currently in version 0.0.1) focuses on providing console-based applications that serve common functions. This package is written completely in Python, a language which I believe has a lot of potential for creating applications for the visually impaired. In addition, I think that Python is also an ideal language on many fronts, especially when it comes to programming, debugging, and maintaining code non-visually. Other than the ocularis-desktop package, there are also several developers who are working on other subprojects of Ocularis that aim to provide better access to X, including GTK-based applications. I would love to discuss or hear ideas from anyone about Python's many uses for and with the visually impaired. Thank you. --JP Schnapper-Casteras jpsc@users.sourceforge.net Details about Ocularis: The computing environment and suite of applications that are the goal of Ocularis will be free software (see "www.gnu.org" for a definition of free software) and will be based on Linux. The basic applications that Ocularis will possess are a word processor, calendar, calculator, basic accounting or finance application, file manager, Internet browser, and e-mail client. All of these programs will run smoothly on computers consisting of commonly available hardware costing less than $500 that can be bought at almost any local computer store. In comparison to current adaptive technology, this is both a drastic price drop and an increase in the availability of the required hardware. Ocularis was started in response to research on current adaptive technology, which culminated in the editorial "The Potential of Open Source for the Visually Impaired" (available at the Ocularis web site,"http://ocularis.sourceforge.net/"). For more information, please visit the Ocularis web site, "http://ocularis.sourceforge.net/", or contact me directly. __________________________________________________ Do You Yahoo!? Yahoo! Photos - 35mm Quality Prints, Now Get 15 Free! http://photos.yahoo.com/