From paul@prescod.net Fri Jun 2 04:20:48 2000 From: paul@prescod.net (Paul Prescod) Date: Thu, 01 Jun 2000 22:20:48 -0500 Subject: [I18n-sig] Literal strings Message-ID: <39372810.F9BFE796@prescod.net> I am thinking about string literals. Not narrow strings in general, just string literals in particular. I'm not sure where we left the issue of a statement about the "encoding" of string literals. Here's my input. I have a lot of code like this: if tagName=="foo": ... I would like it to magically work with Unicode. Guido's proposal allows it to magically work with Unicode-encoded ASCII, but not with the full range of Unicode characters. I'm not entirely happy that my code will crash and burn the first time someone pops in a cedilla. What would be the consequences of a module-level pragma that allows the literal strings in my module to be interpreted as *Unicode literals* instead of ASCII literals. I usually know that all of the literals in my program are raw ASCII, so even if they are interpreted as Unicode, they will be "compatible with" raw ASCII input. The only thing that they would not be compatible with is 8-bit binary goo, which they were never intended to be compatible with anyhow. I just want to add something at the top of my file like: #pragma IL8N and have my literal strings act as Unicode. Now I could go through my code and change all of the literals to Unicode literals by hand, but a) that's really ugly, syntactically b) I feel like I'll end up switching them all back when we just make literal strings "wide" by default c) I feel like I'm being penalized for making my program internationalized d) I have a lot of code, as we all do. -- Paul Prescod - ISOGEN Consulting Engineer speaking for himself Simplicity does not precede complexity, but follows it. - http://www.cs.yale.edu/~perlis-alan/quotes.html From paul@prescod.net Fri Jun 2 04:53:41 2000 From: paul@prescod.net (Paul Prescod) Date: Thu, 01 Jun 2000 22:53:41 -0500 Subject: [I18n-sig] Re: [Python-Dev] ascii.py? References: <200006012236.SAA03578@snark.thyrsus.com> Message-ID: <39372FC5.DE1CE8EA@prescod.net> "Eric S. Raymond" wrote: > > There has been a vast and echoing silence about the ascii.py module I > posted here at Fred Drake's request. Is it really such a bad idea? Without looking closely, or even being particularly knowledgable (how's that for a disclaimer!) my instinctive reaction was: "does the ASCII subset of Unicode need its own module just before we add Unicode to the language?" It may be that there are some semantics of ASCII that are not captured in the Unicode spec. and thus are not generalizable. I'm pretty confident that these ones ARE generalizable: isalnum isalpha isascii islower isupper isspace isxdigit How do Unicode users get this information from the famous Unicode database and why not merge the Unicode and ASCII versions in 1.6? -- Paul Prescod - ISOGEN Consulting Engineer speaking for himself Simplicity does not precede complexity, but follows it. - http://www.cs.yale.edu/~perlis-alan/quotes.html From esr@thyrsus.com Fri Jun 2 06:43:54 2000 From: esr@thyrsus.com (Eric S. Raymond) Date: Fri, 2 Jun 2000 01:43:54 -0400 Subject: [I18n-sig] Re: [Python-Dev] ascii.py? In-Reply-To: <39372FC5.DE1CE8EA@prescod.net>; from paul@prescod.net on Thu, Jun 01, 2000 at 10:53:41PM -0500 References: <200006012236.SAA03578@snark.thyrsus.com> <39372FC5.DE1CE8EA@prescod.net> Message-ID: <20000602014353.A5211@thyrsus.com> Paul Prescod : > "Eric S. Raymond" wrote: > > > > There has been a vast and echoing silence about the ascii.py module I > > posted here at Fred Drake's request. Is it really such a bad idea? > > Without looking closely, or even being particularly knowledgable (how's > that for a disclaimer!) my instinctive reaction was: "does the ASCII > subset of Unicode need its own module just before we add Unicode to the > language?" > > It may be that there are some semantics of ASCII that are not captured > in the Unicode spec. and thus are not generalizable. ascii.ctrl is one such. > I'm pretty > confident that these ones ARE generalizable: > > isalnum > isalpha > isascii > islower > isupper > isspace > isxdigit > > How do Unicode users get this information from the famous Unicode > database and why not merge the Unicode and ASCII versions in 1.6? Answer: ascii.py is not designed for text processing. I wrote it to package some functions useful for classifying *ASCII* data, especially in the context of roguelike programs that interpret keystrokes coming in through a curses interface. (Where this all touches ground is CML2, my replacement configuration system for the Linux kernel.) -- Eric S. Raymond ..every Man has a Property in his own Person. This no Body has any Right to but himself. The Labour of his Body, and the Work of his Hands, we may say, are properly his. .... The great and chief end therefore, of Mens uniting into Commonwealths, and putting themselves under Government, is the Preservation of their Property. -- John Locke, "A Treatise Concerning Civil Government" From mal@lemburg.com Fri Jun 2 09:02:35 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Fri, 02 Jun 2000 10:02:35 +0200 Subject: [I18n-sig] Re: [Python-Dev] ascii.py? References: <200006012236.SAA03578@snark.thyrsus.com> <39372FC5.DE1CE8EA@prescod.net> Message-ID: <39376A1B.10E45C7B@lemburg.com> Paul Prescod wrote: > > "Eric S. Raymond" wrote: > > > > There has been a vast and echoing silence about the ascii.py module I > > posted here at Fred Drake's request. Is it really such a bad idea? > > Without looking closely, or even being particularly knowledgable (how's > that for a disclaimer!) my instinctive reaction was: "does the ASCII > subset of Unicode need its own module just before we add Unicode to the > language?" > > It may be that there are some semantics of ASCII that are not captured > in the Unicode spec. and thus are not generalizable. I'm pretty > confident that these ones ARE generalizable: > > isalnum > isalpha > isascii > islower > isupper > isspace > isxdigit > > How do Unicode users get this information from the famous Unicode > database and why not merge the Unicode and ASCII versions in 1.6? Note that many of the above are already implemented as string|Unicode methods. The Unicode database is accessible via the unicodedata module. The specs for the used APIs and constants can be found in the Unicode database description file on www.unicode.org. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Fri Jun 2 10:32:29 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Fri, 02 Jun 2000 11:32:29 +0200 Subject: [I18n-sig] Literal strings References: <39372810.F9BFE796@prescod.net> Message-ID: <39377F2D.B6FBBF71@lemburg.com> Paul Prescod wrote: > > I am thinking about string literals. Not narrow strings in general, just > string literals in particular. I'm not sure where we left the issue of a > statement about the "encoding" of string literals. Here's my input. > > I have a lot of code like this: > > if tagName=="foo": > ... > > I would like it to magically work with Unicode. Guido's proposal allows > it to magically work with Unicode-encoded ASCII, but not with the full > range of Unicode characters. I'm not entirely happy that my code will > crash and burn the first time someone pops in a cedilla. > > What would be the consequences of a module-level pragma that allows the > literal strings in my module to be interpreted as *Unicode literals* > instead of ASCII literals. I usually know that all of the literals in my > program are raw ASCII, so even if they are interpreted as Unicode, they > will be "compatible with" raw ASCII input. The only thing that they > would not be compatible with is 8-bit binary goo, which they were never > intended to be compatible with anyhow. > > I just want to add something at the top of my file like: > > #pragma IL8N > > and have my literal strings act as Unicode. > > Now I could go through my code and change all of the literals to Unicode > literals by hand, but > > a) that's really ugly, syntactically > > b) I feel like I'll end up switching them all back when we just make > literal strings "wide" by default > > c) I feel like I'm being penalized for making my program > internationalized > > d) I have a lot of code, as we all do. You can use the exerimental command line flag -U to have the Python compiler do this for you. The downside is that it does this for *all* modules and this currently causes much of the standard lib to fail (that's why it's experimental -- a future goal should be making the standard lib work with and without -U). The safest way to do this certainly is by fixing all instances to use u"" instead of "" (not that hard, really). Even though this may look strange at first, reading the code will immediately bring your attention to the fact that you are dealing with Unicode here -- a #pragma at the top won't get that much attention and a casual user might wonder where the u"" strings in variable dumps originate from. Note that there are plans to add a #pragma to allow specifying a Python script encoding. Things haven't been sorted out, though. One way to do this is by turning all "" string literals into u"" assuming the encoding given in the #pragma e.g. Latin-1 or MacRoman -- this would be along the lines of what you have in mind. The problem with this is that some string literaly might have to map to 8-bit strings, so for these you'd need to write e.g. s"" or something similar. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From pf@artcom-gmbh.de Fri Jun 2 11:39:09 2000 From: pf@artcom-gmbh.de (Peter Funk) Date: Fri, 2 Jun 2000 12:39:09 +0200 (MEST) Subject: [I18n-sig] Literal strings In-Reply-To: <39372810.F9BFE796@prescod.net> from Paul Prescod at "Jun 1, 2000 10:20:48 pm" Message-ID: Hi Paul, Paul Prescod : > I am thinking about string literals. Not narrow strings in general, just > string literals in particular. I'm not sure where we left the issue of a > statement about the "encoding" of string literals. Here's my input. > > I have a lot of code like this: > > if tagName=="foo": > ... > > I would like it to magically work with Unicode. Guido's proposal allows > it to magically work with Unicode-encoded ASCII, but not with the full > range of Unicode characters. I'm not entirely happy that my code will > crash and burn the first time someone pops in a cedilla. A cedilla (ç) is a normal 8-Bit character in ISO-Latin-1, so this may be a bad example. We use such literals a lot and it didn't break anything. Even with Guidos proposal it will only break things, if you coerce such a literal into unicode without an explicit conversion. Since my native language is German and since my English leaves a lot to be desired (take my rants to python-dev as examples), we decided long ago to use German as our "master language" in our company for our I18N software. This works pretty well in Python 1.5.2. Example how this looks like: tkMessageBox.askquestion(_("Löschen bestätigen"), _("Soll %s gelöscht werden?") % object_name) '_()' in this context is an shortcut name pointing to the 'fintl.gettext()' function. This function possibly returns the literal translated into English, French or Spanish depending on the language environment. An additional tool (xgettext, now pygettext by Barry W.) is used to extract all those literals and to deliver them to professional translators which translate these message strings into English, French ... Additionally we abopted the style to use single quotes for all literals that are normally invisible to a user of the software. Exmaple: if hasattr(target, 'disable'): target.disable() > What would be the consequences of a module-level pragma that allows the > literal strings in my module to be interpreted as *Unicode literals* > instead of ASCII literals. I usually know that all of the literals in my > program are raw ASCII, so even if they are interpreted as Unicode, they > will be "compatible with" raw ASCII input. The only thing that they > would not be compatible with is 8-bit binary goo, which they were never > intended to be compatible with anyhow. Hmmmm.... I don't understand, what you meant with your last sentence. May be my ignorance comes from the situation, that I can view, edit and print any files containing ISO-Latin1 characters in WYSIWYG without thinking about it and still don't know what kind of text editor and Keyboard/Display Equipment is required to work with those Unicode characters with ord(ch) >= 256 in WYSIWYG? [I'm using Linux/X11/vim if this matters] > I just want to add something at the top of my file like: > > #pragma IL8N > > and have my literal strings act as Unicode. There already was a long discussion about interpreter pragmas on python-dev. I still prefer David Scherer's brilliant idea to (ab)use the 'global' statment at module level, if we ever introduce pragmas into the 1.x series of Python. Please review the discussion (April 2000) in the python-dev archives. > Now I could go through my code and change all of the literals to Unicode > literals by hand, but > > a) that's really ugly, syntactically As always this is simply a matter of taste. And after a while you get used to it. > b) I feel like I'll end up switching them all back when we just make > literal strings "wide" by default I don't believe that this will happen in the 1.x series. This would break just too many things and the memory penalty is just to harsh for small systems. > c) I feel like I'm being penalized for making my program > internationalized As long as your i18n effort doesn't hit asian languages (for example chinese, japanese) you can get away with narrow strings. Unicode only comes into play, if you have to deal with several different languages at the same time. Even a japanese translation is possible with 8-bit Python 1.5.2, as long as you don't need to display for example umlauts and japanese characters at they same time, and as long as the japanese translator uses the same character set as the production platform. On Feb, 9th 2000 Andy Robinson wrote a very good explanation, what character sets are used in Japan. Review this in the i18n archive, if interested. Brian Takashi Hooper was also a very helpful guy concerning Japanese. > d) I have a lot of code, as we all do. If code can be modified automatically (and what you proposed can be done with a only slightly more elaborated operation than a simple 's/"/u"/g' replacement) this is IMO no argument. Regards, Peter From mal@lemburg.com Fri Jun 2 12:26:07 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Fri, 02 Jun 2000 13:26:07 +0200 Subject: [I18n-sig] Literal strings References: Message-ID: <393799CF.270D9BE4@lemburg.com> Peter Funk wrote: > > Since my native language is German and since my English leaves a lot > to be desired (take my rants to python-dev as examples), we decided > long ago to use German as our "master language" in our company for our > I18N software. This works pretty well in Python 1.5.2. Example how > this looks like: > > tkMessageBox.askquestion(_("Löschen bestätigen"), > _("Soll %s gelöscht werden?") % object_name) > > '_()' in this context is an shortcut name pointing to the > 'fintl.gettext()' function. This function possibly returns the literal > translated into English, French or Spanish depending on the language > environment. An additional tool (xgettext, now pygettext by Barry W.) is > used to extract all those literals and to deliver them to professional > translators which translate these message strings into English, French ... > > Additionally we abopted the style to use single quotes for all literals > that are normally invisible to a user of the software. Exmaple: > > if hasattr(target, 'disable'): > target.disable() Nice idea :-) I'm currently using my own scheme for solving the NLS problem, but it currently only work on a per-process basis. What I am looking for now, is a way to be able to set the language on a per user (of a single server process) basis. Is the gettext approach useful for this too, i.e. does it allow fast switching of the target language ? -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From pf@artcom-gmbh.de Fri Jun 2 13:48:10 2000 From: pf@artcom-gmbh.de (Peter Funk) Date: Fri, 2 Jun 2000 14:48:10 +0200 (MEST) Subject: Translated messages and 'gettext' API (was Re: [I18n-sig] Literal strings) In-Reply-To: <393799CF.270D9BE4@lemburg.com> from "M.-A. Lemburg" at "Jun 2, 2000 1:26: 7 pm" Message-ID: Hi, [M.-A. Lemburg]: > I'm currently using my own scheme for solving the NLS problem, > but it currently only work on a per-process basis. What I am > looking for now, is a way to be able to set the language > on a per user (of a single server process) basis. > > Is the gettext approach useful for this too, i.e. does it > allow fast switching of the target language ? Not as is. Currently my module 'fintl.py' is simply a small wrapper around MvLs 'intl' interface to the GNU gettext C-library, if this is available and an emulator, which does the same as GNU gettext library does in pure Python. My goal was to 1. avoid GPL infection and 2. to use the same API on Non-Unix platforms like WinXX and MacOS). But mailman seems to have a similar problem, Juan Carlos Rey Anaya has taken the module 'gettext.py' by James Henstridge and modified it to support dynamic loading of message catalogs. Based on a suggestion made by Françios Pinard in his mail to python-list from 15 Jan 2000 20:15:08 I thought it would be a nice idea, to replace the current singleton pattern for locale and catalog setting with a 'Translator' class, from which you may create several instances. This is trivial to implement, if you don't have to pay too much attention on memory consumption and don't insist to be API compatible with GNU gettext. Of course this will introduce some additional complexity: either you have to carry the "right" Translator instance around to all places, where messages are used in order to access the right 'gettext' method, or you have to expose some global default state, for example through the following two functions: def switch_language(new_language): global _current_translator if new_language != _current_translator.language: if not _translators.has_key(new_language): _translators[new_language] = Translator(new_language) _current_translator = _translators[new_language] ... def query_language(): return _current_translator.language I'm not sure, what is needed. Regards, Peter -- Peter Funk, Oldenburger Str.86, D-27777 Ganderkesee, Germany, Fax:+49 4222950260 office: +49 421 20419-0 (ArtCom GmbH, Grazer Str.8, D-28359 Bremen) From tanzer@swing.co.at Fri Jun 2 15:12:36 2000 From: tanzer@swing.co.at (Christian Tanzer) Date: Fri, 02 Jun 2000 16:12:36 +0200 Subject: [I18n-sig] Literal strings In-Reply-To: Your message of "Fri, 02 Jun 2000 12:39:09 +0200." Message-ID: pf@artcom-gmbh.de (Peter Funk) wrote: > If code can be modified automatically (and what you proposed can > be done with a only slightly more elaborated operation than a simple > 's/"/u"/g' replacement) this is IMO no argument. Unfortunately, it's not that simple: -------------------------------------------------------------------------= ------ Python 1.5.2 (#5, Jan 4 2000, 11:37:02) [GCC 2.7.2.1] on linux2 Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam >>> #some_string =2E.. = >>> import re >>> some_string=3D''' =2E.. Just to test an over-simplified regex: "first string". =2E.. """Followed by another string = =2E.. spanning several lines. =2E.. """ =2E.. ''' >>> print some_string = Just to test an over-simplified regex: "first string". """Followed by another string spanning several lines. """ >>> print re.sub ('"','u"',some_string ) Just to test an over-simplified regex: u"first stringu". u"u"u"Followed by another string spanning several lines. u"u"u" -------------------------------------------------------------------------= ------ Do you really have `a only slightly more elaborated operation'? If so, please post it. Regards, Christian -- = Christian Tanzer tanzer@swing.co.= at Glasauergasse 32 Tel: +43 1 876 62 = 36 A-1130 Vienna, Austria Fax: +43 1 877 66 = 92 From pf@artcom-gmbh.de Fri Jun 2 16:30:49 2000 From: pf@artcom-gmbh.de (Peter Funk) Date: Fri, 2 Jun 2000 17:30:49 +0200 (MEST) Subject: Replacing string literals with u"..." (was Re: [I18n-sig] Literal strings) In-Reply-To: from Christian Tanzer at "Jun 2, 2000 4:12:36 pm" Message-ID: Hi, [me:] > > If code can be modified automatically (and what you proposed can > > be done with a only slightly more elaborated operation than a simple > > 's/"/u"/g' replacement) this is IMO no argument. [Christian Tanzer]: > Unfortunately, it's not that simple: [...example of complicated string not repeated...] > Do you really have `a only slightly more elaborated operation'? If so, > please post it. No, sorry. Indeed regular expressions seem not to be the right tool to do this. But since the module 'tokenize' from standard library is able to identify all those forms of Python string literals, it should be possible and not to hard to write a script, which will identify all string tokens using 'tokenize' and replace them with u. Regards, Peter -- Peter Funk, Oldenburger Str.86, D-27777 Ganderkesee, Germany, Fax:+49 4222950260 office: +49 421 20419-0 (ArtCom GmbH, Grazer Str.8, D-28359 Bremen) From brian@garage.co.jp Fri Jun 2 16:41:36 2000 From: brian@garage.co.jp (Brian Takashi Hooper) Date: Sat, 03 Jun 2000 00:41:36 +0900 Subject: Translated messages and 'gettext' API (was Re: [I18n-sig] Literal strings) In-Reply-To: References: <393799CF.270D9BE4@lemburg.com> Message-ID: <3937D5B0254.6274BRIAN@smtp.garage.co.jp> Hi there, [snip] > [M.-A. Lemburg]: > > I'm currently using my own scheme for solving the NLS problem, > > but it currently only work on a per-process basis. What I am > > looking for now, is a way to be able to set the language > > on a per user (of a single server process) basis. > > > > Is the gettext approach useful for this too, i.e. does it > > allow fast switching of the target language ? > > Not as is. I also ran into this same problem and made a slightly expanded Python implementation of gettext (based on Peter's fintl.py!) that adds a few calls to allow the language to explicitly be set for each call, which makes it a little more appropriate for applications where each thread, or perhaps even each call, might have a different language preference. I've also experimentally used interpositioning with a hacked version of gettext, compiled as a .so, to enable a C version of the same stuff (basically, just allowing an explicit language argument to dcgettext, which if supplied is used instead of getting the language from the environment). Does this seem useful to anyone? If so, I'll put the code up somewheres (actually, even if not, what the heck.) -Brian From brian@garage.co.jp Fri Jun 2 16:41:36 2000 From: brian@garage.co.jp (Brian Takashi Hooper) Date: Sat, 03 Jun 2000 00:41:36 +0900 Subject: Translated messages and 'gettext' API (was Re: [I18n-sig] Literal strings) In-Reply-To: References: <393799CF.270D9BE4@lemburg.com> Message-ID: <3937D5B0254.6274BRIAN@smtp.garage.co.jp> Hi there, [snip] > [M.-A. Lemburg]: > > I'm currently using my own scheme for solving the NLS problem, > > but it currently only work on a per-process basis. What I am > > looking for now, is a way to be able to set the language > > on a per user (of a single server process) basis. > > > > Is the gettext approach useful for this too, i.e. does it > > allow fast switching of the target language ? > > Not as is. I also ran into this same problem and made a slightly expanded Python implementation of gettext (based on Peter's fintl.py!) that adds a few calls to allow the language to explicitly be set for each call, which makes it a little more appropriate for applications where each thread, or perhaps even each call, might have a different language preference. I've also experimentally used interpositioning with a hacked version of gettext, compiled as a .so, to enable a C version of the same stuff (basically, just allowing an explicit language argument to dcgettext, which if supplied is used instead of getting the language from the environment). Does this seem useful to anyone? If so, I'll put the code up somewheres (actually, even if not, what the heck.) -Brian From mal@lemburg.com Fri Jun 2 20:05:18 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Fri, 02 Jun 2000 21:05:18 +0200 Subject: Translated messages and 'gettext' API (was Re: [I18n-sig] Literal strings) References: <393799CF.270D9BE4@lemburg.com> <3937D5B0254.6274BRIAN@smtp.garage.co.jp> Message-ID: <3938056E.6B707525@lemburg.com> Brian Takashi Hooper wrote: > > Hi there, > > [snip] > > [M.-A. Lemburg]: > > > I'm currently using my own scheme for solving the NLS problem, > > > but it currently only work on a per-process basis. What I am > > > looking for now, is a way to be able to set the language > > > on a per user (of a single server process) basis. > > > > > > Is the gettext approach useful for this too, i.e. does it > > > allow fast switching of the target language ? > > > > Not as is. > > I also ran into this same problem and made a slightly expanded Python > implementation of gettext (based on Peter's fintl.py!) that adds a few > calls to allow the language to explicitly be set for each call, which > makes it a little more appropriate for applications where each thread, > or perhaps even each call, might have a different language preference. > > I've also experimentally used interpositioning with a hacked version of > gettext, compiled as a .so, to enable a C version of the same stuff > (basically, just allowing an explicit language argument to dcgettext, > which if supplied is used instead of getting the language from the > environment). > > Does this seem useful to anyone? If so, I'll put the code up somewheres > (actually, even if not, what the heck.) If I understand this right, Peter's version does not need the GPLed gettext lib, right ? What are the license terms for the Python gettext version and your modified one ? -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From paul@prescod.net Sat Jun 3 20:23:22 2000 From: paul@prescod.net (Paul Prescod) Date: Sat, 03 Jun 2000 14:23:22 -0500 Subject: [I18n-sig] Literal strings References: <39372810.F9BFE796@prescod.net> <39377F2D.B6FBBF71@lemburg.com> Message-ID: <39395B2A.BE05860@prescod.net> "M.-A. Lemburg" wrote: > > .... > The safest way to do this certainly is by fixing all > instances to use u"" instead of "" (not that hard, really). > Even though this may look strange at first, reading the code > will immediately bring your attention to the fact that you > are dealing with Unicode here -- a #pragma at the top won't > get that much attention and a casual user might wonder > where the u"" strings in variable dumps originate from. I guess that's our philosophical difference. I don't want to go around thinking about the fact that I am using Unicode. I want to test it once and then have it "just work." > One way to do this is by turning > all "" string literals into u"" assuming the encoding > given in the #pragma e.g. Latin-1 or MacRoman -- this would > be along the lines of what you have in mind. Yes, this would probably be acceptable. > The problem > with this is that some string literaly might have to map > to 8-bit strings, so for these you'd need to write e.g. > s"" or something similar. Right, or call a conversion function. -- Paul Prescod - ISOGEN Consulting Engineer speaking for himself Simplicity does not precede complexity, but follows it. - http://www.cs.yale.edu/~perlis-alan/quotes.html From paul@prescod.net Sat Jun 3 20:24:34 2000 From: paul@prescod.net (Paul Prescod) Date: Sat, 03 Jun 2000 14:24:34 -0500 Subject: [I18n-sig] Literal strings References: Message-ID: <39395B72.EED8E1B1@prescod.net> Peter Funk wrote: > > > I would like it to magically work with Unicode. Guido's proposal allows > > it to magically work with Unicode-encoded ASCII, but not with the full > > range of Unicode characters. I'm not entirely happy that my code will > > crash and burn the first time someone pops in a cedilla. > > A cedilla (ç) is a normal 8-Bit character in ISO-Latin-1, so this may > be a bad example. Guido's proposal only auto-coerces 7-bit data. > We use such literals a lot and it didn't break anything. > Even with Guidos proposal it will only break things, if you coerce > such a literal into unicode without an explicit conversion. My code example showed an implicit coercion. > There already was a long discussion about interpreter pragmas > on python-dev. I still prefer David Scherer's brilliant idea to > (ab)use the 'global' statment at module level, if we ever introduce > pragmas into the 1.x series of Python. Please review the discussion > (April 2000) in the python-dev archives. I wasn't so concerned about the syntax so I didn't bother to look that up. > > Now I could go through my code and change all of the literals to Unicode > > literals by hand, but > > > > a) that's really ugly, syntactically > > As always this is simply a matter of taste. And after a while you get > used to it. They say that about Perl too. :) I don't believe them. > > b) I feel like I'll end up switching them all back when we just make > > literal strings "wide" by default > > I don't believe that this will happen in the 1.x series. This would break > just too many things and the memory penalty is just to harsh for small > systems. We will see about the former. The latter is just not true because a Unicode object could be internally implemented as an 8-bit string as long as it implements the same external interface. We have often discussed these "tagged Unicode objects" and have just not implemented them yet. > > c) I feel like I'm being penalized for making my program > > internationalized > > As long as your i18n effort doesn't hit asian languages (for example > chinese, japanese) you can get away with narrow strings. I work with XML so I don't know what language the input is in. > Unicode only comes > into play, if you have to deal with several different languages at > the same time. Or if you are dealing with XML, or TKinter, or WebDAV or communicating with Java or ... > > d) I have a lot of code, as we all do. > > If code can be modified automatically (and what you proposed can > be done with a only slightly more elaborated operation than a simple > 's/"/u"/g' replacement) this is IMO no argument. Actually, I haven't had any experience with source to source Python transforms myself. Wouldn't it mess up other things like comments and tabbing unless you got to a great deal of work? -- Paul Prescod - ISOGEN Consulting Engineer speaking for himself Simplicity does not precede complexity, but follows it. - http://www.cs.yale.edu/~perlis-alan/quotes.html From brian@garage.co.jp Sun Jun 4 03:03:07 2000 From: brian@garage.co.jp (Brian Takashi Hooper) Date: Sun, 04 Jun 2000 11:03:07 +0900 Subject: Translated messages and 'gettext' API (was Re: [I18n-sig] Literal strings) In-Reply-To: <3938056E.6B707525@lemburg.com> References: <3937D5B0254.6274BRIAN@smtp.garage.co.jp> <3938056E.6B707525@lemburg.com> Message-ID: <3939B8DB4.6275BRIAN@smtp.garage.co.jp> On Fri, 02 Jun 2000 21:05:18 +0200 "M.-A. Lemburg" wrote: > Brian Takashi Hooper wrote: > > > > Hi there, > > > > [snip] > > > [M.-A. Lemburg]: > > > > I'm currently using my own scheme for solving the NLS problem, > > > > but it currently only work on a per-process basis. What I am > > > > looking for now, is a way to be able to set the language > > > > on a per user (of a single server process) basis. > > > > > > > > Is the gettext approach useful for this too, i.e. does it > > > > allow fast switching of the target language ? > > > > > > Not as is. > > > > I also ran into this same problem and made a slightly expanded Python > > implementation of gettext (based on Peter's fintl.py!) that adds a few > > calls to allow the language to explicitly be set for each call, which > > makes it a little more appropriate for applications where each thread, > > or perhaps even each call, might have a different language preference. > > > > I've also experimentally used interpositioning with a hacked version of > > gettext, compiled as a .so, to enable a C version of the same stuff > > (basically, just allowing an explicit language argument to dcgettext, > > which if supplied is used instead of getting the language from the > > environment). > > > > Does this seem useful to anyone? If so, I'll put the code up somewheres > > (actually, even if not, what the heck.) > > If I understand this right, Peter's version does not need the > GPLed gettext lib, right ? What are the license terms for > the Python gettext version and your modified one ? Peter's fintl.py, and my modified version of fintl.py, are by themselves freestanding modules, they do not require libintl or the gettext library, they are Python reimplementations (of just the message retrieval API). Peter's is free for any use and my module inherits that license (is also free). The .so I made which modifies the C gettext library is, obviously, GPL'ed - however, it doesn't seem like it would be too hard to, again, make a free implementation which just understands GNU (and, if possible, Solaris-style and other platforms if there are) .mo files and implements only gettext, dgettext, etc. --Brian From paul@prescod.net Sun Jun 4 15:54:01 2000 From: paul@prescod.net (Paul Prescod) Date: Sun, 04 Jun 2000 09:54:01 -0500 Subject: [I18n-sig] Codecs Message-ID: <393A6D89.B9DC952F@prescod.net> Should codecs be returned to the user as objects instead of tuples? Today we have: (UTF8_encode, UTF8_decode, UTF8_streamreader, UTF8_streamwriter) = codecs.lookup('UTF-8') output = UTF8_streamwriter( open( '/tmp/output', 'wb') ) I think this would be a little simpler: output=codecs.lookup('UTF-8').stream_writer( open( '/tmp/output', 'wb') ) The object solution is more extensible, requires less "bogus" assignments and does not require the user to remember the order of the return values. -- Paul Prescod - ISOGEN Consulting Engineer speaking for himself Simplicity does not precede complexity, but follows it. - http://www.cs.yale.edu/~perlis-alan/quotes.html From brian@garage.co.jp Sun Jun 4 16:05:48 2000 From: brian@garage.co.jp (Brian Takashi Hooper) Date: Mon, 05 Jun 2000 00:05:48 +0900 Subject: [I18n-sig] Codecs In-Reply-To: <393A6D89.B9DC952F@prescod.net> References: <393A6D89.B9DC952F@prescod.net> Message-ID: <393A704C17D.DF54BRIAN@smtp.garage.co.jp> This issue came up before on this list, I think Andy Robinson suggested it before in the midst of a lot of other Unicode musings. One thing I remember Andy mentioned was that a codec object could then additionally contain methods in addition to those required by the codec API, for example a method to fix broken legacy encoding input strings, etc. Personally, I would be happier to get an object back from codecs.lookup(), one vote in favor if it matters. Are there any good reasons to prefer getting a tuple back from codecs.lookup()? --Brian On Sun, 04 Jun 2000 09:54:01 -0500 Paul Prescod wrote: > Should codecs be returned to the user as objects instead of tuples? > Today we have: > > (UTF8_encode, UTF8_decode, > UTF8_streamreader, UTF8_streamwriter) = codecs.lookup('UTF-8') > > output = UTF8_streamwriter( open( '/tmp/output', 'wb') ) > > I think this would be a little simpler: > > output=codecs.lookup('UTF-8').stream_writer( open( '/tmp/output', 'wb') > ) > > The object solution is more extensible, requires less "bogus" > assignments and does not require the user to remember the order of the > return values. > > -- > Paul Prescod - ISOGEN Consulting Engineer speaking for himself > Simplicity does not precede complexity, but follows it. > - http://www.cs.yale.edu/~perlis-alan/quotes.html > > _______________________________________________ > I18n-sig mailing list > I18n-sig@python.org > http://www.python.org/mailman/listinfo/i18n-sig > From andy@reportlab.com Sun Jun 4 23:25:04 2000 From: andy@reportlab.com (Andy Robinson) Date: Sun, 4 Jun 2000 23:25:04 +0100 Subject: [I18n-sig] Codecs In-Reply-To: <393A6D89.B9DC952F@prescod.net> Message-ID: > > Should codecs be returned to the user as objects instead of tuples? > Today we have: > > (UTF8_encode, UTF8_decode, > UTF8_streamreader, UTF8_streamwriter) = codecs.lookup('UTF-8') > > output = UTF8_streamwriter( open( '/tmp/output', 'wb') ) > > I think this would be a little simpler: > > output=codecs.lookup('UTF-8').stream_writer( open( '/tmp/output', 'wb') > ) > > The object solution is more extensible, requires less "bogus" > assignments and does not require the user to remember the order of the > return values. > I suggested this a while back, for a different reason. Right now you get four things back from lookup() relating to the given encoding. But in many cases there may be other encoding-specific routines of great use, and returning an object would give us a place to hang them; codec.repair(...) and codec.validate(...), for example. There are accepted and useful bits of code around to repair Shift-JIS or EUC data in which one or two bytes are corrupt. We would also have a place to hang language-specific routines. So I would be very, very happy to see codecs.lookup return a 'codec object' with the four attributes encode, decode, streamreader() and streamwriter() rather than a tuple. - Andy Robinson From mal@lemburg.com Mon Jun 5 13:43:52 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Mon, 05 Jun 2000 14:43:52 +0200 Subject: [I18n-sig] Literal strings References: <39372810.F9BFE796@prescod.net> <39377F2D.B6FBBF71@lemburg.com> <39395B2A.BE05860@prescod.net> Message-ID: <393BA088.8C306FA8@lemburg.com> Paul Prescod wrote: > > "M.-A. Lemburg" wrote: > > > > .... > > > The safest way to do this certainly is by fixing all > > instances to use u"" instead of "" (not that hard, really). > > Even though this may look strange at first, reading the code > > will immediately bring your attention to the fact that you > > are dealing with Unicode here -- a #pragma at the top won't > > get that much attention and a casual user might wonder > > where the u"" strings in variable dumps originate from. > > I guess that's our philosophical difference. I don't want to go around > thinking about the fact that I am using Unicode. I want to test it > once and then have it "just work." That won't always work... Unicode and strings are two different things -- the first is explicitely there for text data while the second can hold arbitrary data with no extra meta information attached. ...if it does work, then you're lucky ;-) > > One way to do this is by turning > > all "" string literals into u"" assuming the encoding > > given in the #pragma e.g. Latin-1 or MacRoman -- this would > > be along the lines of what you have in mind. > > Yes, this would probably be acceptable. > > > The problem > > with this is that some string literaly might have to map > > to 8-bit strings, so for these you'd need to write e.g. > > s"" or something similar. > > Right, or call a conversion function. ...but then you have the same problem as before: string literal modifiers (the small 'u' or 's' in front of the literal) scattered around in the source code. Hmm, we need some more ideas in this area I guess... -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Mon Jun 5 14:11:56 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Mon, 05 Jun 2000 15:11:56 +0200 Subject: [I18n-sig] Codecs References: <393A6D89.B9DC952F@prescod.net> <393A704C17D.DF54BRIAN@smtp.garage.co.jp> Message-ID: <393BA71C.349A7338@lemburg.com> Brian Takashi Hooper wrote: > > This issue came up before on this list, I think Andy Robinson suggested > it before in the midst of a lot of other Unicode musings. One thing I > remember Andy mentioned was that a codec object could then additionally > contain methods in addition to those required by the codec API, for > example a method to fix broken legacy encoding input strings, etc. > > Personally, I would be happier to get an object back from > codecs.lookup(), one vote in favor if it matters. > > Are there any good reasons to prefer getting a tuple back from codecs.lookup()? Here are some: * The tuple entries have two different flavours: the first two are readily usable encode/decode APIs, while the last two point to factory functions which can be used to create new objects. * Tuples are much easier to create and query at C level than Python objects having a certain interface. * The tuples can easily be cached and this is what the codec registry currently does to enhance performance. Object lookups are slower than tuple entry lookups (ok, no so much an argument, because the conversion itself is likely to cause much more overhead). * There is quite a lot of code in the dist which already uses the tuple value (all codecs, the codec registry, sample apps, etc.). * Who's going to write the code and produce the patches ? Note that you can easily add you own wrappers of codecs.lookup() which then give you an object instead of the tuple. The extensibility argument is a problem with the current solution, but is there really such a great need for extra codec APIs ? (Please remember that all codec writers would have to implement these new APIs -- there more you put in there the more difficult and less attractive it gets...) > --Brian > > On Sun, 04 Jun 2000 09:54:01 -0500 > Paul Prescod wrote: > > > Should codecs be returned to the user as objects instead of tuples? > > Today we have: > > > > (UTF8_encode, UTF8_decode, > > UTF8_streamreader, UTF8_streamwriter) = codecs.lookup('UTF-8') > > > > output = UTF8_streamwriter( open( '/tmp/output', 'wb') ) > > > > I think this would be a little simpler: > > > > output=codecs.lookup('UTF-8').stream_writer( open( '/tmp/output', 'wb') > > ) > > > > The object solution is more extensible, requires less "bogus" > > assignments and does not require the user to remember the order of the > > return values. > > > > -- > > Paul Prescod - ISOGEN Consulting Engineer speaking for himself > > Simplicity does not precede complexity, but follows it. > > - http://www.cs.yale.edu/~perlis-alan/quotes.html > > > > _______________________________________________ > > I18n-sig mailing list > > I18n-sig@python.org > > http://www.python.org/mailman/listinfo/i18n-sig > > > > _______________________________________________ > I18n-sig mailing list > I18n-sig@python.org > http://www.python.org/mailman/listinfo/i18n-sig -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Mon Jun 5 14:53:09 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Mon, 05 Jun 2000 15:53:09 +0200 Subject: [I18n-sig] Re: Translated messages and 'gettext' References: <3937D5B0254.6274BRIAN@smtp.garage.co.jp> <3938056E.6B707525@lemburg.com> <3939B8DB4.6275BRIAN@smtp.garage.co.jp> Message-ID: <393BB0C5.C4323929@lemburg.com> [gettext and changing languages on the fly] > > Peter's fintl.py, and my modified version of fintl.py, are by themselves > freestanding modules, they do not require libintl or the gettext > library, they are Python reimplementations (of just the message > retrieval API). Peter's is free for any use and my module inherits that > license (is also free). > > The .so I made which modifies the C gettext library is, obviously, > GPL'ed - however, it doesn't seem like it would be too hard to, again, > make a free implementation which just understands GNU (and, if possible, > Solaris-style and other platforms if there are) .mo files and implements > only gettext, dgettext, etc. Hmm, wouldn't it make sense to come up with one standard gettext.py module which implements all the needed functionality in Python and can use the wrapped GNU libintl.a optionally if available ? Wish list: The module should ideally support all major gettext and similar l10n-formats and allow changing languages. Peter's translation object approach seems to fit this best: it could use mixin classes for the different l10n formats (gettext .mo files, locale message files, resource files, etc.) and provide the needed caching and lookup engine as base class. It would probably be easiest to have one language per instance and perhaps a tranlation object factory which implements caching objects for the different languages currently in use. ...just some thoughts (got no time for this :-() -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Mon Jun 5 14:37:37 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Mon, 05 Jun 2000 15:37:37 +0200 Subject: [I18n-sig] Codecs References: Message-ID: <393BAD21.38C23FF9@lemburg.com> Andy Robinson wrote: > > > > > Should codecs be returned to the user as objects instead of tuples? > > Today we have: > > > > (UTF8_encode, UTF8_decode, > > UTF8_streamreader, UTF8_streamwriter) = codecs.lookup('UTF-8') > > > > output = UTF8_streamwriter( open( '/tmp/output', 'wb') ) > > > > I think this would be a little simpler: > > > > output=codecs.lookup('UTF-8').stream_writer( open( '/tmp/output', 'wb') > > ) > > > > The object solution is more extensible, requires less "bogus" > > assignments and does not require the user to remember the order of the > > return values. > > > I suggested this a while back, for a different reason. Right now you get > four things back from lookup() relating to the given encoding. But in many > cases there may be other encoding-specific routines of great use, and > returning an object would give us a place to hang them; codec.repair(...) > and codec.validate(...), for example. There are accepted and useful bits of > code around to repair Shift-JIS or EUC data in which one or two bytes are > corrupt. We would also have a place to hang language-specific routines. > > So I would be very, very happy to see codecs.lookup return a 'codec object' > with the four attributes encode, decode, streamreader() and streamwriter() > rather than a tuple. (Please also see my other post on the subject...) The tuple design was chosen for speed and because of its simplicity... please remember that much of the codec registry stuff is written in C and should be easily accessible and managable from there. Note that things like "validate" and "repair" can be handled by providing new error handling codes and then checking for the encoding/decoding calls for exceptions. New functionality can easily be added to the stream read/writer objects which are returned by the factory functions given in the tuple -- these also allow keeping state and can work on string like objects via StringIO. Perhaps all we need is a simpler interface for codecs.lookup() ? ... Something like: encoder = codecs.encoder('utf-8') # dito for .decoder, .streamwriter, .streamreader -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From pf@artcom-gmbh.de Mon Jun 5 15:52:44 2000 From: pf@artcom-gmbh.de (Peter Funk) Date: Mon, 5 Jun 2000 16:52:44 +0200 (MEST) Subject: [I18n-sig] Re: Translated messages and 'gettext' In-Reply-To: <393BB0C5.C4323929@lemburg.com> from "M.-A. Lemburg" at "Jun 5, 2000 3:53: 9 pm" Message-ID: Hi, M.-A. Lemburg: > Hmm, wouldn't it make sense to come up with one standard > gettext.py module which implements all the needed functionality > in Python and can use the wrapped GNU libintl.a optionally > if available ? Yes: this is, what I want have in mind. > Wish list: > > The module should ideally support all major gettext and > similar l10n-formats and allow changing languages. At the moment I see no heavy need for other binary formats, than the GNU gettext .mo file format. However it may be useful for people, who want to embed Python into a larger project. So I will try to design an easy way to plugin readers for other formats. > Peter's translation object approach seems to fit this best: it > could use mixin classes for the different l10n formats > (gettext .mo files, locale message files, resource files, etc.) and > provide the needed caching and lookup engine as base class. > It would probably be easiest to have one language per instance > and perhaps a tranlation object factory which implements > caching objects for the different languages currently in use. > > ...just some thoughts (got no time for this :-() I will save your suggestions here and I will *try* to realize them in time for inclusion into Python 1.6 final. Regards, Peter From paul@prescod.net Mon Jun 5 16:21:11 2000 From: paul@prescod.net (Paul Prescod) Date: Mon, 05 Jun 2000 10:21:11 -0500 Subject: [I18n-sig] Codecs References: <393A6D89.B9DC952F@prescod.net> <393A704C17D.DF54BRIAN@smtp.garage.co.jp> <393BA71C.349A7338@lemburg.com> Message-ID: <393BC567.F5FA18EF@prescod.net> "M.-A. Lemburg" wrote: > > > ... > > Are there any good reasons to prefer getting a tuple back from codecs.lookup()? > > Here are some: > > * The tuple entries have two different flavours: the first > two are readily usable encode/decode APIs, while the last > two point to factory functions which can be used to create > new objects. Right, and with an object syntax you can only deal with the properties you are interested in, not with all four, all of the time. > * Tuples are much easier to create and query at C level than > Python objects having a certain interface. I don't see that as very important! > * The tuples can easily be cached and this is what the codec > registry currently does to enhance performance. Object lookups > are slower than tuple entry lookups (ok, no so much an argument, > because the conversion itself is likely to cause much more > overhead). I agree that this is not much of an argument. :) > * There is quite a lot of code in the dist which already uses > the tuple value (all codecs, the codec registry, sample apps, > etc.). > > * Who's going to write the code and produce the patches ? These two are important arguments but we need to decide what we want before we start deciding whether it is doable. > The extensibility argument is a problem with the current > solution, but is there really such a great need for extra > codec APIs ? I don't know yet. If we knew now, we'd add them now. :) > (Please remember that all codec writers would > have to implement these new APIs -- there more you put in > there the more difficult and less attractive it gets...) I think that Andy was thinking that codecs might be a useful place to "hang" arbitrary encoding-related methods -- whether or not they are standardized. Python is dynamically typed so we don't need to conform to a restrictive interface definition. Anyhow, more than the extensibility, returning structured objects is just more Pythonic. I hate having to remember the position of tuple return values. > encoder = codecs.encoder('utf-8') > # dito for .decoder, .streamwriter, .streamreader That might be an acceptable compromise on the syntactic issue....but.... It doesn't seem much more work to just make a version of "lookup" that wraps tuples in objects. If we took this half-step then we could decide to move to "full objects" in the future and break a lot less code. -- Paul Prescod - ISOGEN Consulting Engineer speaking for himself Simplicity does not precede complexity, but follows it. - http://www.cs.yale.edu/~perlis-alan/quotes.html From andy@reportlab.com Mon Jun 5 16:49:38 2000 From: andy@reportlab.com (Andy Robinson) Date: Mon, 5 Jun 2000 16:49:38 +0100 Subject: [I18n-sig] Codecs In-Reply-To: <393BA71C.349A7338@lemburg.com> Message-ID: Replying to MAL slightly out of order: > Note that you can easily add you own wrappers of codecs.lookup() > which then give you an object instead of the tuple. > > The extensibility argument is a problem with the current > solution, but is there really such a great need for extra > codec APIs ? (Please remember that all codec writers would > have to implement these new APIs -- there more you put in > there the more difficult and less attractive it gets...) I'm proposing a place to put non-standard extensions. The whole point is that these are things which are useful for multi-byte codecs and non-European languages, but will certainly not exist for all codecs. These could be exposed as functions within the relevant codec module, but it seems clean if codecs module provides the lookup functionality, and the particular codec can provide new 'services' itself. > Here are some: > > * The tuple entries have two different flavours: the first > two are readily usable encode/decode APIs, while the last > two point to factory functions which can be used to create > new objects. > > * Tuples are much easier to create and query at C level than > Python objects having a certain interface. > > * The tuples can easily be cached and this is what the codec > registry currently does to enhance performance. Object lookups > are slower than tuple entry lookups (ok, no so much an argument, > because the conversion itself is likely to cause much more > overhead). > > * There is quite a lot of code in the dist which already uses > the tuple value (all codecs, the codec registry, sample apps, > etc.). > > * Who's going to write the code and produce the patches ? I did argue for this originally at least twice but got ignored by everyone. Now there is some support I'll make another bid. If the only issue is the work involved, then we should first decide if it is the right thing, then see if we can find the resources to write the patch. Anyone else got opinions? - Andy Robinson From mal@lemburg.com Mon Jun 5 18:38:38 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Mon, 05 Jun 2000 19:38:38 +0200 Subject: [I18n-sig] Codecs References: Message-ID: <393BE59E.D8D85336@lemburg.com> Andy Robinson wrote: > > Replying to MAL slightly out of order: > > > Note that you can easily add you own wrappers of codecs.lookup() > > which then give you an object instead of the tuple. > > > > The extensibility argument is a problem with the current > > solution, but is there really such a great need for extra > > codec APIs ? (Please remember that all codec writers would > > have to implement these new APIs -- there more you put in > > there the more difficult and less attractive it gets...) > > I'm proposing a place to put non-standard extensions. > The whole point is that these are things which are useful > for multi-byte codecs and non-European languages, but will > certainly not exist for all codecs. These could be exposed > as functions within the relevant codec module, but it seems > clean if codecs module provides the lookup functionality, > and the particular codec can provide new 'services' itself. That's already possible via the stream writer/reader object. The two extra functions encode/decode are really only there to enhance performance of the builtin encoding machinery (which only needs stateless converters). You can easily add new methods to the stream writer and reader objects. They also allow you to keep state -- which a simple entry in a codec registry object would not. Perhaps I'm missing something ? > > Here are some: > > > > * The tuple entries have two different flavours: the first > > two are readily usable encode/decode APIs, while the last > > two point to factory functions which can be used to create > > new objects. > > > > * Tuples are much easier to create and query at C level than > > Python objects having a certain interface. > > > > * The tuples can easily be cached and this is what the codec > > registry currently does to enhance performance. Object lookups > > are slower than tuple entry lookups (ok, no so much an argument, > > because the conversion itself is likely to cause much more > > overhead). > > > > * There is quite a lot of code in the dist which already uses > > the tuple value (all codecs, the codec registry, sample apps, > > etc.). > > > > * Who's going to write the code and produce the patches ? > > I did argue for this originally at least twice but got > ignored by everyone. Could be that we were too busy with other things, e.g. the source code encoding debate ;-) -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Mon Jun 5 18:40:59 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Mon, 05 Jun 2000 19:40:59 +0200 Subject: [I18n-sig] Codecs References: <393A6D89.B9DC952F@prescod.net> <393A704C17D.DF54BRIAN@smtp.garage.co.jp> <393BA71C.349A7338@lemburg.com> <393BC567.F5FA18EF@prescod.net> Message-ID: <393BE62B.182B6AF1@lemburg.com> [codecs.lookup() returning a tuple] > > > * Tuples are much easier to create and query at C level than > > Python objects having a certain interface. > > I don't see that as very important! For me it is: I maintain this stuff :-) Adding full object support would mean that I'd have to write a new C type which support the object interface -- I'm not particularly interested in doing so... > > encoder = codecs.encoder('utf-8') > > # dito for .decoder, .streamwriter, .streamreader > > That might be an acceptable compromise on the syntactic issue....but.... > > It doesn't seem much more work to just make a version of "lookup" that > wraps tuples in objects. If we took this half-step then we could decide > to move to "full objects" in the future and break a lot less code. I have no problem with a new lookup API which returns objects, I just wouldn't want to have the codec registry use these wrapper objects as basis for doing its work. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/ From mal@lemburg.com Fri Jun 9 12:09:19 2000 From: mal@lemburg.com (M.-A. Lemburg) Date: Fri, 09 Jun 2000 13:09:19 +0200 Subject: [I18n-sig] New Unicode default encoding scheme Message-ID: <3940D05E.9E266396@lemburg.com> Hi everybody, I just wanted to inform you that the Unicode default encoding handling has changed from the strict UTF-8 setting to a much more flexible solution which is based on the default locale settings (provided via the LANG environment variable). The new default setting is ASCII as per Guido's request. Here's the important section of the Misc/unicode.txt file. For more details I refer you to reading that file from the current CVS tree. """ Unicode Default Encoding: ------------------------- The Unicode implementation has to make some assumption about the encoding of 8-bit strings passed to it for coercion and about the encoding to as default for conversion of Unicode to strings when no specific encoding is given. This encoding is called throughout this text. If not otherwise defined or set, the is set to 'ascii'. For this, the implementation maintains a global which can be set in the site.py Python startup script. Subsequent changes are not possible. The can be set and queried using the two sys module APIs: sys.setdefaultencoding(encoding) --> Sets the used by the Unicode implementation. encoding has to be an encoding which is supported by the Python installation, otherwise, a LookupError is raised. Note: This API is only available in site.py ! sys.getdefaultencoding() --> Returns the current default encoding. To enhance usability of Unicode coercion, the is set in the default site.py startup module according to the encoding defined by the locale active when the site.py module gets executed. The locale module is used to extract the encoding from the locale default settings defined in the LANG environment variable (and possibly others -- see locale.py). If the encoding cannot be determined, is unkown or unsupported, site.py defaults to setting the to 'ascii'. This encoding is also the startup default of Python (and in effect before site.py is executed). """ Example: cnri/Python+Unicode> setenv LANG de_DE:utf8 cnri/Python+Unicode> ./python >>> import sys >>> sys.getdefaultencoding() 'utf' >>> print u"äöü" äöü >>> cnri/Python+Unicode> setenv LANG de_DE:latin1 cnri/Python+Unicode> ./python >>> import sys >>> sys.getdefaultencoding() 'latin1' >>> print u"äöü" äöü >>> -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/