From paul@prescod.net  Fri Jun  2 04:20:48 2000
From: paul@prescod.net (Paul Prescod)
Date: Thu, 01 Jun 2000 22:20:48 -0500
Subject: [I18n-sig] Literal strings
Message-ID: <39372810.F9BFE796@prescod.net>

I am thinking about string literals. Not narrow strings in general, just
string literals in particular. I'm not sure where we left the issue of a
statement about the "encoding" of string literals. Here's my input.

I have a lot of code like this:

if tagName=="foo":
	...

I would like it to magically work with Unicode. Guido's proposal allows
it to magically work with Unicode-encoded ASCII, but not with the full
range of Unicode characters. I'm not entirely happy that my code will
crash and burn the first time someone pops in a cedilla.

What would be the consequences of a module-level pragma that allows the
literal strings in my module to be interpreted as *Unicode literals*
instead of ASCII literals. I usually know that all of the literals in my
program are raw ASCII, so even if they are interpreted as Unicode, they
will be "compatible with" raw ASCII input. The only thing that they
would not be compatible with is 8-bit binary goo, which they were never
intended to be compatible with anyhow.

I just want to add something at the top of my file like:

#pragma IL8N

and have my literal strings act as Unicode.

Now I could go through my code and change all of the literals to Unicode
literals by hand, but 

 a) that's really ugly, syntactically

 b) I feel like I'll end up switching them all back when we just make
literal strings "wide" by default

 c) I feel like I'm being penalized for making my program
internationalized

 d) I have a lot of code, as we all do.

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
Simplicity does not precede complexity, but follows it. 
	- http://www.cs.yale.edu/~perlis-alan/quotes.html


From paul@prescod.net  Fri Jun  2 04:53:41 2000
From: paul@prescod.net (Paul Prescod)
Date: Thu, 01 Jun 2000 22:53:41 -0500
Subject: [I18n-sig] Re: [Python-Dev] ascii.py?
References: <200006012236.SAA03578@snark.thyrsus.com>
Message-ID: <39372FC5.DE1CE8EA@prescod.net>

"Eric S. Raymond" wrote:
> 
> There has been a vast and echoing silence about the ascii.py module I
> posted here at Fred Drake's request.  Is it really such a  bad idea?

Without looking closely, or even being particularly knowledgable (how's
that for a disclaimer!) my instinctive reaction was: "does the ASCII
subset of Unicode need its own module just before we add Unicode to the
language?"

It may be that there are some semantics of ASCII that are not captured
in the Unicode spec. and thus are not generalizable. I'm pretty
confident that these ones ARE generalizable:

isalnum
isalpha
isascii
islower
isupper
isspace
isxdigit

How do Unicode users get this information from the famous Unicode
database and why not merge the Unicode and ASCII versions in 1.6?

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
Simplicity does not precede complexity, but follows it. 
	- http://www.cs.yale.edu/~perlis-alan/quotes.html


From esr@thyrsus.com  Fri Jun  2 06:43:54 2000
From: esr@thyrsus.com (Eric S. Raymond)
Date: Fri, 2 Jun 2000 01:43:54 -0400
Subject: [I18n-sig] Re: [Python-Dev] ascii.py?
In-Reply-To: <39372FC5.DE1CE8EA@prescod.net>; from paul@prescod.net on Thu, Jun 01, 2000 at 10:53:41PM -0500
References: <200006012236.SAA03578@snark.thyrsus.com> <39372FC5.DE1CE8EA@prescod.net>
Message-ID: <20000602014353.A5211@thyrsus.com>

Paul Prescod <paul@prescod.net>:
> "Eric S. Raymond" wrote:
> > 
> > There has been a vast and echoing silence about the ascii.py module I
> > posted here at Fred Drake's request.  Is it really such a  bad idea?
> 
> Without looking closely, or even being particularly knowledgable (how's
> that for a disclaimer!) my instinctive reaction was: "does the ASCII
> subset of Unicode need its own module just before we add Unicode to the
> language?"
> 
> It may be that there are some semantics of ASCII that are not captured
> in the Unicode spec. and thus are not generalizable.

ascii.ctrl is one such.

>                                                       I'm pretty
> confident that these ones ARE generalizable:
> 
> isalnum
> isalpha
> isascii
> islower
> isupper
> isspace
> isxdigit
> 
> How do Unicode users get this information from the famous Unicode
> database and why not merge the Unicode and ASCII versions in 1.6?

Answer: ascii.py is not designed for text processing.  I wrote it to package
some functions useful for classifying *ASCII* data, especially in the
context of roguelike programs that interpret keystrokes coming in through
a curses interface.

(Where this all touches ground is CML2, my replacement configuration 
system for the Linux kernel.)
-- 
		<a href="http://www.tuxedo.org/~esr">Eric S. Raymond</a>

..every Man has a Property in his own Person. This no Body has any
Right to but himself.  The Labour of his Body, and the Work of his
Hands, we may say, are properly his. .... The great and chief end
therefore, of Mens uniting into Commonwealths, and putting themselves
under Government, is the Preservation of their Property.
	-- John Locke, "A Treatise Concerning Civil Government"


From mal@lemburg.com  Fri Jun  2 09:02:35 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 02 Jun 2000 10:02:35 +0200
Subject: [I18n-sig] Re: [Python-Dev] ascii.py?
References: <200006012236.SAA03578@snark.thyrsus.com> <39372FC5.DE1CE8EA@prescod.net>
Message-ID: <39376A1B.10E45C7B@lemburg.com>

Paul Prescod wrote:
> 
> "Eric S. Raymond" wrote:
> >
> > There has been a vast and echoing silence about the ascii.py module I
> > posted here at Fred Drake's request.  Is it really such a  bad idea?
> 
> Without looking closely, or even being particularly knowledgable (how's
> that for a disclaimer!) my instinctive reaction was: "does the ASCII
> subset of Unicode need its own module just before we add Unicode to the
> language?"
> 
> It may be that there are some semantics of ASCII that are not captured
> in the Unicode spec. and thus are not generalizable. I'm pretty
> confident that these ones ARE generalizable:
> 
> isalnum
> isalpha
> isascii
> islower
> isupper
> isspace
> isxdigit
> 
> How do Unicode users get this information from the famous Unicode
> database and why not merge the Unicode and ASCII versions in 1.6?

Note that many of the above are already implemented as
string|Unicode methods.

The Unicode database is accessible via the unicodedata
module. The specs for the used APIs and constants can
be found in the Unicode database description file
on www.unicode.org.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From mal@lemburg.com  Fri Jun  2 10:32:29 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 02 Jun 2000 11:32:29 +0200
Subject: [I18n-sig] Literal strings
References: <39372810.F9BFE796@prescod.net>
Message-ID: <39377F2D.B6FBBF71@lemburg.com>

Paul Prescod wrote:
> 
> I am thinking about string literals. Not narrow strings in general, just
> string literals in particular. I'm not sure where we left the issue of a
> statement about the "encoding" of string literals. Here's my input.
> 
> I have a lot of code like this:
> 
> if tagName=="foo":
>         ...
> 
> I would like it to magically work with Unicode. Guido's proposal allows
> it to magically work with Unicode-encoded ASCII, but not with the full
> range of Unicode characters. I'm not entirely happy that my code will
> crash and burn the first time someone pops in a cedilla.
> 
> What would be the consequences of a module-level pragma that allows the
> literal strings in my module to be interpreted as *Unicode literals*
> instead of ASCII literals. I usually know that all of the literals in my
> program are raw ASCII, so even if they are interpreted as Unicode, they
> will be "compatible with" raw ASCII input. The only thing that they
> would not be compatible with is 8-bit binary goo, which they were never
> intended to be compatible with anyhow.
>
> I just want to add something at the top of my file like:
> 
> #pragma IL8N
> 
> and have my literal strings act as Unicode.
> 
> Now I could go through my code and change all of the literals to Unicode
> literals by hand, but
> 
>  a) that's really ugly, syntactically
> 
>  b) I feel like I'll end up switching them all back when we just make
> literal strings "wide" by default
> 
>  c) I feel like I'm being penalized for making my program
> internationalized
> 
>  d) I have a lot of code, as we all do.

You can use the exerimental command line flag -U to have the
Python compiler do this for you. The downside is that it does
this for *all* modules and this currently causes much of the
standard lib to fail (that's why it's experimental -- a future
goal should be making the standard lib work with and without
-U).

The safest way to do this certainly is by fixing all
instances to use u"" instead of "" (not that hard, really).
Even though this may look strange at first, reading the code
will immediately bring your attention to the fact that you
are dealing with Unicode here -- a #pragma at the top won't
get that much attention and a casual user might wonder
where the u"" strings in variable dumps originate from.

Note that there are plans to add a #pragma to allow
specifying a Python script encoding. Things haven't been
sorted out, though.

One way to do this is by turning
all "" string literals into u"" assuming the encoding
given in the #pragma e.g. Latin-1 or MacRoman -- this would
be along the lines of what you have in mind. The problem
with this is that some string literaly might have to map
to 8-bit strings, so for these you'd need to write e.g.
s"" or something similar.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From pf@artcom-gmbh.de  Fri Jun  2 11:39:09 2000
From: pf@artcom-gmbh.de (Peter Funk)
Date: Fri, 2 Jun 2000 12:39:09 +0200 (MEST)
Subject: [I18n-sig] Literal strings
In-Reply-To: <39372810.F9BFE796@prescod.net> from Paul Prescod at "Jun 1, 2000 10:20:48 pm"
Message-ID: <m12xor8-000DifC@artcom0.artcom-gmbh.de>

Hi Paul,

Paul Prescod :
> I am thinking about string literals. Not narrow strings in general, just
> string literals in particular. I'm not sure where we left the issue of a
> statement about the "encoding" of string literals. Here's my input.
> 
> I have a lot of code like this:
> 
> if tagName=="foo":
>       ...
> 
> I would like it to magically work with Unicode. Guido's proposal allows
> it to magically work with Unicode-encoded ASCII, but not with the full
> range of Unicode characters. I'm not entirely happy that my code will
> crash and burn the first time someone pops in a cedilla.

A cedilla (ç) is a normal 8-Bit character in ISO-Latin-1, so this may
be a bad example.  We use such literals a lot and it didn't break anything.
Even with Guidos proposal it will only break things, if you coerce
such a literal into unicode without an explicit conversion.

Since my native language is German and since my English leaves a lot
to be desired (take my rants to python-dev as examples), we decided
long ago to use German as our "master language" in our company for our
I18N software.  This works pretty well in Python 1.5.2.  Example how
this looks like:

        tkMessageBox.askquestion(_("Löschen bestätigen"),
                                 _("Soll %s gelöscht werden?") % object_name)

'_()' in this context is an shortcut name pointing to the
'fintl.gettext()' function.  This function possibly returns the literal
translated into English, French or Spanish depending on the language
environment.  An additional tool (xgettext, now pygettext by Barry W.) is
used to extract all those literals and to deliver them to professional
translators which translate these message strings into English, French ...

Additionally we abopted the style to use single quotes for all literals 
that are normally invisible to a user of the software.  Exmaple:

        if hasattr(target, 'disable'):
            target.disable()

> What would be the consequences of a module-level pragma that allows the
> literal strings in my module to be interpreted as *Unicode literals*
> instead of ASCII literals. I usually know that all of the literals in my
> program are raw ASCII, so even if they are interpreted as Unicode, they
> will be "compatible with" raw ASCII input. The only thing that they
> would not be compatible with is 8-bit binary goo, which they were never
> intended to be compatible with anyhow.

Hmmmm.... I don't understand, what you meant with your last sentence.
May be my ignorance comes from the situation, that I can view, edit and print
any files containing ISO-Latin1 characters in WYSIWYG without thinking
about it and still don't know what kind of text editor and Keyboard/Display 
Equipment is required to work with those Unicode characters with 
ord(ch) >= 256 in WYSIWYG? [I'm using Linux/X11/vim if this matters]

> I just want to add something at the top of my file like:
> 
> #pragma IL8N
> 
> and have my literal strings act as Unicode.

There already was a long discussion about interpreter pragmas
on python-dev.  I still prefer David Scherer's brilliant idea to
(ab)use the 'global' statment at module level, if we ever introduce
pragmas into the 1.x series of Python.  Please review the discussion
(April 2000) in the python-dev archives.

> Now I could go through my code and change all of the literals to Unicode
> literals by hand, but 
> 
>  a) that's really ugly, syntactically

As always this is simply a matter of taste.  And after a while you get 
used to it.

>  b) I feel like I'll end up switching them all back when we just make
> literal strings "wide" by default

I don't believe that this will happen in the 1.x series.  This would break 
just too many things and the memory penalty is just to harsh for small 
systems.

>  c) I feel like I'm being penalized for making my program
> internationalized

As long as your i18n effort doesn't hit asian languages (for example 
chinese, japanese) you can get away with narrow strings.  Unicode only comes
into play, if you have to deal with several different languages at
the same time.  

Even a japanese translation is possible with 8-bit Python 1.5.2, as
long as you don't need to display for example umlauts and japanese
characters at they same time, and as long as the japanese translator
uses the same character set as the production platform.  On Feb, 9th
2000 Andy Robinson wrote a very good explanation, what character sets
are used in Japan.  Review this in the i18n archive, if interested.

Brian Takashi Hooper was also a very helpful guy concerning Japanese.

>  d) I have a lot of code, as we all do.

If code can be modified automatically (and what you proposed can
be done with a only slightly more elaborated operation than a simple
's/"/u"/g' replacement) this is IMO no argument.

Regards, Peter


From mal@lemburg.com  Fri Jun  2 12:26:07 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 02 Jun 2000 13:26:07 +0200
Subject: [I18n-sig] Literal strings
References: <m12xor8-000DifC@artcom0.artcom-gmbh.de>
Message-ID: <393799CF.270D9BE4@lemburg.com>

Peter Funk wrote:
> 
> Since my native language is German and since my English leaves a lot
> to be desired (take my rants to python-dev as examples), we decided
> long ago to use German as our "master language" in our company for our
> I18N software.  This works pretty well in Python 1.5.2.  Example how
> this looks like:
> 
>         tkMessageBox.askquestion(_("Löschen bestätigen"),
>                                  _("Soll %s gelöscht werden?") % object_name)
> 
> '_()' in this context is an shortcut name pointing to the
> 'fintl.gettext()' function.  This function possibly returns the literal
> translated into English, French or Spanish depending on the language
> environment.  An additional tool (xgettext, now pygettext by Barry W.) is
> used to extract all those literals and to deliver them to professional
> translators which translate these message strings into English, French ...
> 
> Additionally we abopted the style to use single quotes for all literals
> that are normally invisible to a user of the software.  Exmaple:
> 
>         if hasattr(target, 'disable'):
>             target.disable()

Nice idea :-)

I'm currently using my own scheme for solving the NLS problem,
but it currently only work on a per-process basis. What I am
looking for now, is a way to be able to set the language
on a per user (of a single server process) basis.

Is the gettext approach useful for this too, i.e. does it
allow fast switching of the target language ?

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From pf@artcom-gmbh.de  Fri Jun  2 13:48:10 2000
From: pf@artcom-gmbh.de (Peter Funk)
Date: Fri, 2 Jun 2000 14:48:10 +0200 (MEST)
Subject: Translated messages and 'gettext' API (was Re: [I18n-sig] Literal strings)
In-Reply-To: <393799CF.270D9BE4@lemburg.com> from "M.-A. Lemburg" at "Jun 2, 2000  1:26: 7 pm"
Message-ID: <m12xqrz-000DifC@artcom0.artcom-gmbh.de>

Hi, 

[M.-A. Lemburg]:
> I'm currently using my own scheme for solving the NLS problem,
> but it currently only work on a per-process basis. What I am
> looking for now, is a way to be able to set the language
> on a per user (of a single server process) basis.
> 
> Is the gettext approach useful for this too, i.e. does it
> allow fast switching of the target language ?

Not as is.  

Currently my module 'fintl.py' is simply a small wrapper around MvLs
'intl' interface to the GNU gettext C-library, if this is available
and an emulator, which does the same as GNU gettext library does in
pure Python.  My goal was to 1. avoid GPL infection and 2. to use
the same API on Non-Unix platforms like WinXX and MacOS).

But mailman seems to have a similar problem, 
Juan Carlos Rey Anaya <jcrey@uma.es> has taken the module 'gettext.py'
by James Henstridge <james@daa.com.au> and modified it to support
dynamic loading of message catalogs.

Based on a suggestion made by Françios Pinard in his mail to python-list 
from 15 Jan 2000 20:15:08 I thought it would be a nice idea, to
replace the current singleton pattern for locale and catalog setting 
with a 'Translator' class, from which you may create several instances.  
This is trivial to implement, if you don't have to pay too much attention
on memory consumption and don't insist to be API compatible with
GNU gettext.

Of course this will introduce some additional complexity: 
either you have to carry the "right" Translator instance around to 
all places, where messages are used in order to access the right 
'gettext' method, or you have to expose some global default 
state, for example through the  following two functions:

    def switch_language(new_language):
        global _current_translator
        if new_language != _current_translator.language:
	    if not _translators.has_key(new_language):
	        _translators[new_language] = Translator(new_language)
	    _current_translator = _translators[new_language]
	...
    def query_language():
	return _current_translator.language
 
I'm not sure, what is needed.

Regards, Peter
-- 
Peter Funk, Oldenburger Str.86, D-27777 Ganderkesee, Germany, Fax:+49 4222950260
office: +49 421 20419-0 (ArtCom GmbH, Grazer Str.8, D-28359 Bremen)


From tanzer@swing.co.at  Fri Jun  2 15:12:36 2000
From: tanzer@swing.co.at (Christian Tanzer)
Date: Fri, 02 Jun 2000 16:12:36 +0200
Subject: [I18n-sig] Literal strings
In-Reply-To: Your message of "Fri, 02 Jun 2000 12:39:09 +0200."
 <m12xor8-000DifC@artcom0.artcom-gmbh.de>
Message-ID: <m12xsBh-000wcDC@swing.co.at>

pf@artcom-gmbh.de (Peter Funk) wrote:

> If code can be modified automatically (and what you proposed can
> be done with a only slightly more elaborated operation than a simple
> 's/"/u"/g' replacement) this is IMO no argument.

Unfortunately, it's not that simple:

-------------------------------------------------------------------------=
------
Python 1.5.2 (#5, Jan  4 2000, 11:37:02)  [GCC 2.7.2.1] on linux2
Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
>>> #some_string
=2E.. =

>>> import re
>>> some_string=3D'''
=2E.. Just to test an over-simplified regex: "first string".
=2E.. """Followed by another string =

=2E.. spanning several lines.
=2E.. """
=2E.. '''
>>> print some_string =


Just to test an over-simplified regex: "first string".
"""Followed by another string
spanning several lines.
"""

>>> print re.sub ('"','u"',some_string )

Just to test an over-simplified regex: u"first stringu".
u"u"u"Followed by another string
spanning several lines.
u"u"u"
-------------------------------------------------------------------------=
------

Do you really have `a only slightly more elaborated operation'? If so,
please post it.

Regards,
Christian

-- =

Christian Tanzer                                         tanzer@swing.co.=
at
Glasauergasse 32                                       Tel: +43 1 876 62 =
36
A-1130 Vienna, Austria                                 Fax: +43 1 877 66 =
92


From pf@artcom-gmbh.de  Fri Jun  2 16:30:49 2000
From: pf@artcom-gmbh.de (Peter Funk)
Date: Fri, 2 Jun 2000 17:30:49 +0200 (MEST)
Subject: Replacing string literals with u"..." (was Re: [I18n-sig] Literal strings)
In-Reply-To: <m12xsBh-000wcDC@swing.co.at> from Christian Tanzer at "Jun 2, 2000  4:12:36 pm"
Message-ID: <m12xtPN-000DifC@artcom0.artcom-gmbh.de>

Hi,

[me:]
> > If code can be modified automatically (and what you proposed can
> > be done with a only slightly more elaborated operation than a simple
> > 's/"/u"/g' replacement) this is IMO no argument.

[Christian Tanzer]:
> Unfortunately, it's not that simple:
[...example of complicated string not repeated...]
> Do you really have `a only slightly more elaborated operation'? If so,
> please post it.

No, sorry.  

Indeed regular expressions seem not to be the right tool to
do this.  But since the module 'tokenize' from standard library
is able to identify all those forms of Python string literals, it
should be possible and not to hard to write a script, which will
identify all string tokens using 'tokenize' and replace them 
with u<token>.

Regards, Peter
-- 
Peter Funk, Oldenburger Str.86, D-27777 Ganderkesee, Germany, Fax:+49 4222950260
office: +49 421 20419-0 (ArtCom GmbH, Grazer Str.8, D-28359 Bremen)


From brian@garage.co.jp  Fri Jun  2 16:41:36 2000
From: brian@garage.co.jp (Brian Takashi Hooper)
Date: Sat, 03 Jun 2000 00:41:36 +0900
Subject: Translated messages and 'gettext' API (was Re: [I18n-sig] Literal strings)
In-Reply-To: <m12xqrz-000DifC@artcom0.artcom-gmbh.de>
References: <393799CF.270D9BE4@lemburg.com> <m12xqrz-000DifC@artcom0.artcom-gmbh.de>
Message-ID: <3937D5B0254.6274BRIAN@smtp.garage.co.jp>

Hi there,

[snip]
> [M.-A. Lemburg]:
> > I'm currently using my own scheme for solving the NLS problem,
> > but it currently only work on a per-process basis. What I am
> > looking for now, is a way to be able to set the language
> > on a per user (of a single server process) basis.
> > 
> > Is the gettext approach useful for this too, i.e. does it
> > allow fast switching of the target language ?
> 
> Not as is.  

I also ran into this same problem and made a slightly expanded Python
implementation of gettext (based on Peter's fintl.py!) that adds a few
calls to allow the language to explicitly be set for each call, which
makes it a little more appropriate for applications where each thread,
or perhaps even each call, might have a different language preference.

I've also experimentally used interpositioning with a hacked version of
gettext, compiled as a .so, to enable a C version of the same stuff
(basically, just allowing an explicit language argument to dcgettext,
which if supplied is used instead of getting the language from the
environment).

Does this seem useful to anyone?  If so, I'll put the code up somewheres
(actually, even if not, what the heck.)

-Brian


From brian@garage.co.jp  Fri Jun  2 16:41:36 2000
From: brian@garage.co.jp (Brian Takashi Hooper)
Date: Sat, 03 Jun 2000 00:41:36 +0900
Subject: Translated messages and 'gettext' API (was Re: [I18n-sig] Literal strings)
In-Reply-To: <m12xqrz-000DifC@artcom0.artcom-gmbh.de>
References: <393799CF.270D9BE4@lemburg.com> <m12xqrz-000DifC@artcom0.artcom-gmbh.de>
Message-ID: <3937D5B0254.6274BRIAN@smtp.garage.co.jp>

Hi there,

[snip]
> [M.-A. Lemburg]:
> > I'm currently using my own scheme for solving the NLS problem,
> > but it currently only work on a per-process basis. What I am
> > looking for now, is a way to be able to set the language
> > on a per user (of a single server process) basis.
> > 
> > Is the gettext approach useful for this too, i.e. does it
> > allow fast switching of the target language ?
> 
> Not as is.  

I also ran into this same problem and made a slightly expanded Python
implementation of gettext (based on Peter's fintl.py!) that adds a few
calls to allow the language to explicitly be set for each call, which
makes it a little more appropriate for applications where each thread,
or perhaps even each call, might have a different language preference.

I've also experimentally used interpositioning with a hacked version of
gettext, compiled as a .so, to enable a C version of the same stuff
(basically, just allowing an explicit language argument to dcgettext,
which if supplied is used instead of getting the language from the
environment).

Does this seem useful to anyone?  If so, I'll put the code up somewheres
(actually, even if not, what the heck.)

-Brian


From mal@lemburg.com  Fri Jun  2 20:05:18 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 02 Jun 2000 21:05:18 +0200
Subject: Translated messages and 'gettext' API (was Re: [I18n-sig] Literal
 strings)
References: <393799CF.270D9BE4@lemburg.com> <m12xqrz-000DifC@artcom0.artcom-gmbh.de> <3937D5B0254.6274BRIAN@smtp.garage.co.jp>
Message-ID: <3938056E.6B707525@lemburg.com>

Brian Takashi Hooper wrote:
> 
> Hi there,
> 
> [snip]
> > [M.-A. Lemburg]:
> > > I'm currently using my own scheme for solving the NLS problem,
> > > but it currently only work on a per-process basis. What I am
> > > looking for now, is a way to be able to set the language
> > > on a per user (of a single server process) basis.
> > >
> > > Is the gettext approach useful for this too, i.e. does it
> > > allow fast switching of the target language ?
> >
> > Not as is.
> 
> I also ran into this same problem and made a slightly expanded Python
> implementation of gettext (based on Peter's fintl.py!) that adds a few
> calls to allow the language to explicitly be set for each call, which
> makes it a little more appropriate for applications where each thread,
> or perhaps even each call, might have a different language preference.
> 
> I've also experimentally used interpositioning with a hacked version of
> gettext, compiled as a .so, to enable a C version of the same stuff
> (basically, just allowing an explicit language argument to dcgettext,
> which if supplied is used instead of getting the language from the
> environment).
> 
> Does this seem useful to anyone?  If so, I'll put the code up somewheres
> (actually, even if not, what the heck.)

If I understand this right, Peter's version does not need the
GPLed gettext lib, right ? What are the license terms for
the Python gettext version and your modified one ?

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From paul@prescod.net  Sat Jun  3 20:23:22 2000
From: paul@prescod.net (Paul Prescod)
Date: Sat, 03 Jun 2000 14:23:22 -0500
Subject: [I18n-sig] Literal strings
References: <39372810.F9BFE796@prescod.net> <39377F2D.B6FBBF71@lemburg.com>
Message-ID: <39395B2A.BE05860@prescod.net>

"M.-A. Lemburg" wrote:
> 
> ....

> The safest way to do this certainly is by fixing all
> instances to use u"" instead of "" (not that hard, really).
> Even though this may look strange at first, reading the code
> will immediately bring your attention to the fact that you
> are dealing with Unicode here -- a #pragma at the top won't
> get that much attention and a casual user might wonder
> where the u"" strings in variable dumps originate from.

I guess that's our philosophical difference. I don't want to go around
thinking about the fact that I am using Unicode. I want to test it 
once and then have it "just work."

> One way to do this is by turning
> all "" string literals into u"" assuming the encoding
> given in the #pragma e.g. Latin-1 or MacRoman -- this would
> be along the lines of what you have in mind. 

Yes, this would probably be acceptable.

> The problem
> with this is that some string literaly might have to map
> to 8-bit strings, so for these you'd need to write e.g.
> s"" or something similar.

Right, or call a conversion function.

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
Simplicity does not precede complexity, but follows it. 
	- http://www.cs.yale.edu/~perlis-alan/quotes.html


From paul@prescod.net  Sat Jun  3 20:24:34 2000
From: paul@prescod.net (Paul Prescod)
Date: Sat, 03 Jun 2000 14:24:34 -0500
Subject: [I18n-sig] Literal strings
References: <m12xor8-000DifC@artcom0.artcom-gmbh.de>
Message-ID: <39395B72.EED8E1B1@prescod.net>

Peter Funk wrote:
> 
> > I would like it to magically work with Unicode. Guido's proposal allows
> > it to magically work with Unicode-encoded ASCII, but not with the full
> > range of Unicode characters. I'm not entirely happy that my code will
> > crash and burn the first time someone pops in a cedilla.
> 
> A cedilla (ç) is a normal 8-Bit character in ISO-Latin-1, so this may
> be a bad example.  

Guido's proposal only auto-coerces 7-bit data.

> We use such literals a lot and it didn't break anything.
> Even with Guidos proposal it will only break things, if you coerce
> such a literal into unicode without an explicit conversion.

My code example showed an implicit coercion.

> There already was a long discussion about interpreter pragmas
> on python-dev.  I still prefer David Scherer's brilliant idea to
> (ab)use the 'global' statment at module level, if we ever introduce
> pragmas into the 1.x series of Python.  Please review the discussion
> (April 2000) in the python-dev archives.

I wasn't so concerned about the syntax so I didn't bother to look that
up.

> > Now I could go through my code and change all of the literals to Unicode
> > literals by hand, but
> >
> >  a) that's really ugly, syntactically
> 
> As always this is simply a matter of taste.  And after a while you get
> used to it.

They say that about Perl too. :) I don't believe them.

> >  b) I feel like I'll end up switching them all back when we just make
> > literal strings "wide" by default
> 
> I don't believe that this will happen in the 1.x series.  This would break
> just too many things and the memory penalty is just to harsh for small
> systems.

We will see about the former. The latter is just not true because a
Unicode object could be internally implemented as an 8-bit string as
long as it implements the same external interface. We have often
discussed these "tagged Unicode objects" and have just not implemented
them yet.

> >  c) I feel like I'm being penalized for making my program
> > internationalized
> 
> As long as your i18n effort doesn't hit asian languages (for example
> chinese, japanese) you can get away with narrow strings.  

I work with XML so I don't know what language the input is in.

> Unicode only comes
> into play, if you have to deal with several different languages at
> the same time.

Or if you are dealing with XML, or TKinter, or WebDAV or communicating
with Java or ...

> >  d) I have a lot of code, as we all do.
> 
> If code can be modified automatically (and what you proposed can
> be done with a only slightly more elaborated operation than a simple
> 's/"/u"/g' replacement) this is IMO no argument.

Actually, I haven't had any experience with source to source Python 
transforms myself. Wouldn't it mess up other things like comments and 
tabbing unless you got to a great deal of work?
-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
Simplicity does not precede complexity, but follows it. 
	- http://www.cs.yale.edu/~perlis-alan/quotes.html


From brian@garage.co.jp  Sun Jun  4 03:03:07 2000
From: brian@garage.co.jp (Brian Takashi Hooper)
Date: Sun, 04 Jun 2000 11:03:07 +0900
Subject: Translated messages and 'gettext' API (was Re: [I18n-sig] Literal strings)
In-Reply-To: <3938056E.6B707525@lemburg.com>
References: <3937D5B0254.6274BRIAN@smtp.garage.co.jp> <3938056E.6B707525@lemburg.com>
Message-ID: <3939B8DB4.6275BRIAN@smtp.garage.co.jp>

On Fri, 02 Jun 2000 21:05:18 +0200
"M.-A. Lemburg" <mal@lemburg.com> wrote:

> Brian Takashi Hooper wrote:
> > 
> > Hi there,
> > 
> > [snip]
> > > [M.-A. Lemburg]:
> > > > I'm currently using my own scheme for solving the NLS problem,
> > > > but it currently only work on a per-process basis. What I am
> > > > looking for now, is a way to be able to set the language
> > > > on a per user (of a single server process) basis.
> > > >
> > > > Is the gettext approach useful for this too, i.e. does it
> > > > allow fast switching of the target language ?
> > >
> > > Not as is.
> > 
> > I also ran into this same problem and made a slightly expanded Python
> > implementation of gettext (based on Peter's fintl.py!) that adds a few
> > calls to allow the language to explicitly be set for each call, which
> > makes it a little more appropriate for applications where each thread,
> > or perhaps even each call, might have a different language preference.
> > 
> > I've also experimentally used interpositioning with a hacked version of
> > gettext, compiled as a .so, to enable a C version of the same stuff
> > (basically, just allowing an explicit language argument to dcgettext,
> > which if supplied is used instead of getting the language from the
> > environment).
> > 
> > Does this seem useful to anyone?  If so, I'll put the code up somewheres
> > (actually, even if not, what the heck.)
> 
> If I understand this right, Peter's version does not need the
> GPLed gettext lib, right ? What are the license terms for
> the Python gettext version and your modified one ?

Peter's fintl.py, and my modified version of fintl.py, are by themselves
freestanding modules, they do not require libintl or the gettext
library, they are Python reimplementations (of just the message
retrieval API).  Peter's is free for any use and my module inherits that
license (is also free).

The .so I made which modifies the C gettext library is, obviously,
GPL'ed - however, it doesn't seem like it would be too hard to, again,
make a free implementation which just understands GNU (and, if possible,
Solaris-style and other platforms if there are) .mo files and implements
only gettext, dgettext, etc.

--Brian


From paul@prescod.net  Sun Jun  4 15:54:01 2000
From: paul@prescod.net (Paul Prescod)
Date: Sun, 04 Jun 2000 09:54:01 -0500
Subject: [I18n-sig] Codecs
Message-ID: <393A6D89.B9DC952F@prescod.net>

Should codecs be returned to the user as objects instead of tuples?
Today we have:

(UTF8_encode, UTF8_decode,
      UTF8_streamreader, UTF8_streamwriter) = codecs.lookup('UTF-8')

output = UTF8_streamwriter( open( '/tmp/output', 'wb') )

I think this would be a little simpler:

output=codecs.lookup('UTF-8').stream_writer( open( '/tmp/output', 'wb')
)

The object solution is more extensible, requires less "bogus"
assignments and does not require the user to remember the order of the
return values. 

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
Simplicity does not precede complexity, but follows it. 
	- http://www.cs.yale.edu/~perlis-alan/quotes.html


From brian@garage.co.jp  Sun Jun  4 16:05:48 2000
From: brian@garage.co.jp (Brian Takashi Hooper)
Date: Mon, 05 Jun 2000 00:05:48 +0900
Subject: [I18n-sig] Codecs
In-Reply-To: <393A6D89.B9DC952F@prescod.net>
References: <393A6D89.B9DC952F@prescod.net>
Message-ID: <393A704C17D.DF54BRIAN@smtp.garage.co.jp>

This issue came up before on this list, I think Andy Robinson suggested
it before in the midst of a lot of other Unicode musings.  One thing I
remember Andy mentioned was that a codec object could then additionally
contain methods in addition to those required by the codec API, for
example a method to fix broken legacy encoding input strings, etc.

Personally, I would be happier to get an object back from
codecs.lookup(), one vote in favor if it matters.

Are there any good reasons to prefer getting a tuple back from codecs.lookup()?

--Brian

On Sun, 04 Jun 2000 09:54:01 -0500
Paul Prescod <paul@prescod.net> wrote:

> Should codecs be returned to the user as objects instead of tuples?
> Today we have:
> 
> (UTF8_encode, UTF8_decode,
>       UTF8_streamreader, UTF8_streamwriter) = codecs.lookup('UTF-8')
> 
> output = UTF8_streamwriter( open( '/tmp/output', 'wb') )
> 
> I think this would be a little simpler:
> 
> output=codecs.lookup('UTF-8').stream_writer( open( '/tmp/output', 'wb')
> )
> 
> The object solution is more extensible, requires less "bogus"
> assignments and does not require the user to remember the order of the
> return values. 
> 
> -- 
>  Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
> Simplicity does not precede complexity, but follows it. 
> 	- http://www.cs.yale.edu/~perlis-alan/quotes.html
> 
> _______________________________________________
> I18n-sig mailing list
> I18n-sig@python.org
> http://www.python.org/mailman/listinfo/i18n-sig
> 


From andy@reportlab.com  Sun Jun  4 23:25:04 2000
From: andy@reportlab.com (Andy Robinson)
Date: Sun, 4 Jun 2000 23:25:04 +0100
Subject: [I18n-sig] Codecs
In-Reply-To: <393A6D89.B9DC952F@prescod.net>
Message-ID: <NDBBLMJCKKLCPKJBLPPCIEEMCAAA.andy@reportlab.com>

>
> Should codecs be returned to the user as objects instead of tuples?
> Today we have:
>
> (UTF8_encode, UTF8_decode,
>       UTF8_streamreader, UTF8_streamwriter) = codecs.lookup('UTF-8')
>
> output = UTF8_streamwriter( open( '/tmp/output', 'wb') )
>
> I think this would be a little simpler:
>
> output=codecs.lookup('UTF-8').stream_writer( open( '/tmp/output', 'wb')
> )
>
> The object solution is more extensible, requires less "bogus"
> assignments and does not require the user to remember the order of the
> return values.
>
I suggested this a while back, for a different reason.  Right now you get
four things back from lookup() relating to the given encoding.  But in many
cases there may be other encoding-specific routines of great use, and
returning an object would give us a place to hang them;  codec.repair(...)
and codec.validate(...), for example.  There are accepted and useful bits of
code around to repair Shift-JIS or EUC data in which one or two bytes are
corrupt.  We would also have a place to hang language-specific routines.

So I would be very, very happy to see codecs.lookup return a 'codec object'
with the four attributes encode, decode, streamreader() and streamwriter()
rather than a tuple.

- Andy Robinson


From mal@lemburg.com  Mon Jun  5 13:43:52 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Mon, 05 Jun 2000 14:43:52 +0200
Subject: [I18n-sig] Literal strings
References: <39372810.F9BFE796@prescod.net> <39377F2D.B6FBBF71@lemburg.com> <39395B2A.BE05860@prescod.net>
Message-ID: <393BA088.8C306FA8@lemburg.com>

Paul Prescod wrote:
> 
> "M.-A. Lemburg" wrote:
> >
> > ....
> 
> > The safest way to do this certainly is by fixing all
> > instances to use u"" instead of "" (not that hard, really).
> > Even though this may look strange at first, reading the code
> > will immediately bring your attention to the fact that you
> > are dealing with Unicode here -- a #pragma at the top won't
> > get that much attention and a casual user might wonder
> > where the u"" strings in variable dumps originate from.
> 
> I guess that's our philosophical difference. I don't want to go around
> thinking about the fact that I am using Unicode. I want to test it
> once and then have it "just work."

That won't always work... 

Unicode and strings are two different
things -- the first is explicitely there for text data
while the second can hold arbitrary data with no extra
meta information attached.

...if it does work, then you're lucky ;-)
 
> > One way to do this is by turning
> > all "" string literals into u"" assuming the encoding
> > given in the #pragma e.g. Latin-1 or MacRoman -- this would
> > be along the lines of what you have in mind.
> 
> Yes, this would probably be acceptable.
> 
> > The problem
> > with this is that some string literaly might have to map
> > to 8-bit strings, so for these you'd need to write e.g.
> > s"" or something similar.
> 
> Right, or call a conversion function.

...but then you have the same problem as before: string
literal modifiers (the small 'u' or 's' in front of the
literal) scattered around in the source code.

Hmm, we need some more ideas in this area I guess...

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From mal@lemburg.com  Mon Jun  5 14:11:56 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Mon, 05 Jun 2000 15:11:56 +0200
Subject: [I18n-sig] Codecs
References: <393A6D89.B9DC952F@prescod.net> <393A704C17D.DF54BRIAN@smtp.garage.co.jp>
Message-ID: <393BA71C.349A7338@lemburg.com>

Brian Takashi Hooper wrote:
> 
> This issue came up before on this list, I think Andy Robinson suggested
> it before in the midst of a lot of other Unicode musings.  One thing I
> remember Andy mentioned was that a codec object could then additionally
> contain methods in addition to those required by the codec API, for
> example a method to fix broken legacy encoding input strings, etc.
> 
> Personally, I would be happier to get an object back from
> codecs.lookup(), one vote in favor if it matters.
> 
> Are there any good reasons to prefer getting a tuple back from codecs.lookup()?

Here are some: 

* The tuple entries have two different flavours: the first
two are readily usable encode/decode APIs, while the last
two point to factory functions which can be used to create
new objects.

* Tuples are much easier to create and query at C level than
Python objects having a certain interface.

* The tuples can easily be cached and this is what the codec
registry currently does to enhance performance. Object lookups
are slower than tuple entry lookups (ok, no so much an argument,
because the conversion itself is likely to cause much more
overhead).

* There is quite a lot of code in the dist which already uses
the tuple value (all codecs, the codec registry, sample apps,
etc.).

* Who's going to write the code and produce the patches ?

Note that you can easily add you own wrappers of codecs.lookup()
which then give you an object instead of the tuple.

The extensibility argument is a problem with the current
solution, but is there really such a great need for extra
codec APIs ? (Please remember that all codec writers would
have to implement these new APIs -- there more you put in
there the more difficult and less attractive it gets...)

> --Brian
> 
> On Sun, 04 Jun 2000 09:54:01 -0500
> Paul Prescod <paul@prescod.net> wrote:
> 
> > Should codecs be returned to the user as objects instead of tuples?
> > Today we have:
> >
> > (UTF8_encode, UTF8_decode,
> >       UTF8_streamreader, UTF8_streamwriter) = codecs.lookup('UTF-8')
> >
> > output = UTF8_streamwriter( open( '/tmp/output', 'wb') )
> >
> > I think this would be a little simpler:
> >
> > output=codecs.lookup('UTF-8').stream_writer( open( '/tmp/output', 'wb')
> > )
> >
> > The object solution is more extensible, requires less "bogus"
> > assignments and does not require the user to remember the order of the
> > return values.
> >
> > --
> >  Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
> > Simplicity does not precede complexity, but follows it.
> >       - http://www.cs.yale.edu/~perlis-alan/quotes.html
> >
> > _______________________________________________
> > I18n-sig mailing list
> > I18n-sig@python.org
> > http://www.python.org/mailman/listinfo/i18n-sig
> >
> 
> _______________________________________________
> I18n-sig mailing list
> I18n-sig@python.org
> http://www.python.org/mailman/listinfo/i18n-sig

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From mal@lemburg.com  Mon Jun  5 14:53:09 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Mon, 05 Jun 2000 15:53:09 +0200
Subject: [I18n-sig] Re: Translated messages and 'gettext'
References: <3937D5B0254.6274BRIAN@smtp.garage.co.jp> <3938056E.6B707525@lemburg.com> <3939B8DB4.6275BRIAN@smtp.garage.co.jp>
Message-ID: <393BB0C5.C4323929@lemburg.com>

[gettext and changing languages on the fly]
> 
> Peter's fintl.py, and my modified version of fintl.py, are by themselves
> freestanding modules, they do not require libintl or the gettext
> library, they are Python reimplementations (of just the message
> retrieval API).  Peter's is free for any use and my module inherits that
> license (is also free).
> 
> The .so I made which modifies the C gettext library is, obviously,
> GPL'ed - however, it doesn't seem like it would be too hard to, again,
> make a free implementation which just understands GNU (and, if possible,
> Solaris-style and other platforms if there are) .mo files and implements
> only gettext, dgettext, etc.

Hmm, wouldn't it make sense to come up with one standard
gettext.py module which implements all the needed functionality
in Python and can use the wrapped GNU libintl.a optionally
if available ?

Wish list:

The module should ideally support all major gettext and
similar l10n-formats and allow changing languages. 

Peter's translation object approach seems to fit this best: it
could use mixin classes for the different l10n formats
(gettext .mo files, locale message files, resource files, etc.) and
provide the needed caching and lookup engine as base class.
It would probably be easiest to have one language per instance
and perhaps a tranlation object factory which implements
caching objects for the different languages currently in use.

...just some thoughts (got no time for this :-()
-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From mal@lemburg.com  Mon Jun  5 14:37:37 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Mon, 05 Jun 2000 15:37:37 +0200
Subject: [I18n-sig] Codecs
References: <NDBBLMJCKKLCPKJBLPPCIEEMCAAA.andy@reportlab.com>
Message-ID: <393BAD21.38C23FF9@lemburg.com>

Andy Robinson wrote:
> 
> >
> > Should codecs be returned to the user as objects instead of tuples?
> > Today we have:
> >
> > (UTF8_encode, UTF8_decode,
> >       UTF8_streamreader, UTF8_streamwriter) = codecs.lookup('UTF-8')
> >
> > output = UTF8_streamwriter( open( '/tmp/output', 'wb') )
> >
> > I think this would be a little simpler:
> >
> > output=codecs.lookup('UTF-8').stream_writer( open( '/tmp/output', 'wb')
> > )
> >
> > The object solution is more extensible, requires less "bogus"
> > assignments and does not require the user to remember the order of the
> > return values.
> >
> I suggested this a while back, for a different reason.  Right now you get
> four things back from lookup() relating to the given encoding.  But in many
> cases there may be other encoding-specific routines of great use, and
> returning an object would give us a place to hang them;  codec.repair(...)
> and codec.validate(...), for example.  There are accepted and useful bits of
> code around to repair Shift-JIS or EUC data in which one or two bytes are
> corrupt.  We would also have a place to hang language-specific routines.
> 
> So I would be very, very happy to see codecs.lookup return a 'codec object'
> with the four attributes encode, decode, streamreader() and streamwriter()
> rather than a tuple.

(Please also see my other post on the subject...)

The tuple design was chosen for speed and because of its
simplicity... please remember that much of the codec registry
stuff is written in C and should be easily accessible and
managable from there.

Note that things like "validate" and "repair" can be handled
by providing new error handling codes and then checking for
the encoding/decoding calls for exceptions. 

New functionality can easily be added to the stream read/writer
objects which are returned by the factory functions given in
the tuple -- these also allow keeping state and can work on string
like objects via StringIO.

Perhaps all we need is a simpler interface for codecs.lookup() ? ...
Something like:

encoder = codecs.encoder('utf-8')
# dito for .decoder, .streamwriter, .streamreader

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From pf@artcom-gmbh.de  Mon Jun  5 15:52:44 2000
From: pf@artcom-gmbh.de (Peter Funk)
Date: Mon, 5 Jun 2000 16:52:44 +0200 (MEST)
Subject: [I18n-sig] Re: Translated messages and 'gettext'
In-Reply-To: <393BB0C5.C4323929@lemburg.com> from "M.-A. Lemburg" at "Jun 5, 2000  3:53: 9 pm"
Message-ID: <m12yyFA-000DifC@artcom0.artcom-gmbh.de>

Hi,

M.-A. Lemburg:
> Hmm, wouldn't it make sense to come up with one standard
> gettext.py module which implements all the needed functionality
> in Python and can use the wrapped GNU libintl.a optionally
> if available ?

Yes: this is, what I want have in mind.

> Wish list:
> 
> The module should ideally support all major gettext and
> similar l10n-formats and allow changing languages. 

At the moment I see no heavy need for other binary formats, than the
GNU gettext .mo file format.  However it may be useful for people,
who want to embed Python into a larger project.  So I will try to design
an easy way to plugin readers for other formats.

> Peter's translation object approach seems to fit this best: it
> could use mixin classes for the different l10n formats
> (gettext .mo files, locale message files, resource files, etc.) and
> provide the needed caching and lookup engine as base class.
> It would probably be easiest to have one language per instance
> and perhaps a tranlation object factory which implements
> caching objects for the different languages currently in use.
> 
> ...just some thoughts (got no time for this :-()

I will save your suggestions here and I will *try* to realize them
in time for inclusion into Python 1.6 final.

Regards, Peter


From paul@prescod.net  Mon Jun  5 16:21:11 2000
From: paul@prescod.net (Paul Prescod)
Date: Mon, 05 Jun 2000 10:21:11 -0500
Subject: [I18n-sig] Codecs
References: <393A6D89.B9DC952F@prescod.net> <393A704C17D.DF54BRIAN@smtp.garage.co.jp> <393BA71C.349A7338@lemburg.com>
Message-ID: <393BC567.F5FA18EF@prescod.net>

"M.-A. Lemburg" wrote:
> 
> > ...
> > Are there any good reasons to prefer getting a tuple back from codecs.lookup()?
> 
> Here are some:
> 
> * The tuple entries have two different flavours: the first
> two are readily usable encode/decode APIs, while the last
> two point to factory functions which can be used to create
> new objects.

Right, and with an object syntax you can only deal with the properties
you are interested in, not with all four, all of the time.

> * Tuples are much easier to create and query at C level than
> Python objects having a certain interface.

I don't see that as very important!

> * The tuples can easily be cached and this is what the codec
> registry currently does to enhance performance. Object lookups
> are slower than tuple entry lookups (ok, no so much an argument,
> because the conversion itself is likely to cause much more
> overhead).

I agree that this is not much of an argument. :)

> * There is quite a lot of code in the dist which already uses
> the tuple value (all codecs, the codec registry, sample apps,
> etc.).
> 
> * Who's going to write the code and produce the patches ?

These two are important arguments but we need to decide what we want
before we start deciding whether it is doable.

> The extensibility argument is a problem with the current
> solution, but is there really such a great need for extra
> codec APIs ? 

I don't know yet. If we knew now, we'd add them now. :)

> (Please remember that all codec writers would
> have to implement these new APIs -- there more you put in
> there the more difficult and less attractive it gets...)

I think that Andy was thinking that codecs might be a useful place to
"hang" arbitrary encoding-related methods -- whether or not they are
standardized. Python is dynamically typed so we don't need to conform to
a restrictive interface definition.

Anyhow, more than the extensibility, returning structured objects is
just more Pythonic. I hate having to remember the position of tuple
return values.

> encoder = codecs.encoder('utf-8')
> # dito for .decoder, .streamwriter, .streamreader

That might be an acceptable compromise on the syntactic issue....but....

It doesn't seem much more work to just make a version of "lookup" that
wraps tuples in objects. If we took this half-step then we could decide
to move to "full objects" in the future and break a lot less code.

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
Simplicity does not precede complexity, but follows it. 
	- http://www.cs.yale.edu/~perlis-alan/quotes.html


From andy@reportlab.com  Mon Jun  5 16:49:38 2000
From: andy@reportlab.com (Andy Robinson)
Date: Mon, 5 Jun 2000 16:49:38 +0100
Subject: [I18n-sig] Codecs
In-Reply-To: <393BA71C.349A7338@lemburg.com>
Message-ID: <PGECLPOBGNBNKHNAGIJHAEGPCCAA.andy@reportlab.com>

Replying to MAL slightly out of order:
 
> Note that you can easily add you own wrappers of codecs.lookup()
> which then give you an object instead of the tuple.
> 
> The extensibility argument is a problem with the current
> solution, but is there really such a great need for extra
> codec APIs ? (Please remember that all codec writers would
> have to implement these new APIs -- there more you put in
> there the more difficult and less attractive it gets...)

I'm proposing a place to put non-standard extensions.
The whole point is that these are things which are useful 
for multi-byte codecs and non-European languages, but will
certainly not exist for all codecs.  These could be exposed
as functions within the relevant codec module, but it seems
clean if codecs module provides the lookup functionality,
and the particular codec can provide new 'services' itself.

> Here are some: 
> 
> * The tuple entries have two different flavours: the first
> two are readily usable encode/decode APIs, while the last
> two point to factory functions which can be used to create
> new objects.
> 
> * Tuples are much easier to create and query at C level than
> Python objects having a certain interface.
> 
> * The tuples can easily be cached and this is what the codec
> registry currently does to enhance performance. Object lookups
> are slower than tuple entry lookups (ok, no so much an argument,
> because the conversion itself is likely to cause much more
> overhead).
> 
> * There is quite a lot of code in the dist which already uses
> the tuple value (all codecs, the codec registry, sample apps,
> etc.).
> 
> * Who's going to write the code and produce the patches ?

I did argue for this originally at least twice but got 
ignored by everyone.  Now there is some support I'll make another 
bid.  If the only issue is the work involved, then we should first 
decide if it is the right thing, then see if we can find the 
resources to write the patch.  

Anyone else got opinions?

- Andy Robinson


From mal@lemburg.com  Mon Jun  5 18:38:38 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Mon, 05 Jun 2000 19:38:38 +0200
Subject: [I18n-sig] Codecs
References: <PGECLPOBGNBNKHNAGIJHAEGPCCAA.andy@reportlab.com>
Message-ID: <393BE59E.D8D85336@lemburg.com>

Andy Robinson wrote:
> 
> Replying to MAL slightly out of order:
> 
> > Note that you can easily add you own wrappers of codecs.lookup()
> > which then give you an object instead of the tuple.
> >
> > The extensibility argument is a problem with the current
> > solution, but is there really such a great need for extra
> > codec APIs ? (Please remember that all codec writers would
> > have to implement these new APIs -- there more you put in
> > there the more difficult and less attractive it gets...)
> 
> I'm proposing a place to put non-standard extensions.
> The whole point is that these are things which are useful
> for multi-byte codecs and non-European languages, but will
> certainly not exist for all codecs.  These could be exposed
> as functions within the relevant codec module, but it seems
> clean if codecs module provides the lookup functionality,
> and the particular codec can provide new 'services' itself.

That's already possible via the stream writer/reader object.
The two extra functions encode/decode are really only there
to enhance performance of the builtin encoding machinery
(which only needs stateless converters).
 
You can easily add new methods to the stream writer and
reader objects. They also allow you to keep state -- which
a simple entry in a codec registry object would not.

Perhaps I'm missing something ?

> > Here are some:
> >
> > * The tuple entries have two different flavours: the first
> > two are readily usable encode/decode APIs, while the last
> > two point to factory functions which can be used to create
> > new objects.
> >
> > * Tuples are much easier to create and query at C level than
> > Python objects having a certain interface.
> >
> > * The tuples can easily be cached and this is what the codec
> > registry currently does to enhance performance. Object lookups
> > are slower than tuple entry lookups (ok, no so much an argument,
> > because the conversion itself is likely to cause much more
> > overhead).
> >
> > * There is quite a lot of code in the dist which already uses
> > the tuple value (all codecs, the codec registry, sample apps,
> > etc.).
> >
> > * Who's going to write the code and produce the patches ?
> 
> I did argue for this originally at least twice but got
> ignored by everyone. 

Could be that we were too busy with other things, e.g. 
the source code encoding debate ;-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From mal@lemburg.com  Mon Jun  5 18:40:59 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Mon, 05 Jun 2000 19:40:59 +0200
Subject: [I18n-sig] Codecs
References: <393A6D89.B9DC952F@prescod.net> <393A704C17D.DF54BRIAN@smtp.garage.co.jp> <393BA71C.349A7338@lemburg.com> <393BC567.F5FA18EF@prescod.net>
Message-ID: <393BE62B.182B6AF1@lemburg.com>

[codecs.lookup() returning a tuple]
> 
> > * Tuples are much easier to create and query at C level than
> > Python objects having a certain interface.
> 
> I don't see that as very important!

For me it is: I maintain this stuff :-)

Adding full object support would mean that I'd have to
write a new C type which support the object interface -- 
I'm not particularly interested in doing so...

> > encoder = codecs.encoder('utf-8')
> > # dito for .decoder, .streamwriter, .streamreader
> 
> That might be an acceptable compromise on the syntactic issue....but....
> 
> It doesn't seem much more work to just make a version of "lookup" that
> wraps tuples in objects. If we took this half-step then we could decide
> to move to "full objects" in the future and break a lot less code.

I have no problem with a new lookup API which returns
objects, I just wouldn't want to have the codec registry
use these wrapper objects as basis for doing its work.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From mal@lemburg.com  Fri Jun  9 12:09:19 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 09 Jun 2000 13:09:19 +0200
Subject: [I18n-sig] New Unicode default encoding scheme
Message-ID: <3940D05E.9E266396@lemburg.com>

Hi everybody,

I just wanted to inform you that the Unicode default encoding
handling has changed from the strict UTF-8 setting to a
much more flexible solution which is based on the default
locale settings (provided via the LANG environment variable).
The new default setting is ASCII as per Guido's request.

Here's the important section of the Misc/unicode.txt file.
For more details I refer you to reading that file from
the current CVS tree.

"""
Unicode Default Encoding:
-------------------------

The Unicode implementation has to make some assumption about the
encoding of 8-bit strings passed to it for coercion and about the
encoding to as default for conversion of Unicode to strings when no
specific encoding is given. This encoding is called <default encoding>
throughout this text.

If not otherwise defined or set, the <default encoding> is set to
'ascii'.

For this, the implementation maintains a global which can be set in
the site.py Python startup script. Subsequent changes are not
possible. The <default encoding> can be set and queried using the
two sys module APIs:

  sys.setdefaultencoding(encoding)
     --> Sets the <default encoding> used by the Unicode implementation.
         encoding has to be an encoding which is supported by the Python
         installation, otherwise, a LookupError is raised. Note: This API
         is only available in site.py !

  sys.getdefaultencoding()
     --> Returns the current default encoding.

To enhance usability of Unicode coercion, the <default encoding> is
set in the default site.py startup module according to the encoding
defined by the locale active when the site.py module gets executed.
The locale module is used to extract the encoding from the locale
default settings defined in the LANG environment variable (and
possibly others -- see locale.py). If the encoding cannot be
determined, is unkown or unsupported, site.py defaults to setting the
<default encoding> to 'ascii'. This encoding is also the startup
default of Python (and in effect before site.py is executed).
"""

Example:

cnri/Python+Unicode> setenv LANG de_DE:utf8
cnri/Python+Unicode> ./python 
>>> import sys
>>> sys.getdefaultencoding()
'utf'
>>> print u"äöü"
Ã¤Ã¶Ã¼
>>> 
cnri/Python+Unicode> setenv LANG de_DE:latin1
cnri/Python+Unicode> ./python
>>> import sys
>>> sys.getdefaultencoding()
'latin1'
>>> print u"äöü"
äöü
>>>

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/