From martin@loewis.home.cs.tu-berlin.de  Fri Sep  1 08:17:34 2000
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Fri, 1 Sep 2000 09:17:34 +0200
Subject: [I18n-sig] Translating doc strings
Message-ID: <200009010717.JAA02431@loewis.home.cs.tu-berlin.de>

Now that Python 2 supports the gettext API and methodology, I'd like
to start discussion on translating messages in the Python core and
library proper.

I see two different kinds of messages in the Python source: Printed
messages, which are produced in the course of running the interpreter
(including, say, informative parameters to exceptions), and doc
strings (which are not normally printed during program execution, but
are instead retrieved by a developer.

I have produced patch 101320, which is available from

http://sourceforge.net/patch/?func=detailpatch&patch_id=101320&group_id=5470

The patch consists of a message catalog for Python doc strings,
beginnings of a German translation thereof, a compiled version of the
German catalog, and makefile machinery to install the catalogs.

When discussing this patch with BeOpen, Barry and Guido raised
concerns about the size of the catalog; Barry proposed to split it
into pieces.

Splitting the patch into pieces has its own problems: How to split,
and should the pieces become their own textual domains?

How would users retrieve translations of the doc strings in the first
place? I have proposed patch 101313

http://sourceforge.net/patch/?func=detailpatch&patch_id=101313&group_id=5470

which introduces a doc() function, so that users could write

>>> doc("".split)
S.split([sep [,maxsplit]]) -> Liste von Strings

Gib eine Liste der Worte im String S zurück, mit sep als Trennstring.
Wenn maxsplit angegeben ist, werden höchstens maxsplit Worte
abgetrennt.  Wenn sep nicht angegeben ist, gelten beliebige
Whitespace-Strings als Trenner.

This interface has a number of advantages:
- you don't have to type print in the front to get line breaks display
  properly
- you don't have to type _ four times
- it will transparently retrieve the translation if available

For this to work, all doc strings must be in a single textual
domain. The implementation of the doc function will retrieve the
__doc__ attribute of the argument and look for a translation.

With that approach, the next question is: What is the name of the
textual domain, and how are translation managed? My proposal was
"pylib"; Barry's "docstring". As for management of translations, I'd
like to ask the Free Translation Project for help. As soon as we've
settled the technical issues, I'd like to submit a catalog for
translation.

Comments?

Martin


From guido@beopen.com  Fri Sep  1 16:54:46 2000
From: guido@beopen.com (Guido van Rossum)
Date: Fri, 01 Sep 2000 10:54:46 -0500
Subject: [I18n-sig] Translating doc strings
In-Reply-To: Your message of "Fri, 01 Sep 2000 09:17:34 +0200."
 <200009010717.JAA02431@loewis.home.cs.tu-berlin.de>
References: <200009010717.JAA02431@loewis.home.cs.tu-berlin.de>
Message-ID: <200009011554.KAA09534@cj20424-a.reston1.va.home.com>

> How would users retrieve translations of the doc strings in the first
> place? I have proposed patch 101313
> 
> http://sourceforge.net/patch/?func=detailpatch&patch_id=101313&group_id=5470
> 
> which introduces a doc() function, so that users could write
> 
>     >>> doc("".split)
>     S.split([sep [,maxsplit]]) -> Liste von Strings
>     
>     Gib eine Liste der Worte im String S zurück, mit sep als Trennstring.
>     Wenn maxsplit angegeben ist, werden höchstens maxsplit Worte
>     abgetrennt.  Wenn sep nicht angegeben ist, gelten beliebige
>     Whitespace-Strings als Trenner.

I like the interface fine.  (Some might prefer to call it help()).

> This interface has a number of advantages:
> - you don't have to type print in the front to get line breaks display
>   properly
> - you don't have to type _ four times
> - it will transparently retrieve the translation if available

In an IDE, doc() could be replaced by something that pops up the docs
in a separate window.

> For this to work, all doc strings must be in a single textual
> domain. The implementation of the doc function will retrieve the
> __doc__ attribute of the argument and look for a translation.

Hmm...  This lumps together *all* documentation for *all* modules and
packages.  What about documentation for 3rd party packages?  How will
your doc() deal with unrelated objects that somehow have the same
(probably brief) docstring but for which the translation (depending on
context) should be different?

For functions, classes, methods and instances, the module name is
easily accessible, e.g.:

    >>> import rfc822
    >>> m = rfc822.Message(open("/dev/null"))
    >>> m.__class__.__name__
    'Message'
    >>> m.__class__.__module__
    'rfc822'
    >>> 

(For submodules of packages, __module__ gives the full package name.)

--Guido van Rossum (home page: http://www.pythonlabs.com/~guido/)


From martin@loewis.home.cs.tu-berlin.de  Fri Sep  1 19:58:16 2000
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Fri, 1 Sep 2000 20:58:16 +0200
Subject: [I18n-sig] Translating doc strings
In-Reply-To: <200009011554.KAA09534@cj20424-a.reston1.va.home.com> (message
 from Guido van Rossum on Fri, 01 Sep 2000 10:54:46 -0500)
References: <200009010717.JAA02431@loewis.home.cs.tu-berlin.de> <200009011554.KAA09534@cj20424-a.reston1.va.home.com>
Message-ID: <200009011858.UAA00812@loewis.home.cs.tu-berlin.de>

> Hmm...  This lumps together *all* documentation for *all* modules and
> packages.  

Yes, it would. In itself, I don't see it as a problem. In the
lumped-together form, only translators see it. This will guarantee
consistency of terminology (eg. is it "Strings" or "Zeichenketten";
what is "Slicing").

> What about documentation for 3rd party packages?

That is indeed a problem.

> For functions, classes, methods and instances, the module name is
> easily accessible, e.g.:
> 
>     >>> import rfc822
>     >>> m = rfc822.Message(open("/dev/null"))
>     >>> m.__class__.__name__
>     'Message'
>     >>> m.__class__.__module__
>     'rfc822'
>     >>> 

I see two problems with using the package name. Exactly how do you
obtain it for functions? f.func_globals['__name__']? And for builtin
functions? As for __module__: I know, it was my idea, after all :-)

The other problem is that this would give an inflation of hundreds of
.mo files. I'd rather prefer to have one per product (in the Zope
sense).

One heuristic would be to the use the catalog that _ is bound to, i.e.

def module_of(symbol):
  as above

def catalog_of(symbol):
  return sys.modules[module_of(symbol)]._.im_self

There could be an official protocol as well, of course, but a global
catalog together with that convention might do.

Regards,
Martin


From pinard@iro.umontreal.ca  Sat Sep  2 14:34:51 2000
From: pinard@iro.umontreal.ca (=?ISO-8859-1?Q?Fran=E7ois_Pinard?=)
Date: 02 Sep 2000 09:34:51 -0400
Subject: [I18n-sig] UTF-8 decoder in CVS still buggy
In-Reply-To: M.-A. Lemburg mal@lemburg.com's message of "Mon, 24 Jul 2000 10:26:25 +0200"
Message-ID: <oqbsy6yl1g.fsf@titan.progiciels-bpi.ca>

[mal@lemburg.com]

> Please keep us informed of any quirks you may experience during this
> conversion.  We can use some real life reports for the new Unicode
> support in Python to polish up the implementation and design.

Hi, people.  I just recently subscribed to i18n-sig, and started to
read the archives.  Let me hope you will tolerate that I jump in some
conversations without having matured all the background.

On the above topic, I did not check what Python exactly does, but I wanted to
share that my `recode' program is not perfect in that area.  In particular,
there is a requirement for UTF-8 to be valid that the sequence be minimal,
which `recode' currently does not check on input.  Roughly said, an UTF-8
sequence is not valid if it could have been expressed in fewer bytes.

I've nothing against Python beating me at it! :-)

-- 
François Pinard   http://www.iro.umontreal.ca/~pinard


From pinard@iro.umontreal.ca  Sat Sep  2 14:49:14 2000
From: pinard@iro.umontreal.ca (=?ISO-8859-1?Q?Fran=E7ois_Pinard?=)
Date: 02 Sep 2000 09:49:14 -0400
Subject: [I18n-sig] Python Translation
In-Reply-To: Dinesh Nadarajah dindin2k@yahoo.com's message of "Mon, 10 Jul 2000 20:12:58 -0700 (PDT)"
Message-ID: <oq66oeykdh.fsf@titan.progiciels-bpi.ca>

[dindin2k@yahoo.com]

> Is there any working/ target towards translating Python to other
> languages. i.e.  Some sort of structure like the *.po files in KDE such
> that native languages can be substituted for the standards keywords.
> Are there any plans to port Python to other (human) languages.

I would not think there is.  Some while ago, I wrote to Guido about i18n
issues, and to my surprise, he replied quite strongly against the above
suggestion, which I did not even make in my letter.  So, I presumed the
issue was rather hot for him, for him to read it where not written :-).
The main point of Guido is that it goes against source portability.

Yet, and even I do not remember having discussed this with Guido, I think
it would be a good idea.  Some shops develop in-house code never meant
to be exported, and it would locally help a lot, and not hurt everybody
outside, being able to use diacritics within identifiers, and even
translated keywords.  For one of my contracts, I'm working in such a shop.

I had a very comfortable experience in such things when I was younger,
which lasted for many years, using a French adaptation of a Pascal compiler.
See `http://www.iro.umontreal.ca/~pinard/accents/bonjour.tar.gz' to see some
archived code from this period (better to like French and CDC machines! :-).

My point is that source portability might be a concern for some, but not for
everybody, and I wish Python is open enough to not impose source portability
where it has no meaning.  If Python may be nationally comfortable, just
let it be, and let users choose where their priorities are.

-- 
François Pinard   http://www.iro.umontreal.ca/~pinard


From mal@lemburg.com  Sat Sep  2 15:03:46 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Sat, 02 Sep 2000 16:03:46 +0200
Subject: [I18n-sig] UTF-8 decoder in CVS still buggy
References: <oqbsy6yl1g.fsf@titan.progiciels-bpi.ca>
Message-ID: <39B108C2.F22A0660@lemburg.com>

François Pinard wrote:
> 
> [mal@lemburg.com]
> 
> > Please keep us informed of any quirks you may experience during this
> > conversion.  We can use some real life reports for the new Unicode
> > support in Python to polish up the implementation and design.
> 
> Hi, people.  I just recently subscribed to i18n-sig, and started to
> read the archives.  Let me hope you will tolerate that I jump in some
> conversations without having matured all the background.
> 
> On the above topic, I did not check what Python exactly does, but I wanted to
> share that my `recode' program is not perfect in that area.  In particular,
> there is a requirement for UTF-8 to be valid that the sequence be minimal,
> which `recode' currently does not check on input.  Roughly said, an UTF-8
> sequence is not valid if it could have been expressed in fewer bytes.
> 
> I've nothing against Python beating me at it! :-)

Could you give some examples ? I'm not sure I understand what you
mean by "could have been expressed with fewer bytes" -- perhaps
a multi-byte encoding where the top-most bytes are 0 ?

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From guido@beopen.com  Sat Sep  2 16:46:35 2000
From: guido@beopen.com (Guido van Rossum)
Date: Sat, 02 Sep 2000 10:46:35 -0500
Subject: [I18n-sig] Python Translation
In-Reply-To: Your message of "02 Sep 2000 09:49:14 -0400."
 <oq66oeykdh.fsf@titan.progiciels-bpi.ca>
References: <oq66oeykdh.fsf@titan.progiciels-bpi.ca>
Message-ID: <200009021546.KAA02082@cj20424-a.reston1.va.home.com>

> [dindin2k@yahoo.com]
> 
> > Is there any working/ target towards translating Python to other
> > languages. i.e.  Some sort of structure like the *.po files in KDE such
> > that native languages can be substituted for the standards keywords.
> > Are there any plans to port Python to other (human) languages.

[pinard@iro.umontreal.ca]
> I would not think there is.  Some while ago, I wrote to Guido about i18n
> issues, and to my surprise, he replied quite strongly against the above
> suggestion, which I did not even make in my letter.  So, I presumed the
> issue was rather hot for him, for him to read it where not written :-).
> The main point of Guido is that it goes against source portability.
> 
> Yet, and even I do not remember having discussed this with Guido, I think
> it would be a good idea.  Some shops develop in-house code never meant
> to be exported, and it would locally help a lot, and not hurt everybody
> outside, being able to use diacritics within identifiers, and even
> translated keywords.  For one of my contracts, I'm working in such a shop.
> 
> I had a very comfortable experience in such things when I was younger,
> which lasted for many years, using a French adaptation of a Pascal compiler.
> See `http://www.iro.umontreal.ca/~pinard/accents/bonjour.tar.gz' to see some
> archived code from this period (better to like French and CDC machines! :-).
> 
> My point is that source portability might be a concern for some, but not for
> everybody, and I wish Python is open enough to not impose source portability
> where it has no meaning.  If Python may be nationally comfortable, just
> let it be, and let users choose where their priorities are.

Let me restate my position.  It's not a priority for me, and I believe
that most in the Python community probably don't see it as a priority
for themselves either.  There is so much else to do that I don't see
myself putting effort in it.

But if it is a priority for you, I won't stop you!  It would probably
best be implemented as a custom translator.  We're thinking about
making the Python chain of command (input loop -> parser -> compiler
-> optimizer -> bytecode interpreter -> runtime) more pluggable in
future (post-2.0) versions, and an internationalization pass would
easily plug in there.

--Guido van Rossum (home page: http://www.pythonlabs.com/~guido/)


From Fredrik Lundh" <effbot@telia.com  Sat Sep  2 17:30:56 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Sat, 2 Sep 2000 18:30:56 +0200
Subject: [I18n-sig] UTF-8 decoder in CVS still buggy
References: <oqbsy6yl1g.fsf@titan.progiciels-bpi.ca>
Message-ID: <02bf01c014fb$2e66a6c0$766940d5@hagrid>

François Pinard wrote:
> Hi, people.  I just recently subscribed to i18n-sig, and started to
> read the archives.  Let me hope you will tolerate that I jump in some
> conversations without having matured all the background.
>
> On the above topic, I did not check what Python exactly does, but I wanted to
> share that my `recode' program is not perfect in that area.  In particular,
> there is a requirement for UTF-8 to be valid that the sequence be minimal,
> which `recode' currently does not check on input.  Roughly said, an UTF-8
> sequence is not valid if it could have been expressed in fewer bytes.

for security reasons, the UTF-8 codec gives you an "illegal encoding"
error in this case.

mal wrote:
> Could you give some examples ? I'm not sure I understand what you
> mean by "could have been expressed with fewer bytes" -- perhaps
> a multi-byte encoding where the top-most bytes are 0 ?

quoting RFC 2279:

    Implementors of UTF-8 need to consider the security aspects of how
    they handle illegal UTF-8 sequences.  It is conceivable that in some
    circumstances an attacker would be able to exploit an incautious
    UTF-8 parser by sending it an octet sequence that is not permitted by
    the UTF-8 syntax.

    A particularly subtle form of this attack could be carried out
    against a parser which performs security-critical validity checks
    against the UTF-8 encoded form of its input, but interprets certain
    illegal octet sequences as characters.  For example, a parser might
    prohibit the NUL character when encoded as the single-octet sequence
    00, but allow the illegal two-octet sequence C0 80 and interpret it
    as a NUL character.  Another example might be a parser which
    prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the
    illegal octet sequence 2F C0 AE 2E 2F.

</F>


From mal@lemburg.com  Sat Sep  2 18:05:08 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Sat, 02 Sep 2000 19:05:08 +0200
Subject: [I18n-sig] UTF-8 decoder in CVS still buggy
References: <oqbsy6yl1g.fsf@titan.progiciels-bpi.ca> <02bf01c014fb$2e66a6c0$766940d5@hagrid>
Message-ID: <39B13344.1DCCB05A@lemburg.com>

Fredrik Lundh wrote:
> 
> François Pinard wrote:
> > Hi, people.  I just recently subscribed to i18n-sig, and started to
> > read the archives.  Let me hope you will tolerate that I jump in some
> > conversations without having matured all the background.
> >
> > On the above topic, I did not check what Python exactly does, but I wanted to
> > share that my `recode' program is not perfect in that area.  In particular,
> > there is a requirement for UTF-8 to be valid that the sequence be minimal,
> > which `recode' currently does not check on input.  Roughly said, an UTF-8
> > sequence is not valid if it could have been expressed in fewer bytes.
> 
> for security reasons, the UTF-8 codec gives you an "illegal encoding"
> error in this case.
> 
> mal wrote:
> > Could you give some examples ? I'm not sure I understand what you
> > mean by "could have been expressed with fewer bytes" -- perhaps
> > a multi-byte encoding where the top-most bytes are 0 ?
> 
> quoting RFC 2279:
> 
>     Implementors of UTF-8 need to consider the security aspects of how
>     they handle illegal UTF-8 sequences.  It is conceivable that in some
>     circumstances an attacker would be able to exploit an incautious
>     UTF-8 parser by sending it an octet sequence that is not permitted by
>     the UTF-8 syntax.
> 
>     A particularly subtle form of this attack could be carried out
>     against a parser which performs security-critical validity checks
>     against the UTF-8 encoded form of its input, but interprets certain
>     illegal octet sequences as characters.  For example, a parser might
>     prohibit the NUL character when encoded as the single-octet sequence
>     00, but allow the illegal two-octet sequence C0 80 and interpret it
>     as a NUL character.  Another example might be a parser which
>     prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the
>     illegal octet sequence 2F C0 AE 2E 2F.

Hmm...

>>> unicode('\xC0\x80','utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeError: UTF-8 decoding error: illegal encoding
>>> unicode('\x2F\x2E\x2E\x2F','utf-8')
u'/../'
>>> unicode('\x2F\xC0\xAE\x2E\x2F','utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeError: UTF-8 decoding error: illegal encoding
>>> 

... so what's buggy about the codec ?

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From Fredrik Lundh" <effbot@telia.com  Sat Sep  2 18:22:05 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Sat, 2 Sep 2000 19:22:05 +0200
Subject: [I18n-sig] UTF-8 decoder in CVS still buggy
References: <oqbsy6yl1g.fsf@titan.progiciels-bpi.ca> <02bf01c014fb$2e66a6c0$766940d5@hagrid> <39B13344.1DCCB05A@lemburg.com>
Message-ID: <02ef01c01502$57479200$766940d5@hagrid>

mal wrote: 
> >>> unicode('\xC0\x80','utf-8')
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
> UnicodeError: UTF-8 decoding error: illegal encoding
> >>> unicode('\x2F\x2E\x2E\x2F','utf-8')
> u'/../'
> >>> unicode('\x2F\xC0\xAE\x2E\x2F','utf-8')
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
> UnicodeError: UTF-8 decoding error: illegal encoding
> >>> 
> 
> ... so what's buggy about the codec ?

nothing -- francois posted under a misleading subject,
without checking the code first.

(and I never write buggy code anyway ;-)

</F>


From pinard@iro.umontreal.ca  Sat Sep  2 21:13:25 2000
From: pinard@iro.umontreal.ca (=?ISO-8859-1?Q?Fran=E7ois_Pinard?=)
Date: 02 Sep 2000 16:13:25 -0400
Subject: [I18n-sig] UTF-8 decoder in CVS still buggy
In-Reply-To: "Fredrik Lundh"'s message of "Sat, 2 Sep 2000 19:22:05 +0200"
References: <oqbsy6yl1g.fsf@titan.progiciels-bpi.ca>
 <02bf01c014fb$2e66a6c0$766940d5@hagrid>
 <39B13344.1DCCB05A@lemburg.com>
 <02ef01c01502$57479200$766940d5@hagrid>
Message-ID: <oqaedqwo0q.fsf@titan.progiciels-bpi.ca>

[Fredrik Lundh]

> nothing -- francois posted under a misleading subject,
> without checking the code first.

I wrote that I did not check the code, so I'm safe there.  But it is also
true I did not check, nor change the subject, merely replied to a message.

> (and I never write buggy code anyway ;-)

Far from me the idea to suggest otherwise! :-)

-- 
François Pinard   http://www.iro.umontreal.ca/~pinard


From tdickenson@geminidataloggers.com  Mon Sep  4 09:17:04 2000
From: tdickenson@geminidataloggers.com (Toby Dickenson)
Date: Mon, 04 Sep 2000 09:17:04 +0100
Subject: [I18n-sig] Terminology gap
In-Reply-To: <39AE5F20.E68BC43F@lemburg.com>
References: <v1ksqs8i94unqdk1cs9voqjv5sscv6bn73@4ax.com> <39AE5F20.E68BC43F@lemburg.com>
Message-ID: <dim6rs4fk93ipd7il9lkp93qdmbld0406d@4ax.com>

On Thu, 31 Aug 2000 15:35:28 +0200, "M.-A. Lemburg" <mal@lemburg.com>
wrote:

>Toby Dickenson wrote:
>>=20
>> Ive recently been updating my documentation to account for Unicode
>> issues, and have been troubled by the lack of a good name to describe
>> an object that can be *either* a "plain string" or a "unicode string".
>
>I usually use "8-bit string" and "Unicode object".
>=20
>> My best attempt so far is to call it a "string-like object", but that
>> feels too long for something so common.
>>=20
>> I would like to use the simple "string", but a quick poll of my local
>> developers suggests that this does not convey the unicode option.
>>=20
>> Does anyone have any suggestions?
>
>I think the accepted term is "string", since someday Python will
>have a string base class. Unicode objects and 8-bit strings will
>then be subclasses of this string class.

I think the more specific use of "string" will be a hard habit to
break....

>>> type('')
<type 'string'>


Toby Dickenson
tdickenson@geminidataloggers.com


From andy@reportlab.com  Mon Sep  4 10:00:11 2000
From: andy@reportlab.com (Andy Robinson)
Date: Mon, 4 Sep 2000 10:00:11 +0100
Subject: [I18n-sig] Terminology gap
In-Reply-To: <39AE5F20.E68BC43F@lemburg.com>
Message-ID: <PGECLPOBGNBNKHNAGIJHMEHNCEAA.andy@reportlab.com>


> > My best attempt so far is to call it a "string-like
> object", but that
> > feels too long for something so common.
> >
> > I would like to use the simple "string", but a quick poll
> of my local
> > developers suggests that this does not convey the unicode option.
> >
> > Does anyone have any suggestions?
>
> I think the accepted term is "string", since someday Python will
> have a string base class. Unicode objects and 8-bit strings will
> then be subclasses of this string class.
>
I agree with MAL.  "string" should refer to an interface; people doing
i18n stuff could then write their own ones in future if needed.  I
cannot get at CVS this week, but I think we actually checked in a
UserString class into the standard library in order to clearly define
the interface for string-like objects.

- Andy Robinson.


From andy@reportlab.com  Mon Sep  4 10:00:16 2000
From: andy@reportlab.com (Andy Robinson)
Date: Mon, 4 Sep 2000 10:00:16 +0100
Subject: [I18n-sig] Python Translation
In-Reply-To: <200009021546.KAA02082@cj20424-a.reston1.va.home.com>
Message-ID: <PGECLPOBGNBNKHNAGIJHAEHOCEAA.andy@reportlab.com>


> But if it is a priority for you, I won't stop you!  It
> would probably
> best be implemented as a custom translator.  We're thinking about
> making the Python chain of command (input loop -> parser -> compiler
> -> optimizer -> bytecode interpreter -> runtime) more pluggable in
> future (post-2.0) versions, and an internationalization pass would
> easily plug in there.
>
For inspiration on what can be done with pluggable parsers, check out
Damian Conway's lingua::romana::perligata.  He built an alternate
syntax and parser for Perl in Latin, getting a lot of help from the
Monash classics department on the correct case endings to substitute
for $, @ and all that stuff.  Don't ask me why.  (Sorry, I don't have
a URL and am off line at the moment).

BTW, I sat next to him at an author signing at which someone was
volunteering to do the Klingon port and make Perl the official
scripting language of the Klingon empire.   It seems like there is
More Than One Way to Say "die" in Klingon.  We'd better watch out.

- Andy Robinson


From loewis@informatik.hu-berlin.de  Mon Sep  4 14:11:41 2000
From: loewis@informatik.hu-berlin.de (Martin von Loewis)
Date: Mon, 4 Sep 2000 15:11:41 +0200 (MET DST)
Subject: [I18n-sig] Re: Patch 101320: doc strings
In-Reply-To: <oqr972wzsd.fsf@titan.progiciels-bpi.ca> (message from
 =?ISO-8859-1?Q?Fran=E7ois?= Pinard on 02 Sep 2000 11:59:14 -0400)
References: <200008302026.WAA13599@pandora.informatik.hu-berlin.de>
 <14765.62346.449402.907537@anthem.concentric.net>
 <200008311550.RAA29820@pandora.informatik.hu-berlin.de> <oqr972wzsd.fsf@titan.progiciels-bpi.ca>
Message-ID: <200009041311.PAA27712@pandora.informatik.hu-berlin.de>

[Martin v. L=F6wis]
> > The textual domain of a module will relate to what _ binds to. Doc
> > strings won't be wrapped into _(), as a result, you can't use the
> > binding of _.

[Fran=E7ois Pinard]
> "_(__doc__)" should work if the docstring shares the textual domain of
> the rest of the module, which looks like the correct thing to do in
> my eyes.

I don't see how this could work for doc strings of classes, methods
and functions. Do you propose to write

def foo():
  _("This does the foo thing.")
  pass

That won't work; the parser won't recognize it as a doc string.

Regards,
Martin


From loewis@informatik.hu-berlin.de  Mon Sep  4 14:14:34 2000
From: loewis@informatik.hu-berlin.de (Martin von Loewis)
Date: Mon, 4 Sep 2000 15:14:34 +0200 (MET DST)
Subject: [I18n-sig] Re: Patch 101320: doc strings
In-Reply-To: <oqn1hqwzif.fsf@titan.progiciels-bpi.ca> (message from
 =?ISO-8859-1?Q?Fran=E7ois?= Pinard on 02 Sep 2000 12:05:12 -0400)
References: <200008302026.WAA13599@pandora.informatik.hu-berlin.de>
 <14765.62346.449402.907537@anthem.concentric.net>
 <200008311550.RAA29820@pandora.informatik.hu-berlin.de>
 <14766.48956.463218.310154@anthem.concentric.net> <oqn1hqwzif.fsf@titan.progiciels-bpi.ca>
Message-ID: <200009041314.PAA27902@pandora.informatik.hu-berlin.de>

[Barry A. Warsaw]
> So maybe for /docstrings/ there should be one domain, and then each module
> can have it's own domain for its own additional translatable strings?

[Fran=E7ois Pinard]
> I do not understand the advantage of doing this.  Of course, if we do
> not need the translation of docstrings, these should not be collected
> for translation.  But if they get collected, there is no reason to have
> a separate domain for them.  It is just natural that they be part of the
> domain for the collection of modules they are part of.

How would you access the doc strings? Today, I do

>>> import httplib    =20
>>> print httplib.HTTP.__doc__
This class manages a connection to an HTTP server.

Now, how do I get to the translation of this message?

Regards,
Martin


From loewis@informatik.hu-berlin.de  Mon Sep  4 14:29:25 2000
From: loewis@informatik.hu-berlin.de (Martin von Loewis)
Date: Mon, 4 Sep 2000 15:29:25 +0200 (MET DST)
Subject: [I18n-sig] Re: Patch 101320: doc strings
In-Reply-To: <oqvgwex0ds.fsf@titan.progiciels-bpi.ca> (message from
 =?ISO-8859-1?Q?Fran=E7ois?= Pinard on 02 Sep 2000 11:46:23 -0400)
References: <200008302026.WAA13599@pandora.informatik.hu-berlin.de>
 <14765.62346.449402.907537@anthem.concentric.net>
 <200008311550.RAA29820@pandora.informatik.hu-berlin.de> <oqvgwex0ds.fsf@titan.progiciels-bpi.ca>
Message-ID: <200009041329.PAA28928@pandora.informatik.hu-berlin.de>

[Fran=E7ois Pinard]
> People might fear that the POT file is too time consuming to load all
> at once.  If this is the case, then the problem lies in the implementation
> of the `gettext' interface.  I repeated all along that it should be lazy
> evaluated, exactly to avoid that an insufficient implementation becomes
> an excuse to split a textual domain in many smaller ones.

I have started translating the Python doc strings into German, and
covered about 30% so far. Using the Python 2 gettext.py, I did not
experience any noticable delay in loading the mo file, on my 300MHz
machine. While I agree that lazy loading may become necessary, I think
it is ok to do implement the feature when the problem actually arises.
I'm pretty certain you can implement lazy access without changing the
existing API.

> People might fear that the PO file would take too much memory.  On
> modern systems, there is no problem `mmap'ing a file, as virtual
> address space is more than enough to hold even big translation
> files.  The Python difficulty, here, is that it is (nicely) portable
> to some less capable systems, where `mmap' has no equivalent.

The Python 2 mmap works on Unix and Win32. It probably is the best
solution if available.

> In my opinion, the solution might then be for these systems to load
> the MO hash tables only, and then retrieve messages from disk.

If you load the hash tables, does this give enough information so that
you can use two seek(2) calls only; on average? If so, it would be
probably good if there was a) documentation for the hash table format,
and/or b) an implementation of it in Python.

> The last fear might be that the POT file might be too big for
> translators to handle.

That indeed is my concern. The largest catalog so far was Lynx
(AFAICT), with 1100 messages. I guess gcc might also be pretty large.

> One of the goal of the Translation Project has been to promote a
> clean separation of responsibilities between software maintainers
> and national translators, as software maintainers spontaneously have
> a wide variety of (often contradictory) opinions about how (and even
> when!) translators should work :-).  It is a difficult aspect of the
> overall thing, in fact.

I think for the Python docstring catalog, we can give some guidance -
perhaps by shipping not all at once, but waiting for translators to
complete with the most interesting things first (like docstrings for
the builtin core functions).

I'm certain it will take some time to get translations back, so if=20
we want to have something in the next release (after 2.0), we should
start today.

Regards,
Martin


From loewis@informatik.hu-berlin.de  Mon Sep  4 14:42:57 2000
From: loewis@informatik.hu-berlin.de (Martin von Loewis)
Date: Mon, 4 Sep 2000 15:42:57 +0200 (MET DST)
Subject: [I18n-sig] Re: gettext in the standard library
In-Reply-To: <oqlmx9jl7q.fsf@titan.progiciels-bpi.ca> (message from
 =?ISO-8859-1?Q?Fran=E7ois?= Pinard on 03 Sep 2000 16:04:41 -0400)
References: <14749.42747.411862.940207@anthem.concentric.net>
 <Pine.LNX.4.21.0008190854480.25020-100000@james.daa.com.au>
 <14757.24220.225628.464982@anthem.concentric.net> <oqlmx9jl7q.fsf@titan.progiciels-bpi.ca>
Message-ID: <200009041342.PAA29915@pandora.informatik.hu-berlin.de>

> I do not see, nor understand, why we should have special API provisions
> for Unicode.  I thought a great effort has been put in Unicode support
> design so it would be as transparent as possible.  Isn't making Unicode
> explicit going against this spirit?

In Python 2, unicode strings are a separate type from byte
strings. The catalog objects will have two methods, one for retrieving
a byte string, as it appears in the mo file, and one for retrieving a
unicode string. It is then the application developer's choice whether
his application can deal with Unicode messages on output or not.

The core issue is that catalogs only map byte strings to byte strings.

> Should not "_(...)" return either a simple string or a Unicode string,
> depending solely on the goal language?  Would not all the rest just fall
> out naturally from this choice?  What is that problem that I do not
> see?

You can't be certain that the encoding of the catalog msgstrs is the
same as the one of the user. For example, the catalog may use KOI-8,
whereas the user's terminals are all in UTF-8. So you have know the
catalog's encoding. This, in turn, is only available of the catalog
follows the convention of containing a valid Content-Type field in the
translation of the empty string. Or, the Python installation may not
have the converter from the .mo file's encoding to Unicode.

Also, how would goal language determine whether Unicode is a better
representation for messages than some MBCS?

> Also, what means "GNUTranslations" above?  What is especially "GNU" in
> the act of translating?  Should not we just avoid any "GNU"
> references?

The format of the catalog files is defined by GNU gettext.

Regards,
Martin


From loewis@informatik.hu-berlin.de  Mon Sep  4 14:56:56 2000
From: loewis@informatik.hu-berlin.de (Martin von Loewis)
Date: Mon, 4 Sep 2000 15:56:56 +0200 (MET DST)
Subject: [I18n-sig] Re: gettext in the standard library
In-Reply-To: <oqhf7xjkji.fsf@titan.progiciels-bpi.ca> (message from
 =?ISO-8859-1?Q?Fran=E7ois?= Pinard on 03 Sep 2000 16:19:13 -0400)
References: <14749.42747.411862.940207@anthem.concentric.net>
 <Pine.LNX.4.21.0008190854480.25020-100000@james.daa.com.au>
 <14757.24220.225628.464982@anthem.concentric.net>
 <200008241935.VAA05311@pandora.informatik.hu-berlin.de> <oqhf7xjkji.fsf@titan.progiciels-bpi.ca>
Message-ID: <200009041356.PAA01550@pandora.informatik.hu-berlin.de>

> [Martin von Loewis]
>=20
> > Also, after discussion, I think we concluded that supporting alternative
> > locale categories is useless; the code should always assume LC_MESSAGES.

[Fran=E7ois Pinard]
> The charset selection could be also part of the LANG specification (after
> a period), or implied by the LC_CTYPE value (which itself might be derived
> from LC_ALL).  To make things a bit worse, many packages allow LANGUAGE
> to override LANG.

That was not the issue here. The question was whether dcgettext should
be supported, which allows to specify a category other than
LC_MESSAGES when looking for catalogs.

> LANGUAGE is an extension of LANG allowing fallback languages,
> something that has been asked by people when `gettext' was designed
> and which looked reasonable to us (yet Richard objected that we
> loose time over this).

Yes, gettext.py supports this convention.

> I also wanted to stress another point.  Regionalised translation files
> automatically fallback on non-regionalised files when available, on a
> message per message basis.  For example, a typical `de_AT' (Autrichian
> German) translation file contains only a few re-translations, the bulk
> of them is still kept within `de'.

The current gettext supports trying these in order.

However, looking at the implementation, it seems both conventions are
implemented incorrectly: The fall-backs are used when opening the
catalog. When the catalog is there, but lookup finds that a message is
not translated, it won't try the fall-backs. Instead, it will just
return the English message.

In the case of LANGUAGE, I think this is acceptable: If you set it to
de:sv, you may get German, Swedish, or English translations. However,
in real live, you either get German or Swedish, since catalogs are
likely full translations, or not present at all.

As for de_AT falling back to de on a per-message basis - gettext.py
doesn't do that. As for 'a typical' de_AT file: I have a total of 2
de_AT files on my installation, whereas I have 211 de translations.
So it seems that the typical de_AT translation is empty, in which case
it would indeed fall back to de.

Regards,
Martin


From loewis@informatik.hu-berlin.de  Mon Sep  4 15:00:10 2000
From: loewis@informatik.hu-berlin.de (Martin von Loewis)
Date: Mon, 4 Sep 2000 16:00:10 +0200 (MET DST)
Subject: [I18n-sig] Re: gettext in the standard library
In-Reply-To: <oqd7iljjv5.fsf@titan.progiciels-bpi.ca> (message from
 =?ISO-8859-1?Q?Fran=E7ois?= Pinard on 03 Sep 2000 16:33:50 -0400)
References: <14749.42747.411862.940207@anthem.concentric.net>
 <Pine.LNX.4.21.0008190854480.25020-100000@james.daa.com.au>
 <14757.24220.225628.464982@anthem.concentric.net>
 <200008241935.VAA05311@pandora.informatik.hu-berlin.de>
 <14758.53732.459528.857102@anthem.concentric.net> <oqd7iljjv5.fsf@titan.progiciels-bpi.ca>
Message-ID: <200009041400.QAA02532@pandora.informatik.hu-berlin.de>

[Fran=E7ois Pinard]
> Near the times of the beginning of the Translation Project, the
> mentality was that a PO file could be used to translate from any
> original language to any goal language.  The original language
> being, of course, the language used by the programmer.  With only a
> few exception, I can say that almost all examples I saw or handled
> use English as the original language.  But the spirit was opened to
> the fact people could program in their own national language, and
> _then_ have translation files towards English.
>
> Currently, this opened-ness is getting reversed.  Not only the
> original language is mandated to be English in the spirit of many,
> there are now pressures for the charset in use be a small subset of
> ASCII, with some strange code already committed, for parameterising
> ASCII to Unicode conversions (I've strong and probably biased
> opinions in that debate, so better not let me try to summarising it
> here :-).  A sure thing is that it looks all wrong to me, as just
> giving in highly pedantic complexity.
>=20
> So, not only would I like that Python does it better, but I would
> welcome if Python was allowing the original language to be based on
> either ASCII or Unicode, the most transparently as possible, of
> course.

Isn't that limited by the structure of mo files? You'd somehow have to
know what encoding to use when looking into the catalog - the content
type only talks about the encoding of the translations.

Regards,
Martin


From pinard@iro.umontreal.ca  Mon Sep  4 15:06:40 2000
From: pinard@iro.umontreal.ca (=?ISO-8859-1?Q?Fran=E7ois_Pinard?=)
Date: 04 Sep 2000 10:06:40 -0400
Subject: [I18n-sig] Re: Patch 101320: doc strings
In-Reply-To: Martin von Loewis's message of "Mon, 4 Sep 2000 15:11:41 +0200 (MET DST)"
References: <200008302026.WAA13599@pandora.informatik.hu-berlin.de>
 <14765.62346.449402.907537@anthem.concentric.net>
 <200008311550.RAA29820@pandora.informatik.hu-berlin.de>
 <oqr972wzsd.fsf@titan.progiciels-bpi.ca>
 <200009041311.PAA27712@pandora.informatik.hu-berlin.de>
Message-ID: <oqu2bwckun.fsf@titan.progiciels-bpi.ca>

[Martin von Loewis]

> [François Pinard]
> > "_(__doc__)" should work if the docstring shares the textual domain of
> > the rest of the module, which looks like the correct thing to do in
> > my eyes.

> I don't see how this could work for doc strings of classes, methods
> and functions. Do you propose to write

> def foo():
>   _("This does the foo thing.")
>   pass

> That won't work; the parser won't recognize it as a doc string.

Of course.  The idea is to write:

    def foo():
        "This does the foo thing."
        pass

and at some later place:

    print _(foo.__doc__)

-- 
François Pinard   http://www.iro.umontreal.ca/~pinard


From loewis@informatik.hu-berlin.de  Mon Sep  4 15:16:29 2000
From: loewis@informatik.hu-berlin.de (Martin von Loewis)
Date: Mon, 4 Sep 2000 16:16:29 +0200 (MET DST)
Subject: [I18n-sig] Re: Marking translatable strings
In-Reply-To: <oq7l8se366.fsf@titan.progiciels-bpi.ca> (message from
 =?ISO-8859-1?Q?Fran=E7ois?= Pinard on 04 Sep 2000 08:45:37 -0400)
References: <oq4sgvf0m1.fsf@titan.progiciels-bpi.ca>
 <m3puzi924u.fsf@greebo.nodomain.de>
 <oqwvh3fsdb.fsf@titan.progiciels-bpi.ca>
 <200008280639.IAA24958@pandora.informatik.hu-berlin.de>
 <oqog2dscg1.fsf@titan.progiciels-bpi.ca>
 <200008281626.SAA04073@pandora.informatik.hu-berlin.de>
 <oqya1hqmxz.fsf@titan.progiciels-bpi.ca>
 <14763.14225.909612.157094@anthem.concentric.net>
 <oqitsewxs1.fsf@titan.progiciels-bpi.ca>
 <39B35742.24C18621@lemburg.com> <oq7l8se366.fsf@titan.progiciels-bpi.ca>
Message-ID: <200009041416.QAA04065@pandora.informatik.hu-berlin.de>

> I much prefer this as well, and `i' as a string modifier would be welcome.
> However, this requires a change to the Python interpreter.  If we can
> obtain that this change be done, then that's wonderful.  However, if such
> a change is out of question for some reason, quote mangling is our best
> next choice for delayed strings.  Be sure that if i"..." gets adopted in
> Python as a kind of "ignored" modifier, I'll modify PO utils so it is the
> preferred form, and deprecate quote mangling soon after 2.0 is out.
> 
> Another advantage of i"..." is that it could be used to segregate and
> mark doc-strings needing translation at run-time, from those not really
> needing it.  It's better than extracting either all of them, or none.

Is there any precedent of a large Python application that uses (or
could use) that of lazy translation of strings?

Regards,
Martin


From pinard@iro.umontreal.ca  Mon Sep  4 15:25:03 2000
From: pinard@iro.umontreal.ca (=?ISO-8859-1?Q?Fran=E7ois_Pinard?=)
Date: 04 Sep 2000 10:25:03 -0400
Subject: [I18n-sig] Re: Patch 101320: doc strings
In-Reply-To: Martin von Loewis's message of "Mon, 4 Sep 2000 15:14:34 +0200 (MET DST)"
References: <200008302026.WAA13599@pandora.informatik.hu-berlin.de>
 <14765.62346.449402.907537@anthem.concentric.net>
 <200008311550.RAA29820@pandora.informatik.hu-berlin.de>
 <14766.48956.463218.310154@anthem.concentric.net>
 <oqn1hqwzif.fsf@titan.progiciels-bpi.ca>
 <200009041314.PAA27902@pandora.informatik.hu-berlin.de>
Message-ID: <oqpumkck00.fsf@titan.progiciels-bpi.ca>

> [Barry A. Warsaw]
> > So maybe for /docstrings/ there should be one domain, and then each module
> > can have it's own domain for its own additional translatable strings?

> [François Pinard]
> > I do not understand the advantage of doing this.  Of course, if we do
> > not need the translation of docstrings, these should not be collected
> > for translation.  But if they get collected, there is no reason to have
> > a separate domain for them.  It is just natural that they be part of the
> > domain for the collection of modules they are part of.

[Martin von Loewis]

> How would you access the doc strings?  Today, I do

> >>> import httplib
> >>> print httplib.HTTP.__doc__
> This class manages a connection to an HTTP server.

> Now, how do I get to the translation of this message?

I do not imagine all the details, but I think the spirit of the thing is that
at "import httplib" time, some function (or class instantiator) was called
at the top level of the httplib module, to produce a translating function,
which the httplib module soon assigned to the `_' variable, or something
else if the programmer did not like `_'.  The httplib module transmitted
its translation domain to the mechanism generating the translating function.

If it was systematic that `_' was assigned to, we could try to retrieve
the function stored in the `_' global variable of `httplib', and then use
it to translate any docstring from httplib.  However, it would be nicer
if the constraint of using `_' for the translating function did not exist,
and if it was rather completely left at the discretion of the programmer.
If we use `_' systematically in documentation examples we produce, it is
likely to become the popular choice, but let's avoid mandating it.

If we are not forcing `_', the doc() or help() function able to retrieve the
translated docstring would have to be a bit more clever.  I'm not familiar
enough with the Python system variables to exactly know how to do this, but
I have the feeling that it would not be hard to organise without having to
make the API any less simple than it already is.  The mechanism producing
the translating function and the help() function (let me confess I have a
preference for `help' over `doc' :-) could be designed so they collaborate,
if a straightforward implementation of help() appears difficult.

-- 
François Pinard   http://www.iro.umontreal.ca/~pinard


From loewis@informatik.hu-berlin.de  Mon Sep  4 15:45:20 2000
From: loewis@informatik.hu-berlin.de (Martin von Loewis)
Date: Mon, 4 Sep 2000 16:45:20 +0200 (MET DST)
Subject: [I18n-sig] Re: Patch 101320: doc strings
In-Reply-To: <oqpumkck00.fsf@titan.progiciels-bpi.ca> (message from
 =?ISO-8859-1?Q?Fran=E7ois?= Pinard on 04 Sep 2000 10:25:03 -0400)
References: <200008302026.WAA13599@pandora.informatik.hu-berlin.de>
 <14765.62346.449402.907537@anthem.concentric.net>
 <200008311550.RAA29820@pandora.informatik.hu-berlin.de>
 <14766.48956.463218.310154@anthem.concentric.net>
 <oqn1hqwzif.fsf@titan.progiciels-bpi.ca>
 <200009041314.PAA27902@pandora.informatik.hu-berlin.de> <oqpumkck00.fsf@titan.progiciels-bpi.ca>
Message-ID: <200009041445.QAA06095@pandora.informatik.hu-berlin.de>

> I do not imagine all the details, but I think the spirit of the
> thing is that at "import httplib" time, some function (or class
> instantiator) was called at the top level of the httplib module, to
> produce a translating function, which the httplib module soon
> assigned to the `_' variable, or something else if the programmer
> did not like `_'.  The httplib module transmitted its translation
> domain to the mechanism generating the translating function.

Ok, so if _ is bound, all is well. That brings us back to square one:

Should we split the Python library into different textual domains? If
yes, then how? *If* we decide to split that, it would be very easy to
extract doc strings of different modules into different catalogs.

Even in that case, I guess there would be some code left that did not
have its own textual domain. So there would still be the need for some
kind of "fallback" domain for the docstrings.

The proposed operation of the help function would then be that:
- if the module of the object (function, class, etc) can be
  established, and has _ bound, then translate the doc string
  in the catalog associated with _.
- else, try to translate the doc string in the domain for Python
  doc strings ("pydoc"?).

However, you also brought the point that the doc strings should use
the same catalog as any other strings of the Python core, and that
this should be a single domain (e.g. "python"). In that case, lookup
would fall-back to the "python" domain, and it would not matter
whether _ was bound in any of the modules of the standard Python
library.

Regards,
Martin


From pinard@iro.umontreal.ca  Mon Sep  4 17:32:09 2000
From: pinard@iro.umontreal.ca (=?ISO-8859-1?Q?Fran=E7ois_Pinard?=)
Date: 04 Sep 2000 12:32:09 -0400
Subject: [I18n-sig] Re: Patch 101320: doc strings
In-Reply-To: Martin von Loewis's message of "Mon, 4 Sep 2000 16:45:20 +0200 (MET DST)"
References: <200008302026.WAA13599@pandora.informatik.hu-berlin.de>
 <14765.62346.449402.907537@anthem.concentric.net>
 <200008311550.RAA29820@pandora.informatik.hu-berlin.de>
 <14766.48956.463218.310154@anthem.concentric.net>
 <oqn1hqwzif.fsf@titan.progiciels-bpi.ca>
 <200009041314.PAA27902@pandora.informatik.hu-berlin.de>
 <oqpumkck00.fsf@titan.progiciels-bpi.ca>
 <200009041445.QAA06095@pandora.informatik.hu-berlin.de>
Message-ID: <oq66ocjeye.fsf@titan.progiciels-bpi.ca>

[Martin von Loewis]

> > I do not imagine all the details, but I think the spirit of the
> > thing is that at "import httplib" time, some function (or class
> > instantiator) was called at the top level of the httplib module, to
> > produce a translating function, which the httplib module soon
> > assigned to the `_' variable, or something else if the programmer
> > did not like `_'.  The httplib module transmitted its translation
> > domain to the mechanism generating the translating function.

> Ok, so if _ is bound, all is well.

Not necessarily.  `_' could be bound to a lot of things, not necessarily
a translating function.

> That brings us back to square one: Should we split the Python library
> into different textual domains?

I miss the logic of sliding over the snake, down to square one.  I perceive
the issues as rather orthogonal.  How are they connected?

> If yes, then how?  *If* we decide to split that, it would be very easy
> to extract doc strings of different modules into different catalogs.

Everything should be easy.  It is just not "convenient" to handle a
multiplicity of domains, without very serious reasons to do so.  Best is
to use one textual (or translation) domain per distribution of a system
or package.

> Even in that case, I guess there would be some code left that did not
> have its own textual domain.  So there would still be the need for some
> kind of "fallback" domain for the doc strings.

Why should we use separate domains for doc strings?

> The proposed operation of the help function would then be that:
> - if the module of the object (function, class, etc) can be
>   established, and has _ bound, then translate the doc string
>   in the catalog associated with _.

My feeling is that we should not rely on `_'.  The variable used to hold
the translating function should be left at the discretion of the user.

> - else, try to translate the doc string in the domain for Python
>   doc strings ("pydoc"?).

Why not just use the textual domain of a module, to translate the doc
strings it contains?  It may well happen that if the module comes with
the Python distribution, it will have "python" for its textual domain.
But it might come from anywhere, and we cannot predict the textual domain
of a randomly imported module.  However, all modules holding translated
strings should also get, right on initial import, a translating function
out of their textual domain, and the mechanics producing that translating
function might save a correspondence between the module and the textual
domain for that module (unless we find something more straightfoward).
It should be possible to communicate with the mechanics to get a copy
of the translating function for that module, and use that function to
translate doc strings held within that module.

> However, you also brought the point that the doc strings should use
> the same catalog as any other strings of the Python core,

I just checked the `To:' of your message to make sure, and indeed, you
are writing to me :-).  No, I'm pretty sure I never said that, or else,
if I did, I surely was extremely tired! :-) Simplicity asks that doc
strings share the textual domain of all other strings for the same module.
Is there a need to do otherwise?

                                                Keep happy!

P.S. - I slightly begin to fear that we will not have a full, clear
consensus by the 4th of September... :-)

-- 
François Pinard   http://www.iro.umontreal.ca/~pinard


From pinard@iro.umontreal.ca  Mon Sep  4 17:59:27 2000
From: pinard@iro.umontreal.ca (=?ISO-8859-1?Q?Fran=E7ois_Pinard?=)
Date: 04 Sep 2000 12:59:27 -0400
Subject: [I18n-sig] Re: Patch 101320: doc strings
In-Reply-To: Martin von Loewis's message of "Mon, 4 Sep 2000 15:29:25 +0200 (MET DST)"
References: <200008302026.WAA13599@pandora.informatik.hu-berlin.de>
 <14765.62346.449402.907537@anthem.concentric.net>
 <200008311550.RAA29820@pandora.informatik.hu-berlin.de>
 <oqvgwex0ds.fsf@titan.progiciels-bpi.ca>
 <200009041329.PAA28928@pandora.informatik.hu-berlin.de>
Message-ID: <oq1yz0jdow.fsf@titan.progiciels-bpi.ca>

[Martin von Loewis]

> I have started translating the Python doc strings into German, and
> covered about 30% so far.  Using the Python 2 gettext.py, I did not
> experience any noticable delay in loading the mo file, on my 300MHz
> machine.  While I agree that lazy loading may become necessary, I think
> it is ok to do implement the feature when the problem actually arises.
> I'm pretty certain you can implement lazy access without changing the
> existing API.

Excellent.  You just have to remember this, if you ever read someone asking
that we split textual domains in "smaller" or "more manageable" parts.
We should then correct implementations and tools as needed, rather than
give in splitting, or multiplying textual domains.

> The Python 2 mmap works on Unix and Win32.  It probably is the best
> solution if available.

Wow!  Good news.  Our luck would be that it works on Macintosh as well...

> > In my opinion, the solution might then be for these systems to load
> > the MO hash tables only, and then retrieve messages from disk.

> If you load the hash tables, does this give enough information so that
> you can use two seek(2) calls only; on average? If so, it would be
> probably good if there was a) documentation for the hash table format,
> and/or b) an implementation of it in Python.

We could use the compendium of all existing PO files in the Translation
Project to establish statistics (I'm not rushing in doing this today! :-).
My guess is that we could hold full hash tables in memory through a quick
swallow, and that double hashing would later guarantee a single seek on
the average.

The precise hash algorithm is only documented in the sources.  Using it
from GNU `gettext' would raise question about how the GPL applies, but
using the copy bought by the Danish UUG should be OK.  Best might be to
postpone for now, according to the first quoted paragraph of this message.

> > The last fear might be that the POT file might be too big for
> > translators to handle.

> That indeed is my concern.  The largest catalog so far was Lynx
> (AFAICT), with 1100 messages.  I guess gcc might also be pretty large.

It should be the concern of translators, national teams, or the
Translation Project, but surely not the concern of programmers in the sake
of translators.  At the start of the Translation Project, it has been a
recurrent difficulty that each and every programmer needed to decide how
translators should work.  Better to keep responsibility well separated:
everybody sleeps better, and is happier in the long run.

> I think for the Python docstring catalog, we can give some guidance -
> perhaps by shipping not all at once, but waiting for translators to
> complete with the most interesting things first (like docstrings for
> the builtin core functions).

No, no, I don't think so.  As programmers, we should just not interfere.
Believe me, people do not need so much of our precious "guidance".

> I'm certain it will take some time to get translations back, so if 
> we want to have something in the next release (after 2.0), we should
> start today.

This is another thing.  You have to loose hope, _right now_, to ever keep
all translations synchronous with releases.  Some teams, and only a few of
them, react in a fast way, but most teams are slow.  You will live endless
irritation, and might end up pretty disgusted, if you start trying to push
and pull on teams.  You have to peace down your own soul, and become quiet.

Consider, as a programmer, that your job is to internationalise your scripts,
(and maybe comply, once in a while, when you receive reports about burning
too much of English grammar in your run-time construction of strings),
and then accepting translations from translators, almost blindly, without
judging if they are worth being distributed or not.  Linguistics matters
is something to be discussed and resolved between the national team of
translators for a language, and the users of that language.  You have to
detach yourself, as a programmer, of all such concerns: they are not yours.
The quality of your package is orthogonal and independent from the quality
of translations, this should be absolutely clear for everybody.

-- 
François Pinard   http://www.iro.umontreal.ca/~pinard


From pinard@iro.umontreal.ca  Mon Sep  4 18:08:08 2000
From: pinard@iro.umontreal.ca (=?ISO-8859-1?Q?Fran=E7ois_Pinard?=)
Date: 04 Sep 2000 13:08:08 -0400
Subject: [I18n-sig] Re: Marking translatable strings
In-Reply-To: Martin von Loewis's message of "Mon, 4 Sep 2000 16:16:29 +0200 (MET DST)"
References: <oq4sgvf0m1.fsf@titan.progiciels-bpi.ca>
 <m3puzi924u.fsf@greebo.nodomain.de>
 <oqwvh3fsdb.fsf@titan.progiciels-bpi.ca>
 <200008280639.IAA24958@pandora.informatik.hu-berlin.de>
 <oqog2dscg1.fsf@titan.progiciels-bpi.ca>
 <200008281626.SAA04073@pandora.informatik.hu-berlin.de>
 <oqya1hqmxz.fsf@titan.progiciels-bpi.ca>
 <14763.14225.909612.157094@anthem.concentric.net>
 <oqitsewxs1.fsf@titan.progiciels-bpi.ca>
 <39B35742.24C18621@lemburg.com>
 <oq7l8se366.fsf@titan.progiciels-bpi.ca>
 <200009041416.QAA04065@pandora.informatik.hu-berlin.de>
Message-ID: <oqwvgshypz.fsf@titan.progiciels-bpi.ca>

[Martin von Loewis]

> Is there any precedent of a large Python application that uses (or
> could use) that of lazy translation of strings?

A friend and I marked all of an older Mailman, and delayed translations
are needed here and there, like for most other big programs.  It is
typical of many applications, anyway.  I would not think that Python
is very special on this particular aspect, compared to other languages.

In my experience, delayed translations are not often needed on average,
but yet, inescapable here and there, once in a while.  In the case of
Python, of course, all doc strings are inherently delayed, but maybe they
are not necessarily always meant to be translated in every application.
(I guess this may be debated. :-)

-- 
François Pinard   http://www.iro.umontreal.ca/~pinard


From pinard@iro.umontreal.ca  Mon Sep  4 18:19:04 2000
From: pinard@iro.umontreal.ca (=?ISO-8859-1?Q?Fran=E7ois_Pinard?=)
Date: 04 Sep 2000 13:19:04 -0400
Subject: [I18n-sig] Re: gettext in the standard library
In-Reply-To: Martin von Loewis's message of "Mon, 4 Sep 2000 15:56:56 +0200 (MET DST)"
References: <14749.42747.411862.940207@anthem.concentric.net>
 <Pine.LNX.4.21.0008190854480.25020-100000@james.daa.com.au>
 <14757.24220.225628.464982@anthem.concentric.net>
 <200008241935.VAA05311@pandora.informatik.hu-berlin.de>
 <oqhf7xjkji.fsf@titan.progiciels-bpi.ca>
 <200009041356.PAA01550@pandora.informatik.hu-berlin.de>
Message-ID: <oqsnrghy7r.fsf@titan.progiciels-bpi.ca>

[Martin von Loewis]

> However, looking at the implementation, it seems both conventions are
> implemented incorrectly: The fall-backs are used when opening the catalog.

This can be seen as an important optimisation, indeed.

> When the catalog is there, but lookup finds that a message is
> not translated, it won't try the fall-backs.

It most probably should, to respect the spirit of fall-backs.

> In the case of LANGUAGE, I think this is acceptable: If you set it to
> de:sv, you may get German, Swedish, or English translations.  However,
> in real live, you either get German or Swedish, since catalogs are likely
> full translations, or not present at all.

The truth of experience is that for many teams, translations will lag over
releases, and you will often not have full translation files, a few holes
will exist.  It is then more important that fall-backs are taken on a per
message basis.

> As for de_AT falling back to de on a per-message basis - gettext.py
> doesn't do that. As for 'a typical' de_AT file: I have a total of 2
> de_AT files on my installation, whereas I have 211 de translations.
> So it seems that the typical de_AT translation is empty, in which case
> it would indeed fall back to de.

Indeed :-).  When `de_AT' does not even exist, no need to consider it.

-- 
François Pinard   http://www.iro.umontreal.ca/~pinard


From pinard@iro.umontreal.ca  Mon Sep  4 18:29:32 2000
From: pinard@iro.umontreal.ca (=?ISO-8859-1?Q?Fran=E7ois_Pinard?=)
Date: 04 Sep 2000 13:29:32 -0400
Subject: [I18n-sig] Re: gettext in the standard library
In-Reply-To: Martin von Loewis's message of "Mon, 4 Sep 2000 16:00:10 +0200 (MET DST)"
References: <14749.42747.411862.940207@anthem.concentric.net>
 <Pine.LNX.4.21.0008190854480.25020-100000@james.daa.com.au>
 <14757.24220.225628.464982@anthem.concentric.net>
 <200008241935.VAA05311@pandora.informatik.hu-berlin.de>
 <14758.53732.459528.857102@anthem.concentric.net>
 <oqd7iljjv5.fsf@titan.progiciels-bpi.ca>
 <200009041400.QAA02532@pandora.informatik.hu-berlin.de>
Message-ID: <oqog24hxqb.fsf@titan.progiciels-bpi.ca>

> [François Pinard]

> > So, not only would I like that Python does it better, but I would
> > welcome if Python was allowing the original language to be based on
> > either ASCII or Unicode, the most transparently as possible, of
> > course.

[Martin von Loewis]

> Isn't that limited by the structure of mo files? You'd somehow have to
> know what encoding to use when looking into the catalog - the content
> type only talks about the encoding of the translations.

It is surely a bit sad that the PO file header (the translation of the empty
string) has no current provision to describe `msgstr' language and encoding.

Yet, in practice, as long as the POT file is automatically derived from the
sources, each `msgstr' is identical to how it appears in the sources, and
consequently, it uses in the POT file the same encoding that in the source.
So, it is likely that retrieving the `msgstr' at run-time will work.

Problems would arise if the source strings were recoded, between string
extraction by POT tools, and string usage for translation at run-time.
Python will likely "internalise" or convert Unicode strings from UTF-8,
and this is a change of representation.  Maybe we could do similar changes
in the POT extractors, so the match occurs.  This might become difficult
if the Python sources are coded in other things than UTF-8.  But whatever
means will exist for Python to do the conversion, POT extractors might
have to be modified to use the same means.  Matches shall occur.

-- 
François Pinard   http://www.iro.umontreal.ca/~pinard


From loewis@informatik.hu-berlin.de  Mon Sep  4 18:32:31 2000
From: loewis@informatik.hu-berlin.de (Martin von Loewis)
Date: Mon, 4 Sep 2000 19:32:31 +0200 (MET DST)
Subject: [I18n-sig] Re: Patch 101320: doc strings
In-Reply-To: <oq66ocjeye.fsf@titan.progiciels-bpi.ca> (message from
 =?ISO-8859-1?Q?Fran=E7ois?= Pinard on 04 Sep 2000 12:32:09 -0400)
References: <200008302026.WAA13599@pandora.informatik.hu-berlin.de>
 <14765.62346.449402.907537@anthem.concentric.net>
 <200008311550.RAA29820@pandora.informatik.hu-berlin.de>
 <14766.48956.463218.310154@anthem.concentric.net>
 <oqn1hqwzif.fsf@titan.progiciels-bpi.ca>
 <200009041314.PAA27902@pandora.informatik.hu-berlin.de>
 <oqpumkck00.fsf@titan.progiciels-bpi.ca>
 <200009041445.QAA06095@pandora.informatik.hu-berlin.de> <oq66ocjeye.fsf@titan.progiciels-bpi.ca>
Message-ID: <200009041732.TAA14636@pandora.informatik.hu-berlin.de>

> > That brings us back to square one: Should we split the Python library
> > into different textual domains?
>=20
> I miss the logic of sliding over the snake, down to square one.  I percei=
ve
> the issues as rather orthogonal.  How are they connected?

[me, paraphrasing]

Me:    I propose a single domain for docstrings, 'pylib'.
Barry: This is too large, split it up.
Me:    Then how do you access the individual strings?
Barry: Use the module name.
Me:    This will give too many domains.
Barry: There might be some point in having a /docstring/ domain [for
       the python library]
You:   Why should we have one? All docstrings should be in the same
       domain as the module.
Me:    Then how do you access individual strings?
You:   I don't know, but maybe you can use the binding of _.
Me:    I propose a single domain for docstrings. Anybody proposing
       a different organisation?

> > Even in that case, I guess there would be some code left that did not
> > have its own textual domain.  So there would still be the need for some
> > kind of "fallback" domain for the doc strings.
>=20
> Why should we use separate domains for doc strings?

I did not propose a *separate* domain for doc strings. I proposed that
there is one well-known domain in which the doc strings of the core
python library can be found. I don't care too much at this time
whether it contains anything else - there are no other translatable
strings in the Python sources at this point in time.

> My feeling is that we should not rely on `_'.  The variable used to hold
> the translating function should be left at the discretion of the
> user.

Well, what else do you propose?

> > - else, try to translate the doc string in the domain for Python
> >   doc strings ("pydoc"?).
>=20
> Why not just use the textual domain of a module, to translate the doc
> strings it contains?

How do I find out the textual domain of a module? How do I find out
the module of a builtin function?

> However, all modules holding translated strings should also get,
> right on initial import, a translating function out of their textual
> domain, and the mechanics producing that translating function might
> save a correspondence between the module and the textual domain for
> that module (unless we find something more straightfoward).

So you propose that there be some kind of protocol to be observed by a
module that wants to make "its" textual domain known. What is that
protocol?

I also propose a protocol: A module can announce its textual domain by
binding _. It may chose not to bind _, or it may chose not to bind it
to a catalog method. In either case, it does not follow the protocol,
so anybody using that protocol may get some kind of failure.

> > However, you also brought the point that the doc strings should use
> > the same catalog as any other strings of the Python core,
>=20
> I just checked the `To:' of your message to make sure, and indeed, you
> are writing to me :-).  No, I'm pretty sure I never said that, or else,
> if I did, I surely was extremely tired! :-) Simplicity asks that doc
> strings share the textual domain of all other strings for the same module.
> Is there a need to do otherwise?

Maybe my logic is somewhat flawed:
- Did you agree that doc strings of a module should use the same
  domain as all other strings of the module?
- Did you propose that a single package, distributed as a whole,
  should have a single textual domain?
- Do you agree that the Python core+libs is a single package?

=46rom that, I'd conclude that you are in favour of having a single
domain for the Python core+libs, which contains both doc strings
and other translatable strings of Python core+libs.

> P.S. - I slightly begin to fear that we will not have a full, clear
> consensus by the 4th of September... :-)

I've given up on having message catalogs in the Python 2.0
distribution. Since there is no point in having the catalog without
any translations, this is not so urgent. What *is* urgent is to give
the catalog to the translators.

Regards,
Martin


From loewis@informatik.hu-berlin.de  Mon Sep  4 18:44:42 2000
From: loewis@informatik.hu-berlin.de (Martin von Loewis)
Date: Mon, 4 Sep 2000 19:44:42 +0200 (MET DST)
Subject: [I18n-sig] Re: Patch 101320: doc strings
In-Reply-To: <oq1yz0jdow.fsf@titan.progiciels-bpi.ca> (message from
 =?ISO-8859-1?Q?Fran=E7ois?= Pinard on 04 Sep 2000 12:59:27 -0400)
References: <200008302026.WAA13599@pandora.informatik.hu-berlin.de>
 <14765.62346.449402.907537@anthem.concentric.net>
 <200008311550.RAA29820@pandora.informatik.hu-berlin.de>
 <oqvgwex0ds.fsf@titan.progiciels-bpi.ca>
 <200009041329.PAA28928@pandora.informatik.hu-berlin.de> <oq1yz0jdow.fsf@titan.progiciels-bpi.ca>
Message-ID: <200009041744.TAA15131@pandora.informatik.hu-berlin.de>

> > I'm certain it will take some time to get translations back, so if 
> > we want to have something in the next release (after 2.0), we should
> > start today.
> 
> This is another thing.  You have to loose hope, _right now_, to ever keep
> all translations synchronous with releases.  

I never had this hope - this is the first thing the gettext manual
told me a few years ago. However, would you then conclude to the
contrary: Teams never finish, so we don't need to start?

> You will live endless irritation, and might end up pretty disgusted,
> if you start trying to push and pull on teams.

I certainly won't push teams. At the moment, I'm pushing Python
maintainers to grant me the freedom to release an already existing
catalogue.

As a translator, I'm always frustrated when my translations aren't
used in released software (*). The reason for that is that quite a lot
of translations tend to get fuzzy in short time.

(*) In the German catalog of GNU grep, which I maintain, a number of
option descriptions appear in English in grep 2.3, even though they
had mostly-correct translations. It does not help at all that the
manual says I should not worry - I did. In grep 2.4.2, everything
is fine - mainly as a result of better coordination.

Regards,
Martin


From pinard@iro.umontreal.ca  Mon Sep  4 18:49:33 2000
From: pinard@iro.umontreal.ca (=?ISO-8859-1?Q?Fran=E7ois_Pinard?=)
Date: 04 Sep 2000 13:49:33 -0400
Subject: [I18n-sig] Re: gettext in the standard library
In-Reply-To: Martin von Loewis's message of "Mon, 4 Sep 2000 15:42:57 +0200 (MET DST)"
References: <14749.42747.411862.940207@anthem.concentric.net>
 <Pine.LNX.4.21.0008190854480.25020-100000@james.daa.com.au>
 <14757.24220.225628.464982@anthem.concentric.net>
 <oqlmx9jl7q.fsf@titan.progiciels-bpi.ca>
 <200009041342.PAA29915@pandora.informatik.hu-berlin.de>
Message-ID: <oqk8cshwsy.fsf@titan.progiciels-bpi.ca>

[Martin von Loewis]

> > I do not see, nor understand, why we should have special API provisions
> > for Unicode.  I thought a great effort has been put in Unicode support
> > design so it would be as transparent as possible.  Isn't making Unicode
> > explicit going against this spirit?

> In Python 2, unicode strings are a separate type from byte strings.
> The catalog objects will have two methods, one for retrieving a byte
> string, as it appears in the mo file, and one for retrieving a unicode
> string.  It is then the application developer's choice whether his
> application can deal with Unicode messages on output or not.

You are merely re-stating that there is a special API for Unicode, here.
I got this already! :-).  My question is about why it is necessary.

> You can't be certain that the encoding of the catalog msgstrs is the
> same as the one of the user.  For example, the catalog may use KOI-8,
> whereas the user's terminals are all in UTF-8.  So you have know the
> catalog's encoding.

Yes, it is described in the PO file header (the translation of the empty
string).  The idea is to convert KOI-8 (or whatever) while retrieving
the translation.  Most of the time, the conversion will be to Unicode.
In some very rare cases, like for Netherlands, ASCII is sufficient.
This all can be done automatically, I do not see why we need two APIs.

> the Python installation may not have the converter from the .mo file's
> encoding to Unicode.

I thought Python 2.0 was to come with a comprehensive set of conversion
routines for doing such things.  If we ever find that one is missing,
we might try to add it, shouldn't we?

> Also, how would goal language determine whether Unicode is a better
> representation for messages than some MBCS?

Oh, no doubt that this may yield to hot debates.  I thought that Python was
trying to give a special treat to Unicode.  You might remember, I do not
know, that I tried to warn people that Unicode is not the end of everything.
I guess you are saying the same thing, here. :-)

For translation purposes, I thought Python was to produce either ASCII
or UTF-8 rather automatically on output.  It is likely to produce a mix,
as the original strings are written in ASCII most of times, which do not
get all translated.  If something else is needed on output, I thought the
intent was to override UTF-8 as an output encoding, yet still use Unicode
internally, instead of any MBCS, taking advantage of all the magic Python
2.0 will have in that respect.  Otherwise, you have to make your Python
script aware of those coding a lot more, internationalisation becomes much
more intrusive in your sources, while we wanted it to be as light weight
as possible.

> > Also, what means "GNUTranslations" above?  What is especially "GNU" in
> > the act of translating?  Should not we just avoid any "GNU"
> > references?

> The format of the catalog files is defined by GNU gettext.

Let's avoid "GNU" in the terminology, if we avoid the GPL.  They usually
go together! :-) And besides, I think we should not overly insist in the
documentation, nor in the API, on the fact that a particular `gettext'
is used underneath.

-- 
François Pinard   http://www.iro.umontreal.ca/~pinard


From loewis@informatik.hu-berlin.de  Mon Sep  4 18:52:29 2000
From: loewis@informatik.hu-berlin.de (Martin von Loewis)
Date: Mon, 4 Sep 2000 19:52:29 +0200 (MET DST)
Subject: [I18n-sig] Re: gettext in the standard library
In-Reply-To: <oqsnrghy7r.fsf@titan.progiciels-bpi.ca> (message from
 =?ISO-8859-1?Q?Fran=E7ois?= Pinard on 04 Sep 2000 13:19:04 -0400)
References: <14749.42747.411862.940207@anthem.concentric.net>
 <Pine.LNX.4.21.0008190854480.25020-100000@james.daa.com.au>
 <14757.24220.225628.464982@anthem.concentric.net>
 <200008241935.VAA05311@pandora.informatik.hu-berlin.de>
 <oqhf7xjkji.fsf@titan.progiciels-bpi.ca>
 <200009041356.PAA01550@pandora.informatik.hu-berlin.de> <oqsnrghy7r.fsf@titan.progiciels-bpi.ca>
Message-ID: <200009041752.TAA15488@pandora.informatik.hu-berlin.de>

> The truth of experience is that for many teams, translations will lag over
> releases, and you will often not have full translation files, a few holes
> will exist.  It is then more important that fall-backs are taken on a per
> message basis.

I agree in principle. From a practical point of view: Do you know any
user that actually has a a LANGUAGE setting listing more than one
language? Even in the sv:de example, there is still a chance that
neither the Swedish nor the German catalog has a translation, so the
user would get three languages on her screen. I don't know anybody
who'd prefer that over just falling back to English.

Regards,
Martin


From loewis@informatik.hu-berlin.de  Mon Sep  4 19:01:48 2000
From: loewis@informatik.hu-berlin.de (Martin von Loewis)
Date: Mon, 4 Sep 2000 20:01:48 +0200 (MET DST)
Subject: [I18n-sig] Re: gettext in the standard library
In-Reply-To: <oqog24hxqb.fsf@titan.progiciels-bpi.ca> (message from
 =?ISO-8859-1?Q?Fran=E7ois?= Pinard on 04 Sep 2000 13:29:32 -0400)
References: <14749.42747.411862.940207@anthem.concentric.net>
 <Pine.LNX.4.21.0008190854480.25020-100000@james.daa.com.au>
 <14757.24220.225628.464982@anthem.concentric.net>
 <200008241935.VAA05311@pandora.informatik.hu-berlin.de>
 <14758.53732.459528.857102@anthem.concentric.net>
 <oqd7iljjv5.fsf@titan.progiciels-bpi.ca>
 <200009041400.QAA02532@pandora.informatik.hu-berlin.de> <oqog24hxqb.fsf@titan.progiciels-bpi.ca>
Message-ID: <200009041801.UAA15838@pandora.informatik.hu-berlin.de>

> Problems would arise if the source strings were recoded, between string
> extraction by POT tools, and string usage for translation at run-time.
> Python will likely "internalise" or convert Unicode strings from UTF-8,
> and this is a change of representation.

Currently, to put Unicode strings into source code, you'll have to use
\u escapes in your source(e.g. print u"\u263A"). I'm not aware of any
editor that transparently displays these beasts.

So if you want to have non-English msgid strings using the Unicode
standard (rather than Unicode objects), your best bet is probably to
encode the Python source as UTF-8. As a result, you'll use byte
strings as parameters to _, which is supported well by the API.

[As a side note: I would have preferred if u"" strings had UTF-8
 inside them. As it is, I doubt anybody will use them for things
 other than WHITE SMILING FACE].

With byte strings, Python won't do any internalisation, so at run
time, you'll always have the same byte string that you got at
extraction time.

Regards,
Martin


From loewis@informatik.hu-berlin.de  Mon Sep  4 19:13:57 2000
From: loewis@informatik.hu-berlin.de (Martin von Loewis)
Date: Mon, 4 Sep 2000 20:13:57 +0200 (MET DST)
Subject: [I18n-sig] Re: gettext in the standard library
In-Reply-To: <oqk8cshwsy.fsf@titan.progiciels-bpi.ca> (message from
 =?ISO-8859-1?Q?Fran=E7ois?= Pinard on 04 Sep 2000 13:49:33 -0400)
References: <14749.42747.411862.940207@anthem.concentric.net>
 <Pine.LNX.4.21.0008190854480.25020-100000@james.daa.com.au>
 <14757.24220.225628.464982@anthem.concentric.net>
 <oqlmx9jl7q.fsf@titan.progiciels-bpi.ca>
 <200009041342.PAA29915@pandora.informatik.hu-berlin.de> <oqk8cshwsy.fsf@titan.progiciels-bpi.ca>
Message-ID: <200009041813.UAA16309@pandora.informatik.hu-berlin.de>

> > In Python 2, unicode strings are a separate type from byte strings.
> > The catalog objects will have two methods, one for retrieving a byte
> > string, as it appears in the mo file, and one for retrieving a unicode
> > string.  It is then the application developer's choice whether his
> > application can deal with Unicode messages on output or not.
> 
> You are merely re-stating that there is a special API for Unicode, here.
> I got this already! :-).  My question is about why it is necessary.

Which part do you deem unnecessary? The part returning a byte string,
or the part returning a Unicode string?

> Yes, it is described in the PO file header (the translation of the empty
> string).  The idea is to convert KOI-8 (or whatever) while retrieving
> the translation.  Most of the time, the conversion will be to Unicode.
> In some very rare cases, like for Netherlands, ASCII is sufficient.
> This all can be done automatically, I do not see why we need two
> APIs.

So you are proposing that an application cannot tell in advance what
the return type of _ will be? In some application, writing

header = '\x01\x01'
body   = _('warning')
messgage = header + body

Will this work or not? Anwer: It depends. In the Netherlands, it will
work, elsewhere, it won't.

> I thought Python 2.0 was to come with a comprehensive set of conversion
> routines for doing such things.  If we ever find that one is missing,
> we might try to add it, shouldn't we?

I think it was decided not to include the JIS something tables in the
Python 2 distribution, because they are too large to include.

> > Also, how would goal language determine whether Unicode is a better
> > representation for messages than some MBCS?
> 
> Oh, no doubt that this may yield to hot debates.  

I did not really ask for an opinion, I asked for an algorithm:

def mbcs_p(parameters):
  your code here

> For translation purposes, I thought Python was to produce either ASCII
> or UTF-8 rather automatically on output.  It is likely to produce a mix,
> as the original strings are written in ASCII most of times, which do not
> get all translated.

In Python 2.0, developers should be aware at all times whether they
operate on Unicode strings or on byte strings. Python will try to do
the right thing if there is a clear right thing, and try to raise
exceptions whenever it is not so clear what the right thing would be.

Having an API that sometimes returns Unicode strings and sometimes
byte strings (depending on environment variables (!)) would be just
terrible.

> If something else is needed on output, I thought the intent was to
> override UTF-8 as an output encoding, yet still use Unicode
> internally, instead of any MBCS, taking advantage of all the magic
> Python 2.0 will have in that respect.

Maybe it's a terminology issue: I consider UTF-8 as a MBCS (multi-byte
character set); UTF-8 strings are byte strings, not Unicode strings.

> Otherwise, you have to make your Python script aware of those coding
> a lot more, internationalisation becomes much more intrusive in your
> sources, while we wanted it to be as light weight as possible.

I simply want to give users a choice. If they chose to "let's try
Unicode", they have the choice. If they find it all works,
well. Otherwise, they can go for byte strings, with a different set of
limitations.

Regards,
Martin


From mal@lemburg.com  Mon Sep  4 19:39:03 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Mon, 04 Sep 2000 20:39:03 +0200
Subject: [I18n-sig] Re: gettext in the standard library
References: <14749.42747.411862.940207@anthem.concentric.net>
 <Pine.LNX.4.21.0008190854480.25020-100000@james.daa.com.au>
 <14757.24220.225628.464982@anthem.concentric.net>
 <200008241935.VAA05311@pandora.informatik.hu-berlin.de>
 <14758.53732.459528.857102@anthem.concentric.net>
 <oqd7iljjv5.fsf@titan.progiciels-bpi.ca>
 <200009041400.QAA02532@pandora.informatik.hu-berlin.de> <oqog24hxqb.fsf@titan.progiciels-bpi.ca> <200009041801.UAA15838@pandora.informatik.hu-berlin.de>
Message-ID: <39B3EC47.D54483D2@lemburg.com>

Martin von Loewis wrote:
> 
> > Problems would arise if the source strings were recoded, between string
> > extraction by POT tools, and string usage for translation at run-time.
> > Python will likely "internalise" or convert Unicode strings from UTF-8,
> > and this is a change of representation.
> 
> Currently, to put Unicode strings into source code, you'll have to use
> \u escapes in your source(e.g. print u"\u263A"). I'm not aware of any
> editor that transparently displays these beasts.

You could wrap the decoding processing into the _ function:

def _(s):
    return unicode(s, "utf-8")

This would allow you not only to use translatable strings,
but also any unicode string encoding you like, e.g. utf-8
or latin1.

Once the "declare" statement is in place you should also be
able to write:

declare encoding = "utf-8"

... u"utf-8 encoded string" ...

in Python source code.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From loewis@informatik.hu-berlin.de  Mon Sep  4 19:48:58 2000
From: loewis@informatik.hu-berlin.de (Martin von Loewis)
Date: Mon, 4 Sep 2000 20:48:58 +0200 (MET DST)
Subject: [I18n-sig] Re: gettext in the standard library
In-Reply-To: <39B3EC47.D54483D2@lemburg.com> (mal@lemburg.com)
References: <14749.42747.411862.940207@anthem.concentric.net>
 <Pine.LNX.4.21.0008190854480.25020-100000@james.daa.com.au>
 <14757.24220.225628.464982@anthem.concentric.net>
 <200008241935.VAA05311@pandora.informatik.hu-berlin.de>
 <14758.53732.459528.857102@anthem.concentric.net>
 <oqd7iljjv5.fsf@titan.progiciels-bpi.ca>
 <200009041400.QAA02532@pandora.informatik.hu-berlin.de> <oqog24hxqb.fsf@titan.progiciels-bpi.ca> <200009041801.UAA15838@pandora.informatik.hu-berlin.de> <39B3EC47.D54483D2@lemburg.com>
Message-ID: <200009041848.UAA17574@pandora.informatik.hu-berlin.de>

> You could wrap the decoding processing into the _ function:
> 
> def _(s):
>     return unicode(s, "utf-8")
> 
> This would allow you not only to use translatable strings,
> but also any unicode string encoding you like, e.g. utf-8
> or latin1.

Maybe I'm missing something here. How does the catalog come into play
in this definition of _?

Regards,
Martin


From mal@lemburg.com  Mon Sep  4 20:39:07 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Mon, 04 Sep 2000 21:39:07 +0200
Subject: [I18n-sig] Re: gettext in the standard library
References: <14749.42747.411862.940207@anthem.concentric.net>
 <Pine.LNX.4.21.0008190854480.25020-100000@james.daa.com.au>
 <14757.24220.225628.464982@anthem.concentric.net>
 <200008241935.VAA05311@pandora.informatik.hu-berlin.de>
 <14758.53732.459528.857102@anthem.concentric.net>
 <oqd7iljjv5.fsf@titan.progiciels-bpi.ca>
 <200009041400.QAA02532@pandora.informatik.hu-berlin.de> <oqog24hxqb.fsf@titan.progiciels-bpi.ca> <200009041801.UAA15838@pandora.informatik.hu-berlin.de> <39B3EC47.D54483D2@lemburg.com> <200009041848.UAA17574@pandora.informatik.hu-berlin.de>
Message-ID: <39B3FA5B.A8A4BF19@lemburg.com>

Martin von Loewis wrote:
> 
> > You could wrap the decoding processing into the _ function:
> >
> > def _(s):
> >     return unicode(s, "utf-8")
> >
> > This would allow you not only to use translatable strings,
> > but also any unicode string encoding you like, e.g. utf-8
> > or latin1.
> 
> Maybe I'm missing something here. How does the catalog come into play
> in this definition of _?

That was just an example of how you could add the decoding
functionality to the _ function.

You would of course also add a gettext.gettext call
somewhere in there which translates the string first
(possibly recoding it to some other encoding for the
table lookup first).

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From pinard@iro.umontreal.ca  Mon Sep  4 21:28:29 2000
From: pinard@iro.umontreal.ca (=?ISO-8859-1?Q?Fran=E7ois_Pinard?=)
Date: 04 Sep 2000 16:28:29 -0400
Subject: [I18n-sig] Re: Patch 101320: doc strings
References: <200008302026.WAA13599@pandora.informatik.hu-berlin.de>
 <14765.62346.449402.907537@anthem.concentric.net>
 <200008311550.RAA29820@pandora.informatik.hu-berlin.de>
 <14766.48956.463218.310154@anthem.concentric.net>
 <oqn1hqwzif.fsf@titan.progiciels-bpi.ca>
 <200009041314.PAA27902@pandora.informatik.hu-berlin.de>
 <oqpumkck00.fsf@titan.progiciels-bpi.ca>
 <200009041445.QAA06095@pandora.informatik.hu-berlin.de>
 <oq66ocjeye.fsf@titan.progiciels-bpi.ca>
 <200009041732.TAA14636@pandora.informatik.hu-berlin.de>
Message-ID: <oq3djfc36a.fsf@titan.progiciels-bpi.ca>

[Martin von Loewis]

> Me:    I propose a single domain for docstrings, 'pylib'.
> Barry: This is too large, split it up.
> Me:    Then how do you access the individual strings?
> Barry: Use the module name.
> Me:    This will give too many domains.
> Barry: There might be some point in having a /docstring/ domain [for
>        the python library]

Oh, oh!  So, I see that Barry is the bad guy, after all? :-)

> I did not propose a *separate* domain for doc strings. I proposed that
> there is one well-known domain in which the doc strings of the core
> python library can be found.  I don't care too much at this time whether
> it contains anything else - there are no other translatable strings in
> the Python sources at this point in time.

OK, then, let's call it "python", and if other strings are needed from
the Python distribution, let's also use that "python" domain for it as
well for them.  Very fine with me! :-)

> > My feeling is that we should not rely on `_'.  The variable used to hold
> > the translating function should be left at the discretion of the user.

> Well, what else do you propose?

Nothing special.  I suggest that we systematically use `_' in the
documentation and examples, but that we also avoid forcing the issue in
any way from within Python.  Let's use a function in `locale', say, to
get a Translations instance (say) given the textual domain.  The language
to use could be obtained from environment variables as well as the search
path for the MO file, unless such things get overridden by keyword arguments.

> > Why not just use the textual domain of a module, to translate the doc
> > strings it contains?

> How do I find out the textual domain of a module? How do I find out
> the module of a builtin function?

You ask the `locale' module to return you a Translations instance for
that module, maybe though another keyword argument stating the name of
the module for which you need a translator.  This would only work if that
module previously registered its domain name, through asking the creation
of a Translations instance for itself (and without specifying the keyword
argument specifying a module, or course).

> So you propose that there be some kind of protocol to be observed by a
> module that wants to make "its" textual domain known. What is that
> protocol?

Maybe, I do not know, something like:

   _ = locale.translator(TEXTUAL_DOMAIN)

after the overall doc string for the module, for each module?  For all modules
being part of the Python distribution, it would be:

   _ = locale.translator("python")

Of course, if a module does not need to translate any string explicitely,
a mere:

   locale.translator("python")

would be sufficient, in which case the `_' variable gets undisturbed,
of course.

The `locale.translator' function would call `locale.Translator()' if it
finds that none exist yet for the textual domain "python" and the specified
language sequence (which could be specified by keyword, but defaulting to
LANGUAGE in the environment, or else LANG, or none).

> I also propose a protocol: A module can announce its textual domain by
> binding _.  It may chose not to bind _, or it may chose not to bind it
> to a catalog method.  In either case, it does not follow the protocol,
> so anybody using that protocol may get some kind of failure.

I understand, but I think we may avoid imposing `_'.  Even if we expect
it to be popular, best is to not rely on it, if we can avoid doing it.

> - Did you agree that doc strings of a module should use the same
>   domain as all other strings of the module?

Sounds good to me.

> - Did you propose that a single package, distributed as a whole,
>   should have a single textual domain?

As far as possible, yes.  It seems to be the good thing to do for most
things so far.  This is not an absolute, of course, but we should not
start with the idea that splitting is necessary.  If we later discover
some exceptional property or condition that makes a sounded and solid
justification for it, it would be worth exploring, but so far that I know
(and given I've not read all my mail yet :-), none shown up yet.  If we
have many tens of thousands of doc strings, it might change the balance,
I do not know.

> - Do you agree that the Python core+libs is a single package?

I'm tempted to much agree, yes.

> From that, I'd conclude that you are in favour of having a single
> domain for the Python core+libs, which contains both doc strings
> and other translatable strings of Python core+libs.

Yes, of course.

But we cannot blindly rely that the textual domain for any module is
"python", as a Python relies on the run-time importation of many scripts from
various sources.  A good deal of modules will have "python" to start with,
but modules could be added or overridden: the textual domain of a module
should be registered by that module, and retrieved whenever appropriate.

> I've given up on having message catalogs in the Python 2.0
> distribution.

Do not loose hope yet.  Who knows what will happen! :-)

CWRI never confessed its true reasons, but now, we can tell it.  If they
made all that legalese noise and stuff, that was only a circumvoluted way
to buy us more time for completing internationalisation specifications. :-)

> What *is* urgent is to give the catalog to the translators.

This, I deeply understand!  The big work is mainly done by translators,
and PO files are re-usable even when the API changes or fluctuates, or
is postponed.  So, the translation effort is usually best invested.

But it gets frustrating for translators, at times.  I remember the long
years it took before `make' translations could start to work, for example.
`bison' was not immediate either.  And even now, `diffutils' and `bash'
are not settled, while translations for those existed for years.  I thought
that the `tar'/`cpio' saga was over, but it seems it has to be restarted,
for reasons some of you might know :-).

If we can get Python itself to be internationalised within a year, say,
it would be good publishing its POT file now.  But Python may also be seen
as a package among others.  Python could offer internationalisation methods
for Python scripts, without being immediately internationalised itself.

-- 
François Pinard   http://www.iro.umontreal.ca/~pinard


From pinard@iro.umontreal.ca  Mon Sep  4 21:37:11 2000
From: pinard@iro.umontreal.ca (=?ISO-8859-1?Q?Fran=E7ois_Pinard?=)
Date: 04 Sep 2000 16:37:11 -0400
Subject: [I18n-sig] Re: gettext in the standard library
In-Reply-To: Martin von Loewis's message of "Mon, 4 Sep 2000 19:52:29 +0200 (MET DST)"
References: <14749.42747.411862.940207@anthem.concentric.net>
 <Pine.LNX.4.21.0008190854480.25020-100000@james.daa.com.au>
 <14757.24220.225628.464982@anthem.concentric.net>
 <200008241935.VAA05311@pandora.informatik.hu-berlin.de>
 <oqhf7xjkji.fsf@titan.progiciels-bpi.ca>
 <200009041356.PAA01550@pandora.informatik.hu-berlin.de>
 <oqsnrghy7r.fsf@titan.progiciels-bpi.ca>
 <200009041752.TAA15488@pandora.informatik.hu-berlin.de>
Message-ID: <oqwvgrao7c.fsf@titan.progiciels-bpi.ca>

[Martin von Loewis]

> Even in the sv:de example, there is still a chance that neither the
> Swedish nor the German catalog has a translation, so the user would get
> three languages on her screen.  I don't know anybody who'd prefer that
> over just falling back to English.

This is precisely because it was asked for, that we did it.  The idea
did not come from us, but from users.  I only know English and French,
so this would not be useful to me.  I guess most Americans know only one
language, so their need are even simpler than mine!  But I got that in
Europe, many people have an extended culture, making me jealous (:-),
and it is not uncommon for them to be comfortable with many languages.

So, in a word, this specification for fall-backs is a service for the most
cultured of our users.  Let's admire them, and consider they deserve it? :-)

-- 
François Pinard   http://www.iro.umontreal.ca/~pinard


From loewis@informatik.hu-berlin.de  Mon Sep  4 22:00:13 2000
From: loewis@informatik.hu-berlin.de (Martin von Loewis)
Date: Mon, 4 Sep 2000 23:00:13 +0200 (MET DST)
Subject: [I18n-sig] Re: gettext in the standard library
In-Reply-To: <39B3FA5B.A8A4BF19@lemburg.com> (mal@lemburg.com)
References: <14749.42747.411862.940207@anthem.concentric.net>
 <Pine.LNX.4.21.0008190854480.25020-100000@james.daa.com.au>
 <14757.24220.225628.464982@anthem.concentric.net>
 <200008241935.VAA05311@pandora.informatik.hu-berlin.de>
 <14758.53732.459528.857102@anthem.concentric.net>
 <oqd7iljjv5.fsf@titan.progiciels-bpi.ca>
 <200009041400.QAA02532@pandora.informatik.hu-berlin.de> <oqog24hxqb.fsf@titan.progiciels-bpi.ca> <200009041801.UAA15838@pandora.informatik.hu-berlin.de> <39B3EC47.D54483D2@lemburg.com> <200009041848.UAA17574@pandora.informatik.hu-berlin.de> <39B3FA5B.A8A4BF19@lemburg.com>
Message-ID: <200009042100.XAA22248@pandora.informatik.hu-berlin.de>

> > > def _(s):
> > >     return unicode(s, "utf-8")


> That was just an example of how you could add the decoding
> functionality to the _ function.
> 
> You would of course also add a gettext.gettext call
> somewhere in there which translates the string first
> (possibly recoding it to some other encoding for the
> table lookup first).

So it would be

def _(s):
  gettext.gettext(unicode(s,"utf-8"))

then??? There is no reason to do such a thing. First, you take a good
UTF-8 string, transform it into a Unicode object; then gettext must
encode the Unicode object into some byte string (possibly using
UTF-8), as there msgids are stored as bytes on the disk (i.e. using
some encoding).

If you put UTF-8 in your source as msgid, you can *directly* invoking
gettext, without needing to create a temporary Unicode object first.

Even if there is some pragma utf-8 some day, it would be still more
straight-forward to write

  _("<utf-8 bytes>")

than

  _(u"<utf-8 bytes>")

as gettext would need some clue what byte encoding it needs to use,
whereas the byte encoding is obvious in the first case.

Regards,
Martin


From loewis@informatik.hu-berlin.de  Mon Sep  4 22:23:34 2000
From: loewis@informatik.hu-berlin.de (Martin von Loewis)
Date: Mon, 4 Sep 2000 23:23:34 +0200 (MET DST)
Subject: [I18n-sig] Re: Patch 101320: doc strings
In-Reply-To: <oq3djfc36a.fsf@titan.progiciels-bpi.ca> (message from
 =?ISO-8859-1?Q?Fran=E7ois?= Pinard on 04 Sep 2000 16:28:29 -0400)
References: <200008302026.WAA13599@pandora.informatik.hu-berlin.de>
 <14765.62346.449402.907537@anthem.concentric.net>
 <200008311550.RAA29820@pandora.informatik.hu-berlin.de>
 <14766.48956.463218.310154@anthem.concentric.net>
 <oqn1hqwzif.fsf@titan.progiciels-bpi.ca>
 <200009041314.PAA27902@pandora.informatik.hu-berlin.de>
 <oqpumkck00.fsf@titan.progiciels-bpi.ca>
 <200009041445.QAA06095@pandora.informatik.hu-berlin.de>
 <oq66ocjeye.fsf@titan.progiciels-bpi.ca>
 <200009041732.TAA14636@pandora.informatik.hu-berlin.de> <oq3djfc36a.fsf@titan.progiciels-bpi.ca>
Message-ID: <200009042123.XAA23066@pandora.informatik.hu-berlin.de>

> > > My feeling is that we should not rely on `_'.  The variable used to hold
> > > the translating function should be left at the discretion of the user.
> 
> > Well, what else do you propose?
> 
> Nothing special.  I suggest that we systematically use `_' in the
> documentation and examples, but that we also avoid forcing the issue in
> any way from within Python.  Let's use a function in `locale', say, to
> get a Translations instance (say) given the textual domain.  The language
> to use could be obtained from environment variables as well as the search
> path for the MO file, unless such things get overridden by keyword
> arguments.

That is indeed how the gettext.py API works: given a textual domain,
you get a Translations instance, considering environment variables.

The question is still how *doc* strings get translated, from outside
the module. I.e. the help() function needs to determine what textual
domain it is supposed to use when accessing the doc string of some
object. Presence of any function expecting a textual domain does no
good, as help() needs to find out what the textual domain is first.

> You ask the `locale' module to return you a Translations instance for
> that module, maybe though another keyword argument stating the name of
> the module for which you need a translator.  This would only work if that
> module previously registered its domain name, through asking the creation
> of a Translations instance for itself (and without specifying the keyword
> argument specifying a module, or course).

I see. I doubt that *not* specifying the module name gives is
acceptable, though - that locale function would need to know who its
caller is. I feel doing that is too hacky to be accepted for the
standard library.

> > So you propose that there be some kind of protocol to be observed by a
> > module that wants to make "its" textual domain known. What is that
> > protocol?
> 
> Maybe, I do not know, something like:
> 
>    _ = locale.translator(TEXTUAL_DOMAIN)
> 
> after the overall doc string for the module, for each module?  

I believe this would rather become

_ = locale.translator(TEXTUAL_DOMAIN, module = __name__)

for the reason mentioned above. But yes, that might work. It would
invalidate (or, rather, not support) prior art for binding _,
though. Traditionally, Python programs do (in GNOME specifically)

_ = gettext.gettext

With Barry's API, you do

gettext.install(TEXTUAL_DOMAIN)

which puts _ into __builtins__, so individual modules won't even bind
_ themselves.

> CWRI never confessed its true reasons, but now, we can tell it.  If they
> made all that legalese noise and stuff, that was only a circumvoluted way
> to buy us more time for completing internationalisation specifications. :-)

:-)

> But it gets frustrating for translators, at times.  I remember the long
> years it took before `make' translations could start to work

That's why I want to get some assurance that translations will be
indeed used when done. I'd like to get some agreement on procedures
among all interested people here, and I'd like to get some go-ahead
from BeOpen that they'll consider including it when it's done.

In any case, I'll push that Python distributors and packagers (RedHat,
Debian, ActiveState, ...) to include available catalogs even before they
get into an official distribution. As they are plain data files and
don't harm functionality, it's just a matter of file size to use them
or to leave them.

I also hope that the help module takes off, so that there is some
convenient way to access the doc string translations.

> If we can get Python itself to be internationalised within a year, say,
> it would be good publishing its POT file now.  But Python may also be seen
> as a package among others.  Python could offer internationalisation methods
> for Python scripts, without being immediately internationalised itself.

That is certain. I believe Python 2 will be well-equipped already,
having a gettext module, and xgettext and msgfmt utilities in 100%
pure Python. For a full i18n process, only a msgmerge utility and a
po-mode editor (in Tk?) would be missing. Of course, on many systems,
GNU equivalents of these tools will be available now.

Regards,
Martin


From pinard@iro.umontreal.ca  Mon Sep  4 22:26:42 2000
From: pinard@iro.umontreal.ca (=?ISO-8859-1?Q?Fran=E7ois_Pinard?=)
Date: 04 Sep 2000 17:26:42 -0400
Subject: [I18n-sig] Re: gettext in the standard library
In-Reply-To: Martin von Loewis's message of "Mon, 4 Sep 2000 20:13:57 +0200 (MET DST)"
References: <14749.42747.411862.940207@anthem.concentric.net>
 <Pine.LNX.4.21.0008190854480.25020-100000@james.daa.com.au>
 <14757.24220.225628.464982@anthem.concentric.net>
 <oqlmx9jl7q.fsf@titan.progiciels-bpi.ca>
 <200009041342.PAA29915@pandora.informatik.hu-berlin.de>
 <oqk8cshwsy.fsf@titan.progiciels-bpi.ca>
 <200009041813.UAA16309@pandora.informatik.hu-berlin.de>
Message-ID: <oqsnrfalwt.fsf@titan.progiciels-bpi.ca>

[Martin von Loewis]

> > > In Python 2, unicode strings are a separate type from byte strings.
> > > The catalog objects will have two methods, one for retrieving a byte
> > > string, as it appears in the mo file, and one for retrieving a unicode
> > > string.  It is then the application developer's choice whether his
> > > application can deal with Unicode messages on output or not.
> > 
> > You are merely re-stating that there is a special API for Unicode, here.
> > I got this already! :-).  My question is about why it is necessary.

> Which part do you deem unnecessary?  The part returning a byte string,
> or the part returning a Unicode string?

Any part in which one has to make a distinction between both types of
strings.  Let's have the translator function returning a string.  It is
not important to know which kind of string.  Python takes care of what
needs care, anyway.  It should be fairly transparent to the programmer,
and our API should be just as transparent.  Shouldn't it?

> So you are proposing that an application cannot tell in advance what
> the return type of _ will be? In some application, writing

> header = '\x01\x01'
> body   = _('warning')
> message = header + body

Perfect.  No problem.  Python will do something proper, whatever the type
of string which `body' receives...

> I think it was decided not to include the JIS something tables in the
> Python 2 distribution, because they are too large to include.

Then, working with JIS translations would require that Japanese users
fetch the JIS tables from other sources.  A script written for JIS will
need such tables, wherever they come from.

It would be nicer if Python was offering them, but...  Hmph! :-)

> In Python 2.0, developers should be aware at all times whether they
> operate on Unicode strings or on byte strings.  Python will try to do the
> right thing if there is a clear right thing, and try to raise exceptions
> whenever it is not so clear what the right thing would be.

I thought that every effort was made (at least for 1.6a1 and 1.6a2) for
developers should just _not_ be aware of the type of strings.  Is 2.0
different?  Or did I wholly miss the issue?  It would make me sad...

If I missed the issue, you may dismiss many things among what I wrote,
as we are then not reasoning on the same grounds.  If elegance has already
been lost from the start, surely, there is no need for me to in trying to
preserve it, and I'm a mere kibitzer :-(.  Tell me before I make a fool
of myself...  Oh!  It is too late already? :-)

> > If something else is needed on output, I thought the intent was to
> > override UTF-8 as an output encoding, yet still use Unicode internally,
> > instead of any MBCS, taking advantage of all the magic Python 2.0 will
> > have in that respect.

> Maybe it's a terminology issue: I consider UTF-8 as a MBCS (multi-byte
> character set); UTF-8 strings are byte strings, not Unicode strings.

I thought that, by using some 8-bit API instead of some Unicode API for
translation matters, you were intending to handle MBCS directly, all over,
instead of relying on Unicode strings.

> > Otherwise, you have to make your Python script aware of those coding
> > a lot more, internationalisation becomes much more intrusive in your
> > sources, while we wanted it to be as light weight as possible.

> I simply want to give users a choice.  If they chose to "let's try
> Unicode", they have the choice.  If they find it all works, well.
> Otherwise, they can go for byte strings, with a different set of
> limitations.

Shouldn't we just have confidence that Python works?  I would rather
see programmers just using strings and then, playing interactively, or
looking at their output, have a slight and momentary astonishment, saying:
"Hey, things apparently turned Unicode at some point", be satisfied by
the results anyway, and not bother much more about the issue.

If we put unusual exceptions aside (like "English" translation, or
Netherlands), users experience could be that things just happen to work
in ASCII when no translation is requested, and just happen to use Unicode
otherwise.

> > > Also, how would goal language determine whether Unicode is a better
> > > representation for messages than some MBCS?

> I did not really ask for an opinion, I asked for an algorithm:

> def mbcs_p(parameters):
>   your code here

If we get Unicode out of the translating routine, there should not be much
more needed, except maybe a final encoding of the output stream.  This,
I feel we did not discuss enough yet (how to connect the translation
function to the output stream encoding, as transparently as possible).
But once again, maybe I missed so much of the whole point about Unicode
and Python, that none of my remarks hold.

-- 
François Pinard   http://www.iro.umontreal.ca/~pinard


From pinard@iro.umontreal.ca  Mon Sep  4 22:32:54 2000
From: pinard@iro.umontreal.ca (=?ISO-8859-1?Q?Fran=E7ois_Pinard?=)
Date: 04 Sep 2000 17:32:54 -0400
Subject: [I18n-sig] Re: gettext in the standard library
In-Reply-To: Martin von Loewis's message of "Mon, 4 Sep 2000 20:48:58 +0200 (MET DST)"
References: <14749.42747.411862.940207@anthem.concentric.net>
 <Pine.LNX.4.21.0008190854480.25020-100000@james.daa.com.au>
 <14757.24220.225628.464982@anthem.concentric.net>
 <200008241935.VAA05311@pandora.informatik.hu-berlin.de>
 <14758.53732.459528.857102@anthem.concentric.net>
 <oqd7iljjv5.fsf@titan.progiciels-bpi.ca>
 <200009041400.QAA02532@pandora.informatik.hu-berlin.de>
 <oqog24hxqb.fsf@titan.progiciels-bpi.ca>
 <200009041801.UAA15838@pandora.informatik.hu-berlin.de>
 <39B3EC47.D54483D2@lemburg.com>
 <200009041848.UAA17574@pandora.informatik.hu-berlin.de>
Message-ID: <oqog23almh.fsf@titan.progiciels-bpi.ca>

[Martin von Loewis]

> > You could wrap the decoding processing into the _ function:

> > def _(s):
> >     return unicode(s, "utf-8")

> > This would allow you not only to use translatable strings,
> > but also any unicode string encoding you like, e.g. utf-8
> > or latin1.

> Maybe I'm missing something here.  How does the catalog come into play
> in this definition of _?

The conversion to Unicode strings would be done from within the translating
function.  This one might be a bound method from a class instance knowing a
few things besides the textual domain.  In particular, the instance would
know the encoding to use from the PO file header, and so, the translating
function should be able to do the proper conversion, transparently.

-- 
François Pinard   http://www.iro.umontreal.ca/~pinard


From pinard@iro.umontreal.ca  Mon Sep  4 22:37:35 2000
From: pinard@iro.umontreal.ca (=?ISO-8859-1?Q?Fran=E7ois_Pinard?=)
Date: 04 Sep 2000 17:37:35 -0400
Subject: [I18n-sig] Re: gettext in the standard library
In-Reply-To: "M.-A. Lemburg"'s message of "Mon, 04 Sep 2000 20:39:03 +0200"
References: <14749.42747.411862.940207@anthem.concentric.net>
 <Pine.LNX.4.21.0008190854480.25020-100000@james.daa.com.au>
 <14757.24220.225628.464982@anthem.concentric.net>
 <200008241935.VAA05311@pandora.informatik.hu-berlin.de>
 <14758.53732.459528.857102@anthem.concentric.net>
 <oqd7iljjv5.fsf@titan.progiciels-bpi.ca>
 <200009041400.QAA02532@pandora.informatik.hu-berlin.de>
 <oqog24hxqb.fsf@titan.progiciels-bpi.ca>
 <200009041801.UAA15838@pandora.informatik.hu-berlin.de>
 <39B3EC47.D54483D2@lemburg.com>
Message-ID: <oqk8craleo.fsf@titan.progiciels-bpi.ca>

[M.-A. Lemburg]

> Once the "declare" statement is in place you should also be able to write:

> declare encoding = "utf-8"
> ... u"utf-8 encoded string" ...

> in Python source code.

I'm not aware of that "declare" statement (or declaration?), but it sounds
like addressing a need.  But if it exists as stated above, I predict for
myself that I'll often forget the `u' prefix. :-).

Is that what you meant when saying that the programmer will have to be
aware all the time if using Unicode strings?  It looks like it.

The POT extractors will have to be modified to know such conventions.

-- 
François Pinard   http://www.iro.umontreal.ca/~pinard


From mal@lemburg.com  Mon Sep  4 22:56:50 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Mon, 04 Sep 2000 23:56:50 +0200
Subject: [I18n-sig] Re: gettext in the standard library
References: <14749.42747.411862.940207@anthem.concentric.net>
 <Pine.LNX.4.21.0008190854480.25020-100000@james.daa.com.au>
 <14757.24220.225628.464982@anthem.concentric.net>
 <200008241935.VAA05311@pandora.informatik.hu-berlin.de>
 <14758.53732.459528.857102@anthem.concentric.net>
 <oqd7iljjv5.fsf@titan.progiciels-bpi.ca>
 <200009041400.QAA02532@pandora.informatik.hu-berlin.de>
 <oqog24hxqb.fsf@titan.progiciels-bpi.ca>
 <200009041801.UAA15838@pandora.informatik.hu-berlin.de>
 <39B3EC47.D54483D2@lemburg.com> <oqk8craleo.fsf@titan.progiciels-bpi.ca>
Message-ID: <39B41AA2.A8454326@lemburg.com>

François Pinard wrote:
> 
> [M.-A. Lemburg]
> 
> > Once the "declare" statement is in place you should also be able to write:
> 
> > declare encoding = "utf-8"
> > ... u"utf-8 encoded string" ...
> 
> > in Python source code.
> 
> I'm not aware of that "declare" statement (or declaration?), but it sounds
> like addressing a need.  But if it exists as stated above, I predict for
> myself that I'll often forget the `u' prefix. :-).
> 
> Is that what you meant when saying that the programmer will have to be
> aware all the time if using Unicode strings?  It looks like it.
> 
> The POT extractors will have to be modified to know such conventions.

The "declare" statement will be a PEP for 2.1. Until then you'll
have to stick to the _ function trick I posted to Martin.

Note that you will still have to use the "u" string modifier
to have the compiler trigger the conversion. There will probably
also be a similar recoder for 8-bit string literals, but this
will only work provided that the default encoding is set to
something a little more capable than ASCII, e.g. utf-8 or
latin-1.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From pinard@iro.umontreal.ca  Mon Sep  4 23:32:59 2000
From: pinard@iro.umontreal.ca (=?ISO-8859-1?Q?Fran=E7ois_Pinard?=)
Date: 04 Sep 2000 18:32:59 -0400
Subject: [I18n-sig] Re: Patch 101320: doc strings
In-Reply-To: Martin von Loewis's message of "Mon, 4 Sep 2000 23:23:34 +0200 (MET DST)"
References: <200008302026.WAA13599@pandora.informatik.hu-berlin.de>
 <14765.62346.449402.907537@anthem.concentric.net>
 <200008311550.RAA29820@pandora.informatik.hu-berlin.de>
 <14766.48956.463218.310154@anthem.concentric.net>
 <oqn1hqwzif.fsf@titan.progiciels-bpi.ca>
 <200009041314.PAA27902@pandora.informatik.hu-berlin.de>
 <oqpumkck00.fsf@titan.progiciels-bpi.ca>
 <200009041445.QAA06095@pandora.informatik.hu-berlin.de>
 <oq66ocjeye.fsf@titan.progiciels-bpi.ca>
 <200009041732.TAA14636@pandora.informatik.hu-berlin.de>
 <oq3djfc36a.fsf@titan.progiciels-bpi.ca>
 <200009042123.XAA23066@pandora.informatik.hu-berlin.de>
Message-ID: <oqaednaiuc.fsf@titan.progiciels-bpi.ca>

> [François Pinard]

> > You ask the `locale' module to return you a Translations instance for
> > that module, maybe though another keyword argument stating the name of
> > the module for which you need a translator.  This would only work if that
> > module previously registered its domain name, through asking the creation
> > of a Translations instance for itself (and without specifying the keyword
> > argument specifying a module, or course).

[Martin von Loewis]

> I doubt that *not* specifying the module name gives is acceptable,
> though - that locale function would need to know who its caller is.
> I feel doing that is too hacky to be accepted for the standard library.
> [...] With Barry's API, you do

> gettext.install(TEXTUAL_DOMAIN)

> which puts _ into __builtins__, so individual modules won't even bind
> _ themselves.

Hackery for hackery, I would prefer to see the function creating the
translating function to seek for the calling module, as this would really
useful.

As for `gettext.install' function, it looks awkward.  This would be the
only case I know, in the Python library, where a library function hacks a
variable in the local name space.  I do not doubt that it is clever, but
the cleverness alone does not make it attractive enough to make it look
acceptable.  I would suggest that we go without it.  No need having two
ways for doing the same thing, with the `gettext.install' being questionable.

> That is certain. I believe Python 2 will be well-equipped already,
> having a gettext module,

Yet, `gettext' is not an ideal name.  We should avoid using it, and sticking
too closely to the `gettext' API.

> and xgettext and msgfmt utilities in 100% pure Python.

Barry wrote `pygettext.py', but I'm not aware of any `msgfmt' program.
The double hashing algorithm would have to be known for it to exist,
and would not be a legalistic problem then for the MO file reader.

> For a full i18n process, only a msgmerge utility and a po-mode editor
> (in Tk?) would be missing.

I'm starving to find some time for looking at Pango, but the little I read
about it, it looks especially promising as a basis for a rewritten PO mode.

-- 
François Pinard   http://www.iro.umontreal.ca/~pinard


From martin@loewis.home.cs.tu-berlin.de  Mon Sep  4 23:31:25 2000
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Tue, 5 Sep 2000 00:31:25 +0200
Subject: [I18n-sig] Re: gettext in the standard library
In-Reply-To: <oqsnrfalwt.fsf@titan.progiciels-bpi.ca> (message from
 =?ISO-8859-1?Q?Fran=E7ois?= Pinard on 04 Sep 2000 17:26:42 -0400)
References: <14749.42747.411862.940207@anthem.concentric.net>
 <Pine.LNX.4.21.0008190854480.25020-100000@james.daa.com.au>
 <14757.24220.225628.464982@anthem.concentric.net>
 <oqlmx9jl7q.fsf@titan.progiciels-bpi.ca>
 <200009041342.PAA29915@pandora.informatik.hu-berlin.de>
 <oqk8cshwsy.fsf@titan.progiciels-bpi.ca>
 <200009041813.UAA16309@pandora.informatik.hu-berlin.de> <oqsnrfalwt.fsf@titan.progiciels-bpi.ca>
Message-ID: <200009042231.AAA00904@loewis.home.cs.tu-berlin.de>

> Any part in which one has to make a distinction between both types of
> strings.  Let's have the translator function returning a string.  

In the specific implementation that is in Python 2.0, which kind of
string should it return? It has to make a choice; just saying "I don't
care" is a bad basis for an algorithm.

> It is not important to know which kind of string.  Python takes care
> of what needs care, anyway.

No, it doesn't. It will in some cases, but won't in others.

> It should be fairly transparent to the programmer, and our API
> should be just as transparent.  Shouldn't it?

It should, but I feel it isn't.

> > header = '\x01\x01'
> > body   = _('warning')
> > message = header + body
> 
> Perfect.  No problem.  Python will do something proper, whatever the type
> of string which `body' receives...

>>> header = '\xFF\x01'
>>> body   = u'warning'
>>> message = header + body
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeError: ASCII decoding error: ordinal not in range(128)

Is that proper? Is it what the user expected? If not, how should the
user modify her code so it does what she wanted?

> I thought that every effort was made (at least for 1.6a1 and 1.6a2) for
> developers should just _not_ be aware of the type of strings.  Is 2.0
> different?  

No, 2.0 is just the same as 1.6 in that area. I suggest you play
around with the Unicode type somewhat before recommending that API
functions should blindly return it...

> If I missed the issue, you may dismiss many things among what I wrote,
> as we are then not reasoning on the same grounds.

I don't know whether there is an issue. There is a number of cases
where mixing byte strings and Unicode strings will cause runtime
errors; it is not (and IMO shouldn't be) totally transparent.

> Shouldn't we just have confidence that Python works?

Well, I think I know how it works, and I believe that developers need
to be fully aware of Unicode vs byte strings. They still can employ
elegance where available, but I promise that handing out randomly
either byte or Unicode strings will result in complaints.

> If we get Unicode out of the translating routine, there should not be much
> more needed, except maybe a final encoding of the output stream.  This,
> I feel we did not discuss enough yet (how to connect the translation
> function to the output stream encoding, as transparently as possible).

Indeed, this is the crucial issue. Unfortunately, we don't know how
user would eject the messages. I know that passing them to Tkinter
works well for Unicode strings, and I know passing byte strings to
stdout works well. Other combinations don't work as good:

mira% echo $LANG
de_DE.ISO-8859-1
mira% python    
Python 2.0b1 (#31, Aug 31 2000, 23:36:28)  [GCC 2.95.2 19991024 (release)] on linux2
Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
Copyright 1995-2000 Corporation for National Research Initiatives (CNRI)
>>> unicode('fön','latin-1')
u'f\366n'
>>> print unicode('fön','latin-1')

Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeError: ASCII encoding error: ordinal not in range(128)

So I'd rather not return a Unicode string representing an error
message from gettext: the user expecting an error message may be
surprised about the totally unrelated UnicodeError.

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Mon Sep  4 23:48:40 2000
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Tue, 5 Sep 2000 00:48:40 +0200
Subject: [I18n-sig] Re: Patch 101320: doc strings
In-Reply-To: <oqaednaiuc.fsf@titan.progiciels-bpi.ca> (message from
 =?ISO-8859-1?Q?Fran=E7ois?= Pinard on 04 Sep 2000 18:32:59 -0400)
References: <200008302026.WAA13599@pandora.informatik.hu-berlin.de>
 <14765.62346.449402.907537@anthem.concentric.net>
 <200008311550.RAA29820@pandora.informatik.hu-berlin.de>
 <14766.48956.463218.310154@anthem.concentric.net>
 <oqn1hqwzif.fsf@titan.progiciels-bpi.ca>
 <200009041314.PAA27902@pandora.informatik.hu-berlin.de>
 <oqpumkck00.fsf@titan.progiciels-bpi.ca>
 <200009041445.QAA06095@pandora.informatik.hu-berlin.de>
 <oq66ocjeye.fsf@titan.progiciels-bpi.ca>
 <200009041732.TAA14636@pandora.informatik.hu-berlin.de>
 <oq3djfc36a.fsf@titan.progiciels-bpi.ca>
 <200009042123.XAA23066@pandora.informatik.hu-berlin.de> <oqaednaiuc.fsf@titan.progiciels-bpi.ca>
Message-ID: <200009042248.AAA01054@loewis.home.cs.tu-berlin.de>

> Barry wrote `pygettext.py', but I'm not aware of any `msgfmt' program.

I'm aware of one, as I wrote it :-) See Tools/i18n/msgfmt.py in the
Python CVS, or any upcoming 2.0b1 snapshot.

> The double hashing algorithm would have to be known for it to exist,
> and would not be a legalistic problem then for the MO file reader.

This implementation of msgfmt does not generate the hash table, which,
according to the GNU gettext manual, is a conforming implementation.

Regards,
Martin


From pinard@iro.umontreal.ca  Tue Sep  5 00:59:32 2000
From: pinard@iro.umontreal.ca (=?ISO-8859-1?Q?Fran=E7ois_Pinard?=)
Date: 04 Sep 2000 19:59:32 -0400
Subject: [I18n-sig] Re: Patch 101320: doc strings
In-Reply-To: "Martin v. Loewis"'s message of "Tue, 5 Sep 2000 00:48:40 +0200"
References: <200008302026.WAA13599@pandora.informatik.hu-berlin.de>
 <14765.62346.449402.907537@anthem.concentric.net>
 <200008311550.RAA29820@pandora.informatik.hu-berlin.de>
 <14766.48956.463218.310154@anthem.concentric.net>
 <oqn1hqwzif.fsf@titan.progiciels-bpi.ca>
 <200009041314.PAA27902@pandora.informatik.hu-berlin.de>
 <oqpumkck00.fsf@titan.progiciels-bpi.ca>
 <200009041445.QAA06095@pandora.informatik.hu-berlin.de>
 <oq66ocjeye.fsf@titan.progiciels-bpi.ca>
 <200009041732.TAA14636@pandora.informatik.hu-berlin.de>
 <oq3djfc36a.fsf@titan.progiciels-bpi.ca>
 <200009042123.XAA23066@pandora.informatik.hu-berlin.de>
 <oqaednaiuc.fsf@titan.progiciels-bpi.ca>
 <200009042248.AAA01054@loewis.home.cs.tu-berlin.de>
Message-ID: <oq66obaeu3.fsf@titan.progiciels-bpi.ca>

[Martin v. Loewis]

> > Barry wrote `pygettext.py', but I'm not aware of any `msgfmt' program.

> I'm aware of one, as I wrote it :-) See Tools/i18n/msgfmt.py in the
> Python CVS, or any upcoming 2.0b1 snapshot.

Thanks.

> > The double hashing algorithm would have to be known for it to exist,
> > and would not be a legalistic problem then for the MO file reader.

> This implementation of msgfmt does not generate the hash table, which,
> according to the GNU gettext manual, is a conforming implementation.

I wrote most of that manual, and I do not remember that :-).  But it was
quite a while ago, and we discussed _so_ many things at the time...

-- 
François Pinard   http://www.iro.umontreal.ca/~pinard


From pinard@iro.umontreal.ca  Tue Sep  5 01:44:12 2000
From: pinard@iro.umontreal.ca (=?ISO-8859-1?Q?Fran=E7ois_Pinard?=)
Date: 04 Sep 2000 20:44:12 -0400
Subject: [I18n-sig] Re: gettext in the standard library
In-Reply-To: "Martin v. Loewis"'s message of "Tue, 5 Sep 2000 00:31:25 +0200"
References: <14749.42747.411862.940207@anthem.concentric.net>
 <Pine.LNX.4.21.0008190854480.25020-100000@james.daa.com.au>
 <14757.24220.225628.464982@anthem.concentric.net>
 <oqlmx9jl7q.fsf@titan.progiciels-bpi.ca>
 <200009041342.PAA29915@pandora.informatik.hu-berlin.de>
 <oqk8cshwsy.fsf@titan.progiciels-bpi.ca>
 <200009041813.UAA16309@pandora.informatik.hu-berlin.de>
 <oqsnrfalwt.fsf@titan.progiciels-bpi.ca>
 <200009042231.AAA00904@loewis.home.cs.tu-berlin.de>
Message-ID: <oq1yyzacrn.fsf@titan.progiciels-bpi.ca>

[Martin v. Loewis]

> > Python takes care of what needs care, anyway.
> No, it doesn't. It will in some cases, but won't in others.

> > It should be fairly transparent to the programmer, and our API
> > should be just as transparent.  Shouldn't it?
> It should, but I feel it isn't.

OK.  My good prejudice for Unicode support in Python was a bit exaggerated,
then.

> >>> header = '\xFF\x01'
> >>> body   = u'warning'
> >>> message = header + body
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
> UnicodeError: ASCII decoding error: ordinal not in range(128)

> Is that proper?

Sounds proper to me.

> Is it what the user expected?  If not, how should the user modify her
> code so it does what she wanted?

I do not know what the user wanted, so I cannot say how to modify the code.
If she wants to play with bits and bytes, rather than strings, she would
have to make explicit the conversions she wants.  Python cannot guess them.

> I suggest you play around with the Unicode type somewhat before
> recommending that API functions should blindly return it...

Oh, I should surely read and try a lot more before saying anything.
I was invited in this discussion only recently.  Today as a deadline
was not giving me enough time to as careful as I usually like to be.
So, I merely tried contributing my best given the circumstances, with my
limited experience and knowledge.  I think it was better that I risk a few
suggestions and opinions, than stay silent and regret having said nothing.
I hope having been a bit useful, somewhat, despite all the noise I made :-).

> So I'd rather not return a Unicode string representing an error message
> from gettext: the user expecting an error message may be surprised about
> the totally unrelated UnicodeError.

I would have hoped that one could merely replace STRING by _(STRING), and
get a working program.  If I read you correctly, you say that it has more
chance to work _if_ we avoid the Unicode string route, and mimick what we
dumbly do in C.

Instead of:

    _ = locale.translator(DOMAIN)

could we have:

    _, _u = locale.translator(DOMAIN)

and use _(TEXT) or _u(TEXT) for the flat byte string out of the PO file,
or the string converted to a Unicode string from the PO `msgstr' encoding?
Or maybe:

    _, _e = locale.translator(DOMAIN)

with the above _u(TEXT) being rather written unicode(_(TEXT), _e) ?
Or maybe even:

    _, _e, _u = locale.translator(DOMAIN)

But I'm not sure I like any of these things.  Maybe nicer would be that
`_` is the class instance itself, with a __call__ method for implementing
_(TEXT).  One could then use _.charset or such to get then `msgstr'
encoding, and the convenience:

    _.unicode(TEXT)

would be equivalent to:

    unicode(_(TEXT), _.charset)

Better ideas?  I am still under the shock! :-)

-- 
François Pinard   http://www.iro.umontreal.ca/~pinard


From pinard@iro.umontreal.ca  Tue Sep  5 02:16:45 2000
From: pinard@iro.umontreal.ca (=?ISO-8859-1?Q?Fran=E7ois_Pinard?=)
Date: 04 Sep 2000 21:16:45 -0400
Subject: [I18n-sig] Re: Translating doc strings
In-Reply-To: Martin v. Loewis martin@loewis.home.cs.tu-berlin.de's message of "Fri, 1 Sep 2000 09:17:34 +0200"
Message-ID: <oqr96z8woy.fsf@titan.progiciels-bpi.ca>

[martin@loewis.home.cs.tu-berlin.de]

> With that approach, the next question is: What is the name of the textual
> domain, and how are translation managed? My proposal was "pylib"; Barry's
> "docstring".

Why not merely "python"?

> As for management of translations, I'd like to ask the Free Translation
> Project for help.  As soon as we've settled the technical issues, I'd
> like to submit a catalog for translation.

You will be quite welcome, and have an accomplice within! :-)

When you will feel that the time is proper, just write to me again.  You may
browse `http://www.iro.umontreal.ca/contrib/po/HTML/maintainers.html' if
you want to know the questions we usually need answered, and I may open
the translation domain as soon as the textual domain name is decided.

-- 
François Pinard   http://www.iro.umontreal.ca/~pinard


From martin@loewis.home.cs.tu-berlin.de  Tue Sep  5 07:44:44 2000
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Tue, 5 Sep 2000 08:44:44 +0200
Subject: [I18n-sig] Re: gettext in the standard library
In-Reply-To: <oq1yyzacrn.fsf@titan.progiciels-bpi.ca> (message from
 =?ISO-8859-1?Q?Fran=E7ois?= Pinard on 04 Sep 2000 20:44:12 -0400)
References: <14749.42747.411862.940207@anthem.concentric.net>
 <Pine.LNX.4.21.0008190854480.25020-100000@james.daa.com.au>
 <14757.24220.225628.464982@anthem.concentric.net>
 <oqlmx9jl7q.fsf@titan.progiciels-bpi.ca>
 <200009041342.PAA29915@pandora.informatik.hu-berlin.de>
 <oqk8cshwsy.fsf@titan.progiciels-bpi.ca>
 <200009041813.UAA16309@pandora.informatik.hu-berlin.de>
 <oqsnrfalwt.fsf@titan.progiciels-bpi.ca>
 <200009042231.AAA00904@loewis.home.cs.tu-berlin.de> <oq1yyzacrn.fsf@titan.progiciels-bpi.ca>
Message-ID: <200009050644.IAA00730@loewis.home.cs.tu-berlin.de>

> Instead of:
> 
>     _ = locale.translator(DOMAIN)
> 
> could we have:
> 
>     _, _u = locale.translator(DOMAIN)

Currently, you would write

_ = locale.translator(DOMAIN).gettext

for the first one, and

cat = locale.translator(DOMAIN)
_, _u = cat.gettext, cat.ugettext

for the second one. However, I doubt many users would need both
methods. They either trust that their output channels are unicode-safe
or they don't. I'd even emagine cases where they do

def _(msg):
  return cat.ugettext(msg).encode("utf-8")

so they get UTF-8 even if the catalog uses some different encoding;
that may be useful when they write to log files. Of course, in that
case, they should really write

logfile = codecs.open("logfilename","w",encoding="utf-8")

to get a unicode-safe output channel.

> Or maybe:
> 
>     _, _e = locale.translator(DOMAIN)

That would be

_e = cat.charset()

> But I'm not sure I like any of these things.  Maybe nicer would be
> that `_` is the class instance itself, with a __call__ method for
> implementing _(TEXT). One could then use _.charset or such to get
> then `msgstr' encoding

Maybe it's not nicer. __call__ is typically used to hide the fact that
something is an instance object, so users can treat it as if it was a
function. Now, if you say that users need to be aware that it is
indeed an instance (since it exposes additional methods), they also
need to understand how __call__ works for these instances. That is
more difficult to grasp than telling them about instances, their
methods, and the notion of bound methods.

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Tue Sep  5 07:51:16 2000
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Tue, 5 Sep 2000 08:51:16 +0200
Subject: [I18n-sig] Re: Patch 101320: doc strings
In-Reply-To: <oq66obaeu3.fsf@titan.progiciels-bpi.ca> (message from
 =?ISO-8859-1?Q?Fran=E7ois?= Pinard on 04 Sep 2000 19:59:32 -0400)
References: <200008302026.WAA13599@pandora.informatik.hu-berlin.de>
 <14765.62346.449402.907537@anthem.concentric.net>
 <200008311550.RAA29820@pandora.informatik.hu-berlin.de>
 <14766.48956.463218.310154@anthem.concentric.net>
 <oqn1hqwzif.fsf@titan.progiciels-bpi.ca>
 <200009041314.PAA27902@pandora.informatik.hu-berlin.de>
 <oqpumkck00.fsf@titan.progiciels-bpi.ca>
 <200009041445.QAA06095@pandora.informatik.hu-berlin.de>
 <oq66ocjeye.fsf@titan.progiciels-bpi.ca>
 <200009041732.TAA14636@pandora.informatik.hu-berlin.de>
 <oq3djfc36a.fsf@titan.progiciels-bpi.ca>
 <200009042123.XAA23066@pandora.informatik.hu-berlin.de>
 <oqaednaiuc.fsf@titan.progiciels-bpi.ca>
 <200009042248.AAA01054@loewis.home.cs.tu-berlin.de> <oq66obaeu3.fsf@titan.progiciels-bpi.ca>
Message-ID: <200009050651.IAA00782@loewis.home.cs.tu-berlin.de>

> I wrote most of that manual, and I do not remember that :-).  But it was
> quite a while ago, and we discussed _so_ many things at the time...

So I guess it's your text I'm quoting here:

   The size S of the hash table can be zero.  In this case, the hash
  table itself is not contained in the MO file.  Some people might
  prefer this because a precomputed hashing table takes disk space,
  and does not win *that* much speed.

I actually preferred this 'cause it's easier to implement :-) It's
interesting to notice that the description of the mo file format needs
126 lines of English text and that the implementation of a generator
needs only 194 lines of text (of which only 122 contain actual Python
code).

Regards,
Martin


From tdickenson@geminidataloggers.com  Tue Sep  5 12:19:42 2000
From: tdickenson@geminidataloggers.com (Toby Dickenson)
Date: Tue, 05 Sep 2000 12:19:42 +0100
Subject: [I18n-sig] ustr
In-Reply-To: <200007071244.HAA03694@cj20424-a.reston1.va.home.com>
References: <r39bmsc6remdupiv869s5agm46m315ebeq@4ax.com>   <3965BBE5.D67DD838@lemburg.com> <200007071244.HAA03694@cj20424-a.reston1.va.home.com>
Message-ID: <vhl9rsclpk9e89oaeehpg7sec79ar8cdru@4ax.com>

On Fri, 07 Jul 2000 07:44:03 -0500, Guido van Rossum
<guido@beopen.com> wrote:

We debated a ustr function in July. Does anyone have this in hand? I
can prepare a patch if necessary.

>> Toby Dickenson wrote:
>> >=20
>> > I'm just nearing the end of getting Zope to play well with unicode
>> > data. Most of the changes involved replacing a call to str, in
>> > situations where either a unicode or narrow string would be
>> > acceptable.
>> >=20
>> > My best alternative is:
>> >=20
>> > def convert_to_something_stringlike(x):
>> >     if type(x)=3D=3Dtype(u''):
>> >         return x
>> >     else:
>> >         return str(x)
>> >=20
>> > This seems like a fundamental operation - would it be worth having
>> > something similar in the standard library?
>
>Marc-Andre Lemburg replied:
>
>> You mean: for Unicode return Unicode and for everything else
>> return strings ?
>>=20
>> It doesn't fit well with the builtins str() and unicode(). I'd
>> say, make this a userland helper.
>
>I think this would be helpful to have in the std library.  Note that
>in JPython, you'd already use str() for this, and in Python 3000 this
>may also be the case.  At some point in the design discussion for the
>current Unicode support we also thought that we wanted str() to do
>this (i.e. allow 8-bit and Unicode string returns), until we realized
>that there were too many places that would be very unhappy if str()
>returned a Unicode string!
>
>The problem is similar to a situation you have with numbers: sometimes
>you want a coercion that converts everything to float except it should
>leave complex numbers complex.  In other words it coerces up to float
>but it never coerces down to float.  Luckily you can write that as
>"x+0.0" while converts int and long to float with the same value while
>leaving complex alone.
>
>For strings there is no compact notation like "+0.0" if you want to
>convert to string or Unicode -- adding "" might work in Perl, but not
>in Python.
>
>I propose ustr(x) with the semantics given by Toby.  Class support (an
>__ustr__ method, with fallbacks on __str__ and __unicode__) would also
>be handy.


Toby Dickenson
tdickenson@geminidataloggers.com


From bwarsaw@beopen.com  Tue Sep  5 20:00:27 2000
From: bwarsaw@beopen.com (Barry A. Warsaw)
Date: Tue, 5 Sep 2000 15:00:27 -0400 (EDT)
Subject: [I18n-sig] Terminology gap
References: <39AE5F20.E68BC43F@lemburg.com>
 <PGECLPOBGNBNKHNAGIJHMEHNCEAA.andy@reportlab.com>
Message-ID: <14773.17099.703161.266580@anthem.concentric.net>

>>>>> "AR" == Andy Robinson <andy@reportlab.com> writes:

    AR> I agree with MAL.  "string" should refer to an interface;
    AR> people doing i18n stuff could then write their own ones in
    AR> future if needed.  I cannot get at CVS this week, but I think
    AR> we actually checked in a UserString class into the standard
    AR> library in order to clearly define the interface for
    AR> string-like objects.

The answer to that is yes, UserString.py is in the standard
distribution.  It actually defines UserString class, with the basic
interface, and MutableString class which will mutate in place but
can't be used as a dictionary key.

-Barry


From bwarsaw@beopen.com  Wed Sep  6 04:36:37 2000
From: bwarsaw@beopen.com (Barry A. Warsaw)
Date: Tue, 5 Sep 2000 23:36:37 -0400 (EDT)
Subject: [I18n-sig] Re: Patch 101320: doc strings
References: <200008302026.WAA13599@pandora.informatik.hu-berlin.de>
 <14765.62346.449402.907537@anthem.concentric.net>
 <200008311550.RAA29820@pandora.informatik.hu-berlin.de>
 <14766.48956.463218.310154@anthem.concentric.net>
 <oqn1hqwzif.fsf@titan.progiciels-bpi.ca>
 <200009041314.PAA27902@pandora.informatik.hu-berlin.de>
 <oqpumkck00.fsf@titan.progiciels-bpi.ca>
Message-ID: <14773.48069.943223.602061@anthem.concentric.net>

Just to follow up on some ideas:

    FP> If it was systematic that `_' was assigned to, we could try to
    FP> retrieve the function stored in the `_' global variable of
    FP> `httplib', and then use it to translate any docstring from
    FP> httplib.  However, it would be nicer if the constraint of
    FP> using `_' for the translating function did not exist, and if
    FP> it was rather completely left at the discretion of the
    FP> programmer.  If we use `_' systematically in documentation
    FP> examples we produce, it is likely to become the popular
    FP> choice, but let's avoid mandating it.

Here's another suggestion.  I'm not sure I like it but here goes
anyway.

Say we had an import hook that isn't installed by default (for Python
environments that don't care at all about i18n).  If this import hook
is installed, though, it interposes a little extra functionality
whenever a module is imported for the first time.

What this hook does is import the module, then look to see if the
module has a '__domain__' attribute set.  If it does, then the
importer uses that textual domain for that module's translations,
locating the .mo file using the "standard lookup algorithm".

If __domain__ is not set, then if the module's name can be determined,
the import hook tries to use that textual domain.

If that can't be found, it falls back on the textual domain "python".

So we can generate a big .po file containing the entirety of the core
libraries, but we can override individual modules as needed.  This
would also work with 3rd party libraries, since the same import hook
would run when they are imported.

Caveats:

- Using the module's name as the textual domain may create conflicts.
  E.g. mypackage.foo.datetime and yourpackage.bar.datetime.  One
  possible resolution is to first try the fully qualified name with
  period->underscore substitutions.  If that isn't found, fallback to
  the rightmost module name.

- This is a lot of disk statting to do all these searches.  And
  because the import hook will be written in Python, it means that
  i18n'd applications will all import much more slowly.

- I think it's still tricky to get modules to play nice, especially if
  you want to handle the situation where a Python user doesn't know
  about or care about i18n.  How would you define a module's _()
  function to work in both cases?  Would the import hook poke a new
  _() function into the module namespace, or perhaps delete one it
  finds there, assuming the one in builtins will still be there?

Maybe it's a dumb idea anyway.

-Barry


From bwarsaw@beopen.com  Wed Sep  6 04:43:25 2000
From: bwarsaw@beopen.com (Barry A. Warsaw)
Date: Tue, 5 Sep 2000 23:43:25 -0400 (EDT)
Subject: [I18n-sig] Re: Patch 101320: doc strings
References: <200008302026.WAA13599@pandora.informatik.hu-berlin.de>
 <14765.62346.449402.907537@anthem.concentric.net>
 <200008311550.RAA29820@pandora.informatik.hu-berlin.de>
 <14766.48956.463218.310154@anthem.concentric.net>
 <oqn1hqwzif.fsf@titan.progiciels-bpi.ca>
 <200009041314.PAA27902@pandora.informatik.hu-berlin.de>
 <oqpumkck00.fsf@titan.progiciels-bpi.ca>
 <200009041445.QAA06095@pandora.informatik.hu-berlin.de>
 <oq66ocjeye.fsf@titan.progiciels-bpi.ca>
 <200009041732.TAA14636@pandora.informatik.hu-berlin.de>
Message-ID: <14773.48477.738019.157702@anthem.concentric.net>

>>>>> "MvL" == Martin von Loewis <loewis@informatik.hu-berlin.de> writes:

    MvL> Maybe my logic is somewhat flawed: - Did you agree that doc
    MvL> strings of a module should use the same domain as all other
    MvL> strings of the module?

Yes.
    
    MvL> - Did you propose that a single
    MvL> package, distributed as a whole, should have a single textual
    MvL> domain?

Yes.
    
    MvL> - Do you agree that the Python core+libs is a single
    MvL> package?

Not sure.  I think you and Francois do, so I'll defer.  One issue is
for 3rd party modules, and for modules that migrate into the core.  At
the very least, 3rd party modules will /not/ be in the "python"
domain, but if they are migrated into the core, that may change.

If I distribute a module independently, say using distutils, then I'm
going to want to mark the translatable strings, and possibly
distribute a .po file for my module.  In that sense my single module
is a single package.

    MvL> I've given up on having message catalogs in the Python 2.0
    MvL> distribution. Since there is no point in having the catalog
    MvL> without any translations, this is not so urgent. What *is*
    MvL> urgent is to give the catalog to the translators.

I sent out a message about a file system layout for including the
files in the nondist tree of the CVS repository.  Did you read that
message Martin?  What did you think?  Guido's amenable to that
solution for now.

-Barry


From bwarsaw@beopen.com  Wed Sep  6 04:47:19 2000
From: bwarsaw@beopen.com (Barry A. Warsaw)
Date: Tue, 5 Sep 2000 23:47:19 -0400 (EDT)
Subject: [I18n-sig] Re: Patch 101320: doc strings
References: <200008302026.WAA13599@pandora.informatik.hu-berlin.de>
 <14765.62346.449402.907537@anthem.concentric.net>
 <200008311550.RAA29820@pandora.informatik.hu-berlin.de>
 <14766.48956.463218.310154@anthem.concentric.net>
 <oqn1hqwzif.fsf@titan.progiciels-bpi.ca>
 <200009041314.PAA27902@pandora.informatik.hu-berlin.de>
 <oqpumkck00.fsf@titan.progiciels-bpi.ca>
 <200009041445.QAA06095@pandora.informatik.hu-berlin.de>
 <oq66ocjeye.fsf@titan.progiciels-bpi.ca>
 <200009041732.TAA14636@pandora.informatik.hu-berlin.de>
 <oq3djfc36a.fsf@titan.progiciels-bpi.ca>
 <200009042123.XAA23066@pandora.informatik.hu-berlin.de>
 <oqaednaiuc.fsf@titan.progiciels-bpi.ca>
Message-ID: <14773.48711.148352.627828@anthem.concentric.net>

>>>>> "FP" == ISO  <pinard@iro.umontreal.ca> writes:

    FP> As for `gettext.install' function, it looks awkward.  This
    FP> would be the only case I know, in the Python library, where a
    FP> library function hacks a variable in the local name space.

It doesn't.  gettext.install() hacks the __builtin__ module's
namespace, which is the last namespace search after locals and
globals.  So if a module defines _(), that definition will override
the one put in __builtin__ by gettext.install().
    
-Barry


From bwarsaw@beopen.com  Wed Sep  6 04:53:23 2000
From: bwarsaw@beopen.com (Barry A. Warsaw)
Date: Tue, 5 Sep 2000 23:53:23 -0400 (EDT)
Subject: [I18n-sig] Re: Patch 101320: doc strings
References: <200008302026.WAA13599@pandora.informatik.hu-berlin.de>
 <14765.62346.449402.907537@anthem.concentric.net>
 <200008311550.RAA29820@pandora.informatik.hu-berlin.de>
 <oqr972wzsd.fsf@titan.progiciels-bpi.ca>
 <200009041311.PAA27712@pandora.informatik.hu-berlin.de>
Message-ID: <14773.49075.60305.10297@anthem.concentric.net>

>>>>> "MvL" == Martin von Loewis <loewis@informatik.hu-berlin.de> writes:

    MvL> I don't see how this could work for doc strings of classes,
    MvL> methods and functions. Do you propose to write

    MvL> def foo():
    |   _("This does the foo thing.")
    |   pass

    MvL> That won't work; the parser won't recognize it as a doc
    MvL> string.

Martin's right.  Fortunately docstrings are rarely used by the program
itself (they are mostly used by outside tools, like help()/doc() or
IDE's or interactive interpreters).

One place a docstring /is/ used by the program and needs to be
translated is for script help messages.  In most of the executable
scripts I write, the file's docstring is the usage text, and I include
a function that prints the global __doc__.  If that first string in
the file is wrapped in _('') it won't be a docstring.  If it isn't
wrapped, it won't be translated.  Two solutions: either the extractor
needs to be smarter (and xpot currently is, but pygettext isn't), or
you can hack around it like so:

#! /usr/bin/env python
__doc__ = _("blech, my module doc string")

A second place I've used class docstrings inside a program is to
write the error messages for exception classes as the class's
docstring.  This can be done in other ways, but also either solution
above would work.

-Barry


From bwarsaw@beopen.com  Wed Sep  6 04:57:48 2000
From: bwarsaw@beopen.com (Barry A. Warsaw)
Date: Tue, 5 Sep 2000 23:57:48 -0400 (EDT)
Subject: [I18n-sig] Re: Patch 101320: doc strings
References: <200008302026.WAA13599@pandora.informatik.hu-berlin.de>
 <14765.62346.449402.907537@anthem.concentric.net>
 <200008311550.RAA29820@pandora.informatik.hu-berlin.de>
 <oqvgwex0ds.fsf@titan.progiciels-bpi.ca>
 <200009041329.PAA28928@pandora.informatik.hu-berlin.de>
Message-ID: <14773.49340.983150.382453@anthem.concentric.net>

>>>>> "MvL" == Martin von Loewis <loewis@informatik.hu-berlin.de> writes:

    MvL> If you load the hash tables, does this give enough
    MvL> information so that you can use two seek(2) calls only; on
    MvL> average? If so, it would be probably good if there was a)
    MvL> documentation for the hash table format, and/or b) an
    MvL> implementation of it in Python.

Documentation, please!

    MvL> I'm certain it will take some time to get translations back,
    MvL> so if we want to have something in the next release (after
    MvL> 2.0), we should start today.

I'd still like to investigate using distutils as the standard way to
distribute the .mo files.

-Barry


From bwarsaw@beopen.com  Wed Sep  6 05:03:03 2000
From: bwarsaw@beopen.com (Barry A. Warsaw)
Date: Wed, 6 Sep 2000 00:03:03 -0400 (EDT)
Subject: [I18n-sig] Re: Translating doc strings
References: <oqr96z8woy.fsf@titan.progiciels-bpi.ca>
Message-ID: <14773.49655.533135.622916@anthem.concentric.net>

>>>>> "FP" == <pinard@iro.umontreal.ca> writes:

    >> With that approach, the next question is: What is the name of
    >> the textual domain, and how are translation managed? My
    >> proposal was "pylib"; Barry's "docstring".

    FP> Why not merely "python"?

I like it.  If we are to go with a single translation file for the
entire library (and I think I've now agreed with you on that :), then
"python" is better as the textual domain.

-Barry


From pinard@iro.umontreal.ca  Wed Sep  6 05:19:40 2000
From: pinard@iro.umontreal.ca (=?ISO-8859-1?Q?Fran=E7ois_Pinard?=)
Date: 06 Sep 2000 00:19:40 -0400
Subject: [I18n-sig] Re: Patch 101320: doc strings
In-Reply-To: bwarsaw@beopen.com's message of "Tue, 5 Sep 2000 23:47:19 -0400 (EDT)"
References: <200008302026.WAA13599@pandora.informatik.hu-berlin.de>
 <14765.62346.449402.907537@anthem.concentric.net>
 <200008311550.RAA29820@pandora.informatik.hu-berlin.de>
 <14766.48956.463218.310154@anthem.concentric.net>
 <oqn1hqwzif.fsf@titan.progiciels-bpi.ca>
 <200009041314.PAA27902@pandora.informatik.hu-berlin.de>
 <oqpumkck00.fsf@titan.progiciels-bpi.ca>
 <200009041445.QAA06095@pandora.informatik.hu-berlin.de>
 <oq66ocjeye.fsf@titan.progiciels-bpi.ca>
 <200009041732.TAA14636@pandora.informatik.hu-berlin.de>
 <oq3djfc36a.fsf@titan.progiciels-bpi.ca>
 <200009042123.XAA23066@pandora.informatik.hu-berlin.de>
 <oqaednaiuc.fsf@titan.progiciels-bpi.ca>
 <14773.48711.148352.627828@anthem.concentric.net>
Message-ID: <oq66oab19f.fsf@titan.progiciels-bpi.ca>

[Barry A. Warsaw]

> >>>>> "FP" == ISO  <pinard@iro.umontreal.ca> writes:

>     FP> As for `gettext.install' function, it looks awkward.  This
>     FP> would be the only case I know, in the Python library, where a
>     FP> library function hacks a variable in the local name space.

> It doesn't.  gettext.install() hacks the __builtin__ module's
> namespace, which is the last namespace search after locals and
> globals.  So if a module defines _(), that definition will override
> the one put in __builtin__ by gettext.install().

Bizarrier and bizarrier! :-)

What is the purpose of installing a definition of _() just meant to be
overriden?  It should not make sense for any module to use _() without
defining it, as this is the way to associate that module to a textual domain.
Each module ought make this association separately.

-- 
François Pinard   http://www.iro.umontreal.ca/~pinard


From bwarsaw@beopen.com  Wed Sep  6 05:34:31 2000
From: bwarsaw@beopen.com (Barry A. Warsaw)
Date: Wed, 6 Sep 2000 00:34:31 -0400 (EDT)
Subject: [I18n-sig] Re: Patch 101320: doc strings
References: <200008302026.WAA13599@pandora.informatik.hu-berlin.de>
 <14765.62346.449402.907537@anthem.concentric.net>
 <200008311550.RAA29820@pandora.informatik.hu-berlin.de>
 <14766.48956.463218.310154@anthem.concentric.net>
 <oqn1hqwzif.fsf@titan.progiciels-bpi.ca>
 <200009041314.PAA27902@pandora.informatik.hu-berlin.de>
 <oqpumkck00.fsf@titan.progiciels-bpi.ca>
 <200009041445.QAA06095@pandora.informatik.hu-berlin.de>
 <oq66ocjeye.fsf@titan.progiciels-bpi.ca>
 <200009041732.TAA14636@pandora.informatik.hu-berlin.de>
 <oq3djfc36a.fsf@titan.progiciels-bpi.ca>
 <200009042123.XAA23066@pandora.informatik.hu-berlin.de>
 <oqaednaiuc.fsf@titan.progiciels-bpi.ca>
 <14773.48711.148352.627828@anthem.concentric.net>
 <oq66oab19f.fsf@titan.progiciels-bpi.ca>
Message-ID: <14773.51543.773354.942958@anthem.concentric.net>

>>>>> "I" == ISO  <pinard@iro.umontreal.ca> writes:

    FP> What is the purpose of installing a definition of _() just
    FP> meant to be overriden?  It should not make sense for any
    FP> module to use _() without defining it, as this is the way to
    FP> associate that module to a textual domain.  Each module ought
    FP> make this association separately.

Agreed, for modules.  The documentation even recommends that modules
never install().

gettext.install() is for application that have their own global text
domains.  You don't want to have to define _() in every file in the
application.

-Barry


From just@letterror.com  Wed Sep  6 07:57:37 2000
From: just@letterror.com (Just van Rossum)
Date: Wed, 6 Sep 2000 07:57:37 +0100
Subject: [I18n-sig] Re: Patch 101320: doc strings
In-Reply-To: <14773.51543.773354.942958@anthem.concentric.net>
References: <200008302026.WAA13599@pandora.informatik.hu-berlin.de>
 <14765.62346.449402.907537@anthem.concentric.net>
 <200008311550.RAA29820@pandora.informatik.hu-berlin.de>
 <14766.48956.463218.310154@anthem.concentric.net>
 <oqn1hqwzif.fsf@titan.progiciels-bpi.ca>
 <200009041314.PAA27902@pandora.informatik.hu-berlin.de>
 <oqpumkck00.fsf@titan.progiciels-bpi.ca>
 <200009041445.QAA06095@pandora.informatik.hu-berlin.de>
 <oq66ocjeye.fsf@titan.progiciels-bpi.ca>
 <200009041732.TAA14636@pandora.informatik.hu-berlin.de>
 <oq3djfc36a.fsf@titan.progiciels-bpi.ca>
 <200009042123.XAA23066@pandora.informatik.hu-berlin.de>
 <oqaednaiuc.fsf@titan.progiciels-bpi.ca>
 <14773.48711.148352.627828@anthem.concentric.net>
 <oq66oab19f.fsf@titan.progiciels-bpi.ca>
Message-ID: <l03102800b5db999042a0@[193.78.237.168]>

At 12:34 AM -0400 06-09-2000, Barry A. Warsaw wrote:
>>>>>> "I" == ISO  <pinard@iro.umontreal.ca> writes:
>
>    FP> What is the purpose of installing a definition of _() just
>    FP> meant to be overriden?  It should not make sense for any
>    FP> module to use _() without defining it, as this is the way to
>    FP> associate that module to a textual domain.  Each module ought
>    FP> make this association separately.
>
>Agreed, for modules.  The documentation even recommends that modules
>never install().
>
>gettext.install() is for application that have their own global text
>domains.  You don't want to have to define _() in every file in the
>application.

If such an application also uses exec in combination with compile(src,
"<input>", "single") (ie. wants to offer an interactive Python window),
it's pretty much screwed, as this also uses __builtins__._...

Just


From martin@loewis.home.cs.tu-berlin.de  Wed Sep  6 07:29:58 2000
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Wed, 6 Sep 2000 08:29:58 +0200
Subject: [I18n-sig] Re: Patch 101320: doc strings
In-Reply-To: <14773.48477.738019.157702@anthem.concentric.net>
 (bwarsaw@beopen.com)
References: <200008302026.WAA13599@pandora.informatik.hu-berlin.de>
 <14765.62346.449402.907537@anthem.concentric.net>
 <200008311550.RAA29820@pandora.informatik.hu-berlin.de>
 <14766.48956.463218.310154@anthem.concentric.net>
 <oqn1hqwzif.fsf@titan.progiciels-bpi.ca>
 <200009041314.PAA27902@pandora.informatik.hu-berlin.de>
 <oqpumkck00.fsf@titan.progiciels-bpi.ca>
 <200009041445.QAA06095@pandora.informatik.hu-berlin.de>
 <oq66ocjeye.fsf@titan.progiciels-bpi.ca>
 <200009041732.TAA14636@pandora.informatik.hu-berlin.de> <14773.48477.738019.157702@anthem.concentric.net>
Message-ID: <200009060629.IAA00884@loewis.home.cs.tu-berlin.de>

> Not sure.  I think you and Francois do, so I'll defer.  One issue is
> for 3rd party modules, and for modules that migrate into the core.  At
> the very least, 3rd party modules will /not/ be in the "python"
> domain, but if they are migrated into the core, that may change.

Indeed, having a single textual domain for all extensions would not be
feasible; they certainly will have their own domain.

> I sent out a message about a file system layout for including the
> files in the nondist tree of the CVS repository.  Did you read that
> message Martin?  

Just to repeat the proposal here, it was

nondist/i18n/
    po/
        docstrings.pot
        docstrings-de.po
    de/LC_MESSAGES/
        docstrings.mo

> What did you think?

I'd do (replace-regexp "docstrings" "python") now, but apart from
that: sounds good to me. I'll extract the strings from the official
2.0b1, then try to create this structure.

Regards,
Martin


From kajiyama@grad.sccs.chukyo-u.ac.jp  Wed Sep  6 12:09:41 2000
From: kajiyama@grad.sccs.chukyo-u.ac.jp (Tamito KAJIYAMA)
Date: Wed, 6 Sep 2000 20:09:41 +0900
Subject: [I18n-sig] JapaneseCodecs-1.0 released
Message-ID: <200009061109.UAA02662@dhcp198.grad.sccs.chukyo-u.ac.jp>

Hi,

I released JapaneseCodecs-1.0, the latest version of my Unicode
codecs for Japanese character encodings (EUC-JP and Shift_JIS).
It is available at the following location:

http://pseudo.grad.sccs.chukyo-u.ac.jp/~kajiyama/python/

The whole code is refined so as to follow the proposal version
1.6.  Some possible bugs are also fixed.  In addition, the
codecs are packaged using Distutils so that installation should
be quite easy (special thanks to Distutils developers).

Character mapping tables have remained unchanged; they do not
include vendor-specific characters.  Performance issues have
also been left.  These need addressing in the future work.

Regards,

-- 
KAJIYAMA, Tamito <kajiyama@grad.sccs.chukyo-u.ac.jp>


From bwarsaw@beopen.com  Wed Sep  6 14:10:35 2000
From: bwarsaw@beopen.com (Barry A. Warsaw)
Date: Wed, 6 Sep 2000 09:10:35 -0400 (EDT)
Subject: [I18n-sig] Re: Patch 101320: doc strings
References: <200008302026.WAA13599@pandora.informatik.hu-berlin.de>
 <14765.62346.449402.907537@anthem.concentric.net>
 <200008311550.RAA29820@pandora.informatik.hu-berlin.de>
 <14766.48956.463218.310154@anthem.concentric.net>
 <oqn1hqwzif.fsf@titan.progiciels-bpi.ca>
 <200009041314.PAA27902@pandora.informatik.hu-berlin.de>
 <oqpumkck00.fsf@titan.progiciels-bpi.ca>
 <200009041445.QAA06095@pandora.informatik.hu-berlin.de>
 <oq66ocjeye.fsf@titan.progiciels-bpi.ca>
 <200009041732.TAA14636@pandora.informatik.hu-berlin.de>
 <14773.48477.738019.157702@anthem.concentric.net>
 <200009060629.IAA00884@loewis.home.cs.tu-berlin.de>
Message-ID: <14774.16971.243296.903895@anthem.concentric.net>

>>>>> "MvL" == Martin v Loewis <martin@loewis.home.cs.tu-berlin.de> writes:

    MvL> Just to repeat the proposal here, it was

    | nondist/i18n/
    |     po/
    |         docstrings.pot
    |         docstrings-de.po
    |     de/LC_MESSAGES/
    |         docstrings.mo

    >> What did you think?

    MvL> I'd do (replace-regexp "docstrings" "python") now,

Yes.
    
    MvL> but apart from that: sounds good to me. I'll extract the
    MvL> strings from the official 2.0b1, then try to create this
    MvL> structure.

I've added the directory structure to nondist, so please do an
update.  You now have checkins privs so feel free to add the .pot,
.po, and .mo files when you have them ready.  Also, could you write up
a short README file for nondist/i18n?  I don't have time right now.

Thanks,
-Barry


From pinard@iro.umontreal.ca  Wed Sep  6 16:05:00 2000
From: pinard@iro.umontreal.ca (=?ISO-8859-1?Q?Fran=E7ois_Pinard?=)
Date: 06 Sep 2000 11:05:00 -0400
Subject: [I18n-sig] Re: Patch 101320: doc strings
In-Reply-To: bwarsaw@beopen.com's message of "Wed, 6 Sep 2000 00:34:31 -0400 (EDT)"
References: <200008302026.WAA13599@pandora.informatik.hu-berlin.de>
 <14765.62346.449402.907537@anthem.concentric.net>
 <200008311550.RAA29820@pandora.informatik.hu-berlin.de>
 <14766.48956.463218.310154@anthem.concentric.net>
 <oqn1hqwzif.fsf@titan.progiciels-bpi.ca>
 <200009041314.PAA27902@pandora.informatik.hu-berlin.de>
 <oqpumkck00.fsf@titan.progiciels-bpi.ca>
 <200009041445.QAA06095@pandora.informatik.hu-berlin.de>
 <oq66ocjeye.fsf@titan.progiciels-bpi.ca>
 <200009041732.TAA14636@pandora.informatik.hu-berlin.de>
 <oq3djfc36a.fsf@titan.progiciels-bpi.ca>
 <200009042123.XAA23066@pandora.informatik.hu-berlin.de>
 <oqaednaiuc.fsf@titan.progiciels-bpi.ca>
 <14773.48711.148352.627828@anthem.concentric.net>
 <oq66oab19f.fsf@titan.progiciels-bpi.ca>
 <14773.51543.773354.942958@anthem.concentric.net>
Message-ID: <oqg0ndr277.fsf@titan.progiciels-bpi.ca>

[Barry A. Warsaw]

> gettext.install() is for application that have their own global text
> domains.  You don't want to have to define _() in every file in the
> application.

I would.  It is a simple habit to define _() after the docstring, for modules
needing it, and it might also be a safer habit when you move modules around,
something which is more natural in Python than in other languages.

-- 
François Pinard   http://www.iro.umontreal.ca/~pinard


From pinard@iro.umontreal.ca  Wed Sep  6 16:07:31 2000
From: pinard@iro.umontreal.ca (=?ISO-8859-1?Q?Fran=E7ois_Pinard?=)
Date: 06 Sep 2000 11:07:31 -0400
Subject: [I18n-sig] Re: Patch 101320: doc strings
In-Reply-To: "Martin v. Loewis"'s message of "Wed, 6 Sep 2000 08:29:58 +0200"
References: <200008302026.WAA13599@pandora.informatik.hu-berlin.de>
 <14765.62346.449402.907537@anthem.concentric.net>
 <200008311550.RAA29820@pandora.informatik.hu-berlin.de>
 <14766.48956.463218.310154@anthem.concentric.net>
 <oqn1hqwzif.fsf@titan.progiciels-bpi.ca>
 <200009041314.PAA27902@pandora.informatik.hu-berlin.de>
 <oqpumkck00.fsf@titan.progiciels-bpi.ca>
 <200009041445.QAA06095@pandora.informatik.hu-berlin.de>
 <oq66ocjeye.fsf@titan.progiciels-bpi.ca>
 <200009041732.TAA14636@pandora.informatik.hu-berlin.de>
 <14773.48477.738019.157702@anthem.concentric.net>
 <200009060629.IAA00884@loewis.home.cs.tu-berlin.de>
Message-ID: <oqbsy1r230.fsf@titan.progiciels-bpi.ca>

[Martin v. Loewis]

> having a single textual domain for all extensions would not be
> feasible; they certainly will have their own domain.

You mean, for the Python distribution?  What do you mean by `not feasable'
and `certainly'?  I do not understand the need of splitting.  What is it?

-- 
François Pinard   http://www.iro.umontreal.ca/~pinard


From martin@loewis.home.cs.tu-berlin.de  Wed Sep  6 19:37:46 2000
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Wed, 6 Sep 2000 20:37:46 +0200
Subject: [I18n-sig] Re: Patch 101320: doc strings
In-Reply-To: <14774.16971.243296.903895@anthem.concentric.net>
 (bwarsaw@beopen.com)
References: <200008302026.WAA13599@pandora.informatik.hu-berlin.de>
 <14765.62346.449402.907537@anthem.concentric.net>
 <200008311550.RAA29820@pandora.informatik.hu-berlin.de>
 <14766.48956.463218.310154@anthem.concentric.net>
 <oqn1hqwzif.fsf@titan.progiciels-bpi.ca>
 <200009041314.PAA27902@pandora.informatik.hu-berlin.de>
 <oqpumkck00.fsf@titan.progiciels-bpi.ca>
 <200009041445.QAA06095@pandora.informatik.hu-berlin.de>
 <oq66ocjeye.fsf@titan.progiciels-bpi.ca>
 <200009041732.TAA14636@pandora.informatik.hu-berlin.de>
 <14773.48477.738019.157702@anthem.concentric.net>
 <200009060629.IAA00884@loewis.home.cs.tu-berlin.de> <14774.16971.243296.903895@anthem.concentric.net>
Message-ID: <200009061837.UAA00749@loewis.home.cs.tu-berlin.de>

> I've added the directory structure to nondist, so please do an
> update.  You now have checkins privs so feel free to add the .pot,
> .po, and .mo files when you have them ready.  Also, could you write up
> a short README file for nondist/i18n?  I don't have time right now.

Sure will.

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Wed Sep  6 19:42:27 2000
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Wed, 6 Sep 2000 20:42:27 +0200
Subject: [I18n-sig] Re: Patch 101320: doc strings
In-Reply-To: <oqbsy1r230.fsf@titan.progiciels-bpi.ca> (message from
 =?ISO-8859-1?Q?Fran=E7ois?= Pinard on 06 Sep 2000 11:07:31 -0400)
References: <200008302026.WAA13599@pandora.informatik.hu-berlin.de>
 <14765.62346.449402.907537@anthem.concentric.net>
 <200008311550.RAA29820@pandora.informatik.hu-berlin.de>
 <14766.48956.463218.310154@anthem.concentric.net>
 <oqn1hqwzif.fsf@titan.progiciels-bpi.ca>
 <200009041314.PAA27902@pandora.informatik.hu-berlin.de>
 <oqpumkck00.fsf@titan.progiciels-bpi.ca>
 <200009041445.QAA06095@pandora.informatik.hu-berlin.de>
 <oq66ocjeye.fsf@titan.progiciels-bpi.ca>
 <200009041732.TAA14636@pandora.informatik.hu-berlin.de>
 <14773.48477.738019.157702@anthem.concentric.net>
 <200009060629.IAA00884@loewis.home.cs.tu-berlin.de> <oqbsy1r230.fsf@titan.progiciels-bpi.ca>
Message-ID: <200009061842.UAA00796@loewis.home.cs.tu-berlin.de>

> > having a single textual domain for all extensions would not be
> > feasible; they certainly will have their own domain.
> 
> You mean, for the Python distribution?  

No, not for the Python distribution. For extensions to Python: pyqt,
gnome-python, NumPy.

> What do you mean by `not feasable'

It is not feasible that everybody writing a Python library submits her
doc strings to the Python maintainers for inclusion into the python
textual domain.

> and `certainly'? 

If anybody writing a Python library, and can't use the python domain,
then he'll certainly create his own one.

> I do not understand the need of splitting.  What is it?

It is the same reason why there isn't a single coordinated domain for
all free software.

Regards,
Martin


From kajiyama@grad.sccs.chukyo-u.ac.jp  Thu Sep  7 14:13:51 2000
From: kajiyama@grad.sccs.chukyo-u.ac.jp (Tamito KAJIYAMA)
Date: Thu, 7 Sep 2000 22:13:51 +0900
Subject: [I18n-sig] sys.(set|get)_string_encoding in 1.6
Message-ID: <200009071313.WAA05858@dhcp198.grad.sccs.chukyo-u.ac.jp>

Hi,

I found that sys.(get|set)defaultencoding() defined in the
Unicode proposal version 1.6 were implemented with different
names sys.(get|set)_string_encoding() in the 1.6 final release.
Is this an intended change?  If so, why is this incompatibility
introduced?

Thanks,

-- 
KAJIYAMA, Tamito <kajiyama@grad.sccs.chukyo-u.ac.jp>


From mal@lemburg.com  Thu Sep  7 17:30:11 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 07 Sep 2000 18:30:11 +0200
Subject: [I18n-sig] sys.(set|get)_string_encoding in 1.6
References: <200009071313.WAA05858@dhcp198.grad.sccs.chukyo-u.ac.jp>
Message-ID: <39B7C293.70FD7E8A@lemburg.com>

Tamito KAJIYAMA wrote:
> 
> Hi,
> 
> I found that sys.(get|set)defaultencoding() defined in the
> Unicode proposal version 1.6 were implemented with different
> names sys.(get|set)_string_encoding() in the 1.6 final release.
> Is this an intended change?  If so, why is this incompatibility
> introduced?

These APIs were first introduced as experiment to the CVS tree
under the names you find in the 1.6 release. They were meant
to provide an easy way to experiment with different default
encodings. 

After some discussions on python-dev the outcome
was to keep the APIs for use by site.py to set a locale
dependent default encoding. 

This idea was then retracted some weeks later and replaced
with the now standard ASCII default encoding which you find
in both 1.6 and 2.0.

So to answer your question: the sys APIs in 1.6 are to be
considered undocumented features and should *not* be used.

I haven't followed the 1.6 release too closely and didn't
even realize that these APIs made it into the release
version... things were moving much too fast at the time and
I was busy with 2.0. Sorry :-/

Python 2.0 will have the sys APIs which are documented in
the Misc/unicode.txt file:

        getdefaultencoding() -> string
        
        Return the current default string encoding used by the Unicode 
        implementation.

        setdefaultencoding(encoding)

        Set the current default string encoding used by the Unicode
        implementation. Only available in site.py.

Also see the disabled code in site.py for details on how
to reenable the locale dependent default encodings.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From kajiyama@grad.sccs.chukyo-u.ac.jp  Thu Sep  7 17:59:18 2000
From: kajiyama@grad.sccs.chukyo-u.ac.jp (Tamito KAJIYAMA)
Date: Fri, 8 Sep 2000 01:59:18 +0900
Subject: [I18n-sig] sys.(set|get)_string_encoding in 1.6
In-Reply-To: <39B7C293.70FD7E8A@lemburg.com> (mal@lemburg.com)
References: <39B7C293.70FD7E8A@lemburg.com>
Message-ID: <200009071659.BAA06179@dhcp198.grad.sccs.chukyo-u.ac.jp>

"M.-A. Lemburg" <mal@lemburg.com> writes:
|
| Tamito KAJIYAMA wrote:
| > 
| > I found that sys.(get|set)defaultencoding() defined in the
| > Unicode proposal version 1.6 were implemented with different
| > names sys.(get|set)_string_encoding() in the 1.6 final release.
| > Is this an intended change?  If so, why is this incompatibility
| > introduced?
| 
| These APIs were first introduced as experiment to the CVS tree
| under the names you find in the 1.6 release. They were meant
| to provide an easy way to experiment with different default
| encodings. 
| 
| After some discussions on python-dev the outcome
| was to keep the APIs for use by site.py to set a locale
| dependent default encoding. 
| 
| This idea was then retracted some weeks later and replaced
| with the now standard ASCII default encoding which you find
| in both 1.6 and 2.0.

I see.

| So to answer your question: the sys APIs in 1.6 are to be
| considered undocumented features and should *not* be used.

Then, is there no way to set/get the default encoding in 1.6?

-- 
KAJIYAMA, Tamito <kajiyama@grad.sccs.chukyo-u.ac.jp>


From keichwa@gmx.net  Fri Sep  8 11:08:31 2000
From: keichwa@gmx.net (Karl Eichwalder)
Date: 08 Sep 2000 12:08:31 +0200
Subject: [I18n-sig] Re: gettext in the standard library
In-Reply-To: =?iso-8859-1?q?Fran=E7ois?= Pinard's message of "04 Sep 2000 16:37:11 -0400"
References: <14749.42747.411862.940207@anthem.concentric.net>
 <Pine.LNX.4.21.0008190854480.25020-100000@james.daa.com.au>
 <14757.24220.225628.464982@anthem.concentric.net>
 <200008241935.VAA05311@pandora.informatik.hu-berlin.de>
 <oqhf7xjkji.fsf@titan.progiciels-bpi.ca>
 <200009041356.PAA01550@pandora.informatik.hu-berlin.de>
 <oqsnrghy7r.fsf@titan.progiciels-bpi.ca>
 <200009041752.TAA15488@pandora.informatik.hu-berlin.de>
 <oqwvgrao7c.fsf@titan.progiciels-bpi.ca>
Message-ID: <shzolj2o2o.fsf@tux.gnu.franken.de>

> [Martin von Loewis]

> > I don't know anybody who'd prefer that
> > over just falling back to English.

Yes, there are quite some (I'm told by native speakers -- personally,
I'm not familiar with these languages):

    br:fr_FR        Bretonian - French (France)
    gl:es_ES:pt_PT  Galician - Spanish (Spain) - Portuguese (Portugal)
    XX:ru           where XX stands for eastern European languages -
                    Russian=20

Fran=E7ois Pinard <pinard@iro.umontreal.ca> writes:

> But I got that in Europe, many people have an extended culture, making
> me jealous (:-), and it is not uncommon for them to be comfortable
> with many languages.

You simply have to if you wish to travel a bit ;) (unfortunately, my
active languages are rather limited).

--=20
work : ke@suse.de                          |          ------    ,__o
     : http://www.suse.de/~ke/             |         ------   _-\_<,
home : keichwa@gmx.net                     |        ------   (*)/'(*)


From mal@lemburg.com  Fri Sep  8 12:56:08 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 08 Sep 2000 13:56:08 +0200
Subject: [I18n-sig] sys.(set|get)_string_encoding in 1.6
References: <39B7C293.70FD7E8A@lemburg.com> <200009071659.BAA06179@dhcp198.grad.sccs.chukyo-u.ac.jp>
Message-ID: <39B8D3D8.EF9C7738@lemburg.com>

Tamito KAJIYAMA wrote:
> 
> "M.-A. Lemburg" <mal@lemburg.com> writes:
> |
> | Tamito KAJIYAMA wrote:
> | >
> | > I found that sys.(get|set)defaultencoding() defined in the
> | > Unicode proposal version 1.6 were implemented with different
> | > names sys.(get|set)_string_encoding() in the 1.6 final release.
> | > Is this an intended change?  If so, why is this incompatibility
> | > introduced?
> |
> | These APIs were first introduced as experiment to the CVS tree
> | under the names you find in the 1.6 release. They were meant
> | to provide an easy way to experiment with different default
> | encodings.
> |
> | After some discussions on python-dev the outcome
> | was to keep the APIs for use by site.py to set a locale
> | dependent default encoding.
> |
> | This idea was then retracted some weeks later and replaced
> | with the now standard ASCII default encoding which you find
> | in both 1.6 and 2.0.
> 
> I see.
> 
> | So to answer your question: the sys APIs in 1.6 are to be
> | considered undocumented features and should *not* be used.
> 
> Then, is there no way to set/get the default encoding in 1.6?

No, there's no offical way to do this.

You could of course use the undocumented APIs, but you should
be careful not to create any Unicode objects *before* setting
the default in e.g. site.py. The same applies to 2.0.

The reason is that Unicode objects cache their default
encoded string version but don't store the encoding this
string uses. This could lead to the cached version using
a different encoding than the current default encoding.

In any case I'd suggest not relying on the default encoding,
but instead using explicit calls to .encode() and unicode()
to apply the proper conversions -- this is always safe,
uses less magic and is also more portable across Python
installations.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From martin@loewis.home.cs.tu-berlin.de  Mon Sep 11 23:48:15 2000
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Tue, 12 Sep 2000 00:48:15 +0200
Subject: [I18n-sig] Re: [4suite] Output encodings again
In-Reply-To: <39BC9737.C302C19C@fourthought.com> (message from Uche Ogbuji on
 Mon, 11 Sep 2000 02:26:31 -0600)
References: <871yzla17z.fsf@psyche.evansnet> <39BC9737.C302C19C@fourthought.com>
Message-ID: <200009112248.AAA00801@loewis.home.cs.tu-berlin.de>

[for i18n readers: the issue is to convert u"\u000A9\u01A9" to latin-1,
 so that it comes out as "\251A9;"]

> Currently, on output to XML (and HTML), we first convert the UTF-8 that
> the DOM uses into Martin von Lowis's wchar type.  

It may be the time to slowly retire this type. It is still needed for
1.5 installations, but the 1.6/2.0 type has a comparable feature set
yet an interface that is here to stay; plus it offers quite some
additional feature.

Still, I believe it shares this problem with my type.

> So I'm rather at a loss as to how to efficiently escape such characters
> for XML output.  I know I want to render them as &#???;, but every
> method I see for doing so is rather wasteful.

In principle, the approach should be introduce new encodings. That is,
you get latin-1-xml, latin-2-xml, koi-8r-xml, utf-8-xml, and so on.

These encodings are the same as the original ones, except that they
have different error handling. This approach is possible both with my
type and with the 2.0 type - however, implementing these encodings is
quite some effort.

I'm sure you've thought of the approach to catch the exception, then
retry with a smaller string. That may not be too bad - it requires a
binary search to work efficiently. E.g.

def latin1_xml(str):
    try:
        result = result + str.encode("latin-1")
    except UnicodeError:
        if len(str)==1:
            return "&%x;" % ord(str)
        m = len(str)/2
        return latin1_xml(str[:m]) + latin1_xml(str[m:])

It could be implemented more efficiently if the UnicodeError told at
what offset exactly the problem occured, or at least what character
was causing the problem, e.g.

def latin1_xml(str):
    try:
        result = result + str.encode("latin-1")
    except UnicodeError,e:
        if len(str)==1:
            return "&%x;" % ord(str)
        m = str.find(e.bad_char)
	r = "&%x;" % e.bad_char
        return latin1_xml(str[:m-1]) + r + % e.bad_charlatin1_xml(str[m+1:])

I think such an advanced error reporting could be useful; it is
questionable whether it could go into 2.0 if implemented. In any case,
it would probably be reasonable not to require a bad_char attribute in
every UnicodeError instance - perhaps UnicodeError must be further
subclassed:

def latin1_xml(str):
    try:
        result = result + str.encode("latin-1")
    except ConversionError,e:
        m = e.offset
	r = "&%x;" % e.bad_char
        return latin1_xml(str[:m-1]) + r + % e.bad_charlatin1_xml(str[m+1:])
    except UnicodeError,e:
        if len(str)==1:
            return "&%x;" % ord(str)
        m = len(str)/2
        return latin1_xml(str[:m]) + latin1_xml(str[m:])

Regards,
Martin


From mal@lemburg.com  Tue Sep 12 13:30:36 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 12 Sep 2000 14:30:36 +0200
Subject: [I18n-sig] Re: [XML-SIG] Re: [4suite] Output encodings again
References: <871yzla17z.fsf@psyche.evansnet> <39BC9737.C302C19C@fourthought.com> <200009112248.AAA00801@loewis.home.cs.tu-berlin.de>
Message-ID: <39BE21EB.842A066F@lemburg.com>

"Martin v. Loewis" wrote:
> 
> [for i18n readers: the issue is to convert u"\u000A9\u01A9" to latin-1,
>  so that it comes out as "\251A9;"]
> 
> > Currently, on output to XML (and HTML), we first convert the UTF-8 that
> > the DOM uses into Martin von Lowis's wchar type.
> 
> It may be the time to slowly retire this type. It is still needed for
> 1.5 installations, but the 1.6/2.0 type has a comparable feature set
> yet an interface that is here to stay; plus it offers quite some
> additional feature.
> 
> Still, I believe it shares this problem with my type.
> 
> > So I'm rather at a loss as to how to efficiently escape such characters
> > for XML output.  I know I want to render them as &#???;, but every
> > method I see for doing so is rather wasteful.
> 
> In principle, the approach should be introduce new encodings. That is,
> you get latin-1-xml, latin-2-xml, koi-8r-xml, utf-8-xml, and so on.
> 
> These encodings are the same as the original ones, except that they
> have different error handling. This approach is possible both with my
> type and with the 2.0 type - however, implementing these encodings is
> quite some effort.

It's not really all that hard to write codecs for Python 2.0.

You'll have to do two things:
1. write the codec by subclassing the base classes in codecs.py
2. write a search function which returns the needed constructors
   and functions.

You will then have to register the search function using the APIs
in codecs.py. After having done that, the codec will be accessible via
the usual 2.0 methods, e.g. .encode() and unicode().

Documentation is available in codecs.py itself, the various codecs
in the encodings/ package directory and Misc/unicode.txt.

For a good pure-Python implementation built using these techniques
have a look at the Japanese codecs which were recently announced
on the i18n sig-list.

-- 
Marc-Andre Lemburg
________________________________________________________________________
Business:                                        http://www.lemburg.com/
Python Pages:                             http://www.lemburg.com/python/


From martin@loewis.home.cs.tu-berlin.de  Wed Sep 13 12:11:08 2000
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Wed, 13 Sep 2000 13:11:08 +0200
Subject: [I18n-sig] Re: [XML-SIG] Re: [4suite] Output encodings again
In-Reply-To: <39BE21EB.842A066F@lemburg.com> (mal@lemburg.com)
References: <871yzla17z.fsf@psyche.evansnet> <39BC9737.C302C19C@fourthought.com> <200009112248.AAA00801@loewis.home.cs.tu-berlin.de> <39BE21EB.842A066F@lemburg.com>
Message-ID: <200009131111.NAA00929@loewis.home.cs.tu-berlin.de>

> It's not really all that hard to write codecs for Python 2.0.
> 
> You'll have to do two things:
> 1. write the codec by subclassing the base classes in codecs.py
> 2. write a search function which returns the needed constructors
>    and functions.

So how would I write a codec that converts all characters to Latin-1,
and converts those out of latin-1 to &#xxx; (instead of the
replacement character)? I'd need knowledge about what character are in
Latin-1, and I'd need to do conversion on a character-by-character
basis, right? And I can't possible use any of the _codecs helper
functions?

This is certainly feasible if I want it for a single character set,
but now if I want to do it wholesale for the entire set of character
sets supported by Python 2.0.

Regards,
Martin


From mal@lemburg.com  Wed Sep 13 18:57:07 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Wed, 13 Sep 2000 19:57:07 +0200
Subject: [I18n-sig] Re: [XML-SIG] Re: [4suite] Output encodings again
References: <871yzla17z.fsf@psyche.evansnet> <39BC9737.C302C19C@fourthought.com> <200009112248.AAA00801@loewis.home.cs.tu-berlin.de> <39BE21EB.842A066F@lemburg.com> <200009131111.NAA00929@loewis.home.cs.tu-berlin.de>
Message-ID: <39BFBFF3.C7BDD1F4@lemburg.com>

"Martin v. Loewis" wrote:
> 
> > It's not really all that hard to write codecs for Python 2.0.
> >
> > You'll have to do two things:
> > 1. write the codec by subclassing the base classes in codecs.py
> > 2. write a search function which returns the needed constructors
> >    and functions.
> 
> So how would I write a codec that converts all characters to Latin-1,
> and converts those out of latin-1 to &#xxx; (instead of the
> replacement character)? I'd need knowledge about what character are in
> Latin-1, and I'd need to do conversion on a character-by-character
> basis, right?

Right.

> And I can't possible use any of the _codecs helper
> functions?

You could play some tricks with the character mapping codec
which is used by all code page codecs.

You will achieve better performance with a native codec written
in C though.

> This is certainly feasible if I want it for a single character set,
> but now if I want to do it wholesale for the entire set of character
> sets supported by Python 2.0.

This is probably not possible since there's no way to have the
codecs use e.g. a callback function to handle error situations.

But the situation is not all that bad: most codecs rely on the
character mapping codec and you could simply implement a new
version of it which does the XML escaping instead of raising
errors.

-- 
Marc-Andre Lemburg
________________________________________________________________________
Business:                                        http://www.lemburg.com/
Python Pages:                             http://www.lemburg.com/python/


From jpsc@users.sourceforge.net  Wed Sep 27 23:18:35 2000
From: jpsc@users.sourceforge.net (JP S-C)
Date: Wed, 27 Sep 2000 15:18:35 -0700 (PDT)
Subject: [I18n-sig] Python for the Visually Impaired
Message-ID: <20000927221835.26496.qmail@web2201.mail.yahoo.com>

Dear edu-sig and i18n-sig mailing lists,
     The subject of this message is somewhere in
between education and internationalization, so I am
writing you both.  I run a project named Ocularis and
am interested in collaborating with developers from
both SIG's or the SIG's themselves.  Ocularis in
brief, is a distribution of the Linux Operating System
that aims to allow the visually impaired to
communicate, work, and express themselves through
computers as well as to install and customize their
system, independent of sighted assistance.  The
development of Ocularis is already underway and all
software is created by volunteers and is released
under the GNU Public License.  More detailed
information about Ocularis in included below.
     The ocularis-desktop package (currently in
version 0.0.1) focuses on providing console-based
applications that serve common functions.  This
package is written completely in Python, a language
which I believe has a lot of potential for creating
applications for the visually impaired.  In addition,
I think that Python is also an ideal language on many
fronts, especially when it comes to programming,
debugging, and maintaining code non-visually.  Other
than the ocularis-desktop package, there are also
several developers who are working on other
subprojects of Ocularis that aim to provide better
access to X, including GTK-based applications. 
      I would love to discuss or hear ideas from
anyone about Python's many uses for and with the
visually impaired.  Thank you.

--JP Schnapper-Casteras
  jpsc@users.sourceforge.net
     

Details about Ocularis:

        The computing environment and suite of
applications that are the goal of Ocularis will be
free software (see "www.gnu.org" for a definition of
free software) and will be based on Linux.  The basic
applications that Ocularis will possess are a word
processor, calendar, calculator, basic accounting or
finance application, file manager, Internet browser,
and e-mail client.  All of these programs will run
smoothly on computers consisting of commonly available
hardware costing less than $500 that can be bought at
almost any local computer store.  In comparison to
current adaptive technology, this is both a drastic
price drop and an increase in the availability of the
required hardware.  Ocularis was started in response
to research on current adaptive technology, which
culminated in the editorial "The Potential of Open
Source for the Visually Impaired" (available at the
Ocularis web site,"http://ocularis.sourceforge.net/").
 For more information, please visit the Ocularis web
site, "http://ocularis.sourceforge.net/", or contact
me directly.

__________________________________________________
Do You Yahoo!?
Yahoo! Photos - 35mm Quality Prints, Now Get 15 Free!
http://photos.yahoo.com/