[I18n-sig] Re: gettext in the standard library

Martin v. Loewis martin@loewis.home.cs.tu-berlin.de
Tue, 5 Sep 2000 00:31:25 +0200


> Any part in which one has to make a distinction between both types of
> strings.  Let's have the translator function returning a string.  

In the specific implementation that is in Python 2.0, which kind of
string should it return? It has to make a choice; just saying "I don't
care" is a bad basis for an algorithm.

> It is not important to know which kind of string.  Python takes care
> of what needs care, anyway.

No, it doesn't. It will in some cases, but won't in others.

> It should be fairly transparent to the programmer, and our API
> should be just as transparent.  Shouldn't it?

It should, but I feel it isn't.

> > header = '\x01\x01'
> > body   = _('warning')
> > message = header + body
> 
> Perfect.  No problem.  Python will do something proper, whatever the type
> of string which `body' receives...

>>> header = '\xFF\x01'
>>> body   = u'warning'
>>> message = header + body
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeError: ASCII decoding error: ordinal not in range(128)

Is that proper? Is it what the user expected? If not, how should the
user modify her code so it does what she wanted?

> I thought that every effort was made (at least for 1.6a1 and 1.6a2) for
> developers should just _not_ be aware of the type of strings.  Is 2.0
> different?  

No, 2.0 is just the same as 1.6 in that area. I suggest you play
around with the Unicode type somewhat before recommending that API
functions should blindly return it...

> If I missed the issue, you may dismiss many things among what I wrote,
> as we are then not reasoning on the same grounds.

I don't know whether there is an issue. There is a number of cases
where mixing byte strings and Unicode strings will cause runtime
errors; it is not (and IMO shouldn't be) totally transparent.

> Shouldn't we just have confidence that Python works?

Well, I think I know how it works, and I believe that developers need
to be fully aware of Unicode vs byte strings. They still can employ
elegance where available, but I promise that handing out randomly
either byte or Unicode strings will result in complaints.

> If we get Unicode out of the translating routine, there should not be much
> more needed, except maybe a final encoding of the output stream.  This,
> I feel we did not discuss enough yet (how to connect the translation
> function to the output stream encoding, as transparently as possible).

Indeed, this is the crucial issue. Unfortunately, we don't know how
user would eject the messages. I know that passing them to Tkinter
works well for Unicode strings, and I know passing byte strings to
stdout works well. Other combinations don't work as good:

mira% echo $LANG
de_DE.ISO-8859-1
mira% python    
Python 2.0b1 (#31, Aug 31 2000, 23:36:28)  [GCC 2.95.2 19991024 (release)] on linux2
Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
Copyright 1995-2000 Corporation for National Research Initiatives (CNRI)
>>> unicode('fön','latin-1')
u'f\366n'
>>> print unicode('fön','latin-1')

Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeError: ASCII encoding error: ordinal not in range(128)

So I'd rather not return a Unicode string representing an error
message from gettext: the user expecting an error message may be
surprised about the totally unrelated UnicodeError.

Regards,
Martin