[I18n-sig] Re: gettext in the standard library

Martin von Loewis loewis@informatik.hu-berlin.de
Mon, 4 Sep 2000 20:13:57 +0200 (MET DST)


> > In Python 2, unicode strings are a separate type from byte strings.
> > The catalog objects will have two methods, one for retrieving a byte
> > string, as it appears in the mo file, and one for retrieving a unicode
> > string.  It is then the application developer's choice whether his
> > application can deal with Unicode messages on output or not.
> 
> You are merely re-stating that there is a special API for Unicode, here.
> I got this already! :-).  My question is about why it is necessary.

Which part do you deem unnecessary? The part returning a byte string,
or the part returning a Unicode string?

> Yes, it is described in the PO file header (the translation of the empty
> string).  The idea is to convert KOI-8 (or whatever) while retrieving
> the translation.  Most of the time, the conversion will be to Unicode.
> In some very rare cases, like for Netherlands, ASCII is sufficient.
> This all can be done automatically, I do not see why we need two
> APIs.

So you are proposing that an application cannot tell in advance what
the return type of _ will be? In some application, writing

header = '\x01\x01'
body   = _('warning')
messgage = header + body

Will this work or not? Anwer: It depends. In the Netherlands, it will
work, elsewhere, it won't.

> I thought Python 2.0 was to come with a comprehensive set of conversion
> routines for doing such things.  If we ever find that one is missing,
> we might try to add it, shouldn't we?

I think it was decided not to include the JIS something tables in the
Python 2 distribution, because they are too large to include.

> > Also, how would goal language determine whether Unicode is a better
> > representation for messages than some MBCS?
> 
> Oh, no doubt that this may yield to hot debates.  

I did not really ask for an opinion, I asked for an algorithm:

def mbcs_p(parameters):
  your code here

> For translation purposes, I thought Python was to produce either ASCII
> or UTF-8 rather automatically on output.  It is likely to produce a mix,
> as the original strings are written in ASCII most of times, which do not
> get all translated.

In Python 2.0, developers should be aware at all times whether they
operate on Unicode strings or on byte strings. Python will try to do
the right thing if there is a clear right thing, and try to raise
exceptions whenever it is not so clear what the right thing would be.

Having an API that sometimes returns Unicode strings and sometimes
byte strings (depending on environment variables (!)) would be just
terrible.

> If something else is needed on output, I thought the intent was to
> override UTF-8 as an output encoding, yet still use Unicode
> internally, instead of any MBCS, taking advantage of all the magic
> Python 2.0 will have in that respect.

Maybe it's a terminology issue: I consider UTF-8 as a MBCS (multi-byte
character set); UTF-8 strings are byte strings, not Unicode strings.

> Otherwise, you have to make your Python script aware of those coding
> a lot more, internationalisation becomes much more intrusive in your
> sources, while we wanted it to be as light weight as possible.

I simply want to give users a choice. If they chose to "let's try
Unicode", they have the choice. If they find it all works,
well. Otherwise, they can go for byte strings, with a different set of
limitations.

Regards,
Martin