[I18n-sig] Literal strings

Fri, 2 Jun 2000 12:39:09 +0200 (MEST)

Hi Paul,

Paul Prescod :
> I am thinking about string literals. Not narrow strings in general, just
> string literals in particular. I'm not sure where we left the issue of a
> statement about the "encoding" of string literals. Here's my input.
> 
> I have a lot of code like this:
> 
> if tagName=="foo":
>       ...
> 
> I would like it to magically work with Unicode. Guido's proposal allows
> it to magically work with Unicode-encoded ASCII, but not with the full
> range of Unicode characters. I'm not entirely happy that my code will
> crash and burn the first time someone pops in a cedilla.

A cedilla (ç) is a normal 8-Bit character in ISO-Latin-1, so this may
be a bad example.  We use such literals a lot and it didn't break anything.
Even with Guidos proposal it will only break things, if you coerce
such a literal into unicode without an explicit conversion.

Since my native language is German and since my English leaves a lot
to be desired (take my rants to python-dev as examples), we decided
long ago to use German as our "master language" in our company for our
I18N software.  This works pretty well in Python 1.5.2.  Example how
this looks like:

        tkMessageBox.askquestion(_("Löschen bestätigen"),
                                 _("Soll %s gelöscht werden?") % object_name)

'_()' in this context is an shortcut name pointing to the
'fintl.gettext()' function.  This function possibly returns the literal
translated into English, French or Spanish depending on the language
environment.  An additional tool (xgettext, now pygettext by Barry W.) is
used to extract all those literals and to deliver them to professional
translators which translate these message strings into English, French ...

Additionally we abopted the style to use single quotes for all literals 
that are normally invisible to a user of the software.  Exmaple:

        if hasattr(target, 'disable'):
            target.disable()

> What would be the consequences of a module-level pragma that allows the
> literal strings in my module to be interpreted as *Unicode literals*
> instead of ASCII literals. I usually know that all of the literals in my
> program are raw ASCII, so even if they are interpreted as Unicode, they
> will be "compatible with" raw ASCII input. The only thing that they
> would not be compatible with is 8-bit binary goo, which they were never
> intended to be compatible with anyhow.

Hmmmm.... I don't understand, what you meant with your last sentence.
May be my ignorance comes from the situation, that I can view, edit and print
any files containing ISO-Latin1 characters in WYSIWYG without thinking
about it and still don't know what kind of text editor and Keyboard/Display 
Equipment is required to work with those Unicode characters with 
ord(ch) >= 256 in WYSIWYG? [I'm using Linux/X11/vim if this matters]

> I just want to add something at the top of my file like:
> 
> #pragma IL8N
> 
> and have my literal strings act as Unicode.

There already was a long discussion about interpreter pragmas
on python-dev.  I still prefer David Scherer's brilliant idea to
(ab)use the 'global' statment at module level, if we ever introduce
pragmas into the 1.x series of Python.  Please review the discussion
(April 2000) in the python-dev archives.

> Now I could go through my code and change all of the literals to Unicode
> literals by hand, but 
> 
>  a) that's really ugly, syntactically

As always this is simply a matter of taste.  And after a while you get 
used to it.

>  b) I feel like I'll end up switching them all back when we just make
> literal strings "wide" by default

I don't believe that this will happen in the 1.x series.  This would break 
just too many things and the memory penalty is just to harsh for small 
systems.

>  c) I feel like I'm being penalized for making my program
> internationalized

As long as your i18n effort doesn't hit asian languages (for example 
chinese, japanese) you can get away with narrow strings.  Unicode only comes
into play, if you have to deal with several different languages at
the same time.  

Even a japanese translation is possible with 8-bit Python 1.5.2, as
long as you don't need to display for example umlauts and japanese
characters at they same time, and as long as the japanese translator
uses the same character set as the production platform.  On Feb, 9th
2000 Andy Robinson wrote a very good explanation, what character sets
are used in Japan.  Review this in the i18n archive, if interested.

Brian Takashi Hooper was also a very helpful guy concerning Japanese.

>  d) I have a lot of code, as we all do.

If code can be modified automatically (and what you proposed can
be done with a only slightly more elaborated operation than a simple
's/"/u"/g' replacement) this is IMO no argument.

Regards, Peter