Marking translatable strings

Fri Sep 17 01:42:59 EDT 1999

[François Pinard, on internationalization of strings in program text]

I know less than nothing about this -- always figured that if American
English was good enough for God, it's good enough for everyone else <wink>.

So I'll just touch on some Python issues:

> ...
> [Python] has eight type of strings: ', ", ''', """, r', r", r''' and
> r"""; and I thought that maybe we could just discipline ourselves to
> give more meaning to all these differences ...

Indeed, the number of ways to spell strings in Python is quite unPythonic!
The language generally avoids offering gratuitous choices.  Too late now,
though -- all ways of spelling strings are already used in all conceivable
contexts, and no rational discipline can be imposed a posteriori.

For the purpose of finding Python strings, note that under the covers it was
deliberately designed so that there's really just one basic
string-recognizing state machine.  ' vs ", and ''' vs """, make no
difference at all to lexing, and r- vs not-r makes no difference either.
The only difference in lexing is that the 4 single-quoted varieties (', r',
", r") refuse to span lines except via backslash+newline, while the 4
triple-quoted varieties don't treat newline specially.

> ...
> Surely, since doc strings use """ exclusively, ...

Really sorry, but you can't rely on that.  I've seen all 8 flavors of
strings used as docstrings -- what makes a string a docstring is not how
it's spelled, but where it appears (if the first executable stmt in a
module, class or def is a string, that's a docstring; and nothing else is).

> ...
> The most comfortable (the less intrusive) way would be to call:
>
>         _(TEXT)
>
> to get the translation of text.

I agree with everyone else <wink> that the special meaning of "_" in
interactive mode is unlikely to create a problem for you.

> ...
> Experience shows that horizontal space is often a valuable resource while
> writing code, so we should try to be economical there for overall
> legibility, much more than typability.

I like the way you think <wink>.  Really, Python takes pride in its
legibility, so worrying about that is very welcome here.  Note one odd
thing:  since Python doesn't support assignment expressions,
horizontally-challenged 8-line "if" stmts simply don't occur.  So there's
less horizontal pressure here than you may be used to.  OTOH, Python doesn't
support augmented assignments (+= & friends) yet either, and that adds
horizontal pressure (but less, I think, than the lack of assigment
expressions relieves).

> ...
> and we could use''"TEXT" or ""'TEXT' for strings needing to be marked
> only, and translated only later.

I don't understand the distinction here, so just noting that there's nothing
in the compiled code that can distinguish an instance of

    ''"TEXT"

in the source from an instance of

    "TEXT"

or even of

    ''   "TEXT"""""""r''''''

Catenation of adjacent string literals is a compile-time "optimization", or
more accuractely slavish aping of C rules -- as if someone thought C were a
pleasant string language <wink>.

> ...
> One reason to modify the interpreter would be for speed, maybe,
> by ensuring that a given string is translated only once even in a loop,
> like for what _() currently does in C when compiled with GNU C.  It might
> depend on how exactly strings are intern'ed within Python, which I do not
> know.

You can force interning via calling the intern() builtin.  By default, the
only strings that get interned are those literals that "look like" they
*may* be attribute names.  For example, "Pinard" is interned by magic, but
"F. Pinard" is not.

In any case, automatic interning is purely an internal optimization, and the
rules can-- and do --change from release to release.  So don't even think
about relying on their specific incarnation today.

Within a code block (def, class or module body), but not across code blocks,
all string literals are accumulated and stored uniquely in a vector of
constants attached to the code object.  So, e.g., within a single def, 112
instances of "F. Pinard" will all resolve to a single string object.  If you
use a runtime translation function, it can exploit this via a cache, as in

def _(s, cache={}}:
    translated = cache.get(s, None)
    if translated is None:
        translated = cache[s] = translateit(s)
    return translated

Caching on the address of s (id(s)) would be quicker, but while that's safe
for string literals, it can create false hits if _ is ever applied to
computed strings (two instances of which may have the same address if their
lifetimes are disjoint).  Explicit interning can worm around that, but
interned strings are effectively immortal.

So there are puzzles here.  It's also possible to write a little bytecode
optimizer to traverse the code object's constant vector and translate the
strings it finds once and for all; the puzzle there is which strings to
leave alone.

> ...
> it would be nice if internationalization could be achieved
> portably to all flavours of Python, just doing nothing on older
> implementations, anyway, or doing it slower.  That is, that the string
> syntax should not be modified.

Python favors explicit approaches.  Another advantage is that a fully
explicit scheme should work unchanged for JPython too (a variant of Python
that compiles directly to Java bytecode).

> ...
> Problems arise when languages have purely dynamic scoping, because textual
> domains are lexically scoped by intent.  I did internationalize a rather
> big Scheme program already, but it was all within a single textual domain,
> so I did not hit any wall there.  But I would guess that Scheme is much
> simpler than Emacs LISP, in actual practice -- I'm not sure.

Scheme, unlike elisp, is lexically scoped.  So, yes, Scheme is much
friendlier than elisp in this respect.  elisp also has interesting
complications due to, e.g., buffer-local vrbls vs global vrbls of the same
name, which is a novel kind of dynamic scoping (novel because it can't be
resolved either by reference to the static source text or by staring at the
runtime call stack!).

> Python should be simple as well.  Yet, after the XEmacs
> internationalisation failure, one good lesson, that we would be foolish
> to not retain, is that we should take time to think things out, before
> starting to think too big.

"Do the simplest thing that could possibly work" is good advice in such
situations.  We'll fix it later <wink>.

> ...
> It is not clear to me how doc strings are used in Python.

It's not yet clear to anyone.  They were inspired by elisp, but because our
Benevolent Dictator didn't decree every detail of how they were to be used,
people have thrashed for years wondering what to do with them.  The saving
grace is that they're available for runtime introspection & manipulation via
__doc__ attributes, so any explicit scheme that can be applied to runtime
strings can be applied to them as well.

If you register a module's translation domain via binding a
conventionally-named module attribute (let's pick one at random, say
"_domain"), any tool mucking with docstrings should be able to find that.
If you can get at a code body at runtime, you can get a handle to the code's
global namespace, which is an ordinary Python dict that almost always
corresponds to the module namespace in which the code body was defined.
Then you can simply look up "_domain" in that dict.  Python maintains a lot
of info for runtime introspection so you don't have to invent elaborate &
delicate compile-time schemes for retaining it yourself.

so-easy-it-will-probably-write-itself<wink>-ly y'rs  - tim