Marking translatable strings

Thu Sep 16 19:13:48 EDT 1999

Bernhard Herzog <herzog at online.de> écrit:

> > 1) Marking strings
> > eight type of strings: ', ", ''', """, r', r", r''' and r"""

> Note, that all these different types are only different at the lexical
> level.  The tokenizer treats them all as STRING-tokens, so could end up
> with a lot of changes to Pythons internals to distinguish them at any
> higher level.

Python internal changes are unwanted merely for marking.  Marking strings
is mainly so tools like `xgettext' could extract them.  Of course, it
does not know Python currently, put this could be improved.  I had my
own version for making experiments, with Emacs LISP, and other languages,
waiting for Ulrich to get back to `gettext' maintainance, if he ever takes
a little break from GNU `libc' :-).

> triple quotes are easier to type than the C-versions because you just
> have to hit the same key several times instead of different keys scattered
> all over the keyboard.

Surely.  The problem is not really the typing, but the legibility of sources.
Translatability marks should really be as little obtrusive as possible,
because in the long run, they are meant to clutter sources all over.
Experience shows that horizontal space is often a valuable resource while
writing code, so we should try to be economical there for overall legibility,
much more than typability.

> With your suggestion about """ you'd expect 'print """red"""' to print
> rouge if your program was properly localized for french, don't you.

Yes.  I am assuming that people normally use """ for long strings, and that
it is very unlikely that people resort to 'print """red"""' in practice.
But even then, this once consumes too much horizontal space.  I think we
need something shorter.  I'm still away from home and did not scrutinise
all this email thread yet, but so far, I guess we could use _(STRING), with
any type of STRING, as both a marker and a gettext caller, and we could use
''"TEXT" or ""'TEXT' for strings needing to be marked only, and translated
only later.  And doc strings.  This would be acceptable overall, I guess.

> A special translated string literal means that the Python interpreter
> would have to generate special bytecode to translate it everytime it's
> executed.

It is not impossible that we could go without any need to modify the
interpreter, at least for prototypes of internationalized packages.
One reason to modify the interpreter would be for speed, maybe, by ensuring
that a given string is translated only once even in a loop, like for what
_() currently does in C when compiled with GNU C.  It might depend on how
exactly strings are intern'ed within Python, which I do not know.  In any
case, it would be nice if internationalization could be achieved portably to
all flavours of Python, just doing nothing on older implementations, anyway,
or doing it slower.  That is, that the string syntax should not be modified.

> > 2) Translating strings
> I dont't think the potential collision with _ in an interactive Python
> session is a problem.

Nice!  Thanks for alleviating this one :-).

> I've used Martin von Löwis' intl module (a wrapper for GNU gettext) [...]

I should at least study what's available.  My main goal is to reach a
global picture soon, because I would like to start internationalizing
all this Python code I'm currently writing.  Besides many small and some
bigger programs here, I surely intend to internationalise the Translation
Project robot suite, after I will have finished to convert it to Python.
The Mailman author (or one contributor? I do not know all people yet! :-)
wrote to me that he wants to get Mailman internationalised, and that
would be a major exercise that would be worth tackling soon, to uncover
difficulties and problems.  I guess I would like to dive in this, or at
least, to participate to the effort of those who are already swimming! :-)

> If you're careful to only use double-quoted strings inside of _(),
> you can even use xgettext to build the initial po-files.

This would not be neat enough to my taste.  I'm quite ready to modify
xgettext, or alter my own extractor in the meantime, to do things properly.

> The only problem I can see with this is that you couldn't get xgettext to
> recognize doc-strings, because you can't mark them with _() or rather N_().
> Actually translating the doc-strings isn't a problem, because tools that
> access the doc-strings could just pass them through _().

Doc strings are recognizable by their location: at the start of a
module or after a def, when sextuple-quoted (hi hi! ' is single-quote,
" is double-quote, ''' is triple-quote, so we could say that """ is
sextuple-quote, or maybe just sex-quote for short :-).  If we retain the
suggestion that we have ''"TEXT" and ""'TEXT', then we coud say that ''"
is quadruple-quote and ""' is quintuple-quote, so giving a meaning to the
missing multipliers...  If I understand things correctly, `xgettext'-type
tools for Python might extract all 4-quoted and 5-quoted strings, as well
as 6-quoted strings when in doc string position, as well as strings given
to some keyword functions (like `_'), exactly as in `xgettext' currently.

> > 3) Setting the textual domain
> With intl, you'd could just define _ locally (i.e. on module level) as
>     def _(text):
>         return dgettext("domain", text)

I hope it might be that simple.  In Emacs LISP, it is a real difficulty,
and notable internationalisation efforts failed because of this oversight,
if I correctly get the stories told to me.  When Richard and I discussed
the matter a few times, I realised that it is difficult to grasp even for
people who know Emacs LISP rather well :-).  And he keeps forgetting :-(.

> > But there are problematic cases, like for when untranslated strings
> > are transmitted to other modules, for being translated there, or even
> > maybe for plain doc strings.

> But this isn't a python-specific problem, is it?  You'd have similar
> problems with C-libraries, or with passing such strings from say a
> third-party plugin to the main program.

I was not thinking of Python specifically, but I was not thinking C either,
where the above problem does not exist in practice.  In bash, say, it is
sufficient to have a stack of textual domains, when getting in or out of
nested scripts, but this is already implemented through forking.  Sourcing
may cause some uncertainty, especially in the context of shell functions.
Problems arise when languages have purely dynamic scoping, because textual
domains are lexically scoped by intent.  I did internationalize a rather
big Scheme program already, but it was all within a single textual domain,
so I did not hit any wall there.  But I would guess that Scheme is much
simpler than Emacs LISP, in actual practice -- I'm not sure.

Python should be simple as well.  Yet, after the XEmacs internationalisation
failure, one good lesson, that we would be foolish to not retain, is that
we should take time to think things out, before starting to think too big.

> For doc-strings you could probably get away with a global variable.

It is not clear to me how doc strings are used in Python.  Some people were
kind enough to reply to one question in a previous thread of this mailing
list, but I still have some study to do before I understand all the replies.
As this area is a bit fuzzy to me, I would fear some oversight if we were
ignoring them too soon...

Thanks a lot, Bernhard, for having taken the time to share your thoughts!

-- 
François Pinard   http://www.iro.umontreal.ca/~pinard