From Misha.Wolf@reuters.com  Fri Aug  3 20:40:39 2001
From: Misha.Wolf@reuters.com (Misha.Wolf@reuters.com)
Date: Fri, 03 Aug 2001 20:40:39 +0100
Subject: [I18n-sig] 19th Unicode Conference, September 2001, San Jose, CA,
 USA -- Register now!
Message-ID: <T5526932463c407b706480@reuters.com>

           Nineteenth International Unicode Conference (IUC19)
               Unicode and the Web: The Global Connection
                    http://www.unicode.org/iuc/iuc19
                         September 10-14, 2001
                           San Jose, CA, USA
                             Register now!

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

NEWS

 * Hotel guest room group rate valid to August 17.

 * Visit the Conference Web site ( http://www.unicode.org/iuc/iuc19 )
   to check the updated Conference program and register.  To help you
   choose Conference sessions, we've included abstracts of talks and
   speakers' biographies.

CONFERENCE SPONSORS

   Agfa Monotype Corporation
   Basis Technology Corporation
   Lionbridge Technologies
   Microsoft Corporation
   Netscape Communications
   Oracle Corporation
   PeopleSoft, Inc.
   Reuters Ltd.
   Sun Microsystems, Inc.
   Trados Corporation
   Trigeminal Software, Inc.
   World Wide Web Consortium (W3C)
   Wrox Press

CONFERENCE VENUE

   DoubleTree Hotel San Jose
   2050 Gateway Place
   San Jose, CA 95110
   USA

   Tel: +1 408 453 4000
   Fax: +1 408 437 2898

GLOBAL COMPUTING SHOWCASE

   Visit the Showcase to find out more about products supporting the
   Unicode Standard, and products and services that can help you
   globalize/localize your software, documentation and Internet content.
   For details, visit the Conference Web site:
     http://www.unicode.org/iuc/iuc19

   Exhibitors to date include:
   * Basis Technology Corporation
   * Everlasting Systems Ltd.
   * Multilingual Computing, Inc.
   * Oracle Corporation
   * Rasmussen Software, Inc.
   * Sun Microsystems, Inc.
   * Symbio Group
   * Trados

CONFERENCE MANAGEMENT

   Global Meeting Services Inc.
   4360 Benhurst Avenue
   San Diego, CA 92122, USA

   Tel: +1 858 638 0206 (voice)
        +1 858 638 0504 (fax)

   Email: info@global-conference.com
      or: conference@unicode.org

                             *  *  *  *  *

Unicode(r) and the Unicode logo are registered trademarks of Unicode,
Inc.  Used with permission.


-----------------------------------------------------------------
        Visit our Internet site at http://www.reuters.com

Any views expressed in this message are those of  the  individual
sender,  except  where  the sender specifically states them to be
the views of Reuters Ltd.


From pinard@iro.umontreal.ca  Tue Aug  7 21:07:39 2001
From: pinard@iro.umontreal.ca (=?iso-8859-1?q?Fran=E7ois?= Pinard)
Date: 07 Aug 2001 16:07:39 -0400
Subject: [I18n-sig] Re: pygettext.py extraction of docstrings
In-Reply-To: <15117.38438.361043.255768@anthem.wooz.org>
References: <14840.35473.307059.990479@anthem.concentric.net>
 <200010272228.AAA01066@loewis.home.cs.tu-berlin.de>
 <15113.29005.357449.812516@anthem.wooz.org>
 <shy9rmwm54.fsf@tux.gnu.franken.de>
 <15117.38438.361043.255768@anthem.wooz.org>
Message-ID: <oqzo9bfxbo.fsf@lin2.sram.qc.ca>

[Barry A. Warsaw]

> Then again, it doesn't say that #. comments are reserved.  It basically
> just says that #-whitespace comments are reserved for the translators.

You might consider that they are all reserved.

> I'm happy to switch it, but I'd really like to have a reference I can
> point to to short-circuit any further discussion.  Even a mailing list
> archive url would be fine.

There is no formal, fully dependable reference.  I might have written the
bits that exist in the `gettext' manual, and these things were programmed
only after they were thoroughly discussed with me.  But nowadays, even me is
not a good reference.  A few people contributed `gettext' code, pushing and
pulling a bit hard for their own ideas, and not always understanding the
overall plans.  Their code made it into `gettext' releases nevertheless.
So now, I'm not sure I understand much anymore where things are going.

If I remember well, `#.' are for textual comments written by the program
maintainer, meant to be read by translators, and derived automatically at
POT creation time.  They usually come from specially formatted comments
in the C sources.  `#-whitespace' are for textual comments also meant to
be read by various translators, but written by translators themselves.

`#,' are for programmatic flags.  The idea was to use these parsimoniously,
keeping track of possible flag definitions and consequences.  I do not know
how far these are recognized and validated by `msgfmt'.  Best would be to
coordinate with the current `gettext' maintainer before creating new ones.
Unless he declares they are now for free use?

-- 
François Pinard   http://www.iro.umontreal.ca/~pinard


From pinard@iro.umontreal.ca  Tue Aug  7 21:38:05 2001
From: pinard@iro.umontreal.ca (=?iso-8859-1?q?Fran=E7ois?= Pinard)
Date: 07 Aug 2001 16:38:05 -0400
Subject: [I18n-sig] Re: pygettext dilemma
In-Reply-To: <15200.64763.772001.53387@anthem.wooz.org>
References: <15200.64763.772001.53387@anthem.wooz.org>
Message-ID: <oqvgjzfvwy.fsf@lin2.sram.qc.ca>

[Barry A. Warsaw]

> In Mailman, I've got a bunch of normal .py modules and a bunch of
> command line scripts.  The modules have their translatable strings
> nicely marked with _() and only those strings should be extracted.

Hello, Barry.  Long time no talk! :-)

`_(STRING)' is two-fold.  First, it marks STRING for extraction and later
insertion in some generated POT file.  Second, it is a nickname for the
`gettext' function or alike, that will translate STRING at run time given
that a translation file provides a translation.

Experience taught us that this is not always adequate.  We sometimes need
to delay a translation.  That is, we might use `_(VARIABLE)', with VARIABLE
being first assigned some translatable string elsewhere in the program.
Since VARIABLE is not a string, it does not get extracted into a POT file.
But those strings which could get assigned to VARIABLE are not extracted
either, because they are not marked.  You understand that they were marked
with `_(STRING)', they would get translated prematurely.

All this to say that there is a need for marking strings in such a way that
they will be extracted into POT files, but otherwise untouched by Python.
That is, the way to mark string should be a Python no-operation, and ideally,
should not alter the Python language.

The only simple Python no-operation I know is the unary prefix `+', and my
intuition tells me that it might have been dangerous to use it for marking
delayed translation strings.  Using prefixes like i"STRING" or t"STRING"
(for "i"nternationalisable or "t"ranslatable) would require a modification
to Python.

So, I came with the simple idea to play a bit with the fact that Python folds
a succession of constant strings into a single one at compilation time.
The idea is to prefix a translatable string, when it is used outside the
usual `_(STRING)' idiom, by an empty string of the other kind, like this:

   Exemple         Type         For extractor

   'TEXT'          1-quoted     not marked
   "TEXT"          2-quoted     not marked
   '''TEXT'''      3-quoted     not marked
   ''"TEXT"        4-quoted     marked
   ""'TEXT'        5-quoted     marked
   """TEXT"""      6-quoted     not marked
   ""'''TEXT'''    7-quoted     marked
   ''"""TEXT"""    8-quoted     marked

Of course, the idea of using the empty string "of the other kind" is to
avoid ambiguity: prefixing '' to 'TEXT' would produce '''TEXT', which just
cannot work.  I agree that for 7-quoted and 8-quoted strings, it is not
really required to use the empty string of the other kind, using an empty
string of the same kind would work without problem.  I suggest we keep
"of the other kind" for 7-quoted and 8-quoted for being more consistent.

> The scripts however should have both _() and docstrings extracted,
> since the module docstrings include usage text.

In fact, I think that even within a single module, some docstrings should
be considered translatable, while some other docstrings should not be.
Considering the choice has to be per whole module at a time, is too gross.
This goes almost without saying.  One should not feel compelled to avoid
docstrings for internal or service functions within a module, merely to
avoid having them spuriously extracted, and later, uselessly translated.

> Does anybody have any suggestions or better ideas?

I would be tempted to suggest that we merely use delayed string marking,
using the convention above (like in 4-quoted, 5-quoted, 7-quoted or 8-quoted)
for docstrings meant to be translated.  Such strings would be extracted
no matter what, in docstring position of not.

An option to `pygettext' might exist to extract all docstrings, whether
marked as delayed strings or not, but I would guess this is an interim
solution which is not to be satisfying in the long term.  Best is to
mark translatable strings precisely, either using immediate `_(STRING)'
or delayed translation.

One problem is that Python does not seem to automatically concatenate a
sequence of strings as a single one, when in docstring position.  We might
consider this as a Python bug: repairing that bug would not really change
the language, and would allow delayed marking of translation strings.

Let me present the set of suggestions, in this message, as having a minimal
impact on Python, yet being pretty flexible in what it would allow us to do.

-- 
François Pinard   http://www.iro.umontreal.ca/~pinard


From Misha.Wolf@reuters.com  Fri Aug 10 22:58:15 2001
From: Misha.Wolf@reuters.com (Misha.Wolf@reuters.com)
Date: Fri, 10 Aug 2001 22:58:15 +0100
Subject: [I18n-sig] Call for Papers - 20th Unicode Conference - Jan/Feb 2001 - Washington
 DC
Message-ID: <T554b1d94f5c407b706494@reuters.com>

           Twentieth International Unicode Conference (IUC20)
               Unicode and the Web: The Global Connection
                    http://www.unicode.org/iuc/iuc20
                     January 28 - February 1, 2002
                          Washington, DC, USA

      > > > > > > >  C A L L   F O R   P A P E R S  < < < < < < <

                  Submissions due: September 21, 2001
                  Notification date: October 12, 2001
                Completed papers due : November 2, 2001
            (in electronic form and camera-ready paper form)

                               * * * * *

The Unicode Standard has become the foundation for all modern text
processing.  It is used on large machines, tiny portable devices, and
for distributed processing across the Internet.  The standard brings
cost-reducing efficiency to international applications and enables the
exchange of text in an ever increasing list of natural languages.

New technologies and innovative Internet applications, as well as the
evolving Unicode Standard, bring new challenges along with their new
capabilities.  This technical conference will explore the opportunities
created by the latest advances and how to leverage them, as well as
potential pitfalls to be aware of, and problem areas that need further
research.

We invite you to submit papers which either define the software of
tomorrow, demonstrate best practice with today's software, or articulate
problems that must be solved before further advances can occur.  Papers
should discuss subjects in the context of Unicode, internationalization
or localization. You can view the programs of previous conferences at:
http://www.unicode.org/unicode/conference/about-conf.html

Conference attendees are generally involved in either the development,
deployment or use of Unicode software or content, or the globalization
of software and the Internet.  They include managers, software
engineers, systems analysts, font designers, graphic designers, content
developers, technical writers, and product marketing personnel.

THEME & TOPICS

Computing with Unicode is the overall theme of the Conference.
Presentations should be geared towards a technical audience.  Topics of
interest include, but are not limited to, the following (within the
context of Unicode, internationalization or localization):

- UTFs: Not enough or too many?
- Security concerns e.g. Avoiding the spoofing of UTF-8 data
- Impact of new encoding standards
- Implementing Unicode: Practical and political hurdles
- Portable devices
- Implementing new features of recent versions of Unicode
- Algorithms (e.g. normalization, collation, bidirectional)
- Programming languages and libraries (Java, Perl, et al)
- The World Wide Web (WWW)
- Search engines
- Library and archival concerns
- Operating systems
- Databases
- Large scale networks
- Government applications
- Evaluations (case studies, usability studies)
- Natural language processing
- Migrating legacy applications
- Cross platform issues
- Printing and imaging
- Optimizing performance of systems and applications
- Testing applications
- XML and Web protocols
- Business models for software development (e.g. Open source)

SESSIONS

The Conference Program will provide a wide range of sessions including:
- Keynote presentations
- Workshops/Tutorials
- Technical presentations
- Panel sessions

All sessions except the Workshops/Tutorials will be of 40 minute
duration.  In some cases, two consecutive 40 minute program slots may be
devoted to a single session.

The Workshops/Tutorials will each last approximately three hours.  They
should be designed to stimulate discussion and participation, using
slides and demonstrations.

PUBLICITY

If your paper is accepted, your details will be included in the
Conference brochure and Web pages and the paper itself will appear on a
Conference CD, with an optional printed book of Conference Proceedings.

CONFERENCE LANGUAGE

The Conference language is English.  All submissions, papers and
presentations should be provided in English.

SUBMISSIONS

Submissions MUST contain:

1. An abstract of 150-250 words, consisting of statement of purpose,
   paper description, and your conclusions or final summary.

2. A brief biography.

3. The details listed below:

   SESSION TITLE:             _________________________________________

                              _________________________________________

   TITLE (eg Dr/Mr/Mrs/Ms):   _________________________________________

   NAME:                      _________________________________________

   JOB TITLE:                 _________________________________________

   ORGANIZATION/AFFILIATION:  _________________________________________

   ORGANIZATION'S WWW URL:    _________________________________________

   OWN WWW URL:               _________________________________________

   ADDRESS FOR PAPER MAIL:    _________________________________________

                              _________________________________________

                              _________________________________________

   TELEPHONE:                 _________________________________________

   FAX:                       _________________________________________

   E-MAIL ADDRESS:            _________________________________________

   TYPE OF SESSION:           [ ] Keynote presentation

                              [ ] Workshop/Tutorial

                              [ ] Technical presentation

                              [ ] Panel

   PANELISTS (if Panel):      _________________________________________

                              _________________________________________

                              _________________________________________

                              _________________________________________

                              _________________________________________

                              _________________________________________

                              _________________________________________

                              _________________________________________

   TARGET AUDIENCE (you may select more than one category):

                              [ ] Content Developers

                              [ ] Font Designers

                              [ ] Graphic Designers

                              [ ] Managers

                              [ ] Marketers

                              [ ] Software Engineers

                              [ ] Systems Analysts

                              [ ] Technical Writers

                              [ ] Others (please specify):

                              _________________________________________

                              _________________________________________

   LEVEL OF SESSION (you may select more than one category):

                              [ ] Beginner

                              [ ] Intermediate

                              [ ] Advanced

Submissions should be sent by e-mail to either of the following
addresses:

   papers@unicode.org

   info@global-conference.com

They should use ASCII, non-compressed text and the following subject
line:

   Proposal for IUC 20

If desired, a copy of the submission may also be sent by post to:

   Twentieth International Unicode Conference
   c/o Global Meeting Services, Inc.
   4360 Benhurst Avenue
   San Diego, CA  92122  USA
   Tel: +1 858 638 0206
   Fax: +1 858 638 0504

CONFERENCE PROCEEDINGS

All Conference papers will be published on CD.  Printed proceedings will
be offered as an option.

EXHIBIT OPPORTUNITIES

The Conference will have an Exhibition area for corporations or
individuals who wish to display and promote their products, technology
and/or services.

Every effort will be made to provide maximum exposure and advertising.

Exhibit space is limited.  For further information or to reserve a
place, please contact Global Meeting Services at the above location.

CONFERENCE VENUE

   Omni Shoreham Hotel
   2500 Calvert Street, NW
   Washington, DC  20008
   USA

   Tel: +1 202 234 0700
   Fax: +1 202 265 7972

THE UNICODE CONSORTIUM

The Unicode Consortium was founded as a non-profit organization in 1991.
It is dedicated to the development, maintenance and promotion of The
Unicode Standard, a worldwide character encoding.  The Unicode Standard
encodes the characters of the world's principal scripts and languages,
and is code-for-code identical to the international standard ISO/IEC
10646.  In addition to cooperating with ISO on the future development of
ISO/IEC 10646, the Consortium is responsible for providing character
properties and algorithms for use in implementations.  Today the
membership base of the Unicode Consortium includes major computer
corporations, software producers, database vendors, research
institutions, international agencies and various user groups.

For further information on the Unicode Standard, visit the Unicode Web
site at http://www.unicode.org or e-mail <info@unicode.org>

                           *  *  *  *  *

Unicode(r) and the Unicode logo are registered trademarks of Unicode,
Inc.  Used with permission.


-----------------------------------------------------------------
        Visit our Internet site at http://www.reuters.com

Any views expressed in this message are those of  the  individual
sender,  except  where  the sender specifically states them to be
the views of Reuters Ltd.


From martin@loewis.home.cs.tu-berlin.de  Sun Aug 12 09:57:39 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Sun, 12 Aug 2001 10:57:39 +0200
Subject: [I18n-sig] Re: pygettext dilemma
In-Reply-To: <oqvgjzfvwy.fsf@lin2.sram.qc.ca> (pinard@IRO.UMontreal.CA)
References: <15200.64763.772001.53387@anthem.wooz.org> <oqvgjzfvwy.fsf@lin2.sram.qc.ca>
Message-ID: <200108120857.f7C8vdi02038@mira.informatik.hu-berlin.de>

> One problem is that Python does not seem to automatically concatenate a
> sequence of strings as a single one, when in docstring position.

What version did you use to try this? It works fine for me:

Python 2.0 (#1, May 16 2001, 00:02:45)
[GCC 2.95.3 20010315 (SuSE)] on linux2
Type "copyright", "credits" or "license" for more information.
>>> def foo():
...   ""'Hallo'
...
>>> foo.__doc__
'Hallo'

Regards,
Martin


From pinard@iro.umontreal.ca  Mon Aug 13 01:41:34 2001
From: pinard@iro.umontreal.ca (=?iso-8859-1?q?Fran=E7ois?= Pinard)
Date: 12 Aug 2001 20:41:34 -0400
Subject: [I18n-sig] Re: pygettext dilemma
In-Reply-To: <200108120857.f7C8vdi02038@mira.informatik.hu-berlin.de>
References: <15200.64763.772001.53387@anthem.wooz.org>
 <oqvgjzfvwy.fsf@lin2.sram.qc.ca>
 <200108120857.f7C8vdi02038@mira.informatik.hu-berlin.de>
Message-ID: <oqofpkiyf5.fsf@lin2.sram.qc.ca>

[Martin v. Loewis]

> > One problem is that Python does not seem to automatically concatenate
> > a sequence of strings as a single one, when in docstring position.

> What version did you use to try this?  It works fine for me:

Oops!  You are right.  Sorry, I made my tests wrong:

    Python 2.1 (#1, Jul  3 2001, 21:59:44) 
    [GCC 2.95.2 19991024 (release)] on linux2
    Type "copyright", "credits" or "license" for more information.
    >>> def bonjour():
    ...    'chez '
    ...    'vous!'
    ...    pass
    ... 
    >>> bonjour.__doc__
    'chez '
    >>> def bonjour():
    ...    'chez ' 'vous!'
    ...    pass
    ... 
    >>> bonjour.__doc__
    'chez vous!'
    >>>

So, there is no problem, and:

    ''""" LONG DOC STRING """

could be marked as translatable exactly like this.  This could allow
sorting out between docstrings meant to be translated, from the others,
one string at a time, rather than one module at a time.

-- 
François Pinard   http://www.iro.umontreal.ca/~pinard


From barry@wooz.org  Mon Aug 13 04:17:37 2001
From: barry@wooz.org (Barry A. Warsaw)
Date: Sun, 12 Aug 2001 23:17:37 -0400
Subject: [I18n-sig] Re: pygettext.py extraction of docstrings
References: <14840.35473.307059.990479@anthem.concentric.net>
 <200010272228.AAA01066@loewis.home.cs.tu-berlin.de>
 <15113.29005.357449.812516@anthem.wooz.org>
 <shy9rmwm54.fsf@tux.gnu.franken.de>
 <15117.38438.361043.255768@anthem.wooz.org>
 <oqzo9bfxbo.fsf@lin2.sram.qc.ca>
Message-ID: <15223.18129.372008.719610@anthem.wooz.org>

Hi Francois!  I'm Cc'ing Bruno on this message because I think he's
the current gettext maintainer.  Sorry if I'm mistaken...

>>>>> "FP" =3D=3D Fran=E7ois Pinard <pinard@iro.umontreal.ca> writes:

    >> Then again, it doesn't say that #. comments are reserved.  It
    >> basically just says that #-whitespace comments are reserved for
    >> the translators.

    FP> You might consider that they are all reserved.

    >> I'm happy to switch it, but I'd really like to have a reference
    >> I can point to to short-circuit any further discussion.  Even a
    >> mailing list archive url would be fine.

    FP> If I remember well, `#.' are for textual comments written by
    FP> the program maintainer, meant to be read by translators, and
    FP> derived automatically at POT creation time.  They usually come
    FP> from specially formatted comments in the C sources.
    FP> `#-whitespace' are for textual comments also meant to be read
    FP> by various translators, but written by translators themselves.

This makes sense.  It would be good to make this a bit clearer in the
"Format of PO Files" section of the GNU gettext manual.

    FP> `#,' are for programmatic flags.  The idea was to use these
    FP> parsimoniously, keeping track of possible flag definitions and
    FP> consequences.  I do not know how far these are recognized and
    FP> validated by `msgfmt'.  Best would be to coordinate with the
    FP> current `gettext' maintainer before creating new ones.  Unless
    FP> he declares they are now for free use?

A while back I was convinced to switch the `docstring' flag to #, for
pygettext.  Perhaps Bruno can add some information on pygettext.py in
the GNU gettext manual?  I think the following would be of interest:

- Mention the existence of pygettext.py for extracting translatable
  strings in Python.

- Point to Python's gettext module documentation for more details on
  i18n'ing Python programs.  This should be a fairly stable url:

  http://www.python.org/doc/current/lib/module-gettext.html

- Document `docstring' as a legal #,-style flag.  It probably only has
  meaning in Python, but may be useful in other scripting languages.
  Think of it roughly equivalent to Emacs-Lisp docstrings (in fact,
  they were the inspiration for Python docstrings back in '94 at the
  1st Python workshop!)

- Make sure that the other GNU gettext tools recognize the docstring
  flag, in whatever way is meaningful (I'm not sure what would be
  useful or not... ;).

Thanks.  BTW, for my purposes, pygettext.py's -X/--no-docstrings
switch does the job perfectly, if a bit inelegantly.

-Barry


From barry@zope.com  Mon Aug 13 04:42:57 2001
From: barry@zope.com (Barry A. Warsaw)
Date: Sun, 12 Aug 2001 23:42:57 -0400
Subject: [I18n-sig] Re: pygettext dilemma
References: <15200.64763.772001.53387@anthem.wooz.org>
 <oqvgjzfvwy.fsf@lin2.sram.qc.ca>
Message-ID: <15223.19649.811672.585574@anthem.wooz.org>

>>>>> "FP" =3D=3D Fran=E7ois Pinard <pinard@iro.umontreal.ca> writes:

    >> In Mailman, I've got a bunch of normal .py modules and a bunch
    >> of command line scripts.  The modules have their translatable
    >> strings nicely marked with _() and only those strings should be
    >> extracted.

    FP> Hello, Barry.  Long time no talk! :-)

Indeed!  BTW, I18N Mailman is coming along very nicely now.  I hope
the 2.1 release will happen within the next few months.

    FP> `_(STRING)' is two-fold.  First, it marks STRING for
    FP> extraction and later insertion in some generated POT file.
    FP> Second, it is a nickname for the `gettext' function or alike,
    FP> that will translate STRING at run time given that a
    FP> translation file provides a translation.

    FP> Experience taught us that this is not always adequate.  We
    FP> sometimes need to delay a translation.  That is, we might use
    FP> `_(VARIABLE)', with VARIABLE being first assigned some
    FP> translatable string elsewhere in the program.  Since VARIABLE
    FP> is not a string, it does not get extracted into a POT file.
    FP> But those strings which could get assigned to VARIABLE are not
    FP> extracted either, because they are not marked.  You understand
    FP> that they were marked with `_(STRING)', they would get
    FP> translated prematurely.

    FP> All this to say that there is a need for marking strings in
    FP> such a way that they will be extracted into POT files, but
    FP> otherwise untouched by Python.  That is, the way to mark
    FP> string should be a Python no-operation, and ideally, should
    FP> not alter the Python language.

All the above is true, and I have encountered these situations in
Mailman 2.1.  Python, however, provides a very nice solution, quite in
keeping with the Pythonic "explicit-is-better-than-implicit" mantra.

What I do in this situation is to temporarily bind _() to a no-op
function so that the string is marked for extraction, but not
translated in place.  E.g.

    import gettext

    def _(s):
        return s

    foo =3D _('extract this string but do not translate it yet')

    _ =3D gettext.gettext

This works perfectly because Python doesn't suffer from the same
deficiencies as C (i.e. the C pre-processor :).

    FP> The only simple Python no-operation I know is the unary prefix
    FP> `+', and my intuition tells me that it might have been
    FP> dangerous to use it for marking delayed translation strings.
    FP> Using prefixes like i"STRING" or t"STRING" (for
    FP> "i"nternationalisable or "t"ranslatable) would require a
    FP> modification to Python.

Right.  A string-prefix character as another disadvantage; it sets a
bad precedence for explosion of combinations of prefixes (i.e. we'd
now need rt'' strings tr'' strings utr'' strings tru'' strings,
etc. etc.).  So we agree that prefixes are out. :)

    FP> So, I came with the simple idea to play a bit with the fact
    FP> that Python folds a succession of constant strings into a
    FP> single one at compilation time.  The idea is to prefix a
    FP> translatable string, when it is used outside the usual
    FP> `_(STRING)' idiom, by an empty string of the other kind, like
    FP> this:

    FP>    Exemple Type For extractor

    |    'TEXT'          1-quoted     not marked
    |    "TEXT"          2-quoted     not marked
    |    '''TEXT'''      3-quoted     not marked
    |    ''"TEXT"        4-quoted     marked
    |    ""'TEXT'        5-quoted     marked
    |    """TEXT"""      6-quoted     not marked
    |    ""'''TEXT'''    7-quoted     marked
    |    ''"""TEXT"""    8-quoted     marked

This has been brought up before, and I know that some people really
like this approach.  I don't though, because 1) it is too magical; 2)
the rules are arbitrary and hard to remember; 3) explicit is better
than implicit.

When a newbie looks at a bit of Python code that looks like

    _('Traditional Chinese')

and wonders what this does, he should immediately look for the
definition of the _() function.  Using his well-honed Python skills
he'll look for some def or import that brings this function name into
scope, and this should naturally lead to purpose of the idiom.
E.g. they'll see "from gettext import gettext as _" or some such.

Seeing something like an unadorned ""'Traditional Chinese' really
gives no clue as to the purpose of this strange markup, so it would
have to either be something the reader of the code Just Got, or it
would have to be described in a comment, and that's simply
unfeasible.  I also claim that the rules are fairly arbitrary and will
be hard to explain and remember.  It's not something that's learned
once and then ingrained.

    FP> In fact, I think that even within a single module, some
    FP> docstrings should be considered translatable, while some other
    FP> docstrings should not be.

True.
   =20
    FP> Considering the choice has to be per whole module at a time,
    FP> is too gross.  This goes almost without saying.

I personally don't feel like it's that big a problem.  So far, in my
experience the only docstrings that really need to be extracted are
module docstrings in command line scripts.  I've found it not to be
that big a deal to also extract class or function docstrings in those
files, since it doesn't add that much of a burden to the translator.
But my personal preference has been to limit the docstrings in such
files to just the module docstring, and use comments instead of
docstrings for functions or classes.  Or, you can sometimes do
something ugly like use explicit

    __doc__ =3D _('Here is a module docstring')

Not pretty, but also not common I think, so it doesn't concern me
much.  I could conceive of a convention where a leading comment before
a docstring could inhibit extraction of the following docstring, such
as:

    class Foo:
        # notranslate
=09'''Here is a docstring that should not be extracted or translated.''=
'

One of two approaches could happen: either pygettext.py could ignore
the following docstring and not stick it in the PO file (but I forget
if tokenize gets to see comments or not), or pygettext.py could add a
#. notranslate comment to the entry telling translators to skip this
entry.
   =20
    FP> Let me present the set of suggestions, in this message, as
    FP> having a minimal impact on Python, yet being pretty flexible
    FP> in what it would allow us to do.

I appreciate the suggestions Francois!  I think what we've got gives
us the best approach for Python programs.

Cheers,
-Barry


From martin@loewis.home.cs.tu-berlin.de  Mon Aug 13 06:30:51 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Mon, 13 Aug 2001 07:30:51 +0200
Subject: [I18n-sig] Re: pygettext dilemma
In-Reply-To: <15223.19649.811672.585574@anthem.wooz.org> (barry@zope.com)
References: <15200.64763.772001.53387@anthem.wooz.org>
 <oqvgjzfvwy.fsf@lin2.sram.qc.ca> <15223.19649.811672.585574@anthem.wooz.org>
Message-ID: <200108130530.f7D5Upm00873@mira.informatik.hu-berlin.de>

> I personally don't feel like it's that big a problem.  So far, in my
> experience the only docstrings that really need to be extracted are
> module docstrings in command line scripts.

I disagree somewhat, but I also have a different application in
mind. I do want to get translations for the doc strings of the
standard library; in fact, that is what the python domain in the
translation project has at the moment. The application here is
that the help() function should present the translation of the
doc string if available.

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Mon Aug 13 06:35:17 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Mon, 13 Aug 2001 07:35:17 +0200
Subject: [I18n-sig] Re: pygettext.py extraction of docstrings
In-Reply-To: <15223.18129.372008.719610@anthem.wooz.org> (barry@wooz.org)
References: <14840.35473.307059.990479@anthem.concentric.net>
 <200010272228.AAA01066@loewis.home.cs.tu-berlin.de>
 <15113.29005.357449.812516@anthem.wooz.org>
 <shy9rmwm54.fsf@tux.gnu.franken.de>
 <15117.38438.361043.255768@anthem.wooz.org>
 <oqzo9bfxbo.fsf@lin2.sram.qc.ca> <15223.18129.372008.719610@anthem.wooz.org>
Message-ID: <200108130535.f7D5ZHb00876@mira.informatik.hu-berlin.de>

> - Mention the existence of pygettext.py for extracting translatable
>   strings in Python.

To my knowledge, Bruno just put a section in the gettext manual
explaining gettext usage with various (progamming) languages. The
Python entry there does mention pygettext.py.

> - Point to Python's gettext module documentation for more details on
>   i18n'ing Python programs.  This should be a fairly stable url:
> 
>   http://www.python.org/doc/current/lib/module-gettext.html

I don't think it has this link, yet. But then, URL-style links are
infrequent in texinfo documentation. Instead, (python)gettext might be
a better link.

> - Make sure that the other GNU gettext tools recognize the docstring
>   flag, in whatever way is meaningful (I'm not sure what would be
>   useful or not... ;).

At a minimum, msgmerge should preserve them.

Regards,
Martin


From barry@zope.com  Mon Aug 13 06:58:36 2001
From: barry@zope.com (Barry A. Warsaw)
Date: Mon, 13 Aug 2001 01:58:36 -0400
Subject: [I18n-sig] Re: pygettext dilemma
References: <15200.64763.772001.53387@anthem.wooz.org>
 <oqvgjzfvwy.fsf@lin2.sram.qc.ca>
 <15223.19649.811672.585574@anthem.wooz.org>
 <200108130530.f7D5Upm00873@mira.informatik.hu-berlin.de>
Message-ID: <15223.27788.252177.636376@anthem.wooz.org>

>>>>> "MvL" == Martin v Loewis <martin@loewis.home.cs.tu-berlin.de> writes:

    >> I personally don't feel like it's that big a problem.  So far,
    >> in my experience the only docstrings that really need to be
    >> extracted are module docstrings in command line scripts.

    MvL> I disagree somewhat, but I also have a different application
    MvL> in mind. I do want to get translations for the doc strings of
    MvL> the standard library; in fact, that is what the python domain
    MvL> in the translation project has at the moment. The application
    MvL> here is that the help() function should present the
    MvL> translation of the doc string if available.

That's a good point (and would be neat!) but in that case, wouldn't
you want all the docstrings to be extracted?  I.e. you wouldn't want
to just extract some docstrings in a module, but not all?

-Barry


From barry@wooz.org  Mon Aug 13 07:01:17 2001
From: barry@wooz.org (Barry A. Warsaw)
Date: Mon, 13 Aug 2001 02:01:17 -0400
Subject: [I18n-sig] Re: pygettext.py extraction of docstrings
References: <14840.35473.307059.990479@anthem.concentric.net>
 <200010272228.AAA01066@loewis.home.cs.tu-berlin.de>
 <15113.29005.357449.812516@anthem.wooz.org>
 <shy9rmwm54.fsf@tux.gnu.franken.de>
 <15117.38438.361043.255768@anthem.wooz.org>
 <oqzo9bfxbo.fsf@lin2.sram.qc.ca>
 <15223.18129.372008.719610@anthem.wooz.org>
 <200108130535.f7D5ZHb00876@mira.informatik.hu-berlin.de>
Message-ID: <15223.27949.881781.720311@anthem.wooz.org>

>>>>> "MvL" == Martin v Loewis <martin@loewis.home.cs.tu-berlin.de> writes:

    >> - Mention the existence of pygettext.py for extracting
    >> translatable strings in Python.

    MvL> To my knowledge, Bruno just put a section in the gettext
    MvL> manual explaining gettext usage with various (progamming)
    MvL> languages. The Python entry there does mention pygettext.py.

Ah cool, I was only looking at the online documentation at gnu.org,
which claims it's the 30-Apr-1998 edition (a bit out-dated, eh? :).

    >> - Point to Python's gettext module documentation for more
    >> details on i18n'ing Python programs.  This should be a fairly
    >> stable url:
    >> http://www.python.org/doc/current/lib/module-gettext.html

    MvL> I don't think it has this link, yet. But then, URL-style
    MvL> links are infrequent in texinfo documentation. Instead,
    MvL> (python)gettext might be a better link.

    >> - Make sure that the other GNU gettext tools recognize the
    >> docstring flag, in whatever way is meaningful (I'm not sure
    >> what would be useful or not... ;).

    MvL> At a minimum, msgmerge should preserve them.

Good, thanks.
-Barry


From martin@loewis.home.cs.tu-berlin.de  Mon Aug 13 08:05:30 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Mon, 13 Aug 2001 09:05:30 +0200
Subject: [I18n-sig] Re: pygettext dilemma
In-Reply-To: <15223.27788.252177.636376@anthem.wooz.org> (barry@zope.com)
References: <15200.64763.772001.53387@anthem.wooz.org>
 <oqvgjzfvwy.fsf@lin2.sram.qc.ca>
 <15223.19649.811672.585574@anthem.wooz.org>
 <200108130530.f7D5Upm00873@mira.informatik.hu-berlin.de> <15223.27788.252177.636376@anthem.wooz.org>
Message-ID: <200108130705.f7D75UF01419@mira.informatik.hu-berlin.de>

> That's a good point (and would be neat!) but in that case, wouldn't
> you want all the docstrings to be extracted?  I.e. you wouldn't want
> to just extract some docstrings in a module, but not all?

Certainly, yes. In fact, I hacked Fran=E7ois' xpot to find foo__doc__[]
strings in C sources also, since those doc strings are probably the
ones that people are most frequently confronted with.

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Mon Aug 13 08:02:48 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Mon, 13 Aug 2001 09:02:48 +0200
Subject: [I18n-sig] Re: pygettext.py extraction of docstrings
In-Reply-To: <15223.27949.881781.720311@anthem.wooz.org> (barry@wooz.org)
References: <14840.35473.307059.990479@anthem.concentric.net>
 <200010272228.AAA01066@loewis.home.cs.tu-berlin.de>
 <15113.29005.357449.812516@anthem.wooz.org>
 <shy9rmwm54.fsf@tux.gnu.franken.de>
 <15117.38438.361043.255768@anthem.wooz.org>
 <oqzo9bfxbo.fsf@lin2.sram.qc.ca>
 <15223.18129.372008.719610@anthem.wooz.org>
 <200108130535.f7D5ZHb00876@mira.informatik.hu-berlin.de> <15223.27949.881781.720311@anthem.wooz.org>
Message-ID: <200108130702.f7D72mk01417@mira.informatik.hu-berlin.de>

> Ah cool, I was only looking at the online documentation at gnu.org,
> which claims it's the 30-Apr-1998 edition (a bit out-dated, eh? :).

Yes, maintainance of gnu.org always leaves a lot to be desired. That
aside, the changes I was talking about have not been released in
gettext, yet; we probably should work on updating gnu.org once the new
gettext manual is released.

Regards,
Martin


From pinard@iro.umontreal.ca  Mon Aug 13 13:10:56 2001
From: pinard@iro.umontreal.ca (=?iso-8859-1?q?Fran=E7ois?= Pinard)
Date: 13 Aug 2001 08:10:56 -0400
Subject: [I18n-sig] Re: pygettext dilemma
In-Reply-To: <15223.19649.811672.585574@anthem.wooz.org>
References: <15200.64763.772001.53387@anthem.wooz.org>
 <oqvgjzfvwy.fsf@lin2.sram.qc.ca>
 <15223.19649.811672.585574@anthem.wooz.org>
Message-ID: <oqitfs5fe7.fsf@lin2.sram.qc.ca>

[Barry A. Warsaw]

> Indeed!  BTW, I18N Mailman is coming along very nicely now.  I hope
> the 2.1 release will happen within the next few months.

I have a few friends who are impatiently waiting for this release! :-)

> What I do in this situation is to temporarily bind _() to a no-op
> function so that the string is marked for extraction, but not
> translated in place.  E.g.

>     import gettext

>     def _(s):
>         return s

>     foo = _('extract this string but do not translate it yet')

>     _ = gettext.gettext

No hurt intended of course, you should know be better :-).  Let me friendly
stress that constructs like above are ugly.  We should set up examples,
that people could follow, in which we rely on a single, common, widespread,
unvarying interpretation of _(TEXT), without having to look around each
time to see what it means, or set and reset its meaning.  The above is a
kludge that does not fit well with what I think is good Python style.

> This works perfectly because Python doesn't suffer from the same
> deficiencies as C (i.e. the C pre-processor :).

I quite understand that "it works", but yet, it much suffers, both on the
side of legibility and simplicity.

>     |    ''"""TEXT"""    8-quoted     marked

> This has been brought up before, and I know that some people really
> like this approach.  I don't though, because 1) it is too magical; 2)
> the rules are arbitrary and hard to remember; 3) explicit is better
> than implicit.

As long `pygettext.py' (or `xgettext' or `xpot') is involved, there is
some unavoidable magic somewhere.  Even _(TEXT) does not give much clue
to a newcomer about the mandatory extraction process.

About the idiom of prefixing a string with two quotes of the other kind,
I find it quite easy to explain and remember.

> Seeing something like an unadorned ""'Traditional Chinese' really
> gives no clue as to the purpose of this strange markup,

In my opinion, this is equally opaque to use _(TEXT) after having temporarily
redefined _() as the identify function.  It only acquire meaning to a
user after s/he learns about the extraction process, you just cannot make
it evident.  The explanation is unavoidable, anyway.  Redefining _() is a
formidable stunt.  Concatenating an empty string is much simpler and cleaner.

> Or, you can sometimes do something ugly like use explicit

>     __doc__ = _('Here is a module docstring')

> Not pretty, but also not common I think, so it doesn't concern me much.

Let's avoid being ugly, as far as we can.  Keep in mind that you are
opening a way, here, and setting up examples and methods that will stick,
and have incidence.  (One never knows.  When I started to use `_' instead
of explicit `gettext' calls, most people were reluctant, and told me
that it was to break with so many C compilers that I should give up now;
Richard Stallman just refused to see GNU standards suggesting it; but I
used it nevertheless and for many packages, to the point it stuck somewhat;
nowadays, many languages spontaneously use conventions similar to it.)

My point is that you should look forward and a little beyond the immediate
needs.  Even if does not concern you much, let's try to do well.

> I appreciate the suggestions Francois!  I think what we've got gives us
> the best approach for Python programs.

I would not want to crusade inordinately over this, and I'm not really trying
to punch _my_ own suggestions through.  Really not!  On the other hand,
I would like to convince you that temporarily overriding _(), or assigning
the __doc__ attribute directly, just _cannot_ be "the best approach".
We should do better than that.  My suggestion does better already, but
I see we do not agree on this, a bit sadly...  I surely do not mind if
someone comes with something even better that what we both suggest, and do
hope it happens!  But we should at least come with something as good.

                                        Keep happy!

-- 
François Pinard   http://www.iro.umontreal.ca/~pinard


From keichwa@gmx.net  Mon Aug 13 07:40:16 2001
From: keichwa@gmx.net (Karl Eichwalder)
Date: Mon, 13 Aug 2001 08:40:16 +0200
Subject: [I18n-sig] Re: pygettext.py extraction of docstrings
In-Reply-To: <200108130535.f7D5ZHb00876@mira.informatik.hu-berlin.de>
 ("Martin v. Loewis"'s message of "Mon, 13 Aug 2001 07:35:17 +0200")
References: <14840.35473.307059.990479@anthem.concentric.net>
 <200010272228.AAA01066@loewis.home.cs.tu-berlin.de>
 <15113.29005.357449.812516@anthem.wooz.org>
 <shy9rmwm54.fsf@tux.gnu.franken.de>
 <15117.38438.361043.255768@anthem.wooz.org>
 <oqzo9bfxbo.fsf@lin2.sram.qc.ca>
 <15223.18129.372008.719610@anthem.wooz.org>
 <200108130535.f7D5ZHb00876@mira.informatik.hu-berlin.de>
Message-ID: <shhevcpinj.fsf@tux.gnu.franken.de>

"Martin v. Loewis" <martin@loewis.home.cs.tu-berlin.de> writes:

> I don't think it has this link, yet. But then, URL-style links are
> infrequent in texinfo documentation.

Yes, they are infrequent, but with the advent of Texinfo 4.x those
references are perfectly okay; search for 'uref', please.

> Instead, (python)gettext might be a better link.

Just provide both links.

-- 
ke@suse.de (work) / keichwa@gmx.net (home):              |
http://www.suse.de/~ke/                                  |      ,__o
Free Translation Project:                                |    _-\_<,
http://www.iro.umontreal.ca/contrib/po/HTML/             |   (*)/'(*)


From haible@ilog.fr  Tue Aug 14 18:18:27 2001
From: haible@ilog.fr (Bruno Haible)
Date: Tue, 14 Aug 2001 19:18:27 +0200 (CEST)
Subject: [I18n-sig] Re: pygettext.py extraction of docstrings
In-Reply-To: <15223.18129.372008.719610@anthem.wooz.org>
References: <14840.35473.307059.990479@anthem.concentric.net>
 <200010272228.AAA01066@loewis.home.cs.tu-berlin.de>
 <15113.29005.357449.812516@anthem.wooz.org>
 <shy9rmwm54.fsf@tux.gnu.franken.de>
 <15117.38438.361043.255768@anthem.wooz.org>
 <oqzo9bfxbo.fsf@lin2.sram.qc.ca>
 <15223.18129.372008.719610@anthem.wooz.org>
Message-ID: <15225.23907.685771.255536@honolulu.ilog.fr>

Barry A. Warsaw writes:
> A while back I was convinced to switch the `docstring' flag to #, for
> pygettext. ... It probably only has
> meaning in Python, but may be useful in other scripting languages.
> Think of it roughly equivalent to Emacs-Lisp docstrings (in fact,
> they were the inspiration for Python docstrings back in '94 at the
> 1st Python workshop!)

Well, Common Lisp has had docstrings long before Emacs-Lisp and
Python. Their purpose is to have documentation available for the
programmer, in a running session, regardless where each class or
function came from.

Now, why do you want to translate them?

As gettext maintainer, I'm used to think in the categories of
programmer - translator - user. Translated docstrings are not for the
users, because users are not programmers in general. And the
programmers (of .py programs), who must have looked at the various
Python manuals, certainly reads English.

So, as I see it,

  - translated docstrings have a much smaller audience than
    usual translated messages,

  - tranalated docstring users could also use the untranslated
    English docstrings,

  - docstrings are harder to translate, because the translator
    needs to have programmer's know-how.

Therefore I think that docstring translation is a separate process
than usual translations, and should use different .po files.

As a consequence for gettext, I could live with an xgettext option
--docstrings which extracts *only* the docstrings of a set of source
files.

> Perhaps Bruno can add some information on pygettext.py in
> the GNU gettext manual?

The GNU gettext tools are currently being modified to handle various
programming languages. A new flag 'python-format' is being
introduced, with appropriate format string checking in 'msgfmt'.
xgettext will also have a Python backend, making pygettext obsolete
(except for docstring extraction, for the time being).

Bruno


From martin@loewis.home.cs.tu-berlin.de  Tue Aug 14 20:17:22 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Tue, 14 Aug 2001 21:17:22 +0200
Subject: [I18n-sig] Re: pygettext.py extraction of docstrings
In-Reply-To: <15225.23907.685771.255536@honolulu.ilog.fr> (message from Bruno
 Haible on Tue, 14 Aug 2001 19:18:27 +0200 (CEST))
References: <14840.35473.307059.990479@anthem.concentric.net>
 <200010272228.AAA01066@loewis.home.cs.tu-berlin.de>
 <15113.29005.357449.812516@anthem.wooz.org>
 <shy9rmwm54.fsf@tux.gnu.franken.de>
 <15117.38438.361043.255768@anthem.wooz.org>
 <oqzo9bfxbo.fsf@lin2.sram.qc.ca>
 <15223.18129.372008.719610@anthem.wooz.org> <15225.23907.685771.255536@honolulu.ilog.fr>
Message-ID: <200108141917.f7EJHMp02235@mira.informatik.hu-berlin.de>

> As gettext maintainer, I'm used to think in the categories of
> programmer - translator - user. Translated docstrings are not for
> the users, because users are not programmers in general. And the
> programmers (of .py programs), who must have looked at the various
> Python manuals, certainly reads English.

This is a wrong assumption; people writing programs in Python not
necessarily read fluently English (let alone speaking it). I assume
the same is true for any other "scripting" language. E.g. for Ruby,
much of the language documentation is in Japanese, since most of the
Ruby users prefer to read Japanese documentation. Likewise, the French
translation of the Python documentation was started precisely because
users don't read English that well.

Even among my colleagues, I find that they often mis-interpret English
documentation, and get the fine points only when pointed to them, and
after looking up certain keywords in a dictionary. They would not have
the same problems if the documentation was available in German.

So in your categories, these people are certainly users - of Python,
in the specific case.

>   - translated docstrings have a much smaller audience than
>     usual translated messages,

In addition to the above, I think you are missing an important detail
of Python's introspectiveness: Many Python applications present
docstrings to the user, instead of using them for documentation, by
means of accessing some object's __doc__ attribute at
runtime. E.g. you might have a drop-down menu, each item invoking a
different function. Then somebody might chose to key the online help
into the docstring. It is somewhat hackish, but common.

>   - docstrings are harder to translate, because the translator
>     needs to have programmer's know-how.

For the original purpose of docstrings, yes, certainly.

> As a consequence for gettext, I could live with an xgettext option
> --docstrings which extracts *only* the docstrings of a set of source
> files.

Again, for the application I have in mind (providing online help in
the progamming process), that is acceptable. I think for Barry's
application, it is not.

> The GNU gettext tools are currently being modified to handle various
> programming languages. A new flag 'python-format' is being
> introduced, with appropriate format string checking in 'msgfmt'.
> xgettext will also have a Python backend, making pygettext obsolete
> (except for docstring extraction, for the time being).

It turns out that there is a "batteries included" issue here. I know a
few cases where people have been using pygettext just because it was
already on their (Windows) system, whereas GNU gettext was not that
readily available (you'd need a C compiler to build it). So while most
Unix people will switch to GNU gettext for performance reasons
(pygettext is slow), I doubt that pygettext will go away anytime soon.

Regards,
Martin


From Misha.Wolf@reuters.com  Tue Aug 14 21:35:44 2001
From: Misha.Wolf@reuters.com (Misha.Wolf@reuters.com)
Date: Tue, 14 Aug 2001 21:35:44 +0100
Subject: [I18n-sig] 19th Unicode Conference, Sep 2001, San Jose, CA -- Register now!
Message-ID: <T555f6ba927c407b706488@reuters.com>

           Nineteenth International Unicode Conference (IUC19)
               Unicode and the Web: The Global Connection
                    http://www.unicode.org/iuc/iuc19
                         September 10-14, 2001
                           San Jose, CA, USA
                         >>  Register now!  <<

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

NEWS

>> Hotel guest room group rate extended to August 31.

>> Early Bird registration rate extended to August 31.

>> Visit the Conference Web site ( http://www.unicode.org/iuc/iuc19 )
   to check the updated Conference program and register.  To help you
   choose Conference sessions, we've included abstracts of talks and
   speakers' biographies.

CONFERENCE SPONSORS

   Agfa Monotype Corporation
   Basis Technology Corporation
   Lionbridge Technologies
   Microsoft Corporation
   Netscape Communications
   Oracle Corporation
   PeopleSoft, Inc.
   Reuters Ltd.
   Sun Microsystems, Inc.
   Trados Corporation
   Trigeminal Software, Inc.
   World Wide Web Consortium (W3C)
   Wrox Press

CONFERENCE VENUE

   DoubleTree Hotel San Jose
   2050 Gateway Place
   San Jose, CA 95110
   USA

   Tel: +1 408 453 4000
   Fax: +1 408 437 2898

GLOBAL COMPUTING SHOWCASE

   Visit the Showcase to find out more about products supporting the
   Unicode Standard, and products and services that can help you
   globalize/localize your software, documentation and Internet content.
   For details, visit the Conference Web site:
     http://www.unicode.org/iuc/iuc19

   Exhibitors to date include:
   * Agfa Monotype Corporation
   * Basis Technology Corporation
   * Everlasting Systems Ltd.
   * Multilingual Computing, Inc.
   * Oracle Corporation
   * Rasmussen Software, Inc.
   * Sun Microsystems, Inc.
   * Segue Software
   * Sybase, Inc.
   * Symbio Group
   * Trados Corporation

CONFERENCE MANAGEMENT

   Global Meeting Services Inc.
   4360 Benhurst Avenue
   San Diego, CA 92122, USA

   Tel: +1 858 638 0206 (voice)
        +1 858 638 0504 (fax)

   Email: info@global-conference.com
      or: conference@unicode.org

                             *  *  *  *  *

Unicode(r) and the Unicode logo are registered trademarks of Unicode,
Inc.  Used with permission.


-----------------------------------------------------------------
        Visit our Internet site at http://www.reuters.com

Any views expressed in this message are those of  the  individual
sender,  except  where  the sender specifically states them to be
the views of Reuters Ltd.


From haible@ilog.fr  Tue Aug 14 21:51:50 2001
From: haible@ilog.fr (Bruno Haible)
Date: Tue, 14 Aug 2001 22:51:50 +0200 (CEST)
Subject: [I18n-sig] Re: pygettext.py extraction of docstrings
In-Reply-To: <200108141917.f7EJHMp02235@mira.informatik.hu-berlin.de>
References: <14840.35473.307059.990479@anthem.concentric.net>
 <200010272228.AAA01066@loewis.home.cs.tu-berlin.de>
 <15113.29005.357449.812516@anthem.wooz.org>
 <shy9rmwm54.fsf@tux.gnu.franken.de>
 <15117.38438.361043.255768@anthem.wooz.org>
 <oqzo9bfxbo.fsf@lin2.sram.qc.ca>
 <15223.18129.372008.719610@anthem.wooz.org>
 <15225.23907.685771.255536@honolulu.ilog.fr>
 <200108141917.f7EJHMp02235@mira.informatik.hu-berlin.de>
Message-ID: <15225.36710.997827.802409@honolulu.ilog.fr>

Martin v. Loewis writes:
> people writing programs in Python not
> necessarily read fluently English ... Likewise, the French
> translation of the Python documentation was started precisely because
> users don't read English that well.

OK, let me formulate it less strictly: Programmer's documentation is
usually translated to much fewer languages (Japanese, French and very
few others), because the amount of text to translate is quite large
and the translator must have programmer's know-how.

> Many Python applications present
> docstrings to the user, instead of using them for documentation, by
> means of accessing some object's __doc__ attribute at
> runtime. E.g. you might have a drop-down menu, each item invoking a
> different function. Then somebody might chose to key the online help
> into the docstring. It is somewhat hackish, but common.

It is pure Lisp introspection tradition :-) But nevertheless, it
presents a problem: How can the translator know which docstrings are
important to translate for the end user, and which are not?

The danger is that a translator for Finnish, Turkish or Romanian,
without deep programming knowledge, will spend a lot of his time
translating programmer's documentation, which won't help the end users
of his country. There are not many translators for these languages; we
shouldn't abuse them.

IMO, those __doc__ strings that are used at runtime should be
explicitly marked as translatable by the programmer, to avoid excess
work by the translator. The way you mark them doesn't really matter; it
can be a tag in a comment, or something else that triggers xgettext
extraction.

> It turns out that there is a "batteries included" issue here. I know a
> few cases where people have been using pygettext just because it was
> already on their (Windows) system, whereas GNU gettext was not that
> readily available (you'd need a C compiler to build it).

You can point these people to the http://gnuwin32.sourceforge.net/
site which has gettext binaries for Win32 ready for download.

Bruno


From barry@zope.com  Wed Aug 15 04:53:11 2001
From: barry@zope.com (Barry A. Warsaw)
Date: Tue, 14 Aug 2001 23:53:11 -0400
Subject: [I18n-sig] Re: pygettext dilemma
References: <15200.64763.772001.53387@anthem.wooz.org>
 <oqvgjzfvwy.fsf@lin2.sram.qc.ca>
 <15223.19649.811672.585574@anthem.wooz.org>
 <oqitfs5fe7.fsf@lin2.sram.qc.ca>
Message-ID: <15225.61991.262726.993033@anthem.wooz.org>

>>>>> "FP" =3D=3D Fran=E7ois Pinard <pinard@iro.umontreal.ca> writes:

    >> Indeed!  BTW, I18N Mailman is coming along very nicely now.  I
    >> hope the 2.1 release will happen within the next few months.

    FP> I have a few friends who are impatiently waiting for this
    FP> release! :-)

Soon, soon!

    >> What I do in this situation is to temporarily bind _() to a
    >> no-op function so that the string is marked for extraction, but
    >> not translated in place.  E.g.

    >> import gettext

    >> def _(s): return s

    >> foo =3D _('extract this string but do not translate it yet')

    >> _ =3D gettext.gettext

    FP> No hurt intended of course, you should know be better :-).
    FP> Let me friendly stress that constructs like above are ugly.
    FP> We should set up examples, that people could follow, in which
    FP> we rely on a single, common, widespread, unvarying
    FP> interpretation of _(TEXT), without having to look around each
    FP> time to see what it means, or set and reset its meaning.  The
    FP> above is a kludge that does not fit well with what I think is
    FP> good Python style.

No hurt taken, but I'll respectfully disagree. :) I think it's fine
Python style for deferring translation, and not confusing at all
because it is almost always localized around the site of the deferral.
But contrary to Tim's license plate, there /is/ more than one way to
do it. :)

pygettext.py supports a -k/--keyword flag, similar to xgettext, which
expands the list of function names marking translatable strings.
IIRC, gettext suggests binding N_() to gettext_noop() and then
extracting any string wrapped in N_().  So, if you prefer, you can
rewrite my example above to be:

    from gettext import gettext as _

    def N_(s): return s

    foo =3D N_('extract this string but do not translate it yet')

and then run pygettext.py with --keyword=3DN_

Hmm, maybe we should add "N_" as one of the default keywords?

That points out a general philosophy I have that pygettext.py should
mimic xgettext as much as makes sense for the difference between C and
Python.  In this case _() works great for most at-site translation
markings, but for the very few that must be deferred, either the
rebind hack or the N_() marking should suffice.

    >> This works perfectly because Python doesn't suffer from the
    >> same deficiencies as C (i.e. the C pre-processor :).

    FP> I quite understand that "it works", but yet, it much suffers,
    FP> both on the side of legibility and simplicity.

Again, I must respectfully disagree!

    >> | ''"""TEXT""" 8-quoted marked

    >> This has been brought up before, and I know that some people
    >> really like this approach.  I don't though, because 1) it is
    >> too magical; 2) the rules are arbitrary and hard to remember;
    >> 3) explicit is better than implicit.

    FP> As long `pygettext.py' (or `xgettext' or `xpot') is involved,
    FP> there is some unavoidable magic somewhere.  Even _(TEXT) does
    FP> not give much clue to a newcomer about the mandatory
    FP> extraction process.

This is true.  But it's still clearer that there is /some/ reason for
marking the string with _() because you can quickly trace your way
back to gettext.gettext() and then it's obvious <wink> the connection
to the runtime translation process if not the the extraction process.

Which leads me to another question: are you saying that ''"""Text"""
should be used for both the runtime translating and the extraction
marking?  If so, I don't see how that could work.  Even if you could
make it work, I still much prefer have a Real Python Function do the
runtime translation.  An example of why is what I really do in
Mailman...

Say I have the following string that needs to be translated:

    _('No such list %s found on host %s') % (listname, hostname)

Now we all know that this won't do as a source string because there
may be some languages may change the order of the variables, so we
really need to write the string like so:

    _('No such list %(listname)s found on host %(hostname)s') % {
        'listname': listname,
=09'hostname': hostname
=09}

I've found this style to be quite pervasive, but also extremely (and
unnecessarily) repetitive.  Notice that I've typed "listname" and
"hostname" a total of six times.  Wouldn't it be wonderful if I only
needed to type them once:

    _('No such list %(listname)s found on host %(hostname)s')

?  Yes, it's great because -- to me -- I'm trading a modicum of
specialness for a huge raft of simplicity and legibility.  It really
does make the code easier to read, I claim (although it would be
interesting to know what others who have hacked on the Mailman 2.1
code think).  How do I make this work?

The trick is that the function _() isn't gettext.gettext() but a
wrapper around that library function that's unique to Mailman.  In
fact, you won't see many "import gettext"'s in the Mailman code, but
you will see lots of "from Mailman.i18n import _".

My _() actually uses sys._getframe() -- where available -- to get the
locals and globals one stack frame up from the _() frame, and then
automatically interpolates that dictionary into the translatable
string.  Is that magic?  Yes, a bit, but it's magic that is easily
revealed by finding the import, and viewing the Mailman.i18n module.
And once learned, I claim that it's immediately ingrained and needn't
be learned again.

But you might disagree, and use the more verbose approach for your
app.  No problem there!  Having a function call that can be
specialized in the Pythonic way serves both purposes well.

    FP> About the idiom of prefixing a string with two quotes of the
    FP> other kind, I find it quite easy to explain and remember.

I had to really think about the rule, as opposed to the example, in
your original message.  I think your rule goes: prepend the string you
want to extract with an empty string quoted with the alternative
quoting characters from the string you want to extract.  Or something
like that. :)

But there is another problem: for some fonts in some IDE's it can be
challenging to discern ' from " or even ` and having something like
""'''...''' makes it even more difficult to visually pick out.

    >> Seeing something like an unadorned ""'Traditional Chinese'
    >> really gives no clue as to the purpose of this strange markup,

    FP> In my opinion, this is equally opaque to use _(TEXT) after
    FP> having temporarily redefined _() as the identify function.  It
    FP> only acquire meaning to a user after s/he learns about the
    FP> extraction process, you just cannot make it evident.  The
    FP> explanation is unavoidable, anyway.  Redefining _() is a
    FP> formidable stunt.  Concatenating an empty string is much
    FP> simpler and cleaner.

Let me see if I can sum up my objection: you have to use a function
call anyway to do the actual runtime translation.  Since at-site
translations will be the overwhelming majority of examples, so will
_() markings.  For all those cases, you won't need
empty-string-contatenation anyway.  For the handful of cases where you
need to defer translation, I prefer using a technique as similar to
the common way as possible, instead of introducing an entirely
different convention.

But I wouldn't cry foul if you encouraged N_() markings for deferred
translations.

    >> Or, you can sometimes do something ugly like use explicit

    >> __doc__ =3D _('Here is a module docstring')

    >> Not pretty, but also not common I think, so it doesn't concern
    >> me much.

    FP> Let's avoid being ugly, as far as we can.  Keep in mind that
    FP> you are opening a way, here, and setting up examples and
    FP> methods that will stick, and have incidence.  (One never
    FP> knows.  When I started to use `_' instead of explicit
    FP> `gettext' calls, most people were reluctant, and told me that
    FP> it was to break with so many C compilers that I should give up
    FP> now; Richard Stallman just refused to see GNU standards
    FP> suggesting it; but I used it nevertheless and for many
    FP> packages, to the point it stuck somewhat; nowadays, many
    FP> languages spontaneously use conventions similar to it.)

And I think it's a wonderful convention!  I'm glad you came up with
it, and I happily adopted it for Python.  It's beautiful. :)

I won't disagree that the __doc__ hack is ugly.  The more I think
about it, I think a magic comment in front of the docstring is the way
to go.  I'm not yet sure whether something like

    # noextract
    '''This is a docstring that need not be translated.'''

or

    # extract
    '''This is a docstring that should be translated.'''

is better, or whether there's some other better comment keyword to
use.  This would be worth experimenting with a bit. =20

    FP> My point is that you should look forward and a little beyond
    FP> the immediate needs.  Even if does not concern you much, let's
    FP> try to do well.

Agreed!

    >> I appreciate the suggestions Francois!  I think what we've got
    >> gives us the best approach for Python programs.

    FP> I would not want to crusade inordinately over this, and I'm
    FP> not really trying to punch _my_ own suggestions through.
    FP> Really not!  On the other hand, I would like to convince you
    FP> that temporarily overriding _(), or assigning the __doc__
    FP> attribute directly, just _cannot_ be "the best approach".

Let's not conflate what we're talking about.  One situation is
deferred translation, the other is docstring extraction marking.  For
the former, I'm completely happy with rebinding _(), although I
wouldn't squawk if you pushed for N_() <wink>.  For the latter, I
agree that explicit __doc__ binding is gross and we should avoid
it.  Here, I think the special comment is the way to go, but I'm not
sure about the details.  Please let's keep these two issues separate!
   =20
    FP> We should do better than that.  My suggestion does better
    FP> already, but I see we do not agree on this, a bit sadly...  I
    FP> surely do not mind if someone comes with something even better
    FP> that what we both suggest, and do hope it happens!  But we
    FP> should at least come with something as good.

A good, lively debate.  Thanks!

Cheers,
-Barry


From barry@wooz.org  Wed Aug 15 05:10:24 2001
From: barry@wooz.org (Barry A. Warsaw)
Date: Wed, 15 Aug 2001 00:10:24 -0400
Subject: [I18n-sig] Re: pygettext.py extraction of docstrings
References: <14840.35473.307059.990479@anthem.concentric.net>
 <200010272228.AAA01066@loewis.home.cs.tu-berlin.de>
 <15113.29005.357449.812516@anthem.wooz.org>
 <shy9rmwm54.fsf@tux.gnu.franken.de>
 <15117.38438.361043.255768@anthem.wooz.org>
 <oqzo9bfxbo.fsf@lin2.sram.qc.ca>
 <15223.18129.372008.719610@anthem.wooz.org>
 <15225.23907.685771.255536@honolulu.ilog.fr>
Message-ID: <15225.63024.291621.844755@anthem.wooz.org>

>>>>> "BH" == Bruno Haible <haible@ilog.fr> writes:

    BH> Well, Common Lisp has had docstrings long before Emacs-Lisp
    BH> and Python. Their purpose is to have documentation available
    BH> for the programmer, in a running session, regardless where
    BH> each class or function came from.

    BH> Now, why do you want to translate them?

In my Python experience, there is one common situation where you want
to translate docstrings.  Note that in Python, unlike I believe as in
*Lisp, docstrings can be attached to objects other than functions.
It's common to have both module and class docstrings.  Again, IME
class docstrings serve similar audiences to function/method
docstrings, i.e. the programmer.  It is a common idiom in Python to
use module docstrings as usage text for command line scripts, and
those are definitely intended for the end user, and must be
translated.

I've had occasion to use class docstrings as strings for the user too,
although I won't claim that's wonderful Python style.

But as Martin points out, a case can be made for translating even
class, function, and method docstrings.  Think of the situation where
manuals are automatically extracted from source code, a la Javadoc.  I
believe you'd want those strings to be extracted into the catalog.

    BH> As a consequence for gettext, I could live with an xgettext
    BH> option --docstrings which extracts *only* the docstrings of a
    BH> set of source files.

I made the semantics for pygettext.py's --docstrings/-D option to
extract /also/ the docstrings because the older version of msgmerge I
am using can't merge a docstring-only catalog with a normal-string
catalog in a reasonable way (I tried).  And as stated above, module
docstrings can serve exactly the same audience as other translatable
strings, i.e. the end user, so they should be in the same catalog.

But I was also forced to add a very inelegant -X/--exclude-file switch
which suppressed docstring extract for the listed files.  While that
served my purpose, it's a gross hack, and not just because it doesn't
provide the necessary granularity.

More productive I think would be for us to agree on a convention for
extracting docstrings that doesn't require both -D and -X.  Here are
two strawmen:

1) pygettext.py and xgettext never extract unmarked docstrings unless
   the -D/--docstrings option is given.  If -D is given then all
   unmarked docstrings are extracted along with all other normally
   marked text, unless the unmarked docstring is immediately preceded
   by a comment with the word "notranslate" as the first word in the
   comment.  All other words in the comment are ignored.

2) pygettext.py and xgettext never extract unmarked docstrings unless
   they are immediately preceded by a comment with the word
   "translate" as the first word in the comment.  All other words in
   the comment are ignored.

Feel free to knock these down. :)

    >> Perhaps Bruno can add some information on pygettext.py in the
    >> GNU gettext manual?

    BH> The GNU gettext tools are currently being modified to handle
    BH> various programming languages. A new flag 'python-format' is
    BH> being introduced, with appropriate format string checking in
    BH> 'msgfmt'.

I'm not sure exactly what this means.  Can you give a bit more detail?
    
    BH> xgettext will also have a Python backend, making pygettext
    BH> obsolete (except for docstring extraction, for the time
    BH> being).

That'd be great.  It'll be even cooler if we can agree on a convention
for docstring extraction!

BTW, here's the current set of switches for pygettext.py.  Do you see
any glaring incompatibilities with you latest xgettext?

Cheers,
-Barry

-------------------- snip snip --------------------
Usage: pygettext [options] inputfile ...

Options:

    -a
    --extract-all
        Extract all strings.

    -d name
    --default-domain=name
        Rename the default output file from messages.pot to name.pot.

    -E
    --escape
        Replace non-ASCII characters with octal escape sequences.

    -D
    --docstrings
        Extract module, class, method, and function docstrings.  These do not
        need to be wrapped in _() markers, and in fact cannot be for Python to
        consider them docstrings. (See also the -X option).

    -h
    --help
        Print this help message and exit.

    -k word
    --keyword=word
        Keywords to look for in addition to the default set, which are:
        %(DEFAULTKEYWORDS)s

        You can have multiple -k flags on the command line.

    -K
    --no-default-keywords
        Disable the default set of keywords (see above).  Any keywords
        explicitly added with the -k/--keyword option are still recognized.

    --no-location
        Do not write filename/lineno location comments.

    -n
    --add-location
        Write filename/lineno location comments indicating where each
        extracted string is found in the source.  These lines appear before
        each msgid.  The style of comments is controlled by the -S/--style
        option.  This is the default.

    -o filename
    --output=filename
        Rename the default output file from messages.pot to filename.  If
        filename is `-' then the output is sent to standard out.

    -p dir
    --output-dir=dir
        Output files will be placed in directory dir.

    -S stylename
    --style stylename
        Specify which style to use for location comments.  Two styles are
        supported:

        Solaris  # File: filename, line: line-number
        GNU      #: filename:line

        The style name is case insensitive.  GNU style is the default.

    -v
    --verbose
        Print the names of the files being processed.

    -V
    --version
        Print the version of pygettext and exit.

    -w columns
    --width=columns
        Set width of output to columns.

    -x filename
    --exclude-file=filename
        Specify a file that contains a list of strings that are not be
        extracted from the input files.  Each string to be excluded must
        appear on a line by itself in the file.

    -X filename
    --no-docstrings=filename
        Specify a file that contains a list of files (one per line) that
        should not have their docstrings extracted.  This is only useful in
        conjunction with the -D option above.

If `inputfile' is -, standard input is read.


From barry@wooz.org  Wed Aug 15 05:15:14 2001
From: barry@wooz.org (Barry A. Warsaw)
Date: Wed, 15 Aug 2001 00:15:14 -0400
Subject: [I18n-sig] Re: pygettext.py extraction of docstrings
References: <14840.35473.307059.990479@anthem.concentric.net>
 <200010272228.AAA01066@loewis.home.cs.tu-berlin.de>
 <15113.29005.357449.812516@anthem.wooz.org>
 <shy9rmwm54.fsf@tux.gnu.franken.de>
 <15117.38438.361043.255768@anthem.wooz.org>
 <oqzo9bfxbo.fsf@lin2.sram.qc.ca>
 <15223.18129.372008.719610@anthem.wooz.org>
 <15225.23907.685771.255536@honolulu.ilog.fr>
 <200108141917.f7EJHMp02235@mira.informatik.hu-berlin.de>
Message-ID: <15225.63314.692441.773936@anthem.wooz.org>

>>>>> "MvL" == Martin v Loewis <martin@loewis.home.cs.tu-berlin.de> writes:

    >> As a consequence for gettext, I could live with an xgettext
    >> option --docstrings which extracts *only* the docstrings of a
    >> set of source files.

    MvL> Again, for the application I have in mind (providing online
    MvL> help in the progamming process), that is acceptable. I think
    MvL> for Barry's application, it is not.

Correct, as described in my other message.

    >> The GNU gettext tools are currently being modified to handle
    >> various programming languages. A new flag 'python-format' is
    >> being introduced, with appropriate format string checking in
    >> 'msgfmt'.  xgettext will also have a Python backend, making
    >> pygettext obsolete (except for docstring extraction, for the
    >> time being).

    MvL> It turns out that there is a "batteries included" issue
    MvL> here. I know a few cases where people have been using
    MvL> pygettext just because it was already on their (Windows)
    MvL> system, whereas GNU gettext was not that readily available
    MvL> (you'd need a C compiler to build it). So while most Unix
    MvL> people will switch to GNU gettext for performance reasons
    MvL> (pygettext is slow), I doubt that pygettext will go away
    MvL> anytime soon.

I agree.  That's a good reason why Python also comes with its own
msgfmt.py script (Side note: thank you thank you thank you for
documenting .mo and .po file formats!  What a pain it was to reverse
engineer the undocumented Solaris formats. :).

At the very least, we should make sure that xgettext will sufficiently
fulfill the needs of Python programmers.  It's easier for us to
prototype the Python idiosyncrasies in pygettext, and then describe
our experiences so xgettext can support them.  Also, the more we can
agree on common conventions now, the better in the long run.

Cheers,
-Barry


From barry@wooz.org  Wed Aug 15 05:19:49 2001
From: barry@wooz.org (Barry A. Warsaw)
Date: Wed, 15 Aug 2001 00:19:49 -0400
Subject: [I18n-sig] Re: pygettext.py extraction of docstrings
References: <14840.35473.307059.990479@anthem.concentric.net>
 <200010272228.AAA01066@loewis.home.cs.tu-berlin.de>
 <15113.29005.357449.812516@anthem.wooz.org>
 <shy9rmwm54.fsf@tux.gnu.franken.de>
 <15117.38438.361043.255768@anthem.wooz.org>
 <oqzo9bfxbo.fsf@lin2.sram.qc.ca>
 <15223.18129.372008.719610@anthem.wooz.org>
 <15225.23907.685771.255536@honolulu.ilog.fr>
 <200108141917.f7EJHMp02235@mira.informatik.hu-berlin.de>
 <15225.36710.997827.802409@honolulu.ilog.fr>
Message-ID: <15225.63589.862675.901114@anthem.wooz.org>

>>>>> "BH" == Bruno Haible <haible@ilog.fr> writes:

    BH> The danger is that a translator for Finnish, Turkish or
    BH> Romanian, without deep programming knowledge, will spend a lot
    BH> of his time translating programmer's documentation, which
    BH> won't help the end users of his country. There are not many
    BH> translators for these languages; we shouldn't abuse them.

    BH> IMO, those __doc__ strings that are used at runtime should be
    BH> explicitly marked as translatable by the programmer, to avoid
    BH> excess work by the translator. The way you mark them doesn't
    BH> really matter; it can be a tag in a comment, or something else
    BH> that triggers xgettext extraction.

I think you're on the right track, but I think that Martin's and my
applications show that we probably need to cover these two situations:

1) No docstrings are extracted unless they are preceded by a magic
   "extract" comment.

2) All docstrings are extracted unless they are preceded by a magic
   "noextract" comment.

    BH> You can point these people to the
    BH> http://gnuwin32.sourceforge.net/ site which has gettext
    BH> binaries for Win32 ready for download.

Cool, good to know, thanks.
-Barry


From martin@loewis.home.cs.tu-berlin.de  Wed Aug 15 06:54:16 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Wed, 15 Aug 2001 07:54:16 +0200
Subject: [I18n-sig] Re: pygettext dilemma
In-Reply-To: <15225.61991.262726.993033@anthem.wooz.org> (barry@zope.com)
References: <15200.64763.772001.53387@anthem.wooz.org>
 <oqvgjzfvwy.fsf@lin2.sram.qc.ca>
 <15223.19649.811672.585574@anthem.wooz.org>
 <oqitfs5fe7.fsf@lin2.sram.qc.ca> <15225.61991.262726.993033@anthem.wooz.org>
Message-ID: <200108150554.f7F5sGv01383@mira.informatik.hu-berlin.de>

> Now we all know that this won't do as a source string because there
> may be some languages may change the order of the variables, so we
> really need to write the string like so:
> 
>     _('No such list %(listname)s found on host %(hostname)s') % {
>         'listname': listname,
> 	'hostname': hostname
> 	}
> 
> I've found this style to be quite pervasive, but also extremely (and
> unnecessarily) repetitive.  Notice that I've typed "listname" and
> "hostname" a total of six times.  Wouldn't it be wonderful if I only
> needed to type them once:

What's wrong with

 _('No such list %(listname)s found on host %(hostname)s') % locals()

No magic required; of course, this assumes that the variables are
either all globals or all locals - I wish vars() would give me a
dictionary of all variables (perhaps even including the builtins).

Regards,
Martin


From barry@zope.com  Wed Aug 15 07:31:41 2001
From: barry@zope.com (Barry A. Warsaw)
Date: Wed, 15 Aug 2001 02:31:41 -0400
Subject: [I18n-sig] Re: pygettext dilemma
References: <15200.64763.772001.53387@anthem.wooz.org>
 <oqvgjzfvwy.fsf@lin2.sram.qc.ca>
 <15223.19649.811672.585574@anthem.wooz.org>
 <oqitfs5fe7.fsf@lin2.sram.qc.ca>
 <15225.61991.262726.993033@anthem.wooz.org>
 <200108150554.f7F5sGv01383@mira.informatik.hu-berlin.de>
Message-ID: <15226.5965.202148.510121@anthem.wooz.org>

>>>>> "MvL" == Martin v Loewis <martin@loewis.home.cs.tu-berlin.de> writes:

    MvL> What's wrong with

    MvL>  _('No such list %(listname)s found on host %(hostname)s') %
    MvL> locals()

    MvL> No magic required; of course, this assumes that the variables
    MvL> are either all globals or all locals - I wish vars() would
    MvL> give me a dictionary of all variables (perhaps even including
    MvL> the builtins).

Bingo!  That's what I wish vars() would do to, and I want those
semantics, so I went with the _getframe() hack.

Plus, I got tired of writing trailing "% locals()" all over the place,
especially when they cluttered the code even more with long lines,
requiring continuation via extraneous paren grouping or backslashing.
Blah.

-Barry


From keichwa@gmx.net  Wed Aug 15 08:57:51 2001
From: keichwa@gmx.net (Karl Eichwalder)
Date: Wed, 15 Aug 2001 09:57:51 +0200
Subject: [I18n-sig] Re: pygettext.py extraction of docstrings
In-Reply-To: <15225.63024.291621.844755@anthem.wooz.org> (barry@wooz.org's
 message of "Wed, 15 Aug 2001 00:10:24 -0400")
References: <14840.35473.307059.990479@anthem.concentric.net>
 <200010272228.AAA01066@loewis.home.cs.tu-berlin.de>
 <15113.29005.357449.812516@anthem.wooz.org>
 <shy9rmwm54.fsf@tux.gnu.franken.de>
 <15117.38438.361043.255768@anthem.wooz.org>
 <oqzo9bfxbo.fsf@lin2.sram.qc.ca>
 <15223.18129.372008.719610@anthem.wooz.org>
 <15225.23907.685771.255536@honolulu.ilog.fr>
 <15225.63024.291621.844755@anthem.wooz.org>
Message-ID: <shn151wy9s.fsf@tux.gnu.franken.de>

barry@wooz.org (Barry A. Warsaw) writes:

> I made the semantics for pygettext.py's --docstrings/-D option to
> extract /also/ the docstrings because the older version of msgmerge I
> am using can't merge a docstring-only catalog with a normal-string
> catalog in a reasonable way (I tried).

FYI: You must not use msgmerge for this job; msgcomm is the right tool
;)  When gettext 0.11 is released you can go for msgcat.

> And as stated above, module docstrings can serve exactly the same
> audience as other translatable strings, i.e. the end user, so they
> should be in the same catalog.

It depends.  It depends on the size (for example).  gnumeric, a GNOME
spreadsheet application written in C, features "docstrings" associated
with macro functions, highly mathematical stuff, approx. 3-400
messages.  I'm not able to translate these messages and as a translator
I like have these messages go into a separate file...

Happily, these days I can use msggrep to extract these messages
(fr-function.po) and msgcomm to "substrate" the extracted strings
(fr-function.po) from the original .po file (fr.po); result:
fr-without-function.po.  Here are the commands:

    msggrep --output fr-function.po --width 0 \
            --msgid --regex '@FUNCTION=' fr.po
    msgcomm --output fr-without-function.po --width 0 \
            --less-than 2 fr.po fr-function.po

msggrep is able to work on filename markers (#: filename), too.

Sorry for my digression.

-- 
ke@suse.de (work) / keichwa@gmx.net (home):              |
http://www.suse.de/~ke/                                  |      ,__o
Free Translation Project:                                |    _-\_<,
http://www.iro.umontreal.ca/contrib/po/HTML/             |   (*)/'(*)


From pinard@iro.umontreal.ca  Wed Aug 15 16:01:35 2001
From: pinard@iro.umontreal.ca (=?iso-8859-1?q?Fran=E7ois?= Pinard)
Date: 15 Aug 2001 11:01:35 -0400
Subject: [I18n-sig] Re: pygettext dilemma
In-Reply-To: <15225.61991.262726.993033@anthem.wooz.org>
References: <15200.64763.772001.53387@anthem.wooz.org>
 <oqvgjzfvwy.fsf@lin2.sram.qc.ca>
 <15223.19649.811672.585574@anthem.wooz.org>
 <oqitfs5fe7.fsf@lin2.sram.qc.ca>
 <15225.61991.262726.993033@anthem.wooz.org>
Message-ID: <oqr8uds6y8.fsf@lin2.sram.qc.ca>

[Barry A. Warsaw]

> No hurt taken [...]

Thanks!  It would have much saddened me otherwise.

> IIRC, gettext suggests binding N_() to gettext_noop()

I was not overly happy with N_(), even if it comes from me as well, but
given gettext_noop() are less frequent than gettext() in the experience
we accumulated at the time, it was bearable.  But I never really liked it.

> Hmm, maybe we should add "N_" as one of the default keywords?

This would be kind of natural from someone educated with C gettext, and
looks much better to me than redefining _().  Much much better! :-)

Of course, I considered it, a while ago already, but rejected it for my
own use for two reasons.  I'm not sure how solidly those reasons would
stand today.  Here there are nevertheless.

The first is that N_() is completely pre-processed out in C, while N_()
would stay executable in Python.  To go as far as calling a function,
as a side-effect of marking, was looking to me like a high price to me.
Conceptual price, of course; I'm not thinking about sparing the CPU, here.

The second is a mere consequence of the first.  Python would not let us
use N_() for docstrings.  And I consider that Python is very right here,
in telling me I'm wrong, because N_() is much more than a marker, while
it should be nothing more than a marker, and have no other significance.

> That points out a general philosophy I have that pygettext.py should
> mimic xgettext as much as makes sense for the difference between C and
> Python.

I understand what you mean here, and I mostly agree.  However, I would like
to warn you against going too far in trying to follow `gettext'.  It would
be difficult for me to go in details now, but overall, I feel that `gettext'
is a bit short-sighted.  At the origin, this was really on purpose, as the
initial goal was to put out something simple, and allow many years (I knew
it has to be more than a few) so the idea of internationalization spreads
and gets a wider acceptance in the field of free software.  I think the idea
has grown solid enough to not die by now, but if we want to be objective,
there is still much, much to accomplish even with the initial design.

However, I would guess that it would not take many more years before we
get ready for another leap, and I fear `gettext' might not be fully ready
for it, as it is getting somewhat encumbered by opinions, more than vision.
I was hoping that Python might be the vehicle for that step.  And for this
to occur, Python needs being able to keep some distance and autonomy.

> [...] are you saying that ''"""Text""" should be used for both the
> runtime translating and the extraction marking?

No.  Only for marking, when nothing more than marking is meant.

> Wouldn't it be wonderful if I only needed to type them once:
>     _('No such list %(listname)s found on host %(hostname)s')

Yes, it would be wonderful.  Also notice that we could eventually go a lot
further than merely exchanging the order of the variables.  Many languages
use morphological flexing of surrounding words according to various
properties coming from inserts themselves, or even more important changes.

> The trick is that the function _() isn't gettext.gettext() but a
> wrapper around that library function that's unique to Mailman.

As much as possible, think Python, not only Mailman. :-) Yet, I quite
understand one has to start somewhere.

> But there is another problem: for some fonts in some IDE's it can be
> challenging to discern ' from " or even ` and having something like
> ""'''...''' makes it even more difficult to visually pick out.

Please, do not merely let random fonts or editors design decide of your
vision.  Things start to go wrong when each actor is trying to take all
the problems of the world on his/her shoulders.  I could speak with length
and conviction about a few bad moves in the area of fonts, in these days,
especially with Unicode around.  Just let's not dive into that, and rather
hope that reason (or horse sense) will finally prevail.  The most productive
attitude is that everyone identifies his/her share, and do well with it.

> For the handful of cases where you need to defer translation, I prefer
> using a technique as similar to the common way as possible, instead of
> introducing an entirely different convention.  But I wouldn't cry foul
> if you encouraged N_() markings for deferred translations.

No doubt to me, N_() is vastly superior to locally redefining _().
And _even_ if vastly superior, it is still not that good. :-)

> The more I think about it, I think a magic comment in front of the
> docstring is the way to go.  I'm not yet sure whether something like

>     # noextract
>     '''This is a docstring that need not be translated.'''

> or

>     # extract
>     '''This is a docstring that should be translated.'''

> is better, or whether there's some other better comment keyword to
> use.  This would be worth experimenting with a bit.

It seems like a good idea.  This is surely legible and neat, and probably
better than the other things we saw so far, from both of us! :-)

I have a slight fear that it might become tedious if we have long
sequences of translation-delayed strings, as it will likely happen in some
applications.  (I've linguistic applications in mind: as my associate is
a linguist and we often work together, I saw such things a few times.)

>     FP> I would not want to crusade inordinately over this

> A good, lively debate.  Thanks!

Oh, you know, I do not want it too lively.  As much as I like friendly
exchanges of ideas, as least because they convey friendship, I hate debates
and conflicts.  I dare to think I'm a peaceful man...

For the Translation Project, Martin von Löwis and Karl Eichwalder took the
torch and accepted to face the music.  I'm extremely grateful to them, yet a
bit sorry, thinking about some stubbornness out there which will undoubtedly
hit them.  For one, I'm too old and tired for fighting it anymore. :-)

> Let's not conflate what we're talking about.

Oops!  I do not know the word "conflate", and do not find it in my
English-French dictionary.  I'm a priori ready to "not conflate" if it
pleases you :-).  But then, what should I do, or avoid to do? :-)

                                        Keep happy!

-- 
François Pinard   http://www.iro.umontreal.ca/~pinard


From haible@ilog.fr  Wed Aug 15 17:39:58 2001
From: haible@ilog.fr (Bruno Haible)
Date: Wed, 15 Aug 2001 18:39:58 +0200 (CEST)
Subject: [I18n-sig] Re: pygettext.py extraction of docstrings
In-Reply-To: <15225.63024.291621.844755@anthem.wooz.org>
References: <14840.35473.307059.990479@anthem.concentric.net>
 <200010272228.AAA01066@loewis.home.cs.tu-berlin.de>
 <15113.29005.357449.812516@anthem.wooz.org>
 <shy9rmwm54.fsf@tux.gnu.franken.de>
 <15117.38438.361043.255768@anthem.wooz.org>
 <oqzo9bfxbo.fsf@lin2.sram.qc.ca>
 <15223.18129.372008.719610@anthem.wooz.org>
 <15225.23907.685771.255536@honolulu.ilog.fr>
 <15225.63024.291621.844755@anthem.wooz.org>
Message-ID: <15226.42462.142365.711064@honolulu.ilog.fr>

Barry A. Warsaw writes:
>     BH> The GNU gettext tools are currently being modified to handle
>     BH> various programming languages. A new flag 'python-format' is
>     BH> being introduced, with appropriate format string checking in
>     BH> 'msgfmt'.
> 
> I'm not sure exactly what this means.  Can you give a bit more detail?

When a Python program contains a string like "%(name)s, %(firstname)s"
xgettext will mark it as "#, python-format", in order to tell the
translator that the string is a format string. If the translator then
gives an incorrect translation, say "%(fistname)s %(name)s" or
"%(firstname) %(name)s", then "msgfmt --check" will give an
appropriate error message.

> BTW, here's the current set of switches for pygettext.py.  Do you see
> any glaring incompatibilities with you latest xgettext?

pygettext doesn't extract comments of the form
"translator: the c, is a c-cedilla" (xgettext option --add-comments) or
"xgettext: no-python-format" (lets the programmer override the format
string guessing).

Other than that: xgettext doesn't have --docstrings and
--no-docstrings yet :-)

The -K option doesn't exist in xgettext, you have to use --keyword
instead.

Also, xgettext doesn't have -S/--style. The Solaris style is available
only with --strict.

> But as Martin points out, a case can be made for translating even
> class, function, and method docstrings.  Think of the situation where
> manuals are automatically extracted from source code, a la Javadoc.  I
> believe you'd want those strings to be extracted into the catalog.

I believe those strings belong into a different catalog. If you then
want them in the same catalog, you can use "msgcat" to combine both
catalogs.

The reasons for a different catalog: 1) Normal strings and docstrings
may need to be handled by different translators. 2) They may need
different extraction options. Your addition of --no-docstrings
indicates that docstrings may come from a different set of
files. Instead of forcing all options into a single xgettext command
line, what I propose is that you call xgettext twice, once for the
normal strings and once for the docstrings, with independent command
line options, and on independent (but potentially overlapping) sets of
files. This gives you the maximum flexibility.

> Here are two strawmen:
> 
> 1) pygettext.py and xgettext never extract unmarked docstrings unless
>    the -D/--docstrings option is given.  If -D is given then all
>    unmarked docstrings are extracted along with all other normally
>    marked text, unless the unmarked docstring is immediately preceded
>    by a comment with the word "notranslate" as the first word in the
>    comment.  All other words in the comment are ignored.
> 
> 2) pygettext.py and xgettext never extract unmarked docstrings unless
>    they are immediately preceded by a comment with the word
>    "translate" as the first word in the comment.  All other words in
>    the comment are ignored.

Here is my strawman:

  pygettext.py and xgettext never extract unmarked docstrings by
  default. If option -D/--docstrings is given, it extracts docstrings
  only. A separate option like --keywords can be used to select or
  inhibit the docstrings.

> I think that Martin's and my applications show that we probably need
> to cover these two situations:
>
> 1) No docstrings are extracted unless they are preceded by a magic
>    "extract" comment.
>
> 2) All docstrings are extracted unless they are preceded by a magic
>    "noextract" comment.

I agree.

> Note that in Python, unlike I believe as in *Lisp, docstrings can be
> attached to objects other than functions.  It's common to have both
> module and class docstrings.

Lisp has grown up since then. Nowadays you can attach docstrings not
only to functions and macros, but also to classes, methods and packages.
The macros defclass, defmethod and defpackage support this.

Bruno


From martin@loewis.home.cs.tu-berlin.de  Wed Aug 15 19:23:55 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Wed, 15 Aug 2001 20:23:55 +0200
Subject: [I18n-sig] Re: pygettext.py extraction of docstrings
In-Reply-To: <15226.42462.142365.711064@honolulu.ilog.fr> (message from Bruno
 Haible on Wed, 15 Aug 2001 18:39:58 +0200 (CEST))
References: <14840.35473.307059.990479@anthem.concentric.net>
 <200010272228.AAA01066@loewis.home.cs.tu-berlin.de>
 <15113.29005.357449.812516@anthem.wooz.org>
 <shy9rmwm54.fsf@tux.gnu.franken.de>
 <15117.38438.361043.255768@anthem.wooz.org>
 <oqzo9bfxbo.fsf@lin2.sram.qc.ca>
 <15223.18129.372008.719610@anthem.wooz.org>
 <15225.23907.685771.255536@honolulu.ilog.fr>
 <15225.63024.291621.844755@anthem.wooz.org> <15226.42462.142365.711064@honolulu.ilog.fr>
Message-ID: <200108151823.f7FINt503408@mira.informatik.hu-berlin.de>

> When a Python program contains a string like "%(name)s, %(firstname)s"
> xgettext will mark it as "#, python-format", in order to tell the
> translator that the string is a format string. If the translator then
> gives an incorrect translation, say "%(fistname)s %(name)s" or
> "%(firstname) %(name)s", then "msgfmt --check" will give an
> appropriate error message.

That's pretty cool.

Regards,
Martin


From colinsyu@hotmail.com  Fri Aug 24 17:54:00 2001
From: colinsyu@hotmail.com (Colin Yu)
Date: Fri, 24 Aug 2001 09:54:00 -0700
Subject: [I18n-sig] (no subject)
Message-ID: <F1418ARf3sllNY30kkZ00014349@hotmail.com>

Hi,

Is there a way to take a unicode string like u"ãƒ¬ãƒ¼ã‚¶ãƒ¼ ãƒ—ãƒªãƒ³ã‚¿" 
(which are Japanese characters in UTF-8 format) and convert it to unicode 
(\uXXXX format) escape codes in python?  Your help would be greatly 
appreciated. Thank you.

_________________________________________________________________
Get your FREE download of MSN Explorer at http://explorer.msn.com/intl.asp


From tree@basistech.com  Fri Aug 24 18:03:51 2001
From: tree@basistech.com (Tom Emerson)
Date: Fri, 24 Aug 2001 13:03:51 -0400
Subject: [I18n-sig] (no subject)
In-Reply-To: <F1418ARf3sllNY30kkZ00014349@hotmail.com>
References: <F1418ARf3sllNY30kkZ00014349@hotmail.com>
Message-ID: <15238.35063.269485.168879@magrathea.basistech.com>

Colin Yu writes:
> Is there a way to take a unicode string like u"=E3=1B,C,=1B(B=E3=1B,C=
<=1B(B=E3=1B,B6=1B(B=E3=1B,C<=1B(B =E3=83=97=E3=1B,C*=1B(B=E3=1B,C3=1B(=
B=E3=1B,B=3F=1B(B"=20
> (which are Japanese characters in UTF-8 format) and convert it to uni=
code=20
> (\uXXXX format) escape codes in python=3F  Your help would be greatly=
=20
> appreciated. Thank you.

Use "repr".

>>> foo =3D u"\u4e00"
>>> repr(foo)
"u'\\u4E00'"

--=20
Tom Emerson                                          Basis Technology C=
orp.
Sr. Sinostringologist                              http://www.basistech=
.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever=
"


From Misha.Wolf@reuters.com  Sat Aug 25 02:02:33 2001
From: Misha.Wolf@reuters.com (Misha.Wolf@reuters.com)
Date: Sat, 25 Aug 2001 02:02:33 +0100
Subject: [I18n-sig] 19th Unicode Conference, Sep 2001, San Jose, CA -- Two weeks to go!
Message-ID: <T5593dfa144c407b707484@reuters.com>

>>>>>>>>>>>>>>>>>>>>>>>>  Just 2 weeks to go!  <<<<<<<<<<<<<<<<<<<<<<<<

           Nineteenth International Unicode Conference (IUC19)
               Unicode and the Web: The Global Connection
                    http://www.unicode.org/iuc/iuc19
                         September 10-14, 2001
                           San Jose, CA, USA

>>>>>>>>>>>>>>>>>>>>>>>>>>>  Register now!  <<<<<<<<<<<<<<<<<<<<<<<<<<<

NEWS

   * Hotel guest room group rate extended to August 31.

   * Early Bird registration rate extended to August 31.

   * Visit the Conference Web site ( http://www.unicode.org/iuc/iuc19 )
     to check the updated Conference program and register.  To help you
     choose Conference sessions, we've included abstracts of talks and
     speakers' biographies.

CONFERENCE SPONSORS

   * Agfa Monotype Corporation
   * Basis Technology Corporation
   * Lionbridge Technologies
   * Microsoft Corporation
   * Netscape Communications
   * Oracle Corporation
   * PeopleSoft, Inc.
   * Reuters Ltd.
   * Sun Microsystems, Inc.
   * Trados Corporation
   * Trigeminal Software, Inc.
   * World Wide Web Consortium (W3C)
   * Wrox Press

CONFERENCE VENUE

     DoubleTree Hotel San Jose
     2050 Gateway Place
     San Jose, CA 95110
     USA

     Tel: +1 408 453 4000
     Fax: +1 408 437 2898

GLOBAL COMPUTING SHOWCASE

   * Visit the Showcase to find out more about products supporting the
     Unicode Standard, and products and services that can help you
     globalize/localize your software, documentation and Internet
     content.  For details, visit the Conference Web site
     ( http://www.unicode.org/iuc/iuc19 )

   Exhibitors to date include:

   * Agfa Monotype Corporation
   * Basis Technology Corporation
   * Everlasting Systems Ltd.
   * Localization Institute
   * Multilingual Computing, Inc.
   * Oracle Corporation
   * Rasmussen Software, Inc.
   * ReachIn, Inc.
   * Sun Microsystems, Inc.
   * Segue Software
   * Sybase, Inc.
   * Symbio Group
   * Trados Corporation

CONFERENCE MANAGEMENT

   Global Meeting Services Inc.
   4360 Benhurst Avenue
   San Diego, CA 92122, USA

   Tel: +1 858 638 0206 (voice)
        +1 858 638 0504 (fax)

   Email: info@global-conference.com
      or: conference@unicode.org

                             *  *  *  *  *

Unicode(r) and the Unicode logo are registered trademarks of Unicode,
Inc.  Used with permission.


-----------------------------------------------------------------
        Visit our Internet site at http://www.reuters.com

Any views expressed in this message are those of  the  individual
sender,  except  where  the sender specifically states them to be
the views of Reuters Ltd.