From Misha.Wolf@reuters.com Fri Aug 3 20:40:39 2001 From: Misha.Wolf@reuters.com (Misha.Wolf@reuters.com) Date: Fri, 03 Aug 2001 20:40:39 +0100 Subject: [I18n-sig] 19th Unicode Conference, September 2001, San Jose, CA, USA -- Register now! Message-ID: Nineteenth International Unicode Conference (IUC19) Unicode and the Web: The Global Connection http://www.unicode.org/iuc/iuc19 September 10-14, 2001 San Jose, CA, USA Register now! * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * NEWS * Hotel guest room group rate valid to August 17. * Visit the Conference Web site ( http://www.unicode.org/iuc/iuc19 ) to check the updated Conference program and register. To help you choose Conference sessions, we've included abstracts of talks and speakers' biographies. CONFERENCE SPONSORS Agfa Monotype Corporation Basis Technology Corporation Lionbridge Technologies Microsoft Corporation Netscape Communications Oracle Corporation PeopleSoft, Inc. Reuters Ltd. Sun Microsystems, Inc. Trados Corporation Trigeminal Software, Inc. World Wide Web Consortium (W3C) Wrox Press CONFERENCE VENUE DoubleTree Hotel San Jose 2050 Gateway Place San Jose, CA 95110 USA Tel: +1 408 453 4000 Fax: +1 408 437 2898 GLOBAL COMPUTING SHOWCASE Visit the Showcase to find out more about products supporting the Unicode Standard, and products and services that can help you globalize/localize your software, documentation and Internet content. For details, visit the Conference Web site: http://www.unicode.org/iuc/iuc19 Exhibitors to date include: * Basis Technology Corporation * Everlasting Systems Ltd. * Multilingual Computing, Inc. * Oracle Corporation * Rasmussen Software, Inc. * Sun Microsystems, Inc. * Symbio Group * Trados CONFERENCE MANAGEMENT Global Meeting Services Inc. 4360 Benhurst Avenue San Diego, CA 92122, USA Tel: +1 858 638 0206 (voice) +1 858 638 0504 (fax) Email: info@global-conference.com or: conference@unicode.org * * * * * Unicode(r) and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission. ----------------------------------------------------------------- Visit our Internet site at http://www.reuters.com Any views expressed in this message are those of the individual sender, except where the sender specifically states them to be the views of Reuters Ltd. From pinard@iro.umontreal.ca Tue Aug 7 21:07:39 2001 From: pinard@iro.umontreal.ca (=?iso-8859-1?q?Fran=E7ois?= Pinard) Date: 07 Aug 2001 16:07:39 -0400 Subject: [I18n-sig] Re: pygettext.py extraction of docstrings In-Reply-To: <15117.38438.361043.255768@anthem.wooz.org> References: <14840.35473.307059.990479@anthem.concentric.net> <200010272228.AAA01066@loewis.home.cs.tu-berlin.de> <15113.29005.357449.812516@anthem.wooz.org> <15117.38438.361043.255768@anthem.wooz.org> Message-ID: [Barry A. Warsaw] > Then again, it doesn't say that #. comments are reserved. It basically > just says that #-whitespace comments are reserved for the translators. You might consider that they are all reserved. > I'm happy to switch it, but I'd really like to have a reference I can > point to to short-circuit any further discussion. Even a mailing list > archive url would be fine. There is no formal, fully dependable reference. I might have written the bits that exist in the `gettext' manual, and these things were programmed only after they were thoroughly discussed with me. But nowadays, even me is not a good reference. A few people contributed `gettext' code, pushing and pulling a bit hard for their own ideas, and not always understanding the overall plans. Their code made it into `gettext' releases nevertheless. So now, I'm not sure I understand much anymore where things are going. If I remember well, `#.' are for textual comments written by the program maintainer, meant to be read by translators, and derived automatically at POT creation time. They usually come from specially formatted comments in the C sources. `#-whitespace' are for textual comments also meant to be read by various translators, but written by translators themselves. `#,' are for programmatic flags. The idea was to use these parsimoniously, keeping track of possible flag definitions and consequences. I do not know how far these are recognized and validated by `msgfmt'. Best would be to coordinate with the current `gettext' maintainer before creating new ones. Unless he declares they are now for free use? -- François Pinard http://www.iro.umontreal.ca/~pinard From pinard@iro.umontreal.ca Tue Aug 7 21:38:05 2001 From: pinard@iro.umontreal.ca (=?iso-8859-1?q?Fran=E7ois?= Pinard) Date: 07 Aug 2001 16:38:05 -0400 Subject: [I18n-sig] Re: pygettext dilemma In-Reply-To: <15200.64763.772001.53387@anthem.wooz.org> References: <15200.64763.772001.53387@anthem.wooz.org> Message-ID: [Barry A. Warsaw] > In Mailman, I've got a bunch of normal .py modules and a bunch of > command line scripts. The modules have their translatable strings > nicely marked with _() and only those strings should be extracted. Hello, Barry. Long time no talk! :-) `_(STRING)' is two-fold. First, it marks STRING for extraction and later insertion in some generated POT file. Second, it is a nickname for the `gettext' function or alike, that will translate STRING at run time given that a translation file provides a translation. Experience taught us that this is not always adequate. We sometimes need to delay a translation. That is, we might use `_(VARIABLE)', with VARIABLE being first assigned some translatable string elsewhere in the program. Since VARIABLE is not a string, it does not get extracted into a POT file. But those strings which could get assigned to VARIABLE are not extracted either, because they are not marked. You understand that they were marked with `_(STRING)', they would get translated prematurely. All this to say that there is a need for marking strings in such a way that they will be extracted into POT files, but otherwise untouched by Python. That is, the way to mark string should be a Python no-operation, and ideally, should not alter the Python language. The only simple Python no-operation I know is the unary prefix `+', and my intuition tells me that it might have been dangerous to use it for marking delayed translation strings. Using prefixes like i"STRING" or t"STRING" (for "i"nternationalisable or "t"ranslatable) would require a modification to Python. So, I came with the simple idea to play a bit with the fact that Python folds a succession of constant strings into a single one at compilation time. The idea is to prefix a translatable string, when it is used outside the usual `_(STRING)' idiom, by an empty string of the other kind, like this: Exemple Type For extractor 'TEXT' 1-quoted not marked "TEXT" 2-quoted not marked '''TEXT''' 3-quoted not marked ''"TEXT" 4-quoted marked ""'TEXT' 5-quoted marked """TEXT""" 6-quoted not marked ""'''TEXT''' 7-quoted marked ''"""TEXT""" 8-quoted marked Of course, the idea of using the empty string "of the other kind" is to avoid ambiguity: prefixing '' to 'TEXT' would produce '''TEXT', which just cannot work. I agree that for 7-quoted and 8-quoted strings, it is not really required to use the empty string of the other kind, using an empty string of the same kind would work without problem. I suggest we keep "of the other kind" for 7-quoted and 8-quoted for being more consistent. > The scripts however should have both _() and docstrings extracted, > since the module docstrings include usage text. In fact, I think that even within a single module, some docstrings should be considered translatable, while some other docstrings should not be. Considering the choice has to be per whole module at a time, is too gross. This goes almost without saying. One should not feel compelled to avoid docstrings for internal or service functions within a module, merely to avoid having them spuriously extracted, and later, uselessly translated. > Does anybody have any suggestions or better ideas? I would be tempted to suggest that we merely use delayed string marking, using the convention above (like in 4-quoted, 5-quoted, 7-quoted or 8-quoted) for docstrings meant to be translated. Such strings would be extracted no matter what, in docstring position of not. An option to `pygettext' might exist to extract all docstrings, whether marked as delayed strings or not, but I would guess this is an interim solution which is not to be satisfying in the long term. Best is to mark translatable strings precisely, either using immediate `_(STRING)' or delayed translation. One problem is that Python does not seem to automatically concatenate a sequence of strings as a single one, when in docstring position. We might consider this as a Python bug: repairing that bug would not really change the language, and would allow delayed marking of translation strings. Let me present the set of suggestions, in this message, as having a minimal impact on Python, yet being pretty flexible in what it would allow us to do. -- François Pinard http://www.iro.umontreal.ca/~pinard From Misha.Wolf@reuters.com Fri Aug 10 22:58:15 2001 From: Misha.Wolf@reuters.com (Misha.Wolf@reuters.com) Date: Fri, 10 Aug 2001 22:58:15 +0100 Subject: [I18n-sig] Call for Papers - 20th Unicode Conference - Jan/Feb 2001 - Washington DC Message-ID: Twentieth International Unicode Conference (IUC20) Unicode and the Web: The Global Connection http://www.unicode.org/iuc/iuc20 January 28 - February 1, 2002 Washington, DC, USA > > > > > > > C A L L F O R P A P E R S < < < < < < < Submissions due: September 21, 2001 Notification date: October 12, 2001 Completed papers due : November 2, 2001 (in electronic form and camera-ready paper form) * * * * * The Unicode Standard has become the foundation for all modern text processing. It is used on large machines, tiny portable devices, and for distributed processing across the Internet. The standard brings cost-reducing efficiency to international applications and enables the exchange of text in an ever increasing list of natural languages. New technologies and innovative Internet applications, as well as the evolving Unicode Standard, bring new challenges along with their new capabilities. This technical conference will explore the opportunities created by the latest advances and how to leverage them, as well as potential pitfalls to be aware of, and problem areas that need further research. We invite you to submit papers which either define the software of tomorrow, demonstrate best practice with today's software, or articulate problems that must be solved before further advances can occur. Papers should discuss subjects in the context of Unicode, internationalization or localization. You can view the programs of previous conferences at: http://www.unicode.org/unicode/conference/about-conf.html Conference attendees are generally involved in either the development, deployment or use of Unicode software or content, or the globalization of software and the Internet. They include managers, software engineers, systems analysts, font designers, graphic designers, content developers, technical writers, and product marketing personnel. THEME & TOPICS Computing with Unicode is the overall theme of the Conference. Presentations should be geared towards a technical audience. Topics of interest include, but are not limited to, the following (within the context of Unicode, internationalization or localization): - UTFs: Not enough or too many? - Security concerns e.g. Avoiding the spoofing of UTF-8 data - Impact of new encoding standards - Implementing Unicode: Practical and political hurdles - Portable devices - Implementing new features of recent versions of Unicode - Algorithms (e.g. normalization, collation, bidirectional) - Programming languages and libraries (Java, Perl, et al) - The World Wide Web (WWW) - Search engines - Library and archival concerns - Operating systems - Databases - Large scale networks - Government applications - Evaluations (case studies, usability studies) - Natural language processing - Migrating legacy applications - Cross platform issues - Printing and imaging - Optimizing performance of systems and applications - Testing applications - XML and Web protocols - Business models for software development (e.g. Open source) SESSIONS The Conference Program will provide a wide range of sessions including: - Keynote presentations - Workshops/Tutorials - Technical presentations - Panel sessions All sessions except the Workshops/Tutorials will be of 40 minute duration. In some cases, two consecutive 40 minute program slots may be devoted to a single session. The Workshops/Tutorials will each last approximately three hours. They should be designed to stimulate discussion and participation, using slides and demonstrations. PUBLICITY If your paper is accepted, your details will be included in the Conference brochure and Web pages and the paper itself will appear on a Conference CD, with an optional printed book of Conference Proceedings. CONFERENCE LANGUAGE The Conference language is English. All submissions, papers and presentations should be provided in English. SUBMISSIONS Submissions MUST contain: 1. An abstract of 150-250 words, consisting of statement of purpose, paper description, and your conclusions or final summary. 2. A brief biography. 3. The details listed below: SESSION TITLE: _________________________________________ _________________________________________ TITLE (eg Dr/Mr/Mrs/Ms): _________________________________________ NAME: _________________________________________ JOB TITLE: _________________________________________ ORGANIZATION/AFFILIATION: _________________________________________ ORGANIZATION'S WWW URL: _________________________________________ OWN WWW URL: _________________________________________ ADDRESS FOR PAPER MAIL: _________________________________________ _________________________________________ _________________________________________ TELEPHONE: _________________________________________ FAX: _________________________________________ E-MAIL ADDRESS: _________________________________________ TYPE OF SESSION: [ ] Keynote presentation [ ] Workshop/Tutorial [ ] Technical presentation [ ] Panel PANELISTS (if Panel): _________________________________________ _________________________________________ _________________________________________ _________________________________________ _________________________________________ _________________________________________ _________________________________________ _________________________________________ TARGET AUDIENCE (you may select more than one category): [ ] Content Developers [ ] Font Designers [ ] Graphic Designers [ ] Managers [ ] Marketers [ ] Software Engineers [ ] Systems Analysts [ ] Technical Writers [ ] Others (please specify): _________________________________________ _________________________________________ LEVEL OF SESSION (you may select more than one category): [ ] Beginner [ ] Intermediate [ ] Advanced Submissions should be sent by e-mail to either of the following addresses: papers@unicode.org info@global-conference.com They should use ASCII, non-compressed text and the following subject line: Proposal for IUC 20 If desired, a copy of the submission may also be sent by post to: Twentieth International Unicode Conference c/o Global Meeting Services, Inc. 4360 Benhurst Avenue San Diego, CA 92122 USA Tel: +1 858 638 0206 Fax: +1 858 638 0504 CONFERENCE PROCEEDINGS All Conference papers will be published on CD. Printed proceedings will be offered as an option. EXHIBIT OPPORTUNITIES The Conference will have an Exhibition area for corporations or individuals who wish to display and promote their products, technology and/or services. Every effort will be made to provide maximum exposure and advertising. Exhibit space is limited. For further information or to reserve a place, please contact Global Meeting Services at the above location. CONFERENCE VENUE Omni Shoreham Hotel 2500 Calvert Street, NW Washington, DC 20008 USA Tel: +1 202 234 0700 Fax: +1 202 265 7972 THE UNICODE CONSORTIUM The Unicode Consortium was founded as a non-profit organization in 1991. It is dedicated to the development, maintenance and promotion of The Unicode Standard, a worldwide character encoding. The Unicode Standard encodes the characters of the world's principal scripts and languages, and is code-for-code identical to the international standard ISO/IEC 10646. In addition to cooperating with ISO on the future development of ISO/IEC 10646, the Consortium is responsible for providing character properties and algorithms for use in implementations. Today the membership base of the Unicode Consortium includes major computer corporations, software producers, database vendors, research institutions, international agencies and various user groups. For further information on the Unicode Standard, visit the Unicode Web site at http://www.unicode.org or e-mail * * * * * Unicode(r) and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission. ----------------------------------------------------------------- Visit our Internet site at http://www.reuters.com Any views expressed in this message are those of the individual sender, except where the sender specifically states them to be the views of Reuters Ltd. From martin@loewis.home.cs.tu-berlin.de Sun Aug 12 09:57:39 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Sun, 12 Aug 2001 10:57:39 +0200 Subject: [I18n-sig] Re: pygettext dilemma In-Reply-To: (pinard@IRO.UMontreal.CA) References: <15200.64763.772001.53387@anthem.wooz.org> Message-ID: <200108120857.f7C8vdi02038@mira.informatik.hu-berlin.de> > One problem is that Python does not seem to automatically concatenate a > sequence of strings as a single one, when in docstring position. What version did you use to try this? It works fine for me: Python 2.0 (#1, May 16 2001, 00:02:45) [GCC 2.95.3 20010315 (SuSE)] on linux2 Type "copyright", "credits" or "license" for more information. >>> def foo(): ... ""'Hallo' ... >>> foo.__doc__ 'Hallo' Regards, Martin From pinard@iro.umontreal.ca Mon Aug 13 01:41:34 2001 From: pinard@iro.umontreal.ca (=?iso-8859-1?q?Fran=E7ois?= Pinard) Date: 12 Aug 2001 20:41:34 -0400 Subject: [I18n-sig] Re: pygettext dilemma In-Reply-To: <200108120857.f7C8vdi02038@mira.informatik.hu-berlin.de> References: <15200.64763.772001.53387@anthem.wooz.org> <200108120857.f7C8vdi02038@mira.informatik.hu-berlin.de> Message-ID: [Martin v. Loewis] > > One problem is that Python does not seem to automatically concatenate > > a sequence of strings as a single one, when in docstring position. > What version did you use to try this? It works fine for me: Oops! You are right. Sorry, I made my tests wrong: Python 2.1 (#1, Jul 3 2001, 21:59:44) [GCC 2.95.2 19991024 (release)] on linux2 Type "copyright", "credits" or "license" for more information. >>> def bonjour(): ... 'chez ' ... 'vous!' ... pass ... >>> bonjour.__doc__ 'chez ' >>> def bonjour(): ... 'chez ' 'vous!' ... pass ... >>> bonjour.__doc__ 'chez vous!' >>> So, there is no problem, and: ''""" LONG DOC STRING """ could be marked as translatable exactly like this. This could allow sorting out between docstrings meant to be translated, from the others, one string at a time, rather than one module at a time. -- François Pinard http://www.iro.umontreal.ca/~pinard From barry@wooz.org Mon Aug 13 04:17:37 2001 From: barry@wooz.org (Barry A. Warsaw) Date: Sun, 12 Aug 2001 23:17:37 -0400 Subject: [I18n-sig] Re: pygettext.py extraction of docstrings References: <14840.35473.307059.990479@anthem.concentric.net> <200010272228.AAA01066@loewis.home.cs.tu-berlin.de> <15113.29005.357449.812516@anthem.wooz.org> <15117.38438.361043.255768@anthem.wooz.org> Message-ID: <15223.18129.372008.719610@anthem.wooz.org> Hi Francois! I'm Cc'ing Bruno on this message because I think he's the current gettext maintainer. Sorry if I'm mistaken... >>>>> "FP" =3D=3D Fran=E7ois Pinard writes: >> Then again, it doesn't say that #. comments are reserved. It >> basically just says that #-whitespace comments are reserved for >> the translators. FP> You might consider that they are all reserved. >> I'm happy to switch it, but I'd really like to have a reference >> I can point to to short-circuit any further discussion. Even a >> mailing list archive url would be fine. FP> If I remember well, `#.' are for textual comments written by FP> the program maintainer, meant to be read by translators, and FP> derived automatically at POT creation time. They usually come FP> from specially formatted comments in the C sources. FP> `#-whitespace' are for textual comments also meant to be read FP> by various translators, but written by translators themselves. This makes sense. It would be good to make this a bit clearer in the "Format of PO Files" section of the GNU gettext manual. FP> `#,' are for programmatic flags. The idea was to use these FP> parsimoniously, keeping track of possible flag definitions and FP> consequences. I do not know how far these are recognized and FP> validated by `msgfmt'. Best would be to coordinate with the FP> current `gettext' maintainer before creating new ones. Unless FP> he declares they are now for free use? A while back I was convinced to switch the `docstring' flag to #, for pygettext. Perhaps Bruno can add some information on pygettext.py in the GNU gettext manual? I think the following would be of interest: - Mention the existence of pygettext.py for extracting translatable strings in Python. - Point to Python's gettext module documentation for more details on i18n'ing Python programs. This should be a fairly stable url: http://www.python.org/doc/current/lib/module-gettext.html - Document `docstring' as a legal #,-style flag. It probably only has meaning in Python, but may be useful in other scripting languages. Think of it roughly equivalent to Emacs-Lisp docstrings (in fact, they were the inspiration for Python docstrings back in '94 at the 1st Python workshop!) - Make sure that the other GNU gettext tools recognize the docstring flag, in whatever way is meaningful (I'm not sure what would be useful or not... ;). Thanks. BTW, for my purposes, pygettext.py's -X/--no-docstrings switch does the job perfectly, if a bit inelegantly. -Barry From barry@zope.com Mon Aug 13 04:42:57 2001 From: barry@zope.com (Barry A. Warsaw) Date: Sun, 12 Aug 2001 23:42:57 -0400 Subject: [I18n-sig] Re: pygettext dilemma References: <15200.64763.772001.53387@anthem.wooz.org> Message-ID: <15223.19649.811672.585574@anthem.wooz.org> >>>>> "FP" =3D=3D Fran=E7ois Pinard writes: >> In Mailman, I've got a bunch of normal .py modules and a bunch >> of command line scripts. The modules have their translatable >> strings nicely marked with _() and only those strings should be >> extracted. FP> Hello, Barry. Long time no talk! :-) Indeed! BTW, I18N Mailman is coming along very nicely now. I hope the 2.1 release will happen within the next few months. FP> `_(STRING)' is two-fold. First, it marks STRING for FP> extraction and later insertion in some generated POT file. FP> Second, it is a nickname for the `gettext' function or alike, FP> that will translate STRING at run time given that a FP> translation file provides a translation. FP> Experience taught us that this is not always adequate. We FP> sometimes need to delay a translation. That is, we might use FP> `_(VARIABLE)', with VARIABLE being first assigned some FP> translatable string elsewhere in the program. Since VARIABLE FP> is not a string, it does not get extracted into a POT file. FP> But those strings which could get assigned to VARIABLE are not FP> extracted either, because they are not marked. You understand FP> that they were marked with `_(STRING)', they would get FP> translated prematurely. FP> All this to say that there is a need for marking strings in FP> such a way that they will be extracted into POT files, but FP> otherwise untouched by Python. That is, the way to mark FP> string should be a Python no-operation, and ideally, should FP> not alter the Python language. All the above is true, and I have encountered these situations in Mailman 2.1. Python, however, provides a very nice solution, quite in keeping with the Pythonic "explicit-is-better-than-implicit" mantra. What I do in this situation is to temporarily bind _() to a no-op function so that the string is marked for extraction, but not translated in place. E.g. import gettext def _(s): return s foo =3D _('extract this string but do not translate it yet') _ =3D gettext.gettext This works perfectly because Python doesn't suffer from the same deficiencies as C (i.e. the C pre-processor :). FP> The only simple Python no-operation I know is the unary prefix FP> `+', and my intuition tells me that it might have been FP> dangerous to use it for marking delayed translation strings. FP> Using prefixes like i"STRING" or t"STRING" (for FP> "i"nternationalisable or "t"ranslatable) would require a FP> modification to Python. Right. A string-prefix character as another disadvantage; it sets a bad precedence for explosion of combinations of prefixes (i.e. we'd now need rt'' strings tr'' strings utr'' strings tru'' strings, etc. etc.). So we agree that prefixes are out. :) FP> So, I came with the simple idea to play a bit with the fact FP> that Python folds a succession of constant strings into a FP> single one at compilation time. The idea is to prefix a FP> translatable string, when it is used outside the usual FP> `_(STRING)' idiom, by an empty string of the other kind, like FP> this: FP> Exemple Type For extractor | 'TEXT' 1-quoted not marked | "TEXT" 2-quoted not marked | '''TEXT''' 3-quoted not marked | ''"TEXT" 4-quoted marked | ""'TEXT' 5-quoted marked | """TEXT""" 6-quoted not marked | ""'''TEXT''' 7-quoted marked | ''"""TEXT""" 8-quoted marked This has been brought up before, and I know that some people really like this approach. I don't though, because 1) it is too magical; 2) the rules are arbitrary and hard to remember; 3) explicit is better than implicit. When a newbie looks at a bit of Python code that looks like _('Traditional Chinese') and wonders what this does, he should immediately look for the definition of the _() function. Using his well-honed Python skills he'll look for some def or import that brings this function name into scope, and this should naturally lead to purpose of the idiom. E.g. they'll see "from gettext import gettext as _" or some such. Seeing something like an unadorned ""'Traditional Chinese' really gives no clue as to the purpose of this strange markup, so it would have to either be something the reader of the code Just Got, or it would have to be described in a comment, and that's simply unfeasible. I also claim that the rules are fairly arbitrary and will be hard to explain and remember. It's not something that's learned once and then ingrained. FP> In fact, I think that even within a single module, some FP> docstrings should be considered translatable, while some other FP> docstrings should not be. True. =20 FP> Considering the choice has to be per whole module at a time, FP> is too gross. This goes almost without saying. I personally don't feel like it's that big a problem. So far, in my experience the only docstrings that really need to be extracted are module docstrings in command line scripts. I've found it not to be that big a deal to also extract class or function docstrings in those files, since it doesn't add that much of a burden to the translator. But my personal preference has been to limit the docstrings in such files to just the module docstring, and use comments instead of docstrings for functions or classes. Or, you can sometimes do something ugly like use explicit __doc__ =3D _('Here is a module docstring') Not pretty, but also not common I think, so it doesn't concern me much. I could conceive of a convention where a leading comment before a docstring could inhibit extraction of the following docstring, such as: class Foo: # notranslate =09'''Here is a docstring that should not be extracted or translated.''= ' One of two approaches could happen: either pygettext.py could ignore the following docstring and not stick it in the PO file (but I forget if tokenize gets to see comments or not), or pygettext.py could add a #. notranslate comment to the entry telling translators to skip this entry. =20 FP> Let me present the set of suggestions, in this message, as FP> having a minimal impact on Python, yet being pretty flexible FP> in what it would allow us to do. I appreciate the suggestions Francois! I think what we've got gives us the best approach for Python programs. Cheers, -Barry From martin@loewis.home.cs.tu-berlin.de Mon Aug 13 06:30:51 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Mon, 13 Aug 2001 07:30:51 +0200 Subject: [I18n-sig] Re: pygettext dilemma In-Reply-To: <15223.19649.811672.585574@anthem.wooz.org> (barry@zope.com) References: <15200.64763.772001.53387@anthem.wooz.org> <15223.19649.811672.585574@anthem.wooz.org> Message-ID: <200108130530.f7D5Upm00873@mira.informatik.hu-berlin.de> > I personally don't feel like it's that big a problem. So far, in my > experience the only docstrings that really need to be extracted are > module docstrings in command line scripts. I disagree somewhat, but I also have a different application in mind. I do want to get translations for the doc strings of the standard library; in fact, that is what the python domain in the translation project has at the moment. The application here is that the help() function should present the translation of the doc string if available. Regards, Martin From martin@loewis.home.cs.tu-berlin.de Mon Aug 13 06:35:17 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Mon, 13 Aug 2001 07:35:17 +0200 Subject: [I18n-sig] Re: pygettext.py extraction of docstrings In-Reply-To: <15223.18129.372008.719610@anthem.wooz.org> (barry@wooz.org) References: <14840.35473.307059.990479@anthem.concentric.net> <200010272228.AAA01066@loewis.home.cs.tu-berlin.de> <15113.29005.357449.812516@anthem.wooz.org> <15117.38438.361043.255768@anthem.wooz.org> <15223.18129.372008.719610@anthem.wooz.org> Message-ID: <200108130535.f7D5ZHb00876@mira.informatik.hu-berlin.de> > - Mention the existence of pygettext.py for extracting translatable > strings in Python. To my knowledge, Bruno just put a section in the gettext manual explaining gettext usage with various (progamming) languages. The Python entry there does mention pygettext.py. > - Point to Python's gettext module documentation for more details on > i18n'ing Python programs. This should be a fairly stable url: > > http://www.python.org/doc/current/lib/module-gettext.html I don't think it has this link, yet. But then, URL-style links are infrequent in texinfo documentation. Instead, (python)gettext might be a better link. > - Make sure that the other GNU gettext tools recognize the docstring > flag, in whatever way is meaningful (I'm not sure what would be > useful or not... ;). At a minimum, msgmerge should preserve them. Regards, Martin From barry@zope.com Mon Aug 13 06:58:36 2001 From: barry@zope.com (Barry A. Warsaw) Date: Mon, 13 Aug 2001 01:58:36 -0400 Subject: [I18n-sig] Re: pygettext dilemma References: <15200.64763.772001.53387@anthem.wooz.org> <15223.19649.811672.585574@anthem.wooz.org> <200108130530.f7D5Upm00873@mira.informatik.hu-berlin.de> Message-ID: <15223.27788.252177.636376@anthem.wooz.org> >>>>> "MvL" == Martin v Loewis writes: >> I personally don't feel like it's that big a problem. So far, >> in my experience the only docstrings that really need to be >> extracted are module docstrings in command line scripts. MvL> I disagree somewhat, but I also have a different application MvL> in mind. I do want to get translations for the doc strings of MvL> the standard library; in fact, that is what the python domain MvL> in the translation project has at the moment. The application MvL> here is that the help() function should present the MvL> translation of the doc string if available. That's a good point (and would be neat!) but in that case, wouldn't you want all the docstrings to be extracted? I.e. you wouldn't want to just extract some docstrings in a module, but not all? -Barry From barry@wooz.org Mon Aug 13 07:01:17 2001 From: barry@wooz.org (Barry A. Warsaw) Date: Mon, 13 Aug 2001 02:01:17 -0400 Subject: [I18n-sig] Re: pygettext.py extraction of docstrings References: <14840.35473.307059.990479@anthem.concentric.net> <200010272228.AAA01066@loewis.home.cs.tu-berlin.de> <15113.29005.357449.812516@anthem.wooz.org> <15117.38438.361043.255768@anthem.wooz.org> <15223.18129.372008.719610@anthem.wooz.org> <200108130535.f7D5ZHb00876@mira.informatik.hu-berlin.de> Message-ID: <15223.27949.881781.720311@anthem.wooz.org> >>>>> "MvL" == Martin v Loewis writes: >> - Mention the existence of pygettext.py for extracting >> translatable strings in Python. MvL> To my knowledge, Bruno just put a section in the gettext MvL> manual explaining gettext usage with various (progamming) MvL> languages. The Python entry there does mention pygettext.py. Ah cool, I was only looking at the online documentation at gnu.org, which claims it's the 30-Apr-1998 edition (a bit out-dated, eh? :). >> - Point to Python's gettext module documentation for more >> details on i18n'ing Python programs. This should be a fairly >> stable url: >> http://www.python.org/doc/current/lib/module-gettext.html MvL> I don't think it has this link, yet. But then, URL-style MvL> links are infrequent in texinfo documentation. Instead, MvL> (python)gettext might be a better link. >> - Make sure that the other GNU gettext tools recognize the >> docstring flag, in whatever way is meaningful (I'm not sure >> what would be useful or not... ;). MvL> At a minimum, msgmerge should preserve them. Good, thanks. -Barry From martin@loewis.home.cs.tu-berlin.de Mon Aug 13 08:05:30 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Mon, 13 Aug 2001 09:05:30 +0200 Subject: [I18n-sig] Re: pygettext dilemma In-Reply-To: <15223.27788.252177.636376@anthem.wooz.org> (barry@zope.com) References: <15200.64763.772001.53387@anthem.wooz.org> <15223.19649.811672.585574@anthem.wooz.org> <200108130530.f7D5Upm00873@mira.informatik.hu-berlin.de> <15223.27788.252177.636376@anthem.wooz.org> Message-ID: <200108130705.f7D75UF01419@mira.informatik.hu-berlin.de> > That's a good point (and would be neat!) but in that case, wouldn't > you want all the docstrings to be extracted? I.e. you wouldn't want > to just extract some docstrings in a module, but not all? Certainly, yes. In fact, I hacked Fran=E7ois' xpot to find foo__doc__[] strings in C sources also, since those doc strings are probably the ones that people are most frequently confronted with. Regards, Martin From martin@loewis.home.cs.tu-berlin.de Mon Aug 13 08:02:48 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Mon, 13 Aug 2001 09:02:48 +0200 Subject: [I18n-sig] Re: pygettext.py extraction of docstrings In-Reply-To: <15223.27949.881781.720311@anthem.wooz.org> (barry@wooz.org) References: <14840.35473.307059.990479@anthem.concentric.net> <200010272228.AAA01066@loewis.home.cs.tu-berlin.de> <15113.29005.357449.812516@anthem.wooz.org> <15117.38438.361043.255768@anthem.wooz.org> <15223.18129.372008.719610@anthem.wooz.org> <200108130535.f7D5ZHb00876@mira.informatik.hu-berlin.de> <15223.27949.881781.720311@anthem.wooz.org> Message-ID: <200108130702.f7D72mk01417@mira.informatik.hu-berlin.de> > Ah cool, I was only looking at the online documentation at gnu.org, > which claims it's the 30-Apr-1998 edition (a bit out-dated, eh? :). Yes, maintainance of gnu.org always leaves a lot to be desired. That aside, the changes I was talking about have not been released in gettext, yet; we probably should work on updating gnu.org once the new gettext manual is released. Regards, Martin From pinard@iro.umontreal.ca Mon Aug 13 13:10:56 2001 From: pinard@iro.umontreal.ca (=?iso-8859-1?q?Fran=E7ois?= Pinard) Date: 13 Aug 2001 08:10:56 -0400 Subject: [I18n-sig] Re: pygettext dilemma In-Reply-To: <15223.19649.811672.585574@anthem.wooz.org> References: <15200.64763.772001.53387@anthem.wooz.org> <15223.19649.811672.585574@anthem.wooz.org> Message-ID: [Barry A. Warsaw] > Indeed! BTW, I18N Mailman is coming along very nicely now. I hope > the 2.1 release will happen within the next few months. I have a few friends who are impatiently waiting for this release! :-) > What I do in this situation is to temporarily bind _() to a no-op > function so that the string is marked for extraction, but not > translated in place. E.g. > import gettext > def _(s): > return s > foo = _('extract this string but do not translate it yet') > _ = gettext.gettext No hurt intended of course, you should know be better :-). Let me friendly stress that constructs like above are ugly. We should set up examples, that people could follow, in which we rely on a single, common, widespread, unvarying interpretation of _(TEXT), without having to look around each time to see what it means, or set and reset its meaning. The above is a kludge that does not fit well with what I think is good Python style. > This works perfectly because Python doesn't suffer from the same > deficiencies as C (i.e. the C pre-processor :). I quite understand that "it works", but yet, it much suffers, both on the side of legibility and simplicity. > | ''"""TEXT""" 8-quoted marked > This has been brought up before, and I know that some people really > like this approach. I don't though, because 1) it is too magical; 2) > the rules are arbitrary and hard to remember; 3) explicit is better > than implicit. As long `pygettext.py' (or `xgettext' or `xpot') is involved, there is some unavoidable magic somewhere. Even _(TEXT) does not give much clue to a newcomer about the mandatory extraction process. About the idiom of prefixing a string with two quotes of the other kind, I find it quite easy to explain and remember. > Seeing something like an unadorned ""'Traditional Chinese' really > gives no clue as to the purpose of this strange markup, In my opinion, this is equally opaque to use _(TEXT) after having temporarily redefined _() as the identify function. It only acquire meaning to a user after s/he learns about the extraction process, you just cannot make it evident. The explanation is unavoidable, anyway. Redefining _() is a formidable stunt. Concatenating an empty string is much simpler and cleaner. > Or, you can sometimes do something ugly like use explicit > __doc__ = _('Here is a module docstring') > Not pretty, but also not common I think, so it doesn't concern me much. Let's avoid being ugly, as far as we can. Keep in mind that you are opening a way, here, and setting up examples and methods that will stick, and have incidence. (One never knows. When I started to use `_' instead of explicit `gettext' calls, most people were reluctant, and told me that it was to break with so many C compilers that I should give up now; Richard Stallman just refused to see GNU standards suggesting it; but I used it nevertheless and for many packages, to the point it stuck somewhat; nowadays, many languages spontaneously use conventions similar to it.) My point is that you should look forward and a little beyond the immediate needs. Even if does not concern you much, let's try to do well. > I appreciate the suggestions Francois! I think what we've got gives us > the best approach for Python programs. I would not want to crusade inordinately over this, and I'm not really trying to punch _my_ own suggestions through. Really not! On the other hand, I would like to convince you that temporarily overriding _(), or assigning the __doc__ attribute directly, just _cannot_ be "the best approach". We should do better than that. My suggestion does better already, but I see we do not agree on this, a bit sadly... I surely do not mind if someone comes with something even better that what we both suggest, and do hope it happens! But we should at least come with something as good. Keep happy! -- François Pinard http://www.iro.umontreal.ca/~pinard From keichwa@gmx.net Mon Aug 13 07:40:16 2001 From: keichwa@gmx.net (Karl Eichwalder) Date: Mon, 13 Aug 2001 08:40:16 +0200 Subject: [I18n-sig] Re: pygettext.py extraction of docstrings In-Reply-To: <200108130535.f7D5ZHb00876@mira.informatik.hu-berlin.de> ("Martin v. Loewis"'s message of "Mon, 13 Aug 2001 07:35:17 +0200") References: <14840.35473.307059.990479@anthem.concentric.net> <200010272228.AAA01066@loewis.home.cs.tu-berlin.de> <15113.29005.357449.812516@anthem.wooz.org> <15117.38438.361043.255768@anthem.wooz.org> <15223.18129.372008.719610@anthem.wooz.org> <200108130535.f7D5ZHb00876@mira.informatik.hu-berlin.de> Message-ID: "Martin v. Loewis" writes: > I don't think it has this link, yet. But then, URL-style links are > infrequent in texinfo documentation. Yes, they are infrequent, but with the advent of Texinfo 4.x those references are perfectly okay; search for 'uref', please. > Instead, (python)gettext might be a better link. Just provide both links. -- ke@suse.de (work) / keichwa@gmx.net (home): | http://www.suse.de/~ke/ | ,__o Free Translation Project: | _-\_<, http://www.iro.umontreal.ca/contrib/po/HTML/ | (*)/'(*) From haible@ilog.fr Tue Aug 14 18:18:27 2001 From: haible@ilog.fr (Bruno Haible) Date: Tue, 14 Aug 2001 19:18:27 +0200 (CEST) Subject: [I18n-sig] Re: pygettext.py extraction of docstrings In-Reply-To: <15223.18129.372008.719610@anthem.wooz.org> References: <14840.35473.307059.990479@anthem.concentric.net> <200010272228.AAA01066@loewis.home.cs.tu-berlin.de> <15113.29005.357449.812516@anthem.wooz.org> <15117.38438.361043.255768@anthem.wooz.org> <15223.18129.372008.719610@anthem.wooz.org> Message-ID: <15225.23907.685771.255536@honolulu.ilog.fr> Barry A. Warsaw writes: > A while back I was convinced to switch the `docstring' flag to #, for > pygettext. ... It probably only has > meaning in Python, but may be useful in other scripting languages. > Think of it roughly equivalent to Emacs-Lisp docstrings (in fact, > they were the inspiration for Python docstrings back in '94 at the > 1st Python workshop!) Well, Common Lisp has had docstrings long before Emacs-Lisp and Python. Their purpose is to have documentation available for the programmer, in a running session, regardless where each class or function came from. Now, why do you want to translate them? As gettext maintainer, I'm used to think in the categories of programmer - translator - user. Translated docstrings are not for the users, because users are not programmers in general. And the programmers (of .py programs), who must have looked at the various Python manuals, certainly reads English. So, as I see it, - translated docstrings have a much smaller audience than usual translated messages, - tranalated docstring users could also use the untranslated English docstrings, - docstrings are harder to translate, because the translator needs to have programmer's know-how. Therefore I think that docstring translation is a separate process than usual translations, and should use different .po files. As a consequence for gettext, I could live with an xgettext option --docstrings which extracts *only* the docstrings of a set of source files. > Perhaps Bruno can add some information on pygettext.py in > the GNU gettext manual? The GNU gettext tools are currently being modified to handle various programming languages. A new flag 'python-format' is being introduced, with appropriate format string checking in 'msgfmt'. xgettext will also have a Python backend, making pygettext obsolete (except for docstring extraction, for the time being). Bruno From martin@loewis.home.cs.tu-berlin.de Tue Aug 14 20:17:22 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Tue, 14 Aug 2001 21:17:22 +0200 Subject: [I18n-sig] Re: pygettext.py extraction of docstrings In-Reply-To: <15225.23907.685771.255536@honolulu.ilog.fr> (message from Bruno Haible on Tue, 14 Aug 2001 19:18:27 +0200 (CEST)) References: <14840.35473.307059.990479@anthem.concentric.net> <200010272228.AAA01066@loewis.home.cs.tu-berlin.de> <15113.29005.357449.812516@anthem.wooz.org> <15117.38438.361043.255768@anthem.wooz.org> <15223.18129.372008.719610@anthem.wooz.org> <15225.23907.685771.255536@honolulu.ilog.fr> Message-ID: <200108141917.f7EJHMp02235@mira.informatik.hu-berlin.de> > As gettext maintainer, I'm used to think in the categories of > programmer - translator - user. Translated docstrings are not for > the users, because users are not programmers in general. And the > programmers (of .py programs), who must have looked at the various > Python manuals, certainly reads English. This is a wrong assumption; people writing programs in Python not necessarily read fluently English (let alone speaking it). I assume the same is true for any other "scripting" language. E.g. for Ruby, much of the language documentation is in Japanese, since most of the Ruby users prefer to read Japanese documentation. Likewise, the French translation of the Python documentation was started precisely because users don't read English that well. Even among my colleagues, I find that they often mis-interpret English documentation, and get the fine points only when pointed to them, and after looking up certain keywords in a dictionary. They would not have the same problems if the documentation was available in German. So in your categories, these people are certainly users - of Python, in the specific case. > - translated docstrings have a much smaller audience than > usual translated messages, In addition to the above, I think you are missing an important detail of Python's introspectiveness: Many Python applications present docstrings to the user, instead of using them for documentation, by means of accessing some object's __doc__ attribute at runtime. E.g. you might have a drop-down menu, each item invoking a different function. Then somebody might chose to key the online help into the docstring. It is somewhat hackish, but common. > - docstrings are harder to translate, because the translator > needs to have programmer's know-how. For the original purpose of docstrings, yes, certainly. > As a consequence for gettext, I could live with an xgettext option > --docstrings which extracts *only* the docstrings of a set of source > files. Again, for the application I have in mind (providing online help in the progamming process), that is acceptable. I think for Barry's application, it is not. > The GNU gettext tools are currently being modified to handle various > programming languages. A new flag 'python-format' is being > introduced, with appropriate format string checking in 'msgfmt'. > xgettext will also have a Python backend, making pygettext obsolete > (except for docstring extraction, for the time being). It turns out that there is a "batteries included" issue here. I know a few cases where people have been using pygettext just because it was already on their (Windows) system, whereas GNU gettext was not that readily available (you'd need a C compiler to build it). So while most Unix people will switch to GNU gettext for performance reasons (pygettext is slow), I doubt that pygettext will go away anytime soon. Regards, Martin From Misha.Wolf@reuters.com Tue Aug 14 21:35:44 2001 From: Misha.Wolf@reuters.com (Misha.Wolf@reuters.com) Date: Tue, 14 Aug 2001 21:35:44 +0100 Subject: [I18n-sig] 19th Unicode Conference, Sep 2001, San Jose, CA -- Register now! Message-ID: Nineteenth International Unicode Conference (IUC19) Unicode and the Web: The Global Connection http://www.unicode.org/iuc/iuc19 September 10-14, 2001 San Jose, CA, USA >> Register now! << * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * NEWS >> Hotel guest room group rate extended to August 31. >> Early Bird registration rate extended to August 31. >> Visit the Conference Web site ( http://www.unicode.org/iuc/iuc19 ) to check the updated Conference program and register. To help you choose Conference sessions, we've included abstracts of talks and speakers' biographies. CONFERENCE SPONSORS Agfa Monotype Corporation Basis Technology Corporation Lionbridge Technologies Microsoft Corporation Netscape Communications Oracle Corporation PeopleSoft, Inc. Reuters Ltd. Sun Microsystems, Inc. Trados Corporation Trigeminal Software, Inc. World Wide Web Consortium (W3C) Wrox Press CONFERENCE VENUE DoubleTree Hotel San Jose 2050 Gateway Place San Jose, CA 95110 USA Tel: +1 408 453 4000 Fax: +1 408 437 2898 GLOBAL COMPUTING SHOWCASE Visit the Showcase to find out more about products supporting the Unicode Standard, and products and services that can help you globalize/localize your software, documentation and Internet content. For details, visit the Conference Web site: http://www.unicode.org/iuc/iuc19 Exhibitors to date include: * Agfa Monotype Corporation * Basis Technology Corporation * Everlasting Systems Ltd. * Multilingual Computing, Inc. * Oracle Corporation * Rasmussen Software, Inc. * Sun Microsystems, Inc. * Segue Software * Sybase, Inc. * Symbio Group * Trados Corporation CONFERENCE MANAGEMENT Global Meeting Services Inc. 4360 Benhurst Avenue San Diego, CA 92122, USA Tel: +1 858 638 0206 (voice) +1 858 638 0504 (fax) Email: info@global-conference.com or: conference@unicode.org * * * * * Unicode(r) and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission. ----------------------------------------------------------------- Visit our Internet site at http://www.reuters.com Any views expressed in this message are those of the individual sender, except where the sender specifically states them to be the views of Reuters Ltd. From haible@ilog.fr Tue Aug 14 21:51:50 2001 From: haible@ilog.fr (Bruno Haible) Date: Tue, 14 Aug 2001 22:51:50 +0200 (CEST) Subject: [I18n-sig] Re: pygettext.py extraction of docstrings In-Reply-To: <200108141917.f7EJHMp02235@mira.informatik.hu-berlin.de> References: <14840.35473.307059.990479@anthem.concentric.net> <200010272228.AAA01066@loewis.home.cs.tu-berlin.de> <15113.29005.357449.812516@anthem.wooz.org> <15117.38438.361043.255768@anthem.wooz.org> <15223.18129.372008.719610@anthem.wooz.org> <15225.23907.685771.255536@honolulu.ilog.fr> <200108141917.f7EJHMp02235@mira.informatik.hu-berlin.de> Message-ID: <15225.36710.997827.802409@honolulu.ilog.fr> Martin v. Loewis writes: > people writing programs in Python not > necessarily read fluently English ... Likewise, the French > translation of the Python documentation was started precisely because > users don't read English that well. OK, let me formulate it less strictly: Programmer's documentation is usually translated to much fewer languages (Japanese, French and very few others), because the amount of text to translate is quite large and the translator must have programmer's know-how. > Many Python applications present > docstrings to the user, instead of using them for documentation, by > means of accessing some object's __doc__ attribute at > runtime. E.g. you might have a drop-down menu, each item invoking a > different function. Then somebody might chose to key the online help > into the docstring. It is somewhat hackish, but common. It is pure Lisp introspection tradition :-) But nevertheless, it presents a problem: How can the translator know which docstrings are important to translate for the end user, and which are not? The danger is that a translator for Finnish, Turkish or Romanian, without deep programming knowledge, will spend a lot of his time translating programmer's documentation, which won't help the end users of his country. There are not many translators for these languages; we shouldn't abuse them. IMO, those __doc__ strings that are used at runtime should be explicitly marked as translatable by the programmer, to avoid excess work by the translator. The way you mark them doesn't really matter; it can be a tag in a comment, or something else that triggers xgettext extraction. > It turns out that there is a "batteries included" issue here. I know a > few cases where people have been using pygettext just because it was > already on their (Windows) system, whereas GNU gettext was not that > readily available (you'd need a C compiler to build it). You can point these people to the http://gnuwin32.sourceforge.net/ site which has gettext binaries for Win32 ready for download. Bruno From barry@zope.com Wed Aug 15 04:53:11 2001 From: barry@zope.com (Barry A. Warsaw) Date: Tue, 14 Aug 2001 23:53:11 -0400 Subject: [I18n-sig] Re: pygettext dilemma References: <15200.64763.772001.53387@anthem.wooz.org> <15223.19649.811672.585574@anthem.wooz.org> Message-ID: <15225.61991.262726.993033@anthem.wooz.org> >>>>> "FP" =3D=3D Fran=E7ois Pinard writes: >> Indeed! BTW, I18N Mailman is coming along very nicely now. I >> hope the 2.1 release will happen within the next few months. FP> I have a few friends who are impatiently waiting for this FP> release! :-) Soon, soon! >> What I do in this situation is to temporarily bind _() to a >> no-op function so that the string is marked for extraction, but >> not translated in place. E.g. >> import gettext >> def _(s): return s >> foo =3D _('extract this string but do not translate it yet') >> _ =3D gettext.gettext FP> No hurt intended of course, you should know be better :-). FP> Let me friendly stress that constructs like above are ugly. FP> We should set up examples, that people could follow, in which FP> we rely on a single, common, widespread, unvarying FP> interpretation of _(TEXT), without having to look around each FP> time to see what it means, or set and reset its meaning. The FP> above is a kludge that does not fit well with what I think is FP> good Python style. No hurt taken, but I'll respectfully disagree. :) I think it's fine Python style for deferring translation, and not confusing at all because it is almost always localized around the site of the deferral. But contrary to Tim's license plate, there /is/ more than one way to do it. :) pygettext.py supports a -k/--keyword flag, similar to xgettext, which expands the list of function names marking translatable strings. IIRC, gettext suggests binding N_() to gettext_noop() and then extracting any string wrapped in N_(). So, if you prefer, you can rewrite my example above to be: from gettext import gettext as _ def N_(s): return s foo =3D N_('extract this string but do not translate it yet') and then run pygettext.py with --keyword=3DN_ Hmm, maybe we should add "N_" as one of the default keywords? That points out a general philosophy I have that pygettext.py should mimic xgettext as much as makes sense for the difference between C and Python. In this case _() works great for most at-site translation markings, but for the very few that must be deferred, either the rebind hack or the N_() marking should suffice. >> This works perfectly because Python doesn't suffer from the >> same deficiencies as C (i.e. the C pre-processor :). FP> I quite understand that "it works", but yet, it much suffers, FP> both on the side of legibility and simplicity. Again, I must respectfully disagree! >> | ''"""TEXT""" 8-quoted marked >> This has been brought up before, and I know that some people >> really like this approach. I don't though, because 1) it is >> too magical; 2) the rules are arbitrary and hard to remember; >> 3) explicit is better than implicit. FP> As long `pygettext.py' (or `xgettext' or `xpot') is involved, FP> there is some unavoidable magic somewhere. Even _(TEXT) does FP> not give much clue to a newcomer about the mandatory FP> extraction process. This is true. But it's still clearer that there is /some/ reason for marking the string with _() because you can quickly trace your way back to gettext.gettext() and then it's obvious the connection to the runtime translation process if not the the extraction process. Which leads me to another question: are you saying that ''"""Text""" should be used for both the runtime translating and the extraction marking? If so, I don't see how that could work. Even if you could make it work, I still much prefer have a Real Python Function do the runtime translation. An example of why is what I really do in Mailman... Say I have the following string that needs to be translated: _('No such list %s found on host %s') % (listname, hostname) Now we all know that this won't do as a source string because there may be some languages may change the order of the variables, so we really need to write the string like so: _('No such list %(listname)s found on host %(hostname)s') % { 'listname': listname, =09'hostname': hostname =09} I've found this style to be quite pervasive, but also extremely (and unnecessarily) repetitive. Notice that I've typed "listname" and "hostname" a total of six times. Wouldn't it be wonderful if I only needed to type them once: _('No such list %(listname)s found on host %(hostname)s') ? Yes, it's great because -- to me -- I'm trading a modicum of specialness for a huge raft of simplicity and legibility. It really does make the code easier to read, I claim (although it would be interesting to know what others who have hacked on the Mailman 2.1 code think). How do I make this work? The trick is that the function _() isn't gettext.gettext() but a wrapper around that library function that's unique to Mailman. In fact, you won't see many "import gettext"'s in the Mailman code, but you will see lots of "from Mailman.i18n import _". My _() actually uses sys._getframe() -- where available -- to get the locals and globals one stack frame up from the _() frame, and then automatically interpolates that dictionary into the translatable string. Is that magic? Yes, a bit, but it's magic that is easily revealed by finding the import, and viewing the Mailman.i18n module. And once learned, I claim that it's immediately ingrained and needn't be learned again. But you might disagree, and use the more verbose approach for your app. No problem there! Having a function call that can be specialized in the Pythonic way serves both purposes well. FP> About the idiom of prefixing a string with two quotes of the FP> other kind, I find it quite easy to explain and remember. I had to really think about the rule, as opposed to the example, in your original message. I think your rule goes: prepend the string you want to extract with an empty string quoted with the alternative quoting characters from the string you want to extract. Or something like that. :) But there is another problem: for some fonts in some IDE's it can be challenging to discern ' from " or even ` and having something like ""'''...''' makes it even more difficult to visually pick out. >> Seeing something like an unadorned ""'Traditional Chinese' >> really gives no clue as to the purpose of this strange markup, FP> In my opinion, this is equally opaque to use _(TEXT) after FP> having temporarily redefined _() as the identify function. It FP> only acquire meaning to a user after s/he learns about the FP> extraction process, you just cannot make it evident. The FP> explanation is unavoidable, anyway. Redefining _() is a FP> formidable stunt. Concatenating an empty string is much FP> simpler and cleaner. Let me see if I can sum up my objection: you have to use a function call anyway to do the actual runtime translation. Since at-site translations will be the overwhelming majority of examples, so will _() markings. For all those cases, you won't need empty-string-contatenation anyway. For the handful of cases where you need to defer translation, I prefer using a technique as similar to the common way as possible, instead of introducing an entirely different convention. But I wouldn't cry foul if you encouraged N_() markings for deferred translations. >> Or, you can sometimes do something ugly like use explicit >> __doc__ =3D _('Here is a module docstring') >> Not pretty, but also not common I think, so it doesn't concern >> me much. FP> Let's avoid being ugly, as far as we can. Keep in mind that FP> you are opening a way, here, and setting up examples and FP> methods that will stick, and have incidence. (One never FP> knows. When I started to use `_' instead of explicit FP> `gettext' calls, most people were reluctant, and told me that FP> it was to break with so many C compilers that I should give up FP> now; Richard Stallman just refused to see GNU standards FP> suggesting it; but I used it nevertheless and for many FP> packages, to the point it stuck somewhat; nowadays, many FP> languages spontaneously use conventions similar to it.) And I think it's a wonderful convention! I'm glad you came up with it, and I happily adopted it for Python. It's beautiful. :) I won't disagree that the __doc__ hack is ugly. The more I think about it, I think a magic comment in front of the docstring is the way to go. I'm not yet sure whether something like # noextract '''This is a docstring that need not be translated.''' or # extract '''This is a docstring that should be translated.''' is better, or whether there's some other better comment keyword to use. This would be worth experimenting with a bit. =20 FP> My point is that you should look forward and a little beyond FP> the immediate needs. Even if does not concern you much, let's FP> try to do well. Agreed! >> I appreciate the suggestions Francois! I think what we've got >> gives us the best approach for Python programs. FP> I would not want to crusade inordinately over this, and I'm FP> not really trying to punch _my_ own suggestions through. FP> Really not! On the other hand, I would like to convince you FP> that temporarily overriding _(), or assigning the __doc__ FP> attribute directly, just _cannot_ be "the best approach". Let's not conflate what we're talking about. One situation is deferred translation, the other is docstring extraction marking. For the former, I'm completely happy with rebinding _(), although I wouldn't squawk if you pushed for N_() . For the latter, I agree that explicit __doc__ binding is gross and we should avoid it. Here, I think the special comment is the way to go, but I'm not sure about the details. Please let's keep these two issues separate! =20 FP> We should do better than that. My suggestion does better FP> already, but I see we do not agree on this, a bit sadly... I FP> surely do not mind if someone comes with something even better FP> that what we both suggest, and do hope it happens! But we FP> should at least come with something as good. A good, lively debate. Thanks! Cheers, -Barry From barry@wooz.org Wed Aug 15 05:10:24 2001 From: barry@wooz.org (Barry A. Warsaw) Date: Wed, 15 Aug 2001 00:10:24 -0400 Subject: [I18n-sig] Re: pygettext.py extraction of docstrings References: <14840.35473.307059.990479@anthem.concentric.net> <200010272228.AAA01066@loewis.home.cs.tu-berlin.de> <15113.29005.357449.812516@anthem.wooz.org> <15117.38438.361043.255768@anthem.wooz.org> <15223.18129.372008.719610@anthem.wooz.org> <15225.23907.685771.255536@honolulu.ilog.fr> Message-ID: <15225.63024.291621.844755@anthem.wooz.org> >>>>> "BH" == Bruno Haible writes: BH> Well, Common Lisp has had docstrings long before Emacs-Lisp BH> and Python. Their purpose is to have documentation available BH> for the programmer, in a running session, regardless where BH> each class or function came from. BH> Now, why do you want to translate them? In my Python experience, there is one common situation where you want to translate docstrings. Note that in Python, unlike I believe as in *Lisp, docstrings can be attached to objects other than functions. It's common to have both module and class docstrings. Again, IME class docstrings serve similar audiences to function/method docstrings, i.e. the programmer. It is a common idiom in Python to use module docstrings as usage text for command line scripts, and those are definitely intended for the end user, and must be translated. I've had occasion to use class docstrings as strings for the user too, although I won't claim that's wonderful Python style. But as Martin points out, a case can be made for translating even class, function, and method docstrings. Think of the situation where manuals are automatically extracted from source code, a la Javadoc. I believe you'd want those strings to be extracted into the catalog. BH> As a consequence for gettext, I could live with an xgettext BH> option --docstrings which extracts *only* the docstrings of a BH> set of source files. I made the semantics for pygettext.py's --docstrings/-D option to extract /also/ the docstrings because the older version of msgmerge I am using can't merge a docstring-only catalog with a normal-string catalog in a reasonable way (I tried). And as stated above, module docstrings can serve exactly the same audience as other translatable strings, i.e. the end user, so they should be in the same catalog. But I was also forced to add a very inelegant -X/--exclude-file switch which suppressed docstring extract for the listed files. While that served my purpose, it's a gross hack, and not just because it doesn't provide the necessary granularity. More productive I think would be for us to agree on a convention for extracting docstrings that doesn't require both -D and -X. Here are two strawmen: 1) pygettext.py and xgettext never extract unmarked docstrings unless the -D/--docstrings option is given. If -D is given then all unmarked docstrings are extracted along with all other normally marked text, unless the unmarked docstring is immediately preceded by a comment with the word "notranslate" as the first word in the comment. All other words in the comment are ignored. 2) pygettext.py and xgettext never extract unmarked docstrings unless they are immediately preceded by a comment with the word "translate" as the first word in the comment. All other words in the comment are ignored. Feel free to knock these down. :) >> Perhaps Bruno can add some information on pygettext.py in the >> GNU gettext manual? BH> The GNU gettext tools are currently being modified to handle BH> various programming languages. A new flag 'python-format' is BH> being introduced, with appropriate format string checking in BH> 'msgfmt'. I'm not sure exactly what this means. Can you give a bit more detail? BH> xgettext will also have a Python backend, making pygettext BH> obsolete (except for docstring extraction, for the time BH> being). That'd be great. It'll be even cooler if we can agree on a convention for docstring extraction! BTW, here's the current set of switches for pygettext.py. Do you see any glaring incompatibilities with you latest xgettext? Cheers, -Barry -------------------- snip snip -------------------- Usage: pygettext [options] inputfile ... Options: -a --extract-all Extract all strings. -d name --default-domain=name Rename the default output file from messages.pot to name.pot. -E --escape Replace non-ASCII characters with octal escape sequences. -D --docstrings Extract module, class, method, and function docstrings. These do not need to be wrapped in _() markers, and in fact cannot be for Python to consider them docstrings. (See also the -X option). -h --help Print this help message and exit. -k word --keyword=word Keywords to look for in addition to the default set, which are: %(DEFAULTKEYWORDS)s You can have multiple -k flags on the command line. -K --no-default-keywords Disable the default set of keywords (see above). Any keywords explicitly added with the -k/--keyword option are still recognized. --no-location Do not write filename/lineno location comments. -n --add-location Write filename/lineno location comments indicating where each extracted string is found in the source. These lines appear before each msgid. The style of comments is controlled by the -S/--style option. This is the default. -o filename --output=filename Rename the default output file from messages.pot to filename. If filename is `-' then the output is sent to standard out. -p dir --output-dir=dir Output files will be placed in directory dir. -S stylename --style stylename Specify which style to use for location comments. Two styles are supported: Solaris # File: filename, line: line-number GNU #: filename:line The style name is case insensitive. GNU style is the default. -v --verbose Print the names of the files being processed. -V --version Print the version of pygettext and exit. -w columns --width=columns Set width of output to columns. -x filename --exclude-file=filename Specify a file that contains a list of strings that are not be extracted from the input files. Each string to be excluded must appear on a line by itself in the file. -X filename --no-docstrings=filename Specify a file that contains a list of files (one per line) that should not have their docstrings extracted. This is only useful in conjunction with the -D option above. If `inputfile' is -, standard input is read. From barry@wooz.org Wed Aug 15 05:15:14 2001 From: barry@wooz.org (Barry A. Warsaw) Date: Wed, 15 Aug 2001 00:15:14 -0400 Subject: [I18n-sig] Re: pygettext.py extraction of docstrings References: <14840.35473.307059.990479@anthem.concentric.net> <200010272228.AAA01066@loewis.home.cs.tu-berlin.de> <15113.29005.357449.812516@anthem.wooz.org> <15117.38438.361043.255768@anthem.wooz.org> <15223.18129.372008.719610@anthem.wooz.org> <15225.23907.685771.255536@honolulu.ilog.fr> <200108141917.f7EJHMp02235@mira.informatik.hu-berlin.de> Message-ID: <15225.63314.692441.773936@anthem.wooz.org> >>>>> "MvL" == Martin v Loewis writes: >> As a consequence for gettext, I could live with an xgettext >> option --docstrings which extracts *only* the docstrings of a >> set of source files. MvL> Again, for the application I have in mind (providing online MvL> help in the progamming process), that is acceptable. I think MvL> for Barry's application, it is not. Correct, as described in my other message. >> The GNU gettext tools are currently being modified to handle >> various programming languages. A new flag 'python-format' is >> being introduced, with appropriate format string checking in >> 'msgfmt'. xgettext will also have a Python backend, making >> pygettext obsolete (except for docstring extraction, for the >> time being). MvL> It turns out that there is a "batteries included" issue MvL> here. I know a few cases where people have been using MvL> pygettext just because it was already on their (Windows) MvL> system, whereas GNU gettext was not that readily available MvL> (you'd need a C compiler to build it). So while most Unix MvL> people will switch to GNU gettext for performance reasons MvL> (pygettext is slow), I doubt that pygettext will go away MvL> anytime soon. I agree. That's a good reason why Python also comes with its own msgfmt.py script (Side note: thank you thank you thank you for documenting .mo and .po file formats! What a pain it was to reverse engineer the undocumented Solaris formats. :). At the very least, we should make sure that xgettext will sufficiently fulfill the needs of Python programmers. It's easier for us to prototype the Python idiosyncrasies in pygettext, and then describe our experiences so xgettext can support them. Also, the more we can agree on common conventions now, the better in the long run. Cheers, -Barry From barry@wooz.org Wed Aug 15 05:19:49 2001 From: barry@wooz.org (Barry A. Warsaw) Date: Wed, 15 Aug 2001 00:19:49 -0400 Subject: [I18n-sig] Re: pygettext.py extraction of docstrings References: <14840.35473.307059.990479@anthem.concentric.net> <200010272228.AAA01066@loewis.home.cs.tu-berlin.de> <15113.29005.357449.812516@anthem.wooz.org> <15117.38438.361043.255768@anthem.wooz.org> <15223.18129.372008.719610@anthem.wooz.org> <15225.23907.685771.255536@honolulu.ilog.fr> <200108141917.f7EJHMp02235@mira.informatik.hu-berlin.de> <15225.36710.997827.802409@honolulu.ilog.fr> Message-ID: <15225.63589.862675.901114@anthem.wooz.org> >>>>> "BH" == Bruno Haible writes: BH> The danger is that a translator for Finnish, Turkish or BH> Romanian, without deep programming knowledge, will spend a lot BH> of his time translating programmer's documentation, which BH> won't help the end users of his country. There are not many BH> translators for these languages; we shouldn't abuse them. BH> IMO, those __doc__ strings that are used at runtime should be BH> explicitly marked as translatable by the programmer, to avoid BH> excess work by the translator. The way you mark them doesn't BH> really matter; it can be a tag in a comment, or something else BH> that triggers xgettext extraction. I think you're on the right track, but I think that Martin's and my applications show that we probably need to cover these two situations: 1) No docstrings are extracted unless they are preceded by a magic "extract" comment. 2) All docstrings are extracted unless they are preceded by a magic "noextract" comment. BH> You can point these people to the BH> http://gnuwin32.sourceforge.net/ site which has gettext BH> binaries for Win32 ready for download. Cool, good to know, thanks. -Barry From martin@loewis.home.cs.tu-berlin.de Wed Aug 15 06:54:16 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Wed, 15 Aug 2001 07:54:16 +0200 Subject: [I18n-sig] Re: pygettext dilemma In-Reply-To: <15225.61991.262726.993033@anthem.wooz.org> (barry@zope.com) References: <15200.64763.772001.53387@anthem.wooz.org> <15223.19649.811672.585574@anthem.wooz.org> <15225.61991.262726.993033@anthem.wooz.org> Message-ID: <200108150554.f7F5sGv01383@mira.informatik.hu-berlin.de> > Now we all know that this won't do as a source string because there > may be some languages may change the order of the variables, so we > really need to write the string like so: > > _('No such list %(listname)s found on host %(hostname)s') % { > 'listname': listname, > 'hostname': hostname > } > > I've found this style to be quite pervasive, but also extremely (and > unnecessarily) repetitive. Notice that I've typed "listname" and > "hostname" a total of six times. Wouldn't it be wonderful if I only > needed to type them once: What's wrong with _('No such list %(listname)s found on host %(hostname)s') % locals() No magic required; of course, this assumes that the variables are either all globals or all locals - I wish vars() would give me a dictionary of all variables (perhaps even including the builtins). Regards, Martin From barry@zope.com Wed Aug 15 07:31:41 2001 From: barry@zope.com (Barry A. Warsaw) Date: Wed, 15 Aug 2001 02:31:41 -0400 Subject: [I18n-sig] Re: pygettext dilemma References: <15200.64763.772001.53387@anthem.wooz.org> <15223.19649.811672.585574@anthem.wooz.org> <15225.61991.262726.993033@anthem.wooz.org> <200108150554.f7F5sGv01383@mira.informatik.hu-berlin.de> Message-ID: <15226.5965.202148.510121@anthem.wooz.org> >>>>> "MvL" == Martin v Loewis writes: MvL> What's wrong with MvL> _('No such list %(listname)s found on host %(hostname)s') % MvL> locals() MvL> No magic required; of course, this assumes that the variables MvL> are either all globals or all locals - I wish vars() would MvL> give me a dictionary of all variables (perhaps even including MvL> the builtins). Bingo! That's what I wish vars() would do to, and I want those semantics, so I went with the _getframe() hack. Plus, I got tired of writing trailing "% locals()" all over the place, especially when they cluttered the code even more with long lines, requiring continuation via extraneous paren grouping or backslashing. Blah. -Barry From keichwa@gmx.net Wed Aug 15 08:57:51 2001 From: keichwa@gmx.net (Karl Eichwalder) Date: Wed, 15 Aug 2001 09:57:51 +0200 Subject: [I18n-sig] Re: pygettext.py extraction of docstrings In-Reply-To: <15225.63024.291621.844755@anthem.wooz.org> (barry@wooz.org's message of "Wed, 15 Aug 2001 00:10:24 -0400") References: <14840.35473.307059.990479@anthem.concentric.net> <200010272228.AAA01066@loewis.home.cs.tu-berlin.de> <15113.29005.357449.812516@anthem.wooz.org> <15117.38438.361043.255768@anthem.wooz.org> <15223.18129.372008.719610@anthem.wooz.org> <15225.23907.685771.255536@honolulu.ilog.fr> <15225.63024.291621.844755@anthem.wooz.org> Message-ID: barry@wooz.org (Barry A. Warsaw) writes: > I made the semantics for pygettext.py's --docstrings/-D option to > extract /also/ the docstrings because the older version of msgmerge I > am using can't merge a docstring-only catalog with a normal-string > catalog in a reasonable way (I tried). FYI: You must not use msgmerge for this job; msgcomm is the right tool ;) When gettext 0.11 is released you can go for msgcat. > And as stated above, module docstrings can serve exactly the same > audience as other translatable strings, i.e. the end user, so they > should be in the same catalog. It depends. It depends on the size (for example). gnumeric, a GNOME spreadsheet application written in C, features "docstrings" associated with macro functions, highly mathematical stuff, approx. 3-400 messages. I'm not able to translate these messages and as a translator I like have these messages go into a separate file... Happily, these days I can use msggrep to extract these messages (fr-function.po) and msgcomm to "substrate" the extracted strings (fr-function.po) from the original .po file (fr.po); result: fr-without-function.po. Here are the commands: msggrep --output fr-function.po --width 0 \ --msgid --regex '@FUNCTION=' fr.po msgcomm --output fr-without-function.po --width 0 \ --less-than 2 fr.po fr-function.po msggrep is able to work on filename markers (#: filename), too. Sorry for my digression. -- ke@suse.de (work) / keichwa@gmx.net (home): | http://www.suse.de/~ke/ | ,__o Free Translation Project: | _-\_<, http://www.iro.umontreal.ca/contrib/po/HTML/ | (*)/'(*) From pinard@iro.umontreal.ca Wed Aug 15 16:01:35 2001 From: pinard@iro.umontreal.ca (=?iso-8859-1?q?Fran=E7ois?= Pinard) Date: 15 Aug 2001 11:01:35 -0400 Subject: [I18n-sig] Re: pygettext dilemma In-Reply-To: <15225.61991.262726.993033@anthem.wooz.org> References: <15200.64763.772001.53387@anthem.wooz.org> <15223.19649.811672.585574@anthem.wooz.org> <15225.61991.262726.993033@anthem.wooz.org> Message-ID: [Barry A. Warsaw] > No hurt taken [...] Thanks! It would have much saddened me otherwise. > IIRC, gettext suggests binding N_() to gettext_noop() I was not overly happy with N_(), even if it comes from me as well, but given gettext_noop() are less frequent than gettext() in the experience we accumulated at the time, it was bearable. But I never really liked it. > Hmm, maybe we should add "N_" as one of the default keywords? This would be kind of natural from someone educated with C gettext, and looks much better to me than redefining _(). Much much better! :-) Of course, I considered it, a while ago already, but rejected it for my own use for two reasons. I'm not sure how solidly those reasons would stand today. Here there are nevertheless. The first is that N_() is completely pre-processed out in C, while N_() would stay executable in Python. To go as far as calling a function, as a side-effect of marking, was looking to me like a high price to me. Conceptual price, of course; I'm not thinking about sparing the CPU, here. The second is a mere consequence of the first. Python would not let us use N_() for docstrings. And I consider that Python is very right here, in telling me I'm wrong, because N_() is much more than a marker, while it should be nothing more than a marker, and have no other significance. > That points out a general philosophy I have that pygettext.py should > mimic xgettext as much as makes sense for the difference between C and > Python. I understand what you mean here, and I mostly agree. However, I would like to warn you against going too far in trying to follow `gettext'. It would be difficult for me to go in details now, but overall, I feel that `gettext' is a bit short-sighted. At the origin, this was really on purpose, as the initial goal was to put out something simple, and allow many years (I knew it has to be more than a few) so the idea of internationalization spreads and gets a wider acceptance in the field of free software. I think the idea has grown solid enough to not die by now, but if we want to be objective, there is still much, much to accomplish even with the initial design. However, I would guess that it would not take many more years before we get ready for another leap, and I fear `gettext' might not be fully ready for it, as it is getting somewhat encumbered by opinions, more than vision. I was hoping that Python might be the vehicle for that step. And for this to occur, Python needs being able to keep some distance and autonomy. > [...] are you saying that ''"""Text""" should be used for both the > runtime translating and the extraction marking? No. Only for marking, when nothing more than marking is meant. > Wouldn't it be wonderful if I only needed to type them once: > _('No such list %(listname)s found on host %(hostname)s') Yes, it would be wonderful. Also notice that we could eventually go a lot further than merely exchanging the order of the variables. Many languages use morphological flexing of surrounding words according to various properties coming from inserts themselves, or even more important changes. > The trick is that the function _() isn't gettext.gettext() but a > wrapper around that library function that's unique to Mailman. As much as possible, think Python, not only Mailman. :-) Yet, I quite understand one has to start somewhere. > But there is another problem: for some fonts in some IDE's it can be > challenging to discern ' from " or even ` and having something like > ""'''...''' makes it even more difficult to visually pick out. Please, do not merely let random fonts or editors design decide of your vision. Things start to go wrong when each actor is trying to take all the problems of the world on his/her shoulders. I could speak with length and conviction about a few bad moves in the area of fonts, in these days, especially with Unicode around. Just let's not dive into that, and rather hope that reason (or horse sense) will finally prevail. The most productive attitude is that everyone identifies his/her share, and do well with it. > For the handful of cases where you need to defer translation, I prefer > using a technique as similar to the common way as possible, instead of > introducing an entirely different convention. But I wouldn't cry foul > if you encouraged N_() markings for deferred translations. No doubt to me, N_() is vastly superior to locally redefining _(). And _even_ if vastly superior, it is still not that good. :-) > The more I think about it, I think a magic comment in front of the > docstring is the way to go. I'm not yet sure whether something like > # noextract > '''This is a docstring that need not be translated.''' > or > # extract > '''This is a docstring that should be translated.''' > is better, or whether there's some other better comment keyword to > use. This would be worth experimenting with a bit. It seems like a good idea. This is surely legible and neat, and probably better than the other things we saw so far, from both of us! :-) I have a slight fear that it might become tedious if we have long sequences of translation-delayed strings, as it will likely happen in some applications. (I've linguistic applications in mind: as my associate is a linguist and we often work together, I saw such things a few times.) > FP> I would not want to crusade inordinately over this > A good, lively debate. Thanks! Oh, you know, I do not want it too lively. As much as I like friendly exchanges of ideas, as least because they convey friendship, I hate debates and conflicts. I dare to think I'm a peaceful man... For the Translation Project, Martin von Löwis and Karl Eichwalder took the torch and accepted to face the music. I'm extremely grateful to them, yet a bit sorry, thinking about some stubbornness out there which will undoubtedly hit them. For one, I'm too old and tired for fighting it anymore. :-) > Let's not conflate what we're talking about. Oops! I do not know the word "conflate", and do not find it in my English-French dictionary. I'm a priori ready to "not conflate" if it pleases you :-). But then, what should I do, or avoid to do? :-) Keep happy! -- François Pinard http://www.iro.umontreal.ca/~pinard From haible@ilog.fr Wed Aug 15 17:39:58 2001 From: haible@ilog.fr (Bruno Haible) Date: Wed, 15 Aug 2001 18:39:58 +0200 (CEST) Subject: [I18n-sig] Re: pygettext.py extraction of docstrings In-Reply-To: <15225.63024.291621.844755@anthem.wooz.org> References: <14840.35473.307059.990479@anthem.concentric.net> <200010272228.AAA01066@loewis.home.cs.tu-berlin.de> <15113.29005.357449.812516@anthem.wooz.org> <15117.38438.361043.255768@anthem.wooz.org> <15223.18129.372008.719610@anthem.wooz.org> <15225.23907.685771.255536@honolulu.ilog.fr> <15225.63024.291621.844755@anthem.wooz.org> Message-ID: <15226.42462.142365.711064@honolulu.ilog.fr> Barry A. Warsaw writes: > BH> The GNU gettext tools are currently being modified to handle > BH> various programming languages. A new flag 'python-format' is > BH> being introduced, with appropriate format string checking in > BH> 'msgfmt'. > > I'm not sure exactly what this means. Can you give a bit more detail? When a Python program contains a string like "%(name)s, %(firstname)s" xgettext will mark it as "#, python-format", in order to tell the translator that the string is a format string. If the translator then gives an incorrect translation, say "%(fistname)s %(name)s" or "%(firstname) %(name)s", then "msgfmt --check" will give an appropriate error message. > BTW, here's the current set of switches for pygettext.py. Do you see > any glaring incompatibilities with you latest xgettext? pygettext doesn't extract comments of the form "translator: the c, is a c-cedilla" (xgettext option --add-comments) or "xgettext: no-python-format" (lets the programmer override the format string guessing). Other than that: xgettext doesn't have --docstrings and --no-docstrings yet :-) The -K option doesn't exist in xgettext, you have to use --keyword instead. Also, xgettext doesn't have -S/--style. The Solaris style is available only with --strict. > But as Martin points out, a case can be made for translating even > class, function, and method docstrings. Think of the situation where > manuals are automatically extracted from source code, a la Javadoc. I > believe you'd want those strings to be extracted into the catalog. I believe those strings belong into a different catalog. If you then want them in the same catalog, you can use "msgcat" to combine both catalogs. The reasons for a different catalog: 1) Normal strings and docstrings may need to be handled by different translators. 2) They may need different extraction options. Your addition of --no-docstrings indicates that docstrings may come from a different set of files. Instead of forcing all options into a single xgettext command line, what I propose is that you call xgettext twice, once for the normal strings and once for the docstrings, with independent command line options, and on independent (but potentially overlapping) sets of files. This gives you the maximum flexibility. > Here are two strawmen: > > 1) pygettext.py and xgettext never extract unmarked docstrings unless > the -D/--docstrings option is given. If -D is given then all > unmarked docstrings are extracted along with all other normally > marked text, unless the unmarked docstring is immediately preceded > by a comment with the word "notranslate" as the first word in the > comment. All other words in the comment are ignored. > > 2) pygettext.py and xgettext never extract unmarked docstrings unless > they are immediately preceded by a comment with the word > "translate" as the first word in the comment. All other words in > the comment are ignored. Here is my strawman: pygettext.py and xgettext never extract unmarked docstrings by default. If option -D/--docstrings is given, it extracts docstrings only. A separate option like --keywords can be used to select or inhibit the docstrings. > I think that Martin's and my applications show that we probably need > to cover these two situations: > > 1) No docstrings are extracted unless they are preceded by a magic > "extract" comment. > > 2) All docstrings are extracted unless they are preceded by a magic > "noextract" comment. I agree. > Note that in Python, unlike I believe as in *Lisp, docstrings can be > attached to objects other than functions. It's common to have both > module and class docstrings. Lisp has grown up since then. Nowadays you can attach docstrings not only to functions and macros, but also to classes, methods and packages. The macros defclass, defmethod and defpackage support this. Bruno From martin@loewis.home.cs.tu-berlin.de Wed Aug 15 19:23:55 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Wed, 15 Aug 2001 20:23:55 +0200 Subject: [I18n-sig] Re: pygettext.py extraction of docstrings In-Reply-To: <15226.42462.142365.711064@honolulu.ilog.fr> (message from Bruno Haible on Wed, 15 Aug 2001 18:39:58 +0200 (CEST)) References: <14840.35473.307059.990479@anthem.concentric.net> <200010272228.AAA01066@loewis.home.cs.tu-berlin.de> <15113.29005.357449.812516@anthem.wooz.org> <15117.38438.361043.255768@anthem.wooz.org> <15223.18129.372008.719610@anthem.wooz.org> <15225.23907.685771.255536@honolulu.ilog.fr> <15225.63024.291621.844755@anthem.wooz.org> <15226.42462.142365.711064@honolulu.ilog.fr> Message-ID: <200108151823.f7FINt503408@mira.informatik.hu-berlin.de> > When a Python program contains a string like "%(name)s, %(firstname)s" > xgettext will mark it as "#, python-format", in order to tell the > translator that the string is a format string. If the translator then > gives an incorrect translation, say "%(fistname)s %(name)s" or > "%(firstname) %(name)s", then "msgfmt --check" will give an > appropriate error message. That's pretty cool. Regards, Martin From colinsyu@hotmail.com Fri Aug 24 17:54:00 2001 From: colinsyu@hotmail.com (Colin Yu) Date: Fri, 24 Aug 2001 09:54:00 -0700 Subject: [I18n-sig] (no subject) Message-ID: Hi, Is there a way to take a unicode string like u"レーザー プリンタ" (which are Japanese characters in UTF-8 format) and convert it to unicode (\uXXXX format) escape codes in python? Your help would be greatly appreciated. Thank you. _________________________________________________________________ Get your FREE download of MSN Explorer at http://explorer.msn.com/intl.asp From tree@basistech.com Fri Aug 24 18:03:51 2001 From: tree@basistech.com (Tom Emerson) Date: Fri, 24 Aug 2001 13:03:51 -0400 Subject: [I18n-sig] (no subject) In-Reply-To: References: Message-ID: <15238.35063.269485.168879@magrathea.basistech.com> Colin Yu writes: > Is there a way to take a unicode string like u"=E3=1B,C,=1B(B=E3=1B,C= <=1B(B=E3=1B,B6=1B(B=E3=1B,C<=1B(B =E3=83=97=E3=1B,C*=1B(B=E3=1B,C3=1B(= B=E3=1B,B=3F=1B(B"=20 > (which are Japanese characters in UTF-8 format) and convert it to uni= code=20 > (\uXXXX format) escape codes in python=3F Your help would be greatly= =20 > appreciated. Thank you. Use "repr". >>> foo =3D u"\u4e00" >>> repr(foo) "u'\\u4E00'" --=20 Tom Emerson Basis Technology C= orp. Sr. Sinostringologist http://www.basistech= .com "Beware the lollipop of mediocrity: lick it once and you suck forever= " From Misha.Wolf@reuters.com Sat Aug 25 02:02:33 2001 From: Misha.Wolf@reuters.com (Misha.Wolf@reuters.com) Date: Sat, 25 Aug 2001 02:02:33 +0100 Subject: [I18n-sig] 19th Unicode Conference, Sep 2001, San Jose, CA -- Two weeks to go! Message-ID: >>>>>>>>>>>>>>>>>>>>>>>> Just 2 weeks to go! <<<<<<<<<<<<<<<<<<<<<<<< Nineteenth International Unicode Conference (IUC19) Unicode and the Web: The Global Connection http://www.unicode.org/iuc/iuc19 September 10-14, 2001 San Jose, CA, USA >>>>>>>>>>>>>>>>>>>>>>>>>>> Register now! <<<<<<<<<<<<<<<<<<<<<<<<<<< NEWS * Hotel guest room group rate extended to August 31. * Early Bird registration rate extended to August 31. * Visit the Conference Web site ( http://www.unicode.org/iuc/iuc19 ) to check the updated Conference program and register. To help you choose Conference sessions, we've included abstracts of talks and speakers' biographies. CONFERENCE SPONSORS * Agfa Monotype Corporation * Basis Technology Corporation * Lionbridge Technologies * Microsoft Corporation * Netscape Communications * Oracle Corporation * PeopleSoft, Inc. * Reuters Ltd. * Sun Microsystems, Inc. * Trados Corporation * Trigeminal Software, Inc. * World Wide Web Consortium (W3C) * Wrox Press CONFERENCE VENUE DoubleTree Hotel San Jose 2050 Gateway Place San Jose, CA 95110 USA Tel: +1 408 453 4000 Fax: +1 408 437 2898 GLOBAL COMPUTING SHOWCASE * Visit the Showcase to find out more about products supporting the Unicode Standard, and products and services that can help you globalize/localize your software, documentation and Internet content. For details, visit the Conference Web site ( http://www.unicode.org/iuc/iuc19 ) Exhibitors to date include: * Agfa Monotype Corporation * Basis Technology Corporation * Everlasting Systems Ltd. * Localization Institute * Multilingual Computing, Inc. * Oracle Corporation * Rasmussen Software, Inc. * ReachIn, Inc. * Sun Microsystems, Inc. * Segue Software * Sybase, Inc. * Symbio Group * Trados Corporation CONFERENCE MANAGEMENT Global Meeting Services Inc. 4360 Benhurst Avenue San Diego, CA 92122, USA Tel: +1 858 638 0206 (voice) +1 858 638 0504 (fax) Email: info@global-conference.com or: conference@unicode.org * * * * * Unicode(r) and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission. ----------------------------------------------------------------- Visit our Internet site at http://www.reuters.com Any views expressed in this message are those of the individual sender, except where the sender specifically states them to be the views of Reuters Ltd.