From stephan.richter@tufts.edu Tue Apr 1 02:48:23 2003 From: stephan.richter@tufts.edu (Stephan Richter) Date: Mon, 31 Mar 2003 21:48:23 -0500 Subject: [I18n-sig] gettext tutorial In-Reply-To: <3E83057F.9020004@zope.com> References: <3E8138AA.2020307@canada.com> <3E83057F.9020004@zope.com> Message-ID: <200303312148.23441.stephan.richter@tufts.edu> On Thursday 27 March 2003 09:06, Jim Fulton wrote: > FWIW, I think Stephan Richter has already written such an app for > Zope. > > Jim > > Anmar Oueja wrote: > > Hello All: > > > > I am working on a web application (written in python of course) that > > will display a po file and allow people to translate these po files > > using this web app. See the Zope 3 gettext PO files import/export filters at http://cvs.zope.org/Zope3/src/zope/app/services/translation/filters.py?rev=1.2&content-type=text/vnd.viewcvs-markup Once you have a dictionary, you can do whatever you want. Note of course that the local translation service does much more already; it can also synchronize message catalogs... Regards, Stephan -- Stephan Richter CBU Physics & Chemistry (B.S.) / Tufts Physics (Ph.D. student) Web2k - Web Software Design, Development and Training From duerst@w3.org Thu Apr 10 20:44:52 2003 From: duerst@w3.org (Martin Duerst) Date: Thu, 10 Apr 2003 15:44:52 -0400 Subject: [I18n-sig] IUC24 Call for Papers - September 2003 - Atlanta, Georgia Message-ID: <4.2.0.58.J.20030410153814.0607d8e0@localhost> Hi folks, Yes, it is time to get thinking about the September Unicode conference in Atlanta. Please see the information below and check out the website for conference themes and suggested topics for papers. Submissions are due May 2. Thank you, Martin. >>>>>>>>>>>>>>>>>>>>>>>>>> Call for Papers! <<<<<<<<<<<<<<<<<<<<<<<<< Twenty-fourth Internationalization and Unicode Conference (IUC24) Unicode, Internationalization, the Web: Powering Global Business See Call for Papers at: http://www.unicode.org/iuc/iuc24/call.html September 3-5, 2003 Atlanta, Georgia >>>>>>>>>>>>>>>>>>>> Send in your submission now! <<<<<<<<<<<<<<<<<<< Submissions due: May 2, 2003 Notification date: May 23, 2003 Completed papers due: June 13, 2003 (in electronic form and camera-ready paper form) >>>>>>>>>>>>>>>>>>>>>>>> Just 4 weeks to go! <<<<<<<<<<<<<<<<<<<<<<<< WHAT'S NEW Each conference's theme is different, allowing key subject areas to be explored in depth. This conference will explore global business needs and solutions and the impact of new technologies. Go to the conference web site for a graphical version of this message: http://www.unicode.org/iuc/iuc24/call.html INVITATION TO SUBMIT PAPERS The Internationalization & Unicode Conference is the premier technical conference worldwide for both software and Web internationalization. The conference features tutorials, lectures, and panel discussions that provide coverage of standards, best practices, and recent advances in the globalization of software and the Internet. The conference continues to provide a forum for identifying and discussing new issues in this field. New technologies, innovative Internet applications, and the evolving Unicode Standard bring new challenges along with their new capabilities. This technical conference will explore the opportunities created by the latest advances, how to leverage them, and the potential pitfalls. Their impact on business and the problem areas that need further research will also be identified. Best practices for designing applications that can accommodate any language will be demonstrated. Attendees benefit from the wide range of basic to advanced topics and the opportunities for dialog and idea exchange with experts and peers. We invite you to submit papers that relate to Unicode or any aspect of software and Web Internationalization, with special emphasis on the themes discussed below. You can view the programs of previous conferences at: http://www.unicode.org/unicode/conference/about-conf.html CONFERENCE ATTENDEES Conference attendees are generally involved in either the development and deployment of Unicode software, or the globalization of software and the Internet. They include managers, software engineers, testers, systems analysts, program managers, font designers, graphic designers, content developers, web designers, web administrators, site coordinators, technical writers, and product marketing personnel. THEME: INTERNATIONAL COMPUTING SOLUTIONS FOR GLOBAL BUSINESS "International Computing Solutions for Global Business" is the overall theme of the Conference. In today's tight economy, companies are looking for productivity improvements and increased international sales. One of many challenges is to maximize use of existing resources while accomplishing these achievements. Another is to incorporate new standards and technologies to gain competitive features. In support of the theme and these challenges, papers on GLOBAL BUSINESS and NEW TECHNOLOGIES are requested. More details on the theme and other topics of interest can be found at our web site: http://www.unicode.org/iuc/iuc24/call.html We invite you to submit papers which define tomorrow's computing, demonstrate best practices in computing today, or articulate problems that must be solved before further advances can occur. Presentations should be geared towards a technical audience. EXHIBIT OPPORTUNITIES The Conference SHOWCASE area is for corporations and individuals who wish to display and promote their products, technology and/or services. Every effort will be made to provide maximum exposure, advertising and traffic. Exhibit space is limited. For further information or to reserve a place, please contact Global Meeting Services at info@global-conference.com. CONFERENCE VENUE DoubleTree Hotel Atlanta Buckhead 3342 Peachtree Road Atlanta, GA 30326 Tel: +1-404-231-1234 Fax: +1-404-231-3112 THE UNICODE CONSORTIUM The Unicode Consortium is a non-profit organization dedicated to the development, maintenance and promotion of The Unicode Standard, a worldwide character encoding. The Unicode Standard encodes the characters of the world's principal scripts and languages, and is code-for-code identical to the international standard ISO/IEC 10646. The Consortium also defines character properties and algorithms for use in implementations. The membership base of the Unicode Consortium includes major computer corporations, software producers, database vendors, research institutions, international agencies and various user groups. For further information on the Unicode Standard, visit the Unicode Web site at http://www.unicode.org or e-mail * * * * * Unicode(r) and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission. From barry@python.org Fri Apr 11 18:51:56 2003 From: barry@python.org (Barry Warsaw) Date: 11 Apr 2003 13:51:56 -0400 Subject: [I18n-sig] Changes to gettext.py for Python 2.3 Message-ID: <1050083516.11172.40.camel@barry> Hi I18n-ers, I plan on checking in the following changes to the gettext.py module for Python 2.3, based on feedback from the Zope and Mailman i18n work. Here's a summary of the changes, hopefully there aren't too many controversies . I'll update the tests and the docs at the same time. - Expose NullTranslations and GNUTranslations to __all__ - Set the default charset to iso-8859-1. It used to be None, which would cause problems with .ugettext() if the file had no charset parameter. Arguably, the po/mo file would be broken, but I still think iso-8859-1 is a reasonable default. - Add a "coerce" default argument to GNUTranslations's constructor. The reason for this is that in Zope, we want all msgids and msgstrs to be Unicode. For the latter, we could use .ugettext() but there isn't currently a mechanism for Unicode-ifying msgids. The plan then is that the charset parameter specifies the encoding for both the msgids and msgstrs, and both are decoded to Unicode when read. For example, we might encode po files with utf-8. I think the GNU gettext tools don't care. Since this could potentially break code [*] that wants to use the encoded interface .gettext(), the constructor flag is added, defaulting to False. Most code I suspect will want to set this to True and use .ugettext(). - A few other minor changes from the Zope project, including asserting that a zero-length msgid must have a Project-ID-Version header for it to be counted as the metadata record. -Barry [*] I've come to the opinion that using anything other than Unicode msgids and msgstrs just won't work well for Python, and thus you really should be using the .ugettext() method everywhere. It's also insane to mix .gettext() and .ugettext(). In Zope, all human readable messages will be Unicode strings internally, so we definitely want Unicode msgids. From martin@v.loewis.de Fri Apr 11 20:54:50 2003 From: martin@v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Fri, 11 Apr 2003 21:54:50 +0200 Subject: [I18n-sig] Changes to gettext.py for Python 2.3 In-Reply-To: <1050083516.11172.40.camel@barry> References: <1050083516.11172.40.camel@barry> Message-ID: <3E971D8A.5020006@v.loewis.de> Barry Warsaw wrote: > - Set the default charset to iso-8859-1. It used to be None, which > would cause problems with .ugettext() if the file had no charset > parameter. Arguably, the po/mo file would be broken, but I still think > iso-8859-1 is a reasonable default. I'm -1 here. Why do you think it is a reasonable default? Errors should never pass silently. Unless explicitly silenced. While iso-8859-1 might be a reasonable default in other application domains, in the context of non-English text (which it typically is), assuming Latin-1 is bound to create mojibake. If your application can accept creating mojibake, I suggest a method setdefaultencoding on the catalog, which has no effect if an encoding was found in the catalog. > - Add a "coerce" default argument to GNUTranslations's constructor. The > reason for this is that in Zope, we want all msgids and msgstrs to be > Unicode. For the latter, we could use .ugettext() but there isn't > currently a mechanism for Unicode-ifying msgids. Could you please in what context this is needed? msgids are ASCII, and you can pass a Unicode string to ugettext just fine. > The plan then is that the charset parameter specifies the encoding for > both the msgids and msgstrs, and both are decoded to Unicode when read. > For example, we might encode po files with utf-8. I think the GNU > gettext tools don't care. They complain loudly if they find bytes > 127 in the msgid. > Since this could potentially break code [*] that wants to use the > encoded interface .gettext(), the constructor flag is added, defaulting > to False. Most code I suspect will want to set this to True and use > .ugettext(). To avoid breakage, you could define ugettext as def ugettext(self, message): if isinstance(message, unicode): tmsg = self._catalog.get(message.encode(self._charset)) if tmsg is None: return message else: tmsg = self._catalog.get(message, message) return unicode(tmsg, self._charset) > - A few other minor changes from the Zope project, including asserting > that a zero-length msgid must have a Project-ID-Version header for it to > be counted as the metadata record. That test was there, and removed on request of Bruno Haible, the GNU gettext maintainer, as he points out that Project-ID-Version is not mandatory for the metadata (see Patch #700839). Regards, Martin From barry@python.org Fri Apr 11 21:26:59 2003 From: barry@python.org (Barry Warsaw) Date: 11 Apr 2003 16:26:59 -0400 Subject: [I18n-sig] Changes to gettext.py for Python 2.3 In-Reply-To: <3E971D8A.5020006@v.loewis.de> References: <1050083516.11172.40.camel@barry> <3E971D8A.5020006@v.loewis.de> Message-ID: <1050092819.11172.89.camel@barry> On Fri, 2003-04-11 at 15:54, "Martin v. Löwis" wrote: > Barry Warsaw wrote: > > > - Set the default charset to iso-8859-1. It used to be None, which > > would cause problems with .ugettext() if the file had no charset > > parameter. Arguably, the po/mo file would be broken, but I still think > > iso-8859-1 is a reasonable default. > > I'm -1 here. Why do you think it is a reasonable default? > > Errors should never pass silently. > Unless explicitly silenced. > > While iso-8859-1 might be a reasonable default in other application > domains, in the context of non-English text (which it typically is), > assuming Latin-1 is bound to create mojibake. Okay, never mind, I'll back this one out. The problem was caused by my other patch to unicode-ify on read (see below) without first having a charset. I have a different fix for this. > > - Add a "coerce" default argument to GNUTranslations's constructor. The > > reason for this is that in Zope, we want all msgids and msgstrs to be > > Unicode. For the latter, we could use .ugettext() but there isn't > > currently a mechanism for Unicode-ifying msgids. > > Could you please in what context this is needed? msgids are ASCII, and > you can pass a Unicode string to ugettext just fine. In Zope, all strings are Unicode and the catalog may include messages that are extracted from places other than Python source code, e.g. XML-based files. Message ids can contain non-ASCII characters if they are written by a non-English coder. I think in that case, we'd want to do something like encode the strings possibly with utf-8 for the .po/.mo files, but we want them decoded in time to look the Unicode strings up in the catalog. Similarly, what happens if a non-English coder writes an i18n'd Python module with native strings, possibly using a Python 2.3 coding cookie. We'd want their message ids to be extracted into the .mo/.po files, right? > > The plan then is that the charset parameter specifies the encoding for > > both the msgids and msgstrs, and both are decoded to Unicode when read. > > For example, we might encode po files with utf-8. I think the GNU > > gettext tools don't care. > > They complain loudly if they find bytes > 127 in the msgid. Really? Ok, I'm still confused because I tried the following example: I wrote a .mo file (charset=utf-8) with the following record: #: nofile:0 msgid "ab\xc3\x9e" msgstr "\xc2\xa4yz" I used standard msgfmt to turn that into a .mo file. Then created a GNUTranslation(fp, coerce=True) and called >>> t.ugettext(u'ab\xde') u'\xa4yz' This is what I should expect, right? ;) > > - A few other minor changes from the Zope project, including asserting > > that a zero-length msgid must have a Project-ID-Version header for it to > > be counted as the metadata record. > > That test was there, and removed on request of Bruno Haible, the GNU > gettext maintainer, as he points out that Project-ID-Version is not > mandatory for the metadata (see Patch #700839). Ah, I read the diff backwards in this case. I'll back this one out too. -Barry From barry@python.org Fri Apr 11 21:37:56 2003 From: barry@python.org (Barry Warsaw) Date: 11 Apr 2003 16:37:56 -0400 Subject: [I18n-sig] Changes to gettext.py for Python 2.3 In-Reply-To: <3E971D8A.5020006@v.loewis.de> References: <1050083516.11172.40.camel@barry> <3E971D8A.5020006@v.loewis.de> Message-ID: <1050093475.11200.96.camel@barry> On Fri, 2003-04-11 at 15:54, "Martin v. Löwis" wrote: > To avoid breakage, you could define ugettext as > > def ugettext(self, message): > if isinstance(message, unicode): > tmsg = self._catalog.get(message.encode(self._charset)) > if tmsg is None: > return message > else: > tmsg = self._catalog.get(message, message) > return unicode(tmsg, self._charset) I suppose we could cache the conversion to make the next lookup more efficient. Alternatively, if we always convert internally to Unicode we could encode on .gettext(). Then we could just pick One Way and do away with the coerce flag. -Barry From martin@v.loewis.de Sat Apr 12 11:34:05 2003 From: martin@v.loewis.de (Martin v. =?iso-8859-15?q?L=F6wis?=) Date: 12 Apr 2003 12:34:05 +0200 Subject: [I18n-sig] Changes to gettext.py for Python 2.3 In-Reply-To: <1050093475.11200.96.camel@barry> References: <1050083516.11172.40.camel@barry> <3E971D8A.5020006@v.loewis.de> <1050093475.11200.96.camel@barry> Message-ID: Barry Warsaw writes: > I suppose we could cache the conversion to make the next lookup more > efficient. Alternatively, if we always convert internally to Unicode we > could encode on .gettext(). Then we could just pick One Way and do away > with the coerce flag. If you are concerned about efficiency, I guess there is no way to avoid converting the file to Unicode on loading. I would then encourage a change where this flag is available, but has an effect only on performance, not on the behaviour. Alternatively, you could subclass GNUTranslation. Regards, Martin From martin@v.loewis.de Sat Apr 12 12:43:28 2003 From: martin@v.loewis.de (Martin v. =?iso-8859-15?q?L=F6wis?=) Date: 12 Apr 2003 13:43:28 +0200 Subject: [I18n-sig] Changes to gettext.py for Python 2.3 In-Reply-To: <1050092819.11172.89.camel@barry> References: <1050083516.11172.40.camel@barry> <3E971D8A.5020006@v.loewis.de> <1050092819.11172.89.camel@barry> Message-ID: Barry Warsaw writes: > I used standard msgfmt to turn that into a .mo file. Then created a > GNUTranslation(fp, coerce=3DTrue) and called >=20 > >>> t.ugettext(u'ab\xde') > u'\xa4yz' >=20 > This is what I should expect, right? ;) More or less, yes. Now, what happens if you pot "real" non-ASCII (i.e. bytes above 127) into the message id, like so: msgid "ab=F6" msgstr "\xc2\xa4yz" msgfmt will still accept that, but msgunfmt will complain: msgunfmt: warning: The following msgid contains non-ASCII characters. This will cause problems to translators who use a character encoding different from yours. Consider using a pure ASCII msgid instead. If you think about this, this is really bad: If you mean to apply the charset=3D to both msgid and msgstr, then translators using a different charset from yours are in big trouble. They are faced with three problems: 1. They don't know what the charset of the msgids is. The PO files do have a charset declaration, the POT files typically don't. 2. They need to convert the msgids from the POT encoding to their native encoding. There are no tools available to support that readily; tools like iconv might correctly convert the msgids, but won't update the charset=3D in the POT file (if the charset was filled out). 3. By converting the msgids, they are also changing them. That means the msgids are not really suitable as keys anymore. Regards, Martin From bh@intevation.de Wed Apr 16 17:27:26 2003 From: bh@intevation.de (Bernhard Herzog) Date: 16 Apr 2003 18:27:26 +0200 Subject: [I18n-sig] pygettext and msgfmt support for distutils Message-ID: <6qisteh0gh.fsf@salmakis.intevation.de> Is someone working on integrating the gettext utilities with distutils? Some background: We've just added some simple gettext support to our geogaphic data viewer Thuban[1] but the setup we currently use only works on Unix-like systems because it's just a makefile that can be used to call xgettext (0.11 which supports python :)) and msgmerge and msgfmt as needed. I've already adapted our setup.py file to include any formatted .mo files and any po files in the source distribution and to install the mo files together with other data files which works well so far. What I'd still like to have is a way to at least get the functionality of xgettext (or pygettext) and msgfmt into the distutils setup.py in such a way that it works on windows as well as on Unix. Is someone working on this kind of thing? If no, what would be needed to get it into the standard distutils? AFAICT, for a start it might be good to move pygettext.py and msgfmt.py into the standard library so that distutils can easily call them to do the actual work. Bernhard [1] http://thuban.intevation.org/ -- Intevation GmbH http://intevation.de/ Sketch http://sketch.sourceforge.net/ MapIt! http://www.mapit.de/ From barry@python.org Wed Apr 16 17:52:06 2003 From: barry@python.org (Barry Warsaw) Date: 16 Apr 2003 12:52:06 -0400 Subject: [Python-Dev] Re: [I18n-sig] Changes to gettext.py for Python 2.3 In-Reply-To: References: <1050083516.11172.40.camel@barry> <3E971D8A.5020006@v.loewis.de> <1050092819.11172.89.camel@barry> Message-ID: <1050511925.9818.78.camel@barry> On Sat, 2003-04-12 at 07:43, Martin v. Löwis wrote: > More or less, yes. Now, what happens if you pot "real" non-ASCII > (i.e. bytes above 127) into the message id, like so: But I don't think you'd ever want to do that. In fact, I think in general you're probably talking about ascii msgids or utf-8 encoded Unicode msgids. I'm not sure what else would make sense. > msgfmt will still accept that, but msgunfmt will complain: Didn't even know about msgunfmt. :) > msgunfmt: warning: The following msgid contains non-ASCII characters. > This will cause problems to translators who use a > character encoding different from yours. Consider > using a pure ASCII msgid instead. > > If you think about this, this is really bad: If you mean to apply the > charset= to both msgid and msgstr, then translators using a different > charset from yours are in big trouble. Right, but see above. E.g. if your string literals are all Spanish and you want a Turkish translation, then utf-8 is the only common encoding you could possibly use in a .po file, right? > They are faced with three problems: > 1. They don't know what the charset of the msgids is. The PO files do > have a charset declaration, the POT files typically don't. Yep, although it would be easy for the extractor to add a charset=utf-8 to the pot file. > 2. They need to convert the msgids from the POT encoding to their > native encoding. There are no tools available to support that readily; > tools like iconv might correctly convert the msgids, but won't update > the charset= in the POT file (if the charset was filled out). > 3. By converting the msgids, they are also changing them. That means > the msgids are not really suitable as keys anymore. Is this still a problem for when charset=utf-8? -Barry From barry@python.org Wed Apr 16 17:53:53 2003 From: barry@python.org (Barry Warsaw) Date: 16 Apr 2003 12:53:53 -0400 Subject: [Python-Dev] Re: [I18n-sig] Changes to gettext.py for Python 2.3 In-Reply-To: References: <1050083516.11172.40.camel@barry> <3E971D8A.5020006@v.loewis.de> <1050093475.11200.96.camel@barry> Message-ID: <1050512032.9818.81.camel@barry> On Sat, 2003-04-12 at 06:34, Martin v. Löwis wrote: > Barry Warsaw writes: > > > I suppose we could cache the conversion to make the next lookup more > > efficient. Alternatively, if we always convert internally to Unicode we > > could encode on .gettext(). Then we could just pick One Way and do away > > with the coerce flag. > > If you are concerned about efficiency, I guess there is no way to > avoid converting the file to Unicode on loading. I would then > encourage a change where this flag is available, but has an effect > only on performance, not on the behaviour. > > Alternatively, you could subclass GNUTranslation. It would take some refactoring, unless you implemented a second pass over the catalog. I'd rather not do either, so I'm happy to include this right in GNUTranslations. -Barry From barry@python.org Wed Apr 16 19:10:40 2003 From: barry@python.org (Barry Warsaw) Date: 16 Apr 2003 14:10:40 -0400 Subject: [I18n-sig] pygettext and msgfmt support for distutils In-Reply-To: <6qisteh0gh.fsf@salmakis.intevation.de> References: <6qisteh0gh.fsf@salmakis.intevation.de> Message-ID: <1050516640.9818.150.camel@barry> On Wed, 2003-04-16 at 12:27, Bernhard Herzog wrote: > Is someone working on integrating the gettext utilities with distutils? > > Some background: > > We've just added some simple gettext support to our geogaphic data > viewer Thuban[1] but the setup we currently use only works on Unix-like > systems because it's just a makefile that can be used to call xgettext > (0.11 which supports python :)) and msgmerge and msgfmt as needed. I haven't had time to look at the latest xgettext, but do you know if it supports all the extra features that pygettext supports? Of primary importance to me is the -D/--docstrings and -X/--no-docstrings options. > I've already adapted our setup.py file to include any formatted .mo > files and any po files in the source distribution and to install the mo > files together with other data files which works well so far. > > What I'd still like to have is a way to at least get the functionality > of xgettext (or pygettext) and msgfmt into the distutils setup.py in > such a way that it works on windows as well as on Unix. msgfmt I can see, but I'm not so sure about {x,py}gettext. IME, I don't want to do message extraction at either build time or tar-it-up time. I usually want to do extraction at defined boundaries in the project's development. So that seems to me a separate process. I'm interested in getting your ideas here. Hook msgfmt up in some way would definitely be useful. That way you wouldn't need to include .mo files in your distro (nor in cvs). > Is someone working on this kind of thing? > > If no, what would be needed to get it into the standard distutils? Let's start with a patch! :) > AFAICT, for a start it might be good to move pygettext.py and msgfmt.py > into the standard library so that distutils can easily call them to do > the actual work. Hmm, possibly. They may need to be rewritten or refactored to make them more appropriate as library modules. I can see an i18n package being added to Python's stdlib someday which might contain the raw materials, with the Tools/i18n scripts being mostly just __main__ and getargs wrappers. -Barry From martin@v.loewis.de Wed Apr 16 20:20:34 2003 From: martin@v.loewis.de (Martin v. =?iso-8859-15?q?L=F6wis?=) Date: 16 Apr 2003 21:20:34 +0200 Subject: [Python-Dev] Re: [I18n-sig] Changes to gettext.py for Python 2.3 In-Reply-To: <1050511925.9818.78.camel@barry> References: <1050083516.11172.40.camel@barry> <3E971D8A.5020006@v.loewis.de> <1050092819.11172.89.camel@barry> <1050511925.9818.78.camel@barry> Message-ID: Barry Warsaw writes: > Right, but see above. E.g. if your string literals are all Spanish and > you want a Turkish translation, then utf-8 is the only common encoding > you could possibly use in a .po file, right? That's why your string literals should never be all Spanish. If you have Spanish string literals and use escape codes in the msgid, reading the Spanish msgid becomes difficult, anyway. > > 3. By converting the msgids, they are also changing them. That means > > the msgids are not really suitable as keys anymore. > > Is this still a problem for when charset=utf-8? If the msgids are UTF-8, with non-ASCII characters C-escaped, translators will *still* put non-UTF-8 encodings into the catalogs. This will then be a problem: The catalog encoding won't be UTF-8, and you can't process the msgids. Regards, Martin From martin@v.loewis.de Wed Apr 16 20:24:43 2003 From: martin@v.loewis.de (Martin v. =?iso-8859-15?q?L=F6wis?=) Date: 16 Apr 2003 21:24:43 +0200 Subject: [I18n-sig] pygettext and msgfmt support for distutils In-Reply-To: <6qisteh0gh.fsf@salmakis.intevation.de> References: <6qisteh0gh.fsf@salmakis.intevation.de> Message-ID: Bernhard Herzog writes: > What I'd still like to have is a way to at least get the functionality > of xgettext (or pygettext) and msgfmt into the distutils setup.py in > such a way that it works on windows as well as on Unix. > > Is someone working on this kind of thing? I'm with Barry here: You shouldn't have xgettext as part of the build or install commands. Providing a different command would be fine. For msgfmt, having that as a build step would be useful. However, more important, to me, seems to define a mechanism to smoothly install .mo files, in a location where gettext would find them. Regards, Martin From barry@python.org Wed Apr 16 20:36:08 2003 From: barry@python.org (Barry Warsaw) Date: 16 Apr 2003 15:36:08 -0400 Subject: [Python-Dev] Re: [I18n-sig] Changes to gettext.py for Python 2.3 In-Reply-To: References: <1050083516.11172.40.camel@barry> <3E971D8A.5020006@v.loewis.de> <1050092819.11172.89.camel@barry> <1050511925.9818.78.camel@barry> Message-ID: <1050521768.14112.15.camel@barry> On Wed, 2003-04-16 at 15:20, Martin v. Löwis wrote: > Barry Warsaw writes: > > > Right, but see above. E.g. if your string literals are all Spanish and > > you want a Turkish translation, then utf-8 is the only common encoding > > you could possibly use in a .po file, right? > > That's why your string literals should never be all Spanish. If you > have Spanish string literals and use escape codes in the msgid, > reading the Spanish msgid becomes difficult, anyway. So why isn't the English/US-ASCII bias for msgids considered a liability for gettext? Do non-English programmers not want to use native literals in their source code? If we adhere to this limitation instead of extending gettext then it seems like Zope will be forced to use something else, and that seems like a waste. Its msgids come from sources other than program source code and such sources may indeed be written in non-English. It seems like gettext is so close and all the machinery is almost there, that this small enhancement should be harmless and helpful. BTW, I believe that if all your msgids /are/ us-ascii, you should be able to ignore this change and have it works backwards compatibly. Also, this change ought to visibly only affect .ugettext() which isn't part of the traditional gettext API anyway. > > > 3. By converting the msgids, they are also changing them. That means > > > the msgids are not really suitable as keys anymore. > > > > Is this still a problem for when charset=utf-8? > > If the msgids are UTF-8, with non-ASCII characters C-escaped, > translators will *still* put non-UTF-8 encodings into the catalogs. > This will then be a problem: The catalog encoding won't be UTF-8, > and you can't process the msgids. Isn't this just another validation step to run on the .po files? There are already several ways translators can (and do!) make mistakes, so we already have to validate the files anyway. -Barry From barry@python.org Wed Apr 16 20:59:44 2003 From: barry@python.org (Barry Warsaw) Date: 16 Apr 2003 15:59:44 -0400 Subject: [I18n-sig] pygettext and msgfmt support for distutils In-Reply-To: References: <6qisteh0gh.fsf@salmakis.intevation.de> Message-ID: <1050523183.14115.41.camel@barry> On Wed, 2003-04-16 at 15:24, Martin v. Löwis wrote: > However, more important, to me, seems to define a mechanism to > smoothly install .mo files, in a location where gettext would find > them. Excellent point! setup.py has some provisions for installing data files, so maybe that can be piggybacked? I don't have time to look into this right now, but it would be nice to do something like - specify the domain in setup - have setup drop the files in <--install-data>/xx/LC_MESSAGES -Barry From martin@v.loewis.de Wed Apr 16 23:07:15 2003 From: martin@v.loewis.de (=?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Thu, 17 Apr 2003 00:07:15 +0200 Subject: [Python-Dev] Re: [I18n-sig] Changes to gettext.py for Python 2.3 In-Reply-To: <1050521768.14112.15.camel@barry> References: <1050083516.11172.40.camel@barry> <3E971D8A.5020006@v.loewis.de> <1050092819.11172.89.camel@barry> <1050511925.9818.78.camel@barry> <1050521768.14112.15.camel@barry> Message-ID: <3E9DD413.8030002@v.loewis.de> Barry Warsaw wrote: > So why isn't the English/US-ASCII bias for msgids considered a liability > for gettext? Do non-English programmers not want to use native literals > in their source code? Using English for msgids is about the only way to get translation. Finding a Turkish speaker who can translate from Spanish is *significantly* more difficult than starting from English; if you were starting from, say, Chinese, and going to Hebrew might just be impossible. So any programmer who seriously wants to have his software translated will put English texts into the source code. Non-English literals are only used if l10n is not an issue. > If we adhere to this limitation instead of extending gettext then it > seems like Zope will be forced to use something else, and that seems > like a waste. It's not a limitation of gettext, but a usage guideline: gettext can map arbitrary byte strings to arbitrary other byte strings. > BTW, I believe that if all your msgids /are/ us-ascii, you should be > able to ignore this change and have it works backwards compatibly. "This" change being addition of the "coerce" argument? If you think you will need it, we can leave it in. >>If the msgids are UTF-8, with non-ASCII characters C-escaped, >>translators will *still* put non-UTF-8 encodings into the catalogs. >>This will then be a problem: The catalog encoding won't be UTF-8, >>and you can't process the msgids. > > > Isn't this just another validation step to run on the .po files? There > are already several ways translators can (and do!) make mistakes, so we > already have to validate the files anyway. I'm not sure how exactly a validation step would be executed. Would that step simply verify that the encoding of a catalog is UTF-8? That validation step would fail for catalogs that legally use other charsets. Regards, Martin From bh@intevation.de Thu Apr 17 11:23:52 2003 From: bh@intevation.de (Bernhard Herzog) Date: 17 Apr 2003 12:23:52 +0200 Subject: [I18n-sig] pygettext and msgfmt support for distutils In-Reply-To: References: <6qisteh0gh.fsf@salmakis.intevation.de> Message-ID: <6qel41e81z.fsf@salmakis.intevation.de> martin@v.loewis.de (Martin v. Löwis) writes: > I'm with Barry here: You shouldn't have xgettext as part of the build > or install commands. That wasn't quite my intention. For the xgettext step I was thinking more of a separate command that is not automatically called when doing a build. It seems to me that maybe such a command should be run as part of the sdist command to make sure that a .pot file shipped with the sources is up to date. Another reason I thought it might be good to have it as part of distutils is that it could make use of information the distutils have to, say, extract the translatable strings from all python source files in the distribution. > Providing a different command would be fine. For msgfmt, having that > as a build step would be useful. > > However, more important, to me, seems to define a mechanism to > smoothly install .mo files, in a location where gettext would find > them. That wasn't much of a problem in our case, but thuban is more an application than a library so we don't install under site-packages and so have more control over where to look for mo files. I simply put the mo files into a directory right next to another directory containing data files (icons). I use this scheme in Sketch too (although without distutils so far) and so far nobody has complained about this :). Bernhard -- Intevation GmbH http://intevation.de/ Sketch http://sketch.sourceforge.net/ MapIt! http://www.mapit.de/ From bh@intevation.de Sat Apr 19 19:20:44 2003 From: bh@intevation.de (Bernhard Herzog) Date: 19 Apr 2003 20:20:44 +0200 Subject: [I18n-sig] pygettext and msgfmt support for distutils References: <6qisteh0gh.fsf@salmakis.intevation.de> <1050516640.9818.150.camel@barry> Message-ID: <6q65paic1v.fsf@salmakis.intevation.de> Barry Warsaw writes: > I haven't had time to look at the latest xgettext, but do you know if it > supports all the extra features that pygettext supports? AFAICT python support means mostly that it understands the python syntax enough to recognize all string literals correctly. > Of primary > importance to me is the -D/--docstrings and -X/--no-docstrings options. There doesn't seem to be support for this. At least there's nothing in the docs about this. Bernhard -- Intevation GmbH http://intevation.de/ Sketch http://sketch.sourceforge.net/ MapIt! http://www.mapit.de/ From tex@I18nGuy.com Sun Apr 20 00:01:37 2003 From: tex@I18nGuy.com (Tex Texin) Date: Sat, 19 Apr 2003 19:01:37 -0400 Subject: [I18n-sig] IUC24 Call for Papers INTERNATIONAL COMPUTING SOLUTIONS FOR GLOBAL BUSINESS Message-ID: <3EA1D551.80C8B684@I18nGuy.com> Join us in Atlanta this September! >>>>>>>>>>>>>>>>>>>>>>>>>> Call for Papers! <<<<<<<<<<<<<<<<<<<<<<<<< Twenty-fourth Internationalization and Unicode Conference (IUC24) Unicode, Internationalization, the Web: Powering Global Business See Call for Papers at: http://www.unicode.org/iuc/iuc24/call.html September 3-5, 2003 Atlanta, Georgia >>>>>>>>>>>>>>>>>>>> Send in your submission now! <<<<<<<<<<<<<<<<<<< Submissions due: May 2, 2003 Notification date: May 23, 2003 Completed papers due: June 13, 2003 (in electronic form and camera-ready paper form) >>>>>>>>>>>>>>>>>>>>>>>> Just 2 weeks to go! <<<<<<<<<<<<<<<<<<<<<<<< WHAT'S NEW Each conference's theme is different, allowing key subject areas to be explored in depth. This conference will explore global business needs and solutions and the impact of new technologies. Go to the conference web site for a graphical version of this message, and submit your proposal via our new web-based form! http://www.unicode.org/iuc/iuc24/call.html THEME: INTERNATIONAL COMPUTING SOLUTIONS FOR GLOBAL BUSINESS "International Computing Solutions for Global Business" is the overall theme of the Conference. In today's tight economy, companies are looking for productivity improvements and increased international sales. One of many challenges is to maximize use of existing resources while accomplishing these achievements. Another is to incorporate new standards and technologies to gain competitive features. In support of the theme and these challenges, papers on GLOBAL BUSINESS and NEW TECHNOLOGIES are requested. More details on the theme and other topics of interest can be found at our web site: http://www.unicode.org/iuc/iuc24/call.html INVITATION TO SUBMIT PAPERS We invite you to submit papers which define tomorrow's computing, demonstrate best practices in computing today, or articulate problems that must be solved before further advances can occur. Presentations should be geared towards a technical audience. The Internationalization & Unicode Conference is the premier technical conference worldwide for both software and Web internationalization. The conference features tutorials, lectures, and panel discussions that provide coverage of standards, best practices, and recent advances in the globalization of software and the Internet. The conference continues to provide a forum for identifying and discussing new issues in this field. New technologies, innovative Internet applications, and the evolving Unicode Standard bring new challenges along with their new capabilities. This technical conference will explore the opportunities created by the latest advances, how to leverage them, and the potential pitfalls. Their impact on business and the problem areas that need further research will also be identified. Best practices for designing applications that can accommodate any language will be demonstrated. Attendees benefit from the wide range of basic to advanced topics and the opportunities for dialog and idea exchange with experts and peers. We invite you to submit papers that relate to Unicode or any aspect of software and Web Internationalization, with special emphasis on the themes discussed below. You can view the programs of previous conferences at: http://www.unicode.org/unicode/conference/about-conf.html CONFERENCE ATTENDEES Conference attendees are generally involved in either the development and deployment of Unicode software, or the globalization of software and the Internet. They include managers, software engineers, testers, systems analysts, program managers, font designers, graphic designers, content developers, web designers, web administrators, site coordinators, technical writers, and product marketing personnel. EXHIBIT OPPORTUNITIES The Conference SHOWCASE area is for corporations and individuals who wish to display and promote their products, technology and/or services. Every effort will be made to provide maximum exposure, advertising and traffic. Exhibit space is limited. For further information or to reserve a place, please contact Global Meeting Services at info@global-conference.com. CONFERENCE VENUE DoubleTree Hotel Atlanta Buckhead 3342 Peachtree Road Atlanta, GA 30326 Tel: +1-404-231-1234 Fax: +1-404-231-3112 THE UNICODE CONSORTIUM The Unicode Consortium is a non-profit organization dedicated to the development, maintenance and promotion of The Unicode Standard, a worldwide character encoding. The Unicode Standard encodes the characters of the world's principal scripts and languages, and is code-for-code identical to the international standard ISO/IEC 10646. The Consortium also defines character properties and algorithms for use in implementations. The membership base of the Unicode Consortium includes major computer corporations, software producers, database vendors, research institutions, international agencies and various user groups. For further information on the Unicode Standard, visit the Unicode Web site at http://www.unicode.org or e-mail * * * * * Unicode(r) and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission. From perky@fallin.lv Mon Apr 21 00:01:03 2003 From: perky@fallin.lv (Hye-Shik Chang) Date: Mon, 21 Apr 2003 08:01:03 +0900 Subject: [I18n-sig] ANN: iconvcodec 1.0 is released Message-ID: <20030420230103.GA20594@fallin.lv> Hi, i18n guys! I just released iconvcodec 1.0. The iconvcodec is an universal unicode codec module for Python using POSIX iconv(3). It supports various libiconv implementations including GNU libiconv, GNU libc, FreeBSD iconv, Solaris iconv and etc. And, supports the following features: * PEP293 Error Callbacks (for Python 2.3 only) * Reentrant-safe encoder and decoder * Adaptive multiple unicode encodings: UCS, swapped UCS, UTF-8 * Stateful/context-aware StreamReader and StreamWriter You can download the source and binary packages for FreeBSD, RedHat and/or Windows from SourceForge: http://sourceforge.net/project/showfiles.php?group_id=46747 Thank you! Regards, Hye-Shik =) From barry@python.org Tue Apr 22 20:19:47 2003 From: barry@python.org (Barry Warsaw) Date: 22 Apr 2003 15:19:47 -0400 Subject: [I18n-sig] pygettext and msgfmt support for distutils In-Reply-To: <6qel41e81z.fsf@salmakis.intevation.de> References: <6qisteh0gh.fsf@salmakis.intevation.de> <6qel41e81z.fsf@salmakis.intevation.de> Message-ID: <1051039187.32583.37.camel@barry> On Thu, 2003-04-17 at 06:23, Bernhard Herzog wrote: > martin@v.loewis.de (Martin v. Löwis) writes: > > > I'm with Barry here: You shouldn't have xgettext as part of the build > > or install commands. > > That wasn't quite my intention. For the xgettext step I was thinking > more of a separate command that is not automatically called when doing a > build. It seems to me that maybe such a command should be run as part of > the sdist command to make sure that a .pot file shipped with the sources > is up to date. I tend to think about updating the .pot file on a much different schedule than creating source distributions. Actually, sdist-time would be too late since I usually like to give my translators a little heads-up before a release. > Another reason I thought it might be good to have it as part of > distutils is that it could make use of information the distutils have > to, say, extract the translatable strings from all python source files > in the distribution. In my experience, this isn't too much of a problem. It's usually pretty easy to write a find script to calculate the files for extraction. The hard part (for me) is figuring out which files you also want to extract docstrings for, and such a distinction isn't built into distutils (although possibly could be -- they're usually command line scripts). -Barry From barry@python.org Tue Apr 22 20:22:41 2003 From: barry@python.org (Barry Warsaw) Date: 22 Apr 2003 15:22:41 -0400 Subject: [I18n-sig] pygettext and msgfmt support for distutils In-Reply-To: <6q65paic1v.fsf@salmakis.intevation.de> References: <6qisteh0gh.fsf@salmakis.intevation.de> <1050516640.9818.150.camel@barry> <6q65paic1v.fsf@salmakis.intevation.de> Message-ID: <1051039360.32583.41.camel@barry> On Sat, 2003-04-19 at 14:20, Bernhard Herzog wrote: > Barry Warsaw writes: > > > I haven't had time to look at the latest xgettext, but do you know if it > > supports all the extra features that pygettext supports? > > AFAICT python support means mostly that it understands the python syntax > enough to recognize all string literals correctly. That's a good start, for sure. > > Of primary > > importance to me is the -D/--docstrings and -X/--no-docstrings options. > > There doesn't seem to be support for this. At least there's nothing in > the docs about this. Ok, IBWNI. BTW, the reason I want this is mostly because I put usage information in module docstrings for command line scripts. I really don't want to use something like: __doc__ = _("""mailmanctl -- start and stop the qrunner daemons ... """) I use this as def usage(...): print _(__doc__) And I just want to be able to say, okay, extract the docstring for bin/mailmanctl, but not for certain other files. -Barry From barry@python.org Tue Apr 22 20:53:25 2003 From: barry@python.org (Barry Warsaw) Date: 22 Apr 2003 15:53:25 -0400 Subject: [Python-Dev] Re: [I18n-sig] Changes to gettext.py for Python 2.3 In-Reply-To: <3E9DD413.8030002@v.loewis.de> References: <1050083516.11172.40.camel@barry> <3E971D8A.5020006@v.loewis.de> <1050092819.11172.89.camel@barry> <1050511925.9818.78.camel@barry> <1050521768.14112.15.camel@barry> <3E9DD413.8030002@v.loewis.de> Message-ID: <1051041205.32490.51.camel@barry> On Wed, 2003-04-16 at 18:07, "Martin v. Löwis" wrote: > > So why isn't the English/US-ASCII bias for msgids considered a liability > > for gettext? Do non-English programmers not want to use native literals > > in their source code? > > Using English for msgids is about the only way to get translation. > Finding a Turkish speaker who can translate from Spanish is > *significantly* more difficult than starting from English; if you were > starting from, say, Chinese, and going to Hebrew might just be impossible. > > So any programmer who seriously wants to have his software translated > will put English texts into the source code. Non-English literals are > only used if l10n is not an issue. That's probably true. I'm just not sure Zope wants to make that a requirement. > > BTW, I believe that if all your msgids /are/ us-ascii, you should be > > able to ignore this change and have it works backwards compatibly. > > "This" change being addition of the "coerce" argument? If you think > you will need it, we can leave it in. Actually, thinking about this more, we probably don't even need the coerce flag. If all your msgids are us-ascii, you don't care whether they've been coerced to Unicode or not because they'll still compare equal. So I propose to remove the coerce flag, but still Unicode-ify both msgids and msgstrs. Then .ugettext() will just return the Unicode msgstr in the catalog, while .gettext() will encode it to an 8-bit string based on the charset. Personally, I think most i18n Python apps are going to want to use .ugettext() anyway, so for the average program this will just work as expected. I have the tests passing for this change. Any objections? > >>If the msgids are UTF-8, with non-ASCII characters C-escaped, > >>translators will *still* put non-UTF-8 encodings into the catalogs. > >>This will then be a problem: The catalog encoding won't be UTF-8, > >>and you can't process the msgids. > > > > Isn't this just another validation step to run on the .po files? There > > are already several ways translators can (and do!) make mistakes, so we > > already have to validate the files anyway. > > I'm not sure how exactly a validation step would be executed. Would that > step simply verify that the encoding of a catalog is UTF-8? That > validation step would fail for catalogs that legally use other charsets. The validation step would make sure that all the msgids and msgstrs could be decoded using the encoding claimed in the headers. If msgids are us-ascii then (just about) any other encoding for msgstrs should work just fine. If there are non-ascii in both msgids and msgstrs, then some common encoding would have to be used (what other than utf-8?). It's a choice left up to the application and its translators. -Barry From martin@v.loewis.de Tue Apr 22 23:15:08 2003 From: martin@v.loewis.de (Martin v. =?iso-8859-15?q?L=F6wis?=) Date: 23 Apr 2003 00:15:08 +0200 Subject: [Python-Dev] Re: [I18n-sig] Changes to gettext.py for Python 2.3 In-Reply-To: <1051041205.32490.51.camel@barry> References: <1050083516.11172.40.camel@barry> <3E971D8A.5020006@v.loewis.de> <1050092819.11172.89.camel@barry> <1050511925.9818.78.camel@barry> <1050521768.14112.15.camel@barry> <3E9DD413.8030002@v.loewis.de> <1051041205.32490.51.camel@barry> Message-ID: Barry Warsaw writes: > So I propose to remove the coerce flag, but still Unicode-ify both > msgids and msgstrs. Then .ugettext() will just return the Unicode > msgstr in the catalog, while .gettext() will encode it to an 8-bit > string based on the charset. Personally, I think most i18n Python apps > are going to want to use .ugettext() anyway, so for the average program > this will just work as expected. > > I have the tests passing for this change. Any objections? For safety, I'd recommend that you use byte string msgids if conversion to Unicode fails. Otherwise, I'm fine with automatically coercing everything to Unicode. I do know about catalogs that use Latin-1 in msgids (to represent accented characters in the names of authors). That should not cause failures. Regards, Martin From barry@python.org Thu Apr 24 15:58:36 2003 From: barry@python.org (Barry Warsaw) Date: 24 Apr 2003 10:58:36 -0400 Subject: [Python-Dev] Re: [I18n-sig] Changes to gettext.py for Python 2.3 In-Reply-To: References: <1050083516.11172.40.camel@barry> <3E971D8A.5020006@v.loewis.de> <1050092819.11172.89.camel@barry> <1050511925.9818.78.camel@barry> <1050521768.14112.15.camel@barry> <3E9DD413.8030002@v.loewis.de> <1051041205.32490.51.camel@barry> Message-ID: <1051196316.22909.13.camel@barry> On Tue, 2003-04-22 at 18:15, Martin v. Löwis wrote: > For safety, I'd recommend that you use byte string msgids if > conversion to Unicode fails. Otherwise, I'm fine with automatically > coercing everything to Unicode. For now, I'll add a comment to the code at the point of conversion since I'm not sure whether it's better to throw an exception or attempt to carry on with 8-bit strings. I'll update the docs too. > I do know about catalogs that use Latin-1 in msgids (to represent > accented characters in the names of authors). That should not cause > failures. Cool, thanks for the feedback Martin! -Barry