From barry@zope.com Tue Oct 2 15:06:58 2001 From: barry@zope.com (Barry A. Warsaw) Date: Tue, 2 Oct 2001 10:06:58 -0400 Subject: [I18n-sig] Re: [Mailman-i18n] another question References: <20011001222032.I1342@abulafia.casa> <15288.58471.331864.135744@anthem.wooz.org> <20011002002823.A30920@transas.co.uk> <15289.1014.502023.950703@anthem.wooz.org> <20011002092314.A32656@transas.co.uk> Message-ID: <15289.51714.465519.847800@anthem.wooz.org> [I'm cc'ing Python's i18n-sig... -baw] >>>>> "MS" == Mikhail Sobolev writes: MS> Are you going to do something similar for Russian, then? :)) I MS> believe, here we have at least three possible cases... :)) BAW> Can you suggest what needs to be done? MS> Well, if I understand it right, as for the current [in Debian] MS> version of gettext (0.10.40), there is a special construct in MS> .po files, which allow different appropriate translation of MS> plural forms. My understanding (as I never used it) is that MS> you need to use ngettext instead of gettext in cases, where MS> the result speaks about amounts. So briefly (I almost quote MS> the info page for gettext, the section `Plural Forms') MS> Instead of | printf (gettext ("We've got %d bird(s)"), n); MS> and even | if (n == 1) | printf (gettext ("We've got 1 bird")); | else | printf (gettext ("We've got %d birds"), n); MS> one would need to use | printf (ngettext ("We've got %d bird", "We've got %d birds"), n); MS> Where the first string is used as the msgid, and the second MS> string is used for the English language in case n != 1. MS> You'd need to add a special line to the header entry (example MS> for Russian): | Plural-Forms: nplurals=3; \ | plural=n%10==1 && n%100!=11 ? 0 : \ | n%10>=2 && n%10<=4 && (n%100<10 || n%100>=20) ? 1 : 2; MS> And every entry that corresponds to such a text: | msgid "the singular form" | msgid_plural "the plural form" | msgstr[0] "translated string for the case 0" | ... | msgstr[n] "translated string for the case 0" None of the Python i18n tools supports ngettext or Plural forms. I definitely don't have the time right now to enhance either the gettext module or pygettext.py to grok this, which seems pretty complicated to me . Unless someone volunteers to add such support to the Python i18n tools, I suggest we find a way to hack around it in the Mailman source code. Cheers, -Barry From loewis@informatik.hu-berlin.de Tue Oct 2 16:39:06 2001 From: loewis@informatik.hu-berlin.de (Martin von Loewis) Date: Tue, 2 Oct 2001 17:39:06 +0200 (MEST) Subject: [I18n-sig] Re: [Mailman-i18n] another question In-Reply-To: <20011002163225.A2842@transas.co.uk> (message from Mikhail Sobolev on Tue, 2 Oct 2001 16:32:25 +0100) References: <20011001222032.I1342@abulafia.casa> <15288.58471.331864.135744@anthem.wooz.org> <20011002002823.A30920@transas.co.uk> <15289.1014.502023.950703@anthem.wooz.org> <20011002092314.A32656@transas.co.uk> <15289.51714.465519.847800@anthem.wooz.org> <20011002163225.A2842@transas.co.uk> Message-ID: <200110021539.RAA14902@paros.informatik.hu-berlin.de> > Hmm... I had a look on the code in bin/pygettext.py, it does look a bit > complicated for me... Notice that pygettext isn't that much of a problem (you could use xgettext); it is the .mo file format, i.e. msgfmt.py and gettext.py. For gettext, it is in particular the evaluation of the expression that might be tricky, especially since the set of possible expressions and their syntax isn't all that well defined. Regards, Martin From Misha.Wolf@reuters.com Sat Oct 20 01:17:30 2001 From: Misha.Wolf@reuters.com (Misha.Wolf@reuters.com) Date: Sat, 20 Oct 2001 01:17:30 +0100 Subject: [I18n-sig] 20th Unicode Conference, Jan 2002, Washington DC, USA Message-ID: Twentieth International Unicode Conference (IUC20) Unicode and the Web: The Global Connection http://www.unicode.org/iuc/iuc20 January 28-31, 2002 Washington, DC, USA The Unicode Standard has become the foundation for all modern text processing. It is used on large machines, tiny portable devices, and for distributed processing across the Internet. The standard brings cost-reducing efficiency to international applications and enables the exchange of text in an ever increasing list of natural languages. New technologies and innovative Internet applications, as well as the evolving Unicode Standard, bring new challenges along with their new capabilities. This technical conference will explore the opportunities created by the latest advances and how to leverage them, as well as potential pitfalls to be aware of, and problem areas that need further research. Conference attendees will include managers, software engineers, systems analysts, font designers, graphic designers, content developers, technical writers, and product marketing personnel, involved in the development, deployment or use of Unicode software or content, and the globalization of software and the Internet. CONFERENCE DATES January 28-31, 2002 CONFERENCE WEB SITE, PROGRAM and REGISTRATION The Conference Program and Registration form will be available soon at the Conference Web site: http://www.unicode.org/iuc/iuc20 CONFERENCE SPONSORS Agfa Monotype Corporation Basis Technology Corporation Microsoft Corporation Oracle Corporation Reuters Ltd. Sun Microsystems, Inc. World Wide Web Consortium (W3C) GLOBAL COMPUTING SHOWCASE Visit the Showcase to find out more about products supporting the Unicode Standard, and products and services that can help you globalize/localize your software, documentation and Internet content. For details, visit the Conference Web site. CONFERENCE VENUE Omni Shoreham Hotel 2500 Calvert Street, NW Washington, DC 20008 USA Tel: +1 202 234 0700 Fax: +1 202 265 7972 CONFERENCE MANAGEMENT Global Meeting Services Inc. 4030 Porte Le Paz #90 San Diego, CA 92122, USA Tel: +1 858 638 0206 (voice) +1 858 638 0504 (fax) Email: info@global-conference.com or: conference@unicode.org THE UNICODE CONSORTIUM The Unicode Consortium was founded as a non-profit organization in 1991. It is dedicated to the development, maintenance and promotion of The Unicode Standard, a worldwide character encoding. The Unicode Standard encodes the characters of the world's principal scripts and languages, and is code-for-code identical to the international standard ISO/IEC 10646. In addition to cooperating with ISO on the future development of ISO/IEC 10646, the Consortium is responsible for providing character properties and algorithms for use in implementations. Today the membership base of the Unicode Consortium includes major computer corporations, software producers, database vendors, research institutions, international agencies and various user groups. For further information on the Unicode Standard, visit the Unicode Web site at http://www.unicode.org or e-mail * * * * * Unicode(r) and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission. ----------------------------------------------------------------- Visit our Internet site at http://www.reuters.com Any views expressed in this message are those of the individual sender, except where the sender specifically states them to be the views of Reuters Ltd. From matt.gushee@fourthought.com Fri Oct 26 15:10:55 2001 From: matt.gushee@fourthought.com (Matt Gushee) Date: Fri, 26 Oct 2001 14:10:55 +0000 Subject: [I18n-sig] Intro + Encoding names issue Message-ID: <15321.28399.266224.9609@drachma.fourthought.com> Greetings, i14ers and l6ers-- I'm a developer at Fourthought, Inc., and I've been charged with ensuring that our XML software will play nice with all languages in the known universe. Okay, maybe that's an exaggeration. But at least we would like to support all languages/encodings that are commonly used on the Internet. So you'll probably be hearing from me periodically. I will probably be the one asking dumb questions -- while I have some linguistics b.g., am bilingual+ (fluent in Japanese, a bit of Chinese & Spanish), and know a bit about character encodings, this will be my first serious attempt at software i18n. So have mercy ... On to the namespace issue. Let me preface this by saying, if this has already been thoroughly discussed, please feel free to point me to the relevant thread. I did dig through several months' worth of the list archives, and didn't find any discussion of this problem. The other day I was putting together some Japanese-language test cases for 4Suite server, and I found out that, although they work fine when the source and output are in UTF-8, Shift-JIS and EUC-JP don't work because the Japanese codecs need to be referenced with a 'japanese' prefix: 'japanese.euc_jp' 'japanese.shift_jis' I would like to be able to reference these encodings by their conventional names, e.g. just plain 'euc-jp'; I was able to make it work* by tweaking encodings/aliases.py, but that isn't really satisfactory as a permanent solution. * Not completely true ... there is some code in 4Suite/PyXML that wrongly returns 'not-well-formed' errors on EUC-JP and Shift_JIS documents ... but that's a separate issue from accessing the codecs. Anyway, the alias hack allowed me to do simple I/O operations using the standard encoding names. I understand that this modularization makes sense from a code-maintenance standpoint, but the need for a language-specific prefix is a real stumbling block for developing applications intended to handle arbitrary encodings. Sure, I could tweak the 4Suite code to alias the japanese encoding names in an appropriate fashion, but then what happens when 'koi8-r' becomes 'russian.koi8-r'? ... and so on. I would suggest that codecs development place a high priority on these principles: * Developers should be able to use the codecs API without anticipating every encoding that might be used. * End users should be able to install and use internationalized Python programs without knowing how the codecs work. So the ideal would be a solution that allowed codecs developers to maintain separate packages, but have their component modules "plugged in" to the encodings namespace on installation. Apparently the above is impossible or at least very difficult with Distutils. Maybe a workable compromise would be to have some sort of codecs installation utility that would let end users, in one simple step, insert a set of codecs into the main encodings namespace. So, what are your thoughts on this? Again, if I am rehashing previous threads, I'll be happy to review them if you can let me know where they are. -- Matt Gushee Consultant matt.gushee@fourthought.com +1 303 583 9900 x108 Fourthought, Inc. http://Fourthought.com 4735 East Walnut St, Boulder, CO 80301-2537, USA XML strategy, XML tools (http://4Suite.org), knowledge management From mal@lemburg.com Fri Oct 26 22:06:54 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Fri, 26 Oct 2001 23:06:54 +0200 Subject: [I18n-sig] Intro + Encoding names issue References: <15321.28399.266224.9609@drachma.fourthought.com> Message-ID: <3BD9D06E.55AE8E73@lemburg.com> Matt Gushee wrote: > > So the ideal would be a solution that allowed codecs developers to > maintain separate packages, but have their component modules "plugged > in" to the encodings namespace on installation. > > Apparently the above is impossible or at least very difficult with > Distutils. Maybe a workable compromise would be to have some sort of > codecs installation utility that would let end users, in one simple > step, insert a set of codecs into the main encodings namespace. Just put an application specific codec search function into Ft.__init__ and have this search function do the aliasing you wish to have in place. The encodings.__init__ search function will do the trick for packaged codecs. If you want to tweak the "top-level" codec namespace, you'll really have to provide your own search function. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Consulting & Company: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From matt.gushee@fourthought.com Fri Oct 26 16:49:15 2001 From: matt.gushee@fourthought.com (Matt Gushee) Date: Fri, 26 Oct 2001 15:49:15 +0000 Subject: [I18n-sig] Intro + Encoding names issue In-Reply-To: <3BD9D06E.55AE8E73@lemburg.com> References: <15321.28399.266224.9609@drachma.fourthought.com> <3BD9D06E.55AE8E73@lemburg.com> Message-ID: <15321.34299.546665.180117@drachma.fourthought.com> M.-A. Lemburg writes: > Matt Gushee wrote: > > > > So the ideal would be a solution that allowed codecs developers to > > maintain separate packages, but have their component modules "plugged > > in" to the encodings namespace on installation. > > > > Apparently the above is impossible or at least very difficult with > > Distutils. Maybe a workable compromise would be to have some sort of > > codecs installation utility that would let end users, in one simple > > step, insert a set of codecs into the main encodings namespace. > > Just put an application specific codec search function into > Ft.__init__ and have this search function do the aliasing you wish > to have in place. I appreciate the suggestion, and I will do this if I have to. But I was hoping for some discussion of: 1) whether it is appropriate to put the burden of creating aliases on either application developers or end users (and you probably gathered that I think it isn't); and 2) assuming we would like people to be able to use standard encoding names without creating their own aliases, is there a way to accomplish this goal and still allow language-specific codecs sets to be maintained as separate packages? -- Matt Gushee Consultant matt.gushee@fourthought.com +1 303 583 9900 x108 Fourthought, Inc. http://Fourthought.com 4735 East Walnut St, Boulder, CO 80301-2537, USA XML strategy, XML tools (http://4Suite.org), knowledge management From martin@v.loewis.de Sat Oct 27 09:40:31 2001 From: martin@v.loewis.de (Martin v. Loewis) Date: Sat, 27 Oct 2001 10:40:31 +0200 Subject: [I18n-sig] Intro + Encoding names issue In-Reply-To: <15321.34299.546665.180117@drachma.fourthought.com> (message from Matt Gushee on Fri, 26 Oct 2001 15:49:15 +0000) References: <15321.28399.266224.9609@drachma.fourthought.com> <3BD9D06E.55AE8E73@lemburg.com> <15321.34299.546665.180117@drachma.fourthought.com> Message-ID: <200110270840.f9R8eV404057@mira.informatik.hu-berlin.de> > 1) whether it is appropriate to put the burden of creating aliases > on either application developers or end users (and you probably > gathered that I think it isn't); and No, it isn't. > 2) assuming we would like people to be able to use standard > encoding names without creating their own aliases, is there a > way to accomplish this goal and still allow language-specific > codecs sets to be maintained as separate packages? Starting with Python 2.1, there is an easy solution. To discuss this, I assume you know what a codec search function is and how to register one (see codecs.register if you don't). Now, suppose you hava a package "japanese", containing a number of codecs. Inside japanese/__init__.py, register a search function for these codecs. So anybody importing "japanese" will get a codec "euc-jp". Install the "japanese" directory into site-packages. This works for all Python versions, but still requires applications to "import japanese". That is where a 2.1 feature comes into play: In site-packages, create a file "japanese.pth". In that file, add a single line import japanese Then, every time python starts, the japanese codecs will be automatically registered. HTH, Martin From mal@lemburg.com Sat Oct 27 16:16:02 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Sat, 27 Oct 2001 17:16:02 +0200 Subject: [I18n-sig] Intro + Encoding names issue References: <15321.28399.266224.9609@drachma.fourthought.com> <3BD9D06E.55AE8E73@lemburg.com> <15321.34299.546665.180117@drachma.fourthought.com> Message-ID: <3BDACFB2.DC368820@lemburg.com> Matt Gushee wrote: > > M.-A. Lemburg writes: > > Matt Gushee wrote: > > > > > > So the ideal would be a solution that allowed codecs developers to > > > maintain separate packages, but have their component modules "plugged > > > in" to the encodings namespace on installation. > > > > > > Apparently the above is impossible or at least very difficult with > > > Distutils. Maybe a workable compromise would be to have some sort of > > > codecs installation utility that would let end users, in one simple > > > step, insert a set of codecs into the main encodings namespace. > > > > Just put an application specific codec search function into > > Ft.__init__ and have this search function do the aliasing you wish > > to have in place. > > I appreciate the suggestion, and I will do this if I have to. But I > was hoping for some discussion of: > > 1) whether it is appropriate to put the burden of creating aliases > on either application developers or end users (and you probably > gathered that I think it isn't); and > > 2) assuming we would like people to be able to use standard > encoding names without creating their own aliases, is there a > way to accomplish this goal and still allow language-specific > codecs sets to be maintained as separate packages? I don't really see how except by proposing to add the packaged codec names to encodings/aliases.py. This should be acceptable for those codec packages which are well-accepted and maintained, e.g. we could add aliases for Tamito's Japanese package under the standard encoding names for the supported codecs. The imports would still fail in case the user forgot to install that package, though (which is good, IMHO). What I don't like is adding some kind of magic which goes on behind the scenes. If things then go wrong, finding the cause of the problem would be much harder. If everybody agrees with pointing Python at Tamito's package for the encodings he supports in his package, then I think we should simply add these aliases to the core encodings package. He did a great job on those codecs and the licenses fit in as well. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Consulting & Company: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From martin@v.loewis.de Sun Oct 28 07:33:41 2001 From: martin@v.loewis.de (Martin v. Loewis) Date: Sun, 28 Oct 2001 08:33:41 +0100 Subject: [I18n-sig] Intro + Encoding names issue In-Reply-To: <3BDACFB2.DC368820@lemburg.com> (mal@lemburg.com) References: <15321.28399.266224.9609@drachma.fourthought.com> <3BD9D06E.55AE8E73@lemburg.com> <15321.34299.546665.180117@drachma.fourthought.com> <3BDACFB2.DC368820@lemburg.com> Message-ID: <200110280733.f9S7XfB01766@mira.informatik.hu-berlin.de> > If everybody agrees with pointing Python at Tamito's package for the > encodings he supports in his package, then I think we should simply > add these aliases to the core encodings package. He did a great > job on those codecs and the licenses fit in as well. I don't like this approach. There must be a solution that allows installation of additional codecs without modifying the core, and there is one. Furthermore, there are alternative codecs for Japanese, e.g. the iconv codec also supports euc-jp if the underlying iconv implementation supports it; most do. If euc-jp would be a built-in alias to japanese.euc-jp, and if the japanese package is not installed, the iconv codec would have no chance of finding it. Finally, I think this would be misuse of the aliases mechanism. Aliases should convert to the canonical form of the character set name. I'd argue that the canonical form is the name that IANA has assigned to this character set, or atleast the alias that IANA designates as preferred MIME name. For euc-jp, the name is Extended_UNIX_Code_Packed_Format_for_Japanese; the preferred MIME name is EUC-JP. Installing an alias that converts euc-jp to japanese.euc-jp is definitely wrong. Regards, Martin From mal@lemburg.com Mon Oct 29 08:43:59 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Mon, 29 Oct 2001 09:43:59 +0100 Subject: [I18n-sig] Intro + Encoding names issue References: <15321.28399.266224.9609@drachma.fourthought.com> <3BD9D06E.55AE8E73@lemburg.com> <15321.34299.546665.180117@drachma.fourthought.com> <3BDACFB2.DC368820@lemburg.com> <200110280733.f9S7XfB01766@mira.informatik.hu-berlin.de> Message-ID: <3BDD16CF.8A1BE7D1@lemburg.com> "Martin v. Loewis" wrote: > > > If everybody agrees with pointing Python at Tamito's package for the > > encodings he supports in his package, then I think we should simply > > add these aliases to the core encodings package. He did a great > > job on those codecs and the licenses fit in as well. > > I don't like this approach. There must be a solution that allows > installation of additional codecs without modifying the core, and > there is one. I know... just wanted to make it easier on the user, but you have a point: there's more than just one codec out there for most encodings. > Furthermore, there are alternative codecs for Japanese, e.g. the iconv > codec also supports euc-jp if the underlying iconv implementation > supports it; most do. If euc-jp would be a built-in alias to > japanese.euc-jp, and if the japanese package is not installed, the > iconv codec would have no chance of finding it. True. You also touch a wound spot here: there's currently no way to override codecs which are available in the encodings package since the encodings search function is always installed before all other search functions. Perhaps we ought to let applications register their search function before the encodings one too ?! > Finally, I think this would be misuse of the aliases > mechanism. Aliases should convert to the canonical form of the > character set name. I'd argue that the canonical form is the name that > IANA has assigned to this character set, or atleast the alias that > IANA designates as preferred MIME name. For euc-jp, the name is > Extended_UNIX_Code_Packed_Format_for_Japanese; the preferred MIME name > is EUC-JP. Installing an alias that converts euc-jp to japanese.euc-jp > is definitely wrong. Ok. Forget about the idea ;-) It's probably better to include the alias support in the application and then make the installation of the right codec packages a requirement during installation of the application. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Consulting & Company: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From martin@v.loewis.de Mon Oct 29 17:38:13 2001 From: martin@v.loewis.de (Martin v. Loewis) Date: Mon, 29 Oct 2001 18:38:13 +0100 Subject: [I18n-sig] Intro + Encoding names issue In-Reply-To: <3BDD16CF.8A1BE7D1@lemburg.com> (mal@lemburg.com) References: <15321.28399.266224.9609@drachma.fourthought.com> <3BD9D06E.55AE8E73@lemburg.com> <15321.34299.546665.180117@drachma.fourthought.com> <3BDACFB2.DC368820@lemburg.com> <200110280733.f9S7XfB01766@mira.informatik.hu-berlin.de> <3BDD16CF.8A1BE7D1@lemburg.com> Message-ID: <200110291738.f9THcDB01320@mira.informatik.hu-berlin.de> > You also touch a wound spot here: there's currently no way to > override codecs which are available in the encodings package since > the encodings search function is always installed before all other > search functions. Perhaps we ought to let applications register > their search function before the encodings one too ?! I'm not sure whether this is necessary. That would assume that the application's encodings are "better" in some sense. While I'm pretty certain that the iconv codecs are faster than any pure Python codec, I wouldn't claim that it was better in all respects. Any application that wants a specific codec should use the API of this codec. > It's probably better to include the alias support in the application > and then make the installation of the right codec packages a > requirement during installation of the application. Actually, there is no need for either the core nor the application to add any aliases. The .pth approach works fine, AFAICT, and is completely transparent. Regards, Martin