From confirm-s2-m8m_Q4ljnHGRsTEWs0P3bcbZC6I-i18n-sig=python.org@yahoogroups.com Mon Sep 2 14:18:56 2002 From: confirm-s2-m8m_Q4ljnHGRsTEWs0P3bcbZC6I-i18n-sig=python.org@yahoogroups.com (Yahoo! Groups) Date: 2 Sep 2002 13:18:56 -0000 Subject: [I18n-sig] Please confirm your request to join locales Message-ID: <1030972736.240.57765.w48@yahoogroups.com> Hello i18n-sig@python.org, We have received your request to join the locales group hosted by Yahoo! Groups, a free, easy-to-use community service. This request will expire in 21 days. TO BECOME A MEMBER OF THE GROUP: 1) Go to the Yahoo! Groups site by clicking on this link: http://groups.yahoo.com/i?i=m8m_Q4ljnHGRsTEWs0P3bcbZC6I&e=i18n-sig%40python%2Eorg (If clicking doesn't work, "Cut" and "Paste" the line above into your Web browser's address bar.) -OR- 2) REPLY to this email by clicking "Reply" and then "Send" in your email program If you did not request, or do not want, a membership in the locales group, please accept our apologies and ignore this message. Regards, Yahoo! Groups Customer Care Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/ From martin@v.loewis.de Tue Sep 3 07:08:38 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 03 Sep 2002 08:08:38 +0200 Subject: [I18n-sig] Plural forms In-Reply-To: <3D6352B2.60402@noos.fr> References: <3D6352B2.60402@noos.fr> Message-ID: Juan David Ib=E1=F1ez Palomar writes: > The Gettext tools support plural forms, but I think that the Python > gettext module doesn't, is this right? That's correct. > I'd like to contribute in this area, is this posible? Sure! Contributions are welcome. > Which would be the procedure? Is a PEP required? Should I do a patch > to the module and send it somewhere? Needs it a new branch in the CVS? Just submitting a patch should be fine. Notice that there are quite a lot of functions to add, for the various levels of indirection. > Is there a chance for this feature to be included in Python 2.3? ... Certainly! Python 2.3 is still months away, and this sounds like a feature that can be added with little debate. Regards, Martin From barry@python.org Tue Sep 3 13:32:52 2002 From: barry@python.org (Barry A. Warsaw) Date: Tue, 3 Sep 2002 08:32:52 -0400 Subject: [I18n-sig] Plural forms References: <3D6352B2.60402@noos.fr> Message-ID: <15732.44020.748601.834403@anthem.wooz.org> >>>>> "MvL" == Martin v Loewis writes: >> Is there a chance for this feature to be included in Python >> 2.3? ... MvL> Certainly! Python 2.3 is still months away, and this sounds MvL> like a feature that can be added with little debate. I just briefly (re-)skimmed the gettext manual section on plural forms. It looks like you'll also have to add support for this in pygettext.py, in the .mo/.po header parsing routines, and (maybe?) in msgfmt.py. -Barry From martin@v.loewis.de Tue Sep 3 21:00:08 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 03 Sep 2002 22:00:08 +0200 Subject: [I18n-sig] Plural forms In-Reply-To: <15732.44020.748601.834403@anthem.wooz.org> References: <3D6352B2.60402@noos.fr> <15732.44020.748601.834403@anthem.wooz.org> Message-ID: barry@python.org (Barry A. Warsaw) writes: > I just briefly (re-)skimmed the gettext manual section on plural > forms. It looks like you'll also have to add support for this in > pygettext.py, in the .mo/.po header parsing routines, and (maybe?) in > msgfmt.py. Implementing the expression syntax from the po header might be the most tricky part - especially as that syntax is not properly specified (to my knowledge), so one would need to reverse-engineer it from the gettext implementation. I'd suggest that anybody attempting such a thing should also provide gettext maintainers with a detailed documentation of that feature. Regards, Martin From kajiyama@grad.sccs.chukyo-u.ac.jp Wed Sep 4 22:25:08 2002 From: kajiyama@grad.sccs.chukyo-u.ac.jp (Tamito KAJIYAMA) Date: Thu, 5 Sep 2002 06:25:08 +0900 Subject: [I18n-sig] JapaneseCodecs 1.4.8 released Message-ID: <200209042125.g84LP8317562@grad.sccs.chukyo-u.ac.jp> Hi, I've released JapaneseCodecs 1.4.8. The source tarball is available at the following locations: http://www.asahi-net.or.jp/~rd6t-kjym/python/ http://www.python.jp/Zope/download/JapaneseCodecs Fixed are bugs in EUC-JP, Shift_JIS and MS932 codecs that failed to encode U+00A5 and U+203E which originate from ISO-2022-JP and its variant codecs. I moved my home page recently, so that the primary distribution site of and the author's e-mail address also changed. * * * By the way, I have a plan to change mappings between Unicode and traditional Japanese encodings such as EUC-JP and Shift_JIS in the next release of JapaneseCodecs. The main reasons of the change are (1) to improve the interoperabilities between the japanese.ms932 codec and other codecs, and (2) to eliminate non-revesibilities that exist in mappings between Unicode and traditional Japanese encodings. Japanese characters that will be changed their corresponding code points in Unicode are the following 7 characters. (Although only Shift_JIS code points are shown below, EUC-JP and ISO-2022-JP codecs will also be changed.) 1. Shift_JIS 0x81ca japanese.sjis -> U+00ac (NOT SIGN) japanese.ms932 -> U+ffe2 (FULLWIDTH NOT SIGN) 2. Shift_JIS 0x815f japanese.sjis -> U+005c (REVERSE SOLIDUS) japanese.ms932 -> U+ff3c (FULLWIDTH REVERSE SOLIDUS) 3. Shift_JIS 0x8161 japanese.sjis -> U+2016 (DOUBLE VERTICAL LINE) japanese.ms932 -> U+2225 (PARALLEL TO) 4. Shift_JIS 0x8160 japanese.sjis -> U+301c (WAVE DASH) japanese.ms932 -> U+ff5e (FULLWIDTH TILDE) 5. Shift_JIS 0x817c japanese.sjis -> U+2212 (MINUS SIGN) japanese.ms932 -> U+ff0d (FULLWIDTH HYPHEN-MINUS) 6. Shift_JIS 0x8191 japanese.sjis -> U+00a2 (CENT SIGN) japanese.ms932 -> U+ffe0 (FULLWIDTH CENT SIGN) 7. Shift_JIS 0x8192 japanese.sjis -> U+00a3 (POUND SIGN) japanese.ms932 -> U+ffe1 (FULLWIDTH POUND SIGN) Due to the differences of the mappings shown above, for example, decoding a byte string using japanese.ms932 and encoding the Unicode string using japanese.sjis may raise a UnicodeError saying "no corresponding character in Shift_JIS". Also, there are non-reversibilities in the codecs for traditional Japanese encodings. For example, the code point 0x815f in Shift_JIS is mapped to U+005c (REVERSE SOLIDUS) when decoded using japanese.sjis. The code point U+005c in Unicode in turn is mapped to 0x005c in Shift_JIS when encoded by the same codec. This non-reversible behavior of the mapping between Shift_JIS and Unicode would be "correct" from the Unicode Consortium's viewpoint, but in practice it would be desired that mappings are reversible. The same non-reversibility exists in other codecs for traditional Japanese encodings. Therefore, I'd like to change the mapping between Unicode and the traditional Japanese encodings so that all codecs use the same 7 code points in Unicode as japanese.ms932. In my plan, for example, 0x815f in Shift_JIS, 0xa1c0 in EUC-JP, and 0x2140 in ISO-2022-JP (JIS X 0208:1990) will be all mapped to U+ff3c (FULLWIDTH REVERSE SOLIDUS) instead of U+005c (REVERSE SOLIDUS). The corresponding code points in Unicode for the other 6 characters will be changed similarly. This change means in effect that Microsoft's mappings will be adopted instead of Unicode consortium's ones. I think the reversibility of mappings is important. However, this change is not backward-compatible, so that it may affect the existing systems and data. I expect both pros and cons. I really appreciate any kind of feedback. I'd like, at the moment, not to support the current mappings in the next release of JapaneseCodecs, since the maintenance cost would be high otherwise, and also someone who needs the current mappings can make use of older JapaneseCodecs. Any comments and suggestions are welcome. Thank you, -- KAJIYAMA, Tamito From martin@v.loewis.de Wed Sep 4 23:32:12 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 05 Sep 2002 00:32:12 +0200 Subject: [I18n-sig] JapaneseCodecs 1.4.8 released In-Reply-To: <200209042125.g84LP8317562@grad.sccs.chukyo-u.ac.jp> References: <200209042125.g84LP8317562@grad.sccs.chukyo-u.ac.jp> Message-ID: Tamito KAJIYAMA writes: > Japanese characters that will be changed their corresponding > code points in Unicode are the following 7 characters. > (Although only Shift_JIS code points are shown below, EUC-JP and > ISO-2022-JP codecs will also be changed.) > > 1. Shift_JIS 0x81ca > japanese.sjis -> U+00ac (NOT SIGN) > japanese.ms932 -> U+ffe2 (FULLWIDTH NOT SIGN) Can you please elaborate on the rationale for picking the Microsoft mapping over the Consortium's mapping? It appears that, if only a single form is available SJIS, that Microsoft picks the FULLWIDTH form, whereas the Consortium picks the default form. Methinks that the consortium does the right thing, here: It *should* be a matter of fonts or presentation how a NOT SIGN is displayed. If SJIS gives users a choice to pick either the default form or the full-width form, it is clear that the Unicode mapping should support that choice. If SJIS users have no choice (as in 0x81ca), the SJIS character should IMO be considered the default version - despite the fact that SJIS-based fonts would usually display it in a double-wide fashion. I also fail to see the need to align those encodings, at all. Why is it necessary that SJIS -> Unicode -> MS932 works for all SJIS texts? You might consider supporting "transliteration", either by default, or by means of a sjis//translit (and ms932//translit) encoding: If people use this encoding, you still have the Shift_JIS->Unicode mapping as above, but you would *also* map U+ffe2 to 0x81ca in sjis, and U+00ac to 0x81ca in Shift_JIS. That may solve the problems people have with the status-quo, while preserving backwards compatibility (and also compatibility with, say, Linux glibc codecs - which use the Consortium's database). HTH, Martin From j-david@noos.fr Thu Sep 5 18:04:53 2002 From: j-david@noos.fr (=?ISO-8859-1?Q?Juan_David_Ib=E1=F1ez_Palomar?=) Date: Thu, 05 Sep 2002 19:04:53 +0200 Subject: [I18n-sig] Plural forms References: <3D6352B2.60402@noos.fr> <15732.44020.748601.834403@anthem.wooz.org> Message-ID: <3D778EB5.3010409@noos.fr> Martin v. Loewis wrote: >barry@python.org (Barry A. Warsaw) writes: > > > >>I just briefly (re-)skimmed the gettext manual section on plural >>forms. It looks like you'll also have to add support for this in >>pygettext.py, in the .mo/.po header parsing routines, and (maybe?) in >>msgfmt.py. >> >> > >Implementing the expression syntax from the po header might be the >most tricky part - especially as that syntax is not properly specified >(to my knowledge), so one would need to reverse-engineer it from the >gettext implementation. I'd suggest that anybody attempting such a >thing should also provide gettext maintainers with a detailed >documentation of that feature. > >Regards, >Martin > > > I was thinking only about changing the API and everything that is needed for it to work, like parsing of MO files. But let pygettext for later. A question, since now xgettext supports Python (didn't tried it, but that's what the documentation says), why not to depracate pygettext and stop its development? Regards, -- J. David Ibáńez, http://www.j-david.net Software Engineer / Ingénieur Logiciel / Ingeniero de Software From barry@python.org Thu Sep 5 18:09:40 2002 From: barry@python.org (Barry A. Warsaw) Date: Thu, 5 Sep 2002 13:09:40 -0400 Subject: [I18n-sig] Plural forms References: <3D6352B2.60402@noos.fr> <15732.44020.748601.834403@anthem.wooz.org> <3D778EB5.3010409@noos.fr> Message-ID: <15735.36820.756013.638264@anthem.wooz.org> >>>>> "JDIP" =3D=3D Juan David Ib=E1=F1ez Palomar wri= tes: JDIP> A question, since now xgettext supports Python (didn't tried JDIP> it, but that's what the documentation says), why not to JDIP> depracate pygettext and stop its development? If it works, I'm all for it. No need to keep a reinvented wheel from going flat, IMO. I haven't had time to try the new xgettext with Python support though, so I don't know if it supports some of the more useful options, like the ability to selectively extract docstrings (which aren't marked). -Barry From martin@v.loewis.de Thu Sep 5 18:35:21 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 05 Sep 2002 19:35:21 +0200 Subject: [I18n-sig] Plural forms In-Reply-To: <3D778EB5.3010409@noos.fr> References: <3D6352B2.60402@noos.fr> <15732.44020.748601.834403@anthem.wooz.org> <3D778EB5.3010409@noos.fr> Message-ID: Juan David Ib=E1=F1ez Palomar writes: > A question, since now xgettext supports Python (didn't tried it, > but that's what the documentation says), why not to depracate > pygettext and stop its development? For one thing, it is good that batteries are included - you may not have GNU gettext, and you may not be able to install it as you don't have a C compiler, either - in particular, if you are using a MS Windows system. For another thing, I don't think that xgettext supports the extraction of docstrings. Regards, Martin From barry@python.org Thu Sep 5 18:43:56 2002 From: barry@python.org (Barry A. Warsaw) Date: Thu, 5 Sep 2002 13:43:56 -0400 Subject: [I18n-sig] Plural forms References: <3D6352B2.60402@noos.fr> <15732.44020.748601.834403@anthem.wooz.org> <3D778EB5.3010409@noos.fr> Message-ID: <15735.38876.988423.22412@anthem.wooz.org> >>>>> "MvL" == Martin v Loewis writes: >> A question, since now xgettext supports Python (didn't tried >> it, but that's what the documentation says), why not to >> depracate pygettext and stop its development? MvL> For one thing, it is good that batteries are included - you MvL> may not have GNU gettext, and you may not be able to install MvL> it as you don't have a C compiler, either - in particular, if MvL> you are using a MS Windows system. Sure, but I'm less concerned about that because not many people actually need to do extractions or catalog building. Those people can probably manage to install gettext. MvL> For another thing, I don't think that xgettext supports the MvL> extraction of docstrings. This one's more important. /Selective/ extraction of docstrings is critical to me. -Barry From kajiyama@grad.sccs.chukyo-u.ac.jp Thu Sep 5 10:46:24 2002 From: kajiyama@grad.sccs.chukyo-u.ac.jp (Tamito KAJIYAMA) Date: Thu, 5 Sep 2002 18:46:24 +0900 Subject: [I18n-sig] JapaneseCodecs 1.4.8 released In-Reply-To: <200209050925.g859Pv318388@grad.sccs.chukyo-u.ac.jp> (message from Tamito KAJIYAMA on Thu, 5 Sep 2002 18:25:57 +0900) References: <200209050925.g859Pv318388@grad.sccs.chukyo-u.ac.jp> Message-ID: <200209050946.g859kOc18409@grad.sccs.chukyo-u.ac.jp> Tamito KAJIYAMA writes: | | The only one reason for choosing the Microsoft mapping is that | it seems better. The Consortium's mapping has a problem that | both 0x5c and 0x815f in Shift_JIS are mapped to U+005c, which | is in turn mapped to 0x5c in Shift_JIS. In other words, the | Consortium's mapping is one-to-many. On the other hand, the | Microsoft's mapping is one-to-one. There is no conversion | problem like the one in the Consortium's mapping. That's why | I think the Microsoft's mapping is better. One addition: the mapping used in Java is also one-to-one so that it may be another candidate. I'm not sure at the moment which mapping should be picked. Any suggestions are welcome. Thanks, -- KAJIYAMA, Tamito From kajiyama@grad.sccs.chukyo-u.ac.jp Thu Sep 5 10:25:57 2002 From: kajiyama@grad.sccs.chukyo-u.ac.jp (Tamito KAJIYAMA) Date: Thu, 5 Sep 2002 18:25:57 +0900 Subject: [I18n-sig] JapaneseCodecs 1.4.8 released In-Reply-To: (martin@v.loewis.de) References: Message-ID: <200209050925.g859Pv318388@grad.sccs.chukyo-u.ac.jp> martin@v.loewis.de (Martin v. Loewis) writes: | | > Japanese characters that will be changed their corresponding | > code points in Unicode are the following 7 characters. | > (Although only Shift_JIS code points are shown below, EUC-JP and | > ISO-2022-JP codecs will also be changed.) | > | > 1. Shift_JIS 0x81ca | > japanese.sjis -> U+00ac (NOT SIGN) | > japanese.ms932 -> U+ffe2 (FULLWIDTH NOT SIGN) | | Can you please elaborate on the rationale for picking the Microsoft | mapping over the Consortium's mapping? The only one reason for choosing the Microsoft mapping is that it seems better. The Consortium's mapping has a problem that both 0x5c and 0x815f in Shift_JIS are mapped to U+005c, which is in turn mapped to 0x5c in Shift_JIS. In other words, the Consortium's mapping is one-to-many. On the other hand, the Microsoft's mapping is one-to-one. There is no conversion problem like the one in the Consortium's mapping. That's why I think the Microsoft's mapping is better. To tell the truth, I don't care whether a Unicode character that corresponds to a character in Shift_JIS is a full-width form or not. What I want to solve by choosing the Microsoft's mapping is only the problem just mentioned above. | I also fail to see the need to align those encodings, at all. Why is | it necessary that SJIS -> Unicode -> MS932 works for all SJIS texts? The interoperability of the MS932 codec and other codecs is a plus. I don't think it is necessary. However, it seems not preferable to me that a small package like JapaneseCodecs has an interoperability problem due to differences among vendor- specific mappings. | You might consider supporting "transliteration", either by default, or | by means of a sjis//translit (and ms932//translit) encoding: If people | use this encoding, you still have the Shift_JIS->Unicode mapping as | above, but you would *also* map U+ffe2 to 0x81ca in sjis, and U+00ac | to 0x81ca in Shift_JIS. That may solve the problems people have with | the status-quo, while preserving backwards compatibility (and also | compatibility with, say, Linux glibc codecs - which use the | Consortium's database). Sorry, I not sure I've got the picture of what transliteration support would do. Transliteration support is meant to solve interoperability problems due to differences among vendor- specific mappings, right? I believe it's worth tackling the interoperability problems, but I've not intended to do so in the next release of JapaneseCodecs. Thanks, -- KAJIYAMA, Tamito From martin@v.loewis.de Fri Sep 6 00:35:33 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 06 Sep 2002 01:35:33 +0200 Subject: [I18n-sig] JapaneseCodecs 1.4.8 released In-Reply-To: <200209050925.g859Pv318388@grad.sccs.chukyo-u.ac.jp> References: <200209050925.g859Pv318388@grad.sccs.chukyo-u.ac.jp> Message-ID: Tamito KAJIYAMA writes: > The only one reason for choosing the Microsoft mapping is that > it seems better. The Consortium's mapping has a problem that > both 0x5c and 0x815f in Shift_JIS are mapped to U+005c, which > is in turn mapped to 0x5c in Shift_JIS. In other words, the > Consortium's mapping is one-to-many.=20=20 I can agree on the mapping of 0x815f; it maps to U+FF3C on glibc. I'm confused about 0x5c; glibc maps it to U+00A5 (YEN SIGN). Also, where did you get the mapping from the Consortium? I can't find a current table, but http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/SHIFTJIS.TXT maps 0x5C to U+00A5, and 0x815F to 0x005C. So this roundtrips just fine. > On the other hand, the Microsoft's mapping is one-to-one. There is > no conversion problem like the one in the Consortium's mapping. > That's why I think the Microsoft's mapping is better. There are many ways to achieve this; starting with the question whether 0x5c is a reverse solidus, or a yen sign. It seems clear that 0x815f is a reverse solidus - the question is whether it is full width or not. This is all unrelated to the other issues that you brought up. > To tell the truth, I don't care whether a Unicode character that > corresponds to a character in Shift_JIS is a full-width form or > not. What I want to solve by choosing the Microsoft's mapping > is only the problem just mentioned above. Ok, then my suggestion would be to make minimal changes to your current mapping; the candidates to look at seem to be - Consortium (but where does it have the current Shift JIS mapping?), - MS - Linux glibc - ICU - Java Of those, I would pick the one that round-trips, and is closest to your current mapping. It appears that ICU does not have a SJIS mapping of its own, only the Linux and the Java one. It appears that Java (according to ICU) - maps 0x5c to U+005C, - maps 0x815f to U+FF3C, - fallback-maps U+00A5 to 0x5c BTW, what does Microsoft map U+00A5 to? > The interoperability of the MS932 codec and other codecs is a plus. > I don't think it is necessary. However, it seems not preferable to > me that a small package like JapaneseCodecs has an interoperability > problem due to differences among vendor- specific mappings. I agree that you should copy mapping data from other sources, instead of inventing your own. I also agree that it is desirable if the mapping round-trips (also there might be a good reason to have one-way mappings, e.g. for the yen sign - if you decide that 0x5c is a backslash). I just don't see the point of having shift-jis be a synonym for cp932. It appears that cp932 is slightly different from shift-jis (even though there are multiple interpretations of both shift-jis and cp932 circulating). It appears that ICU has an exhaustive collection of mappings, which, I hope, are all correct (e.g. that when they claim they have the glibc shift_jis, that this really is what glibc does). > Sorry, I not sure I've got the picture of what transliteration > support would do. Transliteration support is meant to solve > interoperability problems due to differences among vendor- > specific mappings, right? No. In general, transliteration adds one-way mappings, to allow mapping a larger subset of Unicode to the target mapping. For example, "=F6" is not supported in ASCII, but a common transliteration (for German) is to write "oe". So, u"\u00f6".encode("ascii") raises a UnicodeError, where u"\u00f6".encode("ascii//translit-german") might return "oe" (this is not implemented in Python). Therefore, a transliteration mapping never roundtrips - but it is still useful as it attempts to map as much of Unicode to the target encoding as reasonable. In your specific case, you could use transliteration to map both the default form and the full-width form from Unicode to the same JIS - but only one of the forms will round-trip. I agree that round-trip support is a valuable, and should be the default. I do think there is also a need for a "best effort" mapping. Regards, Martin From martin@v.loewis.de Fri Sep 6 00:36:44 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 06 Sep 2002 01:36:44 +0200 Subject: [I18n-sig] JapaneseCodecs 1.4.8 released In-Reply-To: <200209050946.g859kOc18409@grad.sccs.chukyo-u.ac.jp> References: <200209050925.g859Pv318388@grad.sccs.chukyo-u.ac.jp> <200209050946.g859kOc18409@grad.sccs.chukyo-u.ac.jp> Message-ID: Tamito KAJIYAMA writes: > One addition: the mapping used in Java is also one-to-one so > that it may be another candidate. That is not true (according to the ICU data). Java maps U+00A5 to 0x5c, which it maps back to U+005C. Regards, Martin From tree@basistech.com Fri Sep 6 01:30:46 2002 From: tree@basistech.com (Tom Emerson) Date: Thu, 5 Sep 2002 20:30:46 -0400 Subject: [I18n-sig] JapaneseCodecs 1.4.8 released In-Reply-To: References: <200209050925.g859Pv318388@grad.sccs.chukyo-u.ac.jp> Message-ID: <15735.63286.438251.60319@magrathea.basistech.com> Martin v. Loewis writes: > I can agree on the mapping of 0x815f; it maps to U+FF3C on glibc. I'm > confused about 0x5c; glibc maps it to U+00A5 (YEN SIGN). This is a complex topic: In JIS-Roman and pure ShiftJIS, 0x5C encodes the Yen sign, so transcoding from pure ShiftJIS to Unicode means that 0x5C maps to U+00A5. On Windows, 0x5C serves a double life as both the pathname separator *and* as the Yen sign in their version of ShiftJIS, CP932. This means that that the price of Murakami Haruki's 'Noruei no Mori' (part 1) on Amazon.co.jp right now is \467 (i.e., 0x5C 0x34 0x36 0x37). It also means that 'C:\foo\bar' displays with Yen signs instead of back slashes. Hence the mapping from CP932 to Unicode is ambiguous: do you map 0x5C to U+005C or U+00A5? It depends on context: the transcoder doesn't know. You also need to know whether the file came from a "pure" ShiftJIS system, such as earlier versions of Mac OS, or a CP932 system, since the interpretation of 0x5C may or may not be ambiguous. The "usual" recommendation is to map 0x5C to U+00A5 when dealing with pure ShiftJIS and to U+005C when dealing with CP932. There is a similar problem with 0x7E where it maps to different things in ShiftJIS and CP932. The same problem also occurs in the Microsoft Korean code page, where 0x5C is either a path separator (mapping to U+005C) or the Won sign (mapping to U+20A9). > Also, where did you get the mapping from the Consortium? I can't find > a current table, but > > http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/SHIFTJIS.TXT You answer your own question, sort of. The Consortium no longer maintains the East Asian mapping tables (with the exception of JIS X 0213, GB 18030, and HKSCS, where mappings are supplied by the Japanese, Chinese, and Kong Kong SAR governments, respectively). This has been a point of contention between me and the UTC, but they don't want to and I don't have time. -- Tom Emerson Basis Technology Corp. Software Architect http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever" From kajiyama@grad.sccs.chukyo-u.ac.jp Fri Sep 6 02:38:05 2002 From: kajiyama@grad.sccs.chukyo-u.ac.jp (Tamito KAJIYAMA) Date: Fri, 6 Sep 2002 10:38:05 +0900 Subject: [I18n-sig] JapaneseCodecs 1.4.8 released In-Reply-To: (martin@v.loewis.de) References: Message-ID: <200209060138.g861c5f19959@grad.sccs.chukyo-u.ac.jp> martin@v.loewis.de (Martin v. Loewis) writes: | | > The only one reason for choosing the Microsoft mapping is that | > it seems better. The Consortium's mapping has a problem that | > both 0x5c and 0x815f in Shift_JIS are mapped to U+005c, which | > is in turn mapped to 0x5c in Shift_JIS. In other words, the | > Consortium's mapping is one-to-many. | | I can agree on the mapping of 0x815f; it maps to U+FF3C on glibc. I'm | confused about 0x5c; glibc maps it to U+00A5 (YEN SIGN). | | Also, where did you get the mapping from the Consortium? I can't find | a current table, but | | http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/SHIFTJIS.TXT | | maps 0x5C to U+00A5, and 0x815F to 0x005C. So this roundtrips just | fine. I've finally understood what was wrong: the mapping in JapaneseCodecs has a number of bugs! The Unicode Consortium's mapping is totally okay, but it had not been implemented in JapaneseCodecs in the right way (I intended to do so, though). I got the Consortium's mapping from the URL shown above. However, I happened to carelessly modify the original mapping as follows: the Unicode Consortium's original mapping: 0x5c -> U+00A5 -> 0x5c 0x7e -> U+203e -> 0x7e 0x815f -> U+005c -> 0x815f the current (buggy) mapping in JapaneseCodecs: 0x5c -> U+005c -> 0x5c 0x7e -> U+007e -> 0x7e 0x815f -> U+005c -> 0x815f In other words, I had introduced the non-reversibility problem myself! I'd like to hit my head against the wall thousands of times... It seems that there are two solutions: the one is to implement the Consortium's mapping intact, and the other is to fix the current buggy mapping so that 0x815f maps to U+ff3c (the latter means that Java's mapping is adopted, I believe). | > Sorry, I not sure I've got the picture of what transliteration | > support would do. Transliteration support is meant to solve | > interoperability problems due to differences among vendor- | > specific mappings, right? | | No. In general, transliteration adds one-way mappings, to allow | mapping a larger subset of Unicode to the target mapping. For example, | "=F6" is not supported in ASCII, but a common transliteration (for | German) is to write "oe". So, u"\u00f6".encode("ascii") raises a | UnicodeError, where u"\u00f6".encode("ascii//translit-german") might | return "oe" (this is not implemented in Python). | | Therefore, a transliteration mapping never roundtrips - but it is | still useful as it attempts to map as much of Unicode to the target | encoding as reasonable. In your specific case, you could use | transliteration to map both the default form and the full-width form | from Unicode to the same JIS - but only one of the forms will | round-trip. | | I agree that round-trip support is a valuable, and should be the | default. I do think there is also a need for a "best effort" mapping. I see. Transliteration, in the context of JapaneseCodecs, can be used to provide fallback mappings, right? I agree that such a "best effort" mapping is useful and surely needed in a variety of applications. Thank a lot! -- KAJIYAMA, Tamito From kajiyama@grad.sccs.chukyo-u.ac.jp Fri Sep 6 03:05:17 2002 From: kajiyama@grad.sccs.chukyo-u.ac.jp (Tamito KAJIYAMA) Date: Fri, 6 Sep 2002 11:05:17 +0900 Subject: [I18n-sig] JapaneseCodecs 1.4.8 released In-Reply-To: (martin@v.loewis.de) References: Message-ID: <200209060205.g8625HC19999@grad.sccs.chukyo-u.ac.jp> martin@v.loewis.de (Martin v. Loewis) writes: | | > One addition: the mapping used in Java is also one-to-one so | > that it may be another candidate. | | That is not true (according to the ICU data). Java maps U+00A5 to | 0x5c, which it maps back to U+005C. A test program showed that Java's mapping works as follows: 0x815f -> U+ff3c -> 0x815f 0x5c -> U+005c -> 0x5c U+00a5 -> 0x5c It is not true that Java's mapping is one-to-one. But both 0x815f and 0x5c show a round-trip, which is what I want to have. The mapping of U+00a5 to 0x5c seems a fallback. The test program and its execution result are shown below. I've used Sun's J2SE 1.3 on Linux. $ cat UnicodeTest1.java class UnicodeTest1 { public static void main(String args[]) { try { byte[] b = { -127, 95, 92 }; /* 0x815f, 0x5c */ String s = new String(b, "Shift_JIS") + "\u00a5"; System.out.print("Unicode: "); dump(s.getBytes("UnicodeBig")); System.out.print("Shift_JIS:"); dump(s.getBytes("Shift_JIS")); } catch (java.io.UnsupportedEncodingException e) { e.printStackTrace(); } } public static void dump(byte[] b) { for (int i = 0; i < b.length ; i++) { String h = "0" + Integer.toHexString(b[i]); System.out.print(" " + h.substring(h.length()-2, h.length())); } System.out.println(); } } $ javac UnicodeTest1.java $ java UnicodeTest1 Unicode: fe ff ff 3c 00 5c 00 a5 Shift_JIS: 81 5f 5c 5c $ Regards, -- KAJIYAMA, Tamito From kajiyama@grad.sccs.chukyo-u.ac.jp Fri Sep 6 04:38:30 2002 From: kajiyama@grad.sccs.chukyo-u.ac.jp (Tamito KAJIYAMA) Date: Fri, 6 Sep 2002 12:38:30 +0900 Subject: [I18n-sig] JapaneseCodecs 1.4.8 released In-Reply-To: <200209060138.g861c5f19959@grad.sccs.chukyo-u.ac.jp> (message from Tamito KAJIYAMA on Fri, 6 Sep 2002 10:38:05 +0900) References: <200209060305.g8635O520049@grad.sccs.chukyo-u.ac.jp> Message-ID: <200209060338.g863cUQ20072@grad.sccs.chukyo-u.ac.jp> Tamito KAJIYAMA writes: | | the current (buggy) mapping in JapaneseCodecs: | 0x5c -> U+005c -> 0x5c | 0x7e -> U+007e -> 0x7e | 0x815f -> U+005c -> 0x815f | | In other words, I had introduced the non-reversibility problem | myself! Oops. I made a mistake. It should read: the current (buggy) mapping in JapaneseCodecs: 0x5c -> U+005c -> 0x5c 0x7e -> U+007e -> 0x7e 0x815f -> U+005c -> 0x5c Thanks, -- KAJIYAMA, Tamito From martin@v.loewis.de Fri Sep 6 07:50:44 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 06 Sep 2002 08:50:44 +0200 Subject: [I18n-sig] JapaneseCodecs 1.4.8 released In-Reply-To: <15735.63286.438251.60319@magrathea.basistech.com> References: <200209050925.g859Pv318388@grad.sccs.chukyo-u.ac.jp> <15735.63286.438251.60319@magrathea.basistech.com> Message-ID: Tom Emerson writes: > The "usual" recommendation is to map 0x5C to U+00A5 when dealing with > pure ShiftJIS and to U+005C when dealing with CP932. > > There is a similar problem with 0x7E where it maps to different things > in ShiftJIS and CP932. That indicates that JapanseCodecs should *not* treat shift-jis and cp932 as synonyms, right? > You answer your own question, sort of. The Consortium no longer > maintains the East Asian mapping tables (with the exception of JIS X > 0213, GB 18030, and HKSCS, where mappings are supplied by the > Japanese, Chinese, and Kong Kong SAR governments, respectively). This > has been a point of contention between me and the UTC, but they don't > want to and I don't have time. Yes, but they claim that the UniHan database is a replacement. That appears to be the case for a lot of code points, but that database fails to document mappings for non-Hanzi, right? Regards, Martin From martin@v.loewis.de Fri Sep 6 07:54:35 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 06 Sep 2002 08:54:35 +0200 Subject: [I18n-sig] JapaneseCodecs 1.4.8 released In-Reply-To: <200209060138.g861c5f19959@grad.sccs.chukyo-u.ac.jp> References: <200209060138.g861c5f19959@grad.sccs.chukyo-u.ac.jp> Message-ID: Tamito KAJIYAMA writes: > I see. Transliteration, in the context of JapaneseCodecs, can > be used to provide fallback mappings, right? Correct. Regards, Martin From walter@livinglogic.de Fri Sep 6 11:24:53 2002 From: walter@livinglogic.de (=?ISO-8859-15?Q?Walter_D=F6rwald?=) Date: Fri, 06 Sep 2002 12:24:53 +0200 Subject: [I18n-sig] JapaneseCodecs 1.4.8 released References: <200209050925.g859Pv318388@grad.sccs.chukyo-u.ac.jp> Message-ID: <3D788275.5030905@livinglogic.de> Martin v. Loewis wrote: > Tamito KAJIYAMA writes: > [...] > >>Sorry, I not sure I've got the picture of what transliteration >>support would do. Transliteration support is meant to solve >>interoperability problems due to differences among vendor- >>specific mappings, right? > > > No. In general, transliteration adds one-way mappings, to allow > mapping a larger subset of Unicode to the target mapping. For example, > "ö" is not supported in ASCII, but a common transliteration (for > German) is to write "oe". So, u"\u00f6".encode("ascii") raises a > UnicodeError, where u"\u00f6".encode("ascii//translit-german") might > return "oe" (this is not implemented in Python). But it's simple to implement as an PEP 293 error handling callback: # -*- coding: iso-8859-1 -*- import codecs translit_german_map = { ord(u"ö"): u"oe", ord(u"ä"): u"ae", ord(u"ü"): u"ue", ord(u"ß"): u"ss" } def translit_german(exc): if isinstance(exc, UnicodeEncodeError): return (exc.object[exc.start:exc.end]. \ translate(translit_german_map), exc.end) else: raise TypeError("Don't know how to handle %r" % exc) codecs.register_error("translit-german", translit_german) u"-ä-ö-ü-ß-".encode("ascii", "translit-german") Could transliteration for the JapaneseCodecs be handled in a similar way? Bye, Walter Dörwald From tree@basistech.com Fri Sep 6 16:44:38 2002 From: tree@basistech.com (Tom Emerson) Date: Fri, 6 Sep 2002 11:44:38 -0400 Subject: [I18n-sig] JapaneseCodecs 1.4.8 released In-Reply-To: References: <200209050925.g859Pv318388@grad.sccs.chukyo-u.ac.jp> <15735.63286.438251.60319@magrathea.basistech.com> Message-ID: <15736.52582.439408.912920@magrathea.basistech.com> Martin v. Loewis writes: > Tom Emerson writes: > > The "usual" recommendation is to map 0x5C to U+00A5 when dealing with > > pure ShiftJIS and to U+005C when dealing with CP932. > > > > There is a similar problem with 0x7E where it maps to different things > > in ShiftJIS and CP932. > > That indicates that JapanseCodecs should *not* treat shift-jis and > cp932 as synonyms, right? Yes, that is my feeling. However, there are complications: most Japanese web pages that I've seen which claim to be in Shift JIS are in fact CP932, which is why the two are seen to be synonymous. The problem is even worse in Chinese, where pages claiming to be encoded in GB2312 (which isn't even an encoding, its a character set, but I digress ;) are actually in CP936, which has a significantly larger character repertoire (i.e., all of Unicode 2.1's unified ideographic block) than GB2312. > > You answer your own question, sort of. The Consortium no longer > > maintains the East Asian mapping tables (with the exception of JIS X > > 0213, GB 18030, and HKSCS, where mappings are supplied by the > > Japanese, Chinese, and Kong Kong SAR governments, respectively). This > > has been a point of contention between me and the UTC, but they don't > > want to and I don't have time. > > Yes, but they claim that the UniHan database is a replacement. That > appears to be the case for a lot of code points, but that database > fails to document mappings for non-Hanzi, right? That's a cop-out, and they know it. UniHAN cannot serve as a mapping table source, and shouldn't be trusted. -- Tom Emerson Basis Technology Corp. Software Architect http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever" From martin@v.loewis.de Fri Sep 6 16:44:25 2002 From: martin@v.loewis.de (Martin v. Loewis) Date: 06 Sep 2002 17:44:25 +0200 Subject: [I18n-sig] JapaneseCodecs 1.4.8 released In-Reply-To: <3D788275.5030905@livinglogic.de> References: <200209050925.g859Pv318388@grad.sccs.chukyo-u.ac.jp> <3D788275.5030905@livinglogic.de> Message-ID: Walter D=F6rwald writes: > Could transliteration for the JapaneseCodecs be handled > in a similar way? Certainly; this is about Unicode code points that have an obvious, but not round-tripping mapping to Shift-JIS. So if this can wait for Python 2.3, using error callbacks would be an option. Regards, Martin From j-david@noos.fr Fri Sep 6 17:13:02 2002 From: j-david@noos.fr (=?ISO-8859-1?Q?Juan_David_Ib=E1=F1ez_Palomar?=) Date: Fri, 06 Sep 2002 18:13:02 +0200 Subject: [I18n-sig] Plural forms References: <3D6352B2.60402@noos.fr> <15732.44020.748601.834403@anthem.wooz.org> <3D778EB5.3010409@noos.fr> <15735.38876.988423.22412@anthem.wooz.org> Message-ID: <3D78D40E.1030807@noos.fr> Barry A. Warsaw wrote: >>>>>>"MvL" == Martin v Loewis writes: >>>>>> >>>>>> > > >> A question, since now xgettext supports Python (didn't tried > >> it, but that's what the documentation says), why not to > >> depracate pygettext and stop its development? > > MvL> For one thing, it is good that batteries are included - you > MvL> may not have GNU gettext, and you may not be able to install > MvL> it as you don't have a C compiler, either - in particular, if > MvL> you are using a MS Windows system. > >Sure, but I'm less concerned about that because not many people >actually need to do extractions or catalog building. Those people can >probably manage to install gettext. > > MvL> For another thing, I don't think that xgettext supports the > MvL> extraction of docstrings. > >This one's more important. /Selective/ extraction of docstrings is >critical to me. > > > Probably not, but I'd prefer to add selective extraction of docstrings to xgettext than plural forms support to pygettext. Ok, I add the task "add plural forms support to the gettext Python module" to my task queue. See you, -- J. David Ibáńez, http://www.j-david.net Software Engineer / Ingénieur Logiciel / Ingeniero de Software From barry@python.org Fri Sep 6 17:25:43 2002 From: barry@python.org (Barry A. Warsaw) Date: Fri, 6 Sep 2002 12:25:43 -0400 Subject: [I18n-sig] Plural forms References: <3D6352B2.60402@noos.fr> <15732.44020.748601.834403@anthem.wooz.org> <3D778EB5.3010409@noos.fr> <15735.38876.988423.22412@anthem.wooz.org> <3D78D40E.1030807@noos.fr> Message-ID: <15736.55047.959154.274887@anthem.wooz.org> >>>>> "JDIP" =3D=3D Juan David Ib=E1=F1ez Palomar wri= tes: JDIP> Probably not, but I'd prefer to add selective extraction JDIP> of docstrings to xgettext than plural forms support to JDIP> pygettext. Why? Isn't hacking Python more fun than hacking C? :) -Barry From j-david@noos.fr Fri Sep 6 17:40:39 2002 From: j-david@noos.fr (=?ISO-8859-1?Q?Juan_David_Ib=E1=F1ez_Palomar?=) Date: Fri, 06 Sep 2002 18:40:39 +0200 Subject: [I18n-sig] Plural forms References: <3D6352B2.60402@noos.fr> <15732.44020.748601.834403@anthem.wooz.org> <3D778EB5.3010409@noos.fr> <15735.38876.988423.22412@anthem.wooz.org> <3D78D40E.1030807@noos.fr> <15736.55047.959154.274887@anthem.wooz.org> Message-ID: <3D78DA87.40009@noos.fr> Barry A. Warsaw wrote: >>>>>>"JDIP" == Juan David Ibáńez Palomar writes: >>>>>> >>>>>> > > JDIP> Probably not, but I'd prefer to add selective extraction > JDIP> of docstrings to xgettext than plural forms support to > JDIP> pygettext. > >Why? Isn't hacking Python more fun than hacking C? :) > > > Since I finished my studies in the university I've (almost) only used Python (well, I've done quite a lot of JavaScript, but I'm trying to forget it). So I'm just looking for an excuse to work with something else and see wether I still keep my skills. Probably I'll regret it. -- J. David Ibáńez, http://www.j-david.net Software Engineer / Ingénieur Logiciel / Ingeniero de Software