From tex@I18nGuy.com Wed Jun 4 08:28:49 2003 From: tex@I18nGuy.com (Tex Texin) Date: Wed, 04 Jun 2003 03:28:49 -0400 Subject: [I18n-sig] 24th Unicode Conference (IUC24) - September 3-5, 2003 - Atlanta, GA Message-ID: <3EDD9FB1.53FC9B7E@I18nGuy.com> Unicode 4.0 Tutorial, many new presentations, and lovely Atlanta! ************************************************************************ Twenty-fourth Internationalization and Unicode Conference (IUC24) Unicode, Internationalization, the Web: Powering Global Business http://www.unicode.org/iuc/iuc24 September 3-5, 2003 Atlanta, GA ************************************************************************ Mark your diary! >> 12 weeks to go >> Mark your diary! >> 12 weeks to go ************************************************************************ Are you falling behind? Version 4.0 of the Unicode Standard is here! Software and Web applications can now support more languages with greater efficiency and lower cost. Do you need to find out how? Do you need to be more competitive around the globe? Is your software upward-compatible with version 4.0? Does your staff need internationalization training? Learn about software and Web internationalization and the new Unicode Standard, including its latest features and requirements. This is the only event endorsed by the Unicode Consortium. The conference will be held September 3-5, 2003 in Atlanta, Georgia and is completely updated. KEYNOTES: Keynote speakers for IUC24 are well-known authors in the Internationalization and Localization industries: Donald De Palma, President, Common Sense Advisory, Inc., and author of "Business Without Borders: A Strategic Guide to Global Marketing", and Richard Gillam, author of "Unicode Demystified: A Practical Programmer's Guide to the Encoding Standard" and a former columnist for "C++ Report". TUTORIALS: This redeveloped and enhanced Unicode 4.0 Tutorial is taught by Dr. Asmus Freytag, one of the major contributors to the standard, and extensively experienced in implementing real-world Unicode applications. Structured into 3 independent modules, you can attend just the overview, or only the most advanced material. Tutorials in Web Internationalization, non-Latin scripts, and more, are offered in parallel and taught by recognized industry experts. CONFERENCE TRACKS: Gain the competitive edge! Conference sessions provide the most up-to-date technical information on standards, best practices, and recent advances in the globalization of software and the Internet. Panel discussions and the friendly atmosphere allow you to exchange ideas and ask questions of key players in the internationalization industry. WHO SHOULD ATTEND?: If you have a limited training budget, this is the one Internationalization conference you need. Send staff that are involved in either Unicode-enabling software, or internationalization of software and the Internet, including: managers, software engineers, systems analysts, font designers, graphic designers, content developers, Web designers, Web administrators, technical writers, and product marketing personnel. CONFERENCE WEB SITE, PROGRAM and REGISTRATION The Conference Program and Registration form are available at the Conference Web site: http://www.unicode.org/iuc/iuc24 CONFERENCE SPONSORS Agfa Monotype Corporation Basis Technology Corporation ClientSide News L.L.C. Oracle Corporation World Wide Web Consortium (W3C) XenCraft GLOBAL COMPUTING SHOWCASE Visit the Showcase to find out more about products supporting the Unicode Standard, and products and services that can help you globalize/localize your software, documentation and Internet content. Sign up for the Exhibitors' track as part of the Conference. For more information, please see: http://www.unicode.org/iuc/iuc24/showcase.html CONFERENCE VENUE The Conference will take place at: DoubleTree Hotel Atlanta Buckhead 3342 Peachtree Road Atlanta, GA 30326 Tel: +1-404-231-1234 Fax: +1-404-231-3112 CONFERENCE MANAGEMENT Global Meeting Services Inc. 8949 Lombard Place, #416 San Diego, CA 92122, USA Tel: +1 858 638 0206 (voice) +1 858 638 0504 (fax) Email: info@global-conference.com or: conference@unicode.org THE UNICODE CONSORTIUM The Unicode Consortium was founded as a non-profit organization in 1991. It is dedicated to the development, maintenance and promotion of The Unicode Standard, a worldwide character encoding. The Unicode Standard encodes the characters of the world's principal scripts and languages, and is code-for-code identical to the international standard ISO/IEC 10646. In addition to cooperating with ISO on the future development of ISO/IEC 10646, the Consortium is responsible for providing character properties and algorithms for use in implementations. Today the membership base of the Unicode Consortium includes major computer corporations, software producers, database vendors, research institutions, international agencies and various user groups. For further information on the Unicode Standard, visit the Unicode Web site at http://www.unicode.org or e-mail * * * * * Unicode(r) and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission. From confirm-s2-xpeXGMRqcjv1vcmC4kkh0SC8zxI-i18n-sig=python.org@yahoogroups.com Thu Jun 5 11:11:03 2003 From: confirm-s2-xpeXGMRqcjv1vcmC4kkh0SC8zxI-i18n-sig=python.org@yahoogroups.com (Yahoo! Groups) Date: 5 Jun 2003 10:11:03 -0000 Subject: [I18n-sig] Please confirm your request to join locales Message-ID: <1054807863.74.14780.w23@yahoogroups.com> Hello i18n-sig@python.org, We have received your request to join the locales group hosted by Yahoo! Groups, a free, easy-to-use community service. This request will expire in 21 days. TO BECOME A MEMBER OF THE GROUP: 1) Go to the Yahoo! Groups site by clicking on this link: http://groups.yahoo.com/i?i=xpeXGMRqcjv1vcmC4kkh0SC8zxI&e=i18n-sig%40python%2Eorg (If clicking doesn't work, "Cut" and "Paste" the line above into your Web browser's address bar.) -OR- 2) REPLY to this email by clicking "Reply" and then "Send" in your email program If you did not request, or do not want, a membership in the locales group, please accept our apologies and ignore this message. Regards, Yahoo! Groups Customer Care Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/ From perky@fallin.lv Fri Jun 6 10:53:32 2003 From: perky@fallin.lv (Hye-Shik Chang) Date: Fri, 6 Jun 2003 18:53:32 +0900 Subject: [I18n-sig] CJKCodecs 0.9 is released Message-ID: <20030606095332.GA90359@fallin.lv> The CJKCodecs 0.9 is released and available for download at: http://sourceforge.net/project/showfiles.php?group_id=46747 The CJKCodecs is a unified unicode codec set for Chinese, Japanese and Korean encodings. It supports full features of unicode codec specification and PEP293 error callbacks on Python 2.3. Currently supported encodings and planned updates: Authority 0.9 1.0 1.1 1.2 ============================================================================== China (PRC) gb2312 iso-2022-cn gbk(cp936) iso-2022-cn-ext gb18030 hz Hong Kong hkscs Japan shift-jis iso-2022-jp-2 euc-jisx0213 iso-2022-int-1 euc-jp shift-jisx0213 mac_japanese cp932 iso-2022-jp-3 iso-2022-jp iso-2022-jp-1 Korea (ROK) euc-kr (ksx1001:2002) mac_korean cp949(uhc) unijohab johab iso-2022-kr Korea (DPRK) euc-kp Taiwan big5 iso-2022-cn cp950 iso-2022-cn-ext euc-tw Unicode.org utf-8 utf-7 utf-16 It includes utf codecs to use it in our unit tests. (the standard utf-8 StreamReader behaves strangely on some conditions) Thank you! Regards, Hye-Shik =) From JasonR.Mastaler Tue Jun 10 22:25:53 2003 From: JasonR.Mastaler (JasonR.Mastaler) Date: Tue, 10 Jun 2003 15:25:53 -0600 Subject: [I18n-sig] Re: CJKCodecs 0.9 is released References: <20030606095332.GA90359@fallin.lv> Message-ID: Hye-Shik Chang writes: > The CJKCodecs is a unified unicode codec set for Chinese, Japanese > and Korean encodings. Is this packages intended to replace the ChineseCodecs[1], KoreanCodecs[2], and JapaneseCodecs[3] packages, which are currently available separately? I see KoreanCodecs is marked "obsolete", but I see not similar mention on the pages for ChineseCodecs and JapaneseCodecs. Footnotes: [1] http://sourceforge.net/projects/python-codecs [2] http://sourceforge.net/projects/koco/ [3] http://www.asahi-net.or.jp/~rd6t-kjym/python/ From martin@v.loewis.de Tue Jun 10 22:42:28 2003 From: martin@v.loewis.de (Martin v. =?iso-8859-15?q?L=F6wis?=) Date: 10 Jun 2003 23:42:28 +0200 Subject: [I18n-sig] Re: CJKCodecs 0.9 is released In-Reply-To: References: <20030606095332.GA90359@fallin.lv> Message-ID: "Jason R. Mastaler" writes: > I see KoreanCodecs is marked "obsolete", but I see not similar mention > on the pages for ChineseCodecs and JapaneseCodecs. These are different package authors, so they likely have different opinions on the status of each package. Regards, Martin From JasonR.Mastaler Tue Jun 10 23:49:33 2003 From: JasonR.Mastaler (JasonR.Mastaler) Date: Tue, 10 Jun 2003 16:49:33 -0600 Subject: [I18n-sig] Re: CJKCodecs 0.9 is released References: <20030606095332.GA90359@fallin.lv> Message-ID: martin@v.loewis.de (Martin v. Löwis) writes: > These are different package authors, so they likely have different > opinions on the status of each package. Sure, but presumably Hye-Shik will have some ideas on this topic as I know he has been involved with all three codecs in some capacity. It also doesn't make much sense to distribute multiple implementations of the same codecs, so presumably CJKCodecs will replace the three standalone distributions. From perky@fallin.lv Wed Jun 11 03:18:36 2003 From: perky@fallin.lv (Hye-Shik Chang) Date: Wed, 11 Jun 2003 11:18:36 +0900 Subject: [I18n-sig] Re: CJKCodecs 0.9 is released In-Reply-To: References: <20030606095332.GA90359@fallin.lv> Message-ID: <20030611021836.GA87284@fallin.lv> On Tue, Jun 10, 2003 at 03:25:53PM -0600, Jason R. Mastaler wrote: > Hye-Shik Chang writes: > > > The CJKCodecs is a unified unicode codec set for Chinese, Japanese > > and Korean encodings. > > Is this packages intended to replace the ChineseCodecs[1], > KoreanCodecs[2], and JapaneseCodecs[3] packages, which are currently > available separately? > > I see KoreanCodecs is marked "obsolete", but I see not similar mention > on the pages for ChineseCodecs and JapaneseCodecs. Yup. KoreanCodecs will be retired after CJKCodecs 1.0 is released. And, I don't have permissions to replace the others because I am not an author of them. Comparisons for CJKCodecs 1.0 vs {C,J,K}Codecs: JapaneseCodecs ChineseCodecs KoreanCodecs CJKCodecs PEP293 no no no yes StreamReader yes no partly(1) yes StreamWriter no no no yes License BSD GPL LGPL BSD Last Update Oct 2002 Nov 2000 Jul 2002 in development (1.4.9) (1.2.0) (2.0.5) (0.9) Source Size 304KB 528KB 224KB 464KB Binary Size 816KB 616KB 680KB 328KB (FreeBSD/ia32) Encodings(C) big5 big5 gb2312 gb2312 gbk gb18030 cp950 hz Encodings(J) euc-jp euc-jp cp932 cp932 iso-2022-jp iso-2022-jp iso-2022-jp-1 iso-2022-jp-1 iso-2022-jp-2 iso-2022-jp-3 euc-jisx0213 shift-jisx0213 Encodings(K) euc-kr euc-kr cp949 cp949 johab johab unijohab(2) qwerty2bul mac_korean Implementation Pure / C Pure / C Pure / C C only (1) KoreanCodecs supports 'sane' StreamReader for euc-kr, cp949 and johab only. (2) unijohab, qwerty2bul and mac_korean are quite minor encodings and ignorable. I don't think CJKCodecs can replace Chinese and JapaneseCodecs immediately. But, CJKCodecs will be remain useful in respect of abililty to support inter-cjk encodings like ISO-2022-JP-2 and ISO-2022-INT-1. Thank you for your interests! :) Regards, Hye-Shik =) From martin@v.loewis.de Wed Jun 11 06:25:02 2003 From: martin@v.loewis.de (Martin v. =?iso-8859-15?q?L=F6wis?=) Date: 11 Jun 2003 07:25:02 +0200 Subject: [I18n-sig] Re: CJKCodecs 0.9 is released In-Reply-To: <20030611021836.GA87284@fallin.lv> References: <20030606095332.GA90359@fallin.lv> <20030611021836.GA87284@fallin.lv> Message-ID: Hye-Shik Chang writes: > I don't think CJKCodecs can replace Chinese and JapaneseCodecs immediately. > But, CJKCodecs will be remain useful in respect of abililty to support > inter-cjk encodings like ISO-2022-JP-2 and ISO-2022-INT-1. This is an interesting summary. Can you produce another comparison, showing the differences in output of these codecs? Particular interesting might be cp932, euc-jp, iso-2022-jp, big5, and gb2312. For these, please find out a) which characters are encoded in one codec that are not encoded in the other (i.e. Unicode code point -> encoding) b) which characters are decoded in one codec that are not decoded in the other (i.e. encoding -> Unicode code point) c) which characters are encoded differently d) which characters are decoded differently Regards, Martin From perky@fallin.lv Wed Jun 11 09:13:01 2003 From: perky@fallin.lv (Hye-Shik Chang) Date: Wed, 11 Jun 2003 17:13:01 +0900 Subject: [I18n-sig] Re: CJKCodecs 0.9 is released In-Reply-To: References: <20030606095332.GA90359@fallin.lv> <20030611021836.GA87284@fallin.lv> Message-ID: <20030611081301.GA92933@fallin.lv> On Wed, Jun 11, 2003 at 07:25:02AM +0200, Martin v. L?wis wrote: > Hye-Shik Chang writes: > > > I don't think CJKCodecs can replace Chinese and JapaneseCodecs immediately. > > But, CJKCodecs will be remain useful in respect of abililty to support > > inter-cjk encodings like ISO-2022-JP-2 and ISO-2022-INT-1. > > This is an interesting summary. Can you produce another comparison, > showing the differences in output of these codecs? Particular > interesting might be cp932, euc-jp, iso-2022-jp, big5, and gb2312. > For these, please find out > a) which characters are encoded in one codec that are not encoded > in the other (i.e. Unicode code point -> encoding) > b) which characters are decoded in one codec that are not decoded > in the other (i.e. encoding -> Unicode code point) > c) which characters are encoded differently > d) which characters are decoded differently > Legend: CJK - CJKCodecs 0.9 Chinese - ChineseCodecs 1.2.0 Japanese - JapaneseCodecs 1.4.9 Korean - KoreanCodecs 2.0.5 GNU - GNU libiconv 1.8 + iconvcodecs 1.0 1. DECODERS 1) CJKCodecs' gb2312 versus ChineseCodecs' euc-gb2312-cn exactly identical, but ChineseCodecs raises not UnicodeError but IndexError for incompleted multibyte sequences. 2) CJKCodecs' big5 versus ChineseCodecs' big5-tw CJK Chinese GNU a15a - fffd - a1c3 - fffd - a1c5 - fffd - a1fe - fffd - a240 - fffd - a2cc - fffd - a2ce - fffd - and, chinesetw.big5 codec has same problem with chinesecn.gb2312 3) CJKCodecs' euc-jp versus JapaneseCodecs' euc-jp CJK Japanese GNU 01c0 005c ff3c ff3c f5a1 e000 - e000 -+ User-Defined Area f5a2 e001 - e001 | .... | fefd e3aa - e3aa | fefe e3ab - e3ab -+ ffa1 e3ac - - -+ CJKCodecs' bug ;) ffa2 e3ad - - | .... | fffd e408 - - | fffe e409 - - -+ 4) CJKCodecs' shift-jis versus JapaneseCodecs' shift-jis CJK Japanese GNU 005c 00a5 005c 00a5 007e 203e 007e 203e 007f - 007f 007f 815f 005c ff3c ff3c 817f - 00d7 - 837f - 30df - .... 9e7f - 684e - 9f7f - 6bef - a040 - 6f3e - a041 - 6f13 - .... a0fb - 74d4 - a0fc - 73f1 - e07f - 70dd - e17f - 75ff - e27f - 7ab0 - .... e97f - 9a43 - ea7f - 9eef - f040 e000 - e000 -+ User-Defined Area f041 e001 - e001 | .... | f9fb e756 - e756 | f9fc e757 - e757 -+ 5) CJKCodecs' cp932 versus JapaneseCodecs' ms932 CJK Japanese GNU 00a1 - ff61 ff61 -+ CJKCodecs' bug ;) 00a2 - ff62 ff62 | .... | 00de - ff9e ff9e | 00df - ff9f ff9f -+ 8160 ff5e ff5e 301c 8161 2225 2225 2016 817c ff0d ff0d 2212 817f - 00d7 - 8191 ffe0 ffe0 00a2 8192 ffe1 ffe1 00a3 81ca ffe2 ffe2 00ac 837f - 30df - 847f - 043d - .... a0fb - 74d4 - a0fc - 73f1 - e07f - 70dd - e17f - 75ff - .... e97f - 9a43 - ea7f - 9eef - 6) CJKCodecs' euc-kr versus KoreanCodecs' euc-kr exactly identical 7) CJKCodecs' cp949 versus KoreanCodecs' cp949 exactly identical 2. ENCODERS 1) CJKCodecs' gb2312 versus ChineseCodecs' euc-gb2312-cn exactly identical 2) CJKCodecs' big5 versus ChineseCodecs' big5-tw CJK Chinese GNU fffd a2ce a2ce - 3) CJKCodecs' euc-jp versus JapaneseCodecs' euc-jp CJK Japanese GNU 00a5 - 5c 5c 203e - 7e 7e e000 f5a1 - f5a1 -+ User-Defined Area e001 f5a2 - f5a2 | .... | e3aa fefd - fefd | e3ab fefe - fefe | e3ac 8ff5a1 - 8ff5a1 | e3ad 8ff5a2 - 8ff5a2 | .... | e756 8ffefd - 8ffefd | e757 8ffefe - 8ffefe -+ ff3c - a1c0 a1c0 ff5e - - 8fa2b7 4) CJKCodecs' shift-jis versus JapaneseCodecs' shift-jis CJK Japanese GNU 005c 815f 5c - 007e - 7e - 007f - 7f 7f e000 f040 - f040 -+ User-Defined Area e001 f041 - f041 | .... | e756 f9fb - f9fb | e757 f9fc - f9fc -+ ff3c - 815f 815f 5) CJKCodecs' cp932 versus JapaneseCodecs' ms932 CJK Japanese GNU 0080 - 80 - 00a1 - 21 - -+ latin-1 -> ascii 00a5 - 5c - | fallbacks. .... | 00fe - 74 - | 00ff - 79 - -+ 2116 8782 8782 fa59 2121 8784 8784 fa5a .... 2168 875c 875c fa52 2169 875d 875d fa53 2170 eeef fa40 fa40 2171 eef0 fa41 fa41 .... 2178 eef7 fa48 fa48 2179 eef8 fa49 fa49 2225 8161 8161 - 3094 - 8394 - 3231 878a 878a fa58 4e28 ed4c fa68 fa68 4ee1 ed4d fa69 fa69 .... 9e19 eeeb fc4a fc4a 9ed1 eeec fc4b fc4b f8f0 - a0 - f8f1 - fd - f8f2 - fe - f8f3 - ff - f929 edc4 fae0 fae0 f9dc eecd fbe9 fbe9 .... ff02 eefc fa57 fa57 ff07 eefb fa56 fa56 ff0d 817c 817c - ff5e 8160 8160 - ffe0 8191 8191 - ffe1 8192 8192 - ffe2 81ca 81ca fa54 ffe4 eefa fa55 fa55 6) CJKCodecs' euc-kr versus KoreanCodecs' euc-kr exactly identical 7) CJKCodecs' cp949 versus KoreanCodecs' cp949 exactly identical Okay, the comparison says that we need some discussions on the mappings. I'll fix CJKCodecs' bugs as soon as possible and welcome any opionions about mapping inconsistencies. :) Regards, Hye-Shik =) From martin@v.loewis.de Wed Jun 11 21:34:15 2003 From: martin@v.loewis.de (Martin v. =?iso-8859-15?q?L=F6wis?=) Date: 11 Jun 2003 22:34:15 +0200 Subject: [I18n-sig] Re: CJKCodecs 0.9 is released In-Reply-To: <20030611081301.GA92933@fallin.lv> References: <20030606095332.GA90359@fallin.lv> <20030611021836.GA87284@fallin.lv> <20030611081301.GA92933@fallin.lv> Message-ID: Hye-Shik Chang writes: > Legend: > CJK - CJKCodecs 0.9 > Chinese - ChineseCodecs 1.2.0 > Japanese - JapaneseCodecs 1.4.9 > Korean - KoreanCodecs 2.0.5 > GNU - GNU libiconv 1.8 + iconvcodecs 1.0 Very interesting, again. I have some problems interpreting the data, though. > 2) CJKCodecs' big5 versus ChineseCodecs' big5-tw > > CJK Chinese GNU > a15a - fffd - > a1c3 - fffd - > a1c5 - fffd - > a1fe - fffd - > a240 - fffd - > a2cc - fffd - > a2ce - fffd - What does that mean? CJK and iconv gives UnicodeError, whereas ChineseCodecs puts in the replacement character? Seems like a bug in ChineseCodecs to me, doesn't it? The replacement character should only be generated if errors='replace', no? > 3) CJKCodecs' euc-jp versus JapaneseCodecs' euc-jp > > CJK Japanese GNU > 01c0 005c ff3c ff3c That appears to be a bug in CJK, right? This is the question whether /xa1/xc0 is REVERSE SOLIDUS or FULLWIDTH REVERSE SOLIDUS. Now, it appears that euc-jp also supports /x5c, mapped to REVERSE SOLIDUS, and that /xa1/xc0 should be interpreted as FULLWIDTH REVERSE SOLIDUS, no? In case of doubt, I think ICU should be consulted for reference, as well, and following some kind of majority. In any case, I think the questionable mappings need to be documented. > f5a1 e000 - e000 -+ User-Defined Area > f5a2 e001 - e001 | > .... | > fefd e3aa - e3aa | > fefe e3ab - e3ab -+ What are these? I cannot find them in glibc. > 4) CJKCodecs' shift-jis versus JapaneseCodecs' shift-jis > > CJK Japanese GNU > 005c 00a5 005c 00a5 Here, I would trust GNU iconv; 5C really is YEN SIGN. > 007e 203e 007e 203e Likewise for OVERLINE - is there really no TILDE in shift-jis? > 007f - 007f 007f Why that? > 815f 005c ff3c ff3c Again: Why that? Shouldn't /x81/x5f be FULLWIDTH REVERSE SOLIDUS? > 817f - 00d7 - What character is that? Why does JapaneseCodecs map it to MULTIPLICATION SIGN? glibc seems to map that to /x81/x7e; is that a typo in JapaneseCodecs? > 837f - 30df - Are you sure glibc does not support that? Seems to be KATAKANA LETTER MI. > 9e7f - 684e - Why does JapaneseCodecs do that? glibc maps 9e7e to 684e. > f040 e000 - e000 -+ User-Defined Area > f041 e001 - e001 | > .... | > f9fb e756 - e756 | > f9fc e757 - e757 -+ Again: What are these characters? > 5) CJKCodecs' cp932 versus JapaneseCodecs' ms932 > > CJK Japanese GNU > 8160 ff5e ff5e 301c > 8161 2225 2225 2016 Are these verified against MS CP932, e.g. from Windows XP? > 817f - 00d7 - Likewise: For CP932, it seems essential to do whatever Microsoft does, in any Windows version. > 2) CJKCodecs' big5 versus ChineseCodecs' big5-tw > > CJK Chinese GNU > fffd a2ce a2ce - BIG-5 has the notion of a replacement character???? > 3) CJKCodecs' euc-jp versus JapaneseCodecs' euc-jp > > CJK Japanese GNU > 00a5 - 5c 5c That seems wrong, too. > 203e - 7e 7e Likewise. > Okay, the comparison says that we need some discussions on the mappings. > I'll fix CJKCodecs' bugs as soon as possible and welcome any opionions > about mapping inconsistencies. :) I'm too lazy now to review the other encodings. I'd encourage you to consult ICU for established procedures, and to document the cases where you pick one of the possible alternatives. I do hope that this set of codecs becomes part of standard Python one day, at which point we really need to document what exactly they do. Regards, Martin From JasonR.Mastaler Thu Jun 12 04:13:36 2003 From: JasonR.Mastaler (JasonR.Mastaler) Date: Wed, 11 Jun 2003 21:13:36 -0600 Subject: [I18n-sig] Re: CJKCodecs 0.9 is released References: <20030606095332.GA90359@fallin.lv> <20030611021836.GA87284@fallin.lv> Message-ID: Hye-Shik Chang writes: > Yup. KoreanCodecs will be retired after CJKCodecs 1.0 is released. > And, I don't have permissions to replace the others because I am not > an author of them. Have you contacted the authors of {C,J}Codecs? Perhaps they would be willing to contribute to your package in order to retire theirs? > I don't think CJKCodecs can replace Chinese and JapaneseCodecs > immediately. Why is this exactly, because of bugs in CJKCodecs? Your comparison chart showed that CJKCodecs supports at least all the codecs that {C,J,K}Codecs do combined. From perky@fallin.lv Thu Jun 12 05:46:34 2003 From: perky@fallin.lv (Hye-Shik Chang) Date: Thu, 12 Jun 2003 13:46:34 +0900 Subject: [I18n-sig] iconvcodec 1.1 is released Message-ID: <20030612044634.GA10477@fallin.lv> Hi, i18n-goodies! I just released iconvcodec 1.1 and available for download at: http://sourceforge.net/project/showfiles.php?group_id=46747 Changes from 1.0 are the following: - Enabled ISO-10646-2 extended planes by using Surrogate-Pair on ucs2-python - Now, users can add 'iconvcodec.' prefix before encoding names to exclude another lookup functions. (eg: iconvcodec.utf-8) - Fixed a syntax error around #if block [Changwoo Ryu] - Added a workaround to compile it with MinGW32 [Young-Sik Won] And, win32 binary distribution's iconv library is upgraded to GNU libiconv 1.9.1. Therefore, win32 binary is released under LGPL. (GNU libiconv 1.9.1 supports JIS X 0213 encodings, yay!) Thank you for listening! Regards, Hye-Shik =) From kajiyama@grad.sccs.chukyo-u.ac.jp Thu Jun 12 06:28:26 2003 From: kajiyama@grad.sccs.chukyo-u.ac.jp (Tamito KAJIYAMA) Date: Thu, 12 Jun 2003 14:28:26 +0900 Subject: [I18n-sig] Re: CJKCodecs 0.9 is released In-Reply-To: (jason@mastaler.com) References: Message-ID: <200306120528.h5C5SQm09184@grad.sccs.chukyo-u.ac.jp> "Jason R. Mastaler" writes: | | Hye-Shik Chang writes: | | > Yup. KoreanCodecs will be retired after CJKCodecs 1.0 is released. | > And, I don't have permissions to replace the others because I am not | > an author of them. | | Have you contacted the authors of {C,J}Codecs? Perhaps they would be | willing to contribute to your package in order to retire theirs? I've been on this list, watching what's going on. However, I'm busy and don't have enough time to commit to the development of both JapaneseCodecs and CJKCodecs. Excuse me for inconvenience. Regards, -- KAJIYAMA, Tamito From alex@lisa.org Wed Jun 18 13:37:23 2003 From: alex@lisa.org (Alex Lam) Date: Wed, 18 Jun 2003 14:37:23 +0200 Subject: [I18n-sig] Software Testing and Internationalization - Free book by LISA/Lemoine International Message-ID: Dear colleague, LISA, in collaboration with Lemoine International has made "Software Testing and Internationalization" by Galileo Computing freely available for download. This 330 page book will transform how you view testing methodologies and procedures. It introduces the reader to essential concepts and approaches used by practitioners in the software testing arena, while also taking into account the realities of low budgets and real schedule deadlines. It is in this context that the specific needs of small, agile project teams are covered in detail. Topics covered: * New approaches to quality * Risk analysis and evaluation * Risk-based testing * Exploratory testing * Testing and tuning * Testing by using * Use cases, requirements, and test cases * Debugging * Myths and realities of Automated Testing * Windows scripting * Test frameworks * Testing-based application development * Tools for developers and testers * Agile test management * International planning and architecture * International development issues * Internationalization testing To download a copy, please visit http://www.lisa.org/interact/2003/swtestregister.html Founded in 1990 as a non-profit association, LISA is the premier organization for the GILT (Globalization, Internationalization, Localization, and Translation) business communities. Over 400 leading IT manufacturers and solutions providers, along with industry professionals and an increasing number of vertical market corporations with an international business focus, have helped establish LISA best practice guidelines and language-technology standards for enterprise globalization. From perky@i18n.org Thu Jun 19 21:40:31 2003 From: perky@i18n.org (Hye-Shik Chang) Date: Fri, 20 Jun 2003 05:40:31 +0900 Subject: [I18n-sig] Re: CJKCodecs 0.9 is released In-Reply-To: References: <20030606095332.GA90359@fallin.lv> <20030611021836.GA87284@fallin.lv> <20030611081301.GA92933@fallin.lv> Message-ID: <20030619204031.GA62833@i18n.org> On Wed, Jun 11, 2003 at 10:34:15PM +0200, Martin v. L?wis wrote: > Hye-Shik Chang writes: [snip] > > 2) CJKCodecs' big5 versus ChineseCodecs' big5-tw > > > > CJK Chinese GNU > > a15a - fffd - > > a1c3 - fffd - > > a1c5 - fffd - > > a1fe - fffd - > > a240 - fffd - > > a2cc - fffd - > > a2ce - fffd - > > What does that mean? CJK and iconv gives UnicodeError, whereas > ChineseCodecs puts in the replacement character? Seems like a bug in > ChineseCodecs to me, doesn't it? The replacement character should only > be generated if errors='replace', no? According Unicode.org's mapping: ] A number of characters are not currently mapped because ] of conflicts with other mappings. They are as follows: ] ] BIG5 Description Comments ] ] 0xA15A SPACING UNDERSCORE duplicates A1C4 ] 0xA1C3 SPACING HEAVY OVERSCORE not in Unicode ] 0xA1C5 SPACING HEAVY UNDERSCORE not in Unicode ] 0xA1FE LT DIAG UP RIGHT TO LOW LEFT duplicates A2AC ] 0xA240 LT DIAG UP LEFT TO LOW RIGHT duplicates A2AD ] 0xA2CC HANGZHOU NUMERAL TEN conflicts with A451 mapping ] 0xA2CE HANGZHOU NUMERAL THIRTY conflicts with A4CA mapping ] ] We currently map all of these characters to U+FFFD REPLACEMENT CHARACTER. ] It is also possible to map these characters to their duplicates, or to ] the user zone. So, I changed mapping for them to as cp950 does instead of U+FFFD or user-defined area. I think that's affordable. BIG5 Unicode Description 0xA15A 0x2574 SPACING UNDERSCORE 0xA1C3 0xFFE3 SPACING HEAVY OVERSCORE 0xA1C5 0x02CD SPACING HEAVY UNDERSCORE 0xA1FE 0xFF0F LT DIAG UP RIGHT TO LOW LEFT 0xA240 0xFF3C LT DIAG UP LEFT TO LOW RIGHT 0xA2CC 0x5341 HANGZHOU NUMERAL TEN 0xA2CE 0x5345 HANGZHOU NUMERAL THIRTY > > > 3) CJKCodecs' euc-jp versus JapaneseCodecs' euc-jp > > > > CJK Japanese GNU > > 01c0 005c ff3c ff3c > > That appears to be a bug in CJK, right? This is the question whether > /xa1/xc0 is REVERSE SOLIDUS or FULLWIDTH REVERSE SOLIDUS. Now, it > appears that euc-jp also supports /x5c, mapped to REVERSE SOLIDUS, > and that /xa1/xc0 should be interpreted as FULLWIDTH REVERSE SOLIDUS, > no? Right. That makes sense. > > In case of doubt, I think ICU should be consulted for reference, as > well, and following some kind of majority. In any case, I think the > questionable mappings need to be documented. > > > f5a1 e000 - e000 -+ User-Defined Area > > f5a2 e001 - e001 | > > .... | > > fefd e3aa - e3aa | > > fefe e3ab - e3ab -+ > > What are these? I cannot find them in glibc. Quoting Ken Lunde's CJKV Information Processing p.206 table 4-66: ] Table 4-66: Shift-JIS to Unicode and EUC-JP for User-Defined Region ] ] Shift-JIS Unicode EUC-JP ] F040-F0FC E000-E0BB F5A1-F5FE, F6A1-F6FE ] F140-F1FC E0BC-E177 F7A1-F7FE, F8A1-F8FE ] F240-F2FC E178-E233 F9A1-F9FE, FAA1-FAFE ] --snip-- ] F940-F9FC E69C-E757 8FFDA1-8FFDFE, 8FFEA1-8FFEFE > > > 4) CJKCodecs' shift-jis versus JapaneseCodecs' shift-jis > > > > CJK Japanese GNU > > 005c 00a5 005c 00a5 > > Here, I would trust GNU iconv; 5C really is YEN SIGN. > > > 007e 203e 007e 203e > > Likewise for OVERLINE - is there really no TILDE in shift-jis? :) > > > 007f - 007f 007f > > Why that? That's a bug of CJK. fixed. > > > 815f 005c ff3c ff3c > > Again: Why that? Shouldn't /x81/x5f be FULLWIDTH REVERSE SOLIDUS? Then, shift-jis will lose a *reserse solidus*. And, even Unicode.org's mapping did: ] sjis jisx0208 unicode ] 0x815F 0x2140 0x005C # REVERSE SOLIDUS > > > 817f - 00d7 - > > What character is that? Why does JapaneseCodecs map it to > MULTIPLICATION SIGN? glibc seems to map that to /x81/x7e; > is that a typo in JapaneseCodecs? > > > 837f - 30df - > > Are you sure glibc does not support that? Seems to be > KATAKANA LETTER MI. > > > > 9e7f - 684e - > > Why does JapaneseCodecs do that? glibc maps 9e7e to 684e. They are not in shift-jis's byte range. I guess that JapaneseCodecs' SJIS->EUC macro has a bug around them. > > > 5) CJKCodecs' cp932 versus JapaneseCodecs' ms932 > > > > CJK Japanese GNU > > 8160 ff5e ff5e 301c > > 8161 2225 2225 2016 > > Are these verified against MS CP932, e.g. from Windows XP? > > > 817f - 00d7 - > > Likewise: For CP932, it seems essential to do whatever Microsoft does, > in any Windows version. Okay. Here it is! :) CJK Japanese GNU WindowsXP 0080 - - - 0080 00a0 - - - f8f0 00fd - - - f8f1 00fe - - - f8f2 00ff - - - f8f3 8160 ff5e ff5e 301c ff5e 8161 2225 2225 2016 2225 817c ff0d ff0d 2212 ff0d 817f - 00d7 - - 8191 ffe0 ffe0 00a2 ffe0 8192 ffe1 ffe1 00a3 ffe1 81ca ffe2 ffe2 00ac ffe2 837f - 30df - - 847f - 043d - - .... e97f - 9a43 - - ea7f - 9eef - - I'll add 0x80, 0xa0, 0xfd, 0xfe, 0xff to CJKCodecs's cp932 to conform Windows's real mapping. > > > 2) CJKCodecs' big5 versus ChineseCodecs' big5-tw > > > > CJK Chinese GNU > > fffd a2ce a2ce - > > BIG-5 has the notion of a replacement character???? mentioned above. > > > 3) CJKCodecs' euc-jp versus JapaneseCodecs' euc-jp > > > > CJK Japanese GNU > > 00a5 - 5c 5c > > That seems wrong, too. > > > 203e - 7e 7e > > Likewise. > > > Okay, the comparison says that we need some discussions on the mappings. > > I'll fix CJKCodecs' bugs as soon as possible and welcome any opionions > > about mapping inconsistencies. :) > > I'm too lazy now to review the other encodings. I'd encourage you to > consult ICU for established procedures, and to document the cases > where you pick one of the possible alternatives. I do hope that this > set of codecs becomes part of standard Python one day, at which point > we really need to document what exactly they do. Thank you for the comments. Your suggestions were very helpful to make CJKCodecs saner. > > Regards, > Martin > > Regards, Hye-Shik =) From kajiyama@grad.sccs.chukyo-u.ac.jp Thu Jun 19 21:49:40 2003 From: kajiyama@grad.sccs.chukyo-u.ac.jp (Tamito KAJIYAMA) Date: Fri, 20 Jun 2003 05:49:40 +0900 Subject: [I18n-sig] Re: CJKCodecs 0.9 is released In-Reply-To: <20030619204031.GA62833@i18n.org> (message from Hye-Shik Chang on Fri, 20 Jun 2003 05:40:31 +0900) References: <20030619204031.GA62833@i18n.org> Message-ID: <200306192049.h5JKnef08655@grad.sccs.chukyo-u.ac.jp> Hye-Shik Chang writes: | | > > 817f - 00d7 - | > | > What character is that? Why does JapaneseCodecs map it to | > MULTIPLICATION SIGN? glibc seems to map that to /x81/x7e; | > is that a typo in JapaneseCodecs? | > | > > 837f - 30df - | > | > Are you sure glibc does not support that? Seems to be | > KATAKANA LETTER MI. | > | > | > > 9e7f - 684e - | > | > Why does JapaneseCodecs do that? glibc maps 9e7e to 684e. | | They are not in shift-jis's byte range. I guess that JapaneseCodecs' | SJIS->EUC macro has a bug around them. Exactly. I'll fix it in the next release of JapaneseCodecs. Thanks, -- KAJIYAMA, Tamito From martin@v.loewis.de Sat Jun 21 17:53:05 2003 From: martin@v.loewis.de (Martin v. =?iso-8859-15?q?L=F6wis?=) Date: 21 Jun 2003 18:53:05 +0200 Subject: [I18n-sig] Re: CJKCodecs 0.9 is released In-Reply-To: <20030619204031.GA62833@i18n.org> References: <20030606095332.GA90359@fallin.lv> <20030611021836.GA87284@fallin.lv> <20030611081301.GA92933@fallin.lv> <20030619204031.GA62833@i18n.org> Message-ID: Hye-Shik Chang writes: > So, I changed mapping for them to as cp950 does instead of U+FFFD or > user-defined area. I think that's affordable. Indeed. I'd encourage you to list all "critical" cases in the documentation of your package. This is all tricky stuff, and opinions vary widely. So users should be able to find out up-front what they get - they are much more angry if they find out by surprise. > Quoting Ken Lunde's CJKV Information Processing p.206 table 4-66: > ] Table 4-66: Shift-JIS to Unicode and EUC-JP for User-Defined Region > ] > ] Shift-JIS Unicode EUC-JP > ] F040-F0FC E000-E0BB F5A1-F5FE, F6A1-F6FE > ] F140-F1FC E0BC-E177 F7A1-F7FE, F8A1-F8FE > ] F240-F2FC E178-E233 F9A1-F9FE, FAA1-FAFE > ] --snip-- > ] F940-F9FC E69C-E757 8FFDA1-8FFDFE, 8FFEA1-8FFEFE Is this really necessary? Using PUA characters is evil, IMO, and should be avoided unless explicitly requested by the application. If those characters are not supported in Unicode, they can't be really important, no? Or, are you sure that they are still unsupported in Unicode 4.0? > Okay. Here it is! :) > > CJK Japanese GNU WindowsXP > 0080 - - - 0080 > 00a0 - - - f8f0 > 00fd - - - f8f1 > 00fe - - - f8f2 > 00ff - - - f8f3 > > I'll add 0x80, 0xa0, 0xfd, 0xfe, 0xff to CJKCodecs's cp932 to conform > Windows's real mapping. This is, in fact, a place where the mapping-to-PUA might be acceptable. CP932 is Microsoft's "private" encoding, anyway, so they set the rules :-( Regards, Martin From tree@basistech.com Sat Jun 21 18:21:16 2003 From: tree@basistech.com (Tom Emerson) Date: Sat, 21 Jun 2003 13:21:16 -0400 Subject: [I18n-sig] Re: CJKCodecs 0.9 is released In-Reply-To: References: <20030606095332.GA90359@fallin.lv> <20030611021836.GA87284@fallin.lv> <20030611081301.GA92933@fallin.lv> <20030619204031.GA62833@i18n.org> Message-ID: <16116.37900.202469.431556@magrathea.basistech.com> Martin v. L=F6wis writes: [...] > > Quoting Ken Lunde's CJKV Information Processing p.206 table 4-66: > > ] Table 4-66: Shift-JIS to Unicode and EUC-JP for User-Defined Regi= on > > ] > > ] Shift-JIS Unicode EUC-JP > > ] F040-F0FC E000-E0BB F5A1-F5FE, F6A1-F6FE > > ] F140-F1FC E0BC-E177 F7A1-F7FE, F8A1-F8FE > > ] F240-F2FC E178-E233 F9A1-F9FE, FAA1-FAFE > > ] --snip-- > > ] F940-F9FC E69C-E757 8FFDA1-8FFDFE, 8FFEA1-8FFEFE >=20 > Is this really necessary=3F Using PUA characters is evil, IMO, and > should be avoided unless explicitly requested by the application. If= > those characters are not supported in Unicode, they can't be really > important, no=3F Yes, it is really necessary. If you want to round trip these encodings then you need to map the UDR's of the various legacy encodings into the PUA and back again. If you don't then you can and will loose data. For Japanese encodings there are numerous corporate extensions to Shift JIS, as well as the various emoticons and other dingbats introduced for use with iMode and other phones.=20 It is a much bigger issue for the Chinese encodings: extensions to Big Five (CP950, ETen, GCCS, HKSCS, etc.) are done in the UDR and VDR parts of the encoding space. Unfortunately you rarely if ever see such documents identified as ETen or CP950 or HKSCS: just as Big Five. Since you cannot easily detect which of thse variants are in use you need to round trip the UDRs/VDRs through the PUA. > Or, are you sure that they are still unsupported in Unicode 4.0=3F In the case of HKSCS all but 3 characters are defined in Planes 0 and 2. However, as I mentioned above, if you do not know that your file claiming Big Five is really HKSCS then you can't map the UDR/VDR sections appropriately. Oh, and Microsoft defines CP950 as different things depending on whether the file is from Taiwan or Hong Kong. The latest issue faced with transcoding between legacy Asian encodings (especially JIS X 0213) and Unicode is the interpretation of compatibility characters and how strictly you want to enforce the rules laid out by TUC. -tree --=20 Tom Emerson Basis Technology C= orp. Software Architect http://www.basistech= .com "Beware the lollipop of mediocrity: lick it once and you suck forever= " From martin@v.loewis.de Sat Jun 21 20:26:53 2003 From: martin@v.loewis.de (Martin v. =?iso-8859-15?q?L=F6wis?=) Date: 21 Jun 2003 21:26:53 +0200 Subject: [I18n-sig] Re: CJKCodecs 0.9 is released In-Reply-To: <16116.37900.202469.431556@magrathea.basistech.com> References: <20030606095332.GA90359@fallin.lv> <20030611021836.GA87284@fallin.lv> <20030611081301.GA92933@fallin.lv> <20030619204031.GA62833@i18n.org> <16116.37900.202469.431556@magrathea.basistech.com> Message-ID: Tom Emerson writes: > For Japanese encodings there are numerous corporate extensions to > Shift JIS, as well as the various emoticons and other dingbats > introduced for use with iMode and other phones. That only tells me that mapping to the PUA is most likely incorrect, though: Are these corporate extensions well-specified? Are they non-overlapping? If yes, I think a "proper" mapping should be found. For example, many emoticons and dingbats are supported in Unicode 4.0, and should be used instead of the PUA. If no, I feel that these characters just shouldn't round-trip. There would bo no loss of data. Instead, users would get a UnicodeError, indicating that some characters just can't be converted to Unicode. Now, there might be certain applications where this is not acceptable. For many of these applications, it is the runtime error that is not acceptable, not a possible loss of data in rare cases. For these cases, the 'replace' processing of Python codecs seems appropriate. For a small number of applications, round-tripping is important enough even if it means to use the PUA. It is important that authors of these applications understand that they can *only* convert back the results to the original encoding, and not to some other encoding - e.g. it is incorrect to encode the Unicode strings as UTF-8, for use in HTML. Authors of these applications would need to specify that they understand all that, e.g. by using a different codec name (e.g. a '+pua' suffix) > It is a much bigger issue for the Chinese encodings: extensions to Big > Five (CP950, ETen, GCCS, HKSCS, etc.) are done in the UDR and VDR > parts of the encoding space. Unfortunately you rarely if ever see such > documents identified as ETen or CP950 or HKSCS: just as Big > Five. Since you cannot easily detect which of thse variants are in use > you need to round trip the UDRs/VDRs through the PUA. Again, assuming it is round-tripping that you are after. Many Python Unicode applications don't do round-tripping. Instead, they convert the input to some other encoding (put it into a database, output UTF-8, output XML character references). This is a perfect recipe for moji-bake. > In the case of HKSCS all but 3 characters are defined in Planes 0 and > 2. However, as I mentioned above, if you do not know that your file > claiming Big Five is really HKSCS then you can't map the UDR/VDR > sections appropriately. Can you give an example where using the HKSCS codec for decoding would be incorrect? > Oh, and Microsoft defines CP950 as different things depending on > whether the file is from Taiwan or Hong Kong. That sounds like one needs two versions of cp950... In any case, for MS code pages, I think a Python codec should do exactly what MS does. If that involves PUA, oh well, atleast the moji-bake will be consistent with what Microsoft produces, so MSIE might even render it correctly.. Regards, Martin From tree@basistech.com Sat Jun 21 20:46:43 2003 From: tree@basistech.com (Tom Emerson) Date: Sat, 21 Jun 2003 15:46:43 -0400 Subject: [I18n-sig] Re: CJKCodecs 0.9 is released In-Reply-To: References: <20030606095332.GA90359@fallin.lv> <20030611021836.GA87284@fallin.lv> <20030611081301.GA92933@fallin.lv> <20030619204031.GA62833@i18n.org> <16116.37900.202469.431556@magrathea.basistech.com> Message-ID: <16116.46627.374924.972713@magrathea.basistech.com> Martin v. L=F6wis writes: > Tom Emerson writes: > > For Japanese encodings there are numerous corporate extensions to > > Shift JIS, as well as the various emoticons and other dingbats > > introduced for use with iMode and other phones.=20 >=20 > That only tells me that mapping to the PUA is most likely incorrect, > though: >=20 > Are these corporate extensions well-specified=3F Are they > non-overlapping=3F Well specified=3F Sure, there are specifications. Non-overlapping=3F Of course not: each corporate extension starts at th= e same point in the user-defined regions of the legacy encoding. > If yes, I think a "proper" mapping should be found. For example, many= > emoticons and dingbats are supported in Unicode 4.0, and should be > used instead of the PUA. Absolutely you should, but these characters have to be proposed to the Unicode Consortium, and in the case ideographs, to the IRG of ISO 10646. > If no, I feel that these characters just shouldn't round-trip. There > would bo no loss of data. Instead, users would get a UnicodeError, > indicating that some characters just can't be converted to Unicode. This is a rediculously pedantic approach that will end up pissing people off: the PUA in Unicode is designed for this purpose, so it should be used. > Now, there might be certain applications where this is not > acceptable. For many of these applications, it is the runtime error > that is not acceptable, not a possible loss of data in rare cases. Fo= r > these cases, the 'replace' processing of Python codecs seems > appropriate. Data loss is a problem. Customers get very upset when their data gets munged for no good reason. > For a small number of applications, round-tripping is important enoug= h > even if it means to use the PUA. It is important that authors of thes= e > applications understand that they can *only* convert back the results= > to the original encoding, and not to some other encoding - e.g. it is= > incorrect to encode the Unicode strings as UTF-8, for use in HTML. Where does it say you cannot cannot encode PUA characters in UTF-8=3F I= f you have a custom font that handles these code points, then you are going to be upset that you can't display them because the author of the codec decided that PUA characters are an abomination that should be striken from the earth. > Authors of these applications would need to specify that they > understand all that, e.g. by using a different codec name (e.g. a > '+pua' suffix) So then you get a pile of ShiftJIS encodings, those that round trip, those that don't. > Again, assuming it is round-tripping that you are after. Many Python > Unicode applications don't do round-tripping. Instead, they convert > the input to some other encoding (put it into a database, output > UTF-8, output XML character references). This is a perfect recipe for= > moji-bake. I disagree that this is a recipe for moji-bake. If I'm stuffing values into a database PUA may be the only thing we can do. I do not want my ShiftJIS extension characters being replaced with U+FFFD. > > In the case of HKSCS all but 3 characters are defined in Planes 0 a= nd > > 2. However, as I mentioned above, if you do not know that your file= > > claiming Big Five is really HKSCS then you can't map the UDR/VDR > > sections appropriately. >=20 > Can you give an example where using the HKSCS codec for decoding woul= d > be incorrect=3F I can dig up the three characters that are not encoded in Unicode: I don't have the latest HKSCS at home. But again, if you do not know you are looking at HKSCS, you loose. > > Oh, and Microsoft defines CP950 as different things depending on > > whether the file is from Taiwan or Hong Kong. >=20 > That sounds like one needs two versions of cp950... Sure, if you know which version you are dealing which you may not. > In any case, for MS code pages, I think a Python codec should do > exactly what MS does. If that involves PUA, oh well, atleast the > moji-bake will be consistent with what Microsoft produces, so MSIE > might even render it correctly.. Yes, well, it can be a fulltime job to keep up to date with Microsoft's ever-changing mapping tables. Peace, tree --=20 Tom Emerson Basis Technology C= orp. Software Architect http://www.basistech= .com "Beware the lollipop of mediocrity: lick it once and you suck forever= " From martin@v.loewis.de Sat Jun 21 22:16:22 2003 From: martin@v.loewis.de (Martin v. =?iso-8859-15?q?L=F6wis?=) Date: 21 Jun 2003 23:16:22 +0200 Subject: [I18n-sig] Re: CJKCodecs 0.9 is released In-Reply-To: <16116.46627.374924.972713@magrathea.basistech.com> References: <20030606095332.GA90359@fallin.lv> <20030611021836.GA87284@fallin.lv> <20030611081301.GA92933@fallin.lv> <20030619204031.GA62833@i18n.org> <16116.37900.202469.431556@magrathea.basistech.com> <16116.46627.374924.972713@magrathea.basistech.com> Message-ID: Tom Emerson writes: > This is a rediculously pedantic approach that will end up pissing > people off: the PUA in Unicode is designed for this purpose, so it > should be used. It is fine if users are aware that this happens. If they are not, they will be pissed off when they find out. > Where does it say you cannot cannot encode PUA characters in UTF-8? If > you have a custom font that handles these code points, then you are > going to be upset that you can't display them because the author of > the codec decided that PUA characters are an abomination that should > be striken from the earth. And if you don't have such a font, you will see some replacement characters. A lot of things need to be in place for this to work correctly. Developers need to make sure things all are in place, and need to ask the libraries to work to how they put them. > I disagree that this is a recipe for moji-bake. If I'm stuffing values > into a database PUA may be the only thing we can do. I do not want my > ShiftJIS extension characters being replaced with U+FFFD. Now, if your font was meant for a different proprietary extension that happens to use the same private characters, you get incorrect display. Right? Likewise, if some other application reads out the data, and interprets the private characters in a different way. Private characters should never leave the scope of "the application", and some effort should be done to make sure they don't leak out of "the application". > > Can you give an example where using the HKSCS codec for decoding would > > be incorrect? > > I can dig up the three characters that are not encoded in Unicode: I > don't have the latest HKSCS at home. But again, if you do not know you > are looking at HKSCS, you loose. This is not what I meant. What I'm asking is this: Are there HKSCS character that have encodings which are identical to encodings in other common Big-5 extensions? IOW, what bad things would happen if you would assume all Big-5 is HSKCS? Or: how would the use of PUAs improve the situation in that case? > > That sounds like one needs two versions of cp950... > > Sure, if you know which version you are dealing which you may not. That is always the case: If I don't know the encoding of some document, there is always the risk of misinterpretation. I can use heuristics to guess the encoding in some cases, and in some cases, the heuristics work reasonable well - in other cases, they fail miserably. There is nothing one can do, except to have users always declare their encodings properly, to use only data formats which include charset declarations, to use only charset names that are unambiguous, preferably even over time, etc. If people don't follow these rules, some things will go wrong. Then, people will learn to correct their errors. Regards, Martin From Matt Gushee Sat Jun 21 23:51:07 2003 From: Matt Gushee (Matt Gushee) Date: Sat, 21 Jun 2003 16:51:07 -0600 Subject: [I18n-sig] Re: CJKCodecs 0.9 is released In-Reply-To: References: <20030611021836.GA87284@fallin.lv> <20030611081301.GA92933@fallin.lv> <20030619204031.GA62833@i18n.org> <16116.37900.202469.431556@magrathea.basistech.com> <16116.46627.374924.972713@magrathea.basistech.com> Message-ID: <20030621225107.GE12229@swordfish> Some may consider this off-topic, but I don't believe the right course of action here can be decided on purely technical grounds. So here goes: On Sat, Jun 21, 2003 at 11:16:22PM +0200, Martin v. Löwis wrote: > > > This is a rediculously pedantic approach that will end up pissing > > people off: the PUA in Unicode is designed for this purpose, so it > > should be used. > > It is fine if users are aware that this happens. If they are not, they > will be pissed off when they find out. Could be, if by "users" you mean developers that use the library. I doubt that more than a minuscule fraction of end users has even heard of Unicode. They just want working software and readable documents. And I think has a lot to do with the success of Shift-JIS, even though it is the epitome of bad design: at the time it was developed, half-width katakana were in widespread use, and it Shift-JIS made it easy to accommodate that need. > > Where does it say you cannot cannot encode PUA characters in UTF-8? If > > you have a custom font that handles these code points, then you are > > going to be upset that you can't display them because the author of > > the codec decided that PUA characters are an abomination that should > > be striken from the earth. > > And if you don't have such a font, you will see some replacement > characters. Well, I don't have an intimate knowledge of how CJKV character sets are used on a daily basis, but I do have a broad knowledge of how society works in at least Japan and mainland China (been to both, studied the history in school, lived in Japan for seven years), and I would guess that the availability of fonts in any given scenario is somewhat analogous to the availability of XML DTDs: organizations (or individuals) tend to have the same technology (fonts, software, etc.) as other organizations that they are likely to exchange documents with. That's not unique to Asia, of course, but I have the impression it is more true there than in the West. > Private characters should never leave the scope of "the application", > and some effort should be done to make sure they don't leak out of > "the application". If by "application," you mean a particular software program or a closely coordinated set of programs, I very much doubt that goal is achievable in the foreseeable future. Maybe if you took a somewhat broader view and said something like "system," encompassing both software and a set of business practices, it would be realistic. > There is nothing one can do, except to have users always declare their > encodings properly, to use only data formats which include charset > declarations, to use only charset names that are unambiguous, > preferably even over time, etc. If people don't follow these rules, > some things will go wrong. Then, people will learn to correct their > errors. No, rigid enforcement of standards is not the only choice. The alternative is to determine what non-standard practices (or de-facto standard practices) are most common, and attempt to accommodate those. I honestly don't know which is better, but philosophically I favor usability over correctness (of course, the two aren't necessarily at odds in the long term, but often seem to conflict in the short term). Adherence to standards is a good thing, but you also have to deal with the social context where your product is being used. Consider the case of, say, the typical harried IT manager in a Tokyo insurance firm. He needs to plan the development of a new Web application; the project requirements call for a very high-level dynamic language. Well, that gives him several choices, doesn't it? And let's suppose that Python requires his team to "always declare their encodings properly, to use only charset names that are unambiguous ..." and so on. And suppose one of the alternatives (I don't know, perhaps Ruby?) "just works" for his use cases. Well, then, why should he use Python? I'm not suggesting that the goal of standards-compliance be discarded for the sake of popularity, now or ever. But sometimes you need to be a little less forceful: give users something that works for them today, while gently steering them toward the "right" path. Python is good technology, and good technology should be widely used. And if correctness comes at the expense of usability, you're just going to drive people away. -- Matt Gushee When a nation follows the Way, Englewood, Colorado, USA Horses bear manure through mgushee@havenrock.com its fields; http://www.havenrock.com/ When a nation ignores the Way, Horses bear soldiers through its streets. --Lao Tzu (Peter Merel, trans.) From tex@I18nGuy.com Thu Jun 26 09:08:16 2003 From: tex@I18nGuy.com (Tex Texin) Date: Thu, 26 Jun 2003 04:08:16 -0400 Subject: [I18n-sig] 24th Unicode Conference - Atlanta, GA - September 3-5, 2003 Message-ID: <3EFAA9F0.F7850846@I18nGuy.com> ************************************************************************ Twenty-fourth Internationalization and Unicode Conference (IUC24) Unicode, Internationalization, the Web: Powering Global Business http://www.unicode.org/iuc/iuc24 September 3-5, 2003 Atlanta, GA ************************************************************************ NEWS > Visit the Conference Web site ( http://www.unicode.org/iuc/iuc24 ) to check the updated Conference program and register. To help you choose Conference sessions, we've included abstracts of talks and speakers' biographies. > Hotel guest room group rate valid to August 12. > Early bird registration rates valid to August 12. > To find out about, and register for the TILP Breakfast Meeting and Roundtable, organized by The Institute of Localisation Professionals, and taking place at the same venue on September 4, 7:00 a.m.-9:00 a.m., See: http://www.tilponline.org/events/diary.shtml or http://www.unicode.org/iuc/iuc24 ************************************************************************ Are you falling behind? Version 4.0 of the Unicode Standard is here! Software and Web applications can now support more languages with greater efficiency and lower cost. Do you need to find out how? Do you need to be more competitive around the globe? Is your software upward-compatible with version 4.0? Does your staff need internationalization training? Learn about software and Web internationalization and the new Unicode Standard, including its latest features and requirements. This is the only event endorsed by the Unicode Consortium. The conference will be held September 3-5, 2003 in Atlanta, Georgia and is completely updated. KEYNOTES: Keynote speakers for IUC24 are well-known authors in the Internationalization and Localization industries: Donald De Palma, President, Common Sense Advisory, Inc., and author of "Business Without Borders: A Strategic Guide to Global Marketing", and Richard Gillam, author of "Unicode Demystified: A Practical Programmer's Guide to the Encoding Standard" and a former columnist for "C++ Report". TUTORIALS: This redeveloped and enhanced Unicode 4.0 Tutorial is taught by Dr. Asmus Freytag, one of the major contributors to the standard, and extensively experienced in implementing real-world Unicode applications. Structured into 3 independent modules, you can attend just the overview, or only the most advanced material. Tutorials in Web Internationalization, non-Latin scripts, and more, are offered in parallel and taught by recognized industry experts. CONFERENCE TRACKS: Gain the competitive edge! Conference sessions provide the most up-to-date technical information on standards, best practices, and recent advances in the globalization of software and the Internet. Panel discussions and the friendly atmosphere allow you to exchange ideas and ask questions of key players in the internationalization industry. WHO SHOULD ATTEND?: If you have a limited training budget, this is the one Internationalization conference you need. Send staff that are involved in either Unicode-enabling software, or internationalization of software and the Internet, including: managers, software engineers, systems analysts, font designers, graphic designers, content developers, Web designers, Web administrators, technical writers, and product marketing personnel. CONFERENCE WEB SITE, PROGRAM and REGISTRATION The Conference Program and Registration form are available at the Conference Web site: http://www.unicode.org/iuc/iuc24 CONFERENCE SPONSORS Agfa Monotype Corporation Basis Technology Corporation ClientSide News L.L.C. Oracle Corporation World Wide Web Consortium (W3C) XenCraft GLOBAL COMPUTING SHOWCASE Visit the Showcase to find out more about products supporting the Unicode Standard, and products and services that can help you globalize/localize your software, documentation and Internet content. Sign up for the Exhibitors' track as part of the Conference. For more information, please see: http://www.unicode.org/iuc/iuc24/showcase.html CONFERENCE VENUE The Conference will take place at: DoubleTree Hotel Atlanta Buckhead 3342 Peachtree Road Atlanta, GA 30326 Tel: +1-404-231-1234 Fax: +1-404-231-3112 CONFERENCE MANAGEMENT Global Meeting Services Inc. 8949 Lombard Place, #416 San Diego, CA 92122, USA Tel: +1 858 638 0206 (voice) +1 858 638 0504 (fax) Email: info@global-conference.com or: conference@unicode.org THE UNICODE CONSORTIUM The Unicode Consortium was founded as a non-profit organization in 1991. It is dedicated to the development, maintenance and promotion of The Unicode Standard, a worldwide character encoding. The Unicode Standard encodes the characters of the world's principal scripts and languages, and is code-for-code identical to the international standard ISO/IEC 10646. In addition to cooperating with ISO on the future development of ISO/IEC 10646, the Consortium is responsible for providing character properties and algorithms for use in implementations. Today the membership base of the Unicode Consortium includes major computer corporations, software producers, database vendors, research institutions, international agencies and various user groups. For further information on the Unicode Standard, visit the Unicode Web site at http://www.unicode.org or e-mail * * * * * Unicode(r) and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission.