From perky@i18n.org Sat Jul 12 17:06:32 2003 From: perky@i18n.org (Hye-Shik Chang) Date: Sun, 13 Jul 2003 01:06:32 +0900 Subject: [I18n-sig] CJKCodecs 1.0b1 is released Message-ID: <20030712160632.GA17734@i18n.org> The CJKCodecs 1.0b1 is released and available for download at: http://sourceforge.net/project/showfiles.php?group_id=46747 The CJKCodecs is a unified unicode codec set for Chinese, Japanese and Korean encodings. It supports full features of unicode codec specification and PEP293 error callbacks on Python 2.3. The CJKCodecs is supporting these encodings in this time: big5 cp932 cp949 cp950 euc-jisx0213 euc-jp euc-kr gb18030 gb2312 gbk hz iso-2022-jp iso-2022-jp-1 iso-2022-jp-2 iso-2022-jp-3 iso-2022-kr johab shift-jis shift-jisx0213 utf-16 utf-16be utf-16le utf-7 utf-8 Changes with 1.0b1 from 0.9: *) SHIFT-JISX0213, EUC-JISX0213, ISO-2022-JP-2 and ISO-2022-JP-3 codec is added. *) UTF-7, UTF-16, UTF-16BE and UTF-16LE codec is added. *) Changed a few characters of a big5 codepoint mapping to cp950's rather than 0xfffd. (documented on NOTES.big5) *) Fixed a bug that JIS X 0201 routine doesn't encode and decode 0x7f. *) Tweaked some mapping for cp932 and cp950 to make more consistency with MS Windows. - CP932: Added single byte "UNDEFINED" characters 0x80, 0xa0, 0xfd, 0xfe, 0xff (documented on NOTES.cp932) - CP950: Changed encode mappings to another more popular for duplicated unicode points: 5341 -> A451, 5345 -> A4CA *) A unittest for big5 mapping is added. *) Fixed a bug that cp932 codec couldn't decode half-width katakana. *) Added a workaround for PyObject_GenericGetAttr to enable compiling with mingw32. [Young-Sik Won] *) Enable gb18030 and utf-8 codec encode and decode iso-10646-2 characters using surrogate pair. *) Fixed gb18030 codec's syntax error that disturbs compilation on python compiled with --with-unicode=ucs4 option. [Son, Kyung-uk] *) StreamWriter became to be able to buffer incomplete sequences. (this feature is used for surrogate-pair and mapping from unicode character with a following modifier) *) EUC-JP codec's mapping for 0xA1C0 is changed from U+005C to U+FF3C because EUC-JP 0x5C is also a REVERSE SOLIDUS and 0xA1C0 is FULLWIDTH REVERSE SOLIDUS on japanese environments. *) Fixed hz codec's bug that doesn't initialize the encoding mode to ASCII. Thank you very much! Regards, Hye-Shik =) From martin@v.loewis.de Sat Jul 12 18:32:21 2003 From: martin@v.loewis.de (=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=) Date: Sat, 12 Jul 2003 19:32:21 +0200 Subject: [I18n-sig] CJKCodecs 1.0b1 is released In-Reply-To: <20030712160632.GA17734@i18n.org> References: <20030712160632.GA17734@i18n.org> Message-ID: <3F104625.9070705@v.loewis.de> Hye-Shik Chang wrote: > *) UTF-7, UTF-16, UTF-16BE and UTF-16LE codec is added. What is the rationale for this change? Python already distributed codecs for these. Regards, Martin From perky@i18n.org Sat Jul 12 19:55:05 2003 From: perky@i18n.org (Hye-Shik Chang) Date: Sun, 13 Jul 2003 03:55:05 +0900 Subject: [I18n-sig] CJKCodecs 1.0b1 is released In-Reply-To: <3F104625.9070705@v.loewis.de> References: <20030712160632.GA17734@i18n.org> <3F104625.9070705@v.loewis.de> Message-ID: <20030712185505.GA19015@i18n.org> On Sat, Jul 12, 2003 at 07:32:21PM +0200, "Martin v. L?wis" wrote: > Hye-Shik Chang wrote: > > > *) UTF-7, UTF-16, UTF-16BE and UTF-16LE codec is added. > > What is the rationale for this change? Python already distributed codecs > for these. > Python's utf-7 codec is slightly broken for StreamReaders and it was not easy to fix them for me. Simple tests: (doesn't handle surrogate pairs on ucs2 build) >>> u'\U00012345'.encode('utf-7') '+2AjfRQ-' >>> '+2AjfRQ-'.decode('utf-7') Traceback (most recent call last): File "", line 1, in ? UnicodeError: UTF-7 decoding error: code pairs are not supported >>> '+2AjfRQ-'.decode('cjkcodecs.utf-7') u'\U00012345' (broken encoding for unichar > 0xffff) >>> u'\U00012345'.encode('utf-7') '+I0U-' >>> u'\U00012345'.encode('cjkcodecs.utf-7') '+2AjfRQ-' >>> '+2AjfRQ-'.decode('utf-7') Traceback (most recent call last): File "", line 1, in ? UnicodeDecodeError: 'utf7' codec can't decode bytes in position 0-6: code pairs are not supported >>> '+2AjfRQ-'.decode('cjkcodecs.utf-7') u'\U00012345' (problem for long utf-7 sequence) >>> s=StringIO.StringIO((u'\uac00' * 20).encode('utf-7')) >>> rs = codecs.getreader('utf-7')(s) >>> rs.read(10) Traceback (most recent call last): File "", line 1, in ? File "/usr/local/lib/python2.3/codecs.py", line 262, in read object, decodedbytes = decode(data, self.errors) UnicodeDecodeError: 'utf7' codec can't decode bytes in position 0-19: unterminated shift sequence >>> s=StringIO.StringIO((u'\uac00' * 20).encode('utf-7')) >>> rs = codecs.getreader('cjkcodecs.utf-7')(s) >>> rs.read(10) u'\uac00\uac00\uac00' And, I created utf-8 and utf-16 codec for cjkcodecs just for fun. I shipped them because they are somewhat faster than Python's equivalents. (StreamReader benchmarks with a usual 10Kbyte chinese text) (all values are in iterates/sec) Python CJKCodecs read(16) 14 187 read(256) 221 1645 read(512) 468 1990 readline 361 921 readlines 785 1193 They are not so big and don't replace Python's codecs by default. (distributed as commented out on cjkcodecs/aliases.py) So, I think they are not so useless comparing to their size. > Regards, > Martin > > Regards, Hye-Shik =) From mal@lemburg.com Sat Jul 12 20:14:11 2003 From: mal@lemburg.com (M.-A. Lemburg) Date: Sat, 12 Jul 2003 21:14:11 +0200 Subject: [I18n-sig] CJKCodecs 1.0b1 is released In-Reply-To: <20030712185505.GA19015@i18n.org> References: <20030712160632.GA17734@i18n.org> <3F104625.9070705@v.loewis.de> <20030712185505.GA19015@i18n.org> Message-ID: <3F105E03.4060808@lemburg.com> Hye-Shik Chang wrote: > And, I created utf-8 and utf-16 codec for cjkcodecs just for fun. > I shipped them because they are somewhat faster than Python's equivalents. That's interesting. How did you achieve the speedups ? The Python codecs for these are already rather well optimized. > (StreamReader benchmarks with a usual 10Kbyte chinese text) > (all values are in iterates/sec) > > Python CJKCodecs > read(16) 14 187 > read(256) 221 1645 > read(512) 468 1990 > readline 361 921 > readlines 785 1193 > > They are not so big and don't replace Python's codecs by default. > (distributed as commented out on cjkcodecs/aliases.py) > So, I think they are not so useless comparing to their size. Ah, I think I know what's causing this: you are measuring Python function calls (.read() and readlines() for UTF-8/16 are Python functions implemented in codecs.py) against C type methods. -- Marc-Andre Lemburg eGenix.com Professional Python Software directly from the Source (#1, Jul 12 2003) >>> Python/Zope Products & Consulting ... http://www.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ ________________________________________________________________________ 2003-07-01: Released mxODBC.Zope.DA for FreeBSD 1.0.6 beta 1 From perky@i18n.org Sat Jul 12 20:33:35 2003 From: perky@i18n.org (Hye-Shik Chang) Date: Sun, 13 Jul 2003 04:33:35 +0900 Subject: [I18n-sig] CJKCodecs 1.0b1 is released In-Reply-To: <3F105E03.4060808@lemburg.com> References: <20030712160632.GA17734@i18n.org> <3F104625.9070705@v.loewis.de> <20030712185505.GA19015@i18n.org> <3F105E03.4060808@lemburg.com> Message-ID: <20030712193335.GA20529@i18n.org> On Sat, Jul 12, 2003 at 09:14:11PM +0200, M.-A. Lemburg wrote: > Hye-Shik Chang wrote: > >And, I created utf-8 and utf-16 codec for cjkcodecs just for fun. > >I shipped them because they are somewhat faster than Python's equivalents. > > That's interesting. How did you achieve the speedups ? The > Python codecs for these are already rather well optimized. > Ahh. Sorry for incorrect statement. After my some tests, I found Python's codecs are lots faster than CJKCodecs's for .encode() and .decode() functions. (2x ~ 4x) CJKCodecs's codecs were faster than Python's for StreamReader/Writers only. (by similar ratio) > >(StreamReader benchmarks with a usual 10Kbyte chinese text) > >(all values are in iterates/sec) > > > > Python CJKCodecs > >read(16) 14 187 > >read(256) 221 1645 > >read(512) 468 1990 > >readline 361 921 > >readlines 785 1193 > > > >They are not so big and don't replace Python's codecs by default. > >(distributed as commented out on cjkcodecs/aliases.py) > >So, I think they are not so useless comparing to their size. > > Ah, I think I know what's causing this: you are measuring > Python function calls (.read() and readlines() for UTF-8/16 > are Python functions implemented in codecs.py) against > C type methods. Agreed. I'm considering removing utf-{8,16} from 1.0 release and leave utf-7 only. :) Regards, Hye-Shik =) From tex@I18nGuy.com Sat Jul 19 07:41:03 2003 From: tex@I18nGuy.com (Tex Texin) Date: Sat, 19 Jul 2003 02:41:03 -0400 Subject: [I18n-sig] 24th I18N & Unicode Conference - September 3-5, 2003 - Atlanta,Georgia, USA Message-ID: <3F18E7FF.38146E27@I18nGuy.com> Don't fall behind! Sign up now and get the early bird rates! ************************************************************************ Twenty-fourth Internationalization and Unicode Conference (IUC24) Unicode, Internationalization, the Web: Powering Global Business http://www.unicode.org/iuc/iuc24 September 3-5, 2003 Atlanta, Georgia, USA ************************************************************************ NEWS > Visit the Conference Web site ( http://www.unicode.org/iuc/iuc24 ) to check the updated Conference program and register. To help you choose Conference sessions, we've included abstracts of talks and speakers' biographies. > Hotel guest room group rate valid to August 12. > Early bird registration rates valid to August 12. > To find out about, and register for the TILP Breakfast Meeting and Roundtable, organized by The Institute of Localisation Professionals, and taking place at the same venue on September 4, 7:00 a.m.-9:00 a.m., See: http://www.tilponline.org/events/diary.shtml or http://www.unicode.org/iuc/iuc24 ************************************************************************ Are you falling behind? Version 4.0 of the Unicode Standard is here! Software and Web applications can now support more languages with greater efficiency and lower cost. Do you need to find out how? Do you need to be more competitive around the globe? Is your software upward-compatible with version 4.0? Does your staff need internationalization training? Learn about software and Web internationalization and the new Unicode Standard, including its latest features and requirements. This is the only event endorsed by the Unicode Consortium. The conference will be held September 3-5, 2003 in Atlanta, Georgia and is completely updated. KEYNOTES: Keynote speakers for IUC24 are well-known authors in the Internationalization and Localization industries: Donald De Palma, President, Common Sense Advisory, Inc., and author of "Business Without Borders: A Strategic Guide to Global Marketing", and Richard Gillam, author of "Unicode Demystified: A Practical Programmer's Guide to the Encoding Standard" and a former columnist for "C++ Report". TUTORIALS: The redeveloped and enhanced Unicode 4.0 Tutorial is taught by Dr. Asmus Freytag, one of the major contributors to the standard, and extensively experienced in implementing real-world Unicode applications. Structured into 3 independent modules, you can attend just the overview, or only the most advanced material. Tutorials in Web Internationalization, non-Latin scripts, and more, are offered in parallel and taught by recognized industry experts. CONFERENCE TRACKS: Gain the competitive edge! Conference sessions provide the most up-to-date technical information on standards, best practices, and recent advances in the globalization of software and the Internet. Panel discussions and the friendly atmosphere allow you to exchange ideas and ask questions of key players in the internationalization industry. WHO SHOULD ATTEND?: If you have a limited training budget, this is the one Internationalization conference you need. Send staff that are involved in either Unicode-enabling software, or internationalization of software and the Internet, including: managers, software engineers, systems analysts, font designers, graphic designers, content developers, Web designers, Web administrators, technical writers, and product marketing personnel. CONFERENCE WEB SITE, PROGRAM and REGISTRATION The Conference Program and Registration form are available at the Conference Web site: http://www.unicode.org/iuc/iuc24 CONFERENCE SPONSORS Agfa Monotype Corporation Basis Technology Corporation ClientSide News L.L.C. Oracle Corporation World Wide Web Consortium (W3C) XenCraft GLOBAL COMPUTING SHOWCASE Visit the Showcase to find out more about products supporting the Unicode Standard, and products and services that can help you globalize/localize your software, documentation and Internet content. Sign up for the Exhibitors' track as part of the Conference. For more information, please see: http://www.unicode.org/iuc/iuc24/showcase.html CONFERENCE VENUE The Conference will take place at: DoubleTree Hotel Atlanta Buckhead 3342 Peachtree Road Atlanta, GA 30326 Tel: +1-404-231-1234 Fax: +1-404-231-3112 CONFERENCE MANAGEMENT Global Meeting Services Inc. 8949 Lombard Place, #416 San Diego, CA 92122, USA Tel: +1 858 638 0206 (voice) +1 858 638 0504 (fax) Email: info@global-conference.com or: conference@unicode.org THE UNICODE CONSORTIUM The Unicode Consortium was founded as a non-profit organization in 1991. It is dedicated to the development, maintenance and promotion of The Unicode Standard, a worldwide character encoding. The Unicode Standard encodes the characters of the world's principal scripts and languages, and is code-for-code identical to the international standard ISO/IEC 10646. In addition to cooperating with ISO on the future development of ISO/IEC 10646, the Consortium is responsible for providing character properties and algorithms for use in implementations. Today the membership base of the Unicode Consortium includes major computer corporations, software producers, database vendors, research institutions, international agencies and various user groups. For further information on the Unicode Standard, visit the Unicode Web site at http://www.unicode.org or e-mail * * * * * Unicode(r) and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission. From perky@i18n.org Sat Jul 19 13:48:57 2003 From: perky@i18n.org (Hye-Shik Chang) Date: Sat, 19 Jul 2003 21:48:57 +0900 Subject: [I18n-sig] CJKCodecs 1.0 is released Message-ID: <20030719124857.GA48691@i18n.org> The CJKCodecs 1.0 has been released for General Availability and it's available for download at: http://sourceforge.net/project/showfiles.php?group_id=46747 The CJKCodecs is a unified unicode codec set for Chinese, Japanese and Korean encodings. It supports full features of unicode codec specification and PEP293 error callbacks on Python 2.3. The CJKCodecs is supporting these encodings in this time: big5 cp932 cp949 cp950 euc-jisx0213 euc-jp euc-kr gb18030 gb2312 gbk hz iso-2022-jp iso-2022-jp-1 iso-2022-jp-2 iso-2022-jp-3 iso-2022-kr johab shift-jis shift-jisx0213 utf-7 utf-8 Changes came with 1.0 from 1.0b1: *) UTF-16 codecs are removed from distribution. *) Fixed UTF-7 codec's bug that fails to decode surrogate pair on ucs4-python Changes came with 1.0b1 from 0.9: *) SHIFT-JISX0213, EUC-JISX0213, ISO-2022-JP-2 and ISO-2022-JP-3 codec is added. *) UTF-7, UTF-16, UTF-16BE and UTF-16LE codec is added. *) Changed a few characters of a big5 codepoint mapping to cp950's rather than 0xfffd. (documented on NOTES.big5) *) Fixed a bug that JIS X 0201 routine doesn't encode and decode 0x7f. *) Tweaked some mapping for cp932 and cp950 to make more consistency with MS Windows. - CP932: Added single byte "UNDEFINED" characters 0x80, 0xa0, 0xfd, 0xfe, 0xff (documented on NOTES.cp932) - CP950: Changed encode mappings to another more popular for duplicated unicode points: 5341 -> A451, 5345 -> A4CA *) A unittest for big5 mapping is added. *) Fixed a bug that cp932 codec couldn't decode half-width katakana. *) Added a workaround for PyObject_GenericGetAttr to enable compiling with mingw32. [Young-Sik Won] *) Enable gb18030 and utf-8 codec encode and decode iso-10646-2 characters using surrogate pair. *) Fixed gb18030 codec's syntax error that disturbs compilation on python compiled with --with-unicode=ucs4 option. [Son, Kyung-uk] *) StreamWriter became to be able to buffer incomplete sequences. (this feature is used for surrogate-pair and mapping from unicode character with a following modifier) *) EUC-JP codec's mapping for 0xA1C0 is changed from U+005C to U+FF3C because EUC-JP 0x5C is also a REVERSE SOLIDUS and 0xA1C0 is FULLWIDTH REVERSE SOLIDUS on japanese environments. *) Fixed hz codec's bug that doesn't initialize the encoding mode to ASCII. Thank you very much! Regards, Hye-Shik =)