From tex@I18nGuy.com  Wed Jun  4 08:28:49 2003
From: tex@I18nGuy.com (Tex Texin)
Date: Wed, 04 Jun 2003 03:28:49 -0400
Subject: [I18n-sig] 24th Unicode Conference (IUC24) - September 3-5, 2003 -  Atlanta, GA
Message-ID: <3EDD9FB1.53FC9B7E@I18nGuy.com>

Unicode 4.0 Tutorial, many new presentations, and lovely Atlanta!

************************************************************************
    Twenty-fourth Internationalization and Unicode Conference (IUC24)
     Unicode, Internationalization, the Web: Powering Global Business
                    http://www.unicode.org/iuc/iuc24
                            September 3-5, 2003
                                Atlanta, GA
************************************************************************
Mark your diary! >> 12 weeks to go >> Mark your diary! >> 12 weeks to go
************************************************************************

Are you falling behind?  Version 4.0 of the Unicode Standard is here!
Software and Web applications can now support more languages with greater
efficiency and lower cost.  Do you need to find out how? Do you need to
be more competitive around the globe?  Is your software upward-compatible
with version 4.0?  Does your staff need internationalization training?

Learn about software and Web internationalization and the new Unicode
Standard, including its latest features and requirements.  This is the
only event endorsed by the Unicode Consortium.  The conference will be
held September 3-5, 2003 in Atlanta, Georgia and is completely updated.

KEYNOTES: Keynote speakers for IUC24 are well-known authors in the
Internationalization and Localization industries:

Donald De Palma, President, Common Sense Advisory, Inc., and author of
"Business Without Borders: A Strategic Guide to Global Marketing", and
Richard Gillam, author of "Unicode Demystified: A Practical Programmer's
Guide to the Encoding Standard" and a former columnist for "C++ Report".

TUTORIALS:  This redeveloped and enhanced Unicode 4.0 Tutorial is taught
by Dr. Asmus Freytag, one of the major contributors to the standard, and
extensively experienced in implementing real-world Unicode applications.
Structured into 3 independent modules, you can attend just the overview,
or only the most advanced material.  Tutorials in Web Internationalization,

non-Latin scripts, and more, are offered in parallel and taught by
recognized industry experts.

CONFERENCE TRACKS:  Gain the competitive edge! Conference sessions provide
the most up-to-date technical information on standards, best practices, and

recent advances in the globalization of software and the Internet.  Panel
discussions and the friendly atmosphere allow you to exchange ideas and ask

questions of key players in the internationalization industry.

WHO SHOULD ATTEND?:  If you have a limited training budget, this is the one

Internationalization conference you need.  Send staff that are involved in
either Unicode-enabling software, or internationalization of software and
the
Internet, including: managers, software engineers, systems analysts, font
designers, graphic designers, content developers, Web designers, Web
administrators, technical writers, and product marketing personnel.

CONFERENCE WEB SITE, PROGRAM and REGISTRATION

   The Conference Program and Registration form are available at the
   Conference Web site:
      http://www.unicode.org/iuc/iuc24

CONFERENCE SPONSORS

   Agfa Monotype Corporation
   Basis Technology Corporation
   ClientSide News L.L.C.
   Oracle Corporation
   World Wide Web Consortium (W3C)
   XenCraft

GLOBAL COMPUTING SHOWCASE

   Visit the Showcase to find out more about products supporting the
   Unicode Standard, and products and services that can help you
   globalize/localize your software, documentation and Internet content.

   Sign up for the Exhibitors' track as part of the Conference.
   For more information, please see:
   http://www.unicode.org/iuc/iuc24/showcase.html

CONFERENCE VENUE

The Conference will take place at:

          DoubleTree Hotel Atlanta Buckhead
          3342 Peachtree Road
          Atlanta, GA 30326

          Tel: +1-404-231-1234
          Fax: +1-404-231-3112

CONFERENCE MANAGEMENT

   Global Meeting Services Inc.
   8949 Lombard Place, #416
   San Diego, CA 92122, USA

   Tel: +1 858 638 0206 (voice)
        +1 858 638 0504 (fax)

   Email: info@global-conference.com
      or: conference@unicode.org

THE UNICODE CONSORTIUM

The Unicode Consortium was founded as a non-profit organization in 1991.
It is dedicated to the development, maintenance and promotion of The
Unicode Standard, a worldwide character encoding. The Unicode Standard
encodes the characters of the world's principal scripts and languages,
and is code-for-code identical to the international standard ISO/IEC
10646. In addition to cooperating with ISO on the future development of
ISO/IEC 10646, the Consortium is responsible for providing character
properties and algorithms for use in implementations. Today the
membership base of the Unicode Consortium includes major computer
corporations, software producers, database vendors, research
institutions, international agencies and various user groups.

For further information on the Unicode Standard, visit the Unicode Web
site at http://www.unicode.org or e-mail <info@unicode.org>

                           *  *  *  *  *

Unicode(r) and the Unicode logo are registered trademarks of Unicode,
Inc. Used with permission.


From confirm-s2-xpeXGMRqcjv1vcmC4kkh0SC8zxI-i18n-sig=python.org@yahoogroups.com  Thu Jun  5 11:11:03 2003
From: confirm-s2-xpeXGMRqcjv1vcmC4kkh0SC8zxI-i18n-sig=python.org@yahoogroups.com (Yahoo! Groups)
Date: 5 Jun 2003 10:11:03 -0000
Subject: [I18n-sig] Please confirm your request to join locales
Message-ID: <1054807863.74.14780.w23@yahoogroups.com>

Hello i18n-sig@python.org,

We have received your request to join the locales 
group hosted by Yahoo! Groups, a free, easy-to-use community service.

This request will expire in 21 days.

TO BECOME A MEMBER OF THE GROUP: 

1) Go to the Yahoo! Groups site by clicking on this link:

   http://groups.yahoo.com/i?i=xpeXGMRqcjv1vcmC4kkh0SC8zxI&e=i18n-sig%40python%2Eorg 

  (If clicking doesn't work, "Cut" and "Paste" the line above into your 
   Web browser's address bar.)

-OR-

2) REPLY to this email by clicking "Reply" and then "Send"
   in your email program

If you did not request, or do not want, a membership in the
locales group, please accept our apologies
and ignore this message.

Regards,

Yahoo! Groups Customer Care

Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/ 

 
From perky@fallin.lv  Fri Jun  6 10:53:32 2003
From: perky@fallin.lv (Hye-Shik Chang)
Date: Fri, 6 Jun 2003 18:53:32 +0900
Subject: [I18n-sig] CJKCodecs 0.9 is released
Message-ID: <20030606095332.GA90359@fallin.lv>

The CJKCodecs 0.9 is released and available for download at:

 http://sourceforge.net/project/showfiles.php?group_id=46747


The CJKCodecs is a unified unicode codec set for Chinese, Japanese
and Korean encodings. It supports full features of unicode codec
specification and PEP293 error callbacks on Python 2.3.

Currently supported encodings and planned updates:

Authority       0.9             1.0             1.1             1.2
==============================================================================
China (PRC)     gb2312                          iso-2022-cn
                gbk(cp936)                      iso-2022-cn-ext
                gb18030
                hz

Hong Kong                                                       hkscs

Japan           shift-jis       iso-2022-jp-2   euc-jisx0213    iso-2022-int-1
                euc-jp                          shift-jisx0213  mac_japanese
                cp932                           iso-2022-jp-3
                iso-2022-jp
                iso-2022-jp-1

Korea (ROK)     euc-kr                          (ksx1001:2002)  mac_korean
                cp949(uhc)                                      unijohab
                johab
                iso-2022-kr

Korea (DPRK)                                                    euc-kp

Taiwan          big5                            iso-2022-cn
                cp950                           iso-2022-cn-ext
                                                euc-tw

Unicode.org     utf-8           utf-7
                                utf-16


It includes utf codecs to use it in our unit tests. (the standard utf-8
StreamReader behaves strangely on some conditions)

Thank you!


Regards,
    Hye-Shik =)


From JasonR.Mastaler  Tue Jun 10 22:25:53 2003
From: JasonR.Mastaler (JasonR.Mastaler)
Date: Tue, 10 Jun 2003 15:25:53 -0600
Subject: [I18n-sig] Re: CJKCodecs 0.9 is released
References: <20030606095332.GA90359@fallin.lv>
Message-ID: <m2y909zkta.fsf@deacon-blues.mid.mastaler.com>

Hye-Shik Chang <perky@fallin.lv> writes:

> The CJKCodecs is a unified unicode codec set for Chinese, Japanese
> and Korean encodings.

Is this packages intended to replace the ChineseCodecs[1],
KoreanCodecs[2], and JapaneseCodecs[3] packages, which are currently
available separately?

I see KoreanCodecs is marked "obsolete", but I see not similar mention
on the pages for ChineseCodecs and JapaneseCodecs.

Footnotes: 
[1]  http://sourceforge.net/projects/python-codecs
[2]  http://sourceforge.net/projects/koco/
[3]  http://www.asahi-net.or.jp/~rd6t-kjym/python/


From martin@v.loewis.de  Tue Jun 10 22:42:28 2003
From: martin@v.loewis.de (Martin v. =?iso-8859-15?q?L=F6wis?=)
Date: 10 Jun 2003 23:42:28 +0200
Subject: [I18n-sig] Re: CJKCodecs 0.9 is released
In-Reply-To: <m2y909zkta.fsf@deacon-blues.mid.mastaler.com>
References: <20030606095332.GA90359@fallin.lv>
 <m2y909zkta.fsf@deacon-blues.mid.mastaler.com>
Message-ID: <m3isrdhanv.fsf@mira.informatik.hu-berlin.de>

"Jason R. Mastaler" <jason@mastaler.com> writes:

> I see KoreanCodecs is marked "obsolete", but I see not similar mention
> on the pages for ChineseCodecs and JapaneseCodecs.

These are different package authors, so they likely have different
opinions on the status of each package.

Regards,
Martin


From JasonR.Mastaler  Tue Jun 10 23:49:33 2003
From: JasonR.Mastaler (JasonR.Mastaler)
Date: Tue, 10 Jun 2003 16:49:33 -0600
Subject: [I18n-sig] Re: CJKCodecs 0.9 is released
References: <20030606095332.GA90359@fallin.lv> <m2y909zkta.fsf@deacon-blues.mid.mastaler.com>
 <m3isrdhanv.fsf@mira.informatik.hu-berlin.de>
Message-ID: <m2of15zgxu.fsf@deacon-blues.mid.mastaler.com>

martin@v.loewis.de (Martin v. Löwis) writes:

> These are different package authors, so they likely have different
> opinions on the status of each package.

Sure, but presumably Hye-Shik will have some ideas on this topic as I
know he has been involved with all three codecs in some capacity.

It also doesn't make much sense to distribute multiple implementations
of the same codecs, so presumably CJKCodecs will replace the three
standalone distributions.


From perky@fallin.lv  Wed Jun 11 03:18:36 2003
From: perky@fallin.lv (Hye-Shik Chang)
Date: Wed, 11 Jun 2003 11:18:36 +0900
Subject: [I18n-sig] Re: CJKCodecs 0.9 is released
In-Reply-To: <m2y909zkta.fsf@deacon-blues.mid.mastaler.com>
References: <20030606095332.GA90359@fallin.lv> <m2y909zkta.fsf@deacon-blues.mid.mastaler.com>
Message-ID: <20030611021836.GA87284@fallin.lv>

On Tue, Jun 10, 2003 at 03:25:53PM -0600, Jason R. Mastaler wrote:
> Hye-Shik Chang <perky@fallin.lv> writes:
> 
> > The CJKCodecs is a unified unicode codec set for Chinese, Japanese
> > and Korean encodings.
> 
> Is this packages intended to replace the ChineseCodecs[1],
> KoreanCodecs[2], and JapaneseCodecs[3] packages, which are currently
> available separately?
> 
> I see KoreanCodecs is marked "obsolete", but I see not similar mention
> on the pages for ChineseCodecs and JapaneseCodecs.

Yup. KoreanCodecs will be retired after CJKCodecs 1.0 is released.
And, I don't have permissions to replace the others because I am not
an author of them.

Comparisons for CJKCodecs 1.0 vs {C,J,K}Codecs:

            JapaneseCodecs  ChineseCodecs   KoreanCodecs    CJKCodecs

PEP293          no              no              no              yes

StreamReader    yes             no            partly(1)         yes

StreamWriter    no              no              no              yes

License         BSD             GPL            LGPL             BSD

Last Update   Oct 2002       Nov 2000        Jul 2002     in development
              (1.4.9)        (1.2.0)         (2.0.5)          (0.9)

Source Size    304KB           528KB          224KB           464KB

Binary Size    816KB           616KB          680KB           328KB
(FreeBSD/ia32)

Encodings(C)                   big5                           big5
                               gb2312                         gb2312
                                                              gbk
                                                              gb18030
                                                              cp950
                                                              hz

Encodings(J)   euc-jp                                         euc-jp
               cp932                                          cp932
               iso-2022-jp                                    iso-2022-jp
               iso-2022-jp-1                                  iso-2022-jp-1
                                                              iso-2022-jp-2
                                                              iso-2022-jp-3
                                                              euc-jisx0213
                                                              shift-jisx0213

Encodings(K)                                   euc-kr         euc-kr
                                               cp949          cp949
                                               johab          johab
                                               unijohab(2)
                                               qwerty2bul
                                               mac_korean

Implementation  Pure / C      Pure / C         Pure / C       C only

(1) KoreanCodecs supports 'sane' StreamReader for euc-kr, cp949 and johab
    only.
(2) unijohab, qwerty2bul and mac_korean are quite minor encodings and
    ignorable.


I don't think CJKCodecs can replace Chinese and JapaneseCodecs immediately.
But, CJKCodecs will be remain useful in respect of abililty to support
inter-cjk encodings like ISO-2022-JP-2 and ISO-2022-INT-1.

Thank you for your interests! :)

Regards,
    Hye-Shik =)


From martin@v.loewis.de  Wed Jun 11 06:25:02 2003
From: martin@v.loewis.de (Martin v. =?iso-8859-15?q?L=F6wis?=)
Date: 11 Jun 2003 07:25:02 +0200
Subject: [I18n-sig] Re: CJKCodecs 0.9 is released
In-Reply-To: <20030611021836.GA87284@fallin.lv>
References: <20030606095332.GA90359@fallin.lv>
 <m2y909zkta.fsf@deacon-blues.mid.mastaler.com>
 <20030611021836.GA87284@fallin.lv>
Message-ID: <m31xy19oep.fsf@mira.informatik.hu-berlin.de>

Hye-Shik Chang <perky@fallin.lv> writes:

> I don't think CJKCodecs can replace Chinese and JapaneseCodecs immediately.
> But, CJKCodecs will be remain useful in respect of abililty to support
> inter-cjk encodings like ISO-2022-JP-2 and ISO-2022-INT-1.

This is an interesting summary. Can you produce another comparison,
showing the differences in output of these codecs? Particular
interesting might be cp932, euc-jp, iso-2022-jp, big5, and gb2312.
For these, please find out
a) which characters are encoded in one codec that are not encoded
   in the other (i.e. Unicode code point -> encoding)
b) which characters are decoded in one codec that are not decoded
   in the other (i.e. encoding -> Unicode code point)
c) which characters are encoded differently
d) which characters are decoded differently

Regards,
Martin


From perky@fallin.lv  Wed Jun 11 09:13:01 2003
From: perky@fallin.lv (Hye-Shik Chang)
Date: Wed, 11 Jun 2003 17:13:01 +0900
Subject: [I18n-sig] Re: CJKCodecs 0.9 is released
In-Reply-To: <m31xy19oep.fsf@mira.informatik.hu-berlin.de>
References: <20030606095332.GA90359@fallin.lv> <m2y909zkta.fsf@deacon-blues.mid.mastaler.com> <20030611021836.GA87284@fallin.lv> <m31xy19oep.fsf@mira.informatik.hu-berlin.de>
Message-ID: <20030611081301.GA92933@fallin.lv>

On Wed, Jun 11, 2003 at 07:25:02AM +0200, Martin v. L?wis wrote:
> Hye-Shik Chang <perky@fallin.lv> writes:
> 
> > I don't think CJKCodecs can replace Chinese and JapaneseCodecs immediately.
> > But, CJKCodecs will be remain useful in respect of abililty to support
> > inter-cjk encodings like ISO-2022-JP-2 and ISO-2022-INT-1.
> 
> This is an interesting summary. Can you produce another comparison,
> showing the differences in output of these codecs? Particular
> interesting might be cp932, euc-jp, iso-2022-jp, big5, and gb2312.
> For these, please find out
> a) which characters are encoded in one codec that are not encoded
>    in the other (i.e. Unicode code point -> encoding)
> b) which characters are decoded in one codec that are not decoded
>    in the other (i.e. encoding -> Unicode code point)
> c) which characters are encoded differently
> d) which characters are decoded differently
> 

Legend:
    CJK - CJKCodecs 0.9
    Chinese - ChineseCodecs 1.2.0
    Japanese - JapaneseCodecs 1.4.9
    Korean - KoreanCodecs 2.0.5
    GNU - GNU libiconv 1.8 + iconvcodecs 1.0

1. DECODERS

   1) CJKCodecs' gb2312 versus ChineseCodecs' euc-gb2312-cn

      exactly identical, but ChineseCodecs raises not UnicodeError but
      IndexError for incompleted multibyte sequences.

   2) CJKCodecs' big5 versus ChineseCodecs' big5-tw

                CJK             Chinese         GNU
        a15a    -               fffd            -
        a1c3    -               fffd            -
        a1c5    -               fffd            -
        a1fe    -               fffd            -
        a240    -               fffd            -
        a2cc    -               fffd            -
        a2ce    -               fffd            -

      and, chinesetw.big5 codec has same problem with chinesecn.gb2312

   3) CJKCodecs' euc-jp versus JapaneseCodecs' euc-jp

                CJK             Japanese        GNU
        01c0    005c            ff3c            ff3c
        f5a1    e000            -               e000    -+ User-Defined Area
        f5a2    e001            -               e001     |
            ....                                         |
        fefd    e3aa            -               e3aa     |
        fefe    e3ab            -               e3ab    -+
        ffa1    e3ac            -               -       -+ CJKCodecs' bug ;)
        ffa2    e3ad            -               -        |
            ....                                         |
        fffd    e408            -               -        |
        fffe    e409            -               -       -+


   4) CJKCodecs' shift-jis versus JapaneseCodecs' shift-jis

                CJK             Japanese        GNU
        005c    00a5            005c            00a5
        007e    203e            007e            203e
        007f    -               007f            007f
        815f    005c            ff3c            ff3c
        817f    -               00d7            -
        837f    -               30df            -
            ....
        9e7f    -               684e            -
        9f7f    -               6bef            -
        a040    -               6f3e            -
        a041    -               6f13            -
            ....
        a0fb    -               74d4            -
        a0fc    -               73f1            -
        e07f    -               70dd            -
        e17f    -               75ff            -
        e27f    -               7ab0            -
            ....
        e97f    -               9a43            -
        ea7f    -               9eef            -
        f040    e000            -               e000    -+ User-Defined Area
        f041    e001            -               e001     |
            ....                                         |
        f9fb    e756            -               e756     |
        f9fc    e757            -               e757    -+

    5) CJKCodecs' cp932 versus JapaneseCodecs' ms932

                CJK             Japanese        GNU
        00a1    -               ff61            ff61    -+ CJKCodecs' bug ;)
        00a2    -               ff62            ff62     |
            ....                                         |
        00de    -               ff9e            ff9e     |
        00df    -               ff9f            ff9f    -+
        8160    ff5e            ff5e            301c
        8161    2225            2225            2016
        817c    ff0d            ff0d            2212
        817f    -               00d7            -
        8191    ffe0            ffe0            00a2
        8192    ffe1            ffe1            00a3
        81ca    ffe2            ffe2            00ac
        837f    -               30df            -
        847f    -               043d            -
            ....
        a0fb    -               74d4            -
        a0fc    -               73f1            -
        e07f    -               70dd            -
        e17f    -               75ff            -
            ....
        e97f    -               9a43            -
        ea7f    -               9eef            -

    6) CJKCodecs' euc-kr versus KoreanCodecs' euc-kr

        exactly identical

    7) CJKCodecs' cp949 versus KoreanCodecs' cp949

        exactly identical


2. ENCODERS

    1) CJKCodecs' gb2312 versus ChineseCodecs' euc-gb2312-cn

       exactly identical

    2) CJKCodecs' big5 versus ChineseCodecs' big5-tw

                CJK             Chinese         GNU
        fffd    a2ce            a2ce            -

    3) CJKCodecs' euc-jp versus JapaneseCodecs' euc-jp

                CJK             Japanese        GNU
        00a5    -               5c              5c
        203e    -               7e              7e
        e000    f5a1            -               f5a1    -+ User-Defined Area
        e001    f5a2            -               f5a2     |
            ....                                         |
        e3aa    fefd            -               fefd     |
        e3ab    fefe            -               fefe     |
        e3ac    8ff5a1          -               8ff5a1   |
        e3ad    8ff5a2          -               8ff5a2   |
            ....                                         |
        e756    8ffefd          -               8ffefd   |
        e757    8ffefe          -               8ffefe  -+
        ff3c    -               a1c0            a1c0
        ff5e    -               -               8fa2b7

    4) CJKCodecs' shift-jis versus JapaneseCodecs' shift-jis

                CJK             Japanese        GNU
        005c    815f            5c              -
        007e    -               7e              -
        007f    -               7f              7f
        e000    f040            -               f040    -+ User-Defined Area
        e001    f041            -               f041     |
            ....                                         |
        e756    f9fb            -               f9fb     |
        e757    f9fc            -               f9fc    -+
        ff3c    -               815f            815f

    5) CJKCodecs' cp932 versus JapaneseCodecs' ms932

                CJK             Japanese        GNU
        0080    -               80              -
        00a1    -               21              -       -+ latin-1 -> ascii
        00a5    -               5c              -        | fallbacks.
            ....                                         |
        00fe    -               74              -        |
        00ff    -               79              -       -+
        2116    8782            8782            fa59
        2121    8784            8784            fa5a
            ....
        2168    875c            875c            fa52
        2169    875d            875d            fa53
        2170    eeef            fa40            fa40
        2171    eef0            fa41            fa41
            ....
        2178    eef7            fa48            fa48
        2179    eef8            fa49            fa49
        2225    8161            8161            -
        3094    -               8394            -
        3231    878a            878a            fa58
        4e28    ed4c            fa68            fa68
        4ee1    ed4d            fa69            fa69
            ....
        9e19    eeeb            fc4a            fc4a
        9ed1    eeec            fc4b            fc4b
        f8f0    -               a0              -
        f8f1    -               fd              -
        f8f2    -               fe              -
        f8f3    -               ff              -
        f929    edc4            fae0            fae0
        f9dc    eecd            fbe9            fbe9
            ....
        ff02    eefc            fa57            fa57
        ff07    eefb            fa56            fa56
        ff0d    817c            817c            -
        ff5e    8160            8160            -
        ffe0    8191            8191            -
        ffe1    8192            8192            -
        ffe2    81ca            81ca            fa54
        ffe4    eefa            fa55            fa55

    6) CJKCodecs' euc-kr versus KoreanCodecs' euc-kr

        exactly identical

    7) CJKCodecs' cp949 versus KoreanCodecs' cp949

        exactly identical


Okay, the comparison says that we need some discussions on the mappings.
I'll fix CJKCodecs' bugs as soon as possible and welcome any opionions
about mapping inconsistencies. :)


Regards,
    Hye-Shik =)


From martin@v.loewis.de  Wed Jun 11 21:34:15 2003
From: martin@v.loewis.de (Martin v. =?iso-8859-15?q?L=F6wis?=)
Date: 11 Jun 2003 22:34:15 +0200
Subject: [I18n-sig] Re: CJKCodecs 0.9 is released
In-Reply-To: <20030611081301.GA92933@fallin.lv>
References: <20030606095332.GA90359@fallin.lv>
 <m2y909zkta.fsf@deacon-blues.mid.mastaler.com>
 <20030611021836.GA87284@fallin.lv>
 <m31xy19oep.fsf@mira.informatik.hu-berlin.de>
 <20030611081301.GA92933@fallin.lv>
Message-ID: <m3of141hh4.fsf@mira.informatik.hu-berlin.de>

Hye-Shik Chang <perky@fallin.lv> writes:

> Legend:
>     CJK - CJKCodecs 0.9
>     Chinese - ChineseCodecs 1.2.0
>     Japanese - JapaneseCodecs 1.4.9
>     Korean - KoreanCodecs 2.0.5
>     GNU - GNU libiconv 1.8 + iconvcodecs 1.0

Very interesting, again. I have some problems interpreting the data,
though.

>    2) CJKCodecs' big5 versus ChineseCodecs' big5-tw
> 
>                 CJK             Chinese         GNU
>         a15a    -               fffd            -
>         a1c3    -               fffd            -
>         a1c5    -               fffd            -
>         a1fe    -               fffd            -
>         a240    -               fffd            -
>         a2cc    -               fffd            -
>         a2ce    -               fffd            -

What does that mean? CJK and iconv gives UnicodeError, whereas
ChineseCodecs puts in the replacement character? Seems like a bug in
ChineseCodecs to me, doesn't it? The replacement character should only
be generated if errors='replace', no?

>    3) CJKCodecs' euc-jp versus JapaneseCodecs' euc-jp
> 
>                 CJK             Japanese        GNU
>         01c0    005c            ff3c            ff3c

That appears to be a bug in CJK, right? This is the question whether
/xa1/xc0 is REVERSE SOLIDUS or FULLWIDTH REVERSE SOLIDUS.  Now, it
appears that euc-jp also supports /x5c, mapped to REVERSE SOLIDUS,
and that /xa1/xc0 should be interpreted as FULLWIDTH REVERSE SOLIDUS,
no?

In case of doubt, I think ICU should be consulted for reference, as
well, and following some kind of majority. In any case, I think the
questionable mappings need to be documented.

>         f5a1    e000            -               e000    -+ User-Defined Area
>         f5a2    e001            -               e001     |
>             ....                                         |
>         fefd    e3aa            -               e3aa     |
>         fefe    e3ab            -               e3ab    -+

What are these? I cannot find them in glibc.

>    4) CJKCodecs' shift-jis versus JapaneseCodecs' shift-jis
> 
>                 CJK             Japanese        GNU
>         005c    00a5            005c            00a5

Here, I would trust GNU iconv; 5C really is YEN SIGN.

>         007e    203e            007e            203e

Likewise for OVERLINE - is there really no TILDE in shift-jis?

>         007f    -               007f            007f

Why that?

>         815f    005c            ff3c            ff3c

Again: Why that? Shouldn't /x81/x5f be FULLWIDTH REVERSE SOLIDUS?

>         817f    -               00d7            -

What character is that? Why does JapaneseCodecs map it to
MULTIPLICATION SIGN? glibc seems to map that to /x81/x7e;
is that a typo in JapaneseCodecs?

>         837f    -               30df            -

Are you sure glibc does not support that? Seems to be
KATAKANA LETTER MI.


>         9e7f    -               684e            -

Why does JapaneseCodecs do that? glibc maps 9e7e to 684e.

>         f040    e000            -               e000    -+ User-Defined Area
>         f041    e001            -               e001     |
>             ....                                         |
>         f9fb    e756            -               e756     |
>         f9fc    e757            -               e757    -+

Again: What are these characters?

>     5) CJKCodecs' cp932 versus JapaneseCodecs' ms932
> 
>                 CJK             Japanese        GNU
>         8160    ff5e            ff5e            301c
>         8161    2225            2225            2016

Are these verified against MS CP932, e.g. from Windows XP?

>         817f    -               00d7            -

Likewise: For CP932, it seems essential to do whatever Microsoft does,
in any Windows version.

>     2) CJKCodecs' big5 versus ChineseCodecs' big5-tw
> 
>                 CJK             Chinese         GNU
>         fffd    a2ce            a2ce            -

BIG-5 has the notion of a replacement character????

>     3) CJKCodecs' euc-jp versus JapaneseCodecs' euc-jp
> 
>                 CJK             Japanese        GNU
>         00a5    -               5c              5c

That seems wrong, too.

>         203e    -               7e              7e

Likewise.

> Okay, the comparison says that we need some discussions on the mappings.
> I'll fix CJKCodecs' bugs as soon as possible and welcome any opionions
> about mapping inconsistencies. :)

I'm too lazy now to review the other encodings. I'd encourage you to
consult ICU for established procedures, and to document the cases
where you pick one of the possible alternatives. I do hope that this
set of codecs becomes part of standard Python one day, at which point
we really need to document what exactly they do.

Regards,
Martin


From JasonR.Mastaler  Thu Jun 12 04:13:36 2003
From: JasonR.Mastaler (JasonR.Mastaler)
Date: Wed, 11 Jun 2003 21:13:36 -0600
Subject: [I18n-sig] Re: CJKCodecs 0.9 is released
References: <20030606095332.GA90359@fallin.lv> <m2y909zkta.fsf@deacon-blues.mid.mastaler.com>
 <20030611021836.GA87284@fallin.lv>
Message-ID: <m23cigt2cf.fsf@deacon-blues.mid.mastaler.com>

Hye-Shik Chang <perky@fallin.lv> writes:

> Yup. KoreanCodecs will be retired after CJKCodecs 1.0 is released.
> And, I don't have permissions to replace the others because I am not
> an author of them.

Have you contacted the authors of {C,J}Codecs?  Perhaps they would be
willing to contribute to your package in order to retire theirs?

> I don't think CJKCodecs can replace Chinese and JapaneseCodecs
> immediately.

Why is this exactly, because of bugs in CJKCodecs?  Your comparison
chart showed that CJKCodecs supports at least all the codecs that
{C,J,K}Codecs do combined.


From perky@fallin.lv  Thu Jun 12 05:46:34 2003
From: perky@fallin.lv (Hye-Shik Chang)
Date: Thu, 12 Jun 2003 13:46:34 +0900
Subject: [I18n-sig] iconvcodec 1.1 is released
Message-ID: <20030612044634.GA10477@fallin.lv>

Hi, i18n-goodies!

 I just released iconvcodec 1.1 and available for download at:

  http://sourceforge.net/project/showfiles.php?group_id=46747


 Changes from 1.0 are the following:

 - Enabled ISO-10646-2 extended planes by using Surrogate-Pair
   on ucs2-python
 - Now, users can add 'iconvcodec.' prefix before encoding names
   to exclude another lookup functions. (eg: iconvcodec.utf-8)
 - Fixed a syntax error around #if block [Changwoo Ryu]
 - Added a workaround to compile it with MinGW32 [Young-Sik Won]

 And, win32 binary distribution's iconv library is upgraded to
 GNU libiconv 1.9.1. Therefore, win32 binary is released under LGPL.
 (GNU libiconv 1.9.1 supports JIS X 0213 encodings, yay!)

 Thank you for listening!

Regards,
    Hye-Shik =)


From kajiyama@grad.sccs.chukyo-u.ac.jp  Thu Jun 12 06:28:26 2003
From: kajiyama@grad.sccs.chukyo-u.ac.jp (Tamito KAJIYAMA)
Date: Thu, 12 Jun 2003 14:28:26 +0900
Subject: [I18n-sig] Re: CJKCodecs 0.9 is released
In-Reply-To: <m23cigt2cf.fsf@deacon-blues.mid.mastaler.com>
 (jason@mastaler.com)
References: <m23cigt2cf.fsf@deacon-blues.mid.mastaler.com>
Message-ID: <200306120528.h5C5SQm09184@grad.sccs.chukyo-u.ac.jp>

"Jason R. Mastaler" <jason@mastaler.com> writes:
|
| Hye-Shik Chang <perky@fallin.lv> writes:
| 
| > Yup. KoreanCodecs will be retired after CJKCodecs 1.0 is released.
| > And, I don't have permissions to replace the others because I am not
| > an author of them.
| 
| Have you contacted the authors of {C,J}Codecs?  Perhaps they would be
| willing to contribute to your package in order to retire theirs?

I've been on this list, watching what's going on.  However, I'm
busy and don't have enough time to commit to the development of
both JapaneseCodecs and CJKCodecs.  Excuse me for inconvenience.

Regards,

-- 
KAJIYAMA, Tamito <kajiyama@grad.sccs.chukyo-u.ac.jp>


From alex@lisa.org  Wed Jun 18 13:37:23 2003
From: alex@lisa.org (Alex Lam)
Date: Wed, 18 Jun 2003 14:37:23 +0200
Subject: [I18n-sig] Software Testing and Internationalization - Free book by LISA/Lemoine International
Message-ID: <F3C87F64145B734895A6736F2805ABB35C9848@hermes.usa-lisa.org>

Dear colleague,

LISA, in collaboration with Lemoine International has made "Software
Testing and Internationalization"  by Galileo Computing freely available
for download.

This 330 page book will transform how you view testing methodologies and
procedures. It introduces the reader to essential concepts and
approaches used by practitioners in the software testing arena, while
also taking into account the realities of low budgets and real schedule
deadlines. It is in this context that the specific needs of small, agile
project teams are covered in detail.

Topics covered:

    * New approaches to quality
    * Risk analysis and evaluation
    * Risk-based testing
    * Exploratory testing
    * Testing and tuning
    * Testing by using
    * Use cases, requirements, and test cases
    * Debugging
    * Myths and realities of Automated Testing
    * Windows scripting
    * Test frameworks
    * Testing-based application development
    * Tools for developers and testers
    * Agile test management
    * International planning and architecture
    * International development issues
    * Internationalization testing

To download a copy, please visit
http://www.lisa.org/interact/2003/swtestregister.html


Founded in 1990 as a non-profit association, LISA is the premier
organization for the GILT (Globalization, Internationalization,
Localization, and Translation) business communities. Over 400 leading IT
manufacturers and solutions providers, along with industry professionals
and an increasing number of vertical market corporations with an
international business focus, have helped establish LISA best practice
guidelines and language-technology standards for enterprise
globalization.


From perky@i18n.org  Thu Jun 19 21:40:31 2003
From: perky@i18n.org (Hye-Shik Chang)
Date: Fri, 20 Jun 2003 05:40:31 +0900
Subject: [I18n-sig] Re: CJKCodecs 0.9 is released
In-Reply-To: <m3of141hh4.fsf@mira.informatik.hu-berlin.de>
References: <20030606095332.GA90359@fallin.lv> <m2y909zkta.fsf@deacon-blues.mid.mastaler.com> <20030611021836.GA87284@fallin.lv> <m31xy19oep.fsf@mira.informatik.hu-berlin.de> <20030611081301.GA92933@fallin.lv> <m3of141hh4.fsf@mira.informatik.hu-berlin.de>
Message-ID: <20030619204031.GA62833@i18n.org>

On Wed, Jun 11, 2003 at 10:34:15PM +0200, Martin v. L?wis wrote:
> Hye-Shik Chang <perky@fallin.lv> writes:
[snip]
> >    2) CJKCodecs' big5 versus ChineseCodecs' big5-tw
> > 
> >                 CJK             Chinese         GNU
> >         a15a    -               fffd            -
> >         a1c3    -               fffd            -
> >         a1c5    -               fffd            -
> >         a1fe    -               fffd            -
> >         a240    -               fffd            -
> >         a2cc    -               fffd            -
> >         a2ce    -               fffd            -
> 
> What does that mean? CJK and iconv gives UnicodeError, whereas
> ChineseCodecs puts in the replacement character? Seems like a bug in
> ChineseCodecs to me, doesn't it? The replacement character should only
> be generated if errors='replace', no?

According Unicode.org's mapping:
] A number of characters are not currently mapped because
]         of conflicts with other mappings.  They are as follows:
] 
] BIG5        Description                    Comments
] 
] 0xA15A      SPACING UNDERSCORE             duplicates A1C4
] 0xA1C3      SPACING HEAVY OVERSCORE        not in Unicode
] 0xA1C5      SPACING HEAVY UNDERSCORE       not in Unicode
] 0xA1FE      LT DIAG UP RIGHT TO LOW LEFT   duplicates A2AC
] 0xA240      LT DIAG UP LEFT TO LOW RIGHT   duplicates A2AD
] 0xA2CC      HANGZHOU NUMERAL TEN           conflicts with A451 mapping
] 0xA2CE      HANGZHOU NUMERAL THIRTY        conflicts with A4CA mapping
] 
] We currently map all of these characters to U+FFFD REPLACEMENT CHARACTER.
]         It is also possible to map these characters to their duplicates, or to
]         the user zone.

So, I changed mapping for them to as cp950 does instead of U+FFFD or
user-defined area. I think that's affordable.

BIG5        Unicode     Description

0xA15A      0x2574      SPACING UNDERSCORE
0xA1C3      0xFFE3      SPACING HEAVY OVERSCORE
0xA1C5      0x02CD      SPACING HEAVY UNDERSCORE
0xA1FE      0xFF0F      LT DIAG UP RIGHT TO LOW LEFT
0xA240      0xFF3C      LT DIAG UP LEFT TO LOW RIGHT
0xA2CC      0x5341      HANGZHOU NUMERAL TEN
0xA2CE      0x5345      HANGZHOU NUMERAL THIRTY

> 
> >    3) CJKCodecs' euc-jp versus JapaneseCodecs' euc-jp
> > 
> >                 CJK             Japanese        GNU
> >         01c0    005c            ff3c            ff3c
> 
> That appears to be a bug in CJK, right? This is the question whether
> /xa1/xc0 is REVERSE SOLIDUS or FULLWIDTH REVERSE SOLIDUS.  Now, it
> appears that euc-jp also supports /x5c, mapped to REVERSE SOLIDUS,
> and that /xa1/xc0 should be interpreted as FULLWIDTH REVERSE SOLIDUS,
> no?

Right. That makes sense.

> 
> In case of doubt, I think ICU should be consulted for reference, as
> well, and following some kind of majority. In any case, I think the
> questionable mappings need to be documented.
> 
> >         f5a1    e000            -               e000    -+ User-Defined Area
> >         f5a2    e001            -               e001     |
> >             ....                                         |
> >         fefd    e3aa            -               e3aa     |
> >         fefe    e3ab            -               e3ab    -+
> 
> What are these? I cannot find them in glibc.

Quoting Ken Lunde's CJKV Information Processing p.206 table 4-66:
] Table 4-66: Shift-JIS to Unicode and EUC-JP for User-Defined Region
]
] Shift-JIS     Unicode     EUC-JP
] F040-F0FC     E000-E0BB   F5A1-F5FE, F6A1-F6FE
] F140-F1FC     E0BC-E177   F7A1-F7FE, F8A1-F8FE
] F240-F2FC     E178-E233   F9A1-F9FE, FAA1-FAFE
] --snip--
] F940-F9FC     E69C-E757   8FFDA1-8FFDFE, 8FFEA1-8FFEFE

> 
> >    4) CJKCodecs' shift-jis versus JapaneseCodecs' shift-jis
> > 
> >                 CJK             Japanese        GNU
> >         005c    00a5            005c            00a5
> 
> Here, I would trust GNU iconv; 5C really is YEN SIGN.
> 
> >         007e    203e            007e            203e
> 
> Likewise for OVERLINE - is there really no TILDE in shift-jis?

:)

> 
> >         007f    -               007f            007f
> 
> Why that?

That's a bug of CJK. fixed.

> 
> >         815f    005c            ff3c            ff3c
> 
> Again: Why that? Shouldn't /x81/x5f be FULLWIDTH REVERSE SOLIDUS?

Then, shift-jis will lose a *reserse solidus*. And, even Unicode.org's
mapping did:
] sjis   jisx0208 unicode
] 0x815F  0x2140  0x005C  # REVERSE SOLIDUS


> 
> >         817f    -               00d7            -
> 
> What character is that? Why does JapaneseCodecs map it to
> MULTIPLICATION SIGN? glibc seems to map that to /x81/x7e;
> is that a typo in JapaneseCodecs?
> 
> >         837f    -               30df            -
> 
> Are you sure glibc does not support that? Seems to be
> KATAKANA LETTER MI.
> 
> 
> >         9e7f    -               684e            -
> 
> Why does JapaneseCodecs do that? glibc maps 9e7e to 684e.

They are not in shift-jis's byte range. I guess that JapaneseCodecs'
SJIS->EUC macro has a bug around them.

> 
> >     5) CJKCodecs' cp932 versus JapaneseCodecs' ms932
> > 
> >                 CJK             Japanese        GNU
> >         8160    ff5e            ff5e            301c
> >         8161    2225            2225            2016
> 
> Are these verified against MS CP932, e.g. from Windows XP?
> 
> >         817f    -               00d7            -
> 
> Likewise: For CP932, it seems essential to do whatever Microsoft does,
> in any Windows version.

Okay. Here it is! :)

        CJK    Japanese GNU     WindowsXP
0080    -       -       -       0080
00a0    -       -       -       f8f0
00fd    -       -       -       f8f1
00fe    -       -       -       f8f2
00ff    -       -       -       f8f3
8160    ff5e    ff5e    301c    ff5e
8161    2225    2225    2016    2225
817c    ff0d    ff0d    2212    ff0d
817f    -       00d7    -       -
8191    ffe0    ffe0    00a2    ffe0
8192    ffe1    ffe1    00a3    ffe1
81ca    ffe2    ffe2    00ac    ffe2
837f    -       30df    -       -
847f    -       043d    -       -
    ....
e97f    -       9a43    -       -
ea7f    -       9eef    -       -

I'll add 0x80, 0xa0, 0xfd, 0xfe, 0xff to CJKCodecs's cp932 to conform
Windows's real mapping.

> 
> >     2) CJKCodecs' big5 versus ChineseCodecs' big5-tw
> > 
> >                 CJK             Chinese         GNU
> >         fffd    a2ce            a2ce            -
> 
> BIG-5 has the notion of a replacement character????

mentioned above.

> 
> >     3) CJKCodecs' euc-jp versus JapaneseCodecs' euc-jp
> > 
> >                 CJK             Japanese        GNU
> >         00a5    -               5c              5c
> 
> That seems wrong, too.
> 
> >         203e    -               7e              7e
> 
> Likewise.
> 
> > Okay, the comparison says that we need some discussions on the mappings.
> > I'll fix CJKCodecs' bugs as soon as possible and welcome any opionions
> > about mapping inconsistencies. :)
> 
> I'm too lazy now to review the other encodings. I'd encourage you to
> consult ICU for established procedures, and to document the cases
> where you pick one of the possible alternatives. I do hope that this
> set of codecs becomes part of standard Python one day, at which point
> we really need to document what exactly they do.

Thank you for the comments. Your suggestions were very helpful to
make CJKCodecs saner.

> 
> Regards,
> Martin
> 
> 


Regards,
    Hye-Shik =)


From kajiyama@grad.sccs.chukyo-u.ac.jp  Thu Jun 19 21:49:40 2003
From: kajiyama@grad.sccs.chukyo-u.ac.jp (Tamito KAJIYAMA)
Date: Fri, 20 Jun 2003 05:49:40 +0900
Subject: [I18n-sig] Re: CJKCodecs 0.9 is released
In-Reply-To: <20030619204031.GA62833@i18n.org> (message from Hye-Shik Chang on
 Fri, 20 Jun 2003 05:40:31 +0900)
References: <20030619204031.GA62833@i18n.org>
Message-ID: <200306192049.h5JKnef08655@grad.sccs.chukyo-u.ac.jp>

Hye-Shik Chang <perky@i18n.org> writes:
| 
| > >         817f    -               00d7            -
| > 
| > What character is that? Why does JapaneseCodecs map it to
| > MULTIPLICATION SIGN? glibc seems to map that to /x81/x7e;
| > is that a typo in JapaneseCodecs?
| > 
| > >         837f    -               30df            -
| > 
| > Are you sure glibc does not support that? Seems to be
| > KATAKANA LETTER MI.
| > 
| > 
| > >         9e7f    -               684e            -
| > 
| > Why does JapaneseCodecs do that? glibc maps 9e7e to 684e.
| 
| They are not in shift-jis's byte range. I guess that JapaneseCodecs'
| SJIS->EUC macro has a bug around them.

Exactly.  I'll fix it in the next release of JapaneseCodecs.

Thanks,

-- 
KAJIYAMA, Tamito <kajiyama@grad.sccs.chukyo-u.ac.jp>


From martin@v.loewis.de  Sat Jun 21 17:53:05 2003
From: martin@v.loewis.de (Martin v. =?iso-8859-15?q?L=F6wis?=)
Date: 21 Jun 2003 18:53:05 +0200
Subject: [I18n-sig] Re: CJKCodecs 0.9 is released
In-Reply-To: <20030619204031.GA62833@i18n.org>
References: <20030606095332.GA90359@fallin.lv>
 <m2y909zkta.fsf@deacon-blues.mid.mastaler.com>
 <20030611021836.GA87284@fallin.lv>
 <m31xy19oep.fsf@mira.informatik.hu-berlin.de>
 <20030611081301.GA92933@fallin.lv>
 <m3of141hh4.fsf@mira.informatik.hu-berlin.de>
 <20030619204031.GA62833@i18n.org>
Message-ID: <m3adcbl6dq.fsf@mira.informatik.hu-berlin.de>

Hye-Shik Chang <perky@i18n.org> writes:

> So, I changed mapping for them to as cp950 does instead of U+FFFD or
> user-defined area. I think that's affordable.

Indeed. I'd encourage you to list all "critical" cases in the
documentation of your package. This is all tricky stuff, and opinions
vary widely. So users should be able to find out up-front what they
get - they are much more angry if they find out by surprise.

> Quoting Ken Lunde's CJKV Information Processing p.206 table 4-66:
> ] Table 4-66: Shift-JIS to Unicode and EUC-JP for User-Defined Region
> ]
> ] Shift-JIS     Unicode     EUC-JP
> ] F040-F0FC     E000-E0BB   F5A1-F5FE, F6A1-F6FE
> ] F140-F1FC     E0BC-E177   F7A1-F7FE, F8A1-F8FE
> ] F240-F2FC     E178-E233   F9A1-F9FE, FAA1-FAFE
> ] --snip--
> ] F940-F9FC     E69C-E757   8FFDA1-8FFDFE, 8FFEA1-8FFEFE

Is this really necessary? Using PUA characters is evil, IMO, and
should be avoided unless explicitly requested by the application.  If
those characters are not supported in Unicode, they can't be really
important, no?

Or, are you sure that they are still unsupported in Unicode 4.0?

> Okay. Here it is! :)
> 
>         CJK    Japanese GNU     WindowsXP
> 0080    -       -       -       0080
> 00a0    -       -       -       f8f0
> 00fd    -       -       -       f8f1
> 00fe    -       -       -       f8f2
> 00ff    -       -       -       f8f3
> 
> I'll add 0x80, 0xa0, 0xfd, 0xfe, 0xff to CJKCodecs's cp932 to conform
> Windows's real mapping.

This is, in fact, a place where the mapping-to-PUA might be
acceptable. CP932 is Microsoft's "private" encoding, anyway, so they
set the rules :-(

Regards,
Martin


From tree@basistech.com  Sat Jun 21 18:21:16 2003
From: tree@basistech.com (Tom Emerson)
Date: Sat, 21 Jun 2003 13:21:16 -0400
Subject: [I18n-sig] Re: CJKCodecs 0.9 is released
In-Reply-To: <m3adcbl6dq.fsf@mira.informatik.hu-berlin.de>
References: <20030606095332.GA90359@fallin.lv>
 <m2y909zkta.fsf@deacon-blues.mid.mastaler.com>
 <20030611021836.GA87284@fallin.lv>
 <m31xy19oep.fsf@mira.informatik.hu-berlin.de>
 <20030611081301.GA92933@fallin.lv>
 <m3of141hh4.fsf@mira.informatik.hu-berlin.de>
 <20030619204031.GA62833@i18n.org>
 <m3adcbl6dq.fsf@mira.informatik.hu-berlin.de>
Message-ID: <16116.37900.202469.431556@magrathea.basistech.com>

Martin v. L=F6wis writes:
[...]
> > Quoting Ken Lunde's CJKV Information Processing p.206 table 4-66:
> > ] Table 4-66: Shift-JIS to Unicode and EUC-JP for User-Defined Regi=
on
> > ]
> > ] Shift-JIS     Unicode     EUC-JP
> > ] F040-F0FC     E000-E0BB   F5A1-F5FE, F6A1-F6FE
> > ] F140-F1FC     E0BC-E177   F7A1-F7FE, F8A1-F8FE
> > ] F240-F2FC     E178-E233   F9A1-F9FE, FAA1-FAFE
> > ] --snip--
> > ] F940-F9FC     E69C-E757   8FFDA1-8FFDFE, 8FFEA1-8FFEFE
>=20
> Is this really necessary=3F Using PUA characters is evil, IMO, and
> should be avoided unless explicitly requested by the application.  If=

> those characters are not supported in Unicode, they can't be really
> important, no=3F

Yes, it is really necessary.

If you want to round trip these encodings then you need to map the
UDR's of the various legacy encodings into the PUA and back again. If
you don't then you can and will loose data.

For Japanese encodings there are numerous corporate extensions to
Shift JIS, as well as the various emoticons and other dingbats
introduced for use with iMode and other phones.=20

It is a much bigger issue for the Chinese encodings: extensions to Big
Five (CP950, ETen, GCCS, HKSCS, etc.) are done in the UDR and VDR
parts of the encoding space. Unfortunately you rarely if ever see such
documents identified as ETen or CP950 or HKSCS: just as Big
Five. Since you cannot easily detect which of thse variants are in use
you need to round trip the UDRs/VDRs through the PUA.

> Or, are you sure that they are still unsupported in Unicode 4.0=3F

In the case of HKSCS all but 3 characters are defined in Planes 0 and
2. However, as I mentioned above, if you do not know that your file
claiming Big Five is really HKSCS then you can't map the UDR/VDR
sections appropriately.

Oh, and Microsoft defines CP950 as different things depending on
whether the file is from Taiwan or Hong Kong.

The latest issue faced with transcoding between legacy Asian encodings
(especially JIS X 0213) and Unicode is the interpretation of
compatibility characters and how strictly you want to enforce the
rules laid out by TUC.

    -tree

--=20
Tom Emerson                                          Basis Technology C=
orp.
Software Architect                                 http://www.basistech=
.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever=
"


From martin@v.loewis.de  Sat Jun 21 20:26:53 2003
From: martin@v.loewis.de (Martin v. =?iso-8859-15?q?L=F6wis?=)
Date: 21 Jun 2003 21:26:53 +0200
Subject: [I18n-sig] Re: CJKCodecs 0.9 is released
In-Reply-To: <16116.37900.202469.431556@magrathea.basistech.com>
References: <20030606095332.GA90359@fallin.lv>
 <m2y909zkta.fsf@deacon-blues.mid.mastaler.com>
 <20030611021836.GA87284@fallin.lv>
 <m31xy19oep.fsf@mira.informatik.hu-berlin.de>
 <20030611081301.GA92933@fallin.lv>
 <m3of141hh4.fsf@mira.informatik.hu-berlin.de>
 <20030619204031.GA62833@i18n.org>
 <m3adcbl6dq.fsf@mira.informatik.hu-berlin.de>
 <16116.37900.202469.431556@magrathea.basistech.com>
Message-ID: <m3el1njkoy.fsf@mira.informatik.hu-berlin.de>

Tom Emerson <tree@basistech.com> writes:

> For Japanese encodings there are numerous corporate extensions to
> Shift JIS, as well as the various emoticons and other dingbats
> introduced for use with iMode and other phones. 

That only tells me that mapping to the PUA is most likely incorrect,
though:

Are these corporate extensions well-specified? Are they
non-overlapping?

If yes, I think a "proper" mapping should be found. For example, many
emoticons and dingbats are supported in Unicode 4.0, and should be
used instead of the PUA.

If no, I feel that these characters just shouldn't round-trip. There
would bo no loss of data. Instead, users would get a UnicodeError,
indicating that some characters just can't be converted to Unicode.

Now, there might be certain applications where this is not
acceptable. For many of these applications, it is the runtime error
that is not acceptable, not a possible loss of data in rare cases. For
these cases, the 'replace' processing of Python codecs seems
appropriate.

For a small number of applications, round-tripping is important enough
even if it means to use the PUA. It is important that authors of these
applications understand that they can *only* convert back the results
to the original encoding, and not to some other encoding - e.g. it is
incorrect to encode the Unicode strings as UTF-8, for use in HTML.

Authors of these applications would need to specify that they
understand all that, e.g. by using a different codec name (e.g. a
'+pua' suffix)

> It is a much bigger issue for the Chinese encodings: extensions to Big
> Five (CP950, ETen, GCCS, HKSCS, etc.) are done in the UDR and VDR
> parts of the encoding space. Unfortunately you rarely if ever see such
> documents identified as ETen or CP950 or HKSCS: just as Big
> Five. Since you cannot easily detect which of thse variants are in use
> you need to round trip the UDRs/VDRs through the PUA.

Again, assuming it is round-tripping that you are after. Many Python
Unicode applications don't do round-tripping. Instead, they convert
the input to some other encoding (put it into a database, output
UTF-8, output XML character references). This is a perfect recipe for
moji-bake.

> In the case of HKSCS all but 3 characters are defined in Planes 0 and
> 2. However, as I mentioned above, if you do not know that your file
> claiming Big Five is really HKSCS then you can't map the UDR/VDR
> sections appropriately.

Can you give an example where using the HKSCS codec for decoding would
be incorrect?

> Oh, and Microsoft defines CP950 as different things depending on
> whether the file is from Taiwan or Hong Kong.

That sounds like one needs two versions of cp950...

In any case, for MS code pages, I think a Python codec should do
exactly what MS does. If that involves PUA, oh well, atleast the
moji-bake will be consistent with what Microsoft produces, so MSIE
might even render it correctly..

Regards,
Martin


From tree@basistech.com  Sat Jun 21 20:46:43 2003
From: tree@basistech.com (Tom Emerson)
Date: Sat, 21 Jun 2003 15:46:43 -0400
Subject: [I18n-sig] Re: CJKCodecs 0.9 is released
In-Reply-To: <m3el1njkoy.fsf@mira.informatik.hu-berlin.de>
References: <20030606095332.GA90359@fallin.lv>
 <m2y909zkta.fsf@deacon-blues.mid.mastaler.com>
 <20030611021836.GA87284@fallin.lv>
 <m31xy19oep.fsf@mira.informatik.hu-berlin.de>
 <20030611081301.GA92933@fallin.lv>
 <m3of141hh4.fsf@mira.informatik.hu-berlin.de>
 <20030619204031.GA62833@i18n.org>
 <m3adcbl6dq.fsf@mira.informatik.hu-berlin.de>
 <16116.37900.202469.431556@magrathea.basistech.com>
 <m3el1njkoy.fsf@mira.informatik.hu-berlin.de>
Message-ID: <16116.46627.374924.972713@magrathea.basistech.com>

Martin v. L=F6wis writes:
> Tom Emerson <tree@basistech.com> writes:
> > For Japanese encodings there are numerous corporate extensions to
> > Shift JIS, as well as the various emoticons and other dingbats
> > introduced for use with iMode and other phones.=20
>=20
> That only tells me that mapping to the PUA is most likely incorrect,
> though:
>=20
> Are these corporate extensions well-specified=3F Are they
> non-overlapping=3F

Well specified=3F Sure, there are specifications.

Non-overlapping=3F Of course not: each corporate extension starts at th=
e
same point in the user-defined regions of the legacy encoding.

> If yes, I think a "proper" mapping should be found. For example, many=

> emoticons and dingbats are supported in Unicode 4.0, and should be
> used instead of the PUA.

Absolutely you should, but these characters have to be proposed to the
Unicode Consortium, and in the case ideographs, to the IRG of ISO
10646.

> If no, I feel that these characters just shouldn't round-trip. There
> would bo no loss of data. Instead, users would get a UnicodeError,
> indicating that some characters just can't be converted to Unicode.

This is a rediculously pedantic approach that will end up pissing
people off: the PUA in Unicode is designed for this purpose, so it
should be used.

> Now, there might be certain applications where this is not
> acceptable. For many of these applications, it is the runtime error
> that is not acceptable, not a possible loss of data in rare cases. Fo=
r
> these cases, the 'replace' processing of Python codecs seems
> appropriate.

Data loss is a problem. Customers get very upset when their data gets
munged for no good reason.

> For a small number of applications, round-tripping is important enoug=
h
> even if it means to use the PUA. It is important that authors of thes=
e
> applications understand that they can *only* convert back the results=

> to the original encoding, and not to some other encoding - e.g. it is=

> incorrect to encode the Unicode strings as UTF-8, for use in HTML.

Where does it say you cannot cannot encode PUA characters in UTF-8=3F I=
f
you have a custom font that handles these code points, then you are
going to be upset that you can't display them because the author of
the codec decided that PUA characters are an abomination that should
be striken from the earth.

> Authors of these applications would need to specify that they
> understand all that, e.g. by using a different codec name (e.g. a
> '+pua' suffix)

So then you get a pile of ShiftJIS encodings, those that round trip,
those that don't.

> Again, assuming it is round-tripping that you are after. Many Python
> Unicode applications don't do round-tripping. Instead, they convert
> the input to some other encoding (put it into a database, output
> UTF-8, output XML character references). This is a perfect recipe for=

> moji-bake.

I disagree that this is a recipe for moji-bake. If I'm stuffing values
into a database PUA may be the only thing we can do. I do not want my
ShiftJIS extension characters being replaced with U+FFFD.

> > In the case of HKSCS all but 3 characters are defined in Planes 0 a=
nd
> > 2. However, as I mentioned above, if you do not know that your file=

> > claiming Big Five is really HKSCS then you can't map the UDR/VDR
> > sections appropriately.
>=20
> Can you give an example where using the HKSCS codec for decoding woul=
d
> be incorrect=3F

I can dig up the three characters that are not encoded in Unicode: I
don't have the latest HKSCS at home. But again, if you do not know you
are looking at HKSCS, you loose.

> > Oh, and Microsoft defines CP950 as different things depending on
> > whether the file is from Taiwan or Hong Kong.
>=20
> That sounds like one needs two versions of cp950...

Sure, if you know which version you are dealing which you may not.

> In any case, for MS code pages, I think a Python codec should do
> exactly what MS does. If that involves PUA, oh well, atleast the
> moji-bake will be consistent with what Microsoft produces, so MSIE
> might even render it correctly..

Yes, well, it can be a fulltime job to keep up to date with
Microsoft's ever-changing mapping tables.

Peace,

   tree

--=20
Tom Emerson                                          Basis Technology C=
orp.
Software Architect                                 http://www.basistech=
.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever=
"


From martin@v.loewis.de  Sat Jun 21 22:16:22 2003
From: martin@v.loewis.de (Martin v. =?iso-8859-15?q?L=F6wis?=)
Date: 21 Jun 2003 23:16:22 +0200
Subject: [I18n-sig] Re: CJKCodecs 0.9 is released
In-Reply-To: <16116.46627.374924.972713@magrathea.basistech.com>
References: <20030606095332.GA90359@fallin.lv>
 <m2y909zkta.fsf@deacon-blues.mid.mastaler.com>
 <20030611021836.GA87284@fallin.lv>
 <m31xy19oep.fsf@mira.informatik.hu-berlin.de>
 <20030611081301.GA92933@fallin.lv>
 <m3of141hh4.fsf@mira.informatik.hu-berlin.de>
 <20030619204031.GA62833@i18n.org>
 <m3adcbl6dq.fsf@mira.informatik.hu-berlin.de>
 <16116.37900.202469.431556@magrathea.basistech.com>
 <m3el1njkoy.fsf@mira.informatik.hu-berlin.de>
 <16116.46627.374924.972713@magrathea.basistech.com>
Message-ID: <m3isqzi121.fsf@mira.informatik.hu-berlin.de>

Tom Emerson <tree@basistech.com> writes:

> This is a rediculously pedantic approach that will end up pissing
> people off: the PUA in Unicode is designed for this purpose, so it
> should be used.

It is fine if users are aware that this happens. If they are not, they
will be pissed off when they find out.

> Where does it say you cannot cannot encode PUA characters in UTF-8? If
> you have a custom font that handles these code points, then you are
> going to be upset that you can't display them because the author of
> the codec decided that PUA characters are an abomination that should
> be striken from the earth.

And if you don't have such a font, you will see some replacement
characters.

A lot of things need to be in place for this to work
correctly. Developers need to make sure things all are in place, and
need to ask the libraries to work to how they put them.

> I disagree that this is a recipe for moji-bake. If I'm stuffing values
> into a database PUA may be the only thing we can do. I do not want my
> ShiftJIS extension characters being replaced with U+FFFD.

Now, if your font was meant for a different proprietary extension that
happens to use the same private characters, you get incorrect
display. Right? Likewise, if some other application reads out the
data, and interprets the private characters in a different way.

Private characters should never leave the scope of "the application",
and some effort should be done to make sure they don't leak out of
"the application".

> > Can you give an example where using the HKSCS codec for decoding would
> > be incorrect?
> 
> I can dig up the three characters that are not encoded in Unicode: I
> don't have the latest HKSCS at home. But again, if you do not know you
> are looking at HKSCS, you loose.

This is not what I meant. What I'm asking is this: Are there HKSCS
character that have encodings which are identical to encodings in
other common Big-5 extensions?

IOW, what bad things would happen if you would assume all Big-5 is
HSKCS? Or: how would the use of PUAs improve the situation in that
case?

> > That sounds like one needs two versions of cp950...
> 
> Sure, if you know which version you are dealing which you may not.

That is always the case: If I don't know the encoding of some
document, there is always the risk of misinterpretation. I can use
heuristics to guess the encoding in some cases, and in some cases, the
heuristics work reasonable well - in other cases, they fail miserably.

There is nothing one can do, except to have users always declare their
encodings properly, to use only data formats which include charset
declarations, to use only charset names that are unambiguous,
preferably even over time, etc. If people don't follow these rules,
some things will go wrong. Then, people will learn to correct their
errors.

Regards,
Martin


From Matt Gushee <mgushee@havenrock.com>  Sat Jun 21 23:51:07 2003
From: Matt Gushee <mgushee@havenrock.com> (Matt Gushee)
Date: Sat, 21 Jun 2003 16:51:07 -0600
Subject: [I18n-sig] Re: CJKCodecs 0.9 is released
In-Reply-To: <m3isqzi121.fsf@mira.informatik.hu-berlin.de>
References: <20030611021836.GA87284@fallin.lv> <m31xy19oep.fsf@mira.informatik.hu-berlin.de> <20030611081301.GA92933@fallin.lv> <m3of141hh4.fsf@mira.informatik.hu-berlin.de> <20030619204031.GA62833@i18n.org> <m3adcbl6dq.fsf@mira.informatik.hu-berlin.de> <16116.37900.202469.431556@magrathea.basistech.com> <m3el1njkoy.fsf@mira.informatik.hu-berlin.de> <16116.46627.374924.972713@magrathea.basistech.com> <m3isqzi121.fsf@mira.informatik.hu-berlin.de>
Message-ID: <20030621225107.GE12229@swordfish>

Some may consider this off-topic, but I don't believe the right course
of action here can be decided on purely technical grounds. So here goes:

On Sat, Jun 21, 2003 at 11:16:22PM +0200, Martin v. Löwis wrote:
> 
> > This is a rediculously pedantic approach that will end up pissing
> > people off: the PUA in Unicode is designed for this purpose, so it
> > should be used.
> 
> It is fine if users are aware that this happens. If they are not, they
> will be pissed off when they find out.

Could be, if by "users" you mean developers that use the library. I
doubt that more than a minuscule fraction of end users has even heard of
Unicode. They just want working software and readable documents. And I
think has a lot to do with the success of Shift-JIS, even though it is
the epitome of bad design: at the time it was developed, half-width
katakana were in widespread use, and it Shift-JIS made it easy to
accommodate that need.

> > Where does it say you cannot cannot encode PUA characters in UTF-8? If
> > you have a custom font that handles these code points, then you are
> > going to be upset that you can't display them because the author of
> > the codec decided that PUA characters are an abomination that should
> > be striken from the earth.
> 
> And if you don't have such a font, you will see some replacement
> characters.

Well, I don't have an intimate knowledge of how CJKV character sets are
used on a daily basis, but I do have a broad knowledge of how society
works in at least Japan and mainland China (been to both, studied the
history in school, lived in Japan for seven years), and I would guess
that the availability of fonts in any given scenario is somewhat
analogous to the availability of XML DTDs: organizations (or
individuals) tend to have the same technology (fonts, software, etc.)
as other organizations that they are likely to exchange documents with.
That's not unique to Asia, of course, but I have the impression it is
more true there than in the West.

> Private characters should never leave the scope of "the application",
> and some effort should be done to make sure they don't leak out of
> "the application".

If by "application," you mean a particular software program or a closely
coordinated set of programs, I very much doubt that goal is achievable
in the foreseeable future. Maybe if you took a somewhat broader view and
said something like "system," encompassing both software and a set of
business practices, it would be realistic.

> There is nothing one can do, except to have users always declare their
> encodings properly, to use only data formats which include charset
> declarations, to use only charset names that are unambiguous,
> preferably even over time, etc. If people don't follow these rules,
> some things will go wrong. Then, people will learn to correct their
> errors.

No, rigid enforcement of standards is not the only choice. The
alternative is to determine what non-standard practices (or de-facto
standard practices) are most common, and attempt to accommodate those. I
honestly don't know which is better, but philosophically I favor
usability over correctness (of course, the two aren't necessarily at
odds in the long term, but often seem to conflict in the short term).

Adherence to standards is a good thing, but you also have to deal with
the social context where your product is being used. Consider the case
of, say, the typical harried IT manager in a Tokyo insurance firm. He
needs to plan the development of a new Web application; the project
requirements call for a very high-level dynamic language. Well, that
gives him several choices, doesn't it? And let's suppose that Python
requires his team to "always declare their encodings properly, to use
only charset names that are unambiguous ..." and so on. And suppose
one of the alternatives (I don't know, perhaps Ruby?) "just works" for
his use cases. Well, then, why should he use Python?

I'm not suggesting that the goal of standards-compliance be discarded
for the sake of popularity, now or ever. But sometimes you need to be a
little less forceful: give users something that works for them today,
while gently steering them toward the "right" path.

Python is good technology, and good technology should be widely used.
And if correctness comes at the expense of usability, you're just going
to drive people away.

-- 
Matt Gushee                 When a nation follows the Way,
Englewood, Colorado, USA    Horses bear manure through
mgushee@havenrock.com           its fields;
http://www.havenrock.com/   When a nation ignores the Way,
                            Horses bear soldiers through
                                its streets.
                                
                            --Lao Tzu (Peter Merel, trans.)


From tex@I18nGuy.com  Thu Jun 26 09:08:16 2003
From: tex@I18nGuy.com (Tex Texin)
Date: Thu, 26 Jun 2003 04:08:16 -0400
Subject: [I18n-sig] 24th Unicode Conference - Atlanta, GA - September 3-5, 2003
Message-ID: <3EFAA9F0.F7850846@I18nGuy.com>

************************************************************************
    Twenty-fourth Internationalization and Unicode Conference (IUC24)
     Unicode, Internationalization, the Web: Powering Global Business
                    
                     http://www.unicode.org/iuc/iuc24
                            September 3-5, 2003
                                Atlanta, GA
************************************************************************
NEWS
 
 > Visit the Conference Web site ( http://www.unicode.org/iuc/iuc24 )
   to check the updated Conference program and register.  To help you
   choose Conference sessions, we've included abstracts of talks and
   speakers' biographies.

 > Hotel guest room group rate valid to August 12.

 > Early bird registration rates valid to August 12.

 > To find out about, and register for the TILP Breakfast Meeting and
   Roundtable, organized by The Institute of Localisation Professionals,
   and taking place at the same venue on September 4, 7:00 a.m.-9:00 a.m.,
   See: http://www.tilponline.org/events/diary.shtml 
   or
   http://www.unicode.org/iuc/iuc24
************************************************************************

 Are you falling behind?  Version 4.0 of the Unicode Standard is here!
 Software and Web applications can now support more languages with
 greater efficiency and lower cost.  Do you need to find out how? Do
 you need to be more competitive around the globe?  Is your software
 upward-compatible with version 4.0?  Does your staff need
 internationalization training?

 Learn about software and Web internationalization and the new Unicode
 Standard, including its latest features and requirements.  This is
 the only event endorsed by the Unicode Consortium.  The conference
 will be held September 3-5, 2003 in Atlanta, Georgia and is
 completely updated.

 KEYNOTES: Keynote speakers for IUC24 are well-known authors in the
 Internationalization and Localization industries:

 Donald De Palma, President, Common Sense Advisory, Inc., and author
 of "Business Without Borders: A Strategic Guide to Global Marketing",
 and Richard Gillam, author of "Unicode Demystified: A Practical
 Programmer's Guide to the Encoding Standard" and a former columnist
 for "C++ Report".

 TUTORIALS:  This redeveloped and enhanced Unicode 4.0 Tutorial is
 taught by Dr. Asmus Freytag, one of the major contributors to the
 standard, and extensively experienced in implementing real-world
 Unicode applications.  Structured into 3 independent modules, you
 can attend just the overview, or only the most advanced material.
 Tutorials in Web Internationalization, non-Latin scripts, and more,
 are offered in parallel and taught by recognized industry experts.

 CONFERENCE TRACKS:  Gain the competitive edge! Conference sessions
 provide the most up-to-date technical information on standards, best
 practices, and recent advances in the globalization of software and
 the Internet.  Panel discussions and the friendly atmosphere allow
 you to exchange ideas and ask questions of key players in the 
 internationalization industry.

 WHO SHOULD ATTEND?:  If you have a limited training budget, this is
 the one Internationalization conference you need.  Send staff that
 are involved in either Unicode-enabling software, or internationalization
 of software and the Internet, including: managers, software engineers,
 systems analysts, font designers, graphic designers, content developers,
 Web designers, Web administrators, technical writers, and product
 marketing personnel.

CONFERENCE WEB SITE, PROGRAM and REGISTRATION

   The Conference Program and Registration form are available at the
   Conference Web site:
      http://www.unicode.org/iuc/iuc24

CONFERENCE SPONSORS

   Agfa Monotype Corporation
   Basis Technology Corporation
   ClientSide News L.L.C.
   Oracle Corporation
   World Wide Web Consortium (W3C)
   XenCraft

GLOBAL COMPUTING SHOWCASE

   Visit the Showcase to find out more about products supporting the
   Unicode Standard, and products and services that can help you
   globalize/localize your software, documentation and Internet content.

   Sign up for the Exhibitors' track as part of the Conference.
   For more information, please see:
   http://www.unicode.org/iuc/iuc24/showcase.html

CONFERENCE VENUE

The Conference will take place at:

          DoubleTree Hotel Atlanta Buckhead
          3342 Peachtree Road
          Atlanta, GA 30326

          Tel: +1-404-231-1234
          Fax: +1-404-231-3112

CONFERENCE MANAGEMENT

   Global Meeting Services Inc.
   8949 Lombard Place, #416
   San Diego, CA 92122, USA

   Tel: +1 858 638 0206 (voice)
        +1 858 638 0504 (fax)

   Email: info@global-conference.com
      or: conference@unicode.org

THE UNICODE CONSORTIUM

 The Unicode Consortium was founded as a non-profit organization in 1991.
 It is dedicated to the development, maintenance and promotion of The
 Unicode Standard, a worldwide character encoding. The Unicode Standard
 encodes the characters of the world's principal scripts and languages,
 and is code-for-code identical to the international standard ISO/IEC
 10646. In addition to cooperating with ISO on the future development of
 ISO/IEC 10646, the Consortium is responsible for providing character
 properties and algorithms for use in implementations. Today the
 membership base of the Unicode Consortium includes major computer
 corporations, software producers, database vendors, research
 institutions, international agencies and various user groups.

 For further information on the Unicode Standard, visit the Unicode Web
 site at http://www.unicode.org or e-mail <info@unicode.org>

                           *  *  *  *  *

Unicode(r) and the Unicode logo are registered trademarks of Unicode,
Inc. Used with permission.