From Misha.Wolf@reuters.com  Fri Mar  1 17:52:15 2002
From: Misha.Wolf@reuters.com (Misha.Wolf@reuters.com)
Date: Fri, 01 Mar 2002 17:52:15 +0000
Subject: [I18n-sig] 21st Unicode Conference, May 2002, Dublin, Ireland
Message-ID: <T595f936cedc407b707808@reuters.com>

>>>>>>>>>>>>>>>>>> First European IUC in two years! <<<<<<<<<<<<<<<<<<<

         Twenty-first International Unicode Conference (IUC21)
        Unicode, Localization and the Web: The Global Connection
                    http://www.unicode.org/iuc/iuc21
                             14-17 May 2002
                             Dublin, Ireland

>>>>>>>>>>>>>>>>>>>>>>>>> Just 10 weeks to go! <<<<<<<<<<<<<<<<<<<<<<<<

NEWS

 * Hotel guest room group rate valid to 1 May.

 * Early bird registration rate valid to 1 May.

 * Visit the Conference Web site ( http://www.unicode.org/iuc/iuc21 )
   to check the Conference program and register.  To help you choose
   Conference sessions, we've included abstracts of talks and speakers'
   biographies.

 * The Workshop on Standards in Localisation, organised by the
   Localisation Research Centre (LRC), is taking place in the same venue
   on May 13 -- See: http://lrc.csis.ul.ie

CONFERENCE SPONSORS

   Agfa Monotype Corporation
   Basis Technology Corporation
   Localisation Research Centre
   Microsoft Corporation
   Reuters Ltd
   Sun Microsystems, Inc.
   World Wide Web Consortium (W3C)

GLOBAL COMPUTING SHOWCASE

   Visit the Showcase to find out more about products supporting the
   Unicode Standard, and products and services that can help you
   globalize/localize your software, documentation and Internet content.
   For details, visit the Conference Web site.

CONFERENCE VENUE

The Conference will take place at:

   The Burlington Hotel
   Upper Leeson Street
   Dublin 4, Ireland

   Tel: (+353 1) 660 5222
   Fax: (+353 1) 660 8496

CONFERENCE MANAGEMENT

   Global Meeting Services Inc.
   8949 Lombard Place, #416
   San Diego, CA 92122, USA

   Tel: +1 858 638 0206 (voice)
        +1 858 638 0504 (fax)

   Email: info@global-conference.com
      or: conference@unicode.org

                           *  *  *  *  *

Unicode(r) and the Unicode logo are registered trademarks of Unicode,
Inc.  Used with permission.


------------------------------------------------------------- ---
        Visit our Internet site at http://www.reuters.com

Any views expressed in this message are those of  the  individual
sender,  except  where  the sender specifically states them to be
the views of Reuters Ltd.


From kajiyama@grad.sccs.chukyo-u.ac.jp  Mon Mar  4 11:33:08 2002
From: kajiyama@grad.sccs.chukyo-u.ac.jp (Tamito KAJIYAMA)
Date: Mon, 4 Mar 2002 20:33:08 +0900
Subject: [I18n-sig] JapaneseCodecs 1.4.4 released
Message-ID: <200203041133.UAA15626@dhcp225.grad.sccs.chukyo-u.ac.jp>

Hi all,

I've released JapaneseCodecs 1.4.4.  The new feature is the
addition of a codec for MS932 (Microsoft code page 932, i.e.
a version of Shift_JIS).  A source tarball is avairable at:

  http://pseudo.grad.sccs.chukyo-u.ac.jp/~kajiyama/python/

The MS932 codec was written by Atsuo ISHIMOTO.  I really
appreciate the contribution.  Thanks a lot!!

Regards,

-- 
KAJIYAMA, Tamito <kajiyama@grad.sccs.chukyo-u.ac.jp>


From kajiyama@grad.sccs.chukyo-u.ac.jp  Wed Mar  6 12:05:28 2002
From: kajiyama@grad.sccs.chukyo-u.ac.jp (Tamito KAJIYAMA)
Date: Wed, 6 Mar 2002 21:05:28 +0900
Subject: [I18n-sig] PEP 263 and Japanese native encodings
Message-ID: <200203061205.VAA18805@dhcp225.grad.sccs.chukyo-u.ac.jp>

Hi,

I read the PEP 263: Defining Python Source Code Encodings
(revision 1.9).  Here some comments after a discussion on the
PEP in a Japanese Python mailing list.

First of all, as a Japanese Python programmer, I would like to
use three Japanese native encodings EUC-JP, Shift_JIS and
ISO-2022-JP as a file encoding of Python source files.  I think
these encodings are considered "ASCII compatible" in the sense
you mention in the following paragraph in the "Concepts" section:

  Only ASCII compatible encodings are allowed as source code
  encoding to assure that Python language elements other than
  literals and comments remain readable by ASCII processing tools
  and to avoid problems with wide characters encodings such as
  UTF-16.

However, a participant of the discussion in the Japanese Python
mailing list says, among the three Japanese encodings, Shift_JIS
and ISO-2022-JP are *not* ASCII compatible.  He defines ASCII
compatibility as follows:

  An ASCII compatible encoding (character set) is a superset of
  the ASCII encoding (character set) in which octets from 0x00
  to 0x7f are only used to represent ASCII characters and not
  used in a series of bytes that represent a multibyte character
  (such as Kanji and Hiragana).

This definition is too restrictive IMHO, but anyway the term
"ASCII compatible" is somewhat obscure and needs clarification
since there are at least two interpretations.  For the sake of
the PEP's readers, it's also useful to provide a (partial) list
of encodings that can be used as a file encoding.

In summary, the questions to be raised are:

o What does the term "ASCII compatible" mean?
o Are three Japanese native encodings EUC-JP, Shift_JIS and
  ISO-2022-JP "ASCII compatible"?

Anyway, thank you for the great proposal.  It will enhance the
utility of the language for non-Latin Python programmers once
implemented in the language core.  I really hope that.

Regards,

-- 
KAJIYAMA, Tamito <kajiyama@grad.sccs.chukyo-u.ac.jp>


From mal@lemburg.com  Wed Mar  6 12:49:58 2002
From: mal@lemburg.com (M.-A. Lemburg)
Date: Wed, 06 Mar 2002 13:49:58 +0100
Subject: [I18n-sig] PEP 263 and Japanese native encodings
References: <200203061205.VAA18805@dhcp225.grad.sccs.chukyo-u.ac.jp>
Message-ID: <3C861076.7202C114@lemburg.com>

Tamito KAJIYAMA wrote:
> 
> I read the PEP 263: Defining Python Source Code Encodings
> (revision 1.9).  Here some comments after a discussion on the
> PEP in a Japanese Python mailing list.
> 
> First of all, as a Japanese Python programmer, I would like to
> use three Japanese native encodings EUC-JP, Shift_JIS and
> ISO-2022-JP as a file encoding of Python source files.  I think
> these encodings are considered "ASCII compatible" in the sense
> you mention in the following paragraph in the "Concepts" section:
> 
>   Only ASCII compatible encodings are allowed as source code
>   encoding to assure that Python language elements other than
>   literals and comments remain readable by ASCII processing tools
>   and to avoid problems with wide characters encodings such as
>   UTF-16.
> 
> However, a participant of the discussion in the Japanese Python
> mailing list says, among the three Japanese encodings, Shift_JIS
> and ISO-2022-JP are *not* ASCII compatible.  He defines ASCII
> compatibility as follows:
> 
>   An ASCII compatible encoding (character set) is a superset of
>   the ASCII encoding (character set) in which octets from 0x00
>   to 0x7f are only used to represent ASCII characters and not
>   used in a series of bytes that represent a multibyte character
>   (such as Kanji and Hiragana).
> 
> This definition is too restrictive IMHO, but anyway the term
> "ASCII compatible" is somewhat obscure and needs clarification
> since there are at least two interpretations. 

As far as the Python tokenizer/compiler is concerned, it
will only have to be able to read the first two lines
and then decode the information found there as described in
the PEP.

That said, ASCII compatible encoding in the PEP description
means that you can represent the standard printable characters 
including the line end characters of the ASCII encoding using 
ASCII ordinals.

I only wanted to avoid having to support two or more byte 
encodings such as UTF-16 since these make the magic
comment recognition much more difficult.

> For the sake of
> the PEP's readers, it's also useful to provide a (partial) list
> of encodings that can be used as a file encoding.
> 
> In summary, the questions to be raised are:
> 
> o What does the term "ASCII compatible" mean?
> o Are three Japanese native encodings EUC-JP, Shift_JIS and
>   ISO-2022-JP "ASCII compatible"?

Yes, provided they have no problem representing the first two 
lines of a source files as e.g.:

#!/usr/bin/python -uOO
# -*- coding: iso-2022-jp -*-
 
> Anyway, thank you for the great proposal.  It will enhance the
> utility of the language for non-Latin Python programmers once
> implemented in the language core.  I really hope that.

Thanks.

Since I will be busy the next two months, Martin has volunteered
to head on with the implementation. I hope that we can have
phase 1 implemented in Python 2.3.

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                   http://www.egenix.com/files/python/


From martin@v.loewis.de  Wed Mar  6 18:03:07 2002
From: martin@v.loewis.de (Martin v. Loewis)
Date: 06 Mar 2002 19:03:07 +0100
Subject: [I18n-sig] PEP 263 and Japanese native encodings
In-Reply-To: <200203061205.VAA18805@dhcp225.grad.sccs.chukyo-u.ac.jp>
References: <200203061205.VAA18805@dhcp225.grad.sccs.chukyo-u.ac.jp>
Message-ID: <m3u1rtpo9w.fsf@mira.informatik.hu-berlin.de>

Tamito KAJIYAMA <kajiyama@grad.sccs.chukyo-u.ac.jp> writes:

> I think
> these encodings are considered "ASCII compatible" in the sense
> you mention in the following paragraph in the "Concepts" section:
> 
>   Only ASCII compatible encodings are allowed as source code
>   encoding to assure that Python language elements other than
>   literals and comments remain readable by ASCII processing tools
>   and to avoid problems with wide characters encodings such as
>   UTF-16.

My original definition of "ASCII compatible" would have been

  "An encoding X is ASCII compatible iff a text that consists only of
   ASCII characters is byte-for-byte identical when encoded with X,
   compared to the same text encoded in ASCII"

Under this definition, iso-2022-jp would be ASCII compatible, but it
still is not acceptable under the implementation that I have in mind
for the patch.

>   An ASCII compatible encoding (character set) is a superset of
>   the ASCII encoding (character set) in which octets from 0x00
>   to 0x7f are only used to represent ASCII characters and not
>   used in a series of bytes that represent a multibyte character
>   (such as Kanji and Hiragana).

Indeed, this is the definition which the reference implementation of
the PEP currently relies on.

> This definition is too restrictive IMHO, but anyway the term
> "ASCII compatible" is somewhat obscure and needs clarification
> since there are at least two interpretations.  

It would be possible to somewhat losen this definition, defining
"ASCII string" compatible

  An ASCII string compatible encoding (character set) is a superset of
  the ASCII encoding (character set) in which octets from set AS are
  only used to represent ASCII characters and not used in a series of
  bytes that represent a multibyte character (such as Kanji and
  Hiragana). The set AS is defined as

  AS = [\r\n\\'"] (newline, linefeed, backslash, single/double quote)

The rationale here is that, under the PEP, non-ASCII text may only
appear in comments and strings. The lexer needs the ASCII-compatible
property to determine the end-of-line and end-of-string markers,
atleast in the phase-1 implementation.

> o Are three Japanese native encodings EUC-JP, Shift_JIS and
>   ISO-2022-JP "ASCII compatible"?

EUC-JP certainly is; ISO-2022-JP probably isn't. I cannot see the
problem with Shift_JIS; I thought is uses only non-ASCII bytes for the
double-byte characters (and that this is precisely what the "shift" in
Shift_JIS refers to); see

http://www.io.com/~kazushi/encoding/sjis.html

If you are referring to the common interpretation that Shift_JIS uses
JIS X 0201-1976 for the first 128 bytes, I think we can take a relaxed
position here:

1. The only differences between JIS X 0201 and ISO 646 IRV (aka ASCII)
   are \x24 (CURRENCY SIGN vs. DOLLAR SIGN) and \x5C (YEN SIGN vs.
   REVERSE SOLIDUS).
2. \x24 is not in AS.
3. Backslash could cause a problem, if people insist on putting the Yen
   sign into a string literal. Even though this isn't strictly supported
   under PEP 263, people would get away with that most of the time.
4. I understand that Microsoft's interpretation of Shift_JIS actually
   is that \x24 *does* represent REVERSE SOLIDUS, and that only the
   fonts display something else.

Regards,
Martin


From martin@v.loewis.de  Wed Mar  6 18:20:48 2002
From: martin@v.loewis.de (Martin v. Loewis)
Date: 06 Mar 2002 19:20:48 +0100
Subject: [I18n-sig] ICU in python-codecs CVS
Message-ID: <m3k7spo8vz.fsf@mira.informatik.hu-berlin.de>

I'd like to start working on ICU codecs support, in the python-codecs
CVS. For that, I'd like to import Fredrik Juhlin's picu into the CVS,
and start from there. Any objection against creating a picu module in
the CVS? For that to work, Fredrik needs to get write access to the
CVS also. Could somebody please arrange that?

Thanks,
Martin


From tree@basistech.com  Wed Mar  6 19:53:07 2002
From: tree@basistech.com (Tom Emerson)
Date: Wed, 6 Mar 2002 14:53:07 -0500
Subject: [I18n-sig] ICU in python-codecs CVS
In-Reply-To: <m3k7spo8vz.fsf@mira.informatik.hu-berlin.de>
References: <m3k7spo8vz.fsf@mira.informatik.hu-berlin.de>
Message-ID: <15494.29603.884336.395622@magrathea.basistech.com>

Martin v. Loewis writes:
> I'd like to start working on ICU codecs support, in the python-codecs
> CVS. For that, I'd like to import Fredrik Juhlin's picu into the CVS,
> and start from there. Any objection against creating a picu module in
> the CVS? For that to work, Fredrik needs to get write access to the
> CVS also. Could somebody please arrange that?

I have no objects: if you give me his SF account I'll add him to the
list of maintainers.

-- 
Tom Emerson                                          Basis Technology Corp.
Sr. Computational Linguist                         http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From tree@basistech.com  Wed Mar  6 20:11:00 2002
From: tree@basistech.com (Tom Emerson)
Date: Wed, 6 Mar 2002 15:11:00 -0500
Subject: [I18n-sig] ICU in python-codecs CVS
In-Reply-To: <15494.29603.884336.395622@magrathea.basistech.com>
References: <m3k7spo8vz.fsf@mira.informatik.hu-berlin.de>
 <15494.29603.884336.395622@magrathea.basistech.com>
Message-ID: <15494.30676.799975.659849@magrathea.basistech.com>

Tom Emerson writes:
> I have no objects: [...]

Er, s/objects/objections/

:-)

-- 
Tom Emerson                                          Basis Technology Corp.
Sr. Computational Linguist                         http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From goodger@users.sourceforge.net  Thu Mar  7 02:02:56 2002
From: goodger@users.sourceforge.net (David Goodger)
Date: Wed, 06 Mar 2002 21:02:56 -0500
Subject: [I18n-sig] raw-unicode-escape encoding
Message-ID: <B8AC347F.1F9B7%goodger@users.sourceforge.net>

If this isn't the correct venue, please let me know. (The right people
seem to be hanging around.)

I've come across something strange while adding some Unicode
characters to the output generated by the Docutils projects (see my
signature for URLs). I want to get 7-bit ASCII output for the test
suite, but I want to keep newlines, so I'm using the
'raw-unicode-escape' codec. I assumed that this codec would convert
any character whose ord(char) > 127 to "\\uXXXX". This does not seem
to be the case for ord(char) between 128 and 255 inclusive.

Here's my default encoding::

    >>> import sys
    >>> sys.getdefaultencoding()
    'ascii'

Here's a Unicode string that works::

    >>> u =3D u'\u2020\u2021'
    >>> s =3D u.encode('raw-unicode-escape')
    >>> s
    '\\u2020\\u2021'
    >>> print s
    \u2020\u2021

That's what I want. When I run the string (not Unicode) through the
codec again, there's no change (which is good)::

    >>> s.encode('raw-unicode-escape')
    '\\u2020\\u2021'

Here's a Unicode string that doesn't work::

    >>> u =3D u'\u00A7\u00B6'
    >>> s =3D u.encode('raw-unicode-escape')
    >>> s
    '\xa7\xb6'
    >>> print s
    =A7=B6

(The last line contained the &sect; and &para; characters, probably
corrupted.)

Note that although the characters are ordinal > 127, they don't get
converted into '\\uXXXX' escapes. It seems that the
'raw-unicode-escape' codec is assuming latin-1 for output. But my
default encoding is 'ascii'; doesn't that mean 7-bit ASCII? How can I
get 7-bit ascii on \u0080 through \u00FF?

The 'unicode-escape' codec produces '\\xa7\\xb6', but it also converts
newlines to '\\n', which I don't want.

Running the string (now an 8-bit string, not 7-bit ASCII) through the
codec again crashes::

    >>> s.encode('raw-unicode-escape')
    Traceback (most recent call last):
      File "<pyshell#13>", line 1, in ?
        s.encode('raw-unicode-escape')
    UnicodeError: ASCII decoding error: ordinal not in range(128)

Is this because ``s`` is being coerced into a Unicode string, and it
fails because the default encoding is 'ascii' but ``s`` contains 8-bit
characters? Do I even have my terminology straight? ;-)

Is this a bug? I'll open a bug report if it is. Any workarounds?

I get these results with Python 2.2, on US versions of both Win2K and
MacOS 8.6. On Win2K I tried this from IDLE and from a Python session
within GNU Emacs 20.7.1, and on MacOS the test was done using the
PythonInterpreter app.; identical results all around.

--=20
David Goodger    goodger@users.sourceforge.net    Open-source projects:
 - Python Docstring Processing System: http://docstring.sourceforge.net
 - reStructuredText: http://structuredtext.sourceforge.net
 - The Go Tools Project: http://gotools.sourceforge.net


From kajiyama@grad.sccs.chukyo-u.ac.jp  Thu Mar  7 05:32:52 2002
From: kajiyama@grad.sccs.chukyo-u.ac.jp (Tamito KAJIYAMA)
Date: Thu, 7 Mar 2002 14:32:52 +0900
Subject: [I18n-sig] PEP 263 and Japanese native encodings
In-Reply-To: <m3u1rtpo9w.fsf@mira.informatik.hu-berlin.de>
 (martin@v.loewis.de)
References: <m3u1rtpo9w.fsf@mira.informatik.hu-berlin.de>
Message-ID: <200203070532.OAA19231@dhcp225.grad.sccs.chukyo-u.ac.jp>

martin@v.loewis.de (Martin v. Loewis) writes:
| 
|   An ASCII string compatible encoding (character set) is a superset of
|   the ASCII encoding (character set) in which octets from set AS are
|   only used to represent ASCII characters and not used in a series of
|   bytes that represent a multibyte character (such as Kanji and
|   Hiragana). The set AS is defined as
| 
|   AS = [\r\n\\'"] (newline, linefeed, backslash, single/double quote)
| 
| The rationale here is that, under the PEP, non-ASCII text may only
| appear in comments and strings. The lexer needs the ASCII-compatible
| property to determine the end-of-line and end-of-string markers,
| atleast in the phase-1 implementation.
|
| > o Are three Japanese native encodings EUC-JP, Shift_JIS and
| >   ISO-2022-JP "ASCII compatible"?
| 
| EUC-JP certainly is;

Absolutely.

| ISO-2022-JP probably isn't.

Right, ISO-2022-JP is not ASCII compatible in the sense of your
definition.  It uses " and ' to represent both ASCII and JIS X
0208-1983 (Kanji, Hiragana, and so on).  For example, an
ISO-2022-JP representation of u"\u3042" (the first character of
Hiragana) contains a double quote mark:

  >>> u"\u3042".encode("japanese.iso-2022-jp")
  '\033$B$"\033(B'

(FYI: the first escape sequence \033$B is the mark that says the
following bytes represent a series of JIS X 0208-1983 characters.
The second \033(B has a similar meaning for ASCII.)

| I cannot see the problem with Shift_JIS;

Shift_JIS is not ASCII compatible in a similar way.  It uses
backslash as a second byte.  Here is another example:

  >>> u"\u8868".encode("japanese.sjis")
  '\225\\'

This is a well-known and highly annoying problem of Python in
Japanese Windows environment in which Shift_JIS is the system's
default encoding.  There is a patch for Python specifically
fixing this problem.

So, a definition of ASCII compatible encodings is very important
since it may or may not accept Shift_JIS and ISO-2022-JP.  I
believe other Asian native encodings are in a similar situation
with the two Japanese encodings.

I don't want the PEP to exclude the two widely used Japanese
encodings, especially Shift_JIS.  I think the only acceptable
requirement for an ASCII compatible encoding is the property
that it can represent the first two lines of comments only by
ASCII characters.  Other requirements will not make the two
Japanese encodings ASCII comatible.

Regards,

-- 
KAJIYAMA, Tamito <kajiyama@grad.sccs.chukyo-u.ac.jp>


From martin@v.loewis.de  Thu Mar  7 07:38:50 2002
From: martin@v.loewis.de (Martin v. Loewis)
Date: 07 Mar 2002 08:38:50 +0100
Subject: [I18n-sig] raw-unicode-escape encoding
In-Reply-To: <B8AC347F.1F9B7%goodger@users.sourceforge.net>
References: <B8AC347F.1F9B7%goodger@users.sourceforge.net>
Message-ID: <m3r8mwaktx.fsf@mira.informatik.hu-berlin.de>

David Goodger <goodger@users.sourceforge.net> writes:

> Note that although the characters are ordinal > 127, they don't get
> converted into '\\uXXXX' escapes. It seems that the
> 'raw-unicode-escape' codec is assuming latin-1 for output. 

Correct. raw-unicode-escape brings the Unicode string into a form
suitable for usage in Python source code. In Python source code,
bytes in range(128,256) are treated as Latin-1, regardless of your
system encoding.

> But my default encoding is 'ascii'; doesn't that mean 7-bit ASCII?

Your system encoding is (currently) irrelevant how non-ASCII bytes are
interpreted in Python source code; this will change under PEP 263. So
I think the raw-unicode-escape codec should be changed to use hex
escapes for this range.

> Running the string (now an 8-bit string, not 7-bit ASCII) through the
> codec again crashes::
> 
>     >>> s.encode('raw-unicode-escape')
>     Traceback (most recent call last):
>       File "<pyshell#13>", line 1, in ?
>         s.encode('raw-unicode-escape')
>     UnicodeError: ASCII decoding error: ordinal not in range(128)

That's a pilot error: use .decode to decode from some byte string into
a Unicode object. Better yet, use the unicode() builtin.

> Is this because ``s`` is being coerced into a Unicode string, and it
> fails because the default encoding is 'ascii' but ``s`` contains 8-bit
> characters? Do I even have my terminology straight? ;-)

Not in this case, no.

> Is this a bug? I'll open a bug report if it is. Any workarounds?

It is not really a bug. Does it cause problems for you?

Regards,
Martin


From martin@v.loewis.de  Thu Mar  7 07:54:17 2002
From: martin@v.loewis.de (Martin v. Loewis)
Date: 07 Mar 2002 08:54:17 +0100
Subject: [I18n-sig] PEP 263 and Japanese native encodings
In-Reply-To: <200203070532.OAA19231@dhcp225.grad.sccs.chukyo-u.ac.jp>
References: <m3u1rtpo9w.fsf@mira.informatik.hu-berlin.de>
 <200203070532.OAA19231@dhcp225.grad.sccs.chukyo-u.ac.jp>
Message-ID: <m3n0xkak46.fsf@mira.informatik.hu-berlin.de>

Tamito KAJIYAMA <kajiyama@grad.sccs.chukyo-u.ac.jp> writes:

> Shift_JIS is not ASCII compatible in a similar way.  It uses
> backslash as a second byte.  Here is another example:
> 
>   >>> u"\u8868".encode("japanese.sjis")
>   '\225\\'

I see. I missed the part that the second byte can be in the range
0x40-0xFC. If I understand the problem correctly, the quotation
characters (", ') can *not* appear as the second byte, right?
Also, there is a total of 60 characters that end in byte \x5C;
and those will only cause a problem if immediately followed by
a quoting character.

Do you think those 60 characters would cause a problem in real life?
Or is that a problem that only exists on paper?

> This is a well-known and highly annoying problem of Python in
> Japanese Windows environment in which Shift_JIS is the system's
> default encoding.  There is a patch for Python specifically
> fixing this problem.

A patch specifically designed for Shift_JIS probably is not acceptable
to Python. A patch solving the general problem (in some way) may be.

> So, a definition of ASCII compatible encodings is very important
> since it may or may not accept Shift_JIS and ISO-2022-JP.  I
> believe other Asian native encodings are in a similar situation
> with the two Japanese encodings.

All the EUC encodings (EUC-KR, EUC-ZH) should be ASCII
compatible. BIG5 has the same problem as Shift_JIS. Dunno about
GB2312.

> I don't want the PEP to exclude the two widely used Japanese
> encodings, especially Shift_JIS.

Then you need to propose an implementation strategy, and that strategy
should *not* be "special-case Shift_JIS", and it also should not be
"use the C library's multibyte functions".

In phase 2 of the PEP, both Shift_JIS and ISO-2022-JP will be
acceptable source encodings - but we are in search of an
implementation strategy for that as well. So anybody working on this
would be encouraged to implement Phase 2 of the PEP. Until then, I
suggest to live with the limitation that 60 characters cannot appear
as the last character in a string.

Regards,
Martin


From kajiyama@grad.sccs.chukyo-u.ac.jp  Thu Mar  7 10:15:22 2002
From: kajiyama@grad.sccs.chukyo-u.ac.jp (Tamito KAJIYAMA)
Date: Thu, 7 Mar 2002 19:15:22 +0900
Subject: [I18n-sig] PEP 263 and Japanese native encodings
In-Reply-To: <m3n0xkak46.fsf@mira.informatik.hu-berlin.de>
 (martin@v.loewis.de)
References: <m3n0xkak46.fsf@mira.informatik.hu-berlin.de>
Message-ID: <200203071015.TAA19842@nat-dhcp253.grad.sccs.chukyo-u.ac.jp>

martin@v.loewis.de (Martin v. Loewis) writes:
| 
| > Shift_JIS is not ASCII compatible in a similar way.  It uses
| > backslash as a second byte.  Here is another example:
| > 
| >   >>> u"\u8868".encode("japanese.sjis")
| >   '\225\\'
| 
| I see. I missed the part that the second byte can be in the range
| 0x40-0xFC. If I understand the problem correctly, the quotation
| characters (", ') can *not* appear as the second byte, right?

Right.

| Also, there is a total of 60 characters that end in byte \x5C;

Not right.  In JIS X 0208-1983 (6877 characters) there are 37
characters that end in byte \x5C.

| and those will only cause a problem if immediately followed by
| a quoting character.

You've described only the condition of a syntax error; backslash
as a second byte causes run-time problems even when it is
followed by some characters.  Let's consider the following
example.  The byte sequence shown below represents the content
of a string literal in a Shift_JIS encoded source file.  Its
Unicode representation is u"\u88681\u53C2\u7167" ("See Table 1"
in Japanese).

  95 5C 31 8E 51 8F C6

Now, the second byte is backslash and thus the third byte ("1")
gets backslash-escaped ("\1").  So, Python gives the string
literal the following wrong value:

  95 01 8E 51 8F C6

| Do you think those 60 characters would cause a problem in real life?

Yes, absolutely.

| Or is that a problem that only exists on paper?

No.  Suppose that you could not put common English words like
"table", "reserve", "ten" and "paste" in string literals; such
a restriction would not be acceptable at all, right? :-)

| > This is a well-known and highly annoying problem of Python in
| > Japanese Windows environment in which Shift_JIS is the system's
| > default encoding.  There is a patch for Python specifically
| > fixing this problem.
| 
| A patch specifically designed for Shift_JIS probably is not acceptable
| to Python. A patch solving the general problem (in some way) may be.

Yes, I think so too.  The patch I metioned is a localization
patch, not intended to be merged into the Python core.

| > I don't want the PEP to exclude the two widely used Japanese
| > encodings, especially Shift_JIS.
| 
| Then you need to propose an implementation strategy, and that strategy
| should *not* be "special-case Shift_JIS", and it also should not be
| "use the C library's multibyte functions".

I've thought that Marc-Andre's intent for ASCII compatibility
(i.e., ASCII compatible encodings should be able to represent
the first two lines of comments only by ASCII characters) is
good enough.  It appears that his requirement has no problem
with regard to the implementation stategy described in the PEP
(revision 1.9) *and* Japanese encodings.  IMHO, the ASCII
compatibility simply should not impose other requirements.

Regards,

-- 
KAJIYAMA, Tamito <kajiyama@grad.sccs.chukyo-u.ac.jp>


From mal@lemburg.com  Thu Mar  7 10:52:36 2002
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 07 Mar 2002 11:52:36 +0100
Subject: [I18n-sig] raw-unicode-escape encoding
References: <B8AC347F.1F9B7%goodger@users.sourceforge.net>
Message-ID: <3C874674.AD5F62C0@lemburg.com>

David Goodger wrote:
>=20
> If this isn't the correct venue, please let me know. (The right people
> seem to be hanging around.)
>=20
> I've come across something strange while adding some Unicode
> characters to the output generated by the Docutils projects (see my
> signature for URLs). I want to get 7-bit ASCII output for the test
> suite, but I want to keep newlines, so I'm using the
> 'raw-unicode-escape' codec. I assumed that this codec would convert
> any character whose ord(char) > 127 to "\\uXXXX". This does not seem
> to be the case for ord(char) between 128 and 255 inclusive.
>=20
> Here's my default encoding::
>=20
>     >>> import sys
>     >>> sys.getdefaultencoding()
>     'ascii'
>=20
> Here's a Unicode string that works::
>=20
>     >>> u =3D u'\u2020\u2021'
>     >>> s =3D u.encode('raw-unicode-escape')
>     >>> s
>     '\\u2020\\u2021'
>     >>> print s
>     \u2020\u2021
>=20
> That's what I want. When I run the string (not Unicode) through the
> codec again, there's no change (which is good)::
>=20
>     >>> s.encode('raw-unicode-escape')
>     '\\u2020\\u2021'
>=20
> Here's a Unicode string that doesn't work::
>=20
>     >>> u =3D u'\u00A7\u00B6'
>     >>> s =3D u.encode('raw-unicode-escape')
>     >>> s
>     '\xa7\xb6'
>     >>> print s
>     =A7=B6
>=20
> (The last line contained the &sect; and &para; characters, probably
> corrupted.)
>=20
> Note that although the characters are ordinal > 127, they don't get
> converted into '\\uXXXX' escapes. It seems that the
> 'raw-unicode-escape' codec is assuming latin-1 for output. But my
> default encoding is 'ascii'; doesn't that mean 7-bit ASCII? How can I
> get 7-bit ascii on \u0080 through \u00FF?

The unicode-escape codecs (raw and normal) both extend the
Latin-1 encoding with a few escaped characters. The difference
between the two is mainly in the way they decode escapes; the
raw codec only unescapes a small supset of escapes which the
normal codec can handle.

Both codecs are mainly intended to encode/decode Unicode literals
in Python source code, so their functionality may differ a bit
from what you have in mind.
=20
> The 'unicode-escape' codec produces '\\xa7\\xb6', but it also converts
> newlines to '\\n', which I don't want.
>=20
> Running the string (now an 8-bit string, not 7-bit ASCII) through the
> codec again crashes::
>=20
>     >>> s.encode('raw-unicode-escape')
>     Traceback (most recent call last):
>       File "<pyshell#13>", line 1, in ?
>         s.encode('raw-unicode-escape')
>     UnicodeError: ASCII decoding error: ordinal not in range(128)
>=20
> Is this because ``s`` is being coerced into a Unicode string, and it
> fails because the default encoding is 'ascii' but ``s`` contains 8-bit
> characters? Do I even have my terminology straight? ;-)
>=20
> Is this a bug? I'll open a bug report if it is. Any workarounds?

You should first get a feeling for what kind of mapping
you expect, i.e. which characters should be escaped or not.

> I get these results with Python 2.2, on US versions of both Win2K and
> MacOS 8.6. On Win2K I tried this from IDLE and from a Python session
> within GNU Emacs 20.7.1, and on MacOS the test was done using the
> PythonInterpreter app.; identical results all around.

That's intended :-)

--=20
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                   http://www.egenix.com/files/python/


From mal@lemburg.com  Thu Mar  7 11:01:25 2002
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 07 Mar 2002 12:01:25 +0100
Subject: [I18n-sig] PEP 263 and Japanese native encodings
References: <m3u1rtpo9w.fsf@mira.informatik.hu-berlin.de> <200203070532.OAA19231@dhcp225.grad.sccs.chukyo-u.ac.jp>
Message-ID: <3C874885.319595D4@lemburg.com>

Tamito KAJIYAMA wrote:
> 
> I don't want the PEP to exclude the two widely used Japanese
> encodings, especially Shift_JIS.  I think the only acceptable
> requirement for an ASCII compatible encoding is the property
> that it can represent the first two lines of comments only by
> ASCII characters.  Other requirements will not make the two
> Japanese encodings ASCII comatible.

+1, I'll add a note to the PEP about this. 

The whole ASCII business is really only about the first two lines 
and that's it. In phase 2, the complete file will be decoded into 
Unicode, so the problems you now see with backslashes as final 
character in string literals (caused by Shift_JIS) will go away.

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                   http://www.egenix.com/files/python/


From mal@lemburg.com  Thu Mar  7 11:22:13 2002
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 07 Mar 2002 12:22:13 +0100
Subject: [I18n-sig] PEP 263 and Japanese native encodings
References: <m3n0xkak46.fsf@mira.informatik.hu-berlin.de> <200203071015.TAA19842@nat-dhcp253.grad.sccs.chukyo-u.ac.jp>
Message-ID: <3C874D65.A58DA645@lemburg.com>

Tamito KAJIYAMA wrote:
> 
> I've thought that Marc-Andre's intent for ASCII compatibility
> (i.e., ASCII compatible encodings should be able to represent
> the first two lines of comments only by ASCII characters) is
> good enough.  It appears that his requirement has no problem
> with regard to the implementation stategy described in the PEP
> (revision 1.9) *and* Japanese encodings.  IMHO, the ASCII
> compatibility simply should not impose other requirements.

I've updated the PEP to clarify this. Basically it should be
possible to do:

file = open('script.py')
line1 = file.readline()
line2 = file.readline()

# check line1 and line2 for the RE from the PEP

# push the two lines back onto the file stream or handle this
# situation using a line buffer.

Nothing complicated, really.

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                   http://www.egenix.com/files/python/


From tim.one@comcast.net  Thu Mar  7 17:25:30 2002
From: tim.one@comcast.net (Tim Peters)
Date: Thu, 07 Mar 2002 12:25:30 -0500
Subject: [I18n-sig] PEP 263 and Japanese native encodings
In-Reply-To: <3C874D65.A58DA645@lemburg.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCIECKOCAA.tim.one@comcast.net>

[M.-A. Lemburg]
> I've updated the PEP to clarify this. Basically it should be
> possible to do:
>
> file = open('script.py')
> line1 = file.readline()
> line2 = file.readline()
>
> # check line1 and line2 for the RE from the PEP
>
> # push the two lines back onto the file stream or handle this
> # situation using a line buffer.
>
> Nothing complicated, really.

A complication is that so long as Python uses C stdio to read files, there's
no guarantee that "funny bytes" can be gotten from files opened in text
mode.  The inability to read chr(26) from a text-mode file on Windows is an
infamous example of that:

>>> f = open('oops', 'wb')
>>> f.write('x' * 100 + chr(26) + 'x' * 100)
>>> f.close()
>>> f = open('oops')
>>> len(f.read())  # chr(26) acts like EOF on Windows in text mode
100
>>>

OTOH, if you open in binary mode instead, you have to wrestle with the
platform's line-end conventions.

the-devil-is-in-the-details-ly y'rs  - tim


From mal@lemburg.com  Thu Mar  7 18:09:58 2002
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 07 Mar 2002 19:09:58 +0100
Subject: [I18n-sig] PEP 263 and Japanese native encodings
References: <LNBBLJKPBEHFEDALKOLCIECKOCAA.tim.one@comcast.net>
Message-ID: <3C87ACF6.6BDE7924@lemburg.com>

Tim Peters wrote:
> 
> [M.-A. Lemburg]
> > I've updated the PEP to clarify this. Basically it should be
> > possible to do:
> >
> > file = open('script.py')
> > line1 = file.readline()
> > line2 = file.readline()
> >
> > # check line1 and line2 for the RE from the PEP
> >
> > # push the two lines back onto the file stream or handle this
> > # situation using a line buffer.
> >
> > Nothing complicated, really.
> 
> A complication is that so long as Python uses C stdio to read files, there's
> no guarantee that "funny bytes" can be gotten from files opened in text
> mode.  The inability to read chr(26) from a text-mode file on Windows is an
> infamous example of that:
> 
> >>> f = open('oops', 'wb')
> >>> f.write('x' * 100 + chr(26) + 'x' * 100)
> >>> f.close()
> >>> f = open('oops')
> >>> len(f.read())  # chr(26) acts like EOF on Windows in text mode
> 100
> >>>

Pass that string to a teletex machine and you'll get the same
result... Hmm, this should tell us something ;-)
 
> OTOH, if you open in binary mode instead, you have to wrestle with the
> platform's line-end conventions.

Martin's patch leaves these "minor" issues to the tokenizer 
and that's good :-) 

I only wanted to give a very simple
example of what the original idea was when I added "ASCII
compatible encoding" to the PEP -- basically to simplify
the coding parsing part.

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                   http://www.egenix.com/files/python/


From martin@v.loewis.de  Thu Mar  7 19:42:26 2002
From: martin@v.loewis.de (Martin v. Loewis)
Date: 07 Mar 2002 20:42:26 +0100
Subject: [I18n-sig] PEP 263 and Japanese native encodings
In-Reply-To: <200203071015.TAA19842@nat-dhcp253.grad.sccs.chukyo-u.ac.jp>
References: <m3n0xkak46.fsf@mira.informatik.hu-berlin.de>
 <200203071015.TAA19842@nat-dhcp253.grad.sccs.chukyo-u.ac.jp>
Message-ID: <m3r8mwdv19.fsf@mira.informatik.hu-berlin.de>

Tamito KAJIYAMA <kajiyama@grad.sccs.chukyo-u.ac.jp> writes:

> You've described only the condition of a syntax error; backslash
> as a second byte causes run-time problems even when it is
> followed by some characters.  

I see. In phase 1 of the PEP, this problem will only occur for byte
strings. For Unicode literals, those problems will not happen: Python
will decode the string before escape characters are considered, so the
problem can won't occur in Unicode strings.

For byte strings, it won't bring any changes. Your best bet is to
declare them as raw. In Phase 2, the encoding will be applied to all
strings.

So people that want Japanese strings should use Unicode literals.

> | Or is that a problem that only exists on paper?
> 
> No.  Suppose that you could not put common English words like
> "table", "reserve", "ten" and "paste" in string literals; such
> a restriction would not be acceptable at all, right? :-)

If the restriction was that you cannot have such a word as the last
word of a string (but need some spacing character after it), I think
the restriction might be acceptable - although admittedly arbitrary.

Also, notice that the restriction is only for byte strings.

> I've thought that Marc-Andre's intent for ASCII compatibility
> (i.e., ASCII compatible encodings should be able to represent
> the first two lines of comments only by ASCII characters) is
> good enough.  It appears that his requirement has no problem
> with regard to the implementation stategy described in the PEP
> (revision 1.9) *and* Japanese encodings.  IMHO, the ASCII
> compatibility simply should not impose other requirements.

That sounds nice on paper (or rather, in your email message); it
simply does not work in practice. For it to work, the lexer needs to
operate on Unicode characters instead of bytes. Such a change is quite
complex, and cannot be carried out until phase 2 of the PEP.

Anybody interested is encouraged to discuss implementation strategies
on this list. I know that I probably can't find the time to implement
that part before Python 2.3. Also, I'd think that getting the Japanese
codecs and other CJK codecs into Python would be a prerequisite for
implementing phase 2.

Regards,
Martin


From martin@v.loewis.de  Thu Mar  7 19:48:11 2002
From: martin@v.loewis.de (Martin v. Loewis)
Date: 07 Mar 2002 20:48:11 +0100
Subject: [I18n-sig] PEP 263 and Japanese native encodings
In-Reply-To: <3C87ACF6.6BDE7924@lemburg.com>
References: <LNBBLJKPBEHFEDALKOLCIECKOCAA.tim.one@comcast.net>
 <3C87ACF6.6BDE7924@lemburg.com>
Message-ID: <m3n0xkduro.fsf@mira.informatik.hu-berlin.de>

"M.-A. Lemburg" <mal@lemburg.com> writes:

> Martin's patch leaves these "minor" issues to the tokenizer 
> and that's good :-) 
> 
> I only wanted to give a very simple
> example of what the original idea was when I added "ASCII
> compatible encoding" to the PEP -- basically to simplify
> the coding parsing part.

In my implementation, the "ASCII superset" restriction is stronger,
though: the tokenizer needs to find the end of a string without
decoding it. That is not possible for some of the encodings that pass
your "ASCII superset" test.

Regards,
Martin


From perky@fallin.lv  Thu Mar  7 22:59:37 2002
From: perky@fallin.lv (Hye-Shik Chang)
Date: Fri, 8 Mar 2002 07:59:37 +0900
Subject: [I18n-sig] KoreanCodecs 2.0 released
Message-ID: <20020308075937.A24873@fallin.lv>

Hello!

I've released KoreanCodecs 2.0.
It is reimplemented based on JapaneseCodecs 1.4.

Supported Charsets:
 euc-kr   (aliases: ksc5601, ksx1001)
 cp949    (aliases: uhc, ms949)
 iso-2022-kr
 johab
 unijohab
 qwerty2bul   (aliases: 2bul)

Additional Utility:
 korean.hangul : Korean character analyzer


Some of those charsets doesn't have StreamWriter/Reader yet.
And, it has only pure python implementation now.
(Sorry (: I'll add C impl. soon.)

http://sourceforge.net/projects/koco


If you use FreeBSD, just do

 # cd /usr/ports/korean/pycodec
 # make install clean


Ciao.

--
Hye-Shik Chang <perky@fallin.lv>
Yonsei University, Seoul


From goodger@users.sourceforge.net  Fri Mar  8 02:27:09 2002
From: goodger@users.sourceforge.net (David Goodger)
Date: Thu, 07 Mar 2002 21:27:09 -0500
Subject: [I18n-sig] raw-unicode-escape encoding
In-Reply-To: <m3r8mwaktx.fsf@mira.informatik.hu-berlin.de>
Message-ID: <B8AD8BAC.1FADE%goodger@users.sourceforge.net>

[David Goodger]
> > Note that although the characters are ordinal > 127, they don't
> > get converted into '\\uXXXX' escapes. It seems that the
> > 'raw-unicode-escape' codec is assuming latin-1 for output.

[Martin v. Loewis]
> Correct. raw-unicode-escape brings the Unicode string into a form
> suitable for usage in Python source code. In Python source code,
> bytes in range(128,256) are treated as Latin-1, regardless of your
> system encoding.

That seems contrary to the Python Reference Manual, chapter 2,
`Lexical analysis`__:

    Future compatibility note: It may be tempting to assume that the
    character set for 8-bit characters is ISO Latin-1 ...
    ... it is unwise to assume either Latin-1 or UTF-8, even though
    the current implementation appears to favor Latin-1. This applies
    both to the source character set and the run-time character set.

    __ http://www.python.org/doc/current/ref/lexical.html

"a form suitable for usage in Python source code": that's exactly what
I want. Cross-platform compatibility requires 7-bit ASCII source code.
The raw-unicode-escape codec produces 8-bit Latin-1, which doesn't
survive the trip to MacOS.

> > But my default encoding is 'ascii'; doesn't that mean 7-bit ASCII?

> I think the raw-unicode-escape codec should be changed to use hex
> escapes for this range.

+1. But '\xa7' or '\u00a7' escapes? Using the former (which the
unicode-escape codec currently does) assumes Latin-1 as the native
encoding. Hex escapes ('\x##') know nothing about the encoding; they
just produce raw bytes. Shouldn't unicode escapes always be of the
'\u####' variety?

For that matter, shouldn't the internal representation distinguish? ::

    >>> u'\u2020\u00a7'
    u'\u2020\xa7'

If I'm not mistaken, '\xa7' is *not* the same as '\u00a7'.

> > Is this a bug? I'll open a bug report if it is. Any workarounds?
>=20
> It is not really a bug. Does it cause problems for you?

Yes. In the Docutils test suite, most of the tests are data-driven
from (input, expected output) pairs. Here's an example::

    # input:
    ["""\
    [#autolabel]_
   =20
    .. [#autolabel] text
    """,
    # expected output (indented pseudo-xml for readability):
    """\
    <document>
        <paragraph>
            <footnote_reference auto=3D"1" refname=3D"autolabel">
                1
        <footnote auto=3D"1" id=3D"id1" name=3D"autolabel">
            <label>
                1
            <paragraph>
                text
    """],

The test takes the input, runs it through the system, and compares it
to the expected output. If there is any difference, the actual &
expected output are run through difflib.Differ().compare() and printed
out.

This works fine for 7-bit strings. If the expected output contains any
unicode, I have to escape it. Fine. There's no problem for ord(char)
>=3D 256, but it breaks for ord(char) >=3D 127. Look at the label of
footnote 4::

    ["""\
    A sequence of symbol footnotes:
   =20
    .. [*] Auto-symbol footnote 1.
    .. [*] Auto-symbol footnote 2.
    .. [*] Auto-symbol footnote 3.
    .. [*] Auto-symbol footnote 4.
    """,
    """\
    <document>
        <paragraph>
            A sequence of symbol footnotes:
        <footnote auto=3D"*" id=3D"id1">
            <label>
                *
            <paragraph>
                Auto-symbol footnote 1.
        <footnote auto=3D"*" id=3D"id2">
            <label>
                \\u2020
            <paragraph>
                Auto-symbol footnote 2.
        <footnote auto=3D"*" id=3D"id3">
            <label>
                \\u2021
            <paragraph>
                Auto-symbol footnote 3.
        <footnote auto=3D"*" id=3D"id4">
            <label>
                =DF
            <paragraph>
                Auto-symbol footnote 4.
    """],

The \xa7 breaks going cross-platform. It doesn't produce a &sect; on
my Mac.

[Marc-Andre Lemburg]
> The unicode-escape codecs (raw and normal) both extend the
> Latin-1 encoding with a few escaped characters.

Why Latin-1 and not 7-bit ASCII? Is that documented anywhere?

> You should first get a feeling for what kind of mapping
> you expect, i.e. which characters should be escaped or not.

I would like to use a codec which escapes all char for ord(char)
from 128 up, but leaves all 7-bit ASCII alone. Is there any such beast?

--=20
David Goodger    goodger@users.sourceforge.net    Open-source projects:
 - Python Docstring Processing System: http://docstring.sourceforge.net
 - reStructuredText: http://structuredtext.sourceforge.net
 - The Go Tools Project: http://gotools.sourceforge.net


From martin@v.loewis.de  Fri Mar  8 09:03:34 2002
From: martin@v.loewis.de (Martin v. Loewis)
Date: 08 Mar 2002 10:03:34 +0100
Subject: [I18n-sig] raw-unicode-escape encoding
In-Reply-To: <B8AD8BAC.1FADE%goodger@users.sourceforge.net>
References: <B8AD8BAC.1FADE%goodger@users.sourceforge.net>
Message-ID: <m366474ejd.fsf@mira.informatik.hu-berlin.de>

David Goodger <goodger@users.sourceforge.net> writes:

> [Martin v. Loewis]
> > Correct. raw-unicode-escape brings the Unicode string into a form
> > suitable for usage in Python source code. In Python source code,
> > bytes in range(128,256) are treated as Latin-1, regardless of your
> > system encoding.
> 
> That seems contrary to the Python Reference Manual, chapter 2,
> `Lexical analysis`__:
> 
>     Future compatibility note: It may be tempting to assume that the
>     character set for 8-bit characters is ISO Latin-1 ...
>     ... it is unwise to assume either Latin-1 or UTF-8, even though
>     the current implementation appears to favor Latin-1. This applies
>     both to the source character set and the run-time character set.

What contradiction do you see? The documentation says it is unwise to
assume anything non-ASCII, and that is certainly the case: it is not
wise to assume that.

I said that bytes above 128 are treated as Latin-1 in the current
implementation, and that is also a fact. Even though this is a fact,
is is not wise to make use of this fact - for example, PEP 263 will
change that; Python 2.3 likely will not assume that bytes above 128
are Latin-1, but will give a warning instead.


> "a form suitable for usage in Python source code": that's exactly what
> I want. Cross-platform compatibility requires 7-bit ASCII source code.
> The raw-unicode-escape codec produces 8-bit Latin-1, which doesn't
> survive the trip to MacOS.

If you put it into Python source, it sure does survive the trip to
MacOS - assuming you manage not to convert the Python source file when
putting it on a Mac disk.

You Mac text editor will not display it in the same way as Python
interprets it, but that can't change the way how the current Python
implementation interprets it.

> +1. But '\xa7' or '\u00a7' escapes? Using the former (which the
> unicode-escape codec currently does) assumes Latin-1 as the native
> encoding. Hex escapes ('\x##') know nothing about the encoding; they
> just produce raw bytes. Shouldn't unicode escapes always be of the
> '\u####' variety?

No. That makes absolutely no difference. \xXY, in a Unicode literal,
means "Unicode character with the numeric value 16*X + Y". \uVWXY mean
"Unicode character with the numeric value 4096*V+256*W+16*X+Y" (I
leave defining \UKLMNOPQR as an exercise :-). So if V and W are both
0, then \u00XY is precisely the same as \xXY. No assumption of Latin-1
here, anywhere.

> For that matter, shouldn't the internal representation distinguish? ::
> 
>     >>> u'\u2020\u00a7'
>     u'\u2020\xa7'

No, for the same reason.

> If I'm not mistaken, '\xa7' is *not* the same as '\u00a7'.

You are mistaken.

> Yes. In the Docutils test suite, most of the tests are data-driven
> from (input, expected output) pairs. Here's an example::
[...]
> This works fine for 7-bit strings. If the expected output contains any
> unicode, I have to escape it. 

It appears then that the raw-unicode-escape codec is not suited for
your application; you will probably need to write your own. This has
the advantage that you'll know exactly what it is doing.

This is a general principle: The Python "pretty printer" algorithms"
have a specific semantic, suitable for a specific application
(normally: give some kind of output to the user of the interactive
prompt). People try to use those algorithms for different things all
the time, and complain if they don't do what they expect. In general,
the bug is not in Python, but in the application: to use the
algorithm, accept what it does in border cases. 

> [Marc-Andre Lemburg]
> > The unicode-escape codecs (raw and normal) both extend the
> > Latin-1 encoding with a few escaped characters.
> 
> Why Latin-1 and not 7-bit ASCII? Is that documented anywhere?

The unicode-escape codec is not documented at all. How do you know it
exists?

> I would like to use a codec which escapes all char for ord(char)
> from 128 up, but leaves all 7-bit ASCII alone. Is there any such
> beast?

"utf-7" fits that description, except that it will escape
'+'. Something that produces ASCII and does not escape anything does
not exist - you normally have to escape the escape character.

Regards,
Martin


From mal@lemburg.com  Fri Mar  8 09:47:01 2002
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 08 Mar 2002 10:47:01 +0100
Subject: [I18n-sig] PEP 263 and Japanese native encodings
References: <LNBBLJKPBEHFEDALKOLCIECKOCAA.tim.one@comcast.net>
 <3C87ACF6.6BDE7924@lemburg.com> <m3n0xkduro.fsf@mira.informatik.hu-berlin.de>
Message-ID: <3C888895.E4720F3F@lemburg.com>

"Martin v. Loewis" wrote:
> 
> "M.-A. Lemburg" <mal@lemburg.com> writes:
> 
> > Martin's patch leaves these "minor" issues to the tokenizer
> > and that's good :-)
> >
> > I only wanted to give a very simple
> > example of what the original idea was when I added "ASCII
> > compatible encoding" to the PEP -- basically to simplify
> > the coding parsing part.
> 
> In my implementation, the "ASCII superset" restriction is stronger,
> though: the tokenizer needs to find the end of a string without
> decoding it. That is not possible for some of the encodings that pass
> your "ASCII superset" test.

I hope I have clarified the intent of the definition that was
in the PEP -- I've removed it a few days ago and replaced it
with a hopefully better description of what was meant. Your
implementation is just fine in this respect (at least from looking
at it ;-).

Are there any other common encodings which could have trouble 
with the implementation besides two or more byte encodings ?

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                   http://www.egenix.com/files/python/


From mal@lemburg.com  Fri Mar  8 11:13:25 2002
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 08 Mar 2002 12:13:25 +0100
Subject: [I18n-sig] KoreanCodecs 2.0 released
References: <20020308075937.A24873@fallin.lv>
Message-ID: <3C889CD5.88395386@lemburg.com>

Hye-Shik Chang wrote:
> 
> Hello!
> 
> I've released KoreanCodecs 2.0.
> It is reimplemented based on JapaneseCodecs 1.4.
> 
> Supported Charsets:
>  euc-kr   (aliases: ksc5601, ksx1001)
>  cp949    (aliases: uhc, ms949)
>  iso-2022-kr
>  johab
>  unijohab
>  qwerty2bul   (aliases: 2bul)
> 
> Additional Utility:
>  korean.hangul : Korean character analyzer

Great !
 
> Some of those charsets doesn't have StreamWriter/Reader yet.
> And, it has only pure python implementation now.
> (Sorry (: I'll add C impl. soon.)
> 
> http://sourceforge.net/projects/koco
> 
> If you use FreeBSD, just do
> 
>  # cd /usr/ports/korean/pycodec
>  # make install clean

I'd add a distutils setup.py style install script to the
package -- it makes things easier for you and your users.

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                   http://www.egenix.com/files/python/


From goodger@users.sourceforge.net  Fri Mar  8 13:00:02 2002
From: goodger@users.sourceforge.net (David Goodger)
Date: Fri, 08 Mar 2002 08:00:02 -0500
Subject: [I18n-sig] raw-unicode-escape encoding
In-Reply-To: <m366474ejd.fsf@mira.informatik.hu-berlin.de>
Message-ID: <B8AE2001.1FAF7%goodger@users.sourceforge.net>

[David Goodger]
>> "a form suitable for usage in Python source code": that's exactly what
>> I want. Cross-platform compatibility requires 7-bit ASCII source code.
>> The raw-unicode-escape codec produces 8-bit Latin-1, which doesn't
>> survive the trip to MacOS.

[Martin v. Loewis]
> If you put it into Python source, it sure does survive the trip to
> MacOS - assuming you manage not to convert the Python source file when
> putting it on a Mac disk.

Unfortunately, that's an assumption that's easily broken by all kinds of
well-meaning tools (browsers, unarchivers, etc.). Which is why I want a
7-bit representation.

>> If I'm not mistaken, '\xa7' is *not* the same as '\u00a7'
> 
> You are mistaken.

Would I still be mistaken if I put it like this: '\xa7' is not the same as
u'\u00a7'? (Note the '' and u''.) That's what the raw-unicode-escape coded
is doing.

> It appears then that the raw-unicode-escape codec is not suited for
> your application; you will probably need to write your own. This has
> the advantage that you'll know exactly what it is doing.

Fair enough. Thanks for clarifying. I'll try writing a codec; at the least,
it will improve my understanding of how all this fits together.

>> Why Latin-1 and not 7-bit ASCII? Is that documented anywhere?
> 
> The unicode-escape codec is not documented at all. How do you know it
> exists?

By using the source. :-)

>> I would like to use a codec which escapes all char for ord(char)
>> from 128 up, but leaves all 7-bit ASCII alone. Is there any such
>> beast?
> 
> "utf-7" fits that description, except that it will escape '+'.
> Something that produces ASCII and does not escape anything does
> not exist - you normally have to escape the escape character.

That would be acceptable; I'll give it a try.

Thank you, Martin and Marc-Andre, for your thoughtful replies.

-- 
David Goodger    goodger@users.sourceforge.net    Open-source projects:
 - Python Docstring Processing System: http://docstring.sourceforge.net
 - reStructuredText: http://structuredtext.sourceforge.net
 - The Go Tools Project: http://gotools.sourceforge.net


From perky@fallin.lv  Fri Mar  8 14:15:12 2002
From: perky@fallin.lv (Hye-Shik Chang)
Date: Fri, 8 Mar 2002 23:15:12 +0900
Subject: [I18n-sig] KoreanCodecs 2.0 released
In-Reply-To: <3C889CD5.88395386@lemburg.com>; from mal@lemburg.com on Fri, Mar 08, 2002 at 12:13:25PM +0100
References: <20020308075937.A24873@fallin.lv> <3C889CD5.88395386@lemburg.com>
Message-ID: <20020308231512.A34052@fallin.lv>

On Fri, Mar 08, 2002 at 12:13:25PM +0100, M.-A. Lemburg wrote:
> 
> I'd add a distutils setup.py style install script to the
> package -- it makes things easier for you and your users.

Actually, I wrote distutils setup.py already.
See: http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/koco/KoreanCodecs/setup.py?rev=1.4&content-type=text/vnd.viewcvs-markup

Thank you for favor. :-)

--
Hye-Shik Chang <perky@fallin.lv>
Yonsei University, Seoul


From martin@v.loewis.de  Fri Mar  8 18:01:21 2002
From: martin@v.loewis.de (Martin v. Loewis)
Date: 08 Mar 2002 19:01:21 +0100
Subject: [I18n-sig] raw-unicode-escape encoding
In-Reply-To: <B8AE2001.1FAF7%goodger@users.sourceforge.net>
References: <B8AE2001.1FAF7%goodger@users.sourceforge.net>
Message-ID: <m3k7snos5q.fsf@mira.informatik.hu-berlin.de>

David Goodger <goodger@users.sourceforge.net> writes:

> >> If I'm not mistaken, '\xa7' is *not* the same as '\u00a7'
> > 
> > You are mistaken.
> 
> Would I still be mistaken if I put it like this: '\xa7' is not the same as
> u'\u00a7'? (Note the '' and u''.) That's what the raw-unicode-escape coded
> is doing.

It depends on the meaning of "is the same". If you compare the text,
\xa7 is different from \xA7, which is different from \u00A7. In the
context of raw-unicode-escape, they are the same - both are used to
identify Unicode characters, and they identify the same character
(whether with a lower or uppercase letter, and whether in the \x, \u,
or \U notation).

Regards,
Martin


From haible@ilog.fr  Wed Mar 13 13:54:50 2002
From: haible@ilog.fr (Bruno Haible)
Date: Wed, 13 Mar 2002 14:54:50 +0100 (CET)
Subject: [I18n-sig] gettext-0.11.1 is released
Message-ID: <15503.23082.866066.584630@honolulu.ilog.fr>

It is at ftp.gnu.org (soon also its mirrors) in
gnu/gettext/gettext-0.11.1.tar.gz

New in 0.11.1:

* xgettext now also supports Python, Tcl, Awk and Glade.

* msgfmt can create (and msgunfmt can dump) Tcl message catalogs.

* msggrep has a new option -C that allows to search for strings in translator
  comments.

* Bug fixes in the gettext.m4 autoconf macros.

New in 0.11:

* New programs:
    msgattrib - attribute matching and manipulation on message catalog,
    msgcat - combines several message catalogs,
    msgconv - character set conversion for message catalog,
    msgen - create English message catalog,
    msgexec - process translations of message catalog,
    msgfilter - edit translations of message catalog,
    msggrep - pattern matching on message catalog,
    msginit - initialize a message catalog,
    msguniq - unify duplicate translations in message catalog.

* msgfmt can create (and msgunfmt can dump) Java ResourceBundles.

* xgettext now also supports Lisp, Emacs Lisp, librep, Java, ObjectPascal,
  YCP.

* The tools now know about format strings in languages other than C.
  They recognize new message flags named lisp-format, elisp-format,
  librep-format, smalltalk-format, java-format, python-format, ycp-format.
  When such a flag is present, the msgfmt program verifies the consistency
  of the translated and the untranslated format string.

* The msgfmt command line options have changed.  Option -c now also checks
  the header entry, a check which was previously activated through -v.
  Option -C corresponds to the compatibility checks previously activated
  through -v -v.  Option -v now only increases verbosity and doesn't
  influence whether msgfmt succeeds or fails.  A new option
  --check-accelerators is useful for GUI menu item translations.

* msgcomm now writes its results to standard output by default. The options
  -d/--default-domain and -p/--output-dir have been removed.

* Manual pages for all the programs have been added.

* PO mode changes:
  - New key bindings for 'po-previous-fuzzy-entry',
    'po-previous-obsolete-entry', 'po-previous-translated-entry',
    'po-previous-untranslated', 'po-undo', 'po-other-window', and
    'po-select-auxiliary'.
  - Support for merging two message catalogs, based on msgcat and ediff.

* A fuzzy attribute of the header entry of a message catalog is now ignored
  by the tools, i.e. it is used even if marked fuzzy.

* gettextize has a new option --intl which determines whether a copy of the
  intl directory is included in the package.

* The Makefile variable @INTLLIBS@ is deprecated. It is replaced with
  @LIBINTL@ (in projects without libtool) or @LTLIBINTL@ (in projects with
  libtool).

* New packaging hints for binary package distributors. See file PACKAGING.

* New documentation sections:
  - Manipulating
  - po/LINGUAS
  - po/Makevars
  - lib/gettext.h
  - autoconf macros
  - Other Programming Languages


Happy internationalization! Bonne francisation! Frohes Eindeutschen!

                      Bruno


From perky@fallin.lv  Sat Mar 16 03:28:54 2002
From: perky@fallin.lv (Hye-Shik Chang)
Date: Sat, 16 Mar 2002 12:28:54 +0900
Subject: [I18n-sig] KoreanCodecs 2.0.2 is released
Message-ID: <20020316122854.A96548@fallin.lv>

Hello!

I've just released KoreanCodecs 2.0.2.

ChangeLog from 2.0:
 - changed to Python License
 - added C codec implementations for EUC-KR, CP949
   (faster than iconvcodec! *wink*)

Since EUC-KR and CP949 can cover 99% of korean users,
other codecs will remain only pure-python.


Project Homepage: http://sourceforge.net/projects/koco
Download URL:     http://sourceforge.net/project/showfiles.php?group_id=46747

Thanks!


ps. Can I put an advertisement for this codecs package
    in ${PYDOCS}/lib/module-codecs.html as "See Also"?

--
Hye-Shik Chang <perky@fallin.lv>
Yonsei University, Seoul


From Misha.Wolf@reuters.com  Fri Mar 22 18:25:41 2002
From: Misha.Wolf@reuters.com (Misha.Wolf@reuters.com)
Date: Fri, 22 Mar 2002 18:25:41 +0000
Subject: [I18n-sig] 21st Unicode Conference, May 2002, Dublin,
 Ireland -- Seven weeks to go!
Message-ID: <T59cbb37a70c407b707800@reuters.com>

>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Register now! <<<<<<<<<<<<<<<<<<<<<<<<<<<

         Twenty-first International Unicode Conference (IUC21)
        Unicode, Localization and the Web: The Global Connection
                    http://www.unicode.org/iuc/iuc21
                             14-17 May 2002
                             Dublin, Ireland

>>>>>>>>>>>>>>>>>>>>>>>>>> Just 7 weeks to go! <<<<<<<<<<<<<<<<<<<<<<<<

NEWS

 * Visit the Conference Web site ( http://www.unicode.org/iuc/iuc21 )
   to check the Conference program and register.  To help you choose
   Conference sessions, we've included abstracts of talks and speakers'
   biographies.

 * Choose between the 13 tutorials, 51 talks, 2 panels and 2 keynotes,
   presented by over 60 speakers.

 * Find out about the Workshop on Standards in Localisation, organised
   by the Localisation Research Centre (LRC), and taking place in the
   same venue on May 13 -- See: http://lrc.csis.ul.ie

CONFERENCE SPONSORS

   Agfa Monotype Corporation
   Basis Technology Corporation
   Localisation Research Centre
   Microsoft Corporation
   Reuters Ltd
   Sun Microsystems, Inc.
   World Wide Web Consortium (W3C)

GLOBAL COMPUTING SHOWCASE

   Visit the Showcase to find out more about products supporting the
   Unicode Standard, and products and services that can help you
   globalize/localize your software, documentation and Internet content.
   For details, visit the Conference Web site.

CONFERENCE VENUE

The Conference will take place at:

   The Burlington Hotel
   Upper Leeson Street
   Dublin 4, Ireland

   Tel: (+353 1) 660 5222
   Fax: (+353 1) 660 8496

CONFERENCE MANAGEMENT

   Global Meeting Services Inc.
   8949 Lombard Place, #416
   San Diego, CA 92122, USA

   Tel: +1 858 638 0206 (voice)
        +1 858 638 0504 (fax)

   Email: info@global-conference.com
      or: conference@unicode.org

                           *  *  *  *  *

Unicode(r) and the Unicode logo are registered trademarks of Unicode,
Inc.  Used with permission.


------------------------------------------------------------- ---
        Visit our Internet site at http://www.reuters.com

Any views expressed in this message are those of  the  individual
sender,  except  where  the sender specifically states them to be
the views of Reuters Ltd.


From Misha.Wolf@reuters.com  Thu Mar 28 19:15:08 2002
From: Misha.Wolf@reuters.com (Misha.Wolf@reuters.com)
Date: Thu, 28 Mar 2002 19:15:08 +0000
Subject: [I18n-sig] Call for Papers - 22nd Unicode Conference - September 2002 - San Jose,
 CA
Message-ID: <T59eac7d4fec407b7068a8@reuters.com>

         Twenty-second International Unicode Conference (IUC22)
             Unicode and the Web: Evolution or Revolution?
                    http://www.unicode.org/iuc/iuc22
                          September 9-13, 2002
                          San Jose, California
***********************************************************************
Call for Papers >>> Just 6 weeks to go >>> Send in your submission now!
***********************************************************************
                     Submissions due: May 10, 2002
                    Notification date: May 31, 2002
                  Completed papers due : June 21, 2002
            (in electronic form and camera-ready paper form)

The software industry continues its rapid growth and change.  In this
year alone, Unicode 3.2 was released and several new proposals for the
Internet and the World Wide Web were promoted to standards.  Web
Services is the latest buzz.  Are the vendors of software that support
these technologies keeping up?  How can you be sure that you are
deploying software components that work well together today and in the
future?  This Conference is where you go to find out.  Experts will
describe the latest changes to the Unicode standard and the other
standards used for e-business today.  You will also learn about the best
practices for utilizing, integrating and deploying these technologies
based on real-world examples and experience.  Demonstrations are often
provided.

We invite you to submit papers which either define the software of
tomorrow, demonstrate best practice with today's software, or articulate
problems that must be solved before further advances can occur.  Papers
should discuss subjects in the context of Unicode, internationalization
or localization.  You can view the programs of previous Conferences at:
http://www.unicode.org/unicode/conference/about-conf.html

Conference attendees are generally involved in either the development,
deployment or use of Unicode software or content, or the globalization
of software and the Internet.  They include managers, software engineers,
systems analysts, font designers, graphic designers, content developers,
technical writers, and product marketing personnel.

THEME & TOPICS

Computing with Unicode is the overall theme of the Conference.
Presentations should be geared towards a technical audience.  Topics of
interest include, but are not limited to, the following (within the
context of Unicode, internationalization or localization):

- Web Services
- XML and related specifications
- The World Wide Web (WWW)
- Portable devices
- UTFs: Not enough or too many?
- Security concerns e.g. Avoiding the spoofing of UTF-8 data
- Impact of new encoding standards
- Implementing Unicode: Practical and political hurdles
- Implementing new features of recent versions of Unicode
- Algorithms (e.g. normalization, collation, bidirectional)
- Programming languages and libraries (Java, Perl, et al)
- Search engines
- Library and archival concerns
- Operating systems
- Databases
- Large scale networks
- Government applications
- Evaluations (case studies, usability studies)
- Natural language processing
- Migrating legacy applications
- Cross platform issues
- Printing and imaging
- Optimizing performance of systems and applications
- Testing applications
- Business models for software development (e.g. Open source)

SESSIONS

The Conference Program will provide a wide range of sessions including:
- Keynote presentations
- Workshops/Tutorials
- Technical presentations
- Panel sessions

All sessions except the Workshops/Tutorials will be of 40 minute
duration.  In some cases, two consecutive 40 minute program slots may be
devoted to a single session.

The Workshops/Tutorials will each last approximately three hours.  They
should be designed to stimulate discussion and participation, using
slides and demonstrations.

PUBLICITY

If your paper is accepted, your details will be included in the
Conference brochure and Web pages and the paper itself will appear on a
Conference CD, with an optional printed book of Conference Proceedings.

CONFERENCE LANGUAGE

The Conference language is English.  All submissions, papers and
presentations should be provided in English.

SUBMISSIONS

Submissions MUST contain:

1. An abstract of 150-250 words, consisting of statement of purpose,
   paper description, and your conclusions or final summary.

2. A brief biography.

3. The details listed below:

   SESSION TITLE:             _________________________________________

                              _________________________________________

   TITLE (eg Dr/Mr/Mrs/Ms):   _________________________________________

   NAME:                      _________________________________________

   JOB TITLE:                 _________________________________________

   ORGANIZATION/AFFILIATION:  _________________________________________

   ORGANIZATION'S WWW URL:    _________________________________________

   OWN WWW URL:               _________________________________________

   ADDRESS FOR PAPER MAIL:    _________________________________________

                              _________________________________________

                              _________________________________________

   TELEPHONE:                 _________________________________________

   FAX:                       _________________________________________

   E-MAIL ADDRESS:            _________________________________________

   TYPE OF SESSION:           [ ] Keynote presentation

                              [ ] Workshop/Tutorial

                              [ ] Technical presentation

                              [ ] Panel

   PANELISTS (if Panel):      _________________________________________

                              _________________________________________

                              _________________________________________

                              _________________________________________

                              _________________________________________

                              _________________________________________

                              _________________________________________

                              _________________________________________

   TARGET AUDIENCE (you may select more than one category):

                              [ ] Content Developers

                              [ ] Font Designers

                              [ ] Graphic Designers

                              [ ] Managers

                              [ ] Marketers

                              [ ] Software Engineers

                              [ ] Systems Analysts

                              [ ] Technical Writers

                              [ ] Others (please specify):

                              _________________________________________

                              _________________________________________

   LEVEL OF SESSION (you may select more than one category):

                              [ ] Beginner

                              [ ] Intermediate

                              [ ] Advanced

Submissions should be sent by e-mail to either of the following
addresses:

   papers@unicode.org

   info@global-conference.com

They should use ASCII, non-compressed text and the following subject
line:

   Proposal for IUC 22

If desired, a copy of the submission may also be sent by post to:

   22nd International Unicode Conference
   c/o Global Meeting Services, Inc.
   8949 Lombard Place #416
   San Diego, CA  92122  USA
   Tel: +1 858 638 0206
   Fax: +1 858 638 0504

CONFERENCE PROCEEDINGS

All Conference papers will be published on CD.  Printed proceedings will
be offered as an option.

EXHIBIT OPPORTUNITIES

The Conference will have an Exhibition area for corporations or
individuals who wish to display and promote their products, technology
and/or services.

Every effort will be made to provide maximum exposure and advertising.

Exhibit space is limited.  For further information or to reserve a
place, please contact Global Meeting Services at the above location.

CONFERENCE VENUE

   DoubleTree Hotel San Jose
   2050 Gateway Place
   San Jose, CA 95110
   USA

   Telephone number:  +1-408-453-4000
   Facsimile number:  +1-408-437-2898

THE UNICODE CONSORTIUM

The Unicode Consortium was founded as a non-profit organization in 1991.
It is dedicated to the development, maintenance and promotion of The
Unicode Standard, a worldwide character encoding.  The Unicode Standard
encodes the characters of the world's principal scripts and languages,
and is code-for-code identical to the international standard ISO/IEC
10646.  In addition to cooperating with ISO on the future development of
ISO/IEC 10646, the Consortium is responsible for providing character
properties and algorithms for use in implementations.  Today the
membership base of the Unicode Consortium includes major computer
corporations, software producers, database vendors, research
institutions, international agencies and various user groups.

For further information on the Unicode Standard, visit the Unicode Web
site at http://www.unicode.org or e-mail <info@unicode.org>

                           *  *  *  *  *

Unicode(r) and the Unicode logo are registered trademarks of Unicode,
Inc.  Used with permission.


-------------------------------------------------------------- --
        Visit our Internet site at http://www.reuters.com

Any views expressed in this message are those of  the  individual
sender,  except  where  the sender specifically states them to be
the views of Reuters Ltd.