From mal@lemburg.com  Sat Apr  1 18:43:05 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Sat, 01 Apr 2000 20:43:05 +0200
Subject: [I18n-sig] Test Suite for the Unicode codecs
References: <38E52399.19220D0@lemburg.com> <38e5d0b2.10984324@post.demon.co.uk>
Message-ID: <38E64339.CC0B0FB@lemburg.com>

Andy Robinson wrote:
> 
> On Sat, 01 Apr 2000 00:15:53 +0200, you wrote:
> 
> >I would like to add some more testing to the mapping codecs
> >in the Python encodings package. Right now I can only test
> >for round-trips of lower character ordinal ranges and even
> >those tests fail for a couple of encodings.
> >
> >Does anyone have access to some reference test suite for
> >these mappings ? The mapping codec is probably not the
> >cause for these errors. Perhaps the maps themselves
> >aren't of high enough quality or maybe some mappings
> >just cannot provide round-trip safety...
> >
> I can't give specifics off the top of my head, but mappings not giving
> round trips is quite common, especially with corporate character sets.
> We always handled this by framing questions differently and saying
> 'what is the subset of a map that gives a full round-trip, and which
> bits of my data fall outside it', and trying to get some printed code
> chart to show the results; then you can quickly see if the results
> make sense.  If you have that knowledge, you could then build
> assertions into a python-only test suite.

That would be great of course... but how do we get native
script readers for all those code pages ?
 
> For testing, I think the best approach is to compare output to another
> well-known mapping utility.  The most convenient I know of is
> uniconv.exe from http://www.basistech.com/ - not Open Source and
> Windows-only, but it is a straightforward goal for us to write a
> uniconv.py that perfectly mimics its behaviour.

Ok, I've just downloaded it (it's a bit hidden as Demo of
their C++ Unicode class lib) and will give it a try next week.
 
> I'm in the middle of a 'work crisis' at the moment, and I know I'm not
> really pulling my weight.  Does anyone have a few hours to help out
> with testing?  If so I could outline the kind of test program that
> would help us quickly validate the existing mappings, and help with
> any new ones.
> 
> Marc-Andre, do you have any preferences for where a test suite and
> bunch of add-on tools live?  Do you want something which fits into the
> standard distribution, or can we handle it outside?

Hmm, tests for the builtin codecs should live in Lib/test
with the output in Lib/test/output. Tools etc. are probably
best placed somewhere into the Tools/ directory (e.g. the
gencodec.py script lives in Tools/scripts). Perhaps we need
a separate Tools/unicode if there are going to many different
scripts...

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From mal@lemburg.com  Mon Apr  3 09:31:45 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Mon, 03 Apr 2000 10:31:45 +0200
Subject: [I18n-sig] Test Suite for the Unicode codecs
References: <38E52399.19220D0@lemburg.com> <38e5d0b2.10984324@post.demon.co.uk> <38E64339.CC0B0FB@lemburg.com> <007201bf9cb4$74b4b730$01ac2ac0@boulder>
Message-ID: <38E856F1.FBF66AAE@lemburg.com>

[CC:ing to i18n-sig -- hope this is ok]

Andy Robinson wrote:
> 
> >
> > That would be great of course... but how do we get native
> > script readers for all those code pages ?
> I suspect we won't.  Unicode fonts with all 45k glyphs are not exactly
> common; there is one, but it was full of holes last time I checked.  There
> are two approaches to viewing the CJK ones:
> 1. Use IE5 or Netscape.  IE5 comes with lots of font packs for most
> languages, especially the Asian ones.  One makes up preformatted text files
> designed to mirror the vendor or standards' organisation's code chart, puts
> it through a round trip, and tells the browser to display it - possibly side
> by side with the original.  If you feel clever, you can use tables and
> highlight things which fail the round trip.  Of course, this depends on the
> fonts you have installed, and these vary
> 
> 2. (Some months off)  Use Acrobat 4.0 and the Language Packs from Adobe.
> These are the first really platform-independent vewing technology; I have
> wrapped up the Japanese one in ReportLab and used it very successfully at
> Fidelity Investments last year to prove round trips from AS400, but have to
> rewrite that code as it was done in-house for them.  I write a loop to print
> about fifteen pages of charts which are laid out exactly like the relevant
> Appendix in "CJKV Information Processing", run it through some
> transformations, then sit staring at all 6879 glyphs for a couple of hours.
> Sometimes, while bored, I did plots to show how code points mapped from one
> encoding to another; we had to reverse engineer an AS400 encoding.   Adobe's
> CID fonts include their own mapping tables and conversion at the PostScript
> level; If I ask for the font "Mincho-UTF8",  I get it encoded that way and
> can feed it UTF8 strings; if I as for the font "Mincho-SJIS" I get a
> Shift-JIS encoded font.

This looks like an awful lot of work. Isn't there some better
way to get this done ? (There might be a problem due to different
composition of characters, but I think we could handle it by implementing
the normalization algorithmn for Unicode.)
 
> This is actually my main interest in the Unicode stuff; to build a global
> reporting engine, we have to handle data in any encoding and feed it to the
> font engine in an encoding the font can handle.
> 
> The great thing about PDF code charts is that they are immutable and not
> dependent on your PC setup.
> 
> >
> > > For testing, I think the best approach is to compare output to another
> > > well-known mapping utility.  The most convenient I know of is
> > > uniconv.exe from http://www.basistech.com/ - not Open Source and
> > > Windows-only, but it is a straightforward goal for us to write a
> > > uniconv.py that perfectly mimics its behaviour.
> >
> > Ok, I've just downloaded it (it's a bit hidden as Demo of
> > their C++ Unicode class lib) and will give it a try next week.
> >
> > > Marc-Andre, do you have any preferences for where a test suite and
> > > bunch of add-on tools live?  Do you want something which fits into the
> > > standard distribution, or can we handle it outside?
> >
> > Hmm, tests for the builtin codecs should live in Lib/test
> > with the output in Lib/test/output. Tools etc. are probably
> > best placed somewhere into the Tools/ directory (e.g. the
> > gencodec.py script lives in Tools/scripts). Perhaps we need
> > a separate Tools/unicode if there are going to many different
> > scripts...
> I must admit, I was thinking of an actual web server test framework which
> kept a database of sample text files, did round trip tests on demand, and
> could hand out HTML and PDF files to anyone who asked - probably a bit much
> for the standard Python library.  One needs knowledge of each individual
> code page and some quite devious test files to test out double-byte codecs.
> For single-byte, we need a reliable way to see all the code points before we
> dare rely on full round trip tests and assertions.   I think we need some
> separate project on starship, sourceforge or wherever to mess around with
> this stuff, and then you can decide what is worth including in the main
> distribution.

Ok. For now I'll leave the current cp codecs in place and
simply wait for people reporting bugs in the mapping tables...

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From andy@reportlab.com  Wed Apr  5 17:24:56 2000
From: andy@reportlab.com (Andy Robinson)
Date: Wed, 5 Apr 2000 17:24:56 +0100
Subject: [I18n-sig] Unicode Tutorial (Slow) Progress
Message-ID: <PGECLPOBGNBNKHNAGIJHCEOFCAAA.andy@reportlab.com>

I'm part way through a tutorial at long last.  My own work is pretty poor so
far, but it DOES include Marc-Andre's 'console session' demos at the bottom
which show the current usage.

http://www.reportlab.com/i18n/python_unicode_tutorial.html

If anyone can suggest topics I should cover (apart from the obvious one of
using every new features at least once) or simple relevant examples, I'll
try to work them in over the coming weeks.

- Andy Robinson


From guido@python.org  Wed Apr  5 18:55:22 2000
From: guido@python.org (Guido van Rossum)
Date: Wed, 05 Apr 2000 13:55:22 -0400
Subject: [I18n-sig] Unicode Tutorial (Slow) Progress
In-Reply-To: Your message of "Wed, 05 Apr 2000 17:24:56 BST."
 <PGECLPOBGNBNKHNAGIJHCEOFCAAA.andy@reportlab.com>
References: <PGECLPOBGNBNKHNAGIJHCEOFCAAA.andy@reportlab.com>
Message-ID: <200004051755.NAA16668@eric.cnri.reston.va.us>

> I'm part way through a tutorial at long last.  My own work is pretty poor so
> far, but it DOES include Marc-Andre's 'console session' demos at the bottom
> which show the current usage.
> 
> http://www.reportlab.com/i18n/python_unicode_tutorial.html

Thanks!  Added to the i18n-sig home page *and* to the Python 1.6 page.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From mal@lemburg.com  Wed Apr  5 19:41:30 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Wed, 05 Apr 2000 20:41:30 +0200
Subject: [I18n-sig] Unicode Tutorial (Slow) Progress
References: <PGECLPOBGNBNKHNAGIJHCEOFCAAA.andy@reportlab.com>
Message-ID: <38EB88DA.9D391700@lemburg.com>

Andy Robinson wrote:
> 
> I'm part way through a tutorial at long last.  My own work is pretty poor so
> far, but it DOES include Marc-Andre's 'console session' demos at the bottom
> which show the current usage.
> 
> http://www.reportlab.com/i18n/python_unicode_tutorial.html
> 
> If anyone can suggest topics I should cover (apart from the obvious one of
> using every new features at least once) or simple relevant examples, I'll
> try to work them in over the coming weeks.

Looks great... a bit much exposure, maybe ;-)

Note that the stackable stream example will need a small bit
of updating (the return is wrong -- the API was changed since I
programmed the example):

import codecs,sys

# Convert Unicode -> UTF-8
(e,d,sr,sw) = codecs.lookup('utf-8')
unicode_to_utf8 = sw(sys.stdout)

# Convert Latin-1 -> Unicode during .write
(e,d,sr,sw) = codecs.lookup('latin-1')
class StreamRewriter(codecs.StreamWriter):

    encode = e
    decode = d

    def write(self,object):

        """ Writes the object's contents encoded to self.stream
            and returns the number of bytes written.
        """
        data,consumed = self.decode(object,self.errors)
        self.stream.write(data)
    
latin1_to_utf8 = StreamRewriter(unicode_to_utf8)

# Now install
sys.stdout = latin1_to_utf8

# All subsequent prints will output Latin-1 strings using UTF-8
# characters...
print 'Hello World !'
print 'Héllò Wörld !'
print 'ÄÖÜäöüß'


-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From mal@lemburg.com  Wed Apr  5 19:58:44 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Wed, 05 Apr 2000 20:58:44 +0200
Subject: [I18n-sig] Unicode Tutorial (Slow) Progress
References: <PGECLPOBGNBNKHNAGIJHCEOFCAAA.andy@reportlab.com> <38EB88DA.9D391700@lemburg.com>
Message-ID: <38EB8CE4.8CB8914@lemburg.com>

I just noted a bug that appears on your page:

>>> a.encode('ascii', 'ignore')  # turn to zero and continue 
'Andr\000'

This should really give 'Andr' -- 'ignore' will simply ignore
illegal input characters.

I will submit a patch for this with the next Unicode patch set.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From guido@python.org  Mon Apr 10 15:01:58 2000
From: guido@python.org (Guido van Rossum)
Date: Mon, 10 Apr 2000 10:01:58 -0400
Subject: [I18n-sig] "takeuchi": a unicode string on IDLE shell
Message-ID: <200004101401.KAA00238@eric.cnri.reston.va.us>

Can anyone answer this?  I can reproduce the output side of this, and
I believe he's right about the input side.  Where should Python
migrate with respect to Unicode input?  I think that what Takeuchi is
getting is actually better than in Pythonwin or command line (where he
gets Shift-JIS)...

--Guido van Rossum (home page: http://www.python.org/~guido/)

------- Forwarded Message

Date:    Mon, 10 Apr 2000 22:49:45 +0900
From:    "takeuchi" <takeuchi.shohei@lab.ntt.co.jp>
To:      <guido@python.org>
Subject: a unicode string on IDLE shell 

Dear Guido,

I plaied your latest CPython(Python1.6a1) on Win98 Japanese version,
and found a strange IDLE shell behavior.

I'm not sure this is a bug or feacher, so I report my story anyway.

When typing  a Japanese string on IDLE shell with IME ,
Tk8.3 seems to convert it to a UTF-8 representation.
Unfortunatly Python does not know this,
it is dealt with an ordinary string.

>>> s = raw_input(">>>")
Type Japanese characters with IME
for example  $B$"(B
(This is the first  character of Japanese alphabet, Hiragana)
>>> s
 '\343\201\202'   # UTF-8 encoded
>>> print s
$B$"(B                     # A proper griph is appear on the screen

Print statement on IDLE shell works fine with a UTF-8 encoded
string,however,slice operation or len() does not work.
 # I know this is a right result

So I have to convert this string with unicode().

>>> u = unicode(s)
>>> u
u'\u3042'
>>> print u
$B$"(B                     # A proper griph is appear on the screen

Do you think this convertion is unconfortable ?

I think this behavior is inconsistant with command line Python
and PythonWin.

If I want  the same result on command line Python shell or PythonWin shell,
I have to code as follows;
>>> s = raw_input(">>>")
Type Japanese characters with IME
for example  $B$"(B
>>>s
'\202\240'  # Shift-JIS encoded
>>> print s
$B$"(B                     # A proper griph is appear on the screen
>>> u = unicode(s,"mbcs")  # if I use unicode(s) then UnicodeError is raised
!
>>>print u.encode("mbcs")  # if I use print u then wrong griph is appear
$B$"(B                     # A proper griph is appear on the screen

This  difference is confusing  !!
I do not have the best solution for this annoyance, I hope at least IDLE
shell and PythonWin
shell would have  the same behavior .

Thank you for reading.

Best Regards,

       takeuchi

------- End of Forwarded Message


From guido@python.org  Mon Apr 10 15:20:34 2000
From: guido@python.org (Guido van Rossum)
Date: Mon, 10 Apr 2000 10:20:34 -0400
Subject: [I18n-sig] Re: a unicode string on IDLE shell
In-Reply-To: Your message of "Mon, 10 Apr 2000 22:49:45 +0900."
 <002601bfa2f3$a1f67720$8f133c81@pflab.ecl.ntt.co.jp>
References: <002601bfa2f3$a1f67720$8f133c81@pflab.ecl.ntt.co.jp>
Message-ID: <200004101420.KAA00291@eric.cnri.reston.va.us>

> Dear Guido,
> 
> I plaied your latest CPython(Python1.6a1) on Win98 Japanese version,
> and found a strange IDLE shell behavior.
> 
> I'm not sure this is a bug or feacher, so I report my story anyway.
> 
> When typing  a Japanese string on IDLE shell with IME ,
> Tk8.3 seems to convert it to a UTF-8 representation.
> Unfortunatly Python does not know this,
> it is dealt with an ordinary string.
> 
> >>> s = raw_input(">>>")
> Type Japanese characters with IME
> for example  $B$"(B
> (This is the first  character of Japanese alphabet, Hiragana)
> >>> s
>  '\343\201\202'   # UTF-8 encoded
> >>> print s
> $B$"(B                     # A proper griph is appear on the screen
> 
> Print statement on IDLE shell works fine with a UTF-8 encoded
> string,however,slice operation or len() does not work.
>  # I know this is a right result
> 
> So I have to convert this string with unicode().
> 
> >>> u = unicode(s)
> >>> u
> u'\u3042'
> >>> print u
> $B$"(B                     # A proper griph is appear on the screen
> 
> Do you think this convertion is unconfortable ?
> 
> I think this behavior is inconsistant with command line Python
> and PythonWin.
> 
> If I want  the same result on command line Python shell or PythonWin shell,
> I have to code as follows;
> >>> s = raw_input(">>>")
> Type Japanese characters with IME
> for example  $B$"(B
> >>>s
> '\202\240'  # Shift-JIS encoded
> >>> print s
> $B$"(B                     # A proper griph is appear on the screen
> >>> u = unicode(s,"mbcs")  # if I use unicode(s) then UnicodeError is raised
> !
> >>>print u.encode("mbcs")  # if I use print u then wrong griph is appear
> $B$"(B                     # A proper griph is appear on the screen
> 
> This  difference is confusing  !!
> I do not have the best solution for this annoyance, I hope at least IDLE
> shell and PythonWin
> shell would have  the same behavior .
> 
> Thank you for reading.
> 
> Best Regards,
> 
>        takeuchi

Dear Takeuchi,

This is a feature.  Tcl/Tk uses UTF-8 to encode Unicode characters
throughout.  This perfectly matches the Python 1.6 default use of
UTF-8 when 8-bit strings are converted to Unicode.  If you want to
manipulate Unicode strings, you have to use unicode() to convert them
to Unicode string objects.

I may change IDLE so that if you enter Unicode, it will automatically
return a Unicode string.  This may break other code though.

Regarding incompatibilities with Pythonwin and command line Python:
note that there you get a different input encoding, but len() and
slicing are also broken until you convert to Unicode using the correct
encoding!  The input encoding is simply different.  I believe this
will always be an issue (but there should be a way to determine what
the input encoding should be!).

If you have more questions about this, please subscribe to the
i18n-sig mailing list (http://www.python.org/sigs/i18n-sig/) -- this
is where issues like this are discussed.  I'm cc'ing this there.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From dae_alt3@juno.com  Mon Apr 10 20:24:22 2000
From: dae_alt3@juno.com (Doug Edmunds)
Date: Mon, 10 Apr 2000 12:24:22 -0700
Subject: [I18n-sig] P1.6a1- Win98 - unicode issues
Message-ID: <20000410.122423.-454791.0.dae_alt3@juno.com>

py-ver: 1.6a
os: Win98

I am able to copy/paste Cyrillic unicode 
from Internet Explorer 5
into IDLE, without losing fonts.  Text appears 
identical to original. I can wrap the text with 
a print statement "<PUT_CYRILLIC_HERE>" 
and it will print the string.

However writing into IDLE is a problem:
if I switch to the Cyrillic keyboard layout
in IDLE, the fonts change to something, but it is not
Cyrillic (perhaps upper ascii??).

In contrast, WIn98 Wordpad (which will read/
write unicode) associates the keyboard to the
'script' of the font.  Selecting Russian keyboard
automatically switches from Courier New (Western)
to Courier New (Cyrillic).

Can this operability be extended to IDLE?
Without keyboard access 
If not, is there a way to change which font set
appears when the Russian (or other foreign) keyboard is 
selected?


Ideally I would write everything in unicode 
just as written, using WordPad (or Outlook Express, Juno, etc.)
mixing the languages thusly ( simple 1 line script)

RussianText.py
print '???????? ?????'

but IDLE won't read unicode scripts.


d.edmunds
10 April 2000

example texts from internet - 
1 original encoding was Win1251, 
IE5 browser converts to unicode, 
prints in IDLE and Wordpad
(original Win1251 lost)

???????? ????? ??????? ? ??????????? ???????? ??????????????. ??? ?????
??? ?????? ??????? ?? ????? ????? ?????? ?? ????????????? ??????? ?
??????.  

example text original encoded as KOI8-r, which IE5 browser
again turned into unicode. (original KOI8-r lost)

?????? ???? ???????? ???????? ??? ????? ? ?????? ??? ????? ????????
????????? ?????. 


-- Kindly ignore the remainder (Juno ad) which follows --


________________________________________________________________
YOU'RE PAYING TOO MUCH FOR THE INTERNET!
Juno now offers FREE Internet Access!
Try it today - there's no risk!  For your FREE software, visit:
http://dl.www.juno.com/get/tagj.


From dae_alt3@juno.com  Mon Apr 10 20:28:46 2000
From: dae_alt3@juno.com (Doug Edmunds)
Date: Mon, 10 Apr 2000 12:28:46 -0700
Subject: [I18n-sig] P1.6a1- Win98 - unicode issues
Message-ID: <20000410.122846.-454791.1.dae_alt3@juno.com>

Apparently my efforts to send unicode via Juno failed.
d.edmunds
> 10 April 2000
> 
> example texts from internet - 
> 1 original encoding was Win1251, 
> IE5 browser converts to unicode, 
> prints in IDLE and Wordpad
> (original Win1251 lost)
> 
> ???????? ????? ??????? ? ??????????? ???????? ??????????????. ??? 
> ?????
> ??? ?????? ??????? ?? ????? ????? ?????? ?? ????????????? ??????? ?
> ??????.  
> 
> example text original encoded as KOI8-r, which IE5 browser
> again turned into unicode. (original KOI8-r lost)
> 
> ?????? ???? ???????? ???????? ??? ????? ? ?????? ??? ????? ????????
> ????????? ?????. 
> 

________________________________________________________________
YOU'RE PAYING TOO MUCH FOR THE INTERNET!
Juno now offers FREE Internet Access!
Try it today - there's no risk!  For your FREE software, visit:
http://dl.www.juno.com/get/tagj.


From guido@python.org  Mon Apr 10 20:34:34 2000
From: guido@python.org (Guido van Rossum)
Date: Mon, 10 Apr 2000 15:34:34 -0400
Subject: [I18n-sig] P1.6a1- Win98 - unicode issues
In-Reply-To: Your message of "Mon, 10 Apr 2000 12:24:22 PDT."
 <20000410.122423.-454791.0.dae_alt3@juno.com>
References: <20000410.122423.-454791.0.dae_alt3@juno.com>
Message-ID: <200004101934.PAA03031@eric.cnri.reston.va.us>

> I am able to copy/paste Cyrillic unicode 
> from Internet Explorer 5
> into IDLE, without losing fonts.  Text appears 
> identical to original. I can wrap the text with 
> a print statement "<PUT_CYRILLIC_HERE>" 
> and it will print the string.
> 
> However writing into IDLE is a problem:
> if I switch to the Cyrillic keyboard layout
> in IDLE, the fonts change to something, but it is not
> Cyrillic (perhaps upper ascii??).
> 
> In contrast, WIn98 Wordpad (which will read/
> write unicode) associates the keyboard to the
> 'script' of the font.  Selecting Russian keyboard
> automatically switches from Courier New (Western)
> to Courier New (Cyrillic).
> 
> Can this operability be extended to IDLE?
> Without keyboard access 
> If not, is there a way to change which font set
> appears when the Russian (or other foreign) keyboard is 
> selected?
> 
> 
> Ideally I would write everything in unicode 
> just as written, using WordPad (or Outlook Express, Juno, etc.)
> mixing the languages thusly ( simple 1 line script)
> 
> RussianText.py
> print '???????? ?????'
> 
> but IDLE won't read unicode scripts.

Doug,

Can you see if Tcl/Tk version 8.2 or 8.3 (downloadable from
dev.scriptics.com) does what you want?  IDLE is implemented using
Tcl/Tk.  In Python 1.6a1, I'm using Tcl/Tk 8.3.0, but in 1.6a2 I will
go back to Tck/Tk 8.2.3, which appears more stable.

Tcl/Tk's "wish" application supports Unicode.  If it supports your
Cyrillic input method, the problem is with Python's interface to
Tcl/Tk.  If on the other hand the problem is the same with Tcl/Tk,
there's nothing I can do -- you'll have to ask the comp.lang.tcl
newsgroup for help!

--Guido van Rossum (home page: http://www.python.org/~guido/)


From andy@reportlab.com  Mon Apr 10 20:46:25 2000
From: andy@reportlab.com (Andy Robinson)
Date: Mon, 10 Apr 2000 20:46:25 +0100
Subject: [I18n-sig] "takeuchi": a unicode string on IDLE shell
References: <200004101401.KAA00238@eric.cnri.reston.va.us>
Message-ID: <008a01bfa325$79b92f00$01ac2ac0@boulder>

----- Original Message -----
From: Guido van Rossum <guido@python.org>
To: <python-dev@python.org>
Cc: <i18n-sig@python.org>
Sent: 10 April 2000 15:01
Subject: [I18n-sig] "takeuchi": a unicode string on IDLE shell


> Can anyone answer this?  I can reproduce the output side of this, and
> I believe he's right about the input side.  Where should Python
> migrate with respect to Unicode input?  I think that what Takeuchi is
> getting is actually better than in Pythonwin or command line (where he
> gets Shift-JIS)...
>
> --Guido van Rossum (home page: http://www.python.org/~guido/)
I think what he wants, as you hinted, is to be able to specify a 'system
wide' default encoding of Shift-JIS rather than UTF8.

UTF-8 has a certain purity in that it equally annoys every nation, and is
nobody's default encoding.  What a non-ASCII user needs is a site-wide way
of setting the default encoding used for standard input and output.  I think
this could be done with something (config file?  registry key) which site.py
looks at, and wraps stream encoders around stdin, stdout and stderr.

To illustrate why it matters, I often used to parse data files and do
queries on a Japanese name and address database; I could print my lists and
tuples in interactive mode and check they worked, or initialise functions
with correct data, since the OS uses Shift-JIS as its native encoding and I
was manipulating Shift-JIS strings.  I've lost that ability now due to the
Unicode stuff and would need to do
   >>> for thing in mylist:
   >>> ....print mylist.encode('shift_jis')
to see the contents of a database row, rather than just
   >>> mylist

BTW, Pythonwin stopped working in this regard when Scintilla came along; it
prints a byte at a time now, although kanji input is fine, as is kanji
pasted into a source file, as long as you specify a Japanese font.  However,
this is fixable - I just need to find a spare box to run Japanese windows on
and find out where the printing goes wrong.

Andy Robinson
ReportLab


From andy@reportlab.com  Mon Apr 10 20:49:16 2000
From: andy@reportlab.com (Andy Robinson)
Date: Mon, 10 Apr 2000 20:49:16 +0100
Subject: [I18n-sig] Fw: Codecs for Japanese character encodings
Message-ID: <009701bfa325$db0af8b0$01ac2ac0@boulder>

(I forwarded this to the SIG on Friday, but it failed to appear - hope you
don't all get it twice).
Tamito Kajiyama has written pure Python codecs for the two main Japanese
encodings!  Many thanks!

They include the 6879 characers in the JIS0208 character set in literal
Python dictionaries; so it should be trivial to write modified ones which
support vendor-specific extensions with a few extra characters, as long as
the extras are in Unicode.

I'm now rewriting something I did last year in-house for a customer - a
script to generate HTML tables and text files which exactly match the layout
of the code charts for JIS0208 in "CJKV Information Processing".  I ran
these through both codecs and viewed the results in IE5, and as far as I can
see the results are perfect.  I will post up my scripts when they look a bit
prettier :-)

It would be nice to put this code somewhere 'out there' so people can work
on it - not just codecs, but test suites.  How do people feel about starting
a project on www.sourceforge.net under CVS?

Since lots of us want to work on fast Asian codecs, another things we need
is a 'benchmark suite' - maybe a megabyte of Japanese text (mixing
everything - ASII, Kanji, half-width katakana?).  We can then use these pure
Python codecs as a baseline.

- Andy Robinson

----- Original Message -----
From: Tamito KAJIYAMA <kajiyama@grad.sccs.chukyo-u.ac.jp>
To: <andy@reportlab.com>
Sent: 07 April 2000 18:13
Subject: Re: Codecs for Japanese character encodings


> andy@reportlab.com (Andy Robinson) writes:
> |
> | >Based on the Python Unicode support proposal, I wrote codecs for
> | >two Japanese character encodings EUC-JP and Shift_JIS.  The codecs
> | >are available at the following location:
> | >
> |
>http://pseudo.grad.sccs.chukyo-u.ac.jp/~kajiyama/tmp/japanese-codecs.tar.gz
> |
> | Many thanks for this!  I have copied it to the Internationalisation
> | Special Interest Group, where we discuss this stuff, and taken the
> | liberty of copying your message.
>
> Good news.  Thanks for the coordination.
>
> | We need to start coordinating a separate codecs library for
> | Asian languages, and I'd like to use this as a starting point
> | if OK with you.
>
> That's absolutely okay.  I'm grad if my codecs contribute to the
> the i18n SIG.  I joined the i18n-sig@python.org just after I got
> your message.  Please carry on the further discussion about the
> Japanese codecs (if any) in the list.
>
> Best regards,
>
> --
> KAJIYAMA, Tamito <kajiyama@grad.sccs.chukyo-u.ac.jp>
>


From andy@reportlab.com  Mon Apr 10 20:49:27 2000
From: andy@reportlab.com (Andy Robinson)
Date: Mon, 10 Apr 2000 20:49:27 +0100
Subject: [I18n-sig] Codec API questions
Message-ID: <009b01bfa325$e51836b0$01ac2ac0@boulder>

I'm beginning to wonder about some issues with the unicode implementation.
Bear in mind we have seven weeks left - if anyone else has issues or
opinions, we should raise them now.

1. Set Default Encoding at site level
----------------------------------------------------
The default encoding is defined as UTF8, which will at least annoy all
nations equally :-).

It looks like you can hack this any way you want by creating your own
wrappers around stdin/stdout/stderr.  However, I wonder if Python should
make this customizable on a site basis - for example, site.py checks for
some option somewhere to say "I want to see Latin-1" or Shift-JIS or
whatever.  I often used to write scripts to parse files of names and
addresses, and use an interactive prompt to inspect the lists and tuples
directly; the convenience of typing 'print mydata' and see it properly is
nice.  What do people think?

(Or is this feature there already and I've missed it?)


2. lookup returns Codec object rather than tuple?
---------------------------------------------------------------------
I shuld have thought of this when we were in the draft stage months back,
but couldn't really get my mind around it until I had something concrete to
play with.

Right now, codecs.lookup() returns a tuple of
    (encode_func,
    decode_func,
    stream_encoder_factory,
    stream_decoder_factory)

But there is no easy way to lookup the codec object itself - indeed, no
requirement that there be one.  I'd like to see lookup always return a Codec
object
every time, which is guaranteed to have four methods as above, but might
have more.  (Note that a Codec object would have the ability to create
StreamEncoders and StreamDecoders, but would not be one by itself).

A fifth method which is potentially very useful is validate(); a sixth might
be repair().  And for each language, there could be specific ones such as
expanding half-width to full-width katakana.

Furthermore, if we can get hold of the Codec objects, we can start to reason
about codecs - for example, ask whether encodings are compatible with each
other.

3. direct conversion lookups and short-circuiting Unicode
----------------------------------------------------------------------------
This is an extension rather than a change.  I know what I want to do, but
have only the vaguest ideas how to implement it.

As noted here before, you can get from shift-JIS to EUC and vice versa
without going through Unicode.  Because these algorithmic conversions work
on the full 94x94 'kuten space' and not just the 6879 code points in the
standard, they tend to work for any vendor-specific extensions and for
user-defined characters.  Most other Asian native encodings have used a
similar scheme.

I'd like to see an 'extended API' to go from one native character set to
another.  As before, this comes in two flavours, string and stream:
    convert(string, from_enc, to_enc)   returns a string.
We also need ways to get hold of StreamReader and StreamWriter versions.
Now one can trivially build these using Unicode in the middle

codecs.lookup('from_enc', 'to_enc') would return a codec object able to
convert from one encoding to another.  By default, this would weld together
two Unicode codecs.  But if someone writes a codec to do the job directly,
there should be a way to register that.


From guido@python.org  Mon Apr 10 21:02:22 2000
From: guido@python.org (Guido van Rossum)
Date: Mon, 10 Apr 2000 16:02:22 -0400
Subject: [I18n-sig] Fw: Codecs for Japanese character encodings
In-Reply-To: Your message of "Mon, 10 Apr 2000 20:49:16 BST."
 <009701bfa325$db0af8b0$01ac2ac0@boulder>
References: <009701bfa325$db0af8b0$01ac2ac0@boulder>
Message-ID: <200004102002.QAA03212@eric.cnri.reston.va.us>

> It would be nice to put this code somewhere 'out there' so people can work
> on it - not just codecs, but test suites.  How do people feel about starting
> a project on www.sourceforge.net under CVS?

Excellent idea -- go for it!

Make sure to list it in the Vaults of Parnassus too!

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@python.org  Mon Apr 10 21:45:49 2000
From: guido@python.org (Guido van Rossum)
Date: Mon, 10 Apr 2000 16:45:49 -0400
Subject: [I18n-sig] Codec API questions
In-Reply-To: Your message of "Mon, 10 Apr 2000 20:49:27 BST."
 <009b01bfa325$e51836b0$01ac2ac0@boulder>
References: <009b01bfa325$e51836b0$01ac2ac0@boulder>
Message-ID: <200004102045.QAA03303@eric.cnri.reston.va.us>

> 1. Set Default Encoding at site level
> ----------------------------------------------------
> The default encoding is defined as UTF8, which will at least annoy all
> nations equally :-).
> 
> It looks like you can hack this any way you want by creating your own
> wrappers around stdin/stdout/stderr.  However, I wonder if Python should
> make this customizable on a site basis - for example, site.py checks for
> some option somewhere to say "I want to see Latin-1" or Shift-JIS or
> whatever.  I often used to write scripts to parse files of names and
> addresses, and use an interactive prompt to inspect the lists and tuples
> directly; the convenience of typing 'print mydata' and see it properly is
> nice.  What do people think?
> 
> (Or is this feature there already and I've missed it?)

Rather than doing this per site I'd suggest doing this per user.

Surely each user (on a multi-user site) should be allowed to choose
their own apps and settings (cf. locale).

After trying to figure out how to do this, I am confused.  I can do
this:

from codecs import EncodedFile
f = EncodedFile(sys.stdout, "utf-8", "latin-1")

And then I can write Unicode strings to file f, and they are written
to sys.stdout as Latin-1.  I can also write 8-bit strings to file f,
and they are assumed to be UTF-8 and are converted properly to
Latin-1.

However, if I specify anythying except UTF-8 as the input encoding to
EncodedFile, I can't write Unicode objects to it and have something
useful happen!  It seems the Unicode is always converted to UTF-8
first, and then interpreted according to the input encode.

I think that a useful feature to have is a file-like object that
behaves as follows: if you write an 8-bit string to it, it applies a
given input encoding to turn it into Unicode; then it applies a given
output encoding to convert that to (usually multibyte) output
characters.  If you write a Unicode string to it, it skips the input
encoding (since it's already Unicode) and then applies the (same)
given output encoding.

Then I could write a program that mixes 8-bit strings and Unicode in
its output, which encodes all its 8-bit strings in (say) Latin-1.
This program must obviously be very careful when it mixes Unicode and
8-bit strings internally (always calling unicode(s, "latin-1")) to
avoid getting the default (UTF-8) encoding.  But I think this is
something you are asking for -- right?


> 2. lookup returns Codec object rather than tuple?
> ---------------------------------------------------------------------
> I shuld have thought of this when we were in the draft stage months back,
> but couldn't really get my mind around it until I had something concrete to
> play with.
> 
> Right now, codecs.lookup() returns a tuple of
>     (encode_func,
>     decode_func,
>     stream_encoder_factory,
>     stream_decoder_factory)
> 
> But there is no easy way to lookup the codec object itself - indeed, no
> requirement that there be one.  I'd like to see lookup always return a Codec
> object
> every time, which is guaranteed to have four methods as above, but might
> have more.  (Note that a Codec object would have the ability to create
> StreamEncoders and StreamDecoders, but would not be one by itself).
> 
> A fifth method which is potentially very useful is validate(); a sixth might
> be repair().  And for each language, there could be specific ones such as
> expanding half-width to full-width katakana.
> 
> Furthermore, if we can get hold of the Codec objects, we can start to reason
> about codecs - for example, ask whether encodings are compatible with each
> other.

I have no opinion on this; I've forgotten the issues.


> 3. direct conversion lookups and short-circuiting Unicode
> ----------------------------------------------------------------------------
> This is an extension rather than a change.  I know what I want to do, but
> have only the vaguest ideas how to implement it.
> 
> As noted here before, you can get from shift-JIS to EUC and vice versa
> without going through Unicode.  Because these algorithmic conversions work
> on the full 94x94 'kuten space' and not just the 6879 code points in the
> standard, they tend to work for any vendor-specific extensions and for
> user-defined characters.  Most other Asian native encodings have used a
> similar scheme.
> 
> I'd like to see an 'extended API' to go from one native character set to
> another.  As before, this comes in two flavours, string and stream:
>     convert(string, from_enc, to_enc)   returns a string.
> We also need ways to get hold of StreamReader and StreamWriter versions.
> Now one can trivially build these using Unicode in the middle
> 
> codecs.lookup('from_enc', 'to_enc') would return a codec object able to
> convert from one encoding to another.  By default, this would weld together
> two Unicode codecs.  But if someone writes a codec to do the job directly,
> there should be a way to register that.

This could be a separate module, right?  I propose that you write a
separate module (extended_codecs?) that supports such an extended
lookup function.  What functionality would you need from the core?

--Guido van Rossum (home page: http://www.python.org/~guido/)


From brian_takashi@hotmail.com  Mon Apr 10 22:09:58 2000
From: brian_takashi@hotmail.com (Brian Hooper)
Date: Mon, 10 Apr 2000 21:09:58 GMT
Subject: [I18n-sig] Codec API questions
Message-ID: <20000410210958.4338.qmail@hotmail.com>

Hi Andy,

I've been busy recently working with the Unicode API myself and
am thinking some of the same things... (BTW, for a current project
I am working with Basistech's Rosette libraries, and have actually
plugged them into a Python codec, so any Q's about how/what Basistech
does I might be able to help with).

>
>I'm beginning to wonder about some issues with the unicode implementation.
>Bear in mind we have seven weeks left - if anyone else has issues or
>opinions, we should raise them now.
>
>1. Set Default Encoding at site level
>----------------------------------------------------
>The default encoding is defined as UTF8, which will at least annoy all
>nations equally :-).
>
>It looks like you can hack this any way you want by creating your own
>wrappers around stdin/stdout/stderr.  However, I wonder if Python should
>make this customizable on a site basis - for example, site.py checks for
>some option somewhere to say "I want to see Latin-1" or Shift-JIS or
>whatever.  I often used to write scripts to parse files of names and
>addresses, and use an interactive prompt to inspect the lists and tuples
>directly; the convenience of typing 'print mydata' and see it properly is
>nice.  What do people think?
Is there any reason that this should be set on a per site basis - I 
definitely agree that it should be possible to change the interpreter 
encoding, but wouldn't it be nicer if it could instead be changed on a 
per-interpreter basis?  Either via environment variables or maybe 
command-line flags?  Would it be too much of a performance hit to look up 
the default on any conversion which doesn't explicitly specify the encoding 
- this would give the most flexibility of all... (it doesn't seem to me that 
this would be too slow, but I don't have very deep knowledge about this).

>
>(Or is this feature there already and I've missed it?)
No, UTF-8 is the hardcoded default.

>
>
>2. lookup returns Codec object rather than tuple?
>---------------------------------------------------------------------

[snip]

I really like this idea too, and the optional addition of validate()
and repair() are good ideas too.

>
>3. direct conversion lookups and short-circuiting Unicode
>---------------------------------------------------------------------

[snip]

This also seems like a good idea to me, and something that would
be really good for Japanese support.

As for registering, rather than changing how that's done what about changing 
search functions so that they should be required to take a
second argument, which is by default Unicode (UTF-16) but could also
be some other encoding.  The search function would always be called
by the lookup procedure with a to and from encoding, and the search
function could deal with the arguments by returning a direct converter
or a 'welded' converter codec as appropriate.

--Brian
______________________________________________________
Get Your Private, Free Email at http://www.hotmail.com


From mal@lemburg.com  Mon Apr 10 23:34:31 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 11 Apr 2000 00:34:31 +0200
Subject: [I18n-sig] Codec API questions
References: <009b01bfa325$e51836b0$01ac2ac0@boulder>
Message-ID: <38F256F7.2D8E1990@lemburg.com>

Andy Robinson wrote:
> 
> 1. Set Default Encoding at site level
> ----------------------------------------------------
> The default encoding is defined as UTF8, which will at least annoy all
> nations equally :-).
> 
> It looks like you can hack this any way you want by creating your own
> wrappers around stdin/stdout/stderr.  However, I wonder if Python should
> make this customizable on a site basis - for example, site.py checks for
> some option somewhere to say "I want to see Latin-1" or Shift-JIS or
> whatever.  I often used to write scripts to parse files of names and
> addresses, and use an interactive prompt to inspect the lists and tuples
> directly; the convenience of typing 'print mydata' and see it properly is
> nice.  What do people think?
> 
> (Or is this feature there already and I've missed it?)

The design leaves this to user-land. I'd suggest using stdin/stdout
wrappers as needed, possibly only enabled in interactive sessions.
 
> 2. lookup returns Codec object rather than tuple?
> ---------------------------------------------------------------------
> I shuld have thought of this when we were in the draft stage months back,
> but couldn't really get my mind around it until I had something concrete to
> play with.
> 
> Right now, codecs.lookup() returns a tuple of
>     (encode_func,
>     decode_func,
>     stream_encoder_factory,
>     stream_decoder_factory)
> 
> But there is no easy way to lookup the codec object itself - indeed, no
> requirement that there be one.  I'd like to see lookup always return a Codec
> object
> every time, which is guaranteed to have four methods as above, but might
> have more.  (Note that a Codec object would have the ability to create
> StreamEncoders and StreamDecoders, but would not be one by itself).
> 
> A fifth method which is potentially very useful is validate(); a sixth might
> be repair().  And for each language, there could be specific ones such as
> expanding half-width to full-width katakana.
> 
> Furthermore, if we can get hold of the Codec objects, we can start to reason
> about codecs - for example, ask whether encodings are compatible with each
> other.

Why do you want to query an object ? The factory functions
will provide you with an object you can use as codec
when called with the proper arguments... note that there 
can't be just one object alive since these objects can
carry state.

BTW, the Codec API is designed to work for all kinds of
codecs. If you have a need for special new methods there's
no problem adding them to your Codec subclass -- the standard
codec mechanism won't rely on them, but you can still provide
and use them.
 
> 3. direct conversion lookups and short-circuiting Unicode
> ----------------------------------------------------------------------------
> This is an extension rather than a change.  I know what I want to do, but
> have only the vaguest ideas how to implement it.
> 
> As noted here before, you can get from shift-JIS to EUC and vice versa
> without going through Unicode.  Because these algorithmic conversions work
> on the full 94x94 'kuten space' and not just the 6879 code points in the
> standard, they tend to work for any vendor-specific extensions and for
> user-defined characters.  Most other Asian native encodings have used a
> similar scheme.
> 
> I'd like to see an 'extended API' to go from one native character set to
> another.  As before, this comes in two flavours, string and stream:
>     convert(string, from_enc, to_enc)   returns a string.
> We also need ways to get hold of StreamReader and StreamWriter versions.
> Now one can trivially build these using Unicode in the middle
> 
> codecs.lookup('from_enc', 'to_enc') would return a codec object able to
> convert from one encoding to another.  By default, this would weld together
> two Unicode codecs.  But if someone writes a codec to do the job directly,
> there should be a way to register that.

Looks like we need a set of recode codec classes here.
There is already one in codecs.py: StreamRecoder. We'd
probably need similar subclasses for the basic Codec
class though.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From mal@lemburg.com  Mon Apr 10 22:43:43 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Mon, 10 Apr 2000 23:43:43 +0200
Subject: [I18n-sig] Fw: Codecs for Japanese character encodings
References: <009701bfa325$db0af8b0$01ac2ac0@boulder>
Message-ID: <38F24B0F.71F376BA@lemburg.com>

Andy Robinson wrote:
> 
> Tamito Kajiyama has written pure Python codecs for the two main Japanese
> encodings!  Many thanks!

Great !

--
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From takeuchi.shohei@lab.ntt.co.jp  Tue Apr 11 05:21:05 2000
From: takeuchi.shohei@lab.ntt.co.jp (takeuchi)
Date: Tue, 11 Apr 2000 13:21:05 +0900
Subject: [I18n-sig] py1.6a2p1 IDLE annoyance
Message-ID: <003701bfa36d$5b03c640$8f133c81@pflab.ecl.ntt.co.jp>

This is a multi-part message in MIME format.

------=_NextPart_000_0033_01BFA3B8.CAAFEBE0
Content-Type: text/plain;
	charset="iso-2022-jp"
Content-Transfer-Encoding: 7bit

Hi folks,

Thank you Guido for updating IDLE shell string input features according to
my private mail.

I've just got your py1.6a2p1 from the site
and tried. I hate to say like this but
IDLE shell is worse than before.

Here is another my story.

IDLE shell now has a consistant behavior to command line shell
except appearance  on Win98 !
While Input string is saved as a Proper native encoded string (Shift JIS ),
echo backed string looks broken on the screen.
So user can not see the griph at all !

On Py1.6a2p1 IDLE Shell with Win98 Japanese Edition
# When Type Japanese alphabet "$B$"(B"
>>> s = raw_input("Echo backed broken griph")
>>>  s
'\202\240'  # SHift JIS encoding for the input
>>> print s
$B""!!(B# A broken griph comes up
>>> u = unicode( s,"mbcs")
>>> u
u'\u3042'  # UTF-8 encoding for s
>>>print u
$B$"!!!!!!(B # A proper griph comes up

Tk8.3 seems to handle only UTF-8 strings,
so I think IDLE have to go with them.

I hope IDLE allows to  customize shell
encoding so that unicode object is created
automatically from key input.

Any Idea?

Best Regards,

Takeuchi


------=_NextPart_000_0033_01BFA3B8.CAAFEBE0
Content-Type: text/html;
	charset="iso-2022-jp"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=3DContent-Type content=3D"text/html; =
charset=3Diso-2022-jp">
<META content=3D"MSHTML 5.50.4030.2400" name=3DGENERATOR>
<STYLE></STYLE>
</HEAD>
<BODY bgColor=3D#ffffff>
<DIV><FONT face=3D"Courier New" size=3D2>Hi folks, </FONT></DIV>
<DIV><FONT face=3D"Courier New" size=3D2></FONT>&nbsp;</DIV>
<DIV><FONT face=3D"Courier New" size=3D2>Thank you Guido for updating =
IDLE shell=20
string input features according to my private mail.</FONT></DIV>
<DIV><FONT face=3D"Courier New" size=3D2></FONT>&nbsp;</DIV>
<DIV><FONT face=3D"Courier New" size=3D2>I've just got your py1.6a2p1 =
from the=20
site</FONT></DIV>
<DIV><FONT face=3D"Courier New" size=3D2>and tried.&nbsp;I hate to say =
like this but=20
</FONT></DIV>
<DIV><FONT face=3D"Courier New" size=3D2>IDLE&nbsp;shell is worse than=20
before.</FONT></DIV>
<DIV><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV><FONT face=3D"Courier New" size=3D2>Here is another my =
story.</FONT></DIV>
<DIV><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV><FONT face=3DArial size=3D2>IDLE&nbsp;shell now&nbsp;has a =
consistant behavior=20
to command line shell</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>except appearance&nbsp; on Win98 =
!</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>While Input string is saved as&nbsp;a =
Proper native=20
encoded string (Shift JIS ), echo backed&nbsp;string looks broken on the =

screen.</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>So&nbsp;user&nbsp;can =
not&nbsp;see&nbsp;the=20
griph&nbsp;at all !&nbsp;&nbsp;</FONT></DIV>
<DIV><FONT face=3DCourier size=3D2></FONT>&nbsp;</DIV>
<DIV><FONT face=3DCourier size=3D2>On Py1.6a2p1 IDLE Shell with Win98=20
Japanese&nbsp;Edition&nbsp;</FONT></DIV>
<DIV><FONT face=3DCourier size=3D2># When Type Japanese alphabet =
"=1B$B$"=1B(B"=20
&nbsp;</FONT></DIV>
<DIV><FONT face=3DCourier size=3D2>&gt;&gt;&gt; s =3D raw_input("Echo =
backed broken=20
griph") </FONT></DIV>
<DIV><FONT face=3DArial size=3D2>&gt;&gt;&gt;&nbsp; s</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>'\202\240'&nbsp; # SHift JIS encoding =
for the=20
input&nbsp;</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>&gt;&gt;&gt; print s</FONT></DIV>
<DIV><FONT face=3DCourier size=3D2>=1B$B""!!=1B(B#&nbsp;A broken griph =
comes up</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>&gt;&gt;&gt; u =3D =
unicode(</FONT>&nbsp;<FONT=20
face=3DArial size=3D2>s,"mbcs")</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>&gt;&gt;&gt; u </FONT></DIV>
<DIV><FONT face=3DArial size=3D2>u'\u3042'&nbsp; # UTF-8 encoding for =
s</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>&gt;&gt;&gt;print u</FONT></DIV>
<DIV><FONT face=3DArial size=3D2>=1B$B$"!!!!!!=1B(B # A proper griph =
comes up </FONT></DIV>
<DIV><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV><FONT face=3D"Courier New" size=3D2>Tk8.3 seems to handle only =
UTF-8=20
strings,</FONT></DIV>
<DIV><FONT face=3D"Courier New" size=3D2>so I think IDLE have&nbsp;to go =
with=20
them.</FONT></DIV>
<DIV><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV><FONT face=3D"Courier New" size=3D2>I hope&nbsp;IDLE allows to=20
&nbsp;customize&nbsp;shell </FONT></DIV>
<DIV><FONT face=3D"Courier New" size=3D2>encoding so that&nbsp;unicode=20
object&nbsp;is created</FONT></DIV>
<DIV><FONT face=3D"Courier New" size=3D2>automatically from key =
input.</FONT></DIV>
<DIV><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV><FONT face=3D"Courier New" size=3D2>Any&nbsp;Idea?</FONT></DIV>
<DIV><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV><FONT face=3D"Courier New" size=3D2>Best Regards,</FONT></DIV>
<DIV><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV>
<DIV><FONT face=3D"Courier New" size=3D2>Takeuchi</FONT>&nbsp;</DIV>
<DIV><FONT face=3D"Courier New" =
size=3D2></FONT>&nbsp;</DIV></BODY></HTML>

------=_NextPart_000_0033_01BFA3B8.CAAFEBE0--


From takeuchi.shohei@lab.ntt.co.jp  Tue Apr 11 06:24:41 2000
From: takeuchi.shohei@lab.ntt.co.jp (takeuchi)
Date: Tue, 11 Apr 2000 14:24:41 +0900
Subject: [I18n-sig] repost: py16a2p1 IDLE annoyance
Message-ID: <007301bfa376$3d2b9cc0$8f133c81@pflab.ecl.ntt.co.jp>

Ooops , The post with Japanese characters is not suitable here.

Ok, I  will try again.

-----
Hi folks,

Thank you Guido for updating IDLE shell string input features according to
my private mail.

I've just got your py1.6a2p1 from the site
and tried. I hate to say like this but
IDLE shell is worse than before.

Here is another my story.

IDLE shell now has a consistant behavior to command line shell
except appearance  on Win98 !
While Input string is saved as a Proper native encoded string (Shift JIS ),
echo backed string looks broken on the screen.
So user can not see the griph at all !

On Py1.6a2p1 IDLE Shell with Win98 Japanese Edition
# When Type Japanese alphabet  A
   (please take this as  a Japanese character)
>>> s = raw_input("Echo backed broken griph")
>>>  s
'\202\240'  # SHift JIS encoding for the input
>>> print s
Echo backed broken griph
>>> u = unicode( s,"mbcs")
>>> u
u'\u3042'  # UTF-8 encoding for s
>>>print u
A proper griph comes up

Tk8.3 seems to handle only UTF-8 strings,
so I think IDLE have to go with them.

I hope IDLE allows to  customize shell
encoding so that unicode object is created
automatically from key input.

Any Idea?

Best Regards,

Takeuchi


From dae_alt3@juno.com  Tue Apr 11 09:10:11 2000
From: dae_alt3@juno.com (Doug Edmunds)
Date: Tue, 11 Apr 2000 01:10:11 -0700
Subject: [I18n-sig] Reading UTF-16 Scripts
Message-ID: <20000411.011011.-421941.3.dae_alt3@juno.com>

python ver: 1.6a
os: Win98

Are there any plans to allow
python to be able to read scripts
written entirely in UTF-16 format
(such as those written by
Win98's Wordpad program and saved
as unicode text?)

Since each of these files begin
with 'FFEE' it would seem to be
not too difficult for python
to recognize that format and convert
the non-string context to 8bit, i.e.,
p r i n t -> print.

The advantage is that mixed language
scripts (i.e English/Russian) can
be written and saved unambiguously, 
not dependent upon selection 
of a particular 'font script' such as
cp1251 or KOI8-r for Russian.  

The motivation for getting away from
these scripts (encodings, whatever)
is to be able to write multiple languages 
in a single string.  

This kind of scripting could be avoided:
a = unicode ('Ïðàâäà - ãàçåòà', 'cp1251')
print a.encode('cp1251')

and replaced with a simpler:
print "In Russian, newspaper is ____; in Polish it is ______"

Notes: 
1. Cyrillic fonts do not appear in IDLE (US English is base).
2. In PythonWin, even with a Cyrillic 'script' selected,
   such as Courier New (Cyrillic), output appears in English
   -- the 'script' aspect is being ignored.

-- doug edmunds
11 April 2000

//juno ad follows//ignore it please//


________________________________________________________________
YOU'RE PAYING TOO MUCH FOR THE INTERNET!
Juno now offers FREE Internet Access!
Try it today - there's no risk!  For your FREE software, visit:
http://dl.www.juno.com/get/tagj.


From mark.mcmahon@eur.autodesk.com  Tue Apr 11 09:24:21 2000
From: mark.mcmahon@eur.autodesk.com (mark.mcmahon@eur.autodesk.com)
Date: Tue, 11 Apr 2000 10:24:21 +0200
Subject: [I18n-sig] Changing case
Message-ID: <C2AA61E5CAA7D211832F0008C7A4138E01B48CCC@eurmsgneu01.eur.autodesk.com>

Hi,

I can't seem to figure this out..

>>> s = unicode('\204\202', 'latin-1')
>>> s
u'\204\202'
>>> s.upper()
u'\204\202')

Is this something that unicode should be able to do? Am I using the wrong
encoding?

Or would I have to have a particular codec to have a mapping between lower
and uppercase characters.

Sorry if this is basic and obvious - but as I said I can't seem to figure it
out

Windows NT4, (US - French regional settings), Python 1.6a1, both command
line and Idle.

Mark


From andy@reportlab.com  Tue Apr 11 10:04:19 2000
From: andy@reportlab.com (Andy Robinson)
Date: Tue, 11 Apr 2000 10:04:19 +0100
Subject: [I18n-sig] P1.6a1- Win98 - unicode issues
In-Reply-To: <20000410.122846.-454791.1.dae_alt3@juno.com>
Message-ID: <PGECLPOBGNBNKHNAGIJHMEALCBAA.andy@reportlab.com>

> Apparently my efforts to send unicode via Juno failed.
> d.edmunds
> > ???????? ????? ??????? ? ??????????? ???????? ??????????????. ???
> > ?????
> > ??? ?????? ??????? ?? ????? ????? ?????? ?? ????????????? ??????? ?
> > ??????.
> >
I think this is a Windows feature at the moment.  Office 2000 apps, IE5 and
Outlook Express allow input and display in any language if you have the
right OS add-ons loaded.  But at the moment when you past to the clipboard
or save to a file, they get turned to question marks, presumably to avod
upsetting older apps that are not so Unicode aware.  I am told Win2000 is
better - need to try it.

It is this kind of thing that makes i18n really hard - even a simple
cut/paste can modify your data, and it is hard to know in which piece of
software things are going wrong.

For Asian languages, there is a great little freeware word processor /
lookup tool called "JWP" which lets you explicitly control the cut/paste and
save/load encodings used.

- Andy Robinson


From mal@lemburg.com  Tue Apr 11 13:14:00 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 11 Apr 2000 14:14:00 +0200
Subject: [I18n-sig] Reading UTF-16 Scripts
References: <20000411.011011.-421941.3.dae_alt3@juno.com>
Message-ID: <38F31708.AA1088B3@lemburg.com>

Doug Edmunds wrote:
> 
> python ver: 1.6a
> os: Win98
> 
> Are there any plans to allow
> python to be able to read scripts
> written entirely in UTF-16 format
> (such as those written by
> Win98's Wordpad program and saved
> as unicode text?)
> 
> Since each of these files begin
> with 'FFEE' it would seem to be
> not too difficult for python
> to recognize that format and convert
> the non-string context to 8bit, i.e.,
> p r i n t -> print.

As I understand, Python scripts are supposed to
be ASCII (or maybe UTF-8).

Your proposal would only work if *all* strings were
Unicode in Python. There currently are two types:
one for 8-bit strings and the 16-bit Unicode one.

> The advantage is that mixed language
> scripts (i.e English/Russian) can
> be written and saved unambiguously,
> not dependent upon selection
> of a particular 'font script' such as
> cp1251 or KOI8-r for Russian.
> 
> The motivation for getting away from
> these scripts (encodings, whatever)
> is to be able to write multiple languages
> in a single string.
> 
> This kind of scripting could be avoided:
> a = unicode ('Ïðàâäà - ãàçåòà', 'cp1251')
> print a.encode('cp1251')
> 
> and replaced with a simpler:
> print "In Russian, newspaper is ____; in Polish it is ______"
> 
> Notes:
> 1. Cyrillic fonts do not appear in IDLE (US English is base).
> 2. In PythonWin, even with a Cyrillic 'script' selected,
>    such as Courier New (Cyrillic), output appears in English
>    -- the 'script' aspect is being ignored.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From mal@lemburg.com  Tue Apr 11 12:40:57 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 11 Apr 2000 13:40:57 +0200
Subject: [I18n-sig] Changing case
References: <C2AA61E5CAA7D211832F0008C7A4138E01B48CCC@eurmsgneu01.eur.autodesk.com>
Message-ID: <38F30F49.405196E@lemburg.com>

mark.mcmahon@eur.autodesk.com wrote:
> 
> Hi,
> 
> I can't seem to figure this out..
> 
> >>> s = unicode('\204\202', 'latin-1')
> >>> s
> u'\204\202'
> >>> s.upper()
> u'\204\202')
> 
> Is this something that unicode should be able to do? Am I using the wrong
> encoding?
> 
> Or would I have to have a particular codec to have a mapping between lower
> and uppercase characters.
> 
> Sorry if this is basic and obvious - but as I said I can't seem to figure it
> out

Those two characters don't have a lower/upper case mapping:

0080;<control>;Cc;0;BN;;;;;N;;;;;
0081;<control>;Cc;0;BN;;;;;N;;;;;
0082;<control>;Cc;0;BN;;;;;N;BREAK PERMITTED HERE;;;;
0083;<control>;Cc;0;BN;;;;;N;NO BREAK HERE;;;;
0084;<control>;Cc;0;BN;;;;;N;INDEX;;;;

.lower() and .upper() only modify chars which do have such a
mapping -- all others are left untouched.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From mark.mcmahon@eur.autodesk.com  Tue Apr 11 13:13:09 2000
From: mark.mcmahon@eur.autodesk.com (mark.mcmahon@eur.autodesk.com)
Date: Tue, 11 Apr 2000 14:13:09 +0200
Subject: [I18n-sig] Changing case
Message-ID: <C2AA61E5CAA7D211832F0008C7A4138E01B48CCF@eurmsgneu01.eur.autodesk.com>

Hi Marc,

I definately do not understand. \204 is lower_e_egu (spelling?) and =
\204 is
lower_a_umlaut. Upper case of these should be \216 and \220 =
respectively.

(Probably will not display properly on all machines)
--------------
>>> s =3D u"=E9=E4"
>>> s
u'\202\204'
>>> t =3D u"=C4=C9"
>>> t
u'\216\220'
-------------
Mark

<SNIP>
Marc ->
Those two characters don't have a lower/upper case mapping:

<SNIP>
.lower() and .upper() only modify chars which do have such a
mapping -- all others are left untouched.

--=20
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From guido@python.org  Tue Apr 11 14:24:48 2000
From: guido@python.org (Guido van Rossum)
Date: Tue, 11 Apr 2000 09:24:48 -0400
Subject: [I18n-sig] Changing case
In-Reply-To: Your message of "Tue, 11 Apr 2000 14:13:09 +0200."
 <C2AA61E5CAA7D211832F0008C7A4138E01B48CCF@eurmsgneu01.eur.autodesk.com>
References: <C2AA61E5CAA7D211832F0008C7A4138E01B48CCF@eurmsgneu01.eur.autodesk.com>
Message-ID: <200004111324.JAA07943@eric.cnri.reston.va.us>

> I definately do not understand. \204 is lower_e_egu (spelling?) and \204 is
> lower_a_umlaut. Upper case of these should be \216 and \220 respectively.
> 
> (Probably will not display properly on all machines)
> --------------
> >>> s = u"éä"
> >>> s
> u'\202\204'
> >>> t = u"ÄÉ"
> >>> t
> u'\216\220'
> -------------
> Mark

Aha, *I* understand.  You must be on Windows.  Windows has its own
character encoding, where e-egu is \202 and a-umlaut is \204.  However
Python doesn't know what character set you are using, and when you
typed e-egu, all it knew is that you entered \202.  If you type this
in a u"..." string, all codes are interpreted as if they are Latin-1,
which happens to be the lower 256 bytes of Unicode.  The Latin-1
character \202 (which is NOT e-egu but a control character) has no
upper case equivalent.

How do you get what you want?

Instead of typing u"éä", you should be able to type
unicode("éä", "mbcs").

HOWEVER, I can't get this to work either!  I get
unicode('\202\204','mbcs') -> u"\u201A\u201E" and the latter string
doesn't have an upper case equivalent either!  I had expected that
these would have translated to the Latin-1.  Maybe I'm using the wring
MBCS code page???

> <SNIP>
> Marc ->
> Those two characters don't have a lower/upper case mapping:
> 
> <SNIP>
> .lower() and .upper() only modify chars which do have such a
> mapping -- all others are left untouched.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From mal@lemburg.com  Tue Apr 11 13:49:42 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 11 Apr 2000 14:49:42 +0200
Subject: [I18n-sig] Changing case
References: <C2AA61E5CAA7D211832F0008C7A4138E01B48CCF@eurmsgneu01.eur.autodesk.com>
Message-ID: <38F31F66.8C147314@lemburg.com>

mark.mcmahon@eur.autodesk.com wrote:
> 
> Hi Marc,
> 
> I definately do not understand. \204 is lower_e_egu (spelling?) and \204 is
> lower_a_umlaut. Upper case of these should be \216 and \220 respectively.

Not in Latin-1... you are probably using a different code page
in your editor.

>>> u'éä'
u'\351\344'
>>> u'éä'.upper()
u'\311\304'
>>> print u'éä'.encode('latin-1')
éä
>>> print u'éä'.upper().encode('latin-1')
ÉÄ

Strange enough, I get these outputs on my Linux machine:

>>> print 'éä'.upper()
éä

Looks like the C lib doesn't know about upper case mappings
for these Latin-1 characters.
 
> (Probably will not display properly on all machines)
> --------------
> >>> s = u"éä"
> >>> s
> u'\202\204'
> >>> t = u"ÄÉ"
> >>> t
> u'\216\220'
> -------------
> Mark
> 
> <SNIP>
> Marc ->
> Those two characters don't have a lower/upper case mapping:
> 
> <SNIP>
> .lower() and .upper() only modify chars which do have such a
> mapping -- all others are left untouched.
> 
> --
> Marc-Andre Lemburg
> ______________________________________________________________________
> Business:                                      http://www.lemburg.com/
> Python Pages:                           http://www.lemburg.com/python/
> 
> _______________________________________________
> I18n-sig mailing list
> I18n-sig@python.org
> http://www.python.org/mailman/listinfo/i18n-sig

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From guido@python.org  Tue Apr 11 15:55:04 2000
From: guido@python.org (Guido van Rossum)
Date: Tue, 11 Apr 2000 10:55:04 -0400
Subject: [I18n-sig] Changing case
In-Reply-To: Your message of "Tue, 11 Apr 2000 14:49:42 +0200."
 <38F31F66.8C147314@lemburg.com>
References: <C2AA61E5CAA7D211832F0008C7A4138E01B48CCF@eurmsgneu01.eur.autodesk.com>
 <38F31F66.8C147314@lemburg.com>
Message-ID: <200004111455.KAA08048@eric.cnri.reston.va.us>

The story continues...

I tried the following in Python 1.6a2p1 on Windows NT 4.0 in three
interpreters: IDLE, command line, and Pythonwin (win32all-130 using
Python 1.6a2p1).  (Since I live in the US, I don't have any way to
input non-ASCII characters; so I use escape sequences for input.)

>>> s = '\351\344' # This is e-egu a-umlaut in Latin-1
>>> u = unicode(s, "latin-1") # This simply yields u"\351\344"
>>> print s
(see table below)
>>> print u
(see table below)
>>> 

I got the following results:

		print s			print u
		-------			-------

IDLE:		e-egu a-umlaut		e-egu a-umlaut

command line:	THETA SIGMA		three graphics + n~

Pythonwin:	e-egu a-umlaut		A~ (C) A~ o-with-cross


I tried the same thing on Solaris in IDLE and the command line; IDLE
on Solaris did exactly the same thing as it did on Windows, and the
command line on Solaris did exactly the same thing as Pythonwin (!)
did on Windows.

I tried the same thing with IDLE from Python 1.6a1 and also got the
same results -- from this I conclude that Tcl/Tk 8.2 and 8.3 behave
the same way in this respect.

My theory why IDLE has the highest success rate: Tcl/Tk 8.2 uses UTF-8
internally, but falls back to Latin-1 when you use non-ASCII
characters that are clearly not UTF-8.  Thus, "print u" displays the
correct value because Tkinter converts Unicode to UTF-8, and "print s"
displays the correct value because Tcl/Tk recognizes that it's not
UTF-8 and thus interprets it as Latin-1.

The command line (running in a DOS box) uses a default code page which
bears no relation to Latin-1; the THETA and SIGMA happen to have
codes \351 and \344.  The gibberish printed for u is simply what its
UTF-8 encoding ('\303\251\303\244') looks like when interpreted in the
same code page.

Finally, Pythonwin: Scintilla (its text widget) seems to know about
Latin-1 only.  The four characters it prints for u are the Latin-1
characters for \303, \251, \303 and \244.  This is also true for the
command line on Solaris (using xterm with the default Latin-1
encoding).

Note that IDLE doesn't always print Latin-1 characters correctly!  I
was just lucky.  For example, the string "\303,\251,\303\251" prints
as A~, comma, (C), comma, e-egu.  In other words, \303 and \251 by
themselves are interpreted as Latin-1, while taken together they are
interpreted as UTF-8.

What would be nice?  For stdout, to be able to say *independently*
what encoding 8-bit strings are to be assumed when printed, and what
encoding should be used for the output stream.  And for this to work
in all three IDEs: IDLE, command line and Pythonwin.

In IDLE, the output stream should be fixed to UTF-8, but a user
working with Latin-1 strings could set the defaults 8-bit string
encoding for output to be Latin-1.  Then, print '\351\344' would be
encoded as UTF-8: '\303\251\303\244', which prints as e-egu a-umlaut;
on the other hand, print '\303\251\303\244' would be interpreted as 4
Latin-1 characters, and print as A~ (C) A~ o-with-cross.

In the command line, on Windows the output encoding should be set to
the default MBCS code page, but the default encoding for 8-bit strings
could be set to something user-specified, e.g. Latin-1.

A similar thing should happen for input (and the input and output
should normally be switched together, so that a user entering
e.g. shift-JIS would also get shift-JIS on putput).


This is quite independent of the source encoding when reading from a
file.  I have some issues with the current approach (which seems to be
"use whatever bytes you read" and thus defaults to Latin-1 if you use
non-ASCII characters inUnicode string literals; otherwise it's
whatever the user wants it to be.  Note in particular that a user who
edits her source code in shift-JIS can currently *not* use shift-JIS
in Unicode literals -- she must use something like
unicode(".....","shift-jis") to get a Unicode string containing the
correct Japanese characters encoded in Unicode.

Of course, when entering source code interactively, this should be
tied to the encoding for stdin.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From mal@lemburg.com  Tue Apr 11 16:38:32 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 11 Apr 2000 17:38:32 +0200
Subject: [I18n-sig] Changing case
References: <C2AA61E5CAA7D211832F0008C7A4138E01B48CCF@eurmsgneu01.eur.autodesk.com>
 <38F31F66.8C147314@lemburg.com> <200004111455.KAA08048@eric.cnri.reston.va.us>
Message-ID: <38F346F8.63A5C793@lemburg.com>

Guido van Rossum wrote:
> 
> This is quite independent of the source encoding when reading from a
> file.  I have some issues with the current approach (which seems to be
> "use whatever bytes you read" and thus defaults to Latin-1 if you use
> non-ASCII characters inUnicode string literals; otherwise it's
> whatever the user wants it to be.

What direction should we be heading: interpret the source
files under some encoding assumption deduced from the
platform, a command line switch or a #pragma, or simply fix
one encoding (e.g. Latin-1) ?

The current divergence between u"...chars..." and "...chars..."
really only stems from the fact that "...chars..." doesn't
have to know about the used encoding, while u"...chars..." does
to be able to convert the data to Unicode.

> Note in particular that a user who
> edits her source code in shift-JIS can currently *not* use shift-JIS
> in Unicode literals -- she must use something like
> unicode(".....","shift-jis") to get a Unicode string containing the
> correct Japanese characters encoded in Unicode.

See above -- without any further knowledge about the encoding
used to write the source file, there is no other way than to
simply fix one encoding (which happens to be Latin-1 due to
the way the first 256 Unicode ordinals are defined).

Note that even if the parser would know the encoding, you'd
still have a problem processing the strings at run-time:
8-bit strings do not carry any encoding information.
The only ways to fix this would be to define a global 8-bit
string encoding or add an encoding attribute to strings.

One possible way would be to define that all 8-bit strings
get converted to UTF-8 when parsed (by the compiler, eval(), etc.).
This would assure that all strings used at run-time would
in fact be UTF-8 and conversions to and from Unicode would
be possible without information loss.

The downside of this approach is that indexing and slicing do
not work well with UTF-8: a single input character can be
encoded by as much as 6 bytes (for 32-bit Unicode) ! I also
assume that many applications rely on the fact that
len("äö") == 2 and not 4.

Perhaps we should just loosen the used encoding for u"...chars..."
using #pragmas and/or cmd line switches. Then people around the
world would at least have a simple way to write programs which
still work everywhere, but can be written using any of the
encodings known to Python. 8-bit "...chars..." would then
be interpreted as before: user defined data using a user
defined encoding (the string->Unicode conversion would still
need to make the UTF-8 assumption, though).

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From guido@python.org  Tue Apr 11 17:56:21 2000
From: guido@python.org (Guido van Rossum)
Date: Tue, 11 Apr 2000 12:56:21 -0400
Subject: [I18n-sig] Changing case
In-Reply-To: Your message of "Tue, 11 Apr 2000 17:38:32 +0200."
 <38F346F8.63A5C793@lemburg.com>
References: <C2AA61E5CAA7D211832F0008C7A4138E01B48CCF@eurmsgneu01.eur.autodesk.com> <38F31F66.8C147314@lemburg.com> <200004111455.KAA08048@eric.cnri.reston.va.us>
 <38F346F8.63A5C793@lemburg.com>
Message-ID: <200004111656.MAA08887@eric.cnri.reston.va.us>

> What direction should we be heading: interpret the source
> files under some encoding assumption deduced from the
> platform, a command line switch or a #pragma, or simply fix
> one encoding (e.g. Latin-1) ?

I think we'll have to allow user-specified encodings -- including
UTF-8 and eventually UTF-16.  How these are communicated to the parser
is a separate design issue; we could start with a command line switch
(assuming the standard library is ASCII only) and later migrate to a
per-file pragma.  There should also be a default encoding; I would
propose UTF-8, as this is already the default encoding used at
run-time.  (And because it annoys everyone roughly equally. :-)

Once we know the source encoding, it's obvious what to do with Unicode
literals: translate from the input encoding.

I want to propose a very simple rule for 8-bit literals: these use the
source encoding -- in other words, they aren't changed from what is
read from the file.  This is most likely to yield what the user wants.
Especially if the user doesn't use Unicode explicitly (neither
literals nor via conversions) the user sees their native character set
when editing the source file, and probably uses the same encoding for
output files, so if the user simply prints strings, the right thing
should happen automatically.

If the user *does* use Unicode conversions, the user has to specify
their encoding explicitly (unless it's UTF-8).  This seems only fair
-- the runtime can't know whether an 8-bit string being converted to
Unicode started its life as an 8-bit literal or whether it was read
from a file with an encoding that may only be known to the user.

> The current divergence between u"...chars..." and "...chars..."
> really only stems from the fact that "...chars..." doesn't
> have to know about the used encoding, while u"...chars..." does
> to be able to convert the data to Unicode.

Right.  Hence my deduction that currently the source encoding is
Latin-1.

> Note that even if the parser would know the encoding, you'd
> still have a problem processing the strings at run-time:
> 8-bit strings do not carry any encoding information.
> The only ways to fix this would be to define a global 8-bit
> string encoding or add an encoding attribute to strings.

The former we decided against -- the latter can be done by the user
(sublcassing UserString).

> One possible way would be to define that all 8-bit strings
> get converted to UTF-8 when parsed (by the compiler, eval(), etc.).
> This would assure that all strings used at run-time would
> in fact be UTF-8 and conversions to and from Unicode would
> be possible without information loss.

No -- this does NOT guarantee that all 8-bit strings are UTF-8.  It
doesn't cover strings explicitly encoded using octal escapes, and
(much more importantly) it doesn't cover strings read from files or
sockets or constructed in other ways.

(We can know that all strings we get out of Tkinter are UTF-8 encoded
though!  Provided we're using Tcl/Tk 8.1 or higher.)

> The downside of this approach is that indexing and slicing do
> not work well with UTF-8: a single input character can be
> encoded by as much as 6 bytes (for 32-bit Unicode) ! I also
> assume that many applications rely on the fact that
> len("äö") == 2 and not 4.

Agreed.  If we tried to make everything UTF-8, we should never have
started down the path of a separate Unicode string datatype.

I say: 8-bit strings have no fixed encoding -- they are 8-bit bytes
and their interpretation is determined by the program.  The default of
UTF-8 when converting to a Unicode string is just because we need a
default.

> Perhaps we should just loosen the used encoding for u"...chars..."
> using #pragmas and/or cmd line switches. Then people around the
> world would at least have a simple way to write programs which
> still work everywhere, but can be written using any of the
> encodings known to Python. 8-bit "...chars..." would then
> be interpreted as before: user defined data using a user
> defined encoding (the string->Unicode conversion would still
> need to make the UTF-8 assumption, though).

This sounds like my proposal.  Let's do it.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From chris@ccbs.ntu.edu.tw  Wed Apr 12 08:36:05 2000
From: chris@ccbs.ntu.edu.tw (Christian Wittern)
Date: Wed, 12 Apr 2000 15:36:05 +0800
Subject: [I18n-sig] P1.6a1- Win98 - unicode issues
In-Reply-To: <PGECLPOBGNBNKHNAGIJHMEALCBAA.andy@reportlab.com>
Message-ID: <NDBBKJMNKBFFOBEFDJGEEEHJDAAA.chris@ccbs.ntu.edu.tw>

> > Apparently my efforts to send unicode via Juno failed.
> > d.edmunds
> > > ???????? ????? ??????? ? ??????????? ???????? ??????????????. ???
> > > ?????
> > > ??? ?????? ??????? ?? ????? ????? ?????? ?? ????????????? ??????? ?
> > > ??????.
> > >
> I think this is a Windows feature at the moment.  Office 2000
> apps, IE5 and
> Outlook Express allow input and display in any language if you have the
> right OS add-ons loaded.  But at the moment when you past to the clipboard
> or save to a file, they get turned to question marks, presumably to avod
> upsetting older apps that are not so Unicode aware.  I am told Win2000 is
> better - need to try it.
>

As far as I know, the WIndows clipboard offers the text in different
formats, the ??? is just the text-only fallback in cases the application
does not know the magic to read the Unicode portion of the clipboard. I
don't know either, but I know it is possible...

All the best, Christian Wittern


From mal@lemburg.com  Wed Apr 12 08:59:25 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Wed, 12 Apr 2000 09:59:25 +0200
Subject: [I18n-sig] Changing case
References: <C2AA61E5CAA7D211832F0008C7A4138E01B48CCF@eurmsgneu01.eur.autodesk.com> <38F31F66.8C147314@lemburg.com> <200004111455.KAA08048@eric.cnri.reston.va.us>
 <38F346F8.63A5C793@lemburg.com> <200004111656.MAA08887@eric.cnri.reston.va.us>
Message-ID: <38F42CDD.49AE89B3@lemburg.com>

Guido van Rossum wrote:
> 
> > Perhaps we should just loosen the used encoding for u"...chars..."
> > using #pragmas and/or cmd line switches. Then people around the
> > world would at least have a simple way to write programs which
> > still work everywhere, but can be written using any of the
> > encodings known to Python. 8-bit "...chars..." would then
> > be interpreted as before: user defined data using a user
> > defined encoding (the string->Unicode conversion would still
> > need to make the UTF-8 assumption, though).
> 
> This sounds like my proposal.  Let's do it.

Thinking about this some more: while adding a flag to designate
the u"" encoding would be easy, should the encoded string also
be able to contain \uXXXX and the like sequences ? If yes, we'd
need a two level approach:

1. decode the input encoding to Unicode
2. decode the embedded \uXXXX et al. escape sequences (now within
   Unicode)

We'd need a new codec for 2 and this codec would have to be able
to translate Unicode to Unicode -- nothing difficult, but a
new technique since all others currently do 8-bit <-> Unicode.

"Draft proposal"ing here:

Let's start the experiment with a command line switch
until #pragma handling has been properly defined. #pragmas
should then be used for scripts read from files to ensure
that they work elsewhere in the world.

What command line switch should we use... -e as in 
"encoding" ?

We'd also need an environment variable ro make things easier,
say PYTHONENCODING...

The value should be available within Python as e.g. sys.encoding.

The given encoding would only be used by the compiler (the part
that translates u"..." strings into objects). Usage in scripts
in then up to user-land routines (via sys.encoding).

To make all this work without too many hassles we'd need
(at least the most commonly used) CJKV codecs in the core
distribution. How big would these be ? Would someone contribute
them... Tamito ?

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From andy@reportlab.com  Wed Apr 12 09:28:49 2000
From: andy@reportlab.com (Andy Robinson)
Date: Wed, 12 Apr 2000 09:28:49 +0100
Subject: [I18n-sig] Changing case
In-Reply-To: <200004111656.MAA08887@eric.cnri.reston.va.us>
Message-ID: <PGECLPOBGNBNKHNAGIJHMEBNCBAA.andy@reportlab.com>

> I say: 8-bit strings have no fixed encoding -- they are 8-bit bytes
> and their interpretation is determined by the program.  The default of
> UTF-8 when converting to a Unicode string is just because we need a
> default.

This makes perfect sense to me and I agree 100%.  Guido, thanks for 
summing up the issues so clearly.

- Andy Robinson
 

From mal@lemburg.com  Wed Apr 12 10:30:40 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Wed, 12 Apr 2000 11:30:40 +0200
Subject: [I18n-sig] Changing case
References: <PGECLPOBGNBNKHNAGIJHEEBPCBAA.andy@reportlab.com>
Message-ID: <38F44240.CB4A9291@lemburg.com>

[CCing to i18n too]

Andy Robinson wrote:
> 
> > To make all this work without too many hassles we'd need
> > (at least the most commonly used) CJKV codecs in the core
> > distribution. How big would these be ? Would someone contribute
> > them... Tamito ?
> >
> He may be at home by now, but he indicated to me that he was
> happy for them to be used in any way.  The nice things about
> his codecs are
> (a) one could extract the mapping tables for other codecs
>     from data at www.unicode org and use a very similar
>     approach.
> (b) the mappings may be 168k, but they at least zip nicely.
>     I'm guessing at 5-6 such codecs in the distribution
>     initially.
> (c) the algorithmic bit can be accelerated later in C or our
>     vaporware state machine, and nobody needs to change
>     any interfaces.
> (d) if we slightly parameterise his codecs so that one could
>     substitute a different mapping table if needed, then
>     all the corporate variations just need to create a
>     new dictionary with the deltas - Microsoft Code Page
>     932 would not be another 168k, but just a few k and
>     could build its mapping on the fly.

Sounds ok to me.
 
> However, I suspect putting it in the core for June 1st may
> be too aggressive; if the compiler is going to use them on
> every source file for a Japanese user, we really want to
> move from byte-level loops in Python to something much faster.

Speed is not an issue now: what we need is a good concept
and some proof-of-concept code to go with it.

BTW, all this will go into 1.7 AFAIK... 1.6 will have to do
with what's there now. I may get a patch done for the -e
command line switch -- but only as experimental feature
in 1.6.

Unfortunately, Guido's out at the moment, so he can't
comment on this...

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From kajiyama@grad.sccs.chukyo-u.ac.jp  Wed Apr 12 18:30:50 2000
From: kajiyama@grad.sccs.chukyo-u.ac.jp (Tamito KAJIYAMA)
Date: Thu, 13 Apr 2000 02:30:50 +0900
Subject: [I18n-sig] Changing case
In-Reply-To: <38F44240.CB4A9291@lemburg.com> (mal@lemburg.com)
References: <38F46F0524E.B96AOTSUK@boomt.bt.kznet>
Message-ID: <200004121730.CAA03719@dhcp236.grad.sccs.chukyo-u.ac.jp>

* M.-A. Lemburg:
|
| > > To make all this work without too many hassles we'd need
| > > (at least the most commonly used) CJKV codecs in the core
| > > distribution. How big would these be ? Would someone contribute
| > > them... Tamito ?

* Andy Robinson:
|
| > He may be at home by now, but he indicated to me that he was
| > happy for them to be used in any way.  The nice things about
| > his codecs are
| > (a) one could extract the mapping tables for other codecs
| >     from data at www.unicode org and use a very similar
| >     approach.

In fact, I generated the mappings in my Japanese codecs using
simple Python scripts based on the mapping table provided by
Unicode Inc.:

ftp://ftp.unicode.org/Public/MAPPINGS/EASTASIA/JIS/JIS0208.TXT

The version I used is 0.9 (8 March 1994).  The perfectness of
the mappings are totally due to the authors of the original
mapping table, not me ;)

| > (b) the mappings may be 168k, but they at least zip nicely.
| >     I'm guessing at 5-6 such codecs in the distribution
| >     initially.

Thanks for the considerations on size.  I personally consider
the size issue is less important than the speed issue, though.

| > (c) the algorithmic bit can be accelerated later in C or our
| >     vaporware state machine, and nobody needs to change
| >     any interfaces.
| > (d) if we slightly parameterise his codecs so that one could
| >     substitute a different mapping table if needed, then
| >     all the corporate variations just need to create a
| >     new dictionary with the deltas - Microsoft Code Page
| >     932 would not be another 168k, but just a few k and
| >     could build its mapping on the fly.

Good ideas.

| > However, I suspect putting it in the core for June 1st may
| > be too aggressive; if the compiler is going to use them on
| > every source file for a Japanese user, we really want to
| > move from byte-level loops in Python to something much faster.
| 
| Speed is not an issue now: what we need is a good concept
| and some proof-of-concept code to go with it.

I think my pure Python implementation of Japanese codecs is a
kind of "proof of concept" at most.  I run a simple benchmark
test on my codecs; it took about 7 minutes to convert a 7MB
Japanese text file from EUC-JP to EUC-JP via UTF-8.  It seems
that my codecs are too slow to use for most applications.  I
believe the char-by-char iteration on strings in EUC-JP and
Shift_JIS needs to be implemented in C.

Best regards,

-- 
KAJIYAMA, Tamito <kajiyama@grad.sccs.chukyo-u.ac.jp>


From guido@python.org  Thu Apr 27 16:01:48 2000
From: guido@python.org (Guido van Rossum)
Date: Thu, 27 Apr 2000 11:01:48 -0400
Subject: [I18n-sig] Unicode debate
In-Reply-To: Your message of "Thu, 27 Apr 2000 06:42:43 BST."
 <l03102800b52d80db1290@[193.78.237.154]>
References: <l03102805b52ca7830b18@[193.78.237.154]>
 <l03102800b52d80db1290@[193.78.237.154]>
Message-ID: <200004271501.LAA13535@eric.cnri.reston.va.us>

I'd like to reset this discussion.  I don't think we need to involve
c.l.py yet -- I haven't seen anyone with Asian language experience
chime in there, and that's where this matters most.  I am directing
this to the Python i18n-sig mailing list, because that's where the
debate belongs, and there interested parties can join the discussion
without having to be vetted as "fit for python-dev" first.

I apologize for having been less than responsive in the matter;
unfortunately there's lots of other stuff on my mind right now that
has recently had a tendency to distract me with higher priority
crises.

I've heard a few people claim that strings should always be considered
to contain "characters" and that there should be one character per
string element.  I've also heard a clamoring that there should only be
one string type.  You folks have never used Asian encodings.  In
countries like Japan, China and Korea, encodings are a fact of life,
and the most popular encodings are ASCII supersets that use a variable
number of bytes per character, just like UTF-8.  Each country or
language uses different encodings, even though their characters look
mostly the same to western eyes.  UTF-8 and Unicode is having a hard
time getting adopted in these countries because most software that
people use deals only with the local encodings.  (Sounds familiar?)

These encodings are much less "pure" than UTF-8, because they only
encode the local characters (and ASCII), and because of various
problems with slicing: if you look "in the middle" of an encoded
string or file, you may not know how to interpret the bytes you see.
There are overlaps (in most of these encodings anyway) between the
codes used for single-byte and double-byte encodings, and you may have
to look back one or more characters to know what to make of the
particular byte you see.  To get an idea of the nightmares that
non-UTF-8 multibyte encodings give C/C++ programmers, see the
Multibyte Character Set (MBCS) Survival Guide
(http://msdn.microsoft.com/library/backgrnd/html/msdn_mbcssg.htm).
See also the home page of the i18n-sig for more background information
on encoding (and other i18n) issues
(http://www.python.org/sigs/i18n-sig/).

UTF-8 attempts to solve some of these problems: the multi-byte
encodings are chosen such that you can tell by the high bits of each
byte whether it is (1) a single-byte (ASCII) character (top bit off),
(2) the start of a multi-byte character (at least two top bits on; how
many indicates the total number of bytes comprising the character), or
(3) a continuation byte in a multi-byte character (top bit on, next
bit off).

Many of the problems with non-UTF-8 multibyte encodings are the same
as for UTF-8 though: #bytes != #characters, a byte may not be a valid
character, regular expression patterns using "." may give the wrong
results, and so on.

The truth of the matter is: the encoding of string objects is in the
mind of the programmer.  When I read a GIF file into a string object,
the encoding is "binary goop".  When I read a line of Japanese text
from a file, the encoding may be JIS, shift-JIS, or ENC -- this has to
be an assumption built-in to my program, or perhaps information
supplied separately (there's no easy way to guess based on the actual
data).  When I type a string literal using Latin-1 characters, the
encoding is Latin-1.  When I use octal escapes in a string literal,
e.g. '\303\247', the encoding could be UTF-8 (this is a cedilla).
When I type a 7-bit string literal, the encoding is ASCII.

The moral of all this?  8-bit strings are not going away.  They are
not encoded in UTF-8 henceforth.  Like before, and like 8-bit text
files, they are encoded in whatever encoding you want.  All you get is
an extra mechanism to convert them to Unicode, and the Unicode
conversion defaults to UTF-8 because it is the only conversion that is
reversible.  And, as Tim Peters quoted Andy Robinson (paraphrasing
Tim's paraphrase), UTF-8 annoys everyone equally.

Where does the current approach require work?

- We need a way to indicate the encoding of Python source code.
(Probably a "magic comment".)

- We need a way to indicate the encoding of input and output data
files, and we need shortcuts to set the encoding of stdin, stdout and
stderr (and maybe all files opened without an explicit encoding).
Marc-Andre showed some sample code, but I believe it is still
cumbersome.  (I have to play with it more to see how it could be
improved.)

- We need to discuss whether there should be a way to change the
default conversion between Unicode and 8-bit strings (currently
hardcoded to UTF-8), in order to make life easier for people who want
to continue to use their favorite 8-bit encoding (e.g. Latin-1, or
shift-JIS) but who also want to make use of the new Unicode datatype.

We're still in alpha, so we can still fix things.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From gresham@mediavisual.com  Thu Apr 27 17:41:04 2000
From: gresham@mediavisual.com (Paul Gresham)
Date: Fri, 28 Apr 2000 00:41:04 +0800
Subject: [I18n-sig] Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]>             <l03102800b52d80db1290@[193.78.237.154]>  <200004271501.LAA13535@eric.cnri.reston.va.us>
Message-ID: <010f01bfb067$64e43260$9a2b440a@miv01>

Hi, I'm not sure how much value I can add, as I know little about the
charsets etc. and a bit more about Python. As a user of these, and running a
consultancy firm in Hong Kong, I can at least pass on some points and
perhaps help you with testing later on. My first touch on international PCs
was fixing a Japanese 8086 back in 1989, it didn't even have colour ! Hong
Kong is quite an experience as there are two formats in common use, plus
occasionally another gets thrown in. In HK they use the Traditional Chinese,
whereas the mainland uses Simplified, as Guido says, there are a number of
different types of these. Occasionally we see the Taiwanese charsets used.

It seems to me that having each individual string variable encoded might
just be too atomic, perhaps creating a cumbersome overhead in the system.
For most applications I can settle for the entire app to be using a single
charset, however from experience there are exceptions. We are normally
working with prior knowledge of the charset being used, rather than having
to deal with any charset which may come along (at an application level), and
therefore generally work in a context, just as a European programmer would
be working in say English or German.

As you know, storage/retrieval is not a problem, but manipulation and
comparison is. A nice way to handle this would be like operator overloading
such that string operations would be perfomed in the context of the current
charset, I could then change context as needed, removing the need for
metadata surrounding the actual data. This should speed things up as each
overloaded library could be optimised given the different quirks, and new
ones could be added easily. My code could be easily re-used on different
charsets by simply changing context externally to the code, rather than
passing in lots of stuff and expecting Python to deal with it.

Also I'd like very much to compile/load in only the International charsets
that I need. I wouldn't want to see Java type bloat occurring to Python, and
adding internationalisation for everything, is huge.

I think what I am suggesting is a different approach which obviously places
more onus on the programmer rather than Python. Perhaps this is not
acceptable, I don't know as I've never developed a programming language.

I hope this is a helpful point of view to get you thinking further,
otherwise ... please ignore me and I'll keep quiet : )

Regards
Paul

----- Original Message -----
From: "Guido van Rossum" <guido@python.org>
To: <python-dev@python.org>; <i18n-sig@python.org>
Cc: "Just van Rossum" <just@letterror.com>
Sent: Thursday, April 27, 2000 11:01 PM
Subject: [I18n-sig] Unicode debate


> I'd like to reset this discussion.  I don't think we need to involve
> c.l.py yet -- I haven't seen anyone with Asian language experience
> chime in there, and that's where this matters most.  I am directing
> this to the Python i18n-sig mailing list, because that's where the
> debate belongs, and there interested parties can join the discussion
> without having to be vetted as "fit for python-dev" first.
>
> I apologize for having been less than responsive in the matter;
> unfortunately there's lots of other stuff on my mind right now that
> has recently had a tendency to distract me with higher priority
> crises.
>
> I've heard a few people claim that strings should always be considered
> to contain "characters" and that there should be one character per
> string element.  I've also heard a clamoring that there should only be
> one string type.  You folks have never used Asian encodings.  In
> countries like Japan, China and Korea, encodings are a fact of life,
> and the most popular encodings are ASCII supersets that use a variable
> number of bytes per character, just like UTF-8.  Each country or
> language uses different encodings, even though their characters look
> mostly the same to western eyes.  UTF-8 and Unicode is having a hard
> time getting adopted in these countries because most software that
> people use deals only with the local encodings.  (Sounds familiar?)
>
> These encodings are much less "pure" than UTF-8, because they only
> encode the local characters (and ASCII), and because of various
> problems with slicing: if you look "in the middle" of an encoded
> string or file, you may not know how to interpret the bytes you see.
> There are overlaps (in most of these encodings anyway) between the
> codes used for single-byte and double-byte encodings, and you may have
> to look back one or more characters to know what to make of the
> particular byte you see.  To get an idea of the nightmares that
> non-UTF-8 multibyte encodings give C/C++ programmers, see the
> Multibyte Character Set (MBCS) Survival Guide
> (http://msdn.microsoft.com/library/backgrnd/html/msdn_mbcssg.htm).
> See also the home page of the i18n-sig for more background information
> on encoding (and other i18n) issues
> (http://www.python.org/sigs/i18n-sig/).
>
> UTF-8 attempts to solve some of these problems: the multi-byte
> encodings are chosen such that you can tell by the high bits of each
> byte whether it is (1) a single-byte (ASCII) character (top bit off),
> (2) the start of a multi-byte character (at least two top bits on; how
> many indicates the total number of bytes comprising the character), or
> (3) a continuation byte in a multi-byte character (top bit on, next
> bit off).
>
> Many of the problems with non-UTF-8 multibyte encodings are the same
> as for UTF-8 though: #bytes != #characters, a byte may not be a valid
> character, regular expression patterns using "." may give the wrong
> results, and so on.
>
> The truth of the matter is: the encoding of string objects is in the
> mind of the programmer.  When I read a GIF file into a string object,
> the encoding is "binary goop".  When I read a line of Japanese text
> from a file, the encoding may be JIS, shift-JIS, or ENC -- this has to
> be an assumption built-in to my program, or perhaps information
> supplied separately (there's no easy way to guess based on the actual
> data).  When I type a string literal using Latin-1 characters, the
> encoding is Latin-1.  When I use octal escapes in a string literal,
> e.g. '\303\247', the encoding could be UTF-8 (this is a cedilla).
> When I type a 7-bit string literal, the encoding is ASCII.
>
> The moral of all this?  8-bit strings are not going away.  They are
> not encoded in UTF-8 henceforth.  Like before, and like 8-bit text
> files, they are encoded in whatever encoding you want.  All you get is
> an extra mechanism to convert them to Unicode, and the Unicode
> conversion defaults to UTF-8 because it is the only conversion that is
> reversible.  And, as Tim Peters quoted Andy Robinson (paraphrasing
> Tim's paraphrase), UTF-8 annoys everyone equally.
>
> Where does the current approach require work?
>
> - We need a way to indicate the encoding of Python source code.
> (Probably a "magic comment".)
>
> - We need a way to indicate the encoding of input and output data
> files, and we need shortcuts to set the encoding of stdin, stdout and
> stderr (and maybe all files opened without an explicit encoding).
> Marc-Andre showed some sample code, but I believe it is still
> cumbersome.  (I have to play with it more to see how it could be
> improved.)
>
> - We need to discuss whether there should be a way to change the
> default conversion between Unicode and 8-bit strings (currently
> hardcoded to UTF-8), in order to make life easier for people who want
> to continue to use their favorite 8-bit encoding (e.g. Latin-1, or
> shift-JIS) but who also want to make use of the new Unicode datatype.
>
> We're still in alpha, so we can still fix things.
>
> --Guido van Rossum (home page: http://www.python.org/~guido/)
>
>
>
>


From billtut@microsoft.com  Fri Apr 28 00:50:53 2000
From: billtut@microsoft.com (Bill Tutt)
Date: Thu, 27 Apr 2000 16:50:53 -0700
Subject: [I18n-sig] Re: Unicode debate
Message-ID: <4D0A23B3F74DD111ACCD00805F31D8101D8BD020@RED-MSG-50>

> Christopher Petrilli petrilli@amber.org <mailto:petrilli%40amber.org>

>> Guido van Rossum [guido@python.org <mailto:guido@python.org>] wrote:
>> I've heard a few people claim that strings should always be considered
>> to contain "characters" and that there should be one character per
>> string element.  I've also heard a clamoring that there should only be
>> one string type.  You folks have never used Asian encodings.  In
>> countries like Japan, China and Korea, encodings are a fact of life,
>> and the most popular encodings are ASCII supersets that use a variable
>> number of bytes per character, just like UTF-8.  Each country or
>> language uses different encodings, even though their characters look
>> mostly the same to western eyes.  UTF-8 and Unicode is having a hard
>> time getting adopted in these countries because most software that
>> people use deals only with the local encodings.  (Sounds familiar?)

> Actually a bigger concern that we hear from our customers in Japan is
> that Unicode has *serious* problems in asian languages.  Theey took
> the "unification" of Chinese and Japanese, rather than both, and
> therefore can not represent los of phrases quite right.  I can have
> someone write up a better dscription, but I was told by several
> Japanese people that they wouldn't use Unicode come hell or high
> water, basically.

Yeah, not all of the east asian ideographs are availble in Unicode atm. :(
Currently there are two pending extensions to the unified CJK ideographs.
Extension A is slated as part of the BMP. 0x0000 - 0xAAFF in Plane 2 is
currently slated for use by Extension B.  
BMP Roadmap: http://anubis.dkuug.dk/jtc1/sc2/wg2/docs/n2213.pdf
Plane 2 Roadmap: http://anubis.dkuug.dk/jtc1/sc2/wg2/docs/n2215.pdf

On top of which is there is this serious problem of end user defined
characters in a number of these MBCS encodings. 

Win32 OSs handles mapping these characters into Unicode in the following
way:
In the Win32 registry at:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage\EUDCCodeRan
ge

There exists several REG_SZ registry values. The names of the values are
MBCS code pages.
The values are source ranges in the codepage's code space.
e.g.:
932: F040-F9FC
936: AAA1-AFFE,F8A1-FEFE,A140-A7A0
949: C9A1-C9FE,FEA1-FEFE
950: FA40-FEFE,8E40-A0FE,8140-8DFE,C6A1-C8FE
etc....

These ranges get mapped into Unicode code space starting at U+E000 (the
beginning of the BMP private use area).

> Basically it's JJIS, Shift-JIS or nothing for most Japanese
> companies.  This was my experience working with Konica a few years ago 
> as well.

Don't forget the new JIS X 0213. :)

Bill


From tree@basistech.com  Fri Apr 28 01:01:17 2000
From: tree@basistech.com (Tom Emerson)
Date: Thu, 27 Apr 2000 20:01:17 -0400 (EDT)
Subject: [I18n-sig] Re: Unicode debate
In-Reply-To: <4D0A23B3F74DD111ACCD00805F31D8101D8BD020@RED-MSG-50>
References: <4D0A23B3F74DD111ACCD00805F31D8101D8BD020@RED-MSG-50>
Message-ID: <14600.54477.689349.86328@cymru.basistech.com>

Bill Tutt writes:
 > > Actually a bigger concern that we hear from our customers in Japan is
 > > that Unicode has *serious* problems in asian languages.  Theey took
 > > the "unification" of Chinese and Japanese, rather than both, and
 > > therefore can not represent los of phrases quite right.  I can have
 > > someone write up a better dscription, but I was told by several
 > > Japanese people that they wouldn't use Unicode come hell or high
 > > water, basically.

Then tell them to use JIS X 0221 instead of Unicode! Since it is a
Japanese National Standard they'll be pacified into using it, even
though it is nothing more than the Japanese translation of ISO/IEC
10646-1.1993.

This is becoming a bit of an urban legend: while it is true that
during the initial Han unification period for Unicode 1.0 there was
pushback from the Japanese who thought that characters were being left
out. This issue is one of glyph variants between Japanese kanji,
Simplified and Traditional Chinese hanzi, and Korean hanja: the same
character can take different forms in each of these locales.

Remember that one of the criterion for the Unified ideographs was that
mapping between legacy encodings and Unicode can be accomplished. If a
character can be found in an existing national standard (in the case
of Japan), then chances are that code point is found in the Unicode
block.

 > Yeah, not all of the east asian ideographs are availble in Unicode atm. :(

But most, if not all, of the commonly used characters *are* available
in Unicode 3.0. It is rare, especially for Japanese, to find words
that cannot be encoded in Unicode.

 > Currently there are two pending extensions to the unified CJK ideographs.
 > Extension A is slated as part of the BMP. 0x0000 - 0xAAFF in Plane 2 is

Extension A is part of Unicode 3.0 and will be in the BMP when ISO/IEC
10646.2000 is released.

 > On top of which is there is this serious problem of end user defined
 > characters in a number of these MBCS encodings. 

Especially true when dealing with the Hong Kong Supplementary
Character Set (HKSCS). However, the HKSAR provides mapping tables for
between Big Five and HKSCS and ISO/IEC 10646.1993 and .2000 (two 10646
tables are required since some of the code points in the HKSCS are
included in IEB-A --- the rest should appear in IEB-B). The problem is
when you want to transcode between Chinese encodings: you cannot go
from HKSCS to GB2312 or GBK --- the mappings simply do not exist.

 > Don't forget the new JIS X 0213. :)

Has it been published?

    -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Language Hacker                                    http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From billtut@microsoft.com  Fri Apr 28 01:17:01 2000
From: billtut@microsoft.com (Bill Tutt)
Date: Thu, 27 Apr 2000 17:17:01 -0700
Subject: [I18n-sig] Re: Unicode debate
Message-ID: <4D0A23B3F74DD111ACCD00805F31D8101D8BD021@RED-MSG-50>


> From: Tom Emerson [mailto:tree@cymru.basistech.com]
> 
> 
>  > Don't forget the new JIS X 0213. :)
> 
> Has it been published?
> 

Apparently so.

http://jcs.aa.tufs.ac.jp/jcs/index-e.htm notes:

The new Japanese Industrial Standard for a coded character set, JIS X0213
(an enhancement to the current X0208), has been established on January the
21th, 2000. 

The standard has been published on February the 29th, 2000. The standard
(written in Japanese) is priced 11,000(Japanese Yen, 541pages), and is
distributed by Japanese Standards Association 

Bill


From paul@prescod.net  Fri Apr 28 03:20:22 2000
From: paul@prescod.net (Paul Prescod)
Date: Thu, 27 Apr 2000 21:20:22 -0500
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]>
 <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us>
Message-ID: <3908F566.8E5747C@prescod.net>

Guido van Rossum wrote:
> 
> ...
>
> I've heard a few people claim that strings should always be considered
> to contain "characters" and that there should be one character per
> string element.  I've also heard a clamoring that there should only be
> one string type.  You folks have never used Asian encodings.  In
> countries like Japan, China and Korea, encodings are a fact of life,
> and the most popular encodings are ASCII supersets that use a variable
> number of bytes per character, just like UTF-8.  Each country or
> language uses different encodings, even though their characters look
> mostly the same to western eyes.  UTF-8 and Unicode is having a hard
> time getting adopted in these countries because most software that
> people use deals only with the local encodings.  (Sounds familiar?)

I think that maybe an important point is getting lost here. I could be
wrong, but it seems that all of this emphasis on encodings is misplaced.

The physical and logical makeup of character strings are entirely
separate issues. Unicode is a character set. It works in the logical
domain.

Dozens of different physical encodings can be used for Unicode
characters. There are XML users who work with XML (and thus Unicode)
every day and never see UTF-8, UTF-16 or any other Unicode-consortium
"sponsored" encoding. If you invent an encoding tomorrow, it can still
be XML-compatible. There are many encodings older than Unicode that are
XML (and Unicode) compatible.

I have not heard complaints about the XML way of looking at the world
and in fact it was explicitly endorsed by many of the world's leading
experts on internationalization. I haven't followed the Java situation
as closely but I have also not heard screams about its support for il8n. 

> The truth of the matter is: the encoding of string objects is in the
> mind of the programmer.  When I read a GIF file into a string object,
> the encoding is "binary goop".  

IMHO, it's a mistake of history that you would even think it makes sense
to read a GIF file into a "string" object and we should be trying to
erase that mistake, as quickly as possible (which is admittedly not very
quickly) not building more and more infrastructure around it. How can we
make the transition to a "binary goops are not strings" world easiest?

> The moral of all this?  8-bit strings are not going away.  

If that is a statement of your long term vision, then I think that it is
very unfortunate. Treating string literals as if they were isomorphic
with byte arrays was probably the right thing in 1991 but it won't be in
2005.

It doesn't meet the definition of string used in the Unicode spec., nor
in XML, nor in Java, nor at the W3C nor in most other up and coming
specifications.

From the W3C site:

""While ISO-2022-JP is not sufficient for every ISO10646 document, it is
the case that ISO10646 is a sufficient document character set for any
entity encoded with ISO-2022-JP.""

http://www.w3.org/MarkUp/html-spec/charset-harmful.html

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
It's difficult to extract sense from strings, but they're the only
communication coin we can count on. 
	- http://www.cs.yale.edu/~perlis-alan/quotes.html


From just@letterror.com  Fri Apr 28 09:33:16 2000
From: just@letterror.com (Just van Rossum)
Date: Fri, 28 Apr 2000 09:33:16 +0100
Subject: [I18n-sig] Re: Unicode debate
In-Reply-To: <200004271501.LAA13535@eric.cnri.reston.va.us>
References: Your message of "Thu, 27 Apr 2000 06:42:43 BST."
 <l03102800b52d80db1290@[193.78.237.154]>
 <l03102805b52ca7830b18@[193.78.237.154]>
 <l03102800b52d80db1290@[193.78.237.154]>
Message-ID: <l03102802b52ef9cb79aa@[193.78.237.170]>

At 11:01 AM -0400 27-04-2000, Guido van Rossum wrote:
>Where does the current approach require work?
>
>- We need a way to indicate the encoding of Python source code.
>(Probably a "magic comment".)

How will other parts of a program know which encoding was used for
non-unicode string literals?

It seems to me that an encoding attribute for 8-bit strings solves this
nicely. The attribute should only be set automatically if the encoding of
the source file was specified or when the string has been encoded from a
unicode string. The attribute should *only* be used when converting to
unicode. (Hm, it could even be used when calling unicode() without the
encoding argument.) It should *not* be used when comparing (or adding,
etc.) 8-bit strings to each other, since they still may contain binary
goop, even in a source file with a specified encoding!

>- We need a way to indicate the encoding of input and output data
>files, and we need shortcuts to set the encoding of stdin, stdout and
>stderr (and maybe all files opened without an explicit encoding).

Can you open a file *with* an explicit encoding?

Just


From mal@lemburg.com  Fri Apr 28 10:39:37 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 28 Apr 2000 11:39:37 +0200
Subject: [I18n-sig] Re: [Python-Dev] Re: [XML-SIG] Python 1.6a2 Unicode experiences?
References: <200004270208.WAA01413@newcnri.cnri.reston.va.us> <001c01bfb033$96bf66d0$01ac2ac0@boulder> <3908F5B8.9F8D8A9A@prescod.net> <20000428001229.A4790@trump.amber.org>
Message-ID: <39095C59.A5916EEB@lemburg.com>

[Note: These discussion should all move to 18n-sig... CCing there]

Christopher Petrilli wrote:
> 
> Paul Prescod [paul@prescod.net] wrote:
> > > Even working with exotic languages, there is always a native
> > > 8-bit encoding.
> >
> > Unicode has many encodings: Shift-JIS, Big-5, EBCDIC ... You can use
> > 8-bit encodings of Unicode if you want.
> 
> Um, if you go:
> 
>     JIS -> Unicode -> JIS
> 
> you don't get the same thing out that you put in (at least this is
> what I've been told by a lot of Japanese developers), and therefore
> it's not terribly popular because of the nature of the Japanese (and
> Chinese) langauge.
> 
> My experience with Unicode is that a lot of Western people think it's
> the answer to every problem asked, while most asian language people
> disagree vehemently.  This says the problem isn't solved yet, even if
> people wish to deny it.

Isn't this a problem of the translation rather than Unicode
itself (Andy mentioned several times that you can use the private
BMP areas to implement 1-1 round-trips) ?

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From mal@lemburg.com  Fri Apr 28 11:28:48 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 28 Apr 2000 12:28:48 +0200
Subject: [I18n-sig] Re: Unicode debate
References: Your message of "Thu, 27 Apr 2000 06:42:43 BST."
 <l03102800b52d80db1290@[193.78.237.154]>
 <l03102805b52ca7830b18@[193.78.237.154]>
 <l03102800b52d80db1290@[193.78.237.154]> <l03102802b52ef9cb79aa@[193.78.237.170]>
Message-ID: <390967DF.5424E6DF@lemburg.com>

Just van Rossum wrote:
> 
> At 11:01 AM -0400 27-04-2000, Guido van Rossum wrote:
> >Where does the current approach require work?
> >
> >- We need a way to indicate the encoding of Python source code.
> >(Probably a "magic comment".)
> 
> How will other parts of a program know which encoding was used for
> non-unicode string literals?
>
> It seems to me that an encoding attribute for 8-bit strings solves this
> nicely. The attribute should only be set automatically if the encoding of
> the source file was specified or when the string has been encoded from a
> unicode string. The attribute should *only* be used when converting to
> unicode. (Hm, it could even be used when calling unicode() without the
> encoding argument.) It should *not* be used when comparing (or adding,
> etc.) 8-bit strings to each other, since they still may contain binary
> goop, even in a source file with a specified encoding!

This would indeed solve some issues... it would cost sizeof(short)
per string object though (the integer would map into a table
of encoding names).

I'm not sure what to do with the attribute when strings with
differing encodings meet. UTF-8 + ASCII will still be UTF-8,
but e.g. UTF-8 + Latin will not result in meaningful data. Two
ideas for coercing strings with different encodings:

 1. the encoding of the resulting string is set to 'undefined'

 2. coerce both strings to Unicode and then apply the action

Also, how would one create a string having a specific encoding ?
str(object, encname) would match unicode(object, encname)...

> >- We need a way to indicate the encoding of input and output data
> >files, and we need shortcuts to set the encoding of stdin, stdout and
> >stderr (and maybe all files opened without an explicit encoding).
> 
> Can you open a file *with* an explicit encoding?

You can specify the encoding by means of using codecs.open()
instead of open(), but the interface will currently only
accept (.write) and return (.read) Unicode objects.

We'll probably have to make these a little more comfortable,
e.g. by accepting strings and Unicode objects. The needed
machinery is there -- we'd only need to define a suitable
interface on top of the classic file interface.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From tree@basistech.com  Fri Apr 28 11:44:00 2000
From: tree@basistech.com (Tom Emerson)
Date: Fri, 28 Apr 2000 06:44:00 -0400 (EDT)
Subject: [I18n-sig] Re: Unicode debate
In-Reply-To: <l03102802b52ef9cb79aa@[193.78.237.170]>
References: <l03102800b52d80db1290@[193.78.237.154]>
 <l03102805b52ca7830b18@[193.78.237.154]>
 <l03102802b52ef9cb79aa@[193.78.237.170]>
Message-ID: <14601.27504.337569.201251@cymru.basistech.com>

Just van Rossum writes:
 > How will other parts of a program know which encoding was used for
 > non-unicode string literals?

This is the exact reason that Unicode should be used for all string
literals: from a language design perspective I don't understand the
rationale for providing "traditional" and "unicode" string.

 > It seems to me that an encoding attribute for 8-bit strings solves this
 > nicely. The attribute should only be set automatically if the encoding of
 > the source file was specified or when the string has been encoded from a
 > unicode string. The attribute should *only* be used when converting to
 > unicode. (Hm, it could even be used when calling unicode() without the
 > encoding argument.) It should *not* be used when comparing (or adding,
 > etc.) 8-bit strings to each other, since they still may contain binary
 > goop, even in a source file with a specified encoding!

In Dylan there is an explicit split between 'characters' (which are
always Unicode) and 'bytes'.

What are the compelling reasons to not use UTF-8 as the (source)
document encoding? In the past the usual response is, "the tools are't
there for authoring UTF-8 documents". This argument becomes more
specious as more OS's move towards Unicode. I firmly believe this can
be done without Java's bloat.

One off-the-cuff solution is this:

All character strings are Unicode (utf-8 encoding). Language terminals
and operators are restricted to US-ASCII, which are identical to
UTF8. The contents of comments are not interpreted in any way.

 > >- We need a way to indicate the encoding of input and output data
 > >files, and we need shortcuts to set the encoding of stdin, stdout and
 > >stderr (and maybe all files opened without an explicit encoding).
 > 
 > Can you open a file *with* an explicit encoding?

If you cannot, you lose. You absolutely must be able to specify the
encoding of a file when opening it, so that the runtime can transcode
into the native encoding as you read it. This should be otherwise
transparent the user.

            -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Language Hacker                                    http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From just@letterror.com  Fri Apr 28 12:58:28 2000
From: just@letterror.com (Just van Rossum)
Date: Fri, 28 Apr 2000 12:58:28 +0100
Subject: [I18n-sig] Re: Unicode debate
In-Reply-To: <390967DF.5424E6DF@lemburg.com>
References: Your message of "Thu, 27 Apr 2000 06:42:43 BST."
 <l03102800b52d80db1290@[193.78.237.154]>
 <l03102805b52ca7830b18@[193.78.237.154]>
 <l03102800b52d80db1290@[193.78.237.154]>
 <l03102802b52ef9cb79aa@[193.78.237.170]>
Message-ID: <l03102809b52f28c18787@[193.78.237.170]>

At 12:28 PM +0200 28-04-2000, M.-A. Lemburg wrote:
[ encoding attr for 8 bit strings ]
>This would indeed solve some issues... it would cost sizeof(short)
>per string object though (the integer would map into a table
>of encoding names).
>
>I'm not sure what to do with the attribute when strings with
>differing encodings meet. UTF-8 + ASCII will still be UTF-8,
>but e.g. UTF-8 + Latin will not result in meaningful data. Two
>ideas for coercing strings with different encodings:
>
> 1. the encoding of the resulting string is set to 'undefined'
>
> 2. coerce both strings to Unicode and then apply the action

1, because 2 can lead to surprises when two strings containing binary goop
are added and only one was a literal in a source file with an explicit
encoding.

(Would "undefined" be the same as "default"? It would still be nice to be
able to set the global default encoding.)

>Also, how would one create a string having a specific encoding ?
>str(object, encname) would match unicode(object, encname)...

Dunno. Is such a high level interface needed? I'm not proposing to make
8-bit strings almost as powerful as unicode strings: unicode strings are
just fine for those kinds of operations... Hm, I just realized that the
encoding attr can't be mutable (doh!), so maybe your suggestion isn't so
bad at all.

Off-topic, what's the idea behind this behavior?:
>>> unicode(u"abc")
u'\000a\000b\000c'

>> Can you open a file *with* an explicit encoding?
>
>You can specify the encoding by means of using codecs.open()
>instead of open(), but the interface will currently only
>accept (.write) and return (.read) Unicode objects.

Thanks, I wasn't aware of that. Can't the builtin open() function get an
additional encoding argument?

Just


From tree@basistech.com  Fri Apr 28 11:56:50 2000
From: tree@basistech.com (Tom Emerson)
Date: Fri, 28 Apr 2000 06:56:50 -0400 (EDT)
Subject: [I18n-sig] Re: [Python-Dev] Re: [XML-SIG] Python 1.6a2 Unicode experiences?
In-Reply-To: <39095C59.A5916EEB@lemburg.com>
References: <200004270208.WAA01413@newcnri.cnri.reston.va.us>
 <001c01bfb033$96bf66d0$01ac2ac0@boulder>
 <3908F5B8.9F8D8A9A@prescod.net>
 <20000428001229.A4790@trump.amber.org>
 <39095C59.A5916EEB@lemburg.com>
Message-ID: <14601.28274.667733.660938@cymru.basistech.com>

M.-A. Lemburg writes:
 > > > Unicode has many encodings: Shift-JIS, Big-5, EBCDIC ... You can use
 > > > 8-bit encodings of Unicode if you want.

This is meaningless: legacy encodings of national character
sets such Shift-JIS, Big Five, GB2312, or TIS620 are not "encodings"
of Unicode.

TIS620 is a single-byte, 8-bit encoding: each character is
represented by a single byte. The Japanese and Chinese encodings are
multibyte, 8-bit, encodings. ISO-2022 is a multi-byte, 7-bit encoding
for multiple character sets.

Unicode has several possible encodings: UTF-8, UCS-2, UCS-4,
UTF-16... You can view all of these as 8-bit encodings, if you
like. Some are multibyte (such as UTF-8, where each character in
Unicode is represented in 1 to 3 bytes) while others are fixed length,
two or four bytes per character.

 > > Um, if you go:
 > > 
 > >     JIS -> Unicode -> JIS
 > > 
 > > you don't get the same thing out that you put in (at least this is
 > > what I've been told by a lot of Japanese developers), and therefore
 > > it's not terribly popular because of the nature of the Japanese (and
 > > Chinese) langauge.

This is simply not true any more. The ability to round trip between
Unicode and legacy encodings is dependent on the software: being able
to use code points in the PUA for this is acceptable and commonly
done.

The big advantage is in using Unicode as a pivot when transcoding
between different CJK encodings. It is very difficult to map between,
say, Shift JIS and GB2312, directly. However, Unicode provides a good
go-between.

It isn't a panacea: transcoding between legacy encodings like GB2312
and Big Five is still difficult: Unicode or not.

 > > My experience with Unicode is that a lot of Western people think it's
 > > the answer to every problem asked, while most asian language people
 > > disagree vehemently.  This says the problem isn't solved yet, even if
 > > people wish to deny it.

This is a shame: it is an indication that they don't understand the
technology. Unicode is a tool: nothing more.

            -tree 

-- 
Tom Emerson                                          Basis Technology Corp.
Language Hacker                                    http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From fredrik@pythonware.com  Fri Apr 28 13:15:06 2000
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Fri, 28 Apr 2000 14:15:06 +0200
Subject: [I18n-sig] Re: [Python-Dev] Re: [XML-SIG] Python 1.6a2 Unicode experiences?
References: <200004270208.WAA01413@newcnri.cnri.reston.va.us> <001c01bfb033$96bf66d0$01ac2ac0@boulder> <3908F5B8.9F8D8A9A@prescod.net> <20000428001229.A4790@trump.amber.org> <39095C59.A5916EEB@lemburg.com>
Message-ID: <00d101bfb10b$68585800$0500a8c0@secret.pythonware.com>

Christopher Petrilli wrote:
>=20
> Paul Prescod [paul@prescod.net] wrote:
> > > Even working with exotic languages, there is always a native
> > > 8-bit encoding.
> >
> > Unicode has many encodings: Shift-JIS, Big-5, EBCDIC ... You can use
> > 8-bit encodings of Unicode if you want.
>=20
> Um, if you go:
>=20
>     JIS -> Unicode -> JIS
>=20
> you don't get the same thing out that you put in (at least this is
> what I've been told by a lot of Japanese developers), and therefore
> it's not terribly popular because of the nature of the Japanese (and
> Chinese) langauge.
>=20
> My experience with Unicode is that a lot of Western people think it's
> the answer to every problem asked, while most asian language people
> disagree vehemently.  This says the problem isn't solved yet, even if
> people wish to deny it.

this is partly true, partly caused by a confusion over what unicode
really is.  there are at least two issues involved here:

* the unicode character repertoire is not complete

unicode contains all characters from the basic JIS X character
sets (please correct me if I'm wrong), but it doesn't include all
characters in common use in Japan.

as far as I've understood, this is mostly personal names and trade
names.  however, different vendors tend to use different sets,
with different encodings, and there has been no consensus on
which to add, and how.

so in other words, if you're "transcoding" from one encoding to
another (when converting data, or printing or displaying on a
device assuming a different encoding), unicode isn't good enough.

as MAL pointed out, you can work around this by using custom
codecs, mapping the vendor specific characters that you happen
to use to private regions in the unicode code space.  but afaik,
there is no standard way to do that at this time.

(this probably applies to other "CJK languages" too.  if anyone
could verify that, I'd be grateful).

* unicode is about characters, not languages

if you have a unicode string, you still don't know how to display
it.  the string tells you what characters to use, not what language
the text is written in.

and while using one standard "glyph" per unicode character works
pretty well for latin characters (no, it's not perfect, but it's not
much of a problem in real life), it doesn't work for asian languages.
you need extra language/locale information to pick the right glyph
for any given unicode character.

and the crux is that before unicode, this wasn't really a problem
-- if you knew the encoding, you knew what language to use.  when
using unicode, you need to put that information somewhere else
(in an XML attribute, for example).

* corrections and additions are welcome, of course.

</F>


From mal@lemburg.com  Fri Apr 28 13:13:56 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 28 Apr 2000 14:13:56 +0200
Subject: [I18n-sig] Re: Unicode debate
References: <l03102800b52d80db1290@[193.78.237.154]>
 <l03102805b52ca7830b18@[193.78.237.154]>
 <l03102802b52ef9cb79aa@[193.78.237.170]> <14601.27504.337569.201251@cymru.basistech.com>
Message-ID: <39098084.C9600963@lemburg.com>

Tom Emerson wrote:
> 
> Just van Rossum writes:
>  > How will other parts of a program know which encoding was used for
>  > non-unicode string literals?
> 
> This is the exact reason that Unicode should be used for all string
> literals: from a language design perspective I don't understand the
> rationale for providing "traditional" and "unicode" string.
> 
>  > It seems to me that an encoding attribute for 8-bit strings solves this
>  > nicely. The attribute should only be set automatically if the encoding of
>  > the source file was specified or when the string has been encoded from a
>  > unicode string. The attribute should *only* be used when converting to
>  > unicode. (Hm, it could even be used when calling unicode() without the
>  > encoding argument.) It should *not* be used when comparing (or adding,
>  > etc.) 8-bit strings to each other, since they still may contain binary
>  > goop, even in a source file with a specified encoding!
> 
> In Dylan there is an explicit split between 'characters' (which are
> always Unicode) and 'bytes'.
> 
> What are the compelling reasons to not use UTF-8 as the (source)
> document encoding? In the past the usual response is, "the tools are't
> there for authoring UTF-8 documents". This argument becomes more
> specious as more OS's move towards Unicode. I firmly believe this can
> be done without Java's bloat.
> 
> One off-the-cuff solution is this:
> 
> All character strings are Unicode (utf-8 encoding). Language terminals
> and operators are restricted to US-ASCII, which are identical to
> UTF8. The contents of comments are not interpreted in any way.

That would be an option... albeit one that would probably render
many of the existing programs useless (I do believe that many
people have encoded their local charset into their programs,
either by entering locale dependent strings directly in the source
code or by making some assumption about their encoding).
 
>  > >- We need a way to indicate the encoding of input and output data
>  > >files, and we need shortcuts to set the encoding of stdin, stdout and
>  > >stderr (and maybe all files opened without an explicit encoding).
>  >
>  > Can you open a file *with* an explicit encoding?
> 
> If you cannot, you lose. You absolutely must be able to specify the
> encoding of a file when opening it, so that the runtime can transcode
> into the native encoding as you read it. This should be otherwise
> transparent the user.

You can: codecs.open(). The interface needs some further
refinement though.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From mal@lemburg.com  Fri Apr 28 13:09:36 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 28 Apr 2000 14:09:36 +0200
Subject: [I18n-sig] Re: Unicode debate
References: Your message of "Thu, 27 Apr 2000 06:42:43 BST."
 <l03102800b52d80db1290@[193.78.237.154]>
 <l03102805b52ca7830b18@[193.78.237.154]>
 <l03102800b52d80db1290@[193.78.237.154]>
 <l03102802b52ef9cb79aa@[193.78.237.170]> <l03102809b52f28c18787@[193.78.237.170]>
Message-ID: <39097F80.6A0E9FBD@lemburg.com>

[Diving off into the Great Unkown... perhaps we'll end up with
 a useful proposal ;-)]

Just van Rossum wrote:
> 
> At 12:28 PM +0200 28-04-2000, M.-A. Lemburg wrote:
> [ encoding attr for 8 bit strings ]
> >This would indeed solve some issues... it would cost sizeof(short)
> >per string object though (the integer would map into a table
> >of encoding names).
> >
> >I'm not sure what to do with the attribute when strings with
> >differing encodings meet. UTF-8 + ASCII will still be UTF-8,
> >but e.g. UTF-8 + Latin will not result in meaningful data. Two
> >ideas for coercing strings with different encodings:
> >
> > 1. the encoding of the resulting string is set to 'undefined'
> >
> > 2. coerce both strings to Unicode and then apply the action
> 
> 1, because 2 can lead to surprises when two strings containing binary goop
> are added and only one was a literal in a source file with an explicit
> encoding.
> 
> (Would "undefined" be the same as "default"? It would still be nice to be
> able to set the global default encoding.)

I should have been more precise:

2. provided both strings have encodings which can be converted
   to Unicode, coerce them to Unicode and then apply the action;
   otherwise proceed as in 1., i.e. the result has an undefined
   encoding.

If 2. does try to convert to Unicode, conversion errors should
be raised (just like they are now for Unicode coercion errors).

Some more tricky business:

How should str('bla', 'enc1') and str('bla', 'enc2') compare ?
What about the hash values of the two ?

> >Also, how would one create a string having a specific encoding ?
> >str(object, encname) would match unicode(object, encname)...
> 
> Dunno. Is such a high level interface needed? I'm not proposing to make
> 8-bit strings almost as powerful as unicode strings: unicode strings are
> just fine for those kinds of operations... Hm, I just realized that the
> encoding attr can't be mutable (doh!), so maybe your suggestion isn't so
> bad at all.

That's why I was proposing str(obj, encname)... because the
encoding can't be changed after creation. Default encoding
would be 'undefined' for strings created dynamically using
just "..." and the source code encoding in case the strings
were defined in a Python source file (the compiler would set
the encoding).

Hmm, we'd still loose big in case someone puts a raw data
string into a Python source file without changing the encoding
to e.g. 'binary'.

We'd then have to write:

s = "...bla..." # source code encoding
data = str("...data...","binary") # binary data

Although binary data should really use:

data = buffer("...data...")

Side note: "...bla..." + buffer("...data...") currently returns
"...bla......data..." -- not very useful: I would have expected
a new buffer object instead. With string encoding attribute
this could be remedied to produce a string having 'binary'
encoding (at least).

Some more issues:

How should str(obj,encname) extract the information from the
object: via getcharbuf or getreadbuf ? Should it take the
encoding of the obj into account (in case it is a string object) ?

What should str(unicode, encname) return (the same as
unicode.encode(encname)) ?

What would file.read() return (a string with 'undefined'
encoding ?) ? An extra parameter to open() could be added
to have it return strings with a predefined encoding.

> Off-topic, what's the idea behind this behavior?:
> >>> unicode(u"abc")
> u'\000a\000b\000c'

Hmm, I get:

>>> unicode(u"abc")
u'abc'

This was fixed upon Guido's request some weeks ago.
 
> >> Can you open a file *with* an explicit encoding?
> >
> >You can specify the encoding by means of using codecs.open()
> >instead of open(), but the interface will currently only
> >accept (.write) and return (.read) Unicode objects.
> 
> Thanks, I wasn't aware of that. Can't the builtin open() function get an
> additional encoding argument?

That would be probably be an option after some rounds of refinement
of the interface.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From guido@python.org  Fri Apr 28 14:24:29 2000
From: guido@python.org (Guido van Rossum)
Date: Fri, 28 Apr 2000 09:24:29 -0400
Subject: [I18n-sig] Re: [Python-Dev] Re: [XML-SIG] Python 1.6a2 Unicode experiences?
In-Reply-To: Your message of "Fri, 28 Apr 2000 11:39:37 +0200."
 <39095C59.A5916EEB@lemburg.com>
References: <200004270208.WAA01413@newcnri.cnri.reston.va.us> <001c01bfb033$96bf66d0$01ac2ac0@boulder> <3908F5B8.9F8D8A9A@prescod.net> <20000428001229.A4790@trump.amber.org>
 <39095C59.A5916EEB@lemburg.com>
Message-ID: <200004281324.JAA15642@eric.cnri.reston.va.us>

> [Note: These discussion should all move to 18n-sig... CCing there]
> 
> Christopher Petrilli wrote:
> > you don't get the same thing out that you put in (at least this is
> > what I've been told by a lot of Japanese developers), and therefore
> > it's not terribly popular because of the nature of the Japanese (and
> > Chinese) langauge.
> > 
> > My experience with Unicode is that a lot of Western people think it's
> > the answer to every problem asked, while most asian language people
> > disagree vehemently.  This says the problem isn't solved yet, even if
> > people wish to deny it.

[Marc-Andre Lenburg]
> Isn't this a problem of the translation rather than Unicode
> itself (Andy mentioned several times that you can use the private
> BMP areas to implement 1-1 round-trips) ?

Maybe, but apparently such high-quality translations are rare (note
that Andy said "can").

Anyway, a word of caution here.  Years ago I attended a number of IETF
meetings on internationalization, in a time when Unicode wasn't as
accepted as it is now.  The one thing I took away from those meetings
was that this is a *highly* emotional and controversial issue.

As the Python community, I feel we have no need to discuss "why
Unicode."  Therein lies madness, controversy, and no progress.  We
know there's a clear demand for Unicode, and we've committed to
support it.  The question now at hand is "how Unicode."  Let's please
focus on that, e.g. in the other thread ("Unicode debate") in i18n-sig
and python-dev.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@python.org  Fri Apr 28 15:10:27 2000
From: guido@python.org (Guido van Rossum)
Date: Fri, 28 Apr 2000 10:10:27 -0400
Subject: [I18n-sig] Re: [Python-Dev] Re: Unicode debate
In-Reply-To: Your message of "Fri, 28 Apr 2000 09:33:16 BST."
 <l03102802b52ef9cb79aa@[193.78.237.170]>
References: Your message of "Thu, 27 Apr 2000 06:42:43 BST." <l03102800b52d80db1290@[193.78.237.154]> <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]>
 <l03102802b52ef9cb79aa@[193.78.237.170]>
Message-ID: <200004281410.KAA16104@eric.cnri.reston.va.us>

[GvR]
> >- We need a way to indicate the encoding of Python source code.
> >(Probably a "magic comment".)

[JvR]
> How will other parts of a program know which encoding was used for
> non-unicode string literals?
> 
> It seems to me that an encoding attribute for 8-bit strings solves this
> nicely. The attribute should only be set automatically if the encoding of
> the source file was specified or when the string has been encoded from a
> unicode string. The attribute should *only* be used when converting to
> unicode. (Hm, it could even be used when calling unicode() without the
> encoding argument.) It should *not* be used when comparing (or adding,
> etc.) 8-bit strings to each other, since they still may contain binary
> goop, even in a source file with a specified encoding!

Marc-Andre took this idea a bit further, but I think it's not
practical given the current implementation: there are too many places
where the C code would have to be changed in order to propagate the
string encoding information, and there are too many sources of strings
with unknown encodings to make it very useful.  Plus, it would slow
down 8-bit string ops.

I have a better idea: rather than carrying around 8-bit strings with
an encoding, use Unicode literals in your source code.  If the source
encoding is known, these will be converted using the appropriate
codec.

If you object to having to write u"..." all the time, we could say
that "..." is a Unicode literal if it contains any characters with the
top bit on (of course the source file encoding would be used just like
for u"...").

But I think this should be enabled by a separate pragma -- people who
want to write Unicode-unaware code manipulating 8-bit strings in their
favorite encoding (e.g. shift-JIS or Latin-1) should not silently get
Unicode strings.

(I thought about an option to make *all strings* (not just literals)
Unicode, but the current implementation would require too much
hacking.  This is what JPython does, and maybe it should be what
Python 3000 does; I don't see it as a realistic option for the 1.x
series.)

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@python.org  Fri Apr 28 15:32:28 2000
From: guido@python.org (Guido van Rossum)
Date: Fri, 28 Apr 2000 10:32:28 -0400
Subject: [I18n-sig] Re: Unicode debate
In-Reply-To: Your message of "Fri, 28 Apr 2000 06:44:00 EDT."
 <14601.27504.337569.201251@cymru.basistech.com>
References: <l03102800b52d80db1290@[193.78.237.154]> <l03102805b52ca7830b18@[193.78.237.154]> <l03102802b52ef9cb79aa@[193.78.237.170]>
 <14601.27504.337569.201251@cymru.basistech.com>
Message-ID: <200004281432.KAA16418@eric.cnri.reston.va.us>

> This is the exact reason that Unicode should be used for all string
> literals: from a language design perspective I don't understand the
> rationale for providing "traditional" and "unicode" string.

In Python 3000, you would have a point.  In current Python, there
simply are too many programs and extensions written in other languages
that manipulating 8-bit strings to ignore their existence.  We're
trying to add Unicode support to Python 1.6 without breaking code that
used to run under Python 1.5.x; practicalities just make it impossible
to go with Unicode for everything.

I think that if Python didn't have so many extension modules (many
maintained by 3rd party modules) it would be a lot easier to switch to
Unicode for all strings (I think JavaScript has done this).

In Python 3000, we'll have to seriously consider having separate
character string and byte array objects, along the lines of Java's
model.  Note that I say "seriously consider."  We'll first have to see
how well the current solution works *in practice*.  There's time
before we fix Py3k in stone. :-)

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@python.org  Fri Apr 28 15:50:05 2000
From: guido@python.org (Guido van Rossum)
Date: Fri, 28 Apr 2000 10:50:05 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Your message of "Thu, 27 Apr 2000 21:20:22 CDT."
 <3908F566.8E5747C@prescod.net>
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us>
 <3908F566.8E5747C@prescod.net>
Message-ID: <200004281450.KAA16493@eric.cnri.reston.va.us>

[Paul Prescod]
> I think that maybe an important point is getting lost here. I could be
> wrong, but it seems that all of this emphasis on encodings is misplaced.

In practical applications that manipulate text, encodings creep up all
the time.  I remember a talk or message by Andy Robinson about the
messiness of producing printed reports in Japanese for a large
investment firm.  Most off the issues that took his time had to do
with encodings, if I recall correctly.  (Andy, do you remember what
I'm talking about?  Do you have a URL?)

> > The truth of the matter is: the encoding of string objects is in the
> > mind of the programmer.  When I read a GIF file into a string object,
> > the encoding is "binary goop".  
> 
> IMHO, it's a mistake of history that you would even think it makes sense
> to read a GIF file into a "string" object and we should be trying to
> erase that mistake, as quickly as possible (which is admittedly not very
> quickly) not building more and more infrastructure around it. How can we
> make the transition to a "binary goops are not strings" world easiest?

I'm afraid that's a bigger issue than we can solve for Python 1.6.
We're committed to by and large backwards compatibility while
supporting Unicode -- the backwards compatibility with tons of
extension module (many 3rd party) requires that we deal with 8-bit
strings in basically the same way as we did before.

> > The moral of all this?  8-bit strings are not going away.  
> 
> If that is a statement of your long term vision, then I think that it is
> very unfortunate. Treating string literals as if they were isomorphic
> with byte arrays was probably the right thing in 1991 but it won't be in
> 2005.

I think you're a tad too optimistic about the evolution speed of
software (Windows 2000 *still* has to support DOS programs), but I see
your point.  As I stated in another message, in Python 3000 we'll have
to consider a more Java-esque solution: *character* strings are
Unicode, and for bytes we have (mutable!) byte arras.  Certainly 8-bit
bytes as the smallest storage unit aren't going away.

> It doesn't meet the definition of string used in the Unicode spec., nor
> in XML, nor in Java, nor at the W3C nor in most other up and coming
> specifications.

OK, so that's a good indication of where you're coming from.  Maybe
you should spend a little more time in the trenches and a little less
in standards bodies.  Standards are good, but sometimes disconnected
from reality (remember ISO networking? :-).

> From the W3C site:
> 
> ""While ISO-2022-JP is not sufficient for every ISO10646 document, it is
> the case that ISO10646 is a sufficient document character set for any
> entity encoded with ISO-2022-JP.""

And this is exactly why encodings will remain important: entities
encoded in ISO-2022-JP have no compelling reason to be recoded
permanently into ISO10646, and there are lots of forces that make it
convenient to keep it encoded in ISO-2022-JP (like existing tools).

> http://www.w3.org/MarkUp/html-spec/charset-harmful.html

I know that document well.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From andy@reportlab.com  Fri Apr 28 17:12:39 2000
From: andy@reportlab.com (Andy Robinson)
Date: Fri, 28 Apr 2000 17:12:39 +0100
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <200004281450.KAA16493@eric.cnri.reston.va.us>
Message-ID: <PGECLPOBGNBNKHNAGIJHGEINCBAA.andy@reportlab.com>

Guido> In practical applications that manipulate text, encodings creep up
all
Guido> the time.  I remember a talk or message by Andy Robinson about the
Guido> messiness of producing printed reports in Japanese for a large
Guido> investment firm.  Most off the issues that took his time had to do
Guido> with encodings, if I recall correctly.  (Andy, do you remember what
Guido> I'm talking about?  Do you have a URL?)
Guido>

I attach the 'Case Study' I posted to the python-dev list
when I first joined.  If anyone else can tell their own
stories, however long or short, I feel it would be a
useful addition to the present discussion.


- Andy

>To: python-dev@python.org
>Subject: [Python-Dev] Internationalisation Case Study
>From: Andy Robinson <captainrobbo@yahoo.com>
>Date: Tue, 9 Nov 1999 05:57:46 -0800 (PST)
>
>Guido has asked me to get involved in this discussion,
>as I've been working practically full-time on i18n for
>the last year and a half and have done quite a bit
>with Python in this regard.  I thought the most
>helpful thing would be to describe the real-world
>business problems I have been tackling so people can
>understand what one might want from an encoding
>toolkit.  In this (long) post I have included:
>1. who I am and what I want to do
>2. useful sources of info
>3. a real world i18n project
>4. what I'd like to see in an encoding toolkit
>
>
>Grab a coffee - this is a long one.
>
>1. Who I am
>--------------
>Firstly, credentials.  I'm a Python programmer by
>night, and when I can involve it in my work which
>happens perhaps 20% of the time.  More relevantly, I
>did a postgrad course in Japanese Studies and lived in
>Japan for about two years; in 1990 when I returned, I
>was speaking fairly fluently and could read a
>newspaper with regular reference tio a dictionary.
>Since then my Japanese has atrophied badly, but it is
>good enough for IT purposes.  For the last year and a
>half I have been internationalizing a lot of systems -
>more on this below.
>
>My main personal interest is that I am hoping to
>launch a company using Python for reporting, data
>cleaning and transformation.  An encoding library is
>sorely needed for this.
>
>2. Sources of Knowledge
>------------------------------
>We should really go for world class advice on this.
>Some people who could really contribute to this
>discussion are:
>- Ken Lunde, author of "CJKV Information Processing"
>and head of Asian Type Development at Adobe.
>- Jeffrey Friedl, author of "Mastering Regular
>Expressions", and a long time Japan resident and
>expert on things Japanese
>- Maybe some of the Ruby community?
>
>I'll list up books URLs etc. for anyone who needs them
>on request.
>
>3. A Real World Project
>----------------------------
>18 months ago I was offered a contract with one of the
>world's largest investment management companies (which
>I will nickname HugeCo) , who (after many years having
>analysts out there) were launching a business in Japan
>to attract savers; due to recent legal changes,
>Japanese people can now freely buy into mutual funds
>run by foreign firms.  Given the 2% they historically
>get on their savings, and the 12% that US equities
>have returned for most of this century, this is a
>business with huge potential.  I've been there for a
>while now,
>rotating through many different IT projects.
>
>HugeCo runs its non-US business out of the UK.  The
>core deal-processing business runs on IBM AS400s.
>These are kind of a cross between a relational
>database and a file system, and speak their own
>encoding called EBCDIC.    Five years ago the AS400
>had limited
>connectivity to everything else, so they also started
>deploying Sybase databases on Unix to support some
>functions.  This means 'mirroring' data between the
>two systems on a regular basis.  IBM has always
>included encoding information on the AS400 and it
>converts from EBCDIC to ASCII on request with most of
>the transfer tools (FTP, database queries etc.)
>
>To make things work for Japan, everyone realised that
>a double-byte representation would be needed.
>Japanese has about 7000 characters in most IT-related
>character sets, and there are a lot of ways to store
>it.  Here's a potted language lesson.  (Apologies to
>people who really know this field -- I am not going to
>be fully pedantic or this would take forever).
>
>Japanese includes two phonetic alphabets (each with
>about 80-90 characters), the thousands of Kanji, and
>English characters, often all in the same sentence.
>The first attempt to display something was to
>make a single -byte character set which included
>ASCII, and a simplified (and very ugly) katakana
>alphabet in the upper half of the code page.  So you
>could spell out the sounds of Japanese words using
>'half width katakana'.
>
>The basic 'character set' is Japan Industrial Standard
>0208 ("JIS"). This was defined in 1978, the first
>official Asian character set to be defined by a
>government.   This can be thought of as a printed
>chart
>showing the characters - it does not define their
>storage on a computer.   It defined a logical 94 x 94
>grid, and each character has an index in this grid.
>
>The "JIS" encoding was a way of mixing ASCII and
>Japanese in text files and emails.  Each Japanese
>character had a double-byte value. It had 'escape
>sequences' to say 'You are now entering ASCII
>territory' or the opposite.   In 1978 Microsoft
>quickly came up with Shift-JIS, a smarter encoding.
>This basically said "Look at the next byte.  If below
>127, it is ASCII; if between A and B, it is a
>half-width
>katakana; if between B and C, it is the first half of
>a double-byte character and the next one is the second
>half".  Extended Unix Code (EUC) does similar tricks.
>Both have the property that there are no control
>characters, and ASCII is still ASCII.  There are a few
>other encodings too.
>
>Unfortunately for me and HugeCo, IBM had their own
>standard before the Japanese government did, and it
>differs; it is most commonly called DBCS (Double-Byte
>Character Set).  This involves shift-in and shift-out
>sequences (0x16 and 0x17, cannot remember which way
>round), so you can mix single and double bytes in a
>field.  And we used AS400s for our core processing.
>
>So, back to the problem.  We had a FoxPro system using
>ShiftJIS on the desks in Japan which we wanted to
>replace in stages, and an AS400 database to replace it
>with.  The first stage was to hook them up so names
>and addresses could be uploaded to the AS400, and data
>files consisting of daily report input could be
>downloaded to the PCs.  The AS400 supposedly had a
>library which did the conversions, but no one at IBM
>knew how it worked.  The people who did all the
>evaluations had basically proved that 'Hello World' in
>Japanese could be stored on an AS400, but never looked
>at the conversion issues until mid-project. Not only
>did we need a conversion filter, we had the problem
>that the character sets were of different sizes.  So
>it was possible - indeed, likely - that some of our
>ten thousand customers' names and addresses would
>contain characters only on one system or the other,
>and fail to
>survive a round trip.  (This is the absolute key issue
>for me - will a given set of data survive a round trip
>through various encoding conversions?)
>
>We figured out how to get the AS400 do to the
>conversions during a file transfer in one direction,
>and I wrote some Python scripts to make up files with
>each official character in JIS on a line; these went
>up with conversion, came back binary, and I was able
>to build a mapping table and 'reverse engineer' the
>IBM encoding.  It was straightforward in theory, "fun"
>in practice.  I then wrote a python library which knew
>about the AS400 and Shift-JIS encodings, and could
>translate a string between them.  It could also detect
>corruption and warn us when it occurred.  (This is
>another key issue - you will often get badly encoded
>data, half a kanji or a couple of random bytes, and
>need to be clear on your strategy for handling it in
>any library).  It was slow, but it got us our gateway
>in both directions, and it warned us of bad input. 360
>characters in the DBCS encoding actually appear twice,
>so perfect round trips are impossible, but practically
>you can survive with some validation of input at both
>ends.  The final story was that our names and
>addresses were mostly safe, but a few obscure symbols
>weren't.
>
>A big issue was that field lengths varied.  An address
>field 40 characters long on a PC might grow to 42 or
>44 on an AS400 because of the shift characters, so the
>software would truncate the address during import, and
>cut a kanji in half.  This resulted in a string that
>was illegal DBCS, and errors in the database.  To
>guard against this, you need really picky input
>validation.  You not only ask 'is this string valid
>Shift-JIS', you check it will fit on the other system
>too.
>
>The next stage was to bring in our Sybase databases.
>Sybase make a Unicode database, which works like the
>usual one except that all your SQL code suddenly
>becomes case sensitive - more (unrelated) fun when
>you have 2000 tables.  Internally it stores data in
>UTF8, which is a 'rearrangement' of Unicode which is
>much safer to store in conventional systems.
>Basically, a UTF8 character is between one and three
>bytes, there are no nulls or control characters, and
>the ASCII characters are still the same ASCII
>characters.  UTF8<->Unicode involves some bit
>twiddling but is one-to-one and entirely algorithmic.
>
>We had a product to 'mirror' data between AS400 and
>Sybase, which promptly broke when we fed it Japanese.
>The company bought a library called Unilib to do
>conversions, and started rewriting the data mirror
>software.  This library (like many) uses Unicode as a
>central point in all conversions, and offers most of
>the world's encodings.  We wanted to test it, and used
>the Python routines to put together a regression
>test.  As expected, it was mostly right but had some
>differences, which we were at least able to document.
>
>We also needed to rig up a daily feed from the legacy
>FoxPro database into Sybase while it was being
>replaced (about six months).  We took the same
>library, built a DLL wrapper around it, and I
>interfaced to this with DynWin , so we were able to do
>the low-level string conversion in compiled code and
>the high-level
>control in Python. A FoxPro batch job wrote out
>delimited text in shift-JIS; Python read this in, ran
>it through the DLL to convert it to UTF8, wrote that
>out as UTF8 delimited files, ftp'ed them to an
>in directory on the Unix box ready for daily import.
>At this point we had a lot of fun with field widths -
>Shift-JIS is much more compact than UTF8 when you have
>a lot of kanji (e.g. address fields).
>
>Another issue was half-width katakana.  These were the
>earliest attempt to get some form of Japanese out of a
>computer, and are single-byte characters above 128 in
>Shift-JIS - but are not part of the JIS0208 standard.
>
>They look ugly and are discouraged; but when you ar
>enterinh a long address in a field of a database, and
>it won't quite fit, the temptation is to go from
>two-bytes-per -character to one (just hit F7 in
>windows) to save space.  Unilib rejected these (as
>would Java), but has optional modes to preserve them
>or 'expand them out' to their full-width equivalents.
>
>
>The final technical step was our reports package.
>This is a 4GL using a really horrible 1980s Basic-like
>language which reads in fixed-width data files and
>writes out Postscript; you write programs saying 'go
>to x,y' and 'print customer_name', and can build up
>anything you want out of that.  It's a monster to
>develop in, but when done it really works -
>million page jobs no problem.  We had bought into this
>on the promise that it supported Japanese; actually, I
>think they had got the equivalent of 'Hello World' out
>of it, since we had a lot of problems later.
>
>The first stage was that the AS400 would send down
>fixed width data files in EBCDIC and DBCS.  We ran
>these through a C++ conversion utility, again using
>Unilib.  We had to filter out and warn about corrupt
>fields, which the conversion utility would reject.
>Surviving records then went into the reports program.
>
>It then turned out that the reports program only
>supported some of the Japanese alphabets.
>Specifically, it had a built in font switching system
>whereby when it encountered ASCII text, it would flip
>to the most recent single byte text, and when it found
>a byte above 127, it would flip to a double byte font.
> This is because many Chinese fonts do (or did)
>not include English characters, or included really
>ugly ones.  This was wrong for Japanese, and made the
>half-width katakana unprintable.  I found out that I
>could control fonts if I printed one character at a
>time with a special escape sequence, so wrote my own
>bit-scanning code (tough in a language without ord()
>or bitwise operations) to examine a string, classify
>every byte, and control the fonts the way I wanted.
>So a special subroutine is used for every name or
>address field.  This is apparently not unusual in GUI
>development (especially web browsers) - you rarely
>find a complete Unicode font, so you have to switch
>fonts on the fly as you print a string.
>
>After all of this, we had a working system and knew
>quite a bit about encodings.  Then the curve ball
>arrived:  User Defined Characters!
>
>It is not true to say that there are exactly 6879
>characters in Japanese, and more than counting the
>number of languages on the Indian sub-continent or the
>types of cheese in France.  There are historical
>variations and they evolve.  Some people's names got
>missed out, and others like to write a kanji in an
>unusual way.   Others arrived from China where they
>have more complex variants of the same characters.
>Despite the Japanese government's best attempts, these
>people have dug their heels in and want to keep their
>names the way they like them.  My first reaction was
>'Just Say No' - I basically said that it one of these
>customers (14 out of a database of 8000) could show me
>a tax form or phone bill with the correct UDC on it,
>we would implement it but not otherwise (the usual
>workaround is to spell their name phonetically in
>katakana).  But our marketing people put their foot
>down.
>
>A key factor is that Microsoft has 'extended the
>standard' a few times.  First of all, Microsoft and
>IBM include an extra 360 characters in their code page
>which are not in the JIS0208 standard.   This is well
>understood and most encoding toolkits know what 'Code
>Page 932' is Shift-JIS plus a few extra characters.
>Secondly, Shift-JIS has a User-Defined region of a
>couple of thousand characters.  They have lately been
>taking Chinese variants of Japanese characters (which
>are readable but a bit old-fashioned - I can imagine
>pipe-smoking professors using these forms as an
>affectation) and adding them into their standard
>Windows fonts; so users are getting used to these
>being available.  These are not in a standard.
>Thirdly, they include something called the 'Gaiji
>Editor' in Japanese Win95, which lets you add new
>characters to the fonts on your PC within the
>user-defined region.  The first step was to review all
>the PCs in the Tokyo office, and get one centralized
>extension font file on a server.  This was also fun as
>people had assigned different code points to
>characters on differene machines, so what looked
>correct on your word processor was a black square on
>mine.   Effectively, each company has its own custom
>encoding a bit bigger than the standard.
>
>Clearly, none of these extensions would convert
>automatically to the other platforms.
>
>Once we actually had an agreed list of code points, we
>scanned the database by eye and made sure that the
>relevant people were using them.  We decided that
>space for 128 User-Defined Characters would  be
>allowed.  We thought we would need a wrapper around
>Unilib to intercept these values and do a special
>conversion; but to our amazement it worked!  Somebody
>had already figured out a mapping for at least 1000
>characters for all the Japanes encodings, and they did
>the round trips from Shift-JIS to Unicode to DBCS and
>back.  So the conversion problem needed less code than
>we thought.  This mapping is not defined in a standard
>AFAIK (certainly not for DBCS anyway).
>
>We did, however, need some really impressive
>validation.  When you input a name or address on any
>of the platforms, the system should say
>(a) is it valid for my encoding?
>(b) will it fit in the available field space in the
>other platforms?
>(c) if it contains user-defined characters, are they
>the ones we know about, or is this a new guy who will
>require updates to our fonts etc.?
>
>Finally, we got back to the display problems.  Our
>chosen range had a particular first byte. We built a
>miniature font with the characters we needed starting
>in the lower half of the code page.  I then
>generalized by name-printing routine to say 'if the
>first character is XX, throw it away, and print the
>subsequent character in our custom font'.  This worked
>beautifully - not only could we print everything, we
>were using type 1 embedded fonts for the user defined
>characters, so we could distill it and also capture it
>for our internal document imaging systems.
>
>So, that is roughly what is involved in building a
>Japanese client reporting system that spans several
>platforms.
>
>I then moved over to the web team to work on our
>online trading system for Japan, where I am now -
>people will be able to open accounts and invest on the
>web.  The first stage was to prove it all worked.
>With HTML, Java and the Web, I had high hopes, which
>have mostly been fulfilled - we set an option in the
>database connection to say 'this is a UTF8 database',
>and Java converts it to Unicode when reading the
>results, and we set another option saying 'the output
>stream should be Shift-JIS' when we spew out the HTML.
> There is one limitations:  Java sticks to the JIS0208
>standard, so the 360 extra IBM/Microsoft Kanji and our
>user defined characters won't work on the web.  You
>cannot control the fonts on someone else's web
>browser; management accepted this because we gave them
>no alternative.  Certain customers will need to be
>warned, or asked to suggest a standard version of a
>charactere if they want to see their name on the web.
>I really hope the web actually brings character usage
>in line with the standard in due course, as it will
>save a fortune.
>
>Our system is multi-language - when a customer logs
>in, we want to say 'You are a Japanese customer of our
>Tokyo Operation, so you see page X in language Y'.
>The language strings all all kept in UTF8 in XML
>files, so the same file can hold many languages.  This
>and the database are the real-world reasons why you
>want to store stuff in UTF8.  There are very few tools
>to let you view UTF8, but luckily there is a free Word
>Processor that lets you type Japanese and save it in
>any encoding; so we can cut and paste between
>Shift-JIS and UTF8 as needed.
>
>And that's it.  No climactic endings and a lot of real
>world mess, just like life in IT.  But hopefully this
>gives you a feel for some of the practical stuff
>internationalisation projects have to deal with.  See
>my other mail for actual suggestions
>
>- Andy Robinson
>
>=====
>Andy Robinson
>Robinson Analytics Ltd.
>------------------
>My opinions are the official policy of Robinson Analytics Ltd.
>They just vary from day to day.


From just@letterror.com  Fri Apr 28 18:38:14 2000
From: just@letterror.com (Just van Rossum)
Date: Fri, 28 Apr 2000 18:38:14 +0100
Subject: [I18n-sig] Re: Unicode debate
In-Reply-To: <39097F80.6A0E9FBD@lemburg.com>
References: Your message of "Thu, 27 Apr 2000 06:42:43 BST."
 <l03102800b52d80db1290@[193.78.237.154]>
 <l03102805b52ca7830b18@[193.78.237.154]>
 <l03102800b52d80db1290@[193.78.237.154]>
 <l03102802b52ef9cb79aa@[193.78.237.170]>
 <l03102809b52f28c18787@[193.78.237.170]>
Message-ID: <l0310280bb52f7c352ba6@[193.78.237.170]>

At 2:09 PM +0200 28-04-2000, M.-A. Lemburg wrote:
>> 1, because 2 can lead to surprises when two strings containing binary goop
>> are added and only one was a literal in a source file with an explicit
>> encoding.
>
[...]
>I should have been more precise:
>
>2. provided both strings have encodings which can be converted
>   to Unicode, coerce them to Unicode and then apply the action;
>   otherwise proceed as in 1., i.e. the result has an undefined
>   encoding.
>
>If 2. does try to convert to Unicode, conversion errors should
>be raised (just like they are now for Unicode coercion errors).

But that doesn't solve the binary goop problem: two binary gooplets may
have different "encodings", which happen to be valid (ie. not raise an
exception). Conversion to unicode is no way what you want.

>Some more tricky business:
>
>How should str('bla', 'enc1') and str('bla', 'enc2') compare ?
>What about the hash values of the two ?

I proposed to *only* use the encoding attr when dealing with 8-bit
string/unicode string combo's. Just ignore it completely when there's no
unicode string in sight.

Just


From just@letterror.com  Fri Apr 28 18:51:03 2000
From: just@letterror.com (Just van Rossum)
Date: Fri, 28 Apr 2000 18:51:03 +0100
Subject: [I18n-sig] Re: [Python-Dev] Re: Unicode debate
In-Reply-To: <200004281410.KAA16104@eric.cnri.reston.va.us>
References: Your message of "Fri, 28 Apr 2000 09:33:16 BST."
 <l03102802b52ef9cb79aa@[193.78.237.170]> Your message of "Thu, 27 Apr
 2000 06:42:43 BST." <l03102800b52d80db1290@[193.78.237.154]>
 <l03102805b52ca7830b18@[193.78.237.154]>
 <l03102800b52d80db1290@[193.78.237.154]>
 <l03102802b52ef9cb79aa@[193.78.237.170]>
Message-ID: <l0310280cb52f7d3968d7@[193.78.237.170]>

[GvR, on string.encoding ]
>Marc-Andre took this idea a bit further, but I think it's not
>practical given the current implementation: there are too many places
>where the C code would have to be changed in order to propagate the
>string encoding information,

I may miss something, but the encoding attr just travels with the string
object, no? Like I said in my reply to MAL, I think it's undesirable to do
*anything* with the encoding attr if not in combination with a unicode
string.

>and there are too many sources of strings
>with unknown encodings to make it very useful.

That's why the default encoding must be settable as well, as Fredrik suggested.

>Plus, it would slow down 8-bit string ops.

Not if you ignore it most of the time, and just pass it along when
concatenating.

>I have a better idea: rather than carrying around 8-bit strings with
>an encoding, use Unicode literals in your source code.

Explain that to newbies... I guess is that they will want simple 8 bit
strings in their native encoding. Dunno.

>If the source
>encoding is known, these will be converted using the appropriate
>codec.
>
>If you object to having to write u"..." all the time, we could say
>that "..." is a Unicode literal if it contains any characters with the
>top bit on (of course the source file encoding would be used just like
>for u"...").

Only if "\377" would still yield an 8-bit string, for binary goop...

Just


From guido@python.org  Fri Apr 28 19:31:19 2000
From: guido@python.org (Guido van Rossum)
Date: Fri, 28 Apr 2000 14:31:19 -0400
Subject: [I18n-sig] Re: [Python-Dev] Re: Unicode debate
In-Reply-To: Your message of "Fri, 28 Apr 2000 18:51:03 BST."
 <l0310280cb52f7d3968d7@[193.78.237.170]>
References: Your message of "Fri, 28 Apr 2000 09:33:16 BST." <l03102802b52ef9cb79aa@[193.78.237.170]> Your message of "Thu, 27 Apr 2000 06:42:43 BST." <l03102800b52d80db1290@[193.78.237.154]> <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <l03102802b52ef9cb79aa@[193.78.237.170]>
 <l0310280cb52f7d3968d7@[193.78.237.170]>
Message-ID: <200004281831.OAA17406@eric.cnri.reston.va.us>

> [GvR, on string.encoding ]
> >Marc-Andre took this idea a bit further, but I think it's not
> >practical given the current implementation: there are too many places
> >where the C code would have to be changed in order to propagate the
> >string encoding information,

[JvR]
> I may miss something, but the encoding attr just travels with the string
> object, no? Like I said in my reply to MAL, I think it's undesirable to do
> *anything* with the encoding attr if not in combination with a unicode
> string.

But just propagating affects every string op -- s+s, s*n, s[i], s[:],
s.strip(), s.split(), s.lower(), ...

> >and there are too many sources of strings
> >with unknown encodings to make it very useful.
> 
> That's why the default encoding must be settable as well, as Fredrik
> suggested.

I'm open for debate about this.  There's just something about a
changeable global default encoding that worries me -- like any global
property, it requires conventions and defensive programming to make
things work in larger programs.  For example, a module that deals with
Latin-1 strings can't just set the default encoding to Latin-1: it
might be imported by a program that needs it to be UTF-8.  This model
is currently used by the locale in C, where all locale properties are
global, and it doesn't work well.  For example, Python needs to go
through a lot of hoops so that Python numeric literals use "." for the
decimal indicator even if the user's locale specifies "," -- we can't
change Python to swap the meaning of "." and "," in all contexts.

So I think that a changeable default encoding is of limited value.
That's different from being able to set the *source file* encoding --
this only affects Unicode string literals.

> >Plus, it would slow down 8-bit string ops.
> 
> Not if you ignore it most of the time, and just pass it along when
> concatenating.

And slicing, and indexing, and...

> >I have a better idea: rather than carrying around 8-bit strings with
> >an encoding, use Unicode literals in your source code.
> 
> Explain that to newbies... I guess is that they will want simple 8 bit
> strings in their native encoding. Dunno.

If they are hap-py with their native 8-bit encoding, there's no need
for them to ever use Unicode objects in their program, so they should
be fine.  8-bit strings aren't ever interpreted or encoded except when
mixed with Unicode objects.

> >If the source
> >encoding is known, these will be converted using the appropriate
> >codec.
> >
> >If you object to having to write u"..." all the time, we could say
> >that "..." is a Unicode literal if it contains any characters with the
> >top bit on (of course the source file encoding would be used just like
> >for u"...").
> 
> Only if "\377" would still yield an 8-bit string, for binary goop...

Correct.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From mal@lemburg.com  Fri Apr 28 19:52:04 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 28 Apr 2000 20:52:04 +0200
Subject: [I18n-sig] Re: Unicode debate
References: Your message of "Thu, 27 Apr 2000 06:42:43 BST."
 <l03102800b52d80db1290@[193.78.237.154]>
 <l03102805b52ca7830b18@[193.78.237.154]>
 <l03102800b52d80db1290@[193.78.237.154]>
 <l03102802b52ef9cb79aa@[193.78.237.170]>
 <l03102809b52f28c18787@[193.78.237.170]> <l0310280bb52f7c352ba6@[193.78.237.170]>
Message-ID: <3909DDD4.D32296CE@lemburg.com>

Just van Rossum wrote:
> 
> At 2:09 PM +0200 28-04-2000, M.-A. Lemburg wrote:
> >> 1, because 2 can lead to surprises when two strings containing binary goop
> >> are added and only one was a literal in a source file with an explicit
> >> encoding.
> >
> [...]
> >I should have been more precise:
> >
> >2. provided both strings have encodings which can be converted
> >   to Unicode, coerce them to Unicode and then apply the action;
> >   otherwise proceed as in 1., i.e. the result has an undefined
> >   encoding.
> >
> >If 2. does try to convert to Unicode, conversion errors should
> >be raised (just like they are now for Unicode coercion errors).
> 
> But that doesn't solve the binary goop problem: two binary gooplets may
> have different "encodings", which happen to be valid (ie. not raise an
> exception). Conversion to unicode is no way what you want.

See the first line ;-) ... "provided both strings have encodings
which can be converted to Unicode" ... binary encodings would
not fall under these.

str('...data1...','binary') + str('...data2...','UTF-8')
would yield str('...data1......data2...','undefined')

Plus, we'd need to add a third case:

3. Of course, actions on strings of the same encoding should
   result in strings of the same encodings, e.g.
   str('...data1...','enc1') + str('...data2...','enc1')
   should yield str('...data1......data2...','enc1')

> >Some more tricky business:
> >
> >How should str('bla', 'enc1') and str('bla', 'enc2') compare ?
> >What about the hash values of the two ?
> 
> I proposed to *only* use the encoding attr when dealing with 8-bit
> string/unicode string combo's. Just ignore it completely when there's no
> unicode string in sight.

You can't ignore it completely because that would quickly
render it useless: point 3. is very important to assure that
strings with known encoding propogate their encoding as they
get processed. Otherwise you'd soon only deal with undefined
encoding strings and the whole strategy would be pointless.

Hmm, I think this road doesn't lead anywhere (but it was fun
anyway ;). As I've written a few times before: if you intend
to go Unicode, make all your strings Unicode.

Perhaps there should be an experimental command line flag which
turns "..." in source code into u"..." to be able to test this
setup ?! 

If someone is interested, I have a patch which adds
a -U flag. The Python compiler will then interpret all '...'
strings as u'...' strings. Hmm, that switch should probably
be called something like -Py3k ;-)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From just@letterror.com  Fri Apr 28 21:04:46 2000
From: just@letterror.com (Just van Rossum)
Date: Fri, 28 Apr 2000 21:04:46 +0100
Subject: [I18n-sig] Re: Unicode debate
In-Reply-To: <3909DDD4.D32296CE@lemburg.com>
References: Your message of "Thu, 27 Apr 2000 06:42:43 BST."
 <l03102800b52d80db1290@[193.78.237.154]>
 <l03102805b52ca7830b18@[193.78.237.154]>
 <l03102800b52d80db1290@[193.78.237.154]>
 <l03102802b52ef9cb79aa@[193.78.237.170]>
 <l03102809b52f28c18787@[193.78.237.170]>
 <l0310280bb52f7c352ba6@[193.78.237.170]>
Message-ID: <l03102813b52f9e1624fa@[193.78.237.170]>

At 8:52 PM +0200 28-04-2000, M.-A. Lemburg wrote:
>See the first line ;-) ... "provided both strings have encodings
>which can be converted to Unicode" ... binary encodings would
>not fall under these.

Won't a string literal in a source file with an explicit encoding get
*that* encoding, whether the string contains binary goop or not?!

Just


From mal@lemburg.com  Fri Apr 28 20:51:26 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 28 Apr 2000 21:51:26 +0200
Subject: [I18n-sig] Re: Unicode debate
References: Your message of "Thu, 27 Apr 2000 06:42:43 BST."
 <l03102800b52d80db1290@[193.78.237.154]>
 <l03102805b52ca7830b18@[193.78.237.154]>
 <l03102800b52d80db1290@[193.78.237.154]>
 <l03102802b52ef9cb79aa@[193.78.237.170]>
 <l03102809b52f28c18787@[193.78.237.170]>
 <l0310280bb52f7c352ba6@[193.78.237.170]> <l03102813b52f9e1624fa@[193.78.237.170]>
Message-ID: <3909EBBE.FB64589D@lemburg.com>

Just van Rossum wrote:
> 
> At 8:52 PM +0200 28-04-2000, M.-A. Lemburg wrote:
> >See the first line ;-) ... "provided both strings have encodings
> >which can be converted to Unicode" ... binary encodings would
> >not fall under these.
> 
> Won't a string literal in a source file with an explicit encoding get
> *that* encoding, whether the string contains binary goop or not?!

Right. Binary data in such a string literal would have to
use str('...data...','binary') to get the correct encoding
attached to it.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From Moshe Zadka <mzadka@geocities.com>  Sat Apr 29 03:08:48 2000
From: Moshe Zadka <mzadka@geocities.com> (Moshe Zadka)
Date: Sat, 29 Apr 2000 05:08:48 +0300 (IDT)
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <200004281450.KAA16493@eric.cnri.reston.va.us>
Message-ID: <Pine.GSO.4.10.10004290505200.22564-100000@sundial>

I agree with most of what you say, but...

On Fri, 28 Apr 2000, Guido van Rossum wrote:

> As I stated in another message, in Python 3000 we'll have
> to consider a more Java-esque solution: *character* strings are
> Unicode, and for bytes we have (mutable!) byte arras.

I would prefer a different distinction:

       mutable immutable

chars  string string_buffer

bytes  bytes bytes_buffer

Why not allow me the freedom to index a dictionary with goop?
(Here's a sample application: UNIX "file" command)

--
Moshe Zadka <mzadka@geocities.com>. 
http://www.oreilly.com/news/prescod_0300.html
http://www.linux.org.il -- we put the penguin in .com


From kentsin@poboxes.com  Sat Apr 29 04:07:12 2000
From: kentsin@poboxes.com (Sin Hang Kin)
Date: Sat, 29 Apr 2000 11:07:12 +0800
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]>  	            <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net>
Message-ID: <003f01bfb188$0ee7bcc0$770da8c0@bbs>

I am not quite follow on the discussion. But I am interested in Unicode-ify
python:

Python should be able to be an native language of any language. For given
all  nations a fair ground for computer programming. The recently
english-oriented python syntax should be easily ported to other languages
and python programs written in all languages can be converted to another one
automatically. i.e., a french speaking children can use french command words
to write python code, and this python code can convert to Englihs, Chinese,
...

Backward compatibility is a must. The current implementation of unicode
string might break some code. The ability to convert from/to unicode is not
enough. For example, it might for a search engine to collect many text from
different encoding, and I have seen that mixed encoding in a single text. I
did it once with in a Chinese application, I received a collective text file
which someone who collect them from mainland China with GB encoding and
locally with Big-5 encoding. The one who collect them do not read them
carefully, and he got a mighty environment (richwin) which automatically
recognize the encoding and adapt to it. So he just paste all these text
together. With such an mixed text, no conversion to/from unicode handling is
able to handle. Think if you run a mailing list, one like this, with people
quoting each other's message and write in their native encoding, you will
get a funny text collection with different encoding. This also can happen to
the digest of such an mailing list: you may try now writing in all encoding
:)

So, I perfer to have people choosing their encoding. Setting a flag inside a
program will switch the internal handling of utf-8, 8-bit code. With time
pass, we may drop that, but now, we can not abandom the 8-bit code.

Rgs,

Kent Sin


From kentsin@poboxes.com  Sat Apr 29 04:07:06 2000
From: kentsin@poboxes.com (Sin Hang Kin)
Date: Sat, 29 Apr 2000 11:07:06 +0800
Subject: [I18n-sig] Re: Unicode debate
References: Your message of "Thu, 27 Apr 2000 06:42:43 BST."            <l03102800b52d80db1290@[193.78.237.154]><l03102805b52ca7830b18@[193.78.237.154]>             <l03102800b52d80db1290@[193.78.237.154]> <l03102802b52ef9cb79aa@[193.78.237.170]>
Message-ID: <003c01bfb188$0ba08100$770da8c0@bbs>

For python source, we would enforce all to write in utf-8. Provided  that
they would freely choose their own natively encoding if they wish, but to
convert them to unicode if they publish them.

Rgs,

Kent Sin
----- Original Message -----
From: "Just van Rossum" <just@letterror.com>
To: "Guido van Rossum" <guido@python.org>; <python-dev@python.org>;
<i18n-sig@python.org>
Sent: Friday, April 28, 2000 4:33 PM
Subject: [I18n-sig] Re: Unicode debate


> At 11:01 AM -0400 27-04-2000, Guido van Rossum wrote:
> >Where does the current approach require work?
> >
> >- We need a way to indicate the encoding of Python source code.
> >(Probably a "magic comment".)
>
> How will other parts of a program know which encoding was used for
> non-unicode string literals?
>
> It seems to me that an encoding attribute for 8-bit strings solves this
> nicely. The attribute should only be set automatically if the encoding of
> the source file was specified or when the string has been encoded from a
> unicode string. The attribute should *only* be used when converting to
> unicode. (Hm, it could even be used when calling unicode() without the
> encoding argument.) It should *not* be used when comparing (or adding,
> etc.) 8-bit strings to each other, since they still may contain binary
> goop, even in a source file with a specified encoding!
>
> >- We need a way to indicate the encoding of input and output data
> >files, and we need shortcuts to set the encoding of stdin, stdout and
> >stderr (and maybe all files opened without an explicit encoding).
>
> Can you open a file *with* an explicit encoding?
>
> Just
>
>
>
> _______________________________________________
> I18n-sig mailing list
> I18n-sig@python.org
> http://www.python.org/mailman/listinfo/i18n-sig
>


From just@letterror.com  Sat Apr 29 08:03:14 2000
From: just@letterror.com (Just van Rossum)
Date: Sat, 29 Apr 2000 08:03:14 +0100
Subject: [I18n-sig] Re: Unicode debate
In-Reply-To: <3909EBBE.FB64589D@lemburg.com>
References: Your message of "Thu, 27 Apr 2000 06:42:43 BST."
 <l03102800b52d80db1290@[193.78.237.154]>
 <l03102805b52ca7830b18@[193.78.237.154]>
 <l03102800b52d80db1290@[193.78.237.154]>
 <l03102802b52ef9cb79aa@[193.78.237.170]>
 <l03102809b52f28c18787@[193.78.237.170]>
 <l0310280bb52f7c352ba6@[193.78.237.170]>
 <l03102813b52f9e1624fa@[193.78.237.170]>
Message-ID: <l03102801b53037e4cba2@[193.78.237.170]>

At 9:51 PM +0200 28-04-2000, M.-A. Lemburg wrote:
>Right. Binary data in such a string literal would have to
>use str('...data...','binary') to get the correct encoding
>attached to it.

And that sucks. I stick to my point that the encoding attr should *not* be
used when dealing strictly with  bit strings. Ever. At all. Its' *only*
purpose is to aid "upcasting" to unicode. (But maybe that purpose is too
weak to warrant an entirely new attribute...)

Just


From mal@lemburg.com  Sat Apr 29 14:25:47 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Sat, 29 Apr 2000 15:25:47 +0200
Subject: [I18n-sig] Re: Unicode debate
References: Your message of "Thu, 27 Apr 2000 06:42:43 BST."
 <l03102800b52d80db1290@[193.78.237.154]>
 <l03102805b52ca7830b18@[193.78.237.154]>
 <l03102800b52d80db1290@[193.78.237.154]>
 <l03102802b52ef9cb79aa@[193.78.237.170]>
 <l03102809b52f28c18787@[193.78.237.170]>
 <l0310280bb52f7c352ba6@[193.78.237.170]>
 <l03102813b52f9e1624fa@[193.78.237.170]> <l03102801b53037e4cba2@[193.78.237.170]>
Message-ID: <390AE2DB.1EB8692A@lemburg.com>

Just van Rossum wrote:
> 
> At 9:51 PM +0200 28-04-2000, M.-A. Lemburg wrote:
> >Right. Binary data in such a string literal would have to
> >use str('...data...','binary') to get the correct encoding
> >attached to it.
> 
> And that sucks.

Not sure why... after all the point of adding encoding information
to strings was to add missing information: the current usage
as binary data container would then be justified provided the
strings are marked as containing binary data.

> I stick to my point that the encoding attr should *not* be
> used when dealing strictly with  bit strings. Ever. At all. Its' *only*
> purpose is to aid "upcasting" to unicode. (But maybe that purpose is too
> weak to warrant an entirely new attribute...)

I think the little experiment with adding an encoding attribute
to strings is not going to be the right solution. People will
get all confused, the implementation won't be able make much
use of it without proper forarding of the information and that
forwarding costs performance even for those programs which do
not need this at all.

Guido's suggestion is more practical: either go all the way
(meaning to write all *text* as Unicode objects) or don't
use Unicode at all.

Note that the patch I sent to the patches list enables you
to test the "go all the way" strategy in an even more radical
way: it converts all "..." strings to u"..." when the -U
command line option is given.

I think we should use the experience gained with that patch
to make the standard Python library (and the interpreter)
Unicode capable.

Here's a list of what I've found by running some of the
regression tests:

* import string fails due to the way _idtable is constructed
* getattr() doesn't like Unicode as second argument, same for
  delattr() and hasattr()
* eval() expects a string object
* there still are some string exceptions around in the regr.
  tests which cause a failure (Unicode exceptions don't work)
* struct.pack('s') doesn't like Unicode as argument
* re doesn't work: pcre_expand() needs a string object
* regex doesn't work either because string objects are hard-coded
* mmap doesn't like Unicode: "mmap assignment must be
  single-character string"
* cPickle.loads() doesn't like Unicode as data storage
* keywords must be strings (f(1, 2, 3, **{'a':4, 'b':5}) doesn't work)
* rotor doesn't work

Some of these could be fixed by putting a str() call around
the '...' constants. Others need fixes in C code. Yet others
would be better off if they used the buffer interfaces (basically
all APIs which work on raw data like cPickle or rotor).

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From paul@prescod.net  Sat Apr 29 15:18:05 2000
From: paul@prescod.net (Paul Prescod)
Date: Sat, 29 Apr 2000 09:18:05 -0500
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us>
 <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us>
Message-ID: <390AEF1D.253B93EF@prescod.net>

Guido van Rossum wrote:
> 
> [Paul Prescod]
> > I think that maybe an important point is getting lost here. I could be
> > wrong, but it seems that all of this emphasis on encodings is misplaced.
> 
> In practical applications that manipulate text, encodings creep up all
> the time.  

I'm not saying that encodings are unimportant. I'm saying that that they
are *different* than what Fredrik was talking about. He was talking
about a coherent logical model for characters and character strings
based on the conventions of more modern languages and systems than C and
Python.

> > How can we
> > make the transition to a "binary goops are not strings" world easiest?
> 
> I'm afraid that's a bigger issue than we can solve for Python 1.6.

I understand that we can't fix the problem now. I just think that we
shouldn't go out of our ways to make it worst.

If we make byte-array strings "magically" cast themselves into
character-strings, people will expect that behavior forever.

> > It doesn't meet the definition of string used in the Unicode spec., nor
> > in XML, nor in Java, nor at the W3C nor in most other up and coming
> > specifications.
> 
> OK, so that's a good indication of where you're coming from.  Maybe
> you should spend a little more time in the trenches and a little less
> in standards bodies.  Standards are good, but sometimes disconnected
> from reality (remember ISO networking? :-).

As far as I know, XML and Java are used a fair bit in the real
world...even somewhat in Asia. In fact, there is a book titled "XML and
Java" written by three Japanese men.

> And this is exactly why encodings will remain important: entities
> encoded in ISO-2022-JP have no compelling reason to be recoded
> permanently into ISO10646, and there are lots of forces that make it
> convenient to keep it encoded in ISO-2022-JP (like existing tools).

You cannot recode an ISO-2022-JP document into ISO10646 because 10646 is
a character *set* and not an encoding. ISO-2022-JP says how you should
represent characters in terms of bits and bytes. ISO10646 defines a
mapping from integers to characters.

They are both important, but separate. I think that this automagical
re-encoding conflates them.

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
It's difficult to extract sense from strings, but they're the only
communication coin we can count on. 
	- http://www.cs.yale.edu/~perlis-alan/quotes.html


From Fredrik Lundh" <effbot@telia.com  Sat Apr 29 15:52:30 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Sat, 29 Apr 2000 16:52:30 +0200
Subject: [I18n-sig] Re: Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us>              <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net>
Message-ID: <006e01bfb1ea$900d0820$34aab5d4@hagrid>

Paul Prescod wrote:
> > > I think that maybe an important point is getting lost here. I =
could be
> > > wrong, but it seems that all of this emphasis on encodings is =
misplaced.
> >=20
> > In practical applications that manipulate text, encodings creep up =
all
> > the time. =20
>=20
> I'm not saying that encodings are unimportant. I'm saying that that =
they
> are *different* than what Fredrik was talking about. He was talking
> about a coherent logical model for characters and character strings
> based on the conventions of more modern languages and systems than
> C and Python.

note that the existing Python language reference describes this
model very clearly:

    [Sequences] represent finite ordered sets indexed
    by natural numbers.

    The built-in function len() returns the number of
    items of a sequence.

    When the length of a sequence is n, the index set
    contains the numbers 0, 1, ..., n-1.

    Item i of sequence a is selected by a[i].

    An object of an immutable sequence type cannot
    change once it is created.

    The items of a string are characters.

    There is no separate character type; a character is
    represented by a string of one item.

    Characters represent (at least) 8-bit bytes.

    The built-in functions chr() and ord() convert between
    characters and nonnegative integers representing the
    byte values.

    Bytes with the values 0-127 usually represent the corre-
    sponding ASCII values, but the interpretation of values is
    up to the program.

    The string data type is also used to represent arrays
    of bytes, e.g., to hold data read from a file.=20

as I've pointed out before, I want this to apply to all kinds of
strings in 1.6.  imo, the cleanest way to do this is to change
the last three sentences to:

    The built-in functions chr() and ord() convert between
    characters and nonnegative integers representing the
    character codes.

    Character codes usually represent the corresponding
    unicode characters.

    The 8-bit string data type is also used to represent arrays
    of bytes, e.g., to hold data read from a file.

the encodings debate has nothing to do with this model.

...

more later.  gotta run.

</F>


From just@letterror.com  Sat Apr 29 18:40:22 2000
From: just@letterror.com (Just van Rossum)
Date: Sat, 29 Apr 2000 18:40:22 +0100
Subject: [I18n-sig] Re: Unicode debate
In-Reply-To: <390AE2DB.1EB8692A@lemburg.com>
References: Your message of "Thu, 27 Apr 2000 06:42:43 BST."
 <l03102800b52d80db1290@[193.78.237.154]>
 <l03102805b52ca7830b18@[193.78.237.154]>
 <l03102800b52d80db1290@[193.78.237.154]>
 <l03102802b52ef9cb79aa@[193.78.237.170]>
 <l03102809b52f28c18787@[193.78.237.170]>
 <l0310280bb52f7c352ba6@[193.78.237.170]>
 <l03102813b52f9e1624fa@[193.78.237.170]>
 <l03102801b53037e4cba2@[193.78.237.170]>
Message-ID: <l03102805b530c7658799@[193.78.237.127]>

At 3:25 PM +0200 29-04-2000, M.-A. Lemburg wrote:
>Just van Rossum wrote:
>>
>> At 9:51 PM +0200 28-04-2000, M.-A. Lemburg wrote:
>> >Right. Binary data in such a string literal would have to
>> >use str('...data...','binary') to get the correct encoding
>> >attached to it.
>>
>> And that sucks.
>
>Not sure why... after all the point of adding encoding information
>to strings was to add missing information: the current usage
>as binary data container would then be justified provided the
>strings are marked as containing binary data.

For one, it's just too much hassle to write str('...data...','binary')...

All my proposal was, was a very lightweight way to ensure correct
translation to unicode when needed. What you seem to suggest, is that the
encoding attribute could be used to make 8-bit strings almost as powerful
as unicode strings, by converting to unicode whenever there's an action
that involves two 8-bit strings with different encodings. While I'm sure
that would have it's uses, I think that's too ambitious, and seems to get
too much in the way of 8-bit strings doubling as byte arrays. As I've
admitted before, what I had in mind for the encoding attribute is probably
to weak a use to warrant the effort, and there are indeed too many things
that can still go wrong. So for now I'll let it go... (But it was fun
indeed ;-)

(Oh, and I still stand by my and Fredrik's point that utf-8 is a poor
default choice when coercing 8-bit strings to unicode, for the sole reason
a utf-8 string is a byte array, and not a character string.)

Just


From tree@basistech.com  Sun Apr 30 06:29:08 2000
From: tree@basistech.com (Tom Emerson)
Date: Sun, 30 Apr 2000 01:29:08 -0400 (EDT)
Subject: [I18n-sig] codec questions
Message-ID: <14603.50340.167067.470930@cymru.basistech.com>

I'm using 1.6a2 and the following doesn't run. I must be doing
something brain-dead here (I'm jet lagged right now):

--
import codecs;

foo = codecs.open('Sc-orig.utf', 'rb', 'utf-8')

line = foo.readline()
while (line != ""):
    print line
    line = foo.readline()
foo.close()
--

When I attempt to run that, in the directory containing 'Sc-orig.utf',
I get:

(0) tree% python process.py
Traceback (most recent call last):
  File "process.py", line 5, in ?
    line = foo.readline()
  File "/opt/tree/lib/python1.6/codecs.py", line 318, in readline
    return self.reader.readline(size)
NameError: self

Any ideas? I'm trying to grok the architecture so I can add
transcoding support for TIS-620 (Thai, an 8-bit encoding which should
work fine with the mapping codecs), GB2312 (multibyte, simplified
Chinese), and Big-5 (multibyte, traditional Chinese). But I can't even
get the simplest code to work, so I need someone to hit me with a
stick.

Also, are transcoding tables loaded as needed? Or all at once? What
are the plans for managing transcoding tables?

Thanks.

        -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Language Hacker                                    http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From tree@basistech.com  Sun Apr 30 21:20:18 2000
From: tree@basistech.com (Tom Emerson)
Date: Sun, 30 Apr 2000 16:20:18 -0400 (EDT)
Subject: [I18n-sig] codec questions
In-Reply-To: <14603.50340.167067.470930@cymru.basistech.com>
References: <14603.50340.167067.470930@cymru.basistech.com>
Message-ID: <14604.38274.378633.509796@cymru.basistech.com>

Tom Emerson writes:
 > Any ideas? I'm trying to grok the architecture so I can add
 > transcoding support for TIS-620 (Thai, an 8-bit encoding which should
[snip]

TIS-620 is mostly the same as CP874, so for my purposes this is
done. Never mind. 8-) I looked at the encodings directory after I send my mail.

Of course I still cannot get codecs.open() to work, so it is a small victory.

      -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Language Hacker                                    http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"