From mal@lemburg.com  Mon May  1 11:05:57 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Mon, 01 May 2000 12:05:57 +0200
Subject: [I18n-sig] codec questions
References: <14603.50340.167067.470930@cymru.basistech.com>
Message-ID: <390D5705.9CA3DEFB@lemburg.com>

This is a multi-part message in MIME format.
--------------CDA78F736CA89810F3040FD5
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

Tom Emerson wrote:
> 
> I'm using 1.6a2 and the following doesn't run. I must be doing
> something brain-dead here (I'm jet lagged right now):
> 
> --
> import codecs;
> 
> foo = codecs.open('Sc-orig.utf', 'rb', 'utf-8')
> 
> line = foo.readline()
> while (line != ""):
>     print line
>     line = foo.readline()
> foo.close()
> --
> 
> When I attempt to run that, in the directory containing 'Sc-orig.utf',
> I get:
> 
> (0) tree% python process.py
> Traceback (most recent call last):
>   File "process.py", line 5, in ?
>     line = foo.readline()
>   File "/opt/tree/lib/python1.6/codecs.py", line 318, in readline
>     return self.reader.readline(size)
> NameError: self

You've hit a bug... the self argument was missing from the
readline() methods. I've appended a patch and will also send
it to the patches list.

> Any ideas? I'm trying to grok the architecture so I can add
> transcoding support for TIS-620 (Thai, an 8-bit encoding which should
> work fine with the mapping codecs), GB2312 (multibyte, simplified
> Chinese), and Big-5 (multibyte, traditional Chinese). But I can't even
> get the simplest code to work, so I need someone to hit me with a
> stick.
> 
> Also, are transcoding tables loaded as needed? Or all at once? What
> are the plans for managing transcoding tables?

Depends on how you implement them. The codecs included in
the standard Python distribution are loaded on demand
together with their translation tables.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/
--------------CDA78F736CA89810F3040FD5
Content-Type: text/plain; charset=us-ascii;
 name="Unicode-Implementation-2000-05-01.patch"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
 filename="Unicode-Implementation-2000-05-01.patch"

diff -u -rP -x *.o -x *.pyc -x Makefile -x *~ -x *.so -x add2lib -x pgen -x buildno -x config.* -x libpython* -x python -x Setup -x Setup.local -x Setup.thread -x hassignal -x Makefile.pre -x configure -x *.bak -x *.s -x DEADJOE -x *.rej -x *.orig -x Demo -x CVS -x Doc -x *.orig -x .#* -x distutils -x PC -x PCbuild -x *.c -x *.h -x *.in -x output CVS-Python/Lib/codecs.py Python+Unicode/Lib/codecs.py
--- CVS-Python/Lib/codecs.py	Thu Apr 13 18:10:57 2000
+++ Python+Unicode/Lib/codecs.py	Mon May  1 11:54:03 2000
@@ -324,11 +324,11 @@
 
         return self.reader.read(size)
 
-    def readline(size=None):
+    def readline(self, size=None):
 
         return self.reader.readline(size)
 
-    def readlines(sizehint=None):
+    def readlines(self, sizehint=None):
 
         return self.reader.readlines(sizehint)
 
Only in CVS-Python/Lib/test: test_winsound.py

--------------CDA78F736CA89810F3040FD5--


From guido@python.org  Mon May  1 16:10:21 2000
From: guido@python.org (Guido van Rossum)
Date: Mon, 01 May 2000 11:10:21 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Your message of "Sat, 29 Apr 2000 11:07:12 +0800."
 <003f01bfb188$0ee7bcc0$770da8c0@bbs>
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net>
 <003f01bfb188$0ee7bcc0$770da8c0@bbs>
Message-ID: <200005011510.LAA20223@eric.cnri.reston.va.us>

Sin Hang Kim writes:

> I am not quite follow on the discussion. But I am interested in Unicode-ify
> python:

Thanks for participation -- we need more input from people who will
actually use the non-Western Unicode planes!

> Python should be able to be an native language of any language. For given
> all  nations a fair ground for computer programming. The recently
> english-oriented python syntax should be easily ported to other languages
> and python programs written in all languages can be converted to another one
> automatically. i.e., a french speaking children can use french command words
> to write python code, and this python code can convert to Englihs, Chinese,
> ...

I think that's a Python 3000 issue...  This would currently be very
difficult to add to the implementation.  Plus, I worry that it will
prevent free exchange of code across (language) borders: if you write
your program in "Chinese Python", most people in most other countries
won't be able to use it.  (This has been tried long ago for French
Pascal, and it wasn't a big suggess; my guess is for this reason.)

> Backward compatibility is a must. The current implementation of unicode
> string might break some code. The ability to convert from/to unicode is not
> enough. For example, it might for a search engine to collect many text from
> different encoding, and I have seen that mixed encoding in a single text. I
> did it once with in a Chinese application, I received a collective text file
> which someone who collect them from mainland China with GB encoding and
> locally with Big-5 encoding. The one who collect them do not read them
> carefully, and he got a mighty environment (richwin) which automatically
> recognize the encoding and adapt to it. So he just paste all these text
> together. With such an mixed text, no conversion to/from unicode handling is
> able to handle. Think if you run a mailing list, one like this, with people
> quoting each other's message and write in their native encoding, you will
> get a funny text collection with different encoding. This also can happen to
> the digest of such an mailing list: you may try now writing in all encoding
> :)

Of course, Unicode could also *help* -- all messages could be
translated from their original encoding to Unicode, and the digest
could be sent out in UTF-8.

> So, I perfer to have people choosing their encoding. Setting a flag inside a
> program will switch the internal handling of utf-8, 8-bit code. With time
> pass, we may drop that, but now, we can not abandom the 8-bit code.

Absolutely.  The problem you sketch (one file with multiple encodings)
can be handled by a Python program that takes control of the
encodings: for example, the program could read the file a line at a
time (or whatever unit is appropriate) and translate each line
according to the most appropriate encoding (as determined by context).

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@python.org  Mon May  1 18:49:47 2000
From: guido@python.org (Guido van Rossum)
Date: Mon, 01 May 2000 13:49:47 -0400
Subject: [I18n-sig] Re: Unicode debate
In-Reply-To: Your message of "Sat, 29 Apr 2000 15:25:47 +0200."
 <390AE2DB.1EB8692A@lemburg.com>
References: Your message of "Thu, 27 Apr 2000 06:42:43 BST." <l03102800b52d80db1290@[193.78.237.154]> <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <l03102802b52ef9cb79aa@[193.78.237.170]> <l03102809b52f28c18787@[193.78.237.170]> <l0310280bb52f7c352ba6@[193.78.237.170]> <l03102813b52f9e1624fa@[193.78.237.170]> <l03102801b53037e4cba2@[193.78.237.170]>
 <390AE2DB.1EB8692A@lemburg.com>
Message-ID: <200005011749.NAA20802@eric.cnri.reston.va.us>

> Here's a list of what I've found by running some of the
> regression tests:
> 
> * import string fails due to the way _idtable is constructed

Hm, I don't see this -- string.py imports just fine.  There's no
_idtable in my copy of string.py?!?!

> * getattr() doesn't like Unicode as second argument, same for
>   delattr() and hasattr()
> * eval() expects a string object

These should all be fixed.

> * there still are some string exceptions around in the regr.
>   tests which cause a failure (Unicode exceptions don't work)

Interesting.  One more reason to drop string exceptions sometime in
the future.

> * struct.pack('s') doesn't like Unicode as argument

Fix it.

> * re doesn't work: pcre_expand() needs a string object

Fix it, but with low priority (the expectation is that sre will replace
pcre in 1.6a3).

> * regex doesn't work either because string objects are hard-coded

Don't fix (regex is obsolete, only kept around because it used to be
very common).

> * mmap doesn't like Unicode: "mmap assignment must be
>   single-character string"

Yes, this has 8-bit string written all over it.  It really should be
using the buffer API rather than requiring strings!

> * cPickle.loads() doesn't like Unicode as data storage

Hm, hard to fix.  Again, it really should use the buffer API, but it doesn't.

> * keywords must be strings (f(1, 2, 3, **{'a':4, 'b':5}) doesn't work)

How hard would this be to fix?

> * rotor doesn't work

Not very important.

> Some of these could be fixed by putting a str() call around
> the '...' constants. Others need fixes in C code. Yet others
> would be better off if they used the buffer interfaces (basically
> all APIs which work on raw data like cPickle or rotor).

What I said. :-)

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@python.org  Mon May  1 19:02:32 2000
From: guido@python.org (Guido van Rossum)
Date: Mon, 01 May 2000 14:02:32 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Your message of "Sat, 29 Apr 2000 09:18:05 CDT."
 <390AEF1D.253B93EF@prescod.net>
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us>
 <390AEF1D.253B93EF@prescod.net>
Message-ID: <200005011802.OAA21612@eric.cnri.reston.va.us>

[Guido]
> > And this is exactly why encodings will remain important: entities
> > encoded in ISO-2022-JP have no compelling reason to be recoded
> > permanently into ISO10646, and there are lots of forces that make it
> > convenient to keep it encoded in ISO-2022-JP (like existing tools).

[Paul]
> You cannot recode an ISO-2022-JP document into ISO10646 because 10646 is
> a character *set* and not an encoding. ISO-2022-JP says how you should
> represent characters in terms of bits and bytes. ISO10646 defines a
> mapping from integers to characters.

OK.  I really meant recoding in UTF-8 -- I maintain that there are
lots of forces that prevent recoding most ISO-2022-JP documents in
UTF-8.

> They are both important, but separate. I think that this automagical
> re-encoding conflates them.

Who is proposing any automagical re-encoding?

Are you sure you understand what we are arguing about?

*I* am not even sure what we are arguing about.

I am simply saying that 8-bit strings (literals or otherwise) in
Python have always been able to contain encoded strings.

Earlier, you quoted some reference documentation that defines 8-bit
strings as containing characters.  That's taken out of context -- this
was written in a time when there was (for most people anyway) no
difference between characters and bytes, and I really meant bytes.
There's plenty of use of 8-bit Python strings for non-character uses
so your "proof" that 8-bit strings should contain "characters"
according to your definition is invalid.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From tree@basistech.com  Mon May  1 19:05:33 2000
From: tree@basistech.com (Tom Emerson)
Date: Mon, 1 May 2000 14:05:33 -0400 (EDT)
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <200005011802.OAA21612@eric.cnri.reston.va.us>
References: <l03102805b52ca7830b18@[193.78.237.154]>
 <l03102800b52d80db1290@[193.78.237.154]>
 <200004271501.LAA13535@eric.cnri.reston.va.us>
 <3908F566.8E5747C@prescod.net>
 <200004281450.KAA16493@eric.cnri.reston.va.us>
 <390AEF1D.253B93EF@prescod.net>
 <200005011802.OAA21612@eric.cnri.reston.va.us>
Message-ID: <14605.51053.369016.283239@cymru.basistech.com>

Guido van Rossum writes:
 > OK.  I really meant recoding in UTF-8 -- I maintain that there are
 > lots of forces that prevent recoding most ISO-2022-JP documents in
 > UTF-8.

Such as?

-- 
Tom Emerson                                          Basis Technology Corp.
Language Hacker                                    http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From guido@python.org  Mon May  1 19:14:48 2000
From: guido@python.org (Guido van Rossum)
Date: Mon, 01 May 2000 14:14:48 -0400
Subject: [I18n-sig] Re: Unicode debate
In-Reply-To: Your message of "Sat, 29 Apr 2000 16:52:30 +0200."
 <006e01bfb1ea$900d0820$34aab5d4@hagrid>
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net>
 <006e01bfb1ea$900d0820$34aab5d4@hagrid>
Message-ID: <200005011814.OAA21641@eric.cnri.reston.va.us>

Fredrik Lundh wrote:

> note that the existing Python language reference describes this
> model very clearly:
> 
>     [Sequences] represent finite ordered sets indexed
>     by natural numbers.
> 
>     The built-in function len() returns the number of
>     items of a sequence.
> 
>     When the length of a sequence is n, the index set
>     contains the numbers 0, 1, ..., n-1.
> 
>     Item i of sequence a is selected by a[i].
> 
>     An object of an immutable sequence type cannot
>     change once it is created.
> 
>     The items of a string are characters.
> 
>     There is no separate character type; a character is
>     represented by a string of one item.
> 
>     Characters represent (at least) 8-bit bytes.
> 
>     The built-in functions chr() and ord() convert between
>     characters and nonnegative integers representing the
>     byte values.
> 
>     Bytes with the values 0-127 usually represent the corre-
>     sponding ASCII values, but the interpretation of values is
>     up to the program.
> 
>     The string data type is also used to represent arrays
>     of bytes, e.g., to hold data read from a file. 
> 
> as I've pointed out before, I want this to apply to all kinds of
> strings in 1.6.  imo, the cleanest way to do this is to change
> the last three sentences to:
> 
>     The built-in functions chr() and ord() convert between
>     characters and nonnegative integers representing the
>     character codes.
> 
>     Character codes usually represent the corresponding
>     unicode characters.
> 
>     The 8-bit string data type is also used to represent arrays
>     of bytes, e.g., to hold data read from a file.

Again, you're being terse.  I'm not sure what you want to do here.  Do
you want chr() to return a Unicode string for argument values >= 256?
(Note that ord(u"\xffff") already returns 65535; I just notice that
ord(u"\777") returns 255 instead of 511, I consider this a bug.)

You have to understand that the reference documentation is sloppy with
the word "character" -- when I wrote that text, "character" and "byte"
were synonyms in my mind.

> the encodings debate has nothing to do with this model.

If this has nothing to do with the encodings debate, why is it in the
same thread?

Please elaborate.  (But please finish the next sre snapshot first! :-)

--Guido van Rossum (home page: http://www.python.org/~guido/)


From Fredrik Lundh" <effbot@telia.com  Mon May  1 19:34:41 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Mon, 1 May 2000 20:34:41 +0200
Subject: [I18n-sig] Re: Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net>             <006e01bfb1ea$900d0820$34aab5d4@hagrid>  <200005011814.OAA21641@eric.cnri.reston.va.us>
Message-ID: <007f01bfb39b$eb924e00$34aab5d4@hagrid>

Guido van Rossum wrote:
> You have to understand that the reference documentation is sloppy with
> the word "character" -- when I wrote that text, "character" and "byte"
> were synonyms in my mind.

I see -- and trust me, the big "aha!" comes when you stop
viewing them as the same thing ;-)

> If this has nothing to do with the encodings debate, why is it in the
> same thread?

because it's still part of the "unicode debate", perhaps?

> Please elaborate.  (But please finish the next sre snapshot first! :-)

later this week, in other words.

</F>


From Fredrik Lundh" <effbot@telia.com  Mon May  1 19:39:52 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Mon, 1 May 2000 20:39:52 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]><l03102800b52d80db1290@[193.78.237.154]><200004271501.LAA13535@eric.cnri.reston.va.us><3908F566.8E5747C@prescod.net><200004281450.KAA16493@eric.cnri.reston.va.us><390AEF1D.253B93EF@prescod.net><200005011802.OAA21612@eric.cnri.reston.va.us> <14605.51053.369016.283239@cymru.basistech.com>
Message-ID: <009f01bfb39c$a603cc00$34aab5d4@hagrid>

Tom Emerson wrote:
> Guido van Rossum writes:
>  > OK.  I really meant recoding in UTF-8 -- I maintain that there are
>  > lots of forces that prevent recoding most ISO-2022-JP documents in
>  > UTF-8.
>=20
> Such as?

ISO-2022-JP includes language/locale information, UTF-8 doesn't.  if
you just recode the character codes, you'll lose important information.

</F>


From tree@basistech.com  Mon May  1 19:42:40 2000
From: tree@basistech.com (Tom Emerson)
Date: Mon, 1 May 2000 14:42:40 -0400 (EDT)
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <009f01bfb39c$a603cc00$34aab5d4@hagrid>
References: <l03102805b52ca7830b18@[193.78.237.154]>
 <l03102800b52d80db1290@[193.78.237.154]>
 <200004271501.LAA13535@eric.cnri.reston.va.us>
 <3908F566.8E5747C@prescod.net>
 <200004281450.KAA16493@eric.cnri.reston.va.us>
 <390AEF1D.253B93EF@prescod.net>
 <200005011802.OAA21612@eric.cnri.reston.va.us>
 <14605.51053.369016.283239@cymru.basistech.com>
 <009f01bfb39c$a603cc00$34aab5d4@hagrid>
Message-ID: <14605.53280.55595.335112@cymru.basistech.com>

Fredrik Lundh writes:
 > ISO-2022-JP includes language/locale information, UTF-8 doesn't.  if
 > you just recode the character codes, you'll lose important information.

So encode them using the Plane 14 language tags.

I won't start with whether language/locale should be encoded in a
character encoding... 8-)

          -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Language Hacker                                    http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From guido@python.org  Mon May  1 19:52:04 2000
From: guido@python.org (Guido van Rossum)
Date: Mon, 01 May 2000 14:52:04 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Your message of "Mon, 01 May 2000 14:05:33 EDT."
 <14605.51053.369016.283239@cymru.basistech.com>
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us>
 <14605.51053.369016.283239@cymru.basistech.com>
Message-ID: <200005011852.OAA21973@eric.cnri.reston.va.us>

> Guido van Rossum writes:
>  > OK.  I really meant recoding in UTF-8 -- I maintain that there are
>  > lots of forces that prevent recoding most ISO-2022-JP documents in
>  > UTF-8.

[Tom Emerson]
> Such as?

The standard forces that work against all change -- existing tools,
user habits, compatibility, etc.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From tree@basistech.com  Mon May  1 19:46:04 2000
From: tree@basistech.com (Tom Emerson)
Date: Mon, 1 May 2000 14:46:04 -0400 (EDT)
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <200005011852.OAA21973@eric.cnri.reston.va.us>
References: <l03102805b52ca7830b18@[193.78.237.154]>
 <l03102800b52d80db1290@[193.78.237.154]>
 <200004271501.LAA13535@eric.cnri.reston.va.us>
 <3908F566.8E5747C@prescod.net>
 <200004281450.KAA16493@eric.cnri.reston.va.us>
 <390AEF1D.253B93EF@prescod.net>
 <200005011802.OAA21612@eric.cnri.reston.va.us>
 <14605.51053.369016.283239@cymru.basistech.com>
 <200005011852.OAA21973@eric.cnri.reston.va.us>
Message-ID: <14605.53484.225980.235301@cymru.basistech.com>

Guido van Rossum writes:
 > The standard forces that work against all change -- existing tools,
 > user habits, compatibility, etc.

Ah... I misread your original statement, which I took to be a
technical reason why one couldn't convert ISO-2022-JP to UTF-8. Of
course one cannot expect everyone to switch en masse to a new
encoding, pulling their existing documents with them. I'm in full
agreement there.

          -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Language Hacker                                    http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From tree@basistech.com  Mon May  1 19:49:36 2000
From: tree@basistech.com (Tom Emerson)
Date: Mon, 1 May 2000 14:49:36 -0400 (EDT)
Subject: [I18n-sig] ANLP/NAACL '2000 conference in Seattle
Message-ID: <14605.53696.677811.324217@cymru.basistech.com>

[Apologies for the spam]

If anyone from these lists is attending the ANLP/NAACL joint
conference in Seattle this week and would like to get together for a
drink or dinner, drop me a line. Face time is always a good thing.

      -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Language Hacker                                    http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From just@letterror.com  Mon May  1 21:20:17 2000
From: just@letterror.com (Just van Rossum)
Date: Mon, 1 May 2000 21:20:17 +0100
Subject: [I18n-sig] Re: Unicode debate
In-Reply-To: <200005011749.NAA20802@eric.cnri.reston.va.us>
References: Your message of "Sat, 29 Apr 2000 15:25:47 +0200."
 <390AE2DB.1EB8692A@lemburg.com> Your message of "Thu, 27 Apr 2000
 06:42:43 BST." <l03102800b52d80db1290@[193.78.237.154]>
 <l03102805b52ca7830b18@[193.78.237.154]>
 <l03102800b52d80db1290@[193.78.237.154]>
 <l03102802b52ef9cb79aa@[193.78.237.170]>
 <l03102809b52f28c18787@[193.78.237.170]>
 <l0310280bb52f7c352ba6@[193.78.237.170]>
 <l03102813b52f9e1624fa@[193.78.237.170]>
 <l03102801b53037e4cba2@[193.78.237.170]>
 <390AE2DB.1EB8692A@lemburg.com>
Message-ID: <l03102803b533964ab701@[193.78.237.164]>

MAL & GvR wrote:
>> * cPickle.loads() doesn't like Unicode as data storage
>
>Hm, hard to fix.  Again, it really should use the buffer API, but it doesn't.

Why should it be fixed? Unicode as data storage??? The least we can do
about the character string vs. data buffer discrepancy is discourage the
use of Unicode strings as data storage, no?

Just


From Fredrik Lundh" <effbot@telia.com  Mon May  1 20:51:53 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Mon, 1 May 2000 21:51:53 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]><l03102800b52d80db1290@[193.78.237.154]><200004271501.LAA13535@eric.cnri.reston.va.us><3908F566.8E5747C@prescod.net><200004281450.KAA16493@eric.cnri.reston.va.us><390AEF1D.253B93EF@prescod.net><200005011802.OAA21612@eric.cnri.reston.va.us><14605.51053.369016.283239@cymru.basistech.com><009f01bfb39c$a603cc00$34aab5d4@hagrid> <14605.53280.55595.335112@cymru.basistech.com>
Message-ID: <010701bfb3a6$e2f037c0$34aab5d4@hagrid>

Tom Emerson wrote:
> Fredrik Lundh writes:
>  > ISO-2022-JP includes language/locale information, UTF-8 doesn't.  =
if
>  > you just recode the character codes, you'll lose important =
information.
>=20
> So encode them using the Plane 14 language tags.

31-bit unicode characters are not supported in 1.6.  maybe in 1.7.

</F>


From mal@lemburg.com  Mon May  1 20:48:51 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Mon, 01 May 2000 21:48:51 +0200
Subject: [I18n-sig] Re: Unicode debate
References: Your message of "Thu, 27 Apr 2000 06:42:43 BST." <l03102800b52d80db1290@[193.78.237.154]> <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <l03102802b52ef9cb79aa@[193.78.237.170]> <l03102809b52f28c18787@[193.78.237.170]> <l0310280bb52f7c352ba6@[193.78.237.170]> <l03102813b52f9e1624fa@[193.78.237.170]> <l03102801b53037e4cba2@[193.78.237.170]>
 <390AE2DB.1EB8692A@lemburg.com> <200005011749.NAA20802@eric.cnri.reston.va.us>
Message-ID: <390DDFA3.F68C3A73@lemburg.com>

Guido van Rossum wrote:
> 
> > Here's a list of what I've found by running some of the
> > regression tests:
> >
> > * import string fails due to the way _idtable is constructed
> 
> Hm, I don't see this -- string.py imports just fine.  There's no
> _idtable in my copy of string.py?!?!

Ehm, I meant _idmap... I would guess that the reason
your string.py imports fine is that the import still uses
a cached PYC file for the import (this is why I updated the
-U patch to modify the magic number for imports when the
flag is set -- it ensures that when running in -U mode,
only PYC files also having been compiled with -U are
used and that when running without -U no such files
are accepted; makes testing a little easier since it doesn't
interfere with existing implementations).
 
> > * getattr() doesn't like Unicode as second argument, same for
> >   delattr() and hasattr()
> > * eval() expects a string object
> 
> These should all be fixed.
> 
> > * there still are some string exceptions around in the regr.
> >   tests which cause a failure (Unicode exceptions don't work)
> 
> Interesting.  One more reason to drop string exceptions sometime in
> the future.
> 
> > * struct.pack('s') doesn't like Unicode as argument
> 
> Fix it.
> 
> > * re doesn't work: pcre_expand() needs a string object
> 
> Fix it, but with low priority (the expectation is that sre will replace
> pcre in 1.6a3).

Ok.
 
> > * regex doesn't work either because string objects are hard-coded
> 
> Don't fix (regex is obsolete, only kept around because it used to be
> very common).
> 
> > * mmap doesn't like Unicode: "mmap assignment must be
> >   single-character string"
> 
> Yes, this has 8-bit string written all over it.  It really should be
> using the buffer API rather than requiring strings!
> 
> > * cPickle.loads() doesn't like Unicode as data storage
> 
> Hm, hard to fix.  Again, it really should use the buffer API, but it doesn't.

Note that this "bug" only occurrs when using strings as
data storage... the test code should really be using
a buffer object for this (or some other sort of binary
data container).
 
> > * keywords must be strings (f(1, 2, 3, **{'a':4, 'b':5}) doesn't work)
> 
> How hard would this be to fix?

Not sure... the keyword code is spread across many files.
 
> > * rotor doesn't work
> 
> Not very important.
> 
> > Some of these could be fixed by putting a str() call around
> > the '...' constants. Others need fixes in C code. Yet others
> > would be better off if they used the buffer interfaces (basically
> > all APIs which work on raw data like cPickle or rotor).
> 
> What I said. :-)

Should we go ahead with this for the 1.6 series or wait until 1.7 ?

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From mal@lemburg.com  Mon May  1 21:03:21 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Mon, 01 May 2000 22:03:21 +0200
Subject: [I18n-sig] Re: Unicode debate
References: Your message of "Sat, 29 Apr 2000 15:25:47 +0200."
 <390AE2DB.1EB8692A@lemburg.com> Your message of "Thu, 27 Apr 2000
 06:42:43 BST." <l03102800b52d80db1290@[193.78.237.154]>
 <l03102805b52ca7830b18@[193.78.237.154]>
 <l03102800b52d80db1290@[193.78.237.154]>
 <l03102802b52ef9cb79aa@[193.78.237.170]>
 <l03102809b52f28c18787@[193.78.237.170]>
 <l0310280bb52f7c352ba6@[193.78.237.170]>
 <l03102813b52f9e1624fa@[193.78.237.170]>
 <l03102801b53037e4cba2@[193.78.237.170]>
 <390AE2DB.1EB8692A@lemburg.com> <l03102803b533964ab701@[193.78.237.164]>
Message-ID: <390DE309.93F04F9A@lemburg.com>

Just van Rossum wrote:
> 
> MAL & GvR wrote:
> >> * cPickle.loads() doesn't like Unicode as data storage
> >
> >Hm, hard to fix.  Again, it really should use the buffer API, but it doesn't.
> 
> Why should it be fixed? Unicode as data storage???

No. The tests I ran were using the experimental -U command
line option patch which was just added to CVS. It is useful
for finding all these small places where strings are hard-coded
into the standard lib.

Of course, you're right about warning to put binary data into
Unicode strings -- we shouldn't get into the same mess twice ;-)

> The least we can do
> about the character string vs. data buffer discrepancy is discourage the
> use of Unicode strings as data storage, no?

Rather than fixing the implemenation we should fix usage
here: binary data should go into buffer objects, not
strings and cPickle ought to use the buffer interface
then.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From guido@python.org  Mon May  1 21:13:24 2000
From: guido@python.org (Guido van Rossum)
Date: Mon, 01 May 2000 16:13:24 -0400
Subject: [I18n-sig] Re: Unicode debate
In-Reply-To: Your message of "Mon, 01 May 2000 21:20:17 BST."
 <l03102803b533964ab701@[193.78.237.164]>
References: Your message of "Sat, 29 Apr 2000 15:25:47 +0200." <390AE2DB.1EB8692A@lemburg.com> Your message of "Thu, 27 Apr 2000 06:42:43 BST." <l03102800b52d80db1290@[193.78.237.154]> <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <l03102802b52ef9cb79aa@[193.78.237.170]> <l03102809b52f28c18787@[193.78.237.170]> <l0310280bb52f7c352ba6@[193.78.237.170]> <l03102813b52f9e1624fa@[193.78.237.170]> <l03102801b53037e4cba2@[193.78.237.170]> <390AE2DB.1EB8692A@lemburg.com>
 <l03102803b533964ab701@[193.78.237.164]>
Message-ID: <200005012013.QAA22147@eric.cnri.reston.va.us>

> MAL & GvR wrote:
> >> * cPickle.loads() doesn't like Unicode as data storage
> >
> >Hm, hard to fix.  Again, it really should use the buffer API, but it doesn't.
> 
> Why should it be fixed? Unicode as data storage??? The least we can do
> about the character string vs. data buffer discrepancy is discourage the
> use of Unicode strings as data storage, no?

Good point.  I was getting carried away by the idea that the -U option
implements (all strings are Unicode).  This is what JPython does, and
there strings *are* being used as data storage -- at a 100% overhead
cost.  We shouldn't copy this mistake though, and there are limits to
how far we can take -U.

Perhaps there should be an explicit prefix to force 8-bit strings?  I
think that a notation for 8-bit data is still useful, and string
literals with octal escapes are the most compact form I know!

--Guido van Rossum (home page: http://www.python.org/~guido/)


From paul@prescod.net  Mon May  1 21:38:29 2000
From: paul@prescod.net (Paul Prescod)
Date: Mon, 01 May 2000 15:38:29 -0500
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us>
 <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us>
Message-ID: <390DEB45.D8D12337@prescod.net>

Uche asked for a summary so I cc:ed the xml-sig.

Guido van Rossum wrote:
> 
> ...
>
> OK.  I really meant recoding in UTF-8 -- I maintain that there are
> lots of forces that prevent recoding most ISO-2022-JP documents in
> UTF-8.

Absolutely agree.
 
> Are you sure you understand what we are arguing about?

Here's what I thought we were arguing about:

If you put a bunch of "funny characters" into a Python string literal,
and then compare that string literal against a Unicode object, should
those funny characters be treated as logical units of text (characters)
or as bytes? And if bytes, should some transformation be automatically
performed to have those bytes be reinterpreted as characters according
to some particular encoding scheme (probably UTF-8).

I claim that we should *as far as possible* treat strings as character
lists and not add any new functionality that depends on them being byte
list. Ideally, we could add a byte array type and start deprecating the
use of strings in that manner. Yes, it will take a long time to fix this
bug but that's what happens when good software lives a long time and the
world changes around it.

> Earlier, you quoted some reference documentation that defines 8-bit
> strings as containing characters.  That's taken out of context -- this
> was written in a time when there was (for most people anyway) no
> difference between characters and bytes, and I really meant bytes.

Actually, I think that that was Fredrik. 

Anyhow, you wrote the documentation that way because it was the most
intuitive way of thinking about strings. It remains the most intuitive
way. I think that that was the point Fredrik was trying to make.

We can't make "byte-list" strings go away soon but we can start moving
people towards the "character-list" model. In concrete terms I would
suggest that old fashioned lists be automatically coerced to Unicode by
interpreting each byte as a Unicode character. Trying to go the other
way could cause the moral equivalent of an OverflowError but that's not
a problem. 

>>> a=1000000000000000000000000000000000000L
>>> int(a)
Traceback (innermost last):
  File "<stdin>", line 1, in ?
OverflowError: long int too long to convert

And just as with ints and longs, we would expect to eventually unify
strings and unicode strings (but not byte arrays).

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
It's difficult to extract sense from strings, but they're the only
communication coin we can count on. 
	- http://www.cs.yale.edu/~perlis-alan/quotes.html


From guido@python.org  Mon May  1 21:44:10 2000
From: guido@python.org (Guido van Rossum)
Date: Mon, 01 May 2000 16:44:10 -0400
Subject: [I18n-sig] Re: Unicode debate
In-Reply-To: Your message of "Mon, 01 May 2000 21:48:51 +0200."
 <390DDFA3.F68C3A73@lemburg.com>
References: Your message of "Thu, 27 Apr 2000 06:42:43 BST." <l03102800b52d80db1290@[193.78.237.154]> <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <l03102802b52ef9cb79aa@[193.78.237.170]> <l03102809b52f28c18787@[193.78.237.170]> <l0310280bb52f7c352ba6@[193.78.237.170]> <l03102813b52f9e1624fa@[193.78.237.170]> <l03102801b53037e4cba2@[193.78.237.170]> <390AE2DB.1EB8692A@lemburg.com> <200005011749.NAA20802@eric.cnri.reston.va.us>
 <390DDFA3.F68C3A73@lemburg.com>
Message-ID: <200005012044.QAA23041@eric.cnri.reston.va.us>

> > > * import string fails due to the way _idtable is constructed
> > 
> > Hm, I don't see this -- string.py imports just fine.  There's no
> > _idtable in my copy of string.py?!?!
> 
> Ehm, I meant _idmap... I would guess that the reason
> your string.py imports fine is that the import still uses
> a cached PYC file for the import (this is why I updated the
> -U patch to modify the magic number for imports when the
> flag is set -- it ensures that when running in -U mode,
> only PYC files also having been compiled with -U are
> used and that when running without -U no such files
> are accepted; makes testing a little easier since it doesn't
> interfere with existing implementations).

Oops, you're right.

> > > * cPickle.loads() doesn't like Unicode as data storage
> > 
> > Hm, hard to fix.  Again, it really should use the buffer API, but it doesn't.
> 
> Note that this "bug" only occurrs when using strings as
> data storage... the test code should really be using
> a buffer object for this (or some other sort of binary
> data container).

Agreed.  (See my response to Just.)

> > > * keywords must be strings (f(1, 2, 3, **{'a':4, 'b':5}) doesn't work)
> > 
> > How hard would this be to fix?
> 
> Not sure... the keyword code is spread across many files.

I think it deserves to be fixed, but there's no great big hurry.

> > > Some of these could be fixed by putting a str() call around
> > > the '...' constants. Others need fixes in C code. Yet others
> > > would be better off if they used the buffer interfaces (basically
> > > all APIs which work on raw data like cPickle or rotor).
> > 
> > What I said. :-)
> 
> Should we go ahead with this for the 1.6 series or wait until 1.7 ?

Since the -U flag is in, I'd go ahead.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@python.org  Mon May  1 22:32:38 2000
From: guido@python.org (Guido van Rossum)
Date: Mon, 01 May 2000 17:32:38 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Your message of "Mon, 01 May 2000 15:38:29 CDT."
 <390DEB45.D8D12337@prescod.net>
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us>
 <390DEB45.D8D12337@prescod.net>
Message-ID: <200005012132.RAA23319@eric.cnri.reston.va.us>

> > Are you sure you understand what we are arguing about?
> 
> Here's what I thought we were arguing about:
> 
> If you put a bunch of "funny characters" into a Python string literal,
> and then compare that string literal against a Unicode object, should
> those funny characters be treated as logical units of text (characters)
> or as bytes? And if bytes, should some transformation be automatically
> performed to have those bytes be reinterpreted as characters according
> to some particular encoding scheme (probably UTF-8).
> 
> I claim that we should *as far as possible* treat strings as character
> lists and not add any new functionality that depends on them being byte
> list. Ideally, we could add a byte array type and start deprecating the
> use of strings in that manner. Yes, it will take a long time to fix this
> bug but that's what happens when good software lives a long time and the
> world changes around it.
> 
> > Earlier, you quoted some reference documentation that defines 8-bit
> > strings as containing characters.  That's taken out of context -- this
> > was written in a time when there was (for most people anyway) no
> > difference between characters and bytes, and I really meant bytes.
> 
> Actually, I think that that was Fredrik. 

Yes, I came across the post again later.  Sorry.

> Anyhow, you wrote the documentation that way because it was the most
> intuitive way of thinking about strings. It remains the most intuitive
> way. I think that that was the point Fredrik was trying to make.

I just wish he made the point more eloquently.  The eff-bot seems to
be in a crunchy mood lately...

> We can't make "byte-list" strings go away soon but we can start moving
> people towards the "character-list" model. In concrete terms I would
> suggest that old fashioned lists be automatically coerced to Unicode by
> interpreting each byte as a Unicode character. Trying to go the other
> way could cause the moral equivalent of an OverflowError but that's not
> a problem. 
> 
> >>> a=1000000000000000000000000000000000000L
> >>> int(a)
> Traceback (innermost last):
>   File "<stdin>", line 1, in ?
> OverflowError: long int too long to convert
> 
> And just as with ints and longs, we would expect to eventually unify
> strings and unicode strings (but not byte arrays).

OK, you've made your claim -- like Fredrik, you want to interpret
8-bit strings as Latin-1 when converting (not just comparing!) them to
Unicode.

I don't think I've heard a good *argument* for this rule though.  "A
character is a character is a character" sounds like an axiom to me --
something you can't prove or disprove rationally.

I have a bunch of good reasons (I think) for liking UTF-8: it allows
you to convert between Unicode and 8-bit strings without losses, Tcl
uses it (so displaying Unicode in Tkinter *just* *works*...), it is
not Western-language-centric.

Another reason: while you may claim that your (and /F's, and Just's)
preferred solution doesn't enter into the encodings issue, I claim it
does: Latin-1 is just as much an encoding as any other one.

I claim that as long as we're using an encoding we might as well use
the most accepted 8-bit encoding of Unicode as the default encoding.

I also think that the issue is blown out of proportions: this ONLY
happens when you use Unicode objects, and it ONLY matters when some
other part of the program uses 8-bit string objects containing
non-ASCII characters.  Given the long tradition of using different
encodings in 8-bit strings, at that point it is anybody's guess what
encoding is used, and UTF-8 is a better guess than Latin-1.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From Fredrik Lundh" <effbot@telia.com  Mon May  1 23:29:41 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Tue, 2 May 2000 00:29:41 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us>             <390DEB45.D8D12337@prescod.net>  <200005012132.RAA23319@eric.cnri.reston.va.us>
Message-ID: <017d01bfb3bc$c3734c00$34aab5d4@hagrid>

Guido van Rossum <guido@python.org> wrote:
> I just wish he made the point more eloquently.  The eff-bot seems to
> be in a crunchy mood lately...

I've posted a few thousand messages on this topic, most of which
seem to have been ignored.  if you'd read all my messages, and seen
all the replies, you'd be cranky too...

> I don't think I've heard a good *argument* for this rule though.  "A
> character is a character is a character" sounds like an axiom to me --
> something you can't prove or disprove rationally.

maybe, but it's a darn good axiom, and it's used by everyone else.
Perl uses it, Tcl uses it, XML uses it, etc.  see:

http://www.python.org/pipermail/python-dev/2000-April/005218.html

> I have a bunch of good reasons (I think) for liking UTF-8: it allows
> you to convert between Unicode and 8-bit strings without losses, Tcl
> uses it (so displaying Unicode in Tkinter *just* *works*...), it is
> not Western-language-centric.

the "Tcl uses it" is a red herring -- their internal implementation
uses 16-bit integers, and the external interface works very hard
to keep the "strings are character sequences" illusion.

in other words, the length of a string is *always* the number of
characters, the character at index i is *always* the i'th character
in the string, etc.

that's not true in Python 1.6a2.

(as for Tkinter, you only have to add 2-3 lines of code to make it
use 16-bit strings instead...)

> Another reason: while you may claim that your (and /F's, and Just's)
> preferred solution doesn't enter into the encodings issue, I claim it
> does: Latin-1 is just as much an encoding as any other one.

this is another red herring: my argument is that 8-bit strings should
contain unicode characters, using unicode character codes.  there
should be only one character repertoire, and that repertoire is uni-
code.  for a definition of these terms, see:

http://www.python.org/pipermail/python-dev/2000-April/005225.html

obviously, you can only store 256 different values in a single 8-bit
character (just like you can only store 4294967296 different values
in a single 32-bit int).

to store larger values, use unicode strings (or long integers).

conversion from a small type to a large type always work, conversion
from a large type to a small one may result in an OverflowError.

it has nothing to do with encodings.

> I claim that as long as we're using an encoding we might as well use
> the most accepted 8-bit encoding of Unicode as the default encoding.

yeah, and I claim that it won't fly, as long as it breaks the "strings
are character sequences" rule used by all other contemporary (and
competing) systems.

(if you like, I can post more "fun with unicode" messages ;-)

and as I've mentioned before, there are (at least) two ways to solve
this:

1. teach 8-bit strings about UTF-8 (this is how it's done in Tcl and
   Perl).  make sure len(s) returns the number of characters in the
   string, make sure s[i] returns the i'th character (not necessarily
   starting at the i'th byte, and not necessarily one byte), etc.  to
   make this run reasonable fast, use as many implementation tricks
   as you can come up with (I've described three ways to implement
   this in an earlier post).

2. define 8-bit strings as holding an 8-bit subset of unicode: ord(s[i])
   is a unicode character code, whether s is an 8-bit string or a =
unicode
   string.

for alternative 1 to work, you need to add some way to explicitly work
with binary strings (like it's done in Perl and Tcl).

alternative 2 doesn't need that; 8-bit strings can still be used to hold
any kind of binary data, as in 1.5.2.  just keep in mind you cannot use
use all methods on such an object...

> I also think that the issue is blown out of proportions: this ONLY
> happens when you use Unicode objects, and it ONLY matters when some
> other part of the program uses 8-bit string objects containing
> non-ASCII characters.  Given the long tradition of using different
> encodings in 8-bit strings, at that point it is anybody's guess what
> encoding is used, and UTF-8 is a better guess than Latin-1.

I still think it's very unfortunate that you think that unicode strings
are a special kind of strings.  Perl and Tcl don't, so why should we?

</F>


From paul@prescod.net  Tue May  2 01:19:20 2000
From: paul@prescod.net (Paul Prescod)
Date: Mon, 01 May 2000 19:19:20 -0500
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us>
 <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us>
Message-ID: <390E1F08.EA91599E@prescod.net>

Sorry for the long message. Of course you need only respond to that
which is interesting to you. I don't think that most of it is redundant.

Guido van Rossum wrote:
> 
> ...
> 
> OK, you've made your claim -- like Fredrik, you want to interpret
> 8-bit strings as Latin-1 when converting (not just comparing!) them to
> Unicode.

If the user provides an explicit conversion function (e.g. UTF-8-decode)
then of course we should use that function. Under my character is a
character is a character model, this "conversion" is morally equivalent
to ROT-13, strupr or some other text->text translation. So you could
apply UTF-8-decode even to a Unicode string as long as each character in
the string has ord()<256 (so that it could be interpreted as a character
representation for a byte).

> I don't think I've heard a good *argument* for this rule though.  "A
> character is a character is a character" sounds like an axiom to me --
> something you can't prove or disprove rationally.

I don't see it as an axiom, but rather as a design decision you make to
keep your language simple. Along the lines of "all values are objects"
and (now) all integer values are representable with a single type. Are
you happy with this?

a="\244"
b=u"\244"
assert len(a)==len(b)
assert ord(a[0])==ord(b[0])

# same thing, right?
print b==a
# Traceback (most recent call last):
#  File "<stdin>", line 1, in ?
# UnicodeError: UTF-8 decoding error: unexpected code byte

If I type "\244" it means I want character 244, not the first half of a
UTF-8 escape sequence. "\244" is a string with one character. It has no
encoding. It is not latin-1. It is not UTF-8. It is a string with one
character and should compare as equal with another string with the same
character.

I would laugh my ass off if I was using Perl and it did something weird
like this to me (as long as it didn't take a month to track down the
bug!). Now it isn't so funny.

> I have a bunch of good reasons (I think) for liking UTF-8: 

I'm not against UTF-8. It could be an internal representation for some
Unicode objects.

> it allows
> you to convert between Unicode and 8-bit strings without losses, 

Here's the heart of our disagreement:

******
I don't want, in Py3K, to think about "converting between Unicode and
8-bit strings." I want strings and I want byte-arrays and I want to
worry about converting between *them*. There should be only one string
type, its characters should all live in the Unicode character repertoire
and the character numbers should all come from Unicode. "Special"
characters can be assigned to the Unicode Private User Area. Byte arrays
would be entirely seperate and would be converted to Unicode strings
with explicit conversion functions.
*****

In the meantime I'm just trying to get other people thinking in this
mode so that the transition is easier. If I see people embedding UTF-8
escape sequences in literal strings today, I'm going to hit them.

I recognize that we can't design the universe right now but we could
agree on this direction and use it to guide our decision-making.

By the way, if we DID think of 8-bit strings as essentially "byte
arrays" then let's use that terminology and imagine some future
documentation:

"Python's string type is equivalent to a list of bytes. For clarity, we
will call this type a byte list from now on. In contexts where a Unicode
character-string is desired, Python automatically converts byte lists to
charcter strings by doing a UTF-8 decode on them." 

What would you think if Java had a default (I say "magical") conversion
from byte arrays to character strings.

The only reason we are discussing this is because Python strings have a
dual personality which was useful in the past but will (IMHO, of course)
become increasingly confusing in the future. We want the best of both
worlds without confusing anybody and I don't think that we can have it.

If you want 8-bit strings to be really byte arrays in perpetuity then
let's be consistent in that view. We can compare them to Unicode as we
would two completely separate types. "U" comes after "S" so unicode
strings always compare greater than 8-bit strings. The use of the word
"string" for both objects can be considered just a historical accident.

> Tcl uses it (so displaying Unicode in Tkinter *just* *works*...), 

Don't follow this entirely. Shouldn't the next version of TKinter accept
and return Unicode strings? It would be rather ugly for two
Unicode-aware systems (Python and TK) to talk to each other in 8-bit
strings. I mean I don't care what you do at the C level but at the
Python level arguments should be "just strings."

Consider that len() on the TKinter side would return a different value
than on the Python side. 

What about integral indexes into buffers? I'm totally ignorant about
TKinter but let me ask wouldn't Tkinter say (e.g.) that the cursor is
between the 5th and 6th character when in an 8-bit string the equivalent
index might be the 11th or 12th byte?

> it is not Western-language-centric.

If you look at encoding efficiency it is.

> Another reason: while you may claim that your (and /F's, and Just's)
> preferred solution doesn't enter into the encodings issue, I claim it
> does: Latin-1 is just as much an encoding as any other one.

The fact that my proposal has the same effect as making Latin-1 the
"default encoding" is a near-term side effect of the definition of
Unicode. My long term proposal is to do away with the concept of 8-bit
strings (and thus, conversions from 8-bit to Unicode) altogether. One
string to rule them all!

Is Unicode going to be the canonical Py3K character set or will we have
different objects for different character sets/encodings with different
default (I say "magical") conversions between them. Such a design would
not be entirely insane though it would be a PITA to implement and
maintain. If we aren't ready to establish Unicode as the one true
character set then we should probably make no special concessions for
Unicode at all. Let a thousand string objects bloom!

Even if we agreed to allow many string objects, byte==character should
not be the default string object. Unicode should be the default.

> I also think that the issue is blown out of proportions: this ONLY
> happens when you use Unicode objects, and it ONLY matters when some
> other part of the program uses 8-bit string objects containing
> non-ASCII characters.  

Won't this be totally common? Most people are going to use 8-bit
literals in their program text but work with Unicode data from XML
parsers, COM, WebDAV, Tkinter, etc?

> Given the long tradition of using different
> encodings in 8-bit strings, at that point it is anybody's guess what
> encoding is used, and UTF-8 is a better guess than Latin-1.

If we are guessing then we are doing something wrong. My answer to the
question of "default encoding" falls out naturally from a certain way of
looking at text, popularized in various other languages and increasingly
"the norm" on the Web. If you accept the model (a character is a
character is a character), the right behavior is obvious. 

"\244"==u"\244"

Nobody is ever going to have trouble understanding how this works.
Choose simplicity!

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
It's difficult to extract sense from strings, but they're the only
communication coin we can count on. 
	- http://www.cs.yale.edu/~perlis-alan/quotes.html


From guido@python.org  Tue May  2 01:53:26 2000
From: guido@python.org (Guido van Rossum)
Date: Mon, 01 May 2000 20:53:26 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Your message of "Mon, 01 May 2000 19:19:20 CDT."
 <390E1F08.EA91599E@prescod.net>
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us>
 <390E1F08.EA91599E@prescod.net>
Message-ID: <200005020053.UAA23665@eric.cnri.reston.va.us>

Paul, we're both just saying the same thing over and over without
convincing each other.  I'll wait till someone who wasn't in this
debate before chimes in.

Have you tried using this?

--Guido van Rossum (home page: http://www.python.org/~guido/)


From Fredrik Lundh" <effbot@telia.com  Tue May  2 02:26:06 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Tue, 2 May 2000 03:26:06 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us>              <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net>
Message-ID: <002301bfb3d5$8fd57440$34aab5d4@hagrid>

Paul Prescod <paul@prescod.net> wrote:
> I would laugh my ass off if I was using Perl and it did something =
weird
> like this to me.

you don't have to -- in Perl 5.6, a character is a character...

does anyone on this list follow the perl-porters list?  was this as
controversial over in Perl land as it appears to be over here?

</F>


From guido@python.org  Tue May  2 04:31:54 2000
From: guido@python.org (Guido van Rossum)
Date: Mon, 01 May 2000 23:31:54 -0400
Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Your message of "Mon, 01 May 2000 21:55:25 EDT."
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
References: <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
Message-ID: <200005020331.XAA23818@eric.cnri.reston.va.us>

Tom Passin:
> I'm with Paul and Federick on this one - at least about characters being the
> atoms of a string.  We **have** to be able to refer to **characters** in a
> string, and without guessing.  Otherwise, how could you ever construct a
> test, like theString[3]==[a particular japanese ideograph]?  If we do it by
> having a "string" datatype, which is really a byte list, and a
> "unicodeString" datatype which is a list of abstract characters, I'd say
> everyone could get used to working with them.  We'd have to supply
> conversion functions, of course.

You seem unfamiliar with the details of the implementation we're
proposing?  We already have two datatypes, 8-bit string (call it byte
array) and Unicode string.  There are conversions between them:
explicit conversions such as u.encode("utf-8") or unicode(s,
"latin-1") and implicit conversions used in situations like u+s or
u==s.  The whole discussion is *only* about what the default
conversion in the latter cases should be -- the rest of the
implementation is rock solid and works well.

Users can accomplish what you are proposing by simply ensuring that
theString is a Unicode string.

> This route might be the easiest to understand for users.  We'd have to be
> very clear about what file.read() would return, for example, and all those
> similar read and write functions.  And we'd have to work out how real 8-bit
> calls (like writing to a socket?) would play with the new types.

These are all well defined -- they all deal in 8-bit strings
internally, and all use the default conversions when given Unicode
strings.  Programs that only deal in 8-bit strings don't need to
change.  Programs that want to deal with Unicode and sockets, for
example, must know what encoding to use on the socket, and if it's not
the default encoding, must use explicit conversions.

> For extra clarity, we could leave string the way it is, introduce stringU
> (unicode string) **and** string8 (Latin-1 or byte list, whichever seems to
> be the best equivalent to the current string).  Then we would deprecate
> string in favor of string8.  Then if tcl and perl go to unicode strings we
> pass them a stringU, and if they go some other way, we pass them something
> else.  COme to think of it, we need some some data type that will continue
> to work with c and c++.  Would that be string8 or would we keep string for
> that purpose?

What would be the difference between string and string8?

> Clarity and ease of use for the user should be primary, fast implementations
> next.  If we didn't care about ease of use and clarity, we could all use
> Scheme or c, don't use sight of it.
> 
> I'd suggest we could create some use cases or scenarios for this area -
> needs input from those who know encodings and low level Python stuff better
> than I.  Then we could examine more systematically how well various
> approaches would work out.

Very good.

Here's one usage scenario.

A Japanese user is reading lines from a file encoded in ISO-2022-JP.
The readline() method returns 8-bit strings in that encoding (the file
object doesn't do any decoding).  She realizes that she wants to do
some character-level processing on the file so she decides to convert
the strings to Unicode.

I believe that whether the default encoding is UTF-8 or Latin-1
doesn't matter for here -- both are wrong, she needs to write explicit
unicode(line, "iso-2022-jp") code anyway.  I would argue that UTF-8 is
"better", because interpreting ISO-2022-JP data as UTF-8 will most
likely give an exception (when a \300 range byte isn't followed by a
\200 range byte) -- while interpreting it as Latin-1 will silently do
the wrong thing.  (An explicit error is always better than silent
failure.)

I'd love to discuss other scenarios.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From just@letterror.com  Tue May  2 06:47:35 2000
From: just@letterror.com (Just van Rossum)
Date: Tue, 2 May 2000 06:47:35 +0100
Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <200005020331.XAA23818@eric.cnri.reston.va.us>
References: Your message of "Mon, 01 May 2000 21:55:25 EDT."
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
Message-ID: <l03102802b534149a9639@[193.78.237.164]>

At 11:31 PM -0400 01-05-2000, Guido van Rossum wrote:
>Here's one usage scenario.
>
>A Japanese user is reading lines from a file encoded in ISO-2022-JP.
>The readline() method returns 8-bit strings in that encoding (the file
>object doesn't do any decoding).  She realizes that she wants to do
>some character-level processing on the file so she decides to convert
>the strings to Unicode.
>
>I believe that whether the default encoding is UTF-8 or Latin-1
>doesn't matter for here -- both are wrong, she needs to write explicit
>unicode(line, "iso-2022-jp") code anyway.  I would argue that UTF-8 is
>"better", because interpreting ISO-2022-JP data as UTF-8 will most
>likely give an exception (when a \300 range byte isn't followed by a
>\200 range byte) -- while interpreting it as Latin-1 will silently do
>the wrong thing.  (An explicit error is always better than silent
>failure.)

But then it's even better to *always* raise an exception, since it's
entirely possible a string contains valid utf-8 while not *being* utf-8. I
really think the exception argument is moot, since there can *always* be
situations that will pass silently. Encoding issues are silent by nature --
eg. there's no way any system can tell that interpreting MacRoman data as
Latin-1 is wrong, maybe even fatal -- the user will just have to deal with
it. You can argue what you want, but *any* multi-byte encoding stored in an
8-bit string is a buffer, not a string, for all the reasons Fredrik and
Paul have thrown at you, and right they are. Choosing such an encoding as a
default conversion to Unicode makes no sense at all. Recap of the main
arguments:

pro UTF-8:
always reversible when going from Unicode to 8-bit

con UTF-8:
not a string: confusing semantics

pro Latin-1:
simpler semantics

con Latin-1:
non-reversible, western-centric

Given the fact that very often *both* will be wrong, I'd go for the simpler
semantics.

Just


From tpassin@home.com  Tue May  2 06:07:07 2000
From: tpassin@home.com (tpassin@home.com)
Date: Tue, 2 May 2000 01:07:07 -0400
Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>  <200005020331.XAA23818@eric.cnri.reston.va.us>
Message-ID: <006101bfb3f4$454f99e0$7cac1218@reston1.va.home.com>

Guido van Rossum said
<snip/>
> What would be the difference between string and string8?

Probably none, except to alert people that string8 might have different
behavior than the present-day string, perhaps when interacting with
unicode - probably its behavior would be specified more tightly (i.e., is it
strictly a list of bytes or does it have some assumption about encoding?) or
changed in some way from what we have now.  Or if it turned out that a lot
of programmers in other languages (perl, tcl, perhaps?) expected "string" to
behave in particular ways, the use of a term like "string8" might reduce
confusion.   Possibly none of these apply - no need for "string8" then.

>
> > Clarity and ease of use for the user should be primary, fast
implementations
> > next.  If we didn't care about ease of use and clarity, we could all use
> > Scheme or c, don't use sight of it.
> >
> > I'd suggest we could create some use cases or scenarios for this area -
> > needs input from those who know encodings and low level Python stuff
better
> > than I.  Then we could examine more systematically how well various
> > approaches would work out.
>
> Very good.
>
<snip/>

Tom Passin


From mal@lemburg.com  Tue May  2 09:36:43 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 02 May 2000 10:36:43 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: Your message of "Mon, 01 May 2000 21:55:25 EDT."
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <l03102802b534149a9639@[193.78.237.164]>
Message-ID: <390E939B.11B99B71@lemburg.com>

Just a small note on the subject of a character being atomic
which seems to have been forgotten by the discussing parties:

Unicode itself can be understood as multi-word character
encoding, just like UTF-8. The reason is that Unicode entities
can be combined to produce single display characters (e.g.
u"e"+u"\u0301" will print "�" in a Unicode aware renderer).
Slicing such a combined Unicode string will have the same
effect as slicing UTF-8 data.

It seems that most Latin-1 proponents seem to have single
display characters in mind. While the same is true for
many Unicode entities, there are quite a few cases of
combining characters in Unicode 3.0 and the Unicode
nomarization algorithm uses these as basis for its
work.

So in the end the "UTF-8 doesn't slice" argument holds for
Unicode itself too, just as it also does for many Asian
multi-byte variable length character encodings,
image formats, audio formats, database formats, etc.

You can't really expect slicing to always "just work"
without some knowledge about the data you are slicing.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From ping@lfw.org  Tue May  2 09:42:51 2000
From: ping@lfw.org (Ka-Ping Yee)
Date: Tue, 2 May 2000 01:42:51 -0700 (PDT)
Subject: [I18n-sig] Unicode debate
In-Reply-To: <l03102802b534149a9639@[193.78.237.164]>
Message-ID: <Pine.LNX.4.10.10005020114250.522-100000@localhost>

I'll warn you that i'm not much experienced or well-informed, but
i suppose i might as well toss in my naive opinion.

At 11:31 PM -0400 01-05-2000, Guido van Rossum wrote:
> 
> I believe that whether the default encoding is UTF-8 or Latin-1
> doesn't matter for here -- both are wrong, she needs to write explicit
> unicode(line, "iso-2022-jp") code anyway.  I would argue that UTF-8 is
> "better", because [this] will most likely give an exception...

On Tue, 2 May 2000, Just van Rossum wrote:
> But then it's even better to *always* raise an exception, since it's
> entirely possible a string contains valid utf-8 while not *being* utf-8.

I believe it is time for me to make a truly radical proposal:

    No automatic conversions between 8-bit "strings" and Unicode strings.

If you want to turn UTF-8 into a Unicode string, say so.
If you want to turn Latin-1 into a Unicode string, say so.
If you want to turn ISO-2022-JP into a Unicode string, say so.
Adding a Unicode string and an 8-bit "string" gives an exception.

I know this sounds tedious, but at least it stands the least possible
chance of confusing anyone -- and given all i've seen here and in
other i18n and l10n discussions, there's plenty enough confusion to
go around already.


If it turns out automatic conversions *are* absolutely necessary,
then i vote in favour of the simple, direct method promoted by Paul
and Fredrik: just copy the numerical values of the bytes.  The fact
that this happens to correspond to Latin-1 is not really the point;
the main reason is that it satisfies the Principle of Least Surprise.


Okay.  Feel free to yell at me now.


-- ?!ng

P. S.  The scare-quotes when i talk about 8-bit "strings" expose my
sense of them as byte-buffers -- since that *is* all you get when you
read in some bytes from a file.  If you manipulate an 8-bit "string"
as a character string, you are implicitly making the assumption that
the byte values correspond to the character encoding of the character
repertoire you want to work with, and that's your responsibility.

P. P. S.  If always having to specify encodings is really too much,
i'd probably be willing to consider a default-encoding state on the
Unicode class, but it would have to be a stack of values, not a
single value.


From Fredrik Lundh" <effbot@telia.com  Tue May  2 10:00:07 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Tue, 2 May 2000 11:00:07 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: Your message of "Mon, 01 May 2000 21:55:25 EDT."             <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <l03102802b534149a9639@[193.78.237.164]> <390E939B.11B99B71@lemburg.com>
Message-ID: <009701bfb414$d35d0ea0$34aab5d4@hagrid>

M.-A. Lemburg <mal@lemburg.com> wrote:
> Just a small note on the subject of a character being atomic
> which seems to have been forgotten by the discussing parties:
>=20
> Unicode itself can be understood as multi-word character
> encoding, just like UTF-8. The reason is that Unicode entities
> can be combined to produce single display characters (e.g.
> u"e"+u"\u0301" will print "=E9" in a Unicode aware renderer).
> Slicing such a combined Unicode string will have the same
> effect as slicing UTF-8 data.

really?  does it result in a decoder error?  or does it just result
in a rendering error, just as if you slice off any trailing character
without looking...

> It seems that most Latin-1 proponents seem to have single
> display characters in mind. While the same is true for
> many Unicode entities, there are quite a few cases of
> combining characters in Unicode 3.0 and the Unicode
> nomarization algorithm uses these as basis for its
> work.

do we supported automatic normalization in 1.6?

</F>


From pf@artcom-gmbh.de  Tue May  2 10:09:11 2000
From: pf@artcom-gmbh.de (Peter Funk)
Date: Tue, 2 May 2000 11:09:11 +0200 (MEST)
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <Pine.LNX.4.10.10005020114250.522-100000@localhost> from Ka-Ping Yee at "May 2, 2000  1:42:51 am"
Message-ID: <m12mYg3-000CnCC@artcom0.artcom-gmbh.de>

Ka-Ping Yee:
> I'll warn you that i'm not much experienced or well-informed, but
> i suppose i might as well toss in my naive opinion.
[...]
> If it turns out automatic conversions *are* absolutely necessary,
> then i vote in favour of the simple, direct method promoted by Paul
> and Fredrik: just copy the numerical values of the bytes.  The fact
> that this happens to correspond to Latin-1 is not really the point;
> the main reason is that it satisfies the Principle of Least Surprise.

I aggree with Just, Paul, Fredrik and Ping.

Regards, Peter


From mal@lemburg.com  Tue May  2 10:56:21 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 02 May 2000 11:56:21 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: Your message of "Mon, 01 May 2000 21:55:25 EDT."             <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <l03102802b534149a9639@[193.78.237.164]> <390E939B.11B99B71@lemburg.com> <009701bfb414$d35d0ea0$34aab5d4@hagrid>
Message-ID: <390EA645.89E3B22A@lemburg.com>

Fredrik Lundh wrote:
> 
> M.-A. Lemburg <mal@lemburg.com> wrote:
> > Just a small note on the subject of a character being atomic
> > which seems to have been forgotten by the discussing parties:
> >
> > Unicode itself can be understood as multi-word character
> > encoding, just like UTF-8. The reason is that Unicode entities
> > can be combined to produce single display characters (e.g.
> > u"e"+u"\u0301" will print "�" in a Unicode aware renderer).
> > Slicing such a combined Unicode string will have the same
> > effect as slicing UTF-8 data.
> 
> really?  does it result in a decoder error?  or does it just result
> in a rendering error, just as if you slice off any trailing character
> without looking...

In the example, if you cut off the u"\u0301", the "e" would
appear without the acute accent, cutting off the u"e" would
probably result in a rendering error or worse put the accent
over the next character to the left.

UTF-8 is better in this respect: it warns you about
the error by raising an exception when being converted to
Unicode.
 
> > It seems that most Latin-1 proponents seem to have single
> > display characters in mind. While the same is true for
> > many Unicode entities, there are quite a few cases of
> > combining characters in Unicode 3.0 and the Unicode
> > normalization algorithm uses these as basis for its
> > work.
> 
> do we supported automatic normalization in 1.6?

No, but it is likely to appear in 1.7... not sure about
the "automatic" though.

FYI: Normalization is needed to make comparing Unicode
strings robust, e.g. u"�" should compare equal to u"e\u0301".

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From Moshe Zadka <moshez@math.huji.ac.il>  Tue May  2 11:12:14 2000
From: Moshe Zadka <moshez@math.huji.ac.il> (Moshe Zadka)
Date: Tue, 2 May 2000 13:12:14 +0300 (IDT)
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <200005020053.UAA23665@eric.cnri.reston.va.us>
Message-ID: <Pine.GSO.4.10.10005021248200.8983-100000@sundial>

On Mon, 1 May 2000, Guido van Rossum wrote:

> Paul, we're both just saying the same thing over and over without
> convincing each other.  I'll wait till someone who wasn't in this
> debate before chimes in.

Well, I'm guessing you had someone specific in mind (Neil?), but I want to
say someothing too, as the only one here (I think) using ISO-8859-8
natively. I much prefer the Fredrik-Paul position, known also as the
character is a character position, to the UTF-8 as default encoding.
Unicode is western-centered -- the first 256 characters are Latin 1. UTF-8
is even more horribly western-centered (or I should say USA centered) --
ASCII documents are the same. I'd much prefer Python to reflect a
fundamental truth about Unicode, which at least makes sure binary-goop can
pass through Unicode and remain unharmed, then to reflect a nasty problem
with UTF-8 (not everything is legal). 

If I'm using Hebrew characters in my source (which I won't for a long
while), I'll use them in  Unicode strings only, and make sure I use
Unicode. If I'm reading Hebrew from an IS-8859-8 file, I'll set a
conversion to Unicode on the fly anyway, since most bidi libraries work on
Unicode. So having UTF-8 conversions magically happen won't help me at
all, and will only cause problem when I use "sort-for-uniqueness" on a
list with mixed binary-goop and Unicode strings. In short, this sounds
like a recipe for disaster.

internationally y'rs, Z.

--
Moshe Zadka <moshez@math.huji.ac.il>
http://www.oreilly.com/news/prescod_0300.html
http://www.linux.org.il -- we put the penguin in .com


From mal@lemburg.com  Tue May  2 11:46:06 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 02 May 2000 12:46:06 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <Pine.GSO.4.10.10005021248200.8983-100000@sundial>
Message-ID: <390EB1EE.EA557CA9@lemburg.com>

Moshe Zadka wrote:
> 
> I'd much prefer Python to reflect a
> fundamental truth about Unicode, which at least makes sure binary-goop can
> pass through Unicode and remain unharmed, then to reflect a nasty problem
> with UTF-8 (not everything is legal).

Let's not do the same mistake again: Unicode objects should *not*
be used to hold binary data. Please use buffers instead.

BTW, I think that this behaviour should be changed:

>>> buffer('binary') + 'data'
'binarydata'

while:

>>> 'data' + buffer('binary')         
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
TypeError: illegal argument type for built-in operation

IMHO, buffer objects should never coerce to strings, but instead
return a buffer object holding the combined contents. The
same applies to slicing buffer objects:

>>> buffer('binary')[2:5]
'nar'

should prefereably be buffer('nar').

--

Hmm, perhaps we need something like a data string object
to get this 100% right ?!

>>> d = data("...data...")
or
>>> d = d"...data..."
>>> print type(d)
<type 'data'>

>>> 'string' + d
d"string...data..."
>>> u'string' + d
d"s\000t\000r\000i\000n\000g\000...data..."

>>> d[:5]
d"...da"

etc.

Ideally, string and Unicode objects would then be subclasses
of this type in Py3K.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From just@letterror.com  Tue May  2 13:34:57 2000
From: just@letterror.com (Just van Rossum)
Date: Tue, 2 May 2000 13:34:57 +0100
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <390E939B.11B99B71@lemburg.com>
References: Your message of "Mon, 01 May 2000 21:55:25 EDT."
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
 <l03102802b534149a9639@[193.78.237.164]>
Message-ID: <l03102804b534772fc25b@[193.78.237.142]>

At 10:36 AM +0200 02-05-2000, M.-A. Lemburg wrote:
>Just a small note on the subject of a character being atomic
>which seems to have been forgotten by the discussing parties:
>
>Unicode itself can be understood as multi-word character
>encoding, just like UTF-8. The reason is that Unicode entities
>can be combined to produce single display characters (e.g.
>u"e"+u"\u0301" will print "=E9" in a Unicode aware renderer).

Erm, are you sure Unicode prescribes this behavior, for this
example? I know similar behaviors are specified for certain
languages/scripts, but I didn't know it did that for latin.

>Slicing such a combined Unicode string will have the same
>effect as slicing UTF-8 data.

Not true. As Fredrik noted: no exception will be raised.

[ Speaking of exceptions,

after I sent off my previous post I realized Guido's
non-utf8-strings-interpreted-as-utf8-will-often-raise-an-exception
argument can easily be turned around, backfiring at utf-8:

    Defaulting to utf-8 when going from Unicode to 8-bit and
    back only gives the *illusion* things "just work", since it
    will *silently* "work", even if utf-8 is *not* the desired
    8-bit encoding -- as shown by Fredrik's excellent "fun with
    Unicode, part 1" example. Defaulting to Latin-1 will
    warn the user *much* earlier, since it'll barf when
    converting a Unicode string that contains any character
    code > 255. So there.
]

>It seems that most Latin-1 proponents seem to have single
>display characters in mind. While the same is true for
>many Unicode entities, there are quite a few cases of
>combining characters in Unicode 3.0 and the Unicode
>nomarization algorithm uses these as basis for its
>work.

Still, two combining characters are still two input characters for
the renderer! They may result in one *glyph*, but trust me,
that's an entirly different can of worms.

However, if you'd be talking about Unicode surrogates,
you'd definitely have a point. How do Java/Perl/Tcl deal with
surrogates?

Just


From guido@python.org  Tue May  2 13:26:50 2000
From: guido@python.org (Guido van Rossum)
Date: Tue, 02 May 2000 08:26:50 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Your message of "Tue, 02 May 2000 12:46:06 +0200."
 <390EB1EE.EA557CA9@lemburg.com>
References: <Pine.GSO.4.10.10005021248200.8983-100000@sundial>
 <390EB1EE.EA557CA9@lemburg.com>
Message-ID: <200005021226.IAA24203@eric.cnri.reston.va.us>

[MAL]
> Let's not do the same mistake again: Unicode objects should *not*
> be used to hold binary data. Please use buffers instead.

Easier said than done -- Python doesn't really have a buffer data
type.  Or do you mean the array module?  It's not trivial to read a
file into an array (although it's possible, there are even two ways).
Fact is, most of Python's standard library and built-in objects use
(8-bit) strings as buffers.

I agree there's no reason to extend this to Unicode strings.

> BTW, I think that this behaviour should be changed:
> 
> >>> buffer('binary') + 'data'
> 'binarydata'
> 
> while:
> 
> >>> 'data' + buffer('binary')         
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
> TypeError: illegal argument type for built-in operation
> 
> IMHO, buffer objects should never coerce to strings, but instead
> return a buffer object holding the combined contents. The
> same applies to slicing buffer objects:
> 
> >>> buffer('binary')[2:5]
> 'nar'
> 
> should prefereably be buffer('nar').

Note that a buffer object doesn't hold data!  It's only a pointer to
data.  I can't off-hand explain the asymmetry though.

> --
> 
> Hmm, perhaps we need something like a data string object
> to get this 100% right ?!
> 
> >>> d = data("...data...")
> or
> >>> d = d"...data..."
> >>> print type(d)
> <type 'data'>
> 
> >>> 'string' + d
> d"string...data..."
> >>> u'string' + d
> d"s\000t\000r\000i\000n\000g\000...data..."
> 
> >>> d[:5]
> d"...da"
> 
> etc.
> 
> Ideally, string and Unicode objects would then be subclasses
> of this type in Py3K.

Not clear.  I'd rather do the equivalent of byte arrays in Java, for
which no "string literal" notations exist.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@python.org  Tue May  2 13:30:02 2000
From: guido@python.org (Guido van Rossum)
Date: Tue, 02 May 2000 08:30:02 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Your message of "Tue, 02 May 2000 11:56:21 +0200."
 <390EA645.89E3B22A@lemburg.com>
References: Your message of "Mon, 01 May 2000 21:55:25 EDT." <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <l03102802b534149a9639@[193.78.237.164]> <390E939B.11B99B71@lemburg.com> <009701bfb414$d35d0ea0$34aab5d4@hagrid>
 <390EA645.89E3B22A@lemburg.com>
Message-ID: <200005021230.IAA24232@eric.cnri.reston.va.us>

[MAL]
> > > Unicode itself can be understood as multi-word character
> > > encoding, just like UTF-8. The reason is that Unicode entities
> > > can be combined to produce single display characters (e.g.
> > > u"e"+u"\u0301" will print "�" in a Unicode aware renderer).
> > > Slicing such a combined Unicode string will have the same
> > > effect as slicing UTF-8 data.
[/F]
> > really?  does it result in a decoder error?  or does it just result
> > in a rendering error, just as if you slice off any trailing character
> > without looking...
[MAL]
> In the example, if you cut off the u"\u0301", the "e" would
> appear without the acute accent, cutting off the u"e" would
> probably result in a rendering error or worse put the accent
> over the next character to the left.
> 
> UTF-8 is better in this respect: it warns you about
> the error by raising an exception when being converted to
> Unicode.

I think /F's point was that the Unicode standard prescribes different
behavior here: for UTF-8, a missing or lone continuation byte is an
error; for Unicode, accents are separate characters that may be
inserted and deleted in a string but whose display is undefined under
certain conditions.

(I just noticed that this doesn't work in Tkinter but it does work in
wish.  Strange.)

> FYI: Normalization is needed to make comparing Unicode
> strings robust, e.g. u"�" should compare equal to u"e\u0301".

Aha, then we'll see u == v even though type(u) is type(v) and len(u)
!= len(v).  /F's world will collapse. :-)

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@python.org  Tue May  2 13:30:10 2000
From: guido@python.org (Guido van Rossum)
Date: Tue, 02 May 2000 08:30:10 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Your message of "Tue, 02 May 2000 11:09:11 +0200."
 <m12mYg3-000CnCC@artcom0.artcom-gmbh.de>
References: <m12mYg3-000CnCC@artcom0.artcom-gmbh.de>
Message-ID: <200005021230.IAA24240@eric.cnri.reston.va.us>

> I aggree with Just, Paul, Fredrik and Ping.

Sorry, this is not a democracy. :-)  I'm not counting votes, I'm
looking for contributions to the discussion.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@python.org  Tue May  2 13:31:55 2000
From: guido@python.org (Guido van Rossum)
Date: Tue, 02 May 2000 08:31:55 -0400
Subject: [I18n-sig] Re: Unicode debate
In-Reply-To: Your message of "Tue, 02 May 2000 01:42:51 PDT."
 <Pine.LNX.4.10.10005020114250.522-100000@localhost>
References: <Pine.LNX.4.10.10005020114250.522-100000@localhost>
Message-ID: <200005021231.IAA24249@eric.cnri.reston.va.us>

>     No automatic conversions between 8-bit "strings" and Unicode strings.
> 
> If you want to turn UTF-8 into a Unicode string, say so.
> If you want to turn Latin-1 into a Unicode string, say so.
> If you want to turn ISO-2022-JP into a Unicode string, say so.
> Adding a Unicode string and an 8-bit "string" gives an exception.

I'd accept this, with one change: mixing Unicode and 8-bit strings is
okay when the 8-bit strings contain only ASCII (byte values 0 through
127).  That does the right thing when the program is combining
ASCII data (e.g. literals or data files) with Unicode and warns you
when you are using characters for which the encoding matters.  I
believe that this is important because much existing code dealing with
strings can in fact deal with Unicode just fine under these
assumptions.  (E.g. I needed only 4 changes to htmllib/sgmllib to make
it deal with Unicode strings -- those changes were all getattr() and
setattr() calls.)

When *comparing* 8-bit and Unicode strings, the presence of non-ASCII
bytes in either should make the comparison fail; when ordering is
important, we can make an arbitrary choice e.g. "\377" < u"\200".

Why not Latin-1?  Because it gives us Western-alphabet users a false
sense that our code works, where in fact it is broken as soon as you
change the encoding.

> P. S.  The scare-quotes when i talk about 8-bit "strings" expose my
> sense of them as byte-buffers -- since that *is* all you get when you
> read in some bytes from a file.  If you manipulate an 8-bit "string"
> as a character string, you are implicitly making the assumption that
> the byte values correspond to the character encoding of the character
> repertoire you want to work with, and that's your responsibility.

This is how I think of them too.

> P. P. S.  If always having to specify encodings is really too much,
> i'd probably be willing to consider a default-encoding state on the
> Unicode class, but it would have to be a stack of values, not a
> single value.

Please elaborate?

--Guido van Rossum (home page: http://www.python.org/~guido/)


From just@letterror.com  Tue May  2 14:44:30 2000
From: just@letterror.com (Just van Rossum)
Date: Tue, 2 May 2000 14:44:30 +0100
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <200005021230.IAA24232@eric.cnri.reston.va.us>
References: Your message of "Tue, 02 May 2000 11:56:21 +0200."
 <390EA645.89E3B22A@lemburg.com> Your message of "Mon, 01 May 2000
 21:55:25 EDT." <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
 <l03102802b534149a9639@[193.78.237.164]> <390E939B.11B99B71@lemburg.com>
 <009701bfb414$d35d0ea0$34aab5d4@hagrid>
 <390EA645.89E3B22A@lemburg.com>
Message-ID: <l03102807b5348b0e6e0b@[193.78.237.142]>

At 8:30 AM -0400 02-05-2000, Guido van Rossum wrote:
>I think /F's point was that the Unicode standard prescribes different
>behavior here: for UTF-8, a missing or lone continuation byte is an
>error; for Unicode, accents are separate characters that may be
>inserted and deleted in a string but whose display is undefined under
>certain conditions.
>
>(I just noticed that this doesn't work in Tkinter but it does work in
>wish.  Strange.)
>
>> FYI: Normalization is needed to make comparing Unicode
>> strings robust, e.g. u"=C8" should compare equal to u"e\u0301".
>
>Aha, then we'll see u =3D=3D v even though type(u) is type(v) and len(u)
>!=3D len(v).  /F's world will collapse. :-)

Does the Unicode spec *really* specifies u should compare equal to v? This
behavior would be the responsibility of a layout engine, a role which is
way beyond the scope of Unicode support in Python, as it is language- and
script-dependent.

Just


From just@letterror.com  Tue May  2 14:39:24 2000
From: just@letterror.com (Just van Rossum)
Date: Tue, 2 May 2000 14:39:24 +0100
Subject: [I18n-sig] Unicode debate
In-Reply-To: <Pine.LNX.4.10.10005020114250.522-100000@localhost>
References: <l03102802b534149a9639@[193.78.237.164]>
Message-ID: <l03102806b534883ec4cf@[193.78.237.142]>

At 1:42 AM -0700 02-05-2000, Ka-Ping Yee wrote:
>If it turns out automatic conversions *are* absolutely necessary,
>then i vote in favour of the simple, direct method promoted by Paul
>and Fredrik: just copy the numerical values of the bytes.  The fact
>that this happens to correspond to Latin-1 is not really the point;
>the main reason is that it satisfies the Principle of Least Surprise.

Exactly.

I'm not sure if automatic conversions are absolutely necessary, but seeing
8-bit strings as Latin-1 encoded Unicode strings seems most natural to me.
Heck, even 8-bit strings should have an s.encode() method, that would
behave *just* like u.encode(), and unicode(blah) could even *return* an
8-bit string if it turns out the string has no character codes > 255!

Conceptually, this gets *very* close to the ideal of "there is only one
string type", and at the same times leaves room for 8-bit strings doubling
as byte arrays for backward compatibility reasons.

(Unicode strings and 8-bit strings could even be the same type, which only
uses wide chars when neccesary!)

Just


From just@letterror.com  Tue May  2 14:55:31 2000
From: just@letterror.com (Just van Rossum)
Date: Tue, 2 May 2000 14:55:31 +0100
Subject: [I18n-sig] Re: Unicode debate
In-Reply-To: <200005021231.IAA24249@eric.cnri.reston.va.us>
References: Your message of "Tue, 02 May 2000 01:42:51 PDT."
 <Pine.LNX.4.10.10005020114250.522-100000@localhost>
 <Pine.LNX.4.10.10005020114250.522-100000@localhost>
Message-ID: <l03102808b5348d1eea20@[193.78.237.142]>

At 8:31 AM -0400 02-05-2000, Guido van Rossum wrote:
>When *comparing* 8-bit and Unicode strings, the presence of non-ASCII
>bytes in either should make the comparison fail; when ordering is
>important, we can make an arbitrary choice e.g. "\377" < u"\200".

Blech. Just document 8-bit strings *are* Latin-1 unless converted
explicitly, and you're done. It's really much simpler this way. For you as
well as the users.

>Why not Latin-1?  Because it gives us Western-alphabet users a false
>sense that our code works, where in fact it is broken as soon as you
>change the encoding.

Yeah, and? It least it'll *show* it's broken instead of *silently* doing
the wrong thing with utf-8.

It's like using Python ints all over the place, and suddenly a user of the
application enters data that causes an integer overflow. Boom. Program
needs to be fixed. What's the big deal?

Just


From just@letterror.com  Tue May  2 15:00:31 2000
From: just@letterror.com (Just van Rossum)
Date: Tue, 2 May 2000 15:00:31 +0100
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <200005021230.IAA24240@eric.cnri.reston.va.us>
References: Your message of "Tue, 02 May 2000 11:09:11 +0200."
 <m12mYg3-000CnCC@artcom0.artcom-gmbh.de>
 <m12mYg3-000CnCC@artcom0.artcom-gmbh.de>
Message-ID: <l0310280ab5348f526eb6@[193.78.237.142]>

At 8:30 AM -0400 02-05-2000, Guido van Rossum wrote:
>> I aggree with Just, Paul, Fredrik and Ping.
>
>Sorry, this is not a democracy. :-)  I'm not counting votes, I'm
>looking for contributions to the discussion.

Of course it's not, and of course you shouldn't be counting votes. However,
the fact that more and more people chime in on the Latin-1 side (even
non-western oriented people like Ping and Moshe!) should ring a bell.

Just


From Fredrik Lundh" <effbot@telia.com  Tue May  2 14:05:42 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Tue, 2 May 2000 15:05:42 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: Your message of "Mon, 01 May 2000 21:55:25 EDT." <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <l03102802b534149a9639@[193.78.237.164]> <390E939B.11B99B71@lemburg.com> <009701bfb414$d35d0ea0$34aab5d4@hagrid>             <390EA645.89E3B22A@lemburg.com>  <200005021230.IAA24232@eric.cnri.reston.va.us>
Message-ID: <00f301bfb437$227bc180$34aab5d4@hagrid>

Guido van Rossum <guido@python.org> wrote:
> > FYI: Normalization is needed to make comparing Unicode
> > strings robust, e.g. u"=E9" should compare equal to u"e\u0301".
>=20
> Aha, then we'll see u =3D=3D v even though type(u) is type(v) and =
len(u)
> !=3D len(v).  /F's world will collapse. :-)

you're gonna do automatic normalization?  that's interesting.
will this make Python the first language to defines strings as
a "sequence of graphemes"?

or was this just the cheap shot it appeared to be?

</F>


From tdickenson@geminidataloggers.com  Tue May  2 14:46:44 2000
From: tdickenson@geminidataloggers.com (Toby Dickenson)
Date: Tue, 02 May 2000 14:46:44 +0100
Subject: [I18n-sig] Re: Unicode debate
In-Reply-To: <200005021231.IAA24249@eric.cnri.reston.va.us>
References: <Pine.LNX.4.10.10005020114250.522-100000@localhost> <200005021231.IAA24249@eric.cnri.reston.va.us>
Message-ID: <sjmtgss11jaqujkcbppu5keml79lg98gqo@4ax.com>

On Tue, 02 May 2000 08:31:55 -0400, Guido van Rossum
<guido@python.org> wrote:

>>     No automatic conversions between 8-bit "strings" and Unicode =
strings.
>>=20
>> If you want to turn UTF-8 into a Unicode string, say so.
>> If you want to turn Latin-1 into a Unicode string, say so.
>> If you want to turn ISO-2022-JP into a Unicode string, say so.
>> Adding a Unicode string and an 8-bit "string" gives an exception.
>
>I'd accept this, with one change: mixing Unicode and 8-bit strings is
>okay when the 8-bit strings contain only ASCII (byte values 0 through
>127).  That does the right thing when the program is combining
>ASCII data (e.g. literals or data files) with Unicode and warns you
>when you are using characters for which the encoding matters.  I
>believe that this is important because much existing code dealing with
>strings can in fact deal with Unicode just fine under these
>assumptions.  (E.g. I needed only 4 changes to htmllib/sgmllib to make
>it deal with Unicode strings -- those changes were all getattr() and
>setattr() calls.)
>
>When *comparing* 8-bit and Unicode strings, the presence of non-ASCII
>bytes in either should make the comparison fail; when ordering is
>important, we can make an arbitrary choice e.g. "\377" < u"\200".

I assume 'fail' means 'non-equal', rather than 'raises an exception'?


Toby Dickenson
tdickenson@geminidataloggers.com


From guido@python.org  Tue May  2 15:00:14 2000
From: guido@python.org (Guido van Rossum)
Date: Tue, 02 May 2000 10:00:14 -0400
Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate
In-Reply-To: Your message of "Tue, 02 May 2000 14:46:44 BST."
 <sjmtgss11jaqujkcbppu5keml79lg98gqo@4ax.com>
References: <Pine.LNX.4.10.10005020114250.522-100000@localhost> <200005021231.IAA24249@eric.cnri.reston.va.us>
 <sjmtgss11jaqujkcbppu5keml79lg98gqo@4ax.com>
Message-ID: <200005021400.KAA24464@eric.cnri.reston.va.us>

[me]
> >When *comparing* 8-bit and Unicode strings, the presence of non-ASCII
> >bytes in either should make the comparison fail; when ordering is
> >important, we can make an arbitrary choice e.g. "\377" < u"\200".

[Toby] 
> I assume 'fail' means 'non-equal', rather than 'raises an exception'?

Yes, sorry for the ambiguity.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From just@letterror.com  Tue May  2 16:11:39 2000
From: just@letterror.com (Just van Rossum)
Date: Tue, 2 May 2000 16:11:39 +0100
Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate
In-Reply-To: <200005021400.KAA24464@eric.cnri.reston.va.us>
References: Your message of "Tue, 02 May 2000 14:46:44 BST."
 <sjmtgss11jaqujkcbppu5keml79lg98gqo@4ax.com>
 <Pine.LNX.4.10.10005020114250.522-100000@localhost>
 <200005021231.IAA24249@eric.cnri.reston.va.us>
 <sjmtgss11jaqujkcbppu5keml79lg98gqo@4ax.com>
Message-ID: <l0310280fb5349fd24fc5@[193.78.237.142]>

At 10:00 AM -0400 02-05-2000, Guido van Rossum wrote:
>[me]
>> >When *comparing* 8-bit and Unicode strings, the presence of non-ASCII
>> >bytes in either should make the comparison fail; when ordering is
>> >important, we can make an arbitrary choice e.g. "\377" < u"\200".
>
>[Toby]
>> I assume 'fail' means 'non-equal', rather than 'raises an exception'?
>
>Yes, sorry for the ambiguity.

You're going to have a hard time explaining that "\377" != u"\377".

Again, if you define that "all strings are unicode" and that 8-bit strings
contain Unicode characters up to 255, you're all set. Clear semantics, few
surprises, simple implementation, etc. etc.

Just


From guido@python.org  Tue May  2 15:15:50 2000
From: guido@python.org (Guido van Rossum)
Date: Tue, 02 May 2000 10:15:50 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Your message of "Tue, 02 May 2000 15:00:31 BST."
 <l0310280ab5348f526eb6@[193.78.237.142]>
References: Your message of "Tue, 02 May 2000 11:09:11 +0200." <m12mYg3-000CnCC@artcom0.artcom-gmbh.de> <m12mYg3-000CnCC@artcom0.artcom-gmbh.de>
 <l0310280ab5348f526eb6@[193.78.237.142]>
Message-ID: <200005021415.KAA24503@eric.cnri.reston.va.us>

[me]
> >Why not Latin-1?  Because it gives us Western-alphabet users a false
> >sense that our code works, where in fact it is broken as soon as you
> >change the encoding.

[Just]
> Yeah, and? It least it'll *show* it's broken instead of *silently* doing
> the wrong thing with utf-8.
> 
> It's like using Python ints all over the place, and suddenly a user of the
> application enters data that causes an integer overflow. Boom. Program
> needs to be fixed. What's the big deal?

The big deal is that in some cultures, 8-bit strings with non-ASCII
bytes are unlikely to be Latin-1.  Under the Latin-1 convention, they
would get garbage when mixing Unicode and regular strings.  This is
more like ingoring overflow on integer addition (so that 2000000000*2
yields -2442450944).  I am against silently allowing erroneous results
like this if I can help it.

[Just, in a different message]
> Of course it's not, and of course you shouldn't be counting votes. However,
> the fact that more and more people chime in on the Latin-1 side (even
> non-western oriented people like Ping and Moshe!) should ring a bell.

Significantly, neither Ping nor Moshe cares for Latin-1 at all: they
don't have a use for a default encoding.  This is because they have no
hope that their preferred encoding would be elected as the default
encoding.

Note that I think that the ASCII default encoding is essential --
ASCII is the character set used by the Python language for
identifiers, and any 8-bit source encoding should always be a superset
of ASCII.  Essentially, Python has always made the (implicit)
guarantee that programs using only the ASCII character set are
portable w.r.t. character encodings -- I think this is important.

Having no default encoding would be like having no automatic coercion
between ints and long ints -- I tried this in very early Python
versions (around 0.9.1 I believe) but Tim Peters and/or Steve Majewski
quickly dissuaded me of this bad idea.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@python.org  Tue May  2 15:21:28 2000
From: guido@python.org (Guido van Rossum)
Date: Tue, 02 May 2000 10:21:28 -0400
Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate
In-Reply-To: Your message of "Tue, 02 May 2000 16:11:39 BST."
 <l0310280fb5349fd24fc5@[193.78.237.142]>
References: Your message of "Tue, 02 May 2000 14:46:44 BST." <sjmtgss11jaqujkcbppu5keml79lg98gqo@4ax.com> <Pine.LNX.4.10.10005020114250.522-100000@localhost> <200005021231.IAA24249@eric.cnri.reston.va.us> <sjmtgss11jaqujkcbppu5keml79lg98gqo@4ax.com>
 <l0310280fb5349fd24fc5@[193.78.237.142]>
Message-ID: <200005021421.KAA24526@eric.cnri.reston.va.us>

[Just]
> You're going to have a hard time explaining that "\377" != u"\377".

I agree.  You are an example of how hard it is to explain: you still
don't understand that for a person using CJK encodings this is in fact
the truth.

> Again, if you define that "all strings are unicode" and that 8-bit strings
> contain Unicode characters up to 255, you're all set. Clear semantics, few
> surprises, simple implementation, etc. etc.

But not all 8-bit strings occurring in programs are Unicode.  Ask
Moshe.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From just@letterror.com  Tue May  2 16:38:51 2000
From: just@letterror.com (Just van Rossum)
Date: Tue, 2 May 2000 16:38:51 +0100
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <200005021415.KAA24503@eric.cnri.reston.va.us>
References: Your message of "Tue, 02 May 2000 15:00:31 BST."
 <l0310280ab5348f526eb6@[193.78.237.142]> Your message of "Tue, 02 May
 2000 11:09:11 +0200." <m12mYg3-000CnCC@artcom0.artcom-gmbh.de>
 <m12mYg3-000CnCC@artcom0.artcom-gmbh.de>
 <l0310280ab5348f526eb6@[193.78.237.142]>
Message-ID: <l03102810b534a3fc4a4f@[193.78.237.142]>

[GvR]
>Why not Latin-1?  Because it gives us Western-alphabet users a false
>sense that our code works, where in fact it is broken as soon as you
>change the encoding.

[Just]
> Yeah, and? It least it'll *show* it's broken instead of *silently* doing
> the wrong thing with utf-8.
>
> It's like using Python ints all over the place, and suddenly a user of the
> application enters data that causes an integer overflow. Boom. Program
> needs to be fixed. What's the big deal?

[GvR]
>The big deal is that in some cultures, 8-bit strings with non-ASCII
>bytes are unlikely to be Latin-1.  Under the Latin-1 convention, they
>would get garbage when mixing Unicode and regular strings.

They would also get garbage under the utf-8 convention, so again, a moot point.

>This is
>more like ingoring overflow on integer addition (so that 2000000000*2
>yields -2442450944).  I am against silently allowing erroneous results
>like this if I can help it.

As I've explained before, such encoding issues are silent by nature.
There's *nothing* you can ever do about it. The silent errors caused by
defaulting utf-8 are far worse.

>[Just, in a different message]
>> Of course it's not, and of course you shouldn't be counting votes. However,
>> the fact that more and more people chime in on the Latin-1 side (even
>> non-western oriented people like Ping and Moshe!) should ring a bell.
>
>Significantly, neither Ping nor Moshe cares for Latin-1 at all: they
>don't have a use for a default encoding.  This is because they have no
>hope that their preferred encoding would be elected as the default
>encoding.

Hm, Moshe wrote:
"""I much prefer the Fredrik-Paul position, known also as the
character is a character position, to the UTF-8 as default encoding.
Unicode is western-centered -- the first 256 characters are Latin 1.
"""

And Ping wrote:
"""If it turns out automatic conversions *are* absolutely necessary,
then i vote in favour of the simple, direct method promoted by Paul
and Fredrik: just copy the numerical values of the bytes.  The fact
that this happens to correspond to Latin-1 is not really the point;
the main reason is that it satisfies the Principle of Least Surprise.
"""

I thought that was pretty clear.

>Having no default encoding would be like having no automatic coercion
>between ints and long ints -- I tried this in very early Python
>versions (around 0.9.1 I believe) but Tim Peters and/or Steve Majewski
>quickly dissuaded me of this bad idea.

1. Currently utf-8 is the default. Many of us trying to dissuade you of
this bad idea.
2. You propose to *not* provide a default encoding for characters >= 128
3. Many of us trying to dissuade you of this bad idea.

(Too bad none of us is called Tim or Steve, or you would've been convinced
a long time ago ;-)

One additional fact: 8-bit encodings exist that are not even compatible
with 7-bit ascii, making the choice to only compare if it's 7-bit ascii
look even more arbitrary.

Guido, maybe you'll believe it from you loving little brother: Guido's are
not always right ;-)

Just


From just@letterror.com  Tue May  2 16:42:24 2000
From: just@letterror.com (Just van Rossum)
Date: Tue, 2 May 2000 16:42:24 +0100
Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate
In-Reply-To: <200005021421.KAA24526@eric.cnri.reston.va.us>
References: Your message of "Tue, 02 May 2000 16:11:39 BST."
 <l0310280fb5349fd24fc5@[193.78.237.142]> Your message of "Tue, 02 May
 2000 14:46:44 BST." <sjmtgss11jaqujkcbppu5keml79lg98gqo@4ax.com>
 <Pine.LNX.4.10.10005020114250.522-100000@localhost>
 <200005021231.IAA24249@eric.cnri.reston.va.us>
 <sjmtgss11jaqujkcbppu5keml79lg98gqo@4ax.com>
 <l0310280fb5349fd24fc5@[193.78.237.142]>
Message-ID: <l03102812b534a7430fb6@[193.78.237.142]>

>[Just]
>> You're going to have a hard time explaining that "\377" != u"\377".
>
[GvR]
>I agree.  You are an example of how hard it is to explain: you still
>don't understand that for a person using CJK encodings this is in fact
>the truth.

That depends on the definition of truth: it you document that 8-bit strings
are Latin-1, the above is the truth. Conceptually classify all other 8-bit
encodings as binary goop makes the semantics chrystal clear.

>> Again, if you define that "all strings are unicode" and that 8-bit strings
>> contain Unicode characters up to 255, you're all set. Clear semantics, few
>> surprises, simple implementation, etc. etc.
>
>But not all 8-bit strings occurring in programs are Unicode.  Ask
>Moshe.

I know. They can be anything, even binary goop. But that's *only* an
artifact of the fact that 8-bit strings need to double as buffer objects.

Just


From just@letterror.com  Tue May  2 16:45:01 2000
From: just@letterror.com (Just van Rossum)
Date: Tue, 2 May 2000 16:45:01 +0100
Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate
In-Reply-To: <l03102812b534a7430fb6@[193.78.237.142]>
References: <200005021421.KAA24526@eric.cnri.reston.va.us> Your message of
 "Tue, 02 May 2000 16:11:39 BST."
 <l0310280fb5349fd24fc5@[193.78.237.142]> Your message of "Tue, 02 May
 2000 14:46:44 BST." <sjmtgss11jaqujkcbppu5keml79lg98gqo@4ax.com>
 <Pine.LNX.4.10.10005020114250.522-100000@localhost>
 <200005021231.IAA24249@eric.cnri.reston.va.us>
 <sjmtgss11jaqujkcbppu5keml79lg98gqo@4ax.com>
 <l0310280fb5349fd24fc5@[193.78.237.142]>
Message-ID: <l03102813b534a8484cf9@[193.78.237.142]>

I wrote:
>That depends on the definition of truth: it you document that 8-bit strings
>are Latin-1, the above is the truth.

Oops, I meant of course that "\377" == u"\377" is then the truth...

Sorry,

Just


From guido@python.org  Tue May  2 15:53:19 2000
From: guido@python.org (Guido van Rossum)
Date: Tue, 02 May 2000 10:53:19 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Your message of "Tue, 02 May 2000 16:38:51 BST."
 <l03102810b534a3fc4a4f@[193.78.237.142]>
References: Your message of "Tue, 02 May 2000 15:00:31 BST." <l0310280ab5348f526eb6@[193.78.237.142]> Your message of "Tue, 02 May 2000 11:09:11 +0200." <m12mYg3-000CnCC@artcom0.artcom-gmbh.de> <m12mYg3-000CnCC@artcom0.artcom-gmbh.de> <l0310280ab5348f526eb6@[193.78.237.142]>
 <l03102810b534a3fc4a4f@[193.78.237.142]>
Message-ID: <200005021453.KAA24717@eric.cnri.reston.va.us>

> [GvR]
> >The big deal is that in some cultures, 8-bit strings with non-ASCII
> >bytes are unlikely to be Latin-1.  Under the Latin-1 convention, they
> >would get garbage when mixing Unicode and regular strings.

[Just]
> They would also get garbage under the utf-8 convention, so again, a
> moot point.

No, because I changed my position!  I now propose to make ASCII the
default conversion (i.e., characters must be in range(128) to avoid an
exception).  You are arguing for Latin-1 which gives them silent
errors.  I *was* arguing for UTF-8, which would give them likely but
not guaranteed errors.  I *am* now arguing for ASCII, which guarantees
them errors (if theyt are in fact using an encoding).

> >This is
> >more like ingoring overflow on integer addition (so that 2000000000*2
> >yields -2442450944).  I am against silently allowing erroneous results
> >like this if I can help it.
> 
> As I've explained before, such encoding issues are silent by nature.
> There's *nothing* you can ever do about it. The silent errors caused by
> defaulting utf-8 are far worse.

Which why I no longer argue for it.

> Hm, Moshe wrote:
> """I much prefer the Fredrik-Paul position, known also as the
> character is a character position, to the UTF-8 as default encoding.
> Unicode is western-centered -- the first 256 characters are Latin 1.
> """

And then proceeded to write: "If I'm reading Hebrew from an IS-8859-8
file, I'll set a conversion to Unicode on the fly anyway [...]".

> And Ping wrote:
> """If it turns out automatic conversions *are* absolutely necessary,
> then i vote in favour of the simple, direct method promoted by Paul
> and Fredrik: just copy the numerical values of the bytes.  The fact
> that this happens to correspond to Latin-1 is not really the point;
> the main reason is that it satisfies the Principle of Least Surprise.
> """
> 
> I thought that was pretty clear.

But he first proposed to have no conversions at all.  I am now
convinced that UTF-8 is bad, and that having no default conversion at
all is bad.  We need at least ASCII.  I claim that we need no more
than ASCII.  The reason is that Latin-1 is not a safe assumption;
ASCII is.  (Unless it's not characters at all -- but usually binary
goop contains more than a smattering of bytes in range(128, 256) so it
would typically be caught right away.)

> >Having no default encoding would be like having no automatic coercion
> >between ints and long ints -- I tried this in very early Python
> >versions (around 0.9.1 I believe) but Tim Peters and/or Steve Majewski
> >quickly dissuaded me of this bad idea.
> 
> 1. Currently utf-8 is the default. Many of us trying to dissuade you of
> this bad idea.

I agree.

> 2. You propose to *not* provide a default encoding for characters >= 128

Correct.

> 3. Many of us trying to dissuade you of this bad idea.

So far you're the only one -- I haven't seen other responses to this
idea yet.

> (Too bad none of us is called Tim or Steve, or you would've been convinced
> a long time ago ;-)
> 
> One additional fact: 8-bit encodings exist that are not even compatible
> with 7-bit ascii, making the choice to only compare if it's 7-bit ascii
> look even more arbitrary.

But there's a compelling argument that *requires* ASCII (see previous
post), and encodings that are not a superset of ASCII are rare.

> Guido, maybe you'll believe it from you loving little brother: Guido's are
> not always right ;-)

But they listen to reason.  I've been convinced that UTF-8 is bad.
I'm not convinced that Latin-1 is good, and I'm proposing what I think
is a very Pythonic compromise: ASCII, on which we (nearly) all can
agree.

> >[Just]
> >> You're going to have a hard time explaining that "\377" != u"\377".
> >
> [GvR]
> >I agree.  You are an example of how hard it is to explain: you still
> >don't understand that for a person using CJK encodings this is in fact
> >the truth.
> 
> That depends on the definition of truth: it you document that 8-bit strings
> are Latin-1, the above is the truth. Conceptually classify all other 8-bit
> encodings as binary goop makes the semantics chrystal clear.
[and later]
> Oops, I meant of course that "\377" == u"\377" is then the truth...

I can document that 1==2 but that doesn't make it true.  Since we
canhave binary goop in 8-bit strings, 8-bit strings are NOT always
Latin-1.  At least until Python 3000.

Think about it once more.  Why do you really want Latin-1?

--Guido van Rossum (home page: http://www.python.org/~guido/)


From just@letterror.com  Tue May  2 17:22:06 2000
From: just@letterror.com (Just van Rossum)
Date: Tue, 2 May 2000 17:22:06 +0100
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <200005021453.KAA24717@eric.cnri.reston.va.us>
References: Your message of "Tue, 02 May 2000 16:38:51 BST."
 <l03102810b534a3fc4a4f@[193.78.237.142]> Your message of "Tue, 02 May
 2000 15:00:31 BST." <l0310280ab5348f526eb6@[193.78.237.142]> Your message
 of "Tue, 02 May 2000 11:09:11 +0200."
 <m12mYg3-000CnCC@artcom0.artcom-gmbh.de>
 <m12mYg3-000CnCC@artcom0.artcom-gmbh.de>
 <l0310280ab5348f526eb6@[193.78.237.142]>
 <l03102810b534a3fc4a4f@[193.78.237.142]>
Message-ID: <l03102814b534ad7183ee@[193.78.237.142]>

At 10:53 AM -0400 02-05-2000, Guido van Rossum wrote:
>> As I've explained before, such encoding issues are silent by nature.
>> There's *nothing* you can ever do about it. The silent errors caused by
>> defaulting utf-8 are far worse.
>
>Which why I no longer argue for it.

Yay, progress!

>> 1. Currently utf-8 is the default. Many of us trying to dissuade you of
>> this bad idea.
>
>I agree.
>
>> 2. You propose to *not* provide a default encoding for characters >= 128
>
>Correct.
>
>> 3. Many of us trying to dissuade you of this bad idea.
>
>So far you're the only one -- I haven't seen other responses to this
>idea yet.

Well, you're proposal is very new, and as quite a few others have been
backing the Latin-1 proposal, for now I assume they agree with me... But
you're right, we'll have to wait and see what they say.

>> That depends on the definition of truth: it you document that 8-bit strings
>> are Latin-1, the above is the truth. Conceptually classify all other 8-bit
>> encodings as binary goop makes the semantics chrystal clear.
>[and later]
>> Oops, I meant of course that "\377" == u"\377" is then the truth...
>
>I can document that 1==2 but that doesn't make it true.

But that's not what I'm proposing! I propose that 1 == 1 and you propose
that 1 != 1. See the difference? ;-)

>Think about it once more.  Why do you really want Latin-1?

Because it's the only logical 8-bit subset of Unicode? Providing the least
surprises?

IHMO you're taking the western-centric argument way over the top: Python is
western centric, Unicode is western-centric. These are simple facts of life
that we'll just have to accept. You seem to want to make an (in my view)
unneccesary comprimise *only* to appear teensy weensy bit more politically
correct. As I've written before, choosing Latin-1 almost unite 8-bit and
unicode strings. I tend to think that's a good thing. And let's face it,
Latin-1 is the ASCII of today.

Just


From guido@python.org  Tue May  2 16:48:03 2000
From: guido@python.org (Guido van Rossum)
Date: Tue, 02 May 2000 11:48:03 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Your message of "Tue, 02 May 2000 17:22:06 BST."
 <l03102814b534ad7183ee@[193.78.237.142]>
References: Your message of "Tue, 02 May 2000 16:38:51 BST." <l03102810b534a3fc4a4f@[193.78.237.142]> Your message of "Tue, 02 May 2000 15:00:31 BST." <l0310280ab5348f526eb6@[193.78.237.142]> Your message of "Tue, 02 May 2000 11:09:11 +0200." <m12mYg3-000CnCC@artcom0.artcom-gmbh.de> <m12mYg3-000CnCC@artcom0.artcom-gmbh.de> <l0310280ab5348f526eb6@[193.78.237.142]> <l03102810b534a3fc4a4f@[193.78.237.142]>
 <l03102814b534ad7183ee@[193.78.237.142]>
Message-ID: <200005021548.LAA25581@eric.cnri.reston.va.us>

[Just]
> And let's face it, Latin-1 is the ASCII of today.

Maybe that's where we disagree.  I don't actually have any decent
facilities to enter Latin-1 characters -- neither my Unix box nor my
Windows box has a Latin-1 key, and I think the Mac over there in the
corner has accented characters but it doesn't use Latin-1.

Displaying Latin-1 works about 75% of the time on Unix (several of the
system fonts only have glyphs for ASCII) and 99% of the time on
Windows, but when I save a Word file as text and copy it to Unix, it
has non-Latin-1 characters for squiggly quotes and em-dashes...

So for me, Latin-1 is Euro-centric more than anything.

I wonder what position Latin-1 has in countries like Israel or Japan?
I'm pretty sure they support ASCII, since we can exchange email :-)

--Guido van Rossum (home page: http://www.python.org/~guido/)


From mal@lemburg.com  Tue May  2 16:18:21 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 02 May 2000 17:18:21 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: Your message of "Tue, 02 May 2000 11:56:21 +0200."
 <390EA645.89E3B22A@lemburg.com> Your message of "Mon, 01 May 2000
 21:55:25 EDT." <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
 <l03102802b534149a9639@[193.78.237.164]> <390E939B.11B99B71@lemburg.com>
 <009701bfb414$d35d0ea0$34aab5d4@hagrid>
 <390EA645.89E3B22A@lemburg.com> <l03102807b5348b0e6e0b@[193.78.237.142]>
Message-ID: <390EF1BD.E6C7AF74@lemburg.com>

Just van Rossum wrote:
> 
> At 8:30 AM -0400 02-05-2000, Guido van Rossum wrote:
> >I think /F's point was that the Unicode standard prescribes different
> >behavior here: for UTF-8, a missing or lone continuation byte is an
> >error; for Unicode, accents are separate characters that may be
> >inserted and deleted in a string but whose display is undefined under
> >certain conditions.
> >
> >(I just noticed that this doesn't work in Tkinter but it does work in
> >wish.  Strange.)
> >
> >> FYI: Normalization is needed to make comparing Unicode
> >> strings robust, e.g. u"�" should compare equal to u"e\u0301".

                            ^
                            |

Here's a good example of what encoding errors can do: the
above character was an "e" with acute accent (u"�"). Looks like
some mailer converted this to some other code page and yet
another back to Latin-1 again and this even though the
message header for Content-Type clearly states that the
document uses ISO-8859-1.

> >
> >Aha, then we'll see u == v even though type(u) is type(v) and len(u)
> >!= len(v).  /F's world will collapse. :-)
> 
> Does the Unicode spec *really* specifies u should compare equal to v?

The behaviour is needed in order to implement sorting Unicode.
See the www.unicode.org site for more information and the
tech reports describing this.

Note that I haven't mentioned anything about "automatic"
normalization. This should be a method on Unicode strings
and could then be used in sorting compare callbacks.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From mal@lemburg.com  Tue May  2 16:55:40 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 02 May 2000 17:55:40 +0200
Subject: [I18n-sig] Re: Unicode debate
References: <Pine.LNX.4.10.10005020114250.522-100000@localhost> <200005021231.IAA24249@eric.cnri.reston.va.us>
Message-ID: <390EFA7B.F6B622F0@lemburg.com>

[Guido going ASCII]

Do you mean going ASCII all the way (using it for all
aspects where Unicode gets converted to a string and cases
where strings get converted to Unicode), or just 
for some aspect of conversion, e.g. just for the silent
conversions from strings to Unicode ?

[BTW, I'm pretty sure that the Latin-1 folks won't like
ASCII for the same reason they don't like UTF-8: it's
simply an inconvenient way to write strings in their favorite
encoding directly in Python source code. My feeling in this
whole discussion is that it's more about convenience than
anything else. Still, it's very amusing ;-) ]

FYI, here's the conversion table of (potentially) all
conversions done by the implementation:

Python:
-------
string + unicode:       unicode(string,'utf-8') + unicode
string.method(unicode): unicode(string,'utf-8').method(unicode)
print unicode:          print unicode.encode('utf-8'); with stdout
                        redirection this can be changed to any
                        other encoding
str(unicode):           unicode.encode('utf-8')
repr(unicode):          repr(unicode.encode('unicode-escape'))


C (PyArg_ParserTuple):
----------------------
"s" + unicode:          same as "s" + unicode.encode('utf-8')
"s#" + unicode:         same as "s#" + unicode.encode('unicode-internal')
"t" + unicode:          same as "t" + unicode.encode('utf-8')
"t#" + unicode:         same as "t#" + unicode.encode('utf-8')

This effects all C modules and builtins. In case a C module
wants to receive a certain predefined encoding, it can
use the new "es" and "es#" parser markers.


Ways to enter Unicode:
----------------------
u'' + string            same as unicode(string,'utf-8')
unicode(string,encname) any supported encoding
u'...unicode-escape...' unicode-escape currently accepts
                        Latin-1 chars as single-char input; using
                        escape sequences any Unicode char can be
                        entered (*)
codecs.open(filename,mode,encname)
                        opens an encoded file for
                        reading and writing Unicode directly
raw_input() + stdin redirection (see one of my earlier posts for code)
                        returns UTF-8 strings based on the input
                        encoding

IO:
---
open(file,'w').write(unicode)
        same as open(file,'w').write(unicode.encode('utf-8'))
open(file,'wb').write(unicode)
        same as open(file,'wb').write(unicode.encode('unicode-internal'))
codecs.open(file,'wb',encname).write(unicode)
        same as open(file,'wb').write(unicode.encode(encname))
codecs.open(file,'rb',encname).read()
        same as unicode(open(file,'rb').read(),encname)
stdin + stdout
        can be redirected using StreamRecoders to handle any
        of the supported encodings

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From mal@lemburg.com  Tue May  2 16:27:39 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 02 May 2000 17:27:39 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <Pine.GSO.4.10.10005021248200.8983-100000@sundial>
 <390EB1EE.EA557CA9@lemburg.com> <200005021226.IAA24203@eric.cnri.reston.va.us>
Message-ID: <390EF3EB.5BCE9EC3@lemburg.com>

Guido van Rossum wrote:
> 
> [MAL]
> > Let's not do the same mistake again: Unicode objects should *not*
> > be used to hold binary data. Please use buffers instead.
> 
> Easier said than done -- Python doesn't really have a buffer data
> type.  Or do you mean the array module?  It's not trivial to read a
> file into an array (although it's possible, there are even two ways).
> Fact is, most of Python's standard library and built-in objects use
> (8-bit) strings as buffers.
> 
> I agree there's no reason to extend this to Unicode strings.
> 
> > BTW, I think that this behaviour should be changed:
> >
> > >>> buffer('binary') + 'data'
> > 'binarydata'
> >
> > while:
> >
> > >>> 'data' + buffer('binary')
> > Traceback (most recent call last):
> >   File "<stdin>", line 1, in ?
> > TypeError: illegal argument type for built-in operation
> >
> > IMHO, buffer objects should never coerce to strings, but instead
> > return a buffer object holding the combined contents. The
> > same applies to slicing buffer objects:
> >
> > >>> buffer('binary')[2:5]
> > 'nar'
> >
> > should prefereably be buffer('nar').
> 
> Note that a buffer object doesn't hold data!  It's only a pointer to
> data.  I can't off-hand explain the asymmetry though.

Dang, you're right...
 
> > --
> >
> > Hmm, perhaps we need something like a data string object
> > to get this 100% right ?!
> >
> > >>> d = data("...data...")
> > or
> > >>> d = d"...data..."
> > >>> print type(d)
> > <type 'data'>
> >
> > >>> 'string' + d
> > d"string...data..."
> > >>> u'string' + d
> > d"s\000t\000r\000i\000n\000g\000...data..."
> >
> > >>> d[:5]
> > d"...da"
> >
> > etc.
> >
> > Ideally, string and Unicode objects would then be subclasses
> > of this type in Py3K.
> 
> Not clear.  I'd rather do the equivalent of byte arrays in Java, for
> which no "string literal" notations exist.

Anyway, one way or another I think we should make it clear
to users that they should start using some other type for
storing binary data.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From mal@lemburg.com  Tue May  2 16:24:24 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 02 May 2000 17:24:24 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: Your message of "Mon, 01 May 2000 21:55:25 EDT."
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
 <l03102802b534149a9639@[193.78.237.164]> <l03102804b534772fc25b@[193.78.237.142]>
Message-ID: <390EF327.86D8C3D8@lemburg.com>

Just van Rossum wrote:
> 
> At 10:36 AM +0200 02-05-2000, M.-A. Lemburg wrote:
> >Just a small note on the subject of a character being atomic
> >which seems to have been forgotten by the discussing parties:
> >
> >Unicode itself can be understood as multi-word character
> >encoding, just like UTF-8. The reason is that Unicode entities
> >can be combined to produce single display characters (e.g.
> >u"e"+u"\u0301" will print "�" in a Unicode aware renderer).
> 
> Erm, are you sure Unicode prescribes this behavior, for this
> example? I know similar behaviors are specified for certain
> languages/scripts, but I didn't know it did that for latin.

The details are on the www.unicode.org web-site burried
in some of the tech reports on normalization and
collation.
 
> >Slicing such a combined Unicode string will have the same
> >effect as slicing UTF-8 data.
> 
> Not true. As Fredrik noted: no exception will be raised.

Huh ? You will always get an exception when you convert
a broken UTF-8 sequence to Unicode. This is per design
of UTF-8 itself which uses the top bit to identify
multi-byte character encodings.

Or can you give an example (perhaps you've found a bug 
that needs fixing) ?

> [ Speaking of exceptions,
> 
> after I sent off my previous post I realized Guido's
> non-utf8-strings-interpreted-as-utf8-will-often-raise-an-exception
> argument can easily be turned around, backfiring at utf-8:
> 
>     Defaulting to utf-8 when going from Unicode to 8-bit and
>     back only gives the *illusion* things "just work", since it
>     will *silently* "work", even if utf-8 is *not* the desired
>     8-bit encoding -- as shown by Fredrik's excellent "fun with
>     Unicode, part 1" example. Defaulting to Latin-1 will
>     warn the user *much* earlier, since it'll barf when
>     converting a Unicode string that contains any character
>     code > 255. So there.
> ]
> 
> >It seems that most Latin-1 proponents seem to have single
> >display characters in mind. While the same is true for
> >many Unicode entities, there are quite a few cases of
> >combining characters in Unicode 3.0 and the Unicode
> >nomarization algorithm uses these as basis for its
> >work.
> 
> Still, two combining characters are still two input characters for
> the renderer! They may result in one *glyph*, but trust me,
> that's an entirly different can of worms.

No. Please see my other post on the subject...
 
> However, if you'd be talking about Unicode surrogates,
> you'd definitely have a point. How do Java/Perl/Tcl deal with
> surrogates?

Good question... anybody know the answers ?

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From just@letterror.com  Tue May  2 18:33:56 2000
From: just@letterror.com (Just van Rossum)
Date: Tue, 2 May 2000 18:33:56 +0100
Subject: [I18n-sig] Re: Unicode debate
In-Reply-To: <390EFA7B.F6B622F0@lemburg.com>
References: <Pine.LNX.4.10.10005020114250.522-100000@localhost>
 <200005021231.IAA24249@eric.cnri.reston.va.us>
Message-ID: <l03102815b534c1763aa8@[193.78.237.142]>

At 5:55 PM +0200 02-05-2000, M.-A. Lemburg wrote:
>[BTW, I'm pretty sure that the Latin-1 folks won't like
>ASCII for the same reason they don't like UTF-8: it's
>simply an inconvenient way to write strings in their favorite
>encoding directly in Python source code. My feeling in this
>whole discussion is that it's more about convenience than
>anything else. Still, it's very amusing ;-) ]

For the record, I don't want Latin-1 because it's my favorite encoding. It
isn't. Guido's right: I can't even *use* it derictly on my platform. I want
it *only* because it's the most logical 8-bit subset of Unicode -- as we
have stated over and opver and over and over again. What's so hard to
understand about this?

Just


From paul@prescod.net  Tue May  2 17:11:13 2000
From: paul@prescod.net (Paul Prescod)
Date: Tue, 02 May 2000 11:11:13 -0500
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: Your message of "Mon, 01 May 2000 21:55:25 EDT."
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <l03102802b534149a9639@[193.78.237.164]> <390E939B.11B99B71@lemburg.com>
Message-ID: <390EFE21.DAD7749B@prescod.net>

Combining characters are a whole 'nother level of complexity. Charater
sets are hard. I don't accept that the argument that "Unicode itself has
complexities so that gives us license to introduce even more
complexities at the character representation level."

> FYI: Normalization is needed to make comparing Unicode
> strings robust, e.g. u"�" should compare equal to u"e\u0301".

That's a whole 'nother debate at a whole 'nother level of abstraction. I
think we need to get the bytes/characters level right and then we can
worry about display-equivalent characters (or leave that to the Python
programmer to figure out...).
-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
It's difficult to extract sense from strings, but they're the only
communication coin we can count on. 
	- http://www.cs.yale.edu/~perlis-alan/quotes.html


From paul@prescod.net  Tue May  2 17:13:00 2000
From: paul@prescod.net (Paul Prescod)
Date: Tue, 02 May 2000 11:13:00 -0500
Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate
References: Your message of "Tue, 02 May 2000 14:46:44 BST." <sjmtgss11jaqujkcbppu5keml79lg98gqo@4ax.com> <Pine.LNX.4.10.10005020114250.522-100000@localhost> <200005021231.IAA24249@eric.cnri.reston.va.us> <sjmtgss11jaqujkcbppu5keml79lg98gqo@4ax.com>
 <l0310280fb5349fd24fc5@[193.78.237.142]> <200005021421.KAA24526@eric.cnri.reston.va.us>
Message-ID: <390EFE8C.4C10473C@prescod.net>

Guido van Rossum wrote:
> 
> ...
>
> But not all 8-bit strings occurring in programs are Unicode.  Ask
> Moshe.

Where are we going? What's our long-range vision?

Three years from now where will we be? 

1. How will we handle characters? 
2. How will we handle bytes?
3. What will unadorned literal strings "do"?
4. Will literal strings be the same type as byte arrays?

I don't see how we can make decisions today without a vision for the
future. I think that this is the central point in our disagreement. Some
of us are aiming for as much compatibility with where we think we should
be going and others are aiming for as much compatibility as possible
with where we came from.

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
It's difficult to extract sense from strings, but they're the only
communication coin we can count on. 
	- http://www.cs.yale.edu/~perlis-alan/quotes.html


From just@letterror.com  Tue May  2 18:37:09 2000
From: just@letterror.com (Just van Rossum)
Date: Tue, 2 May 2000 18:37:09 +0100
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <390EF327.86D8C3D8@lemburg.com>
References: Your message of "Mon, 01 May 2000 21:55:25 EDT."
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
 <l03102802b534149a9639@[193.78.237.164]>
 <l03102804b534772fc25b@[193.78.237.142]>
Message-ID: <l03102816b534c2476bce@[193.78.237.142]>

At 5:24 PM +0200 02-05-2000, M.-A. Lemburg wrote:
>> Still, two combining characters are still two input characters for
>> the renderer! They may result in one *glyph*, but trust me,
>> that's an entirly different can of worms.
>
>No. Please see my other post on the subject...

It would help if you'd post some actual doco.

Just


From just@letterror.com  Tue May  2 18:46:25 2000
From: just@letterror.com (Just van Rossum)
Date: Tue, 2 May 2000 18:46:25 +0100
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <200005021548.LAA25581@eric.cnri.reston.va.us>
References: Your message of "Tue, 02 May 2000 17:22:06 BST."
 <l03102814b534ad7183ee@[193.78.237.142]> Your message of "Tue, 02 May
 2000 16:38:51 BST." <l03102810b534a3fc4a4f@[193.78.237.142]> Your message
 of "Tue, 02 May 2000 15:00:31 BST."
 <l0310280ab5348f526eb6@[193.78.237.142]> Your message of "Tue, 02 May
 2000 11:09:11 +0200." <m12mYg3-000CnCC@artcom0.artcom-gmbh.de>
 <m12mYg3-000CnCC@artcom0.artcom-gmbh.de>
 <l0310280ab5348f526eb6@[193.78.237.142]>
 <l03102810b534a3fc4a4f@[193.78.237.142]>
 <l03102814b534ad7183ee@[193.78.237.142]>
Message-ID: <l03102817b534c2fd96c8@[193.78.237.142]>

>[Just]
>> And let's face it, Latin-1 is the ASCII of today.

At 11:48 AM -0400 02-05-2000, Guido van Rossum wrote:
>Maybe that's where we disagree.  I don't actually have any decent
>facilities to enter Latin-1 characters -- neither my Unix box nor my
>Windows box has a Latin-1 key, and I think the Mac over there in the
>corner has accented characters but it doesn't use Latin-1.

Ok ok, I may have exaggerated... I take it back. So that's not where we
disagree ;-)

>Displaying Latin-1 works about 75% of the time on Unix (several of the
>system fonts only have glyphs for ASCII) and 99% of the time on
>Windows, but when I save a Word file as text and copy it to Unix, it
>has non-Latin-1 characters for squiggly quotes and em-dashes...
>
>So for me, Latin-1 is Euro-centric more than anything.

And for me, ASCII is US-centric more than anything...

>I wonder what position Latin-1 has in countries like Israel or Japan?
>I'm pretty sure they support ASCII, since we can exchange email :-)

Good point.

Still I'd find this hopelessly ugly:

>>> "\377" == u"\377"
0
>>> ord("\377") == ord(u"\377")
1
>>>

You're creating a sense of safety that's false. Many encoding issues remain
silent problems, no matter what you do. It seems you're helping by
restricting yourself to 7 bit ascii, but you're just pushing the problem
into more obscure corners.

Just


From paul@prescod.net  Tue May  2 17:25:33 2000
From: paul@prescod.net (Paul Prescod)
Date: Tue, 02 May 2000 11:25:33 -0500
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: Your message of "Mon, 01 May 2000 21:55:25 EDT." <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <l03102802b534149a9639@[193.78.237.164]> <390E939B.11B99B71@lemburg.com> <009701bfb414$d35d0ea0$34aab5d4@hagrid>
 <390EA645.89E3B22A@lemburg.com> <200005021230.IAA24232@eric.cnri.reston.va.us>
Message-ID: <390F017C.91C7A8A0@prescod.net>

Guido van Rossum wrote:
> 
> Aha, then we'll see u == v even though type(u) is type(v) and len(u)
> != len(v).  /F's world will collapse. :-)

There are many levels of equality that are interesting. I don't think we
would move to grapheme equivalence until "the rest of the world" (XML,
Java, W3C, SQL) did. 

If we were going to move to grapheme equivalence (some day), the right
way would be to normalize characters in the construction of the Unicode
string. This is known as "Early normalization":

http://www.w3.org/TR/charmod/#NormalizationApplication

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
It's difficult to extract sense from strings, but they're the only
communication coin we can count on. 
	- http://www.cs.yale.edu/~perlis-alan/quotes.html


From tree@basistech.com  Tue May  2 18:14:24 2000
From: tree@basistech.com (Tom Emerson)
Date: Tue, 2 May 2000 13:14:24 -0400 (EDT)
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <390EF327.86D8C3D8@lemburg.com>
References: <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
 <l03102802b534149a9639@[193.78.237.164]>
 <l03102804b534772fc25b@[193.78.237.142]>
 <390EF327.86D8C3D8@lemburg.com>
Message-ID: <14607.3312.660077.42872@cymru.basistech.com>

M.-A. Lemburg writes:
 > The details are on the www.unicode.org web-site burried
 > in some of the tech reports on normalization and
 > collation.

This is described in the Unicode standard itself, and in UTR #15 and
UTR #10. Normalization is an issue with wider imlications than just
handling glyph variants: indeed, it's irrelevant.

The question is this: should

U+00DC LATIN CAPITAL LETTER U WITH DIAERESIS

compare equal to

U+0055 LATIN CAPITAL LETTER U
U+0308 COMBINING DIAERESIS

or not? It depends on the application. Certainly in a database system
I would want these to compare equal.

Perhaps normalization form needs to be an option of the string comparator?

        -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Language Hacker                                    http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From paul@prescod.net  Tue May  2 19:23:24 2000
From: paul@prescod.net (Paul Prescod)
Date: Tue, 02 May 2000 13:23:24 -0500
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us>
 <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us>
Message-ID: <390F1D1C.6EAF7EAD@prescod.net>

Guido van Rossum wrote:
> 
> ....
> 
> Have you tried using this?

Yes. I haven't had large problems with it.

As long as you know what is going on, it doesn't usually hurt anything
because you can just explicitly set up the decoding you want. It's like
the int division problem. You get bitten a few times and then get
careful.

It's the naive user who will be surprised by these random UTF-8 decoding
errors. 

That's why this is NOT a convenience issue (are you listening MAL???).
It's a short and long term simplicity issue. There are lots of languages
where it is de rigeur to discover and work around inconvenient and
confusing default behaviors. I just don't think that we should be ADDING
such behaviors.

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
It's difficult to extract sense from strings, but they're the only
communication coin we can count on. 
	- http://www.cs.yale.edu/~perlis-alan/quotes.html


From guido@python.org  Tue May  2 19:56:34 2000
From: guido@python.org (Guido van Rossum)
Date: Tue, 02 May 2000 14:56:34 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Your message of "Tue, 02 May 2000 13:23:24 CDT."
 <390F1D1C.6EAF7EAD@prescod.net>
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us>
 <390F1D1C.6EAF7EAD@prescod.net>
Message-ID: <200005021856.OAA26104@eric.cnri.reston.va.us>

> It's the naive user who will be surprised by these random UTF-8 decoding
> errors. 
> 
> That's why this is NOT a convenience issue (are you listening MAL???).
> It's a short and long term simplicity issue. There are lots of languages
> where it is de rigeur to discover and work around inconvenient and
> confusing default behaviors. I just don't think that we should be ADDING
> such behaviors.

So what do you think of my new proposal of using ASCII as the default
"encoding"?  It takes care of "a character is a character" but also
(almost) guarantees an error message when mixing encoded 8-bit strings
with Unicode strings without specifying an explicit conversion --
*any* 8-bit byte with the top bit set is rejected by the default
conversion to Unicode.

I think this is less confusing than Latin-1: when an unsuspecting user
is reading encoded text from a file into 8-bit strings and attempts to
use it in a Unicode context, an error is raised instead of producing
garbage Unicode characters.

It encourages the use of Unicode strings for everything beyond ASCII
-- there's no way around ASCII since that's the source encoding etc.,
but Latin-1 is an inconvenient default in most parts of the world.
ASCII is accepted everywhere as the base character set (e.g. for
email and for text-based protocols like FTP and HTTP), just like
English is the one natural language that we can all sue to communicate
(to some extent).

--Guido van Rossum (home page: http://www.python.org/~guido/)


From dieter@handshake.de  Tue May  2 19:44:41 2000
From: dieter@handshake.de (Dieter Maurer)
Date: Tue,  2 May 2000 20:44:41 +0200 (CEST)
Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <390E1F08.EA91599E@prescod.net>
References: <l03102805b52ca7830b18@[193.78.237.154]>
 <390E1F08.EA91599E@prescod.net>
Message-ID: <14607.7798.510723.419556@lindm.dm>

Paul Prescod writes:
 > The fact that my proposal has the same effect as making Latin-1 the
 > "default encoding" is a near-term side effect of the definition of
 > Unicode. My long term proposal is to do away with the concept of 8-bit
 > strings (and thus, conversions from 8-bit to Unicode) altogether. One
 > string to rule them all!
Why must this be a long term proposal?

I would find it quite attractive, when
 * the old string type became an imutable list of bytes
 * automatic conversion between byte lists and unicode strings 
   were performed via user customizable conversion functions
   (a la __import__).

Dieter


From paul@prescod.net  Tue May  2 20:01:32 2000
From: paul@prescod.net (Paul Prescod)
Date: Tue, 02 May 2000 14:01:32 -0500
Subject: [I18n-sig] Unicode compromise?
References: <Pine.LNX.4.10.10005020114250.522-100000@localhost> <200005021231.IAA24249@eric.cnri.reston.va.us>
Message-ID: <390F260C.2314F97E@prescod.net>

Guido van Rossum wrote:
> 
> >     No automatic conversions between 8-bit "strings" and Unicode strings.
> >
> > If you want to turn UTF-8 into a Unicode string, say so.
> > If you want to turn Latin-1 into a Unicode string, say so.
> > If you want to turn ISO-2022-JP into a Unicode string, say so.
> > Adding a Unicode string and an 8-bit "string" gives an exception.
> 
> I'd accept this, with one change: mixing Unicode and 8-bit strings is
> okay when the 8-bit strings contain only ASCII (byte values 0 through
> 127).  

I could live with this compromise as long as we document that a future
version may use the "character is a character" model. I just don't want
people to start depending on a catchable exception being thrown because
that would stop us from ever unifying unmarked literal strings and
Unicode strings.

--

Are there any steps we could take to make a future divorce of strings
and byte arrays easier? What if we added a 

binary_read()

function that returns some form of byte array. The byte array type could
be just like today's string type except that its type object would be
distinct, it wouldn't have as many string-ish methods and it wouldn't
have any auto-conversion to Unicode at all.

People could start to transition code that reads non-ASCII data to the
new function. We could put big warning labels on read() to state that it
might not always be able to read data that is not in some small set of
recognized encodings (probably UTF-8 and UTF-16).

Or perhaps binary_open(). Or perhaps both.

I do not suggest just using the text/binary flag on the existing open
function because we cannot immediately change its behavior without
breaking code.

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
It's difficult to extract sense from strings, but they're the only
communication coin we can count on. 
	- http://www.cs.yale.edu/~perlis-alan/quotes.html


From jkraai@murlmail.com  Tue May  2 20:46:49 2000
From: jkraai@murlmail.com (jkraai@murlmail.com)
Date: Tue, 2 May 2000 14:46:49 -0500
Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate
Message-ID: <200005021946.OAA03609@www.polytopic.com>

The ever quotable Guido:
> English is the one natural language that we can all sue to communicate


------------------------------------------------------------------
You've received MurlMail! -- FREE, web-based email, accessible
anywhere, anytime from any browser-enabled device. Sign up now at
http://murl.com

Murl.com - At Your Service


From paul@prescod.net  Tue May  2 20:23:27 2000
From: paul@prescod.net (Paul Prescod)
Date: Tue, 02 May 2000 14:23:27 -0500
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us>
 <390F1D1C.6EAF7EAD@prescod.net> <200005021856.OAA26104@eric.cnri.reston.va.us>
Message-ID: <390F2B2F.2953C72D@prescod.net>

Guido van Rossum wrote:
> 
> ...
> 
> So what do you think of my new proposal of using ASCII as the default
> "encoding"?  

I can live with it. I am mildly uncomfortable with the idea that I could
write a whole bunch of software that works great until some European
inserts one of their name characters. Nevertheless, being hard-assed is
better than being permissive because we can loosen up later.

What do we do about str( my_unicode_string )? Perhaps escape the Unicode
characters with backslashed numbers?

-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for himself
It's difficult to extract sense from strings, but they're the only
communication coin we can count on. 
	- http://www.cs.yale.edu/~perlis-alan/quotes.html


From guido@python.org  Tue May  2 20:58:20 2000
From: guido@python.org (Guido van Rossum)
Date: Tue, 02 May 2000 15:58:20 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Your message of "Tue, 02 May 2000 14:23:27 CDT."
 <390F2B2F.2953C72D@prescod.net>
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <390F1D1C.6EAF7EAD@prescod.net> <200005021856.OAA26104@eric.cnri.reston.va.us>
 <390F2B2F.2953C72D@prescod.net>
Message-ID: <200005021958.PAA26760@eric.cnri.reston.va.us>

[me]
> > So what do you think of my new proposal of using ASCII as the default
> > "encoding"?  

[Paul]
> I can live with it. I am mildly uncomfortable with the idea that I could
> write a whole bunch of software that works great until some European
> inserts one of their name characters.

Better than that when some Japanese insert *their* name characters and
it produces gibberish instead.

> Nevertheless, being hard-assed is
> better than being permissive because we can loosen up later.

Exactly -- just as nobody should *count* on 10**10 raising
OverflowError, nobody (except maybe parts of the standard library :-)
should *count* on unicode("\347") raising ValueError.  I think that's
fine.

> What do we do about str( my_unicode_string )? Perhaps escape the Unicode
> characters with backslashed numbers?

Hm, good question.  Tcl displays unknown characters as \x or \u
escapes.  I think this may make more sense than raising an error.

But there must be a way to turn on Unicode-awareness on e.g. stdout
and then printing a Unicode object should not use str() (as it
currently does).

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@python.org  Tue May  2 21:47:30 2000
From: guido@python.org (Guido van Rossum)
Date: Tue, 02 May 2000 16:47:30 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode compromise?
In-Reply-To: Your message of "Tue, 02 May 2000 14:01:32 CDT."
 <390F260C.2314F97E@prescod.net>
References: <Pine.LNX.4.10.10005020114250.522-100000@localhost> <200005021231.IAA24249@eric.cnri.reston.va.us>
 <390F260C.2314F97E@prescod.net>
Message-ID: <200005022047.QAA26828@eric.cnri.reston.va.us>

> I could live with this compromise as long as we document that a future
> version may use the "character is a character" model. I just don't want
> people to start depending on a catchable exception being thrown because
> that would stop us from ever unifying unmarked literal strings and
> Unicode strings.

Agreed (as I've said before).

> --
> 
> Are there any steps we could take to make a future divorce of strings
> and byte arrays easier? What if we added a 
> 
> binary_read()
> 
> function that returns some form of byte array. The byte array type could
> be just like today's string type except that its type object would be
> distinct, it wouldn't have as many string-ish methods and it wouldn't
> have any auto-conversion to Unicode at all.

You can do this now with the array module, although clumsily:

  >>> import array
  >>> f = open("/core", "rb")
  >>> a = array.array('B', [0]) * 1000
  >>> f.readinto(a)
  1000
  >>>

Or if you wanted to read raw Unicode (UTF-16):

  >>> a = array.array('H', [0]) * 1000
  >>> f.readinto(a)
  2000
  >>> u = unicode(a, "utf-16")
  >>> 

There are some performance issues, e.g. you have to initialize the
buffer somehow and that seems a bit wasteful.

> People could start to transition code that reads non-ASCII data to the
> new function. We could put big warning labels on read() to state that it
> might not always be able to read data that is not in some small set of
> recognized encodings (probably UTF-8 and UTF-16).
> 
> Or perhaps binary_open(). Or perhaps both.
> 
> I do not suggest just using the text/binary flag on the existing open
> function because we cannot immediately change its behavior without
> breaking code.

A new method makes most sense -- there are definitely situations where
you want to read in text mode for a while and then switch to binary
mode (e.g. HTTP).

I'd like to put this off until after Python 1.6 -- but it deserves
attention.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From mal@lemburg.com  Wed May  3 00:11:37 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Wed, 03 May 2000 01:11:37 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <390F1D1C.6EAF7EAD@prescod.net> <200005021856.OAA26104@eric.cnri.reston.va.us>
 <390F2B2F.2953C72D@prescod.net> <200005021958.PAA26760@eric.cnri.reston.va.us>
Message-ID: <390F60A9.A3AA53A9@lemburg.com>

Guido van Rossum wrote:
> 
> > > So what do you think of my new proposal of using ASCII as the default
> > > "encoding"?

How about using unicode-escape or raw-unicode-escape as
default encoding ? (They would have to be adapted to disallow
Latin-1 char input, though.)

The advantage would be that they are compatible with ASCII
while still providing loss-less conversion and since they
use escape characters, you can even read them using an
ASCII based editor.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From mal@lemburg.com  Wed May  3 00:05:28 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Wed, 03 May 2000 01:05:28 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: Your message of "Mon, 01 May 2000 21:55:25 EDT."
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com> <l03102802b534149a9639@[193.78.237.164]> <390E939B.11B99B71@lemburg.com> <390EFE21.DAD7749B@prescod.net>
Message-ID: <390F5F38.DD76CAF4@lemburg.com>

Paul Prescod wrote:
> 
> Combining characters are a whole 'nother level of complexity. Charater
> sets are hard. I don't accept that the argument that "Unicode itself has
> complexities so that gives us license to introduce even more
> complexities at the character representation level."
> 
> > FYI: Normalization is needed to make comparing Unicode
> > strings robust, e.g. u"�" should compare equal to u"e\u0301".
> 
> That's a whole 'nother debate at a whole 'nother level of abstraction. I
> think we need to get the bytes/characters level right and then we can
> worry about display-equivalent characters (or leave that to the Python
> programmer to figure out...).

I just wanted to point out that the argument "slicing doesn't
work with UTF-8" is moot.

I do see a point against UTF-8 auto-conversion given the example
that Guido mailed me:

"""
s = 'ab\341\210\264def'        # == str(u"ab\u1234def")
s.find(u"def")

This prints 3 -- the wrong result since "def" is found at s[5:8], not
at s[3:6].
"""

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From guido@python.org  Wed May  3 03:31:21 2000
From: guido@python.org (Guido van Rossum)
Date: Tue, 02 May 2000 22:31:21 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Your message of "Wed, 03 May 2000 01:11:37 +0200."
 <390F60A9.A3AA53A9@lemburg.com>
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <390F1D1C.6EAF7EAD@prescod.net> <200005021856.OAA26104@eric.cnri.reston.va.us> <390F2B2F.2953C72D@prescod.net> <200005021958.PAA26760@eric.cnri.reston.va.us>
 <390F60A9.A3AA53A9@lemburg.com>
Message-ID: <200005030231.WAA02678@eric.cnri.reston.va.us>

> Guido van Rossum wrote:
> > > > So what do you think of my new proposal of using ASCII as the default
> > > > "encoding"?

[MAL]
> How about using unicode-escape or raw-unicode-escape as
> default encoding ? (They would have to be adapted to disallow
> Latin-1 char input, though.)
> 
> The advantage would be that they are compatible with ASCII
> while still providing loss-less conversion and since they
> use escape characters, you can even read them using an
> ASCII based editor.

No, the backslash should mean itself when encoding from ASCII to
Unicode.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From esr@thyrsus.com  Wed May  3 04:22:20 2000
From: esr@thyrsus.com (Eric S. Raymond)
Date: Tue, 2 May 2000 23:22:20 -0400
Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate
In-Reply-To: <390EFE8C.4C10473C@prescod.net>; from paul@prescod.net on Tue, May 02, 2000 at 11:13:00AM -0500
References: <sjmtgss11jaqujkcbppu5keml79lg98gqo@4ax.com> <Pine.LNX.4.10.10005020114250.522-100000@localhost> <200005021231.IAA24249@eric.cnri.reston.va.us> <sjmtgss11jaqujkcbppu5keml79lg98gqo@4ax.com> <l0310280fb5349fd24fc5@[193.78.237.142]> <200005021421.KAA24526@eric.cnri.reston.va.us> <390EFE8C.4C10473C@prescod.net>
Message-ID: <20000502232220.B18638@thyrsus.com>

Paul Prescod <paul@prescod.net>:
> Where are we going? What's our long-range vision?
> 
> Three years from now where will we be? 
> 
> 1. How will we handle characters? 
> 2. How will we handle bytes?
> 3. What will unadorned literal strings "do"?
> 4. Will literal strings be the same type as byte arrays?
> 
> I don't see how we can make decisions today without a vision for the
> future. I think that this is the central point in our disagreement. Some
> of us are aiming for as much compatibility with where we think we should
> be going and others are aiming for as much compatibility as possible
> with where we came from.

And *that* is the most insightful statement I have seen in this entire 
foofaraw (which I have carefully been staying right the hell out of). 

Everybody meditate on the above, please.  Then declare your objectives *at
this level* so our Fearless Leader can make an informed decision *at this
level*.  Only then will it make sense to argue encoding theology...
-- 
		<a href="http://www.tuxedo.org/~esr">Eric S. Raymond</a>

"Extremism in the defense of liberty is no vice; moderation in the
pursuit of justice is no virtue."
	-- Barry Goldwater (actually written by Karl Hess)


From tim_one@email.msn.com  Wed May  3 06:05:59 2000
From: tim_one@email.msn.com (Tim Peters)
Date: Wed, 3 May 2000 01:05:59 -0400
Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate
In-Reply-To: <200005021400.KAA24464@eric.cnri.reston.va.us>
Message-ID: <000301bfb4bd$463ec280$622d153f@tim>

[Guido]
> When *comparing* 8-bit and Unicode strings, the presence of non-ASCII
> bytes in either should make the comparison fail; when ordering is
> important, we can make an arbitrary choice e.g. "\377" < u"\200".

[Toby]
> I assume 'fail' means 'non-equal', rather than 'raises an exception'?

[Guido]
> Yes, sorry for the ambiguity.

Huh!  You sure about that?  If we're setting up a case where meaningful
comparison is impossible, isn't an exception more appropriate?  The current

>>> 83479278 < "42"
1
>>>

probably traps more people than it helps.


From tim_one@email.msn.com  Wed May  3 06:19:28 2000
From: tim_one@email.msn.com (Tim Peters)
Date: Wed, 3 May 2000 01:19:28 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <017d01bfb3bc$c3734c00$34aab5d4@hagrid>
Message-ID: <000401bfb4bf$27ec1600$622d153f@tim>

[Fredrik Lundh]
> ...
> (if you like, I can post more "fun with unicode" messages ;-)

By all means!  Exposing a gotcha to ridicule does more good than a dozen
abstract arguments.  But next time stoop to explaining what it is that's
surprising <wink>.


From just@letterror.com  Wed May  3 07:47:07 2000
From: just@letterror.com (Just van Rossum)
Date: Wed, 3 May 2000 07:47:07 +0100
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <390F5F38.DD76CAF4@lemburg.com>
References: Your message of "Mon, 01 May 2000 21:55:25 EDT."
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
 <l03102802b534149a9639@[193.78.237.164]> <390E939B.11B99B71@lemburg.com>
 <390EFE21.DAD7749B@prescod.net>
Message-ID: <l03102800b53572ee87ad@[193.78.237.142]>

[MAL vs. PP]
>> > FYI: Normalization is needed to make comparing Unicode
>> > strings robust, e.g. u"=E9" should compare equal to u"e\u0301".
>>
>> That's a whole 'nother debate at a whole 'nother level of abstraction. I
>> think we need to get the bytes/characters level right and then we can
>> worry about display-equivalent characters (or leave that to the Python
>> programmer to figure out...).
>
>I just wanted to point out that the argument "slicing doesn't
>work with UTF-8" is moot.

And failed...

I asked two Unicode guru's I happen to know about the normalization issue
(which is indeed not relevant to the current discussion, but it's
fascinating nevertheless!).

(Sorry about the possibly wrong email encoding... "=E8" is u"\350", "=F6" is
u"\366")

John Jenkins replied:
"""
Well, I'm not sure you want to hear the answer -- but it really depends on
what the language is attempting to do.

By and large, Unicode takes the position that "e`" should always be treated
the same as "=E8". This is a *semantic* equivalence -- that is, they *mean*
the same thing -- and doesn't depend on the display engine to be true.
Unicode also provides a default collation algorithm
(http://www.unicode.org/unicode/reports/tr10/).

At the same time, the standard acknowledges that in real life, string
comparison and collation are complicated, language-specific problems
requiring a lot of work and interaction with the user to do right.

>From the perspective of a programming language, it would best be served IMH=
O
by implementing the contents of TR10 for string comparison and collation.
That would make "e`" and "=E8" come out as equivalent.
"""


Dave Opstad replied:
"""
Unicode talks about "canonical decomposition" in order to make it easier
to answer questions like yours. Specifically, in the Unicode 3.0
standard, rule D24 in section 3.6 (page 44) states that:

"Two character sequences are said to be canonical equivalents if their
full canonical decompositions are identical. For example, the sequences
<o, combining-diaeresis> and <=F6> are canonical equivalents. Canonical
equivalence is a Unicode propert. It should not be confused with
language-specific collation or matching, which may add additional
equivalencies."

So they still have language-specific differences, even if Unicode sees
them as canonically equivalent.

You might want to check this out:

http://www.unicode.org/unicode/reports/tr15/tr15-18.html

It's the latest technical report on these issues, which may help clarify
things further.
"""


It's very deep stuff, which seems more appropriate for an extension than
for builtin comparisons to me.

Just


From tim_one@email.msn.com  Wed May  3 06:47:37 2000
From: tim_one@email.msn.com (Tim Peters)
Date: Wed, 3 May 2000 01:47:37 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <Pine.GSO.4.10.10005021248200.8983-100000@sundial>
Message-ID: <000501bfb4c3$16743480$622d153f@tim>

[Moshe Zadka]
> ...
> I'd much prefer Python to reflect a fundamental truth about Unicode,
> which at least makes sure binary-goop can pass through Unicode and
> remain unharmed, then to reflect a nasty problem with UTF-8 (not
> everything is legal).

Then you don't want Unicode at all, Moshe.  All the official encoding
schemes for Unicode 3.0 suffer illegal byte sequences (for example, 0xffff
is illegal in UTF-16 (whether BE or LE); this isn't merely a matter of
Unicode not yet having assigned a character to this position, it's that the
standard explicitly makes this sequence illegal and guarantees it will
always be illegal!  the other place this comes up is with surrogates, where
what's legal depends on both parts of a character pair; and, again, the
illegalities here are guaranteed illegal for all time).  UCS-4 is the
closest thing to binary-transparent Unicode encodings get, but even there
the length of a thing is contrained to be a multiple of 4 bytes.  Unicode
and binary goop will never coexist peacefully.


From ping@lfw.org  Wed May  3 06:56:12 2000
From: ping@lfw.org (Ka-Ping Yee)
Date: Tue, 2 May 2000 22:56:12 -0700 (PDT)
Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate
In-Reply-To: <000301bfb4bd$463ec280$622d153f@tim>
Message-ID: <Pine.LNX.4.10.10005022249330.522-100000@localhost>

On Wed, 3 May 2000, Tim Peters wrote:
> [Toby]
> > I assume 'fail' means 'non-equal', rather than 'raises an exception'?
> 
> [Guido]
> > Yes, sorry for the ambiguity.
> 
> Huh!  You sure about that?  If we're setting up a case where meaningful
> comparison is impossible, isn't an exception more appropriate?  The current
> 
> >>> 83479278 < "42"
> 1
> 
> probably traps more people than it helps.

Yeah, when i said

    No automatic conversions between Unicode strings and 8-bit "strings".

i was about to say

    Raise an exception on any operation attempting to combine or
    compare Unicode strings and 8-bit "strings".

...and then i thought, oh crap, but everything in Python is supposed
to be comparable.

What happens when you have some lists with arbitrary objects in them
and you want to sort them for printing, or to canonicalize them so
you can compare?  It might be too troublesome for list.sort() to
throw an exception because e.g. strings and ints were incomparable,
or 8-bit "strings" and Unicode strings were incomparable...

So -- what's the philosophy, Guido?  Are we committed to "everything
is comparable" (well, "all built-in types are comparable") or not?


-- ?!ng


From tim_one@email.msn.com  Wed May  3 07:40:54 2000
From: tim_one@email.msn.com (Tim Peters)
Date: Wed, 3 May 2000 02:40:54 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <l03102800b53572ee87ad@[193.78.237.142]>
Message-ID: <000701bfb4ca$87b765c0$622d153f@tim>

[MAL]
> I just wanted to point out that the argument "slicing doesn't
> work with UTF-8" is moot.

[Just]
> And failed...

He succeeded for me.  Blind slicing doesn't always "work right" no matter
what encoding you use, because "work right" depends on semantics beyond the
level of encoding.  UTF-8 is no worse than anything else in this respect.


From just@letterror.com  Wed May  3 08:50:11 2000
From: just@letterror.com (Just van Rossum)
Date: Wed, 3 May 2000 08:50:11 +0100
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <000701bfb4ca$87b765c0$622d153f@tim>
References: <l03102800b53572ee87ad@[193.78.237.142]>
Message-ID: <l03102804b5358971d413@[193.78.237.152]>

[MAL]
> I just wanted to point out that the argument "slicing doesn't
> work with UTF-8" is moot.

[Just]
> And failed...

[Tim]
>He succeeded for me.  Blind slicing doesn't always "work right" no matter
>what encoding you use, because "work right" depends on semantics beyond the
>level of encoding.  UTF-8 is no worse than anything else in this respect.

But the discussion *was* at the level of encoding! Still it is worse, since
an arbitrary utf-8 slice may result in two illegal strings -- slicing "e`"
results in two perfectly legal strings, at the encoding level. Had he used
surrogates as an example, he would've been right... (But even that is an
encoding issue.)

Just


From Fredrik Lundh" <effbot@telia.com  Wed May  3 08:34:51 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Wed, 3 May 2000 09:34:51 +0200
Subject: [I18n-sig] Re: Unicode debate
References: <Pine.LNX.4.10.10005022249330.522-100000@localhost>
Message-ID: <00b201bfb4d3$07a95420$34aab5d4@hagrid>

Ka-Ping Yee <ping@lfw.org> wrote:
> So -- what's the philosophy, Guido?  Are we committed to "everything
> is comparable" (well, "all built-in types are comparable") or not?

in 1.6a2, obviously not:

>>> aUnicodeString < an8bitString
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeError: UTF-8 decoding error: unexpected code byte

in 1.6a3, maybe.

</F>


From Fredrik Lundh" <effbot@telia.com  Wed May  3 08:48:56 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Wed, 3 May 2000 09:48:56 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <000501bfb4c3$16743480$622d153f@tim>
Message-ID: <00ce01bfb4d4$0a7d1820$34aab5d4@hagrid>

Tim Peters <tim_one@email.msn.com> wrote:
> [Moshe Zadka]
> > ...
> > I'd much prefer Python to reflect a fundamental truth about Unicode,
> > which at least makes sure binary-goop can pass through Unicode and
> > remain unharmed, then to reflect a nasty problem with UTF-8 (not
> > everything is legal).
>=20
> Then you don't want Unicode at all, Moshe.  All the official encoding
> schemes for Unicode 3.0 suffer illegal byte sequences (for example, =
0xffff
> is illegal in UTF-16 (whether BE or LE); this isn't merely a matter of
> Unicode not yet having assigned a character to this position, it's =
that the
> standard explicitly makes this sequence illegal and guarantees it will
> always be illegal!

in context, I think what Moshe meant was that with a straight
character code mapping, any 8-bit string can always be mapped
to a unicode string and back again.

given a byte array "b":

    u =3D unicode(b, "default")
    assert map(ord, u) =3D=3D map(ord, s)

again, this is no different from casting an integer to a long integer
and back again.  (imaging having to do that on the bits and bytes
level!).

and again, the internal unicode encoding used by the unicode string
type itself, or when serializing that string type, has nothing to do
with that.

</F>


From just@letterror.com  Wed May  3 10:03:16 2000
From: just@letterror.com (Just van Rossum)
Date: Wed, 3 May 2000 10:03:16 +0100
Subject: [I18n-sig] Unicode comparisons & normalization
Message-ID: <l03102806b535964edb26@[193.78.237.152]>

After quickly browsing through the unicode.org URLs I posted earlier, I
reach the following (possibly wrong) conclusions:

- there is a script and language independent canonical form (but automatic
normalization is indeed a bad idea)
- ideally, unicode comparisons should follow the rules from
http://www.unicode.org/unicode/reports/tr10/ (But it seems hardly realistic
for 1.6, if at all...)
- this would indeed mean that it's possible for u == v even though type(u)
is type(v) and len(u) != len(v). However, I don't see how this would
collapse /F's world, as the two strings are at most semantically
equivalent. Their physical difference is real, and still follows the
a-string-is-a-sequence-of-characters rule (!).
- there may be additional customized language-specific sorting rules. I
currently don't see how to implement that without some global variable.
- the sorting rules are very complicated, and should be implemented by
calculating "sort keys". If I understood it correctly, these can take up to
4 bytes per character in its most compact form. Still, for it to be
somewhat speed-efficient, they need to be cached...
- u.find() may need an alternative API, which returns a (begin, end) tuple,
since the match may not have the same length as the search string... (This
is tricky, since you need the begin and end indices in the non-canonical
form...)

Just


From Fredrik Lundh" <effbot@telia.com  Wed May  3 08:56:25 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Wed, 3 May 2000 09:56:25 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <390F1D1C.6EAF7EAD@prescod.net> <200005021856.OAA26104@eric.cnri.reston.va.us>             <390F2B2F.2953C72D@prescod.net>  <200005021958.PAA26760@eric.cnri.reston.va.us>
Message-ID: <013c01bfb4d6$da19fb00$34aab5d4@hagrid>

Guido van Rossum <guido@python.org> wrote:
> > What do we do about str( my_unicode_string )? Perhaps escape the =
Unicode
> > characters with backslashed numbers?
>=20
> Hm, good question.  Tcl displays unknown characters as \x or \u
> escapes.  I think this may make more sense than raising an error.

but that's on the display side of things, right?  similar to
repr, in other words.

> But there must be a way to turn on Unicode-awareness on e.g. stdout
> and then printing a Unicode object should not use str() (as it
> currently does).

to throw some extra gasoline on this, how about allowing
str() to return unicode strings?

(extra questions: how about renaming "unicode" to "string",
and getting rid of "unichr"?)

count to ten before replying, please.

</F>


From ping@lfw.org  Wed May  3 09:30:02 2000
From: ping@lfw.org (Ka-Ping Yee)
Date: Wed, 3 May 2000 01:30:02 -0700 (PDT)
Subject: [I18n-sig] Re: [Python-Dev] Unicode comparisons & normalization
In-Reply-To: <l03102806b535964edb26@[193.78.237.152]>
Message-ID: <Pine.LNX.4.10.10005030116460.522-100000@localhost>

On Wed, 3 May 2000, Just van Rossum wrote:
> After quickly browsing through the unicode.org URLs I posted earlier, I
> reach the following (possibly wrong) conclusions:
> 
> - there is a script and language independent canonical form (but automatic
> normalization is indeed a bad idea)
> - ideally, unicode comparisons should follow the rules from
> http://www.unicode.org/unicode/reports/tr10/ (But it seems hardly realistic
> for 1.6, if at all...)

I just looked through this document.  Indeed, there's a lot
of work to be done if we want to compare strings this way.

I thought the most striking feature was that this comparison
method does *not* satisfy the common assumption

    a > b  implies  a + c > b + d        (+ is concatenation)

-- in fact, it is specifically designed to allow for cases
where differences in the *later* part of a string can have
greater influence than differences in an earlier part of a
string.  It *does* still guarantee that

    a + b > a

and of course we can still rely on the most basic rules such as

    a > b  and  b > c  implies  a > c

There are sufficiently many significant transformations
described in the UTR 10 document that i'm pretty sure it
is possible for two things to collate equally but not be
equivalent.  (Even after Unicode normalization, there is
still the possibility of rearrangement in step 1.2.)

This would be another motivation for Python to carefully
separate the three types of equality:

    is         identity-equal
    ==         value-equal
    <=>        magnitude-equal

We currently don't distinguish between the last two;
the operator "<=>" is my proposal for how to spell
"magnitude-equal", and in terms of outward behaviour
you can consider (a <=> b) to be (a <= b and a >= b).
I suspect we will find ourselves needing it if we do
rich comparisons anyway.

(I don't know of any other useful kinds of equality,
but if you've run into this before, do pipe up...)


-- ?!ng


From mal@lemburg.com  Wed May  3 09:15:29 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Wed, 03 May 2000 10:15:29 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: Your message of "Mon, 01 May 2000 21:55:25 EDT."
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
 <l03102802b534149a9639@[193.78.237.164]> <390E939B.11B99B71@lemburg.com>
 <390EFE21.DAD7749B@prescod.net> <l03102800b53572ee87ad@[193.78.237.142]>
Message-ID: <390FE021.6F15C1C8@lemburg.com>

Just van Rossum wrote:
> 
> [MAL vs. PP]
> >> > FYI: Normalization is needed to make comparing Unicode
> >> > strings robust, e.g. u"�" should compare equal to u"e\u0301".
> >>
> >> That's a whole 'nother debate at a whole 'nother level of abstraction. I
> >> think we need to get the bytes/characters level right and then we can
> >> worry about display-equivalent characters (or leave that to the Python
> >> programmer to figure out...).
> >
> >I just wanted to point out that the argument "slicing doesn't
> >work with UTF-8" is moot.
> 
> And failed...

Huh ? The pure fact that you can have two (or more)
Unicode characters to represent a single character makes
Unicode itself have the same problems as e.g. UTF-8.

> [Refs about collation and decomposition]
>
> It's very deep stuff, which seems more appropriate for an extension than
> for builtin comparisons to me.

That's what I think too; I never argued for making this
builtin and automatic (don't know where people got this idea
from).

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From Fredrik Lundh" <effbot@telia.com  Wed May  3 10:02:09 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Wed, 3 May 2000 11:02:09 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode comparisons & normalization
References: <l03102806b535964edb26@[193.78.237.152]>
Message-ID: <018a01bfb4de$7744cc00$34aab5d4@hagrid>

Just van Rossum wrote:
> After quickly browsing through the unicode.org URLs I posted earlier, =
I
> reach the following (possibly wrong) conclusions:

here's another good paper that covers this, the universe, and =
everything:

    Character Model for the World Wide Web=20
    http://www.w3.org/TR/charmod

among many other things, it argues that normalization should be done at
the source, and that it should be sufficient to do binary matching to =
tell
if two strings are identical.

...

another very interesting thing from that paper is where they identify =
four
layers of character support:

    Layer 1: Physical representation. This is necessary for
    APIs that expose a physical representation of string data.
    /.../ To avoid problems with duplicates, it is assumed that
    the data is normalized /.../=20

    Layer 2: Indexing based on abstract codepoints. /.../ This
    is the highest layer of abstraction that ensures interopera-
    bility with very low implementation effort. To avoid problems
    with duplicates, it is assumed that the data is normalized /.../
=20
    Layer 3: Combining sequences, user-relevant. /.../ While we
    think that an exact definition of this layer should be possible,
    such a definition does not currently exist.

    Layer 4: Depending on language and operation. This layer is
    least suited for interoperability, but is necessary for certain
    operations, e.g. sorting.=20

until now, this discussion has focussed on the boundary between
layer 1 and 2.

that as many python strings as possible should be on the second
layer has always been obvious to me ("a very low implementation
effort" is exactly my style ;-), and leave the rest for the app.

...while Guido and MAL has argued that we should stay on level 1
(apparantly because "we've already implemented it" is less effort
that "let's change a little bit")

no wonder they never understand what I'm talking about...

it's also interesting to see that MAL's using layer 3 and 4 issues as an
argument to keep Python's string support at layer 1.  in contrast, the
W3 paper thinks that normalization is a non-issue also on the layer 1
level.  go figure.

...

btw, how about adopting this paper as the "Character Model for Python"?

yes, I'm serious.

</F>

PS. here's my take on Just's normalization points:

> - there is a script and language independent canonical form (but =
automatic
> normalization is indeed a bad idea)
> - ideally, unicode comparisons should follow the rules from
> http://www.unicode.org/unicode/reports/tr10/ (But it seems hardly =
realistic
> for 1.6, if at all...)

note that W3 paper recommends early normalization, and binary
comparision (assuming the same internal representation of the
unicode character codes, of course).

> - this would indeed mean that it's possible for u =3D=3D v even though =
type(u)
> is type(v) and len(u) !=3D len(v). However, I don't see how this would
> collapse /F's world, as the two strings are at most semantically
> equivalent. Their physical difference is real, and still follows the
> a-string-is-a-sequence-of-characters rule (!).

yes, but on layer 3 instead of layer 2.

> - there may be additional customized language-specific sorting rules. =
I
> currently don't see how to implement that without some global =
variable.

layer 4.

> - the sorting rules are very complicated, and should be implemented by
> calculating "sort keys". If I understood it correctly, these can take =
up to
> 4 bytes per character in its most compact form. Still, for it to be
> somewhat speed-efficient, they need to be cached...

layer 4.

> - u.find() may need an alternative API, which returns a (begin, end) =
tuple,
> since the match may not have the same length as the search string... =
(This
> is tricky, since you need the begin and end indices in the =
non-canonical
> form...)

layer 3.


From Fredrik Lundh" <effbot@telia.com  Wed May  3 10:11:26 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Wed, 3 May 2000 11:11:26 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <390F1D1C.6EAF7EAD@prescod.net> <200005021856.OAA26104@eric.cnri.reston.va.us>              <390F2B2F.2953C72D@prescod.net> <200005021958.PAA26760@eric.cnri.reston.va.us> <390F60A9.A3AA53A9@lemburg.com>
Message-ID: <01ed01bfb4df$8feddb60$34aab5d4@hagrid>

M.-A. Lemburg wrote:
> Guido van Rossum wrote:
> >=20
> > > > So what do you think of my new proposal of using ASCII as the =
default
> > > > "encoding"?
>=20
> How about using unicode-escape or raw-unicode-escape as
> default encoding ? (They would have to be adapted to disallow
> Latin-1 char input, though.)
>=20
> The advantage would be that they are compatible with ASCII
> while still providing loss-less conversion and since they
> use escape characters, you can even read them using an
> ASCII based editor.

umm.  if you disallow latin-1 characters, how can you call this
one loss-less?

looks like political correctness taken to an entirely new level...

</F>


From ping@lfw.org  Wed May  3 09:50:30 2000
From: ping@lfw.org (Ka-Ping Yee)
Date: Wed, 3 May 2000 01:50:30 -0700 (PDT)
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <013c01bfb4d6$da19fb00$34aab5d4@hagrid>
Message-ID: <Pine.LNX.4.10.10005030141580.522-100000@localhost>

On Wed, 3 May 2000, Fredrik Lundh wrote:
> Guido van Rossum <guido@python.org> wrote:
> > But there must be a way to turn on Unicode-awareness on e.g. stdout
> > and then printing a Unicode object should not use str() (as it
> > currently does).
> 
> to throw some extra gasoline on this, how about allowing
> str() to return unicode strings?

You still need to *print* them somehow.  One way or another,
stdout is still just a stream with bytes on it, unless we
augment file objects to understand encodings.

stdout sends bytes to something -- and that something will
interpret the stream of bytes in some encoding (could be
Latin-1, UTF-8, ISO-2022-JP, whatever).  So either:

    1.  You explicitly downconvert to bytes, and specify
        the encoding each time you do.  Then write the
        bytes to stdout (or your file object).

    2.  The file object is smart and can be told what
        encoding to use, and Unicode strings written to
        the file are automatically converted to bytes.

Another thread mentioned having separate read/write and
binary_read/binary_write methods on files.  I suggest
doing it the other way, actually: since read/write operate
on byte streams now, *they* are the binary operations;
the new methods should be the ones that do the extra
encoding/decoding work, and could be called uniread/uniwrite,
uread/uwrite, textread/textwrite, etc.

> (extra questions: how about renaming "unicode" to "string",
> and getting rid of "unichr"?)

Would you expect chr(x) to return an 8-bit string when x < 128,
and a Unicode string when x >= 128?


-- ?!ng


From ping@lfw.org  Wed May  3 10:32:31 2000
From: ping@lfw.org (Ka-Ping Yee)
Date: Wed, 3 May 2000 02:32:31 -0700 (PDT)
Subject: [I18n-sig] Re: [Python-Dev] Re: Unicode debate
In-Reply-To: <200005021231.IAA24249@eric.cnri.reston.va.us>
Message-ID: <Pine.LNX.4.10.10005030151150.522-100000@localhost>

On Tue, 2 May 2000, Guido van Rossum wrote:
> > P. P. S.  If always having to specify encodings is really too much,
> > i'd probably be willing to consider a default-encoding state on the
> > Unicode class, but it would have to be a stack of values, not a
> > single value.
> 
> Please elaborate?

On general principle, it seems bad to just have a "set" method
that encourages people to set static state in a way that
irretrievably loses the current state.  For something like this,
you want a "push" method and a "pop" method with which to bracket
a series of operations, so that you can easily write code which
politely leaves other code unaffected.

For example:

    >>> x = unicode("d\351but")        # assume Guido-ASCII wins
    UnicodeError: ASCII encoding error: value out of range
    >>> x = unicode("d\351but", "latin-1")
    >>> x
    u'd\351but'
    >>> print x.encode("latin-1")      # on my xterm with Latin-1 fonts
    d�but
    >>> x.encode("utf-8")
    'd\303\251but'

Now:

    >>> u"".pushenc("latin-1")         # need a better interface to this?
    >>> x = unicode("d\351but")        # okay now
    >>> x
    u'd\351but'
    >>> u"".pushenc("utf-8")
    >>> x = unicode("d\351but")
    UnicodeError: UTF-8 decoding error: invalid data
    >>> x = unicode("d\303\251but")
    >>> print x.encode("latin-1")
    d�but
    >>> str(x)
    'd\303\251\but'
    >>> u"".popenc()                   # back to the Latin-1 encoding
    >>> str(x)
    'd\351but'
        .
        .
        .
    >>> u"".popenc()                   # back to the ASCII encoding

Similarly, imagine:

    >>> x = u"<Japanese text...>"

    >>> file = open("foo.jis", "w")
    >>> file.pushenc("iso-2022-jp")
    >>> file.uniwrite(x)
        .
        .
        .
    >>> file.popenc()

    >>> import sys
    >>> sys.stdout.write(x)            # bad! x contains chars > 127
    UnicodeError: ASCII decoding error: value out of range

    >>> sys.stdout.pushenc("iso-2022-jp")
    >>> sys.stdout.write(x)            # on a kterm with kanji fonts
    <Japanese text...>
        .
        .
        .
    >>> sys.stdout.popenc()

The above examples incorporate the Guido-ASCII proposal, which
makes a fair amount of sense to me now.  How do they look to y'all?


This illustrates the remaining wart:

    >>> sys.stdout.pushenc("iso-2022-jp")
    >>> print x                        # still bad! str is still doing ASCII
    UnicodeError: ASCII decoding error: value out of range

    >>> u"".pushenc("iso-2022-jp")
    >>> print x                        # on a kterm with kanji fonts
    <Japanese text...>

Writing to files asks the file object to convert from Unicode to
bytes, then write the bytes.

Printing converts the Unicode to bytes first with str(), then
hands the bytes to the file object to write.

This wart is really a larger printing issue.  If we want to
solve it, files have to know what to do with objects, i.e.

    print x

doesn't mean

    sys.stdout.write(str(x) + "\n")

instead it means

    sys.stdout.printout(x)

Hmm.  I think this might deserve a separate subject line.


-- ?!ng


From ping@lfw.org  Wed May  3 10:41:20 2000
From: ping@lfw.org (Ka-Ping Yee)
Date: Wed, 3 May 2000 02:41:20 -0700 (PDT)
Subject: [I18n-sig] Printing objects on files
In-Reply-To: <200005021231.IAA24249@eric.cnri.reston.va.us>
Message-ID: <Pine.LNX.4.10.10005030232360.522-100000@localhost>

The following is all stolen from E: see http://www.erights.org/.

As i mentioned in the previous message, there are reasons that
we might want to enable files to know what it means to print
things on them.

    print x

would mean

    sys.stdout.printout(x)

where sys.stdout is defined something like

    def __init__(self):
        self.encs = ["ASCII"]

    def pushenc(self, enc):
        self.encs.append(enc)
    
    def popenc(self):
        self.encs.pop()
        if not self.encs: self.encs = ["ASCII"]

    def printout(self, x):
        if type(x) is type(u""):
            self.write(x.encode(self.encs[-1]))
        else:   
            x.__print__(self)
        self.write("\n")

and each object would have a __print__ method; for lists, e.g.:

    def __print__(self, file):
        file.write("[")
        if len(self):
            file.printout(self[0])
        for item in self[1:]:
            file.write(", ")
            file.printout(item)
        file.write("]")

for floats, e.g.:

    def __print__(self, file):
        if hasattr(file, "floatprec"):
            prec = file.floatprec
        else:
            prec = 17
        file.write("%%.%df" % prec % self)

The passing of control between the file and the objects to
be printed enables us to make Tim happy:

    >>> l = [1/2, 1/3, 1/4]            # I can dream, can't i?

    >>> print l
    [0.3, 0.33333333333333331, 0.25]

    >>> sys.stdout.floatprec = 6
    >>> print l
    [0.5, 0.333333, 0.25]

Fantasizing about other useful kinds of state beyond "encs"
and "floatprec" ("listmax"? "ratprec"?) and managing this
namespace is left as an exercise to the reader.


-- ?!ng


From ht@cogsci.ed.ac.uk  Wed May  3 10:59:28 2000
From: ht@cogsci.ed.ac.uk (Henry S. Thompson)
Date: 03 May 2000 10:59:28 +0100
Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Guido van Rossum's message of "Mon, 01 May 2000 20:53:26 -0400"
References: <l03102805b52ca7830b18@[193.78.237.154]>
 <l03102800b52d80db1290@[193.78.237.154]>
 <200004271501.LAA13535@eric.cnri.reston.va.us>
 <3908F566.8E5747C@prescod.net>
 <200004281450.KAA16493@eric.cnri.reston.va.us>
 <390AEF1D.253B93EF@prescod.net>
 <200005011802.OAA21612@eric.cnri.reston.va.us>
 <390DEB45.D8D12337@prescod.net>
 <200005012132.RAA23319@eric.cnri.reston.va.us>
 <390E1F08.EA91599E@prescod.net>
 <200005020053.UAA23665@eric.cnri.reston.va.us>
Message-ID: <f5bog6o54zj.fsf@cogsci.ed.ac.uk>

Guido van Rossum <guido@python.org> writes:

> Paul, we're both just saying the same thing over and over without
> convincing each other.  I'll wait till someone who wasn't in this
> debate before chimes in.

OK, I've never contributed to this discussion, but I have a long
history of shipping widely used Python/Tkinter/XML tools (see my
homepage).  I care _very_ much that heretofore I have been unable to
support full XML because of the lack of Unicode support in Python.
I've already started playing with 1.6a2 for this reason.

I notice one apparent mis-communication between the various
contributors:

Treating narrow-strings as consisting of UNICODE code points <= 255 is 
not necessarily the same thing as making Latin-1 the default encoding.
I don't think on Paul and Fredrik's account encoding are relevant to
narrow-strings at all.

I'd rather go right away to the coherent position of byte-arrays,
narrow-strings and wide-strings.  Encodings are only relevant to
conversion between byte-arrays and strings.  Decoding a byte-array
with a UTF-8 encoding into a narrow string might cause
overflow/truncation, just as decoding a byte-array with a UTF-8
encoding into a wide-string might.  The fact that decoding a
byte-array with a Latin-1 encoding into a narrow-string is a memcopy
is just a side-effect of the courtesy of the UNICODE designers wrt the 
code points between 128 and 255.

This is effectively the way our C-based XML toolset (which we embed in 
Python) works today -- we build an 8-bit version which uses char*
strings, and a 16-bit version which uses unsigned short* strings, and
convert from/to byte-streams in any supported encoding at the margins.

I'd like to keep byte-arrays at the margins in Python as well, for all 
the reasons advanced by Paul and Fredrik.

I think treating existing strings as a sort of pun between
narrow-strings and byte-arrays is a recipe for ongoing confusion.

ht
-- 
  Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh
          W3C Fellow 1999--2001, part-time member of W3C Team
     2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440
	    Fax: (44) 131 650-4587, e-mail: ht@cogsci.ed.ac.uk
		     URL: http://www.ltg.ed.ac.uk/~ht/


From ping@lfw.org  Wed May  3 10:51:30 2000
From: ping@lfw.org (Ka-Ping Yee)
Date: Wed, 3 May 2000 02:51:30 -0700 (PDT)
Subject: [I18n-sig] Re: Printing objects on files
In-Reply-To: <Pine.LNX.4.10.10005030232360.522-100000@localhost>
Message-ID: <Pine.LNX.4.10.10005030242030.522-100000@localhost>

On Wed, 3 May 2000, Ka-Ping Yee wrote:
> 
> Fantasizing about other useful kinds of state beyond "encs"
> and "floatprec" ("listmax"? "ratprec"?) and managing this
> namespace is left as an exercise to the reader.

Okay, i lied.  Shortly after writing this i realized that it
is probably advisable for all such bits of state to be stored
in stacks, so an interface such as this might do:

    def push(self, key, value):
        if not self.state.has_key(key):
            self.state[key] = []
        self.state[key].append(value)

    def pop(self, key):
        if self.state.has_key(key):
            if len(self.state[key]):
                self.state[key].pop()

    def get(self, key):
        if not self.state.has_key(key):
            stack = self.state[key][-1]
        if stack:
            return stack[-1]
        return None

Thus:

    >>> print 1/3
    0.33333333333333331

    >>> sys.stdout.push("float.prec", 6)
    >>> print 1/3
    0.333333

    >>> sys.stdout.pop("float.prec")
    >>> print 1/3
    0.33333333333333331

And once we allow arbitrary strings as keys to the bits
of state, the period is a natural separator we can use
for managing the namespace.

Take the special case for Unicode out of the file object:
    
    def printout(self, x):
        x.__print__(self)
        self.write("\n")

and have the Unicode string do the work:

    def __printon__(self, file):
        file.write(self.encode(file.get("unicode.enc")))

This behaves just right if an encoding of None means ASCII.

If mucking with encodings is sufficiently common, you could
imagine conveniences on file objects such as

    def __init__(self, filename, mode, encoding=None):
        ...
        if encoding:
            self.push("unicode.enc", encoding)

    def pushenc(self, encoding):
        self.push("unicode.enc", encoding)

    def popenc(self, encoding):
        self.pop("unicode.enc")


-- ?!ng


From pf@artcom-gmbh.de  Wed May  3 11:11:30 2000
From: pf@artcom-gmbh.de (Peter Funk)
Date: Wed, 3 May 2000 12:11:30 +0200 (MEST)
Subject: [I18n-sig] default encoding as global state (was Re: Unicode debate)
In-Reply-To: <Pine.LNX.4.10.10005030151150.522-100000@localhost> from Ka-Ping Yee at "May 3, 2000  2:32:31 am"
Message-ID: <m12mw7u-000CnCC@artcom0.artcom-gmbh.de>

Ka-Ping Yee:
[...]
> For example:
[...]
>     >>> u"".popenc()                   # back to the Latin-1 encoding

I think 'popenc' is a very poor name.  I wondered several seconds about
the trailing 'c' on 'popen' (thinking of 'os.popen') before I realized, that
this should really mean 'pop_encoding'.

IMO this builtin stack model to deal with a global state is overkill and
inconsistent with other already existing global states.  Look for example 
at 'os.getcwd()' and 'os.chdir()'.  What about 'set_encoding()' and 
'get_encoding()'?  

You can always and easy add the stack functionality as you can
easy implement 'pushd()' and 'popd()' on top of 'os.chdir()' and
'os.getcwd()'.

[Charset X-UNKNOWN unsupported, skipping...]
Umm.... BTW: why does your email header contains the following lines?:
	 Content-Type: TEXT/PLAIN; charset=X-UNKNOWN
	 Content-Transfer-Encoding: 8BIT

Regards, Peter


From ping@lfw.org  Wed May  3 11:17:20 2000
From: ping@lfw.org (Ka-Ping Yee)
Date: Wed, 3 May 2000 03:17:20 -0700 (PDT)
Subject: [I18n-sig] Re: default encoding as global state (was Re: Unicode debate)
In-Reply-To: <m12mw7u-000CnCC@artcom0.artcom-gmbh.de>
Message-ID: <Pine.LNX.4.10.10005030314190.522-100000@localhost>

On Wed, 3 May 2000, Peter Funk wrote:
> 
> I think 'popenc' is a very poor name.

Yes, that's well taken.

> IMO this builtin stack model to deal with a global state is overkill and
> inconsistent with other already existing global states.  Look for example 
> at 'os.getcwd()' and 'os.chdir()'.  What about 'set_encoding()' and 
> 'get_encoding()'?  

It seems worthwhile to establish a safe convention if you're going
to be flipping encodings often.  Whether you do it often enough to
justify a stack is open to debate, i suppose -- but there's just
something about twiddling global state that makes me nervous.


-- ?!ng


From Fredrik Lundh" <effbot@telia.com  Wed May  3 11:31:34 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Wed, 3 May 2000 12:31:34 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <Pine.LNX.4.10.10005030141580.522-100000@localhost>
Message-ID: <030a01bfb4ea$c2741e40$34aab5d4@hagrid>

Ka-Ping Yee <ping@lfw.org> wrote:
> > to throw some extra gasoline on this, how about allowing
> > str() to return unicode strings?
>=20
> You still need to *print* them somehow.  One way or another,
> stdout is still just a stream with bytes on it, unless we
> augment file objects to understand encodings.
>=20
> stdout sends bytes to something -- and that something will
> interpret the stream of bytes in some encoding (could be
> Latin-1, UTF-8, ISO-2022-JP, whatever).  So either:
>=20
>     1.  You explicitly downconvert to bytes, and specify
>         the encoding each time you do.  Then write the
>         bytes to stdout (or your file object).
>=20
>     2.  The file object is smart and can be told what
>         encoding to use, and Unicode strings written to
>         the file are automatically converted to bytes.

which one's more convenient?

(no, I won't tell you what I prefer. guido doesn't want
more arguments from the old "characters are characters"
proponents, so I gotta trick someone else to spell them
out ;-)

> > (extra questions: how about renaming "unicode" to "string",
> > and getting rid of "unichr"?)
>=20
> Would you expect chr(x) to return an 8-bit string when x < 128,
> and a Unicode string when x >=3D 128?

that will break too much existing code, I think.  but what
about replacing 128 with 256?

</F>


From pf@artcom-gmbh.de  Wed May  3 11:27:51 2000
From: pf@artcom-gmbh.de (Peter Funk)
Date: Wed, 3 May 2000 12:27:51 +0200 (MEST)
Subject: [I18n-sig] Re: Printing objects on files
In-Reply-To: <Pine.LNX.4.10.10005030242030.522-100000@localhost> from Ka-Ping Yee at "May 3, 2000  2:51:30 am"
Message-ID: <m12mwNj-000CnCC@artcom0.artcom-gmbh.de>

Hi!

Ka-Ping Yee:
> Okay, i lied.  Shortly after writing this i realized that it
> is probably advisable for all such bits of state to be stored
> in stacks, so an interface such as this might do:
> 
>     def push(self, key, value):
[...]
>     def pop(self, key):
[...]

I like the idea of having an encoding attribute in file objects.
May this can be prototyped in an 'UserFile' class similar to 
'UserList', 'UserDict'?  Since file objects have methods I 
wondered long time ago, why there is no 'UserFile' class.
But that's off-topic for the i18n-sig.  

But I still don't see the advantage of having a stack builtin.
Especially in Python this gives us only a very minor adavantage over
the following pattern:

	previous_state = object.get_state()
	object.set_state(some_value)
	...do something with object requiring it to be in state 'some_value'...
	# restore state:
	object.set_state(previous_state)


Regards, Peter


From just@letterror.com  Wed May  3 12:41:27 2000
From: just@letterror.com (Just van Rossum)
Date: Wed, 3 May 2000 12:41:27 +0100
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <390FE021.6F15C1C8@lemburg.com>
References: Your message of "Mon, 01 May 2000 21:55:25 EDT."
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
 <002001bfb3d9$7e020540$7cac1218@reston1.va.home.com>
 <l03102802b534149a9639@[193.78.237.164]> <390E939B.11B99B71@lemburg.com>
 <390EFE21.DAD7749B@prescod.net> <l03102800b53572ee87ad@[193.78.237.142]>
Message-ID: <l03102800b535bef21708@[193.78.237.152]>

At 10:15 AM +0200 03-05-2000, M.-A. Lemburg wrote:
>Huh ? The pure fact that you can have two (or more)
>Unicode characters to represent a single character makes
>Unicode itself have the same problems as e.g. UTF-8.

It's the different level of abstraction that makes it different.

Even if "e`" is _equivalent_ to the combined character, that doesn't mean
that it _is_ the combined character, on the level of abstraction we are
talking about: it's still 2 characters, and those can be sliced apart
without a problem. Slicing utf-8 doesn't work because it yields invalid
strings, slicing "e`" does work since both halves are valid strings. The
fact that "e`" is semantically equivalent to the combined character doesn't
change that.

Just


From guido@python.org  Wed May  3 12:12:44 2000
From: guido@python.org (Guido van Rossum)
Date: Wed, 03 May 2000 07:12:44 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode comparisons & normalization
In-Reply-To: Your message of "Wed, 03 May 2000 01:30:02 PDT."
 <Pine.LNX.4.10.10005030116460.522-100000@localhost>
References: <Pine.LNX.4.10.10005030116460.522-100000@localhost>
Message-ID: <200005031112.HAA03138@eric.cnri.reston.va.us>

[Ping]
> This would be another motivation for Python to carefully
> separate the three types of equality:
> 
>     is         identity-equal
>     ==         value-equal
>     <=>        magnitude-equal
> 
> We currently don't distinguish between the last two;
> the operator "<=>" is my proposal for how to spell
> "magnitude-equal", and in terms of outward behaviour
> you can consider (a <=> b) to be (a <= b and a >= b).
> I suspect we will find ourselves needing it if we do
> rich comparisons anyway.

I don't think that this form of equality deserves its own operator.
The Unicode comparison rules are sufficiently hairy that it seems
better to implement them separately, either in a separate module or at
least as a Unicode-object-specific method, and let the == operator do
what it does best: compare the representations.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@python.org  Wed May  3 12:14:54 2000
From: guido@python.org (Guido van Rossum)
Date: Wed, 03 May 2000 07:14:54 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode comparisons & normalization
In-Reply-To: Your message of "Wed, 03 May 2000 11:02:09 +0200."
 <018a01bfb4de$7744cc00$34aab5d4@hagrid>
References: <l03102806b535964edb26@[193.78.237.152]>
 <018a01bfb4de$7744cc00$34aab5d4@hagrid>
Message-ID: <200005031114.HAA03152@eric.cnri.reston.va.us>

> here's another good paper that covers this, the universe, and everything:

Theer's a lot of useful pointers being flung around.  Could someone
with more spare cycles than I currently have perhaps collect these and
produce a little write up "further reading on Unicode comparison and
normalization" (or perhaps a more comprehensive title if warranted) to
be added to the i18n-sig's home page?

--Guido van Rossum (home page: http://www.python.org/~guido/)


From just@letterror.com  Wed May  3 13:26:50 2000
From: just@letterror.com (Just van Rossum)
Date: Wed, 3 May 2000 13:26:50 +0100
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <030a01bfb4ea$c2741e40$34aab5d4@hagrid>
References: <Pine.LNX.4.10.10005030141580.522-100000@localhost>
Message-ID: <l03102804b535cb14f243@[193.78.237.149]>

[Ka-Ping Yee]
> Would you expect chr(x) to return an 8-bit string when x < 128,
> and a Unicode string when x >= 128?

[Fredrik Lundh]
> that will break too much existing code, I think.  but what
> about replacing 128 with 256?

Hihi... and *poof* -- we're back to Latin-1 for narrow strings ;-)

Just


From guido@python.org  Wed May  3 13:04:29 2000
From: guido@python.org (Guido van Rossum)
Date: Wed, 03 May 2000 08:04:29 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Your message of "Wed, 03 May 2000 12:31:34 +0200."
 <030a01bfb4ea$c2741e40$34aab5d4@hagrid>
References: <Pine.LNX.4.10.10005030141580.522-100000@localhost>
 <030a01bfb4ea$c2741e40$34aab5d4@hagrid>
Message-ID: <200005031204.IAA03252@eric.cnri.reston.va.us>

[Ping]
> > stdout sends bytes to something -- and that something will
> > interpret the stream of bytes in some encoding (could be
> > Latin-1, UTF-8, ISO-2022-JP, whatever).  So either:
> > 
> >     1.  You explicitly downconvert to bytes, and specify
> >         the encoding each time you do.  Then write the
> >         bytes to stdout (or your file object).
> > 
> >     2.  The file object is smart and can be told what
> >         encoding to use, and Unicode strings written to
> >         the file are automatically converted to bytes.

[Fredrik]
> which one's more convenient?

Marc-Andre's codec module contains file-like objects that support this
(or could easily be made to).

However the problem is that print *always* first converts the object
using str(), and str() enforces that the result is an 8-bit string.
I'm afraid that loosening this will break too much code.  (This all
really happens at the C level.)

I'm also afraid that this means that str(unicode) may have to be
defined to yield UTF-8.  My argument goes as follows:

1. We want to be able to set things up so that print u"..." does the
   right thing.  (What "the right thing" is, is not defined here,
   as long as the user sees the glyphs implied by u"...".)

2. print u is equivalent to sys.stdout.write(str(u)).

3. str() must always returns an 8-bit string.

4. So the solution must involve assigning an object to sys.stdout that
   does the right thing given an 8-bit encoding of u.

5. So we need str(u) to produce a lossless 8-bit encoding of Unicode.

6. UTF-8 is the only sensible candidate.

Note that (apart from print) str() is never implicitly invoked -- all
implicit conversions when Unicode and 8-bit strings are combined
go from 8-bit to Unicode.

(There might be an alternative, but it would depend on having yet
another hook (similar to Ping's sys.display) that gets invoked when
printing an object (as opposed to displaying it at the interactive
prompt).  I'm not too keen on this because it would break code that
temporarily sets sys.stdout to a file of its own choosing and then
invokes print -- a common idiom to capture printed output in a string,
for example, which could be embedded deep inside a module.  If the
main program were to install a naive print hook that always sent
output to a designated place, this strategy might fail.)

> > > (extra questions: how about renaming "unicode" to "string",
> > > and getting rid of "unichr"?)
> > 
> > Would you expect chr(x) to return an 8-bit string when x < 128,
> > and a Unicode string when x >= 128?
> 
> that will break too much existing code, I think.  but what
> about replacing 128 with 256?

If the 8-bit Unicode proposal were accepted, this would make sense.
In my "only ASCII is implicitly convertible" proposal, this would be a
mistake, because chr(128) == "\x7f" != u"\x7f" == unichr(128).

I agree with everyone that things would be much simpler if we had
separate data types for byte arrays and 8-bit character strings.  But
we don't have this distinction yet, and I don't see a quick way to add
it in 1.6 without major upsetting the release schedule.

So all of my proposals are to be considered hacks to maintain as much
b/w compatibility as possible while still supporting some form of
Unicode.  The fact that half the time 8-bit strings are really being
used as byte arrays, while Python can't tell the difference, means (to
me) that the default encoding is an important thing to argue about.

I don't know if I want to push it out all the way to Py3k, but I just
don't see a way to implement "a character is a character" in 1.6 given
all the current constraints.  (BTW I promise that 1.7 will be speedy
once 1.6 is out of the door -- there's a lot else that was put off to
1.7.)

Fredrik, I believe I haven't seen your response to my ASCII proposal.
Is it just as bad as UTF-8 to you, or could you live with it?  On a
scale of 0-9 (0: UTF-8, 9: 8-bit Unicode), where is ASCII for you?

Where's my sre snapshot?

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@python.org  Wed May  3 13:16:56 2000
From: guido@python.org (Guido van Rossum)
Date: Wed, 03 May 2000 08:16:56 -0400
Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Your message of "03 May 2000 10:59:28 BST."
 <f5bog6o54zj.fsf@cogsci.ed.ac.uk>
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us>
 <f5bog6o54zj.fsf@cogsci.ed.ac.uk>
Message-ID: <200005031216.IAA03274@eric.cnri.reston.va.us>

[Henry S. Thompson]
> OK, I've never contributed to this discussion, but I have a long
> history of shipping widely used Python/Tkinter/XML tools (see my
> homepage).  I care _very_ much that heretofore I have been unable to
> support full XML because of the lack of Unicode support in Python.
> I've already started playing with 1.6a2 for this reason.

Thanks for chiming in!

> I notice one apparent mis-communication between the various
> contributors:
> 
> Treating narrow-strings as consisting of UNICODE code points <= 255 is 
> not necessarily the same thing as making Latin-1 the default encoding.
> I don't think on Paul and Fredrik's account encoding are relevant to
> narrow-strings at all.

I agree that's what they are trying to tell me.

> I'd rather go right away to the coherent position of byte-arrays,
> narrow-strings and wide-strings.  Encodings are only relevant to
> conversion between byte-arrays and strings.  Decoding a byte-array
> with a UTF-8 encoding into a narrow string might cause
> overflow/truncation, just as decoding a byte-array with a UTF-8
> encoding into a wide-string might.  The fact that decoding a
> byte-array with a Latin-1 encoding into a narrow-string is a memcopy
> is just a side-effect of the courtesy of the UNICODE designers wrt the 
> code points between 128 and 255.
> 
> This is effectively the way our C-based XML toolset (which we embed in 
> Python) works today -- we build an 8-bit version which uses char*
> strings, and a 16-bit version which uses unsigned short* strings, and
> convert from/to byte-streams in any supported encoding at the margins.
> 
> I'd like to keep byte-arrays at the margins in Python as well, for all 
> the reasons advanced by Paul and Fredrik.
> 
> I think treating existing strings as a sort of pun between
> narrow-strings and byte-arrays is a recipe for ongoing confusion.

Very good analysis.

Unfortunately this is where we're stuck, until we have a chance to
redesign this kind of thing from scratch.  Python 1.5.2 programs use
strings for byte arrays probably as much as they use them for
character strings.  This is because way back in 1990 I when I was
designing Python, I wanted to have smallest set of basic types, but I
also wanted to be able to manipulate byte arrays somewhat.  Influenced
by K&R C, I chose to make strings and string I/O 8-bit clean so that
you could read a binary "string" from a file, manipulate it, and write
it back to a file, regardless of whether it was character or binary
data.

This model has never been challenged until now.  I agree that the Java
model (byte arrays and strings) or perhaps your proposed model (byte
arrays, narrow and wide strings) looks better.  But, although Python
has had rudimentary support for byte arrays for a while (the array
module, introduced in 1993), the majority of Python code manipulating
binary data still uses string objects.

My ASCII proposal is a compromise that tries to be fair to both uses
for strings.  Introducing byte arrays as a more fundamental type has
been on the wish list for a long time -- I see no way to introduce
this into Python 1.6 without totally botching the release schedule
(June 1st is very close already!).  I'd like to be able to move on,
there are other important things still to be added to 1.6 (Vladimir's
malloc patches, Neil's GC, Fredrik's completed sre...).

For 1.7 (which should happen later this year) I promise I'll reopen
the discussion on byte arrays.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@python.org  Wed May  3 13:22:57 2000
From: guido@python.org (Guido van Rossum)
Date: Wed, 03 May 2000 08:22:57 -0400
Subject: [Python-Dev] Re: [I18n-sig] Re: Unicode debate
In-Reply-To: Your message of "Wed, 03 May 2000 01:05:59 EDT."
 <000301bfb4bd$463ec280$622d153f@tim>
References: <000301bfb4bd$463ec280$622d153f@tim>
Message-ID: <200005031222.IAA03300@eric.cnri.reston.va.us>

> [Guido]
> > When *comparing* 8-bit and Unicode strings, the presence of non-ASCII
> > bytes in either should make the comparison fail; when ordering is
> > important, we can make an arbitrary choice e.g. "\377" < u"\200".
> 
> [Toby]
> > I assume 'fail' means 'non-equal', rather than 'raises an exception'?
> 
> [Guido]
> > Yes, sorry for the ambiguity.

[Tim]
> Huh!  You sure about that?  If we're setting up a case where meaningful
> comparison is impossible, isn't an exception more appropriate?  The current
> 
> >>> 83479278 < "42"
> 1
> >>>
> 
> probably traps more people than it helps.

Agreed, but that's the rule we all currently live by, and changing it
is something for Python 3000.

I'm not real strong on this though -- I was willing to live with
exceptions from the UTF-8-to-Unicode conversion.  If we all agree that
it's better for u"\377" == "\377" to raise an precedent-setting
exception than to return false, that's fine with me too.  I do want
u"a" == "a" to be true though (and I believe we all already agree on
that one).

Note that it's not the first precedent -- you can already define
classes whose instances can raise exceptions during comparisons.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From mal@lemburg.com  Wed May  3 09:56:08 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Wed, 03 May 2000 10:56:08 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <390F1D1C.6EAF7EAD@prescod.net> <200005021856.OAA26104@eric.cnri.reston.va.us>             <390F2B2F.2953C72D@prescod.net>  <200005021958.PAA26760@eric.cnri.reston.va.us> <013c01bfb4d6$da19fb00$34aab5d4@hagrid>
Message-ID: <390FE9A7.DE5545DA@lemburg.com>

Fredrik Lundh wrote:
> 
> Guido van Rossum <guido@python.org> wrote:
> > > What do we do about str( my_unicode_string )? Perhaps escape the Unicode
> > > characters with backslashed numbers?
> >
> > Hm, good question.  Tcl displays unknown characters as \x or \u
> > escapes.  I think this may make more sense than raising an error.
> 
> but that's on the display side of things, right?  similar to
> repr, in other words.
> 
> > But there must be a way to turn on Unicode-awareness on e.g. stdout
> > and then printing a Unicode object should not use str() (as it
> > currently does).
> 
> to throw some extra gasoline on this, how about allowing
> str() to return unicode strings?
> 
> (extra questions: how about renaming "unicode" to "string",
> and getting rid of "unichr"?)
> 
> count to ten before replying, please.

1 2 3 4 5 6 7 8 9 10 ... ok ;-)

Guido's problem with printing Unicode can easily be solved
using the standard codecs.StreamRecoder class as I've done
in the example I posted some days ago.

Basically, what the stdout wrapper would do is take strings
as input, converting them to Unicode and then writing
them encoded to the original stdout. For Unicode objects
the conversion can be skipped and the encoded output written
directly to stdout.

This can be done for any encoding supported by Python; e.g.
you could do the indirection in site.py and then have
Unicode printed as Latin-1 or UTF-8 or one of the many
code pages supported through the mapping codec.

About having str() return Unicode objects: I see str()
as constructor for string objects and under that assumption
str() will always have to return string objects.
unicode() does the same for Unicode objects, so renaming
it to something else doesn't really help all that much.

BTW, __str__() has to return strings too. Perhaps we
need __unicode__() and a corresponding slot function too ?!

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From mal@lemburg.com  Wed May  3 14:06:27 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Wed, 03 May 2000 15:06:27 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <390F1D1C.6EAF7EAD@prescod.net> <200005021856.OAA26104@eric.cnri.reston.va.us>              <390F2B2F.2953C72D@prescod.net> <200005021958.PAA26760@eric.cnri.reston.va.us> <390F60A9.A3AA53A9@lemburg.com> <01ed01bfb4df$8feddb60$34aab5d4@hagrid>
Message-ID: <39102453.6923B10@lemburg.com>

Fredrik Lundh wrote:
> 
> M.-A. Lemburg wrote:
> > Guido van Rossum wrote:
> > >
> > > > > So what do you think of my new proposal of using ASCII as the default
> > > > > "encoding"?
> >
> > How about using unicode-escape or raw-unicode-escape as
> > default encoding ? (They would have to be adapted to disallow
> > Latin-1 char input, though.)
> >
> > The advantage would be that they are compatible with ASCII
> > while still providing loss-less conversion and since they
> > use escape characters, you can even read them using an
> > ASCII based editor.
> 
> umm.  if you disallow latin-1 characters, how can you call this
> one loss-less?

[Guido didn't like this one, so its probably moot investing
 any more time on this...]

I meant that the unicode-escape codec should only take ASCII
characters as input and disallow non-escaped Latin-1 characters.

Anyway, I'm out of this discussion... 

I'll wait a week or so until things have been sorted out.

Have fun,
-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From ping@lfw.org  Wed May  3 14:09:59 2000
From: ping@lfw.org (Ka-Ping Yee)
Date: Wed, 3 May 2000 06:09:59 -0700 (PDT)
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <200005031204.IAA03252@eric.cnri.reston.va.us>
Message-ID: <Pine.LNX.4.10.10005030556250.522-100000@localhost>

On Wed, 3 May 2000, Guido van Rossum wrote:
> (There might be an alternative, but it would depend on having yet
> another hook (similar to Ping's sys.display) that gets invoked when
> printing an object (as opposed to displaying it at the interactive
> prompt).  I'm not too keen on this because it would break code that
> temporarily sets sys.stdout to a file of its own choosing and then
> invokes print -- a common idiom to capture printed output in a string,
> for example, which could be embedded deep inside a module.  If the
> main program were to install a naive print hook that always sent
> output to a designated place, this strategy might fail.)

I know this is not a small change, but i'm pretty convinced the
right answer here is that the print hook should call a *method*
on sys.stdout, whatever sys.stdout happens to be.  The details
are described in the other long message i wrote ("Printing objects
on files").

Here is an addendum that might actually make that proposal
feasible enough (compatibility-wise) to fly in the short term:

    print x

does, conceptually:

    try:
        sys.stdout.printout(x)
    except AttributeError:
        sys.stdout.write(str(x))
        sys.stdout.write("\n")

The rest can then be added, and the change in 'print x' will
work nicely for any file objects, but will not break on file-like
substitutes that don't define a 'printout' method.

Any reactions to the other benefit of this proposal -- namely,
the ability to control the printing parameters of object
components as they're being traversed for printing?  That was
actually the original motivation for doing the file.printout
thing: it gives you some of the effect of "passing down str-ness"
that we were discussing so heatedly a little while ago.

The other thing that just might justify this much of a change
is that, as you reasoned clearly in your other message, without
adequate resolution to the printing problem we may have painted
ourselves into a corner with regard to str(u"") conversion, and
i don't like the look of that corner much.  *Even* if we were to
get people to agree that it's okay for str(u"") to produce UTF-8,
it still seems pretty hackish to me that we're forced to choose
this encoding as a way of working around that fact that we can't
simply give the file the thing we want to print.


-- ?!ng


From pf@artcom-gmbh.de  Wed May  3 14:43:29 2000
From: pf@artcom-gmbh.de (Peter Funk)
Date: Wed, 3 May 2000 15:43:29 +0200 (MEST)
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <l0310280ab5348f526eb6@[193.78.237.142]> from Just van Rossum at "May 2, 2000  3: 0:31 pm"
Message-ID: <m12mzR3-000CnCC@artcom0.artcom-gmbh.de>

Hi!

[me]:
> >> I aggree with Just, Paul, Fredrik and Ping.

> At 8:30 AM -0400 02-05-2000, Guido van Rossum wrote:
> >
> >Sorry, this is not a democracy. :-)  I'm not counting votes, I'm
> >looking for contributions to the discussion.
 
Just van Rossum:
> Of course it's not, and of course you shouldn't be counting votes. However,
> the fact that more and more people chime in on the Latin-1 side (even
> non-western oriented people like Ping and Moshe!) should ring a bell.

Just: Thank you for trying to defend me... ;-)  But Guido was right, that I
didn't contribute any new argument to the discussion.  In the meantime
it has become really hard with somethinng really new.  
Nevertheless I will try:

May be the situation will become clearer and easier to understand,
if we simply rename the new Unicode string objects into "wide string
objects".  From this POV wide string objects are simply members of
a family of string objects in the same sense as integers, arbitrary
long ints and floats are members of the family of number types.

The whole encoding debate then becomes pointless, since the
interpretation of the content of a wide string object doesn't have
to be unicode at all.  (Although there might be no other useful 16
Bit wide encoding scheme available today).  This intepretation of
the encoding will be left over to the application in the same way
applications interpret the meaning of 8 bit strings as they like.
(usually as latin1 here but that's not the point).

So if mixing normal 8-bit strings with wide strings the expected 
behaviour should be similar to what happens, if mixing floats, long ints
and plain integers:  the value range is extended to fit the largest
operand.  Every other behaviour would be very surprising.

[ascii:]
Please don't drop the 8-bit transparency we already achieved during the
last decade:  I still remember the late 80s, where mailers, news transports
and other pieces of software tends to drop or truncate the eight bit.
So going back to ASCII won't do any good:  It will bother people in the same
way as the octal
>>> "Viel Gl�ck"
'Viel Gl\374ck'
doesn't make much sense on an otherwise 8-bit clean system.

Regards, Peter


From Moshe Zadka <moshez@math.huji.ac.il>  Wed May  3 14:55:37 2000
From: Moshe Zadka <moshez@math.huji.ac.il> (Moshe Zadka)
Date: Wed, 3 May 2000 16:55:37 +0300 (IDT)
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <000501bfb4c3$16743480$622d153f@tim>
Message-ID: <Pine.GSO.4.10.10005031649040.4859-100000@sundial>

On Wed, 3 May 2000, Tim Peters wrote:

[Moshe Zadka]
> ...
> I'd much prefer Python to reflect a fundamental truth about Unicode,
> which at least makes sure binary-goop can pass through Unicode and
> remain unharmed, then to reflect a nasty problem with UTF-8 (not
> everything is legal).

[Tim Peters]
> Then you don't want Unicode at all, Moshe.  All the official encoding
> schemes for Unicode 3.0 suffer illegal byte sequences

Of course I don't, and of course you're right. But what I do want is for
my binary goop to pass unharmed through the evil Unicode forest. Which is
why I don't want it to interpret my goop as a sequence of bytes it tries
to decode, but I want the numeric values of my bytes to pass through to
Unicode uharmed -- that means Latin-1 because of the second design
decision of the horribly western-specific unicdoe - the first 256
characters are the same as Latin-1. If it were up to me, I'd use Latin-3,
but it wasn't, so it's not.

> (for example, 0xffff
> is illegal in UTF-16 (whether BE or LE)

Tim, one of us must have cracked a chip. 0xffff is the same in BE and LE
-- isn't it.

--
Moshe Zadka <moshez@math.huji.ac.il>
http://www.oreilly.com/news/prescod_0300.html
http://www.linux.org.il -- we put the penguin in .com


From just@letterror.com  Wed May  3 20:55:24 2000
From: just@letterror.com (Just van Rossum)
Date: Wed, 3 May 2000 20:55:24 +0100
Subject: [I18n-sig] Unicode strings: an alternative
Message-ID: <l03102800b5362642bae3@[193.78.237.149]>

Today I had a relatively simple idea that unites wide strings and narrow
strings in a way that is more backward comatible at the C level. It's quite
possible this has already been considered and rejected for reasons that are
not yet obvious to me, but I'll give it a shot anyway.

The main concept is not to provide a new string type but to extend the
existing string object like so:
- wide strings are stored as if they were narrow strings, simply using two
bytes for each Unicode character.
- there's a flag that specifies whether the string is narrow or wide.
- the ob_size field is the _physical_ length of the data; if the string is
wide, len(s) will return ob_size/2, all other string operations will have
to do similar things.
- there can possibly be an encoding attribute which may specify the used
encoding, if known.

Admittedly, this is tricky and involves quite a bit of effort to implement,
since all string methods need to have narrow/wide switch. To make it worse,
it hardly offers anything the current solution doesn't. However, it offers
one IMHO _big_ advantage: C code that just passes strings along does not
need to change: wide strings can be seen as narrow strings without any
loss. This allows for __str__() & str() and friends to work with unicode
strings without any change.

Any thoughts?

Just


From tree@basistech.com  Wed May  3 21:19:05 2000
From: tree@basistech.com (Tom Emerson)
Date: Wed, 3 May 2000 16:19:05 -0400 (EDT)
Subject: [I18n-sig] Unicode strings: an alternative
In-Reply-To: <l03102800b5362642bae3@[193.78.237.149]>
References: <l03102800b5362642bae3@[193.78.237.149]>
Message-ID: <14608.35257.729641.178724@cymru.basistech.com>

Just van Rossum writes:
 > The main concept is not to provide a new string type but to extend the
 > existing string object like so:

This is the most logical thing to do.

 > - wide strings are stored as if they were narrow strings, simply using two
 > bytes for each Unicode character.

I disagree with you here... store them as UTF-8.

 > - there's a flag that specifies whether the string is narrow or wide.

Yup.

 > - the ob_size field is the _physical_ length of the data; if the string is
 > wide, len(s) will return ob_size/2, all other string operations will have
 > to do similar things.

Is it possible to add a logical length field too? I presume it is too
expensive to recalculate the logical (character) length of a string
each time len(s) is called? Doing this is only slightly more time
consuming than a normal strlen: really just O(n) + c, where 'c' is the
constant time needed for table lookup (to get the number of bytes in
the UTF-8 sequence given the start character) and the pointer
manipulation (to add that length to your span pointer).

 > - there can possibly be an encoding attribute which may specify the used
 > encoding, if known.

So is this used to handle the case where you have a legacy encoding
(ShiftJIS, say) used in your existing strings, so you flag that 8-bit
("narrow" in a way) string as ShiftJIS?

If wide strings are always Unicode, why do you need the encoding?


 > Admittedly, this is tricky and involves quite a bit of effort to implement,
 > since all string methods need to have narrow/wide switch. To make it worse,
 > it hardly offers anything the current solution doesn't. However, it offers
 > one IMHO _big_ advantage: C code that just passes strings along does not
 > need to change: wide strings can be seen as narrow strings without any
 > loss. This allows for __str__() & str() and friends to work with unicode
 > strings without any change.

If you store wide strings as UCS2 then people using the C interface
lose: strlen() stops working, or will return incorrect
results. Indeed, any of the str*() routines in the C runtime will
break. This is the advantage of using UTF-8 here --- you can still use
strcpy and the like on the C side and have things work.

 > Any thoughts?

I'm doing essentially what you suggest in my Unicode enablement of MySQL.

    -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Language Hacker                                    http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From skip@mojam.com (Skip Montanaro)  Wed May  3 21:51:49 2000
From: skip@mojam.com (Skip Montanaro) (Skip Montanaro)
Date: Wed, 3 May 2000 15:51:49 -0500 (CDT)
Subject: [Python-Dev] [I18n-sig] Unicode strings: an alternative
In-Reply-To: <14608.35257.729641.178724@cymru.basistech.com>
References: <l03102800b5362642bae3@[193.78.237.149]>
 <14608.35257.729641.178724@cymru.basistech.com>
Message-ID: <14608.37223.787291.236623@beluga.mojam.com>

    Tom> Is it possible to add a logical length field too? I presume it is
    Tom> too expensive to recalculate the logical (character) length of a
    Tom> string each time len(s) is called? Doing this is only slightly more
    Tom> time consuming than a normal strlen: ...

Note that currently the len() method doesn't call strlen() at all.  It just
returns the ob_size field.  Presumably, with Just's proposal len() would
simply return ob_size/width.  If you used a variable width encoding, Just's
plan wouldn't work.  (I don't know anything about string encodings - is
UTF-8 variable width?)


From guido@python.org  Wed May  3 22:22:59 2000
From: guido@python.org (Guido van Rossum)
Date: Wed, 03 May 2000 17:22:59 -0400
Subject: [I18n-sig] Unicode strings: an alternative
In-Reply-To: Your message of "Wed, 03 May 2000 20:55:24 BST."
 <l03102800b5362642bae3@[193.78.237.149]>
References: <l03102800b5362642bae3@[193.78.237.149]>
Message-ID: <200005032122.RAA05150@eric.cnri.reston.va.us>

> Today I had a relatively simple idea that unites wide strings and narrow
> strings in a way that is more backward comatible at the C level. It's quite
> possible this has already been considered and rejected for reasons that are
> not yet obvious to me, but I'll give it a shot anyway.
> 
> The main concept is not to provide a new string type but to extend the
> existing string object like so:
> - wide strings are stored as if they were narrow strings, simply using two
> bytes for each Unicode character.
> - there's a flag that specifies whether the string is narrow or wide.
> - the ob_size field is the _physical_ length of the data; if the string is
> wide, len(s) will return ob_size/2, all other string operations will have
> to do similar things.
> - there can possibly be an encoding attribute which may specify the used
> encoding, if known.
> 
> Admittedly, this is tricky and involves quite a bit of effort to implement,
> since all string methods need to have narrow/wide switch. To make it worse,
> it hardly offers anything the current solution doesn't. However, it offers
> one IMHO _big_ advantage: C code that just passes strings along does not
> need to change: wide strings can be seen as narrow strings without any
> loss. This allows for __str__() & str() and friends to work with unicode
> strings without any change.

This seems to have some nice properties, but I think it would cause
problems for existing C code that tries to *interpret* the bytes of a
string: it could very well do the wrong thing for wide strings (since
old C code doesn't check for the "wide" flag).  I'm not sure how much
C code there is that merely passes strings along...  Most C code using
strings makes use of the strings (e.g. open() falls in this category
in my eyes).

--Guido van Rossum (home page: http://www.python.org/~guido/)


From tree@basistech.com  Wed May  3 23:05:39 2000
From: tree@basistech.com (Tom Emerson)
Date: Wed, 3 May 2000 18:05:39 -0400 (EDT)
Subject: [Python-Dev] [I18n-sig] Unicode strings: an alternative
In-Reply-To: <14608.37223.787291.236623@beluga.mojam.com>
References: <l03102800b5362642bae3@[193.78.237.149]>
 <14608.35257.729641.178724@cymru.basistech.com>
 <14608.37223.787291.236623@beluga.mojam.com>
Message-ID: <14608.41651.781464.747522@cymru.basistech.com>

Skip Montanaro writes:
 > Note that currently the len() method doesn't call strlen() at all.  It just
 > returns the ob_size field.  Presumably, with Just's proposal len() would
 > simply return ob_size/width.  If you used a variable width encoding, Just's
 > plan wouldn't work.  (I don't know anything about string encodings - is
 > UTF-8 variable width?)

Yes, technically from 1 - 6 bytes per character, though in practice
for Unicode it's 1 - 3.

    -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Language Hacker                                    http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From kentsin@poboxes.com  Thu May  4 05:04:14 2000
From: kentsin@poboxes.com (Sin Hang Kin)
Date: Thu, 4 May 2000 12:04:14 +0800
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us>              <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net>
Message-ID: <004b01bfb57d$d02cce40$770da8c0@bbs>

> > I don't think I've heard a good *argument* for this rule though.  "A
> > character is a character is a character" sounds like an axiom to me --
> > something you can't prove or disprove rationally.
>
> I don't see it as an axiom, but rather as a design decision you make to
> keep your language simple. Along the lines of "all values are objects"
> and (now) all integer values are representable with a single type. Are
> you happy with this?

No. A character is not just a character.

Got to google and make a search, the return result might be an example of
mixed encoding text:

Search engines index pages in their natural encoding, and present the result
as is, so the search result page will contain whatever encoding mixed in. If
you see JIS, ISO 8859, Hebrew, Thai, Utf-8, Big-5, GB2312, EUC, Shift-JIS
you would not be very surprise. So, if you argue that a character is a
character is a character, how would you handle such a mixed encoding text
mess?

No one can write an automatically convertion program for such text, only if
you can treated it as 8-bit bytes you can make use of it. Otherwise this is
a mess.

Backward compatibility is a must, not an extra feature we would like. At
least provide a way to handle these in python efficiently.

To be able to handle text in character basis is very convient to all,
especially to those do not care about i18n, for people who do i18n text
processing, they can build their own logic into the code and will not be
killed by suprise text. For those applications which is not well prepared,
the sudden arrival of ugly unexpected encoding will certainly fatal. Look
out the net,  you are well connected, and your world is pollued by things
from your connection. Isn't it beautiful?

Rgs,

Kent Sin


From just@letterror.com  Thu May  4 08:42:00 2000
From: just@letterror.com (Just van Rossum)
Date: Thu, 4 May 2000 08:42:00 +0100
Subject: [I18n-sig] Unicode strings: an alternative
In-Reply-To: <200005032122.RAA05150@eric.cnri.reston.va.us>
References: Your message of "Wed, 03 May 2000 20:55:24 BST."
 <l03102800b5362642bae3@[193.78.237.149]>
 <l03102800b5362642bae3@[193.78.237.149]>
Message-ID: <l03102800b536d1d8c0bc@[193.78.237.161]>

(Thanks for all the comments. I'll condense my replies into one post.)

[JvR]
> - wide strings are stored as if they were narrow strings, simply using two
> bytes for each Unicode character.

[Tom Emerson wrote]
>I disagree with you here... store them as UTF-8.

Erm, utf-8 in a wide string? This makes no sense...

[Skip Montanaro]
>Presumably, with Just's proposal len() would
>simply return ob_size/width.

Right. And if you would allow values for width other than 1 and 2, it opens
the way for UCS-4. Wouldn't that be nice? It's hardly more effort, and
"only" width==1 needs to be special-cased for speed.

>If you used a variable width encoding, Just's plan wouldn't work.

Correct, but nor does the current unicode object. Variable width encodings
are too messy to see as strings at all: they are only useful as byte arrays.

[GvR]
>This seems to have some nice properties, but I think it would cause
>problems for existing C code that tries to *interpret* the bytes of a
>string: it could very well do the wrong thing for wide strings (since
>old C code doesn't check for the "wide" flag).  I'm not sure how much
>C code there is that merely passes strings along...  Most C code using
>strings makes use of the strings (e.g. open() falls in this category
>in my eyes).

There are probably many cases that fall into this category. But then again,
these cases, especially those that potentially can deal with other
encodings than ascii, are not much helped by a default encoding, as /F
showed.

My idea arose after yesterday's discussions. Some quotes, plus comments:

[GvR]
>However the problem is that print *always* first converts the object
>using str(), and str() enforces that the result is an 8-bit string.
>I'm afraid that loosening this will break too much code.  (This all
>really happens at the C level.)

Guido goes on to explain that this means utf-8 is the only sensible default
in this case. Good reasoning, but I think it's backwards:
- str(unicodestring) should just return unicodestring
- it is important that stdout receives the original unicode object.

[MAL]
>BTW, __str__() has to return strings too. Perhaps we
>need __unicode__() and a corresponding slot function too ?!

This also seems backwards. If it's really too hard to change Python so that
__str__ can return unicode objects, my solution may help.

[Ka-Ping Yee]
>Here is an addendum that might actually make that proposal
>feasible enough (compatibility-wise) to fly in the short term:
>
>    print x
>
>does, conceptually:
>
>    try:
>        sys.stdout.printout(x)
>    except AttributeError:
>        sys.stdout.write(str(x))
>        sys.stdout.write("\n")

That stuff like this is even being *proposed* (not that it's not smart or
anything...) means there's a terrible bottleneck somewhere which needs
fixing. My proposal seems to do does that nicely.

Of course, there's no such thing as a free lunch, and I'm sure there are
other corners that'll need fixing, but it appears having to write

    if (!PyString_Check(doc) && !PyUnicode_Check(doc))
        ...

in all places that may accept unicode strings is no fun either.

Yes, some code will break if you throw a wide string at it, but I think
that code is easier repaired with my proposal than with the current
implementation.

It's a big advantage to have only one string type; it makes many problems
we've been discussing easier to talk about.

Just


From Fredrik Lundh" <effbot@telia.com  Thu May  4 08:46:05 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Thu, 4 May 2000 09:46:05 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <Pine.LNX.4.10.10005030556250.522-100000@localhost>
Message-ID: <002d01bfb59c$cf482280$34aab5d4@hagrid>

Ka-Ping Yee <ping@lfw.org> wrote:
> I know this is not a small change, but i'm pretty convinced the
> right answer here is that the print hook should call a *method*
> on sys.stdout, whatever sys.stdout happens to be.  The details
> are described in the other long message i wrote ("Printing objects
> on files").
>=20
> Here is an addendum that might actually make that proposal
> feasible enough (compatibility-wise) to fly in the short term:
>=20
>     print x
>=20
> does, conceptually:
>=20
>     try:
>         sys.stdout.printout(x)
>     except AttributeError:
>         sys.stdout.write(str(x))
>         sys.stdout.write("\n")
>=20
> The rest can then be added, and the change in 'print x' will
> work nicely for any file objects, but will not break on file-like
> substitutes that don't define a 'printout' method.

another approach is (simplified):

    try:
        sys.stdout.write(x.encode(sys.stdout.encoding))
    except AttributeError:
        sys.stdout.write(str(x))

or, if str is changed to return any kind of string:

    x =3D str(x)
    try:
        x =3D x.encode(sys.stdout.encoding)
    except AttributeError:
        pass
    sys.stdout.write(x)

</F>


From Fredrik Lundh" <effbot@telia.com  Thu May  4 09:07:45 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Thu, 4 May 2000 10:07:45 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us>              <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <004b01bfb57d$d02cce40$770da8c0@bbs>
Message-ID: <006d01bfb59f$d57056c0$34aab5d4@hagrid>

Sin Hang Kin <kentsin@poboxes.com> wrote:
> > I don't see it as an axiom, but rather as a design decision you make =
to
> > keep your language simple. Along the lines of "all values are =
objects"
> > and (now) all integer values are representable with a single type. =
Are
> > you happy with this?
>=20
> No. A character is not just a character.
>=20
> Got to google and make a search, the return result might be an example =
of
> mixed encoding text:
>=20
> Search engines index pages in their natural encoding, and present the =
result
> as is, so the search result page will contain whatever encoding mixed =
in. If
> you see JIS, ISO 8859, Hebrew, Thai, Utf-8, Big-5, GB2312, EUC, =
Shift-JIS
> you would not be very surprise. So, if you argue that a character is a
> character is a character, how would you handle such a mixed encoding =
text
> mess?

by converting the encoded data, character by character, into a
single known encoding, and doing the search in there?

> No one can write an automatically convertion program for such text, =
only if
> you can treated it as 8-bit bytes you can make use of it. Otherwise =
this is
> a mess.

do you really think the google engine repeats your search in every
possible encoding?  doesn't really sound like the most efficient way
to implement a search engine...

(if you still think that encodings has anything to do with the =
"characters
are characters" rule, see http://www.w3.org/TR/charmod )

</F>


From ht@cogsci.ed.ac.uk  Thu May  4 09:51:39 2000
From: ht@cogsci.ed.ac.uk (Henry S. Thompson)
Date: 04 May 2000 09:51:39 +0100
Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Guido van Rossum's message of "Wed, 03 May 2000 08:16:56 -0400"
References: <l03102805b52ca7830b18@[193.78.237.154]>
 <l03102800b52d80db1290@[193.78.237.154]>
 <200004271501.LAA13535@eric.cnri.reston.va.us>
 <3908F566.8E5747C@prescod.net>
 <200004281450.KAA16493@eric.cnri.reston.va.us>
 <390AEF1D.253B93EF@prescod.net>
 <200005011802.OAA21612@eric.cnri.reston.va.us>
 <390DEB45.D8D12337@prescod.net>
 <200005012132.RAA23319@eric.cnri.reston.va.us>
 <390E1F08.EA91599E@prescod.net>
 <200005020053.UAA23665@eric.cnri.reston.va.us>
 <f5bog6o54zj.fsf@cogsci.ed.ac.uk>
 <200005031216.IAA03274@eric.cnri.reston.va.us>
Message-ID: <f5br9bi1yw4.fsf@cogsci.ed.ac.uk>

Guido van Rossum <guido@python.org> writes:

<snip/>

> My ASCII proposal is a compromise that tries to be fair to both uses
> for strings.  Introducing byte arrays as a more fundamental type has
> been on the wish list for a long time -- I see no way to introduce
> this into Python 1.6 without totally botching the release schedule
> (June 1st is very close already!).  I'd like to be able to move on,
> there are other important things still to be added to 1.6 (Vladimir's
> malloc patches, Neil's GC, Fredrik's completed sre...).
> 
> For 1.7 (which should happen later this year) I promise I'll reopen
> the discussion on byte arrays.

I think I hear a moderate consensus developing that the 'ASCII
proposal' is a reasonable compromise given the time constraints.  But
let's not fail to come back to this ASAP -- it _really_ narcs me that
every time I load XML into my Python-based editor I'm going to convert
large amounts of wide-string data into UTF-8 just so Tk can convert it
back to wide-strings in order to display it!

ht
-- 
  Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh
          W3C Fellow 1999--2001, part-time member of W3C Team
     2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440
	    Fax: (44) 131 650-4587, e-mail: ht@cogsci.ed.ac.uk
		     URL: http://www.ltg.ed.ac.uk/~ht/


From just@letterror.com  Thu May  4 12:27:45 2000
From: just@letterror.com (Just van Rossum)
Date: Thu, 4 May 2000 12:27:45 +0100
Subject: [I18n-sig] Unicode strings: an alternative
In-Reply-To: <l03102800b536d1d8c0bc@[193.78.237.161]>
References: <200005032122.RAA05150@eric.cnri.reston.va.us> Your message of
 "Wed, 03 May 2000 20:55:24 BST."
 <l03102800b5362642bae3@[193.78.237.149]>
 <l03102800b5362642bae3@[193.78.237.149]>
Message-ID: <l03102809b53709fef820@[193.78.237.126]>

I wrote:
>It's a big advantage to have only one string type; it makes many problems
>we've been discussing easier to talk about.

I think I should've been more explicit about what I meant here. I'll try to
phrase it as an addendum to my proposal -- which suddenly is no longer just
a narrow/wide string unification but narrow/wide/ultrawide, to really be
ready for the future...

As someone else suggested in the discussion, I think it's good if we
separate the encoding from the data type. Meaning that wide strings are no
longer tied to Unicode. This allows for double-byte encodings other than
UCS-2 as well as for safe passing-through of binary goop, but that's not
the main point. The main point is that this will make the behavior of
(wide) strings more understandable and consistent.

The extended string type is simply a sequence of code points, allowing for
0-0xFF for narrow strings, 0-0xFFFF for wide strings, and 0-0xFFFFFFFF for
ultra-wide strings. Upcasting is always safe, downcasting may raise
OverflowError. Depending on the used encoding, this comes as close as
possible to the sequence-of-characters model.

The default character set should of course be Unicode -- and it should be
obvious that this implies Latin-1 for narrow strings.

(Additionally: an encoding attribute suddenly makes a whole lot of sense
again.)

Ok, y'all can shoot me now ;-)

Just


From guido@python.org  Thu May  4 13:40:35 2000
From: guido@python.org (Guido van Rossum)
Date: Thu, 04 May 2000 08:40:35 -0400
Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Your message of "04 May 2000 09:51:39 BST."
 <f5br9bi1yw4.fsf@cogsci.ed.ac.uk>
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <f5bog6o54zj.fsf@cogsci.ed.ac.uk> <200005031216.IAA03274@eric.cnri.reston.va.us>
 <f5br9bi1yw4.fsf@cogsci.ed.ac.uk>
Message-ID: <200005041240.IAA08277@eric.cnri.reston.va.us>

> I think I hear a moderate consensus developing that the 'ASCII
> proposal' is a reasonable compromise given the time constraints.  But
> let's not fail to come back to this ASAP -- it _really_ narcs me that
> every time I load XML into my Python-based editor I'm going to convert
> large amounts of wide-string data into UTF-8 just so Tk can convert it
> back to wide-strings in order to display it!

Thanks -- but that's really Tcl's fault, since the only way to get
character data *into* Tcl (or out of it) is through the UTF-8
encoding.

And is your XML really stored on disk in its 16-bit format?

--Guido van Rossum (home page: http://www.python.org/~guido/)


From fredrik@pythonware.com  Thu May  4 14:21:25 2000
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Thu, 4 May 2000 15:21:25 +0200
Subject: [I18n-sig] Re: Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <f5bog6o54zj.fsf@cogsci.ed.ac.uk> <200005031216.IAA03274@eric.cnri.reston.va.us>             <f5br9bi1yw4.fsf@cogsci.ed.ac.uk>  <200005041240.IAA08277@eric.cnri.reston.va.us>
Message-ID: <00d901bfb5cb$a6cfd490$0500a8c0@secret.pythonware.com>

Guido van Rossum <guido@python.org> wrote:
> Thanks -- but that's really Tcl's fault, since the only way to get
> character data *into* Tcl (or out of it) is through the UTF-8
> encoding.

from http://dev.scriptics.com/man/tcl8.3/TclLib/StringObj.htm

    Tcl_NewUnicodeObj(Tcl_UniChar* unicode, int numChars)

    Tcl_NewUnicodeObj and Tcl_SetUnicodeObj create a new
    object or modify an existing object to hold a copy of the
    Unicode string given by unicode and numChars.

    (Tcl_UniChar* is currently the same thing as Py_UNICODE*)

</F>


From just@letterror.com  Thu May  4 22:22:38 2000
From: just@letterror.com (Just van Rossum)
Date: Thu, 4 May 2000 22:22:38 +0100
Subject: [I18n-sig] Unicode strings: an alternative
Message-ID: <l03102810b5378dda02f5@[193.78.237.126]>

(Boy, is it quiet here all of a sudden ;-)

Sorry for the duplication of stuff, but I'd like to reiterate my points, to
separate them from my implementation proposal, as that's just what it is:
an implementation detail.

These things are important to me:
- get rid of the Unicode-ness of wide strings, in order to
- make narrow and wide strings as similar as possible
- implicit conversion between narrow and wide strings should
  happen purely on the basis of the character codes; no
  assumption at all should be made about the encoding, ie.
  what the character code _means_.
- downcasting from wide to narrow may raise OverflowError if
  there are characters in the wide string that are > 255
- str(s) should always return s if s is a string, whether narrow
  or wide
- file objects need to be responsible for handling wide strings
- the above two points should make it possible for
- if no encoding is known, Unicode is the default, whether
  narrow or wide

The above points seem to have the following consequences:
- the 'u' in \uXXXX notation no longer makes much sense,
  since it is not neccesary for the character to be a Unicode
  code point: it's just a 2-byte int. \wXXXX might be an option.
- the u"" notation is no longer neccesary: if a string literal
  contains a character > 255 the string should automatically
  become a wide string.
- narrow strings should also have an encode() method.
- the builtin unicode() function might be redundant if:
  - it is possible to specify a source encoding. I'm not sure if
    this is best done through an extra argument for encode()
    or that it should be a new method, eg. transcode().
  - s.encode() or s.transcode() are allowed to output a wide
    string, as in aNarrowString.encode("UCS-2") and
    s.transcode("Mac-Roman", "UCS-2").

My proposal to extend the "old" string type to be able to contain wide
strings is of course largely unrelated to all this. Yet it may provide some
additional C compatibility (especially now that silent conversion to utf-8
is out) as well as a workaround for the
str()-having-to-return-a-narrow-string bottleneck.

Just


From skip@mojam.com (Skip Montanaro)  Thu May  4 21:43:42 2000
From: skip@mojam.com (Skip Montanaro) (Skip Montanaro)
Date: Thu, 4 May 2000 15:43:42 -0500 (CDT)
Subject: [Python-Dev] Re: [I18n-sig] Unicode strings: an alternative
In-Reply-To: <l03102810b5378dda02f5@[193.78.237.126]>
References: <l03102810b5378dda02f5@[193.78.237.126]>
Message-ID: <14609.57598.738381.250872@beluga.mojam.com>

    Just> Sorry for the duplication of stuff, but I'd like to reiterate my
    Just> points, to separate them from my implementation proposal, as
    Just> that's just what it is: an implementation detail.

    Just> These things are important to me:
    ...

For the encoding-challenged like me, does it make sense to explicitly state
that you can't mix character widths within a single string, or is that just
so obvious that I deserve a head slap just for mentioning it?

-- 
Skip Montanaro, skip@mojam.com, http://www.mojam.com/, http://www.musi-cal.com/
"We have become ... the stewards of life's continuity on earth.  We did not
ask for this role...  We may not be suited to it, but here we are."
- Stephen Jay Gould


From Fredrik Lundh" <effbot@telia.com  Thu May  4 22:02:35 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Thu, 4 May 2000 23:02:35 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
References: <l03102805b52ca7830b18@[193.78.237.154]><l03102800b52d80db1290@[193.78.237.154]><200004271501.LAA13535@eric.cnri.reston.va.us><3908F566.8E5747C@prescod.net><200004281450.KAA16493@eric.cnri.reston.va.us><390AEF1D.253B93EF@prescod.net><200005011802.OAA21612@eric.cnri.reston.va.us><390DEB45.D8D12337@prescod.net><200005012132.RAA23319@eric.cnri.reston.va.us><390E1F08.EA91599E@prescod.net><200005020053.UAA23665@eric.cnri.reston.va.us><f5bog6o54zj.fsf@cogsci.ed.ac.uk><200005031216.IAA03274@eric.cnri.reston.va.us> <f5br9bi1yw4.fsf@cogsci.ed.ac.uk>
Message-ID: <007701bfb60c$1543f060$34aab5d4@hagrid>

Henry S. Thompson <ht@cogsci.ed.ac.uk> wrote:
> I think I hear a moderate consensus developing that the 'ASCII
> proposal' is a reasonable compromise given the time constraints.

agreed.

(but even if we settle for "7-bit unicode" in 1.6, there are still a
few issues left to sort out before 1.6 final.  but it might be best
to get back to that after we've added SRE and GC to 1.6a3. we
might all need a short break...)

> But let's not fail to come back to this ASAP

first week in june, promise ;-)

</F>


From ht@cogsci.ed.ac.uk  Fri May  5 09:19:07 2000
From: ht@cogsci.ed.ac.uk (Henry S. Thompson)
Date: 05 May 2000 09:19:07 +0100
Subject: [XML-SIG] Re: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Guido van Rossum's message of "Thu, 04 May 2000 08:40:35 -0400"
References: <l03102805b52ca7830b18@[193.78.237.154]>
 <l03102800b52d80db1290@[193.78.237.154]>
 <200004271501.LAA13535@eric.cnri.reston.va.us>
 <3908F566.8E5747C@prescod.net>
 <200004281450.KAA16493@eric.cnri.reston.va.us>
 <390AEF1D.253B93EF@prescod.net>
 <200005011802.OAA21612@eric.cnri.reston.va.us>
 <390DEB45.D8D12337@prescod.net>
 <200005012132.RAA23319@eric.cnri.reston.va.us>
 <390E1F08.EA91599E@prescod.net>
 <200005020053.UAA23665@eric.cnri.reston.va.us>
 <f5bog6o54zj.fsf@cogsci.ed.ac.uk>
 <200005031216.IAA03274@eric.cnri.reston.va.us>
 <f5br9bi1yw4.fsf@cogsci.ed.ac.uk>
 <200005041240.IAA08277@eric.cnri.reston.va.us>
Message-ID: <f5bya5pxvd0.fsf@cogsci.ed.ac.uk>

Guido van Rossum <guido@python.org> writes:

> > I think I hear a moderate consensus developing that the 'ASCII
> > proposal' is a reasonable compromise given the time constraints.  But
> > let's not fail to come back to this ASAP -- it _really_ narcs me that
> > every time I load XML into my Python-based editor I'm going to convert
> > large amounts of wide-string data into UTF-8 just so Tk can convert it
> > back to wide-strings in order to display it!
> 
> Thanks -- but that's really Tcl's fault, since the only way to get
> character data *into* Tcl (or out of it) is through the UTF-8
> encoding.
> 
> And is your XML really stored on disk in its 16-bit format?

No, I have no idea what encoding it's in, my XML parser supports over
a dozen encodings, and quite sensibly always delivers the content, as
per the XML REC, as wide-strings.

ht
-- 
  Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh
          W3C Fellow 1999--2001, part-time member of W3C Team
     2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440
	    Fax: (44) 131 650-4587, e-mail: ht@cogsci.ed.ac.uk
		     URL: http://www.ltg.ed.ac.uk/~ht/


From ht@cogsci.ed.ac.uk  Fri May  5 09:21:41 2000
From: ht@cogsci.ed.ac.uk (Henry S. Thompson)
Date: 05 May 2000 09:21:41 +0100
Subject: [I18n-sig] Re: [XML-SIG] Re: Unicode debate
In-Reply-To: "Fredrik Lundh"'s message of "Thu, 4 May 2000 15:21:25 +0200"
References: <l03102805b52ca7830b18@[193.78.237.154]>
 <l03102800b52d80db1290@[193.78.237.154]>
 <200004271501.LAA13535@eric.cnri.reston.va.us>
 <3908F566.8E5747C@prescod.net>
 <200004281450.KAA16493@eric.cnri.reston.va.us>
 <390AEF1D.253B93EF@prescod.net>
 <200005011802.OAA21612@eric.cnri.reston.va.us>
 <390DEB45.D8D12337@prescod.net>
 <200005012132.RAA23319@eric.cnri.reston.va.us>
 <390E1F08.EA91599E@prescod.net>
 <200005020053.UAA23665@eric.cnri.reston.va.us>
 <f5bog6o54zj.fsf@cogsci.ed.ac.uk>
 <200005031216.IAA03274@eric.cnri.reston.va.us>
 <f5br9bi1yw4.fsf@cogsci.ed.ac.uk>
 <200005041240.IAA08277@eric.cnri.reston.va.us>
 <00d901bfb5cb$a6cfd490$0500a8c0@secret.pythonware.com>
Message-ID: <f5bu2gdxv8q.fsf@cogsci.ed.ac.uk>

"Fredrik Lundh" <fredrik@pythonware.com> writes:

> Guido van Rossum <guido@python.org> wrote:
> > Thanks -- but that's really Tcl's fault, since the only way to get
> > character data *into* Tcl (or out of it) is through the UTF-8
> > encoding.
> 
> from http://dev.scriptics.com/man/tcl8.3/TclLib/StringObj.htm
> 
>     Tcl_NewUnicodeObj(Tcl_UniChar* unicode, int numChars)
> 
>     Tcl_NewUnicodeObj and Tcl_SetUnicodeObj create a new
>     object or modify an existing object to hold a copy of the
>     Unicode string given by unicode and numChars.
> 
>     (Tcl_UniChar* is currently the same thing as Py_UNICODE*)
> 

Any way this can be exploited in Tkinter?

ht
-- 
  Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh
          W3C Fellow 1999--2001, part-time member of W3C Team
     2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440
	    Fax: (44) 131 650-4587, e-mail: ht@cogsci.ed.ac.uk
		     URL: http://www.ltg.ed.ac.uk/~ht/


From just@letterror.com  Fri May  5 10:25:37 2000
From: just@letterror.com (Just van Rossum)
Date: Fri, 5 May 2000 10:25:37 +0100
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <007701bfb60c$1543f060$34aab5d4@hagrid>
References: <l03102805b52ca7830b18@[193.78.237.154]><l03102800b52d80db1290@[193.78.237
 .154]><200004271501.LAA13535@eric.cnri.reston.va.us><3908F566.8E5747C@pres
 cod.net><200004281450.KAA16493@eric.cnri.reston.va.us><390AEF1D.253B93EF@p
 rescod.net><200005011802.OAA21612@eric.cnri.reston.va.us><390DEB45.D8D1233
 7@prescod.net><200005012132.RAA23319@eric.cnri.reston.va.us><390E1F08.EA91
 599E@prescod.net><200005020053.UAA23665@eric.cnri.reston.va.us><f5bog6o54z
 j.fsf@cogsci.ed.ac.uk><200005031216.IAA03274@eric.cnri.reston.va.us>
 <f5br9bi1yw4.fsf@cogsci.ed.ac.uk>
Message-ID: <l03102802b5383fd7c128@[193.78.237.126]>

At 11:02 PM +0200 04-05-2000, Fredrik Lundh wrote:
>Henry S. Thompson <ht@cogsci.ed.ac.uk> wrote:
>> I think I hear a moderate consensus developing that the 'ASCII
>> proposal' is a reasonable compromise given the time constraints.
>
>agreed.

This makes no sense: implementing the 7-bit proposal takes the more or less
the same time as implementing 8-bit downcasting. Or is it just the
bickering that's too time consuming? ;-)

I worry that if the current implementation goes into 1.6 more or less as it
is now there's no way we can ever go back (before P3K). Or will Unicode
support be marked "experimental" in 1.6? This is not so much about the
7-bit/8-bit proposal but about the dubious unicode() and unichr() functions
and the u"" notation:

- unicode() only takes strings, so is effectively a method of the string type.
- if narrow and wide strings are meant to be as similar as possible,
chr(256) should just return a wide char
- similarly, why is the u"" notation at all needed?

The current design is more complex than needed, and still offers plenty of
surprises. Making it simpler (without integrating the two string types) is
not a huge effort. Seeing the wide string type as independent of Unicode
takes no physical effort at all, as it's just in our heads.

Fixing str() so it can return wide strings might be harder, and can wait
until later. Would be too bad, though.

Just


From ping@lfw.org  Fri May  5 10:21:20 2000
From: ping@lfw.org (Ka-Ping Yee)
Date: Fri, 5 May 2000 02:21:20 -0700 (PDT)
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: <002d01bfb59c$cf482280$34aab5d4@hagrid>
Message-ID: <Pine.LNX.4.10.10005050217230.3976-100000@skuld.lfw.org>

On Thu, 4 May 2000, Fredrik Lundh wrote:
> 
> another approach is (simplified):
> 
>     try:
>         sys.stdout.write(x.encode(sys.stdout.encoding))
>     except AttributeError:
>         sys.stdout.write(str(x))

Indeed, that would work to solve just this specific Unicode
issue -- but there is a lot of flexibility and power to be
gained from the general solution of putting a method on the
stream object, as the example with the formatted list items
showed.  I think it is a good idea, for instance, to leave
decisions about how to print Unicode up to the Unicode object,
and not hardcode bits of it into print.

Guido, have you digested my earlier 'printout' suggestions?


-- ?!ng

"Old code doesn't die -- it just smells that way."
    -- Bill Frantz


From tdickenson@geminidataloggers.com  Fri May  5 10:07:46 2000
From: tdickenson@geminidataloggers.com (Toby Dickenson)
Date: Fri, 05 May 2000 10:07:46 +0100
Subject: [I18n-sig] Unicode strings: an alternative
In-Reply-To: <l03102810b5378dda02f5@[193.78.237.126]>
References: <l03102810b5378dda02f5@[193.78.237.126]>
Message-ID: <me25hs0diag8d0b6bu5gqjpchdq5q3aig5@4ax.com>

On Thu, 4 May 2000 22:22:38 +0100, Just van Rossum
<just@letterror.com> wrote:

>(Boy, is it quiet here all of a sudden ;-)
>
>Sorry for the duplication of stuff, but I'd like to reiterate my points,=
 to
>separate them from my implementation proposal, as that's just what it =
is:
>an implementation detail.
>
>These things are important to me:
>- get rid of the Unicode-ness of wide strings, in order to
>- make narrow and wide strings as similar as possible
>- implicit conversion between narrow and wide strings should
>  happen purely on the basis of the character codes; no
>  assumption at all should be made about the encoding, ie.
>  what the character code _means_.
>- downcasting from wide to narrow may raise OverflowError if
>  there are characters in the wide string that are > 255
>- str(s) should always return s if s is a string, whether narrow
>  or wide
>- file objects need to be responsible for handling wide strings
>- the above two points should make it possible for
>- if no encoding is known, Unicode is the default, whether
>  narrow or wide
>
>The above points seem to have the following consequences:
>- the 'u' in \uXXXX notation no longer makes much sense,
>  since it is not neccesary for the character to be a Unicode
>  code point: it's just a 2-byte int. \wXXXX might be an option.
>- the u"" notation is no longer neccesary: if a string literal
>  contains a character > 255 the string should automatically
>  become a wide string.
>- narrow strings should also have an encode() method.
>- the builtin unicode() function might be redundant if:
>  - it is possible to specify a source encoding. I'm not sure if
>    this is best done through an extra argument for encode()
>    or that it should be a new method, eg. transcode().

>  - s.encode() or s.transcode() are allowed to output a wide
>    string, as in aNarrowString.encode("UCS-2") and
>    s.transcode("Mac-Roman", "UCS-2").

One other pleasant consequence:

- String comparisons work character-by character, even if the
  representation of those characters have different widths.

>My proposal to extend the "old" string type to be able to contain wide
>strings is of course largely unrelated to all this. Yet it may provide =
some
>additional C compatibility (especially now that silent conversion to =
utf-8
>is out) as well as a workaround for the
>str()-having-to-return-a-narrow-string bottleneck.


Toby Dickenson
tdickenson@geminidataloggers.com


From just@letterror.com  Fri May  5 12:40:49 2000
From: just@letterror.com (Just van Rossum)
Date: Fri, 5 May 2000 12:40:49 +0100
Subject: [I18n-sig] Unicode strings: an alternative
In-Reply-To: <me25hs0diag8d0b6bu5gqjpchdq5q3aig5@4ax.com>
References: <l03102810b5378dda02f5@[193.78.237.126]>
 <l03102810b5378dda02f5@[193.78.237.126]>
Message-ID: <l03102805b5385e3de8e8@[193.78.237.127]>

At 10:07 AM +0100 05-05-2000, Toby Dickenson wrote:
>One other pleasant consequence:
>
>- String comparisons work character-by character, even if the
>  representation of those characters have different widths.

Exactly. By saying "(wide) strings are not tied to Unicode" the question
whether wide strings should or should not be sorted according to the
Unicode spec is answered by a simple "no", instead of "hmm, maybe, but it's
too hard anyway"...

Just


From tree@basistech.com  Fri May  5 12:46:41 2000
From: tree@basistech.com (Tom Emerson)
Date: Fri, 5 May 2000 07:46:41 -0400 (EDT)
Subject: [I18n-sig] Unicode strings: an alternative
In-Reply-To: <l03102805b5385e3de8e8@[193.78.237.127]>
References: <l03102810b5378dda02f5@[193.78.237.126]>
 <l03102805b5385e3de8e8@[193.78.237.127]>
Message-ID: <14610.46241.129977.642796@cymru.basistech.com>

Just van Rossum writes:
 > At 10:07 AM +0100 05-05-2000, Toby Dickenson wrote:
 > >One other pleasant consequence:
 > >
 > >- String comparisons work character-by character, even if the
 > >  representation of those characters have different widths.
 > 
 > Exactly. By saying "(wide) strings are not tied to Unicode" the question
 > whether wide strings should or should not be sorted according to the
 > Unicode spec is answered by a simple "no", instead of "hmm, maybe, but it's
 > too hard anyway"...

Wait a second.

There is nothing about Unicode that would prevent you from defining
string equality as byte-level equality.

This strikes me as the wrong way to deal with the complex collation
issues of Unicode.

It seems to me that by default wide-strings compare at the byte-level
(i.e., '=' is a byte level comparison). If you want a normalized
comparison, then you make an explicit function call for that.

This is no different from comparing strings in a case sensitive
vs. case insensitive manner.

       -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Language Hacker                                    http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From just@letterror.com  Fri May  5 14:17:31 2000
From: just@letterror.com (Just van Rossum)
Date: Fri, 5 May 2000 14:17:31 +0100
Subject: [I18n-sig] Unicode strings: an alternative
In-Reply-To: <14610.46241.129977.642796@cymru.basistech.com>
References: <l03102805b5385e3de8e8@[193.78.237.127]>
 <l03102810b5378dda02f5@[193.78.237.126]>
 <l03102805b5385e3de8e8@[193.78.237.127]>
Message-ID: <l03102808b53877a3e392@[193.78.237.127]>

[Me]
> Exactly. By saying "(wide) strings are not tied to Unicode" the question
> whether wide strings should or should not be sorted according to the
> Unicode spec is answered by a simple "no", instead of "hmm, maybe, but it's
> too hard anyway"...

[Tom Emerson]
>Wait a second.
>
>There is nothing about Unicode that would prevent you from defining
>string equality as byte-level equality.

Agreed.

>This strikes me as the wrong way to deal with the complex collation
>issues of Unicode.

All I was trying to say, was that by looking at it this way, it is even
more obvious that the builtin comparison should not deal with Unicode
sorting & collation issues. It seems you're saying the exact same thing:

>It seems to me that by default wide-strings compare at the byte-level
>(i.e., '=' is a byte level comparison). If you want a normalized
>comparison, then you make an explicit function call for that.

Exactly.

>This is no different from comparing strings in a case sensitive
>vs. case insensitive manner.

Good point. All this taken together still means to me that comparisons
between wide and narrow strings should take place at the character level,
which implies that coercion from narrow to wide is done at the character
level, without looking at the encoding. (Which in my book in turn still
implies that as long as we're talking about Unicode, narrow strings are
effectively Latin-1.)

Just


From tree@basistech.com  Fri May  5 13:34:35 2000
From: tree@basistech.com (Tom Emerson)
Date: Fri, 5 May 2000 08:34:35 -0400 (EDT)
Subject: [I18n-sig] Unicode strings: an alternative
In-Reply-To: <l03102808b53877a3e392@[193.78.237.127]>
References: <l03102805b5385e3de8e8@[193.78.237.127]>
 <l03102810b5378dda02f5@[193.78.237.126]>
 <l03102808b53877a3e392@[193.78.237.127]>
Message-ID: <14610.49115.820599.172598@cymru.basistech.com>

Just van Rossum writes:
 > Good point. All this taken together still means to me that comparisons
 > between wide and narrow strings should take place at the character level,
 > which implies that coercion from narrow to wide is done at the character
 > level, without looking at the encoding. (Which in my book in turn still
 > implies that as long as we're talking about Unicode, narrow strings are
 > effectively Latin-1.)

Only true if "wide" strings are encoded in UCS-2 or UCS-4. If "wide
characters" are Unicode, but stored in UTF-8 encoding, then you loose.

Hmmmm... how often do you expect to compare narrow vs. wide strings,
using default comparison (i.e. = or !=)? What if I'm using Latin 3 and
use the byte comparison? I may very well have two strings (one narrow,
one wide) that compare equal, even though they're not. Not exactly
what I would expect.

     -tree

[I'm flying from Seattle to Boston today, so eventually I will
 disappear for a while]

-- 
Tom Emerson                                          Basis Technology Corp.
Language Hacker                                    http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From pf@artcom-gmbh.de  Fri May  5 14:13:05 2000
From: pf@artcom-gmbh.de (Peter Funk)
Date: Fri, 5 May 2000 15:13:05 +0200 (MEST)
Subject: wide strings vs. Unicode point of view (was Re: [I18n-sig] Unicode st.... alternative)
In-Reply-To: <l03102805b5385e3de8e8@[193.78.237.127]> from Just van Rossum at "May 5, 2000 12:40:49 pm"
Message-ID: <m12nhuj-000CnCC@artcom0.artcom-gmbh.de>

Just van Rossum:
> Exactly. By saying "(wide) strings are not tied to Unicode" the question
> whether wide strings should or should not be sorted according to the
> Unicode spec is answered by a simple "no", instead of "hmm, maybe, but it's
> too hard anyway"...

I personally like the idea speaking of "wide strings" containing wide
character codes instead of Unicode objects.

Unfortunately there are many methods which need to interpret the
content of strings according to some encoding knowledge: for example
'upper()', 'lower()', 'swapcase()', 'lstrip()' and so on need to know,
to which class certain characters belong.

This problem was already some kind of visible in 1.5.2, since these methods 
were available as library functions from the string module and they did
work with a global state maintained by the 'setlocale()' C-library function.
Quoting from the C library man pages:

"""    The details of what constitutes an uppercase or  lowercase
       letter  depend  on  the  current locale.  For example, the
       default "C" locale does not know about umlauts, so no con�
       version is done for them.

       In some non - English locales, there are lowercase letters
       with no corresponding  uppercase  equivalent;  the  German
       sharp s is one example.
"""

I guess applying 'upper' to a chinese char will not make much sense.

Now these former string module functions were moved into the Python
object core.  So the current Python string and Unicode object API is
somewhat "western centric".  ;-) At least Marc's implementation in
'unicodectype.c' contains the hard coded assumption, that wide strings
contain really unicode characters.  
print u"���".upper().encode("latin1") 
shows "���" independent from the locale setting.  This makes sense.
The output from  print u"���".upper().encode()  however looks ugly
here on my screen... UTF-8 ... blech:� ��

Regards and have a nice weekend, Peter
-- 
Peter Funk, Oldenburger Str.86, D-27777 Ganderkesee, Germany, Fax:+49 4222950260
office: +49 421 20419-0 (ArtCom GmbH, Grazer Str.8, D-28359 Bremen)


From guido@python.org  Fri May  5 15:49:52 2000
From: guido@python.org (Guido van Rossum)
Date: Fri, 05 May 2000 10:49:52 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode debate
In-Reply-To: Your message of "Fri, 05 May 2000 02:21:20 PDT."
 <Pine.LNX.4.10.10005050217230.3976-100000@skuld.lfw.org>
References: <Pine.LNX.4.10.10005050217230.3976-100000@skuld.lfw.org>
Message-ID: <200005051449.KAA14138@eric.cnri.reston.va.us>

> Guido, have you digested my earlier 'printout' suggestions?

Not quite, except to the point that they require more thought than to
rush them into 1.6.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@python.org  Fri May  5 15:54:16 2000
From: guido@python.org (Guido van Rossum)
Date: Fri, 05 May 2000 10:54:16 -0400
Subject: [I18n-sig] Unicode strings: an alternative
In-Reply-To: Your message of "Thu, 04 May 2000 22:22:38 BST."
 <l03102810b5378dda02f5@[193.78.237.126]>
References: <l03102810b5378dda02f5@[193.78.237.126]>
Message-ID: <200005051454.KAA14168@eric.cnri.reston.va.us>

> (Boy, is it quiet here all of a sudden ;-)

Maybe because (according to one report on NPR here) 80% of the world's
email systems are victimized by the ILOVEYOU virus?  You & I are not
affected because it's Windows specific (a visual basic script, I got a
copy mailed to me so I could have a good look :-).  Note that there
are already mutations, one of which pretends to be a joke.

> Sorry for the duplication of stuff, but I'd like to reiterate my points, to
> separate them from my implementation proposal, as that's just what it is:
> an implementation detail.
> 
> These things are important to me:
> - get rid of the Unicode-ness of wide strings, in order to
> - make narrow and wide strings as similar as possible
> - implicit conversion between narrow and wide strings should
>   happen purely on the basis of the character codes; no
>   assumption at all should be made about the encoding, ie.
>   what the character code _means_.
> - downcasting from wide to narrow may raise OverflowError if
>   there are characters in the wide string that are > 255
> - str(s) should always return s if s is a string, whether narrow
>   or wide
> - file objects need to be responsible for handling wide strings
> - the above two points should make it possible for
> - if no encoding is known, Unicode is the default, whether
>   narrow or wide
> 
> The above points seem to have the following consequences:
> - the 'u' in \uXXXX notation no longer makes much sense,
>   since it is not neccesary for the character to be a Unicode
>   code point: it's just a 2-byte int. \wXXXX might be an option.
> - the u"" notation is no longer neccesary: if a string literal
>   contains a character > 255 the string should automatically
>   become a wide string.
> - narrow strings should also have an encode() method.
> - the builtin unicode() function might be redundant if:
>   - it is possible to specify a source encoding. I'm not sure if
>     this is best done through an extra argument for encode()
>     or that it should be a new method, eg. transcode().
>   - s.encode() or s.transcode() are allowed to output a wide
>     string, as in aNarrowString.encode("UCS-2") and
>     s.transcode("Mac-Roman", "UCS-2").
> 
> My proposal to extend the "old" string type to be able to contain wide
> strings is of course largely unrelated to all this. Yet it may provide some
> additional C compatibility (especially now that silent conversion to utf-8
> is out) as well as a workaround for the
> str()-having-to-return-a-narrow-string bottleneck.

I'm not so sure that this is enough.  You seem to propose wide strings
as vehicles for 16-bit values (and maybe later 32-bit values) apart
from their encoding.  We already have a data type for that (the array
module).  The Unicode type does a lot more than storing 16-bit values:
it knows lots of encodings to and from Unicode, and it knows things
like which characters are upper or lower or title case and how to map
between them, which characters are word characters, and so on.  All
this is highly Unicode specific and is part of what people ask for
when then when they request Unicode support.  (Example: Unicode has
405 characters classified as numeric, according to the isnumeric()
method.)

And by the way, don't worry about the comparison.  I'm not changing
the default comparison (==, cmp()) for Unicode strings to be anything
than per 16-bit-quantity.  However a Unicode object might in addition
has a method to do normalization or whatever, as long as it's language
independent and strictly defined by the Unicode standard.
Language-specific operations belong in separate modules.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@python.org  Fri May  5 15:59:55 2000
From: guido@python.org (Guido van Rossum)
Date: Fri, 05 May 2000 10:59:55 -0400
Subject: [I18n-sig] Re: [XML-SIG] Re: Unicode debate
In-Reply-To: Your message of "05 May 2000 09:21:41 BST."
 <f5bu2gdxv8q.fsf@cogsci.ed.ac.uk>
References: <l03102805b52ca7830b18@[193.78.237.154]> <l03102800b52d80db1290@[193.78.237.154]> <200004271501.LAA13535@eric.cnri.reston.va.us> <3908F566.8E5747C@prescod.net> <200004281450.KAA16493@eric.cnri.reston.va.us> <390AEF1D.253B93EF@prescod.net> <200005011802.OAA21612@eric.cnri.reston.va.us> <390DEB45.D8D12337@prescod.net> <200005012132.RAA23319@eric.cnri.reston.va.us> <390E1F08.EA91599E@prescod.net> <200005020053.UAA23665@eric.cnri.reston.va.us> <f5bog6o54zj.fsf@cogsci.ed.ac.uk> <200005031216.IAA03274@eric.cnri.reston.va.us> <f5br9bi1yw4.fsf@cogsci.ed.ac.uk> <200005041240.IAA08277@eric.cnri.reston.va.us> <00d901bfb5cb$a6cfd490$0500a8c0@secret.pythonware.com>
 <f5bu2gdxv8q.fsf@cogsci.ed.ac.uk>
Message-ID: <200005051459.KAA14218@eric.cnri.reston.va.us>

[Moving this discussion to i18n-sig, where it belongs]

> > from http://dev.scriptics.com/man/tcl8.3/TclLib/StringObj.htm
> > 
> >     Tcl_NewUnicodeObj(Tcl_UniChar* unicode, int numChars)
> > 
> >     Tcl_NewUnicodeObj and Tcl_SetUnicodeObj create a new
> >     object or modify an existing object to hold a copy of the
> >     Unicode string given by unicode and numChars.
> > 
> >     (Tcl_UniChar* is currently the same thing as Py_UNICODE*)
> 
> Any way this can be exploited in Tkinter?

Yes -- I just checked in a patch to _tkinter that uses
Tcl_NewUnicodeObj() when a Unicode string is passed to Tcl (for Tcl
8.2 and later).

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@python.org  Fri May  5 16:00:25 2000
From: guido@python.org (Guido van Rossum)
Date: Fri, 05 May 2000 11:00:25 -0400
Subject: [I18n-sig] Unicode strings: an alternative
In-Reply-To: Your message of "Fri, 05 May 2000 08:34:35 EDT."
 <14610.49115.820599.172598@cymru.basistech.com>
References: <l03102805b5385e3de8e8@[193.78.237.127]> <l03102810b5378dda02f5@[193.78.237.126]> <l03102808b53877a3e392@[193.78.237.127]>
 <14610.49115.820599.172598@cymru.basistech.com>
Message-ID: <200005051500.LAA14226@eric.cnri.reston.va.us>

[Moving this discussion to i18n-sig, where it belongs]

> Hmmmm... how often do you expect to compare narrow vs. wide strings,
> using default comparison (i.e. = or !=)? What if I'm using Latin 3 and
> use the byte comparison? I may very well have two strings (one narrow,
> one wide) that compare equal, even though they're not. Not exactly
> what I would expect.

Thanks for this support of my ASCII proposal.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@python.org  Fri May  5 16:05:36 2000
From: guido@python.org (Guido van Rossum)
Date: Fri, 05 May 2000 11:05:36 -0400
Subject: [I18n-sig] Unicode debate
In-Reply-To: Your message of "Fri, 05 May 2000 10:25:37 BST."
 <l03102802b5383fd7c128@[193.78.237.126]>
References: <l03102805b52ca7830b18@[193.78.237.154]><l03102800b52d80db1290@[193.78.237 .154]><200004271501.LAA13535@eric.cnri.reston.va.us><3908F566.8E5747C@pres cod.net><200004281450.KAA16493@eric.cnri.reston.va.us><390AEF1D.253B93EF@p rescod.net><200005011802.OAA21612@eric.cnri.reston.va.us><390DEB45.D8D1233 7@prescod.net><200005012132.RAA23319@eric.cnri.reston.va.us><390E1F08.EA91 599E@prescod.net><200005020053.UAA23665@eric.cnri.reston.va.us><f5bog6o54z j.fsf@cogsci.ed.ac.uk><200005031216.IAA03274@eric.cnri.reston.va.us> <f5br9bi1yw4.fsf@cogsci.ed.ac.uk>
 <l03102802b5383fd7c128@[193.78.237.126]>
Message-ID: <200005051505.LAA14236@eric.cnri.reston.va.us>

[Moving this discussion to i18n-sig, where it belongs]

> At 11:02 PM +0200 04-05-2000, Fredrik Lundh wrote:
> >Henry S. Thompson <ht@cogsci.ed.ac.uk> wrote:
> >> I think I hear a moderate consensus developing that the 'ASCII
> >> proposal' is a reasonable compromise given the time constraints.
> >
> >agreed.
> 
> This makes no sense: implementing the 7-bit proposal takes the more or less
> the same time as implementing 8-bit downcasting. Or is it just the
> bickering that's too time consuming? ;-)

Sort of.  The 8-bit proposal has too much opposition, and other
(possibly better) proposals would take too long to implement.  The
7-bit proposal takes away the biggest problem with the current UTF-8
version (a character is always a byte -- a byte isn't always a
character though) and doesn't to back us into a corner we can't get
out of later.

> I worry that if the current implementation goes into 1.6 more or less as it
> is now there's no way we can ever go back (before P3K). Or will Unicode
> support be marked "experimental" in 1.6? This is not so much about the
> 7-bit/8-bit proposal but about the dubious unicode() and unichr() functions
> and the u"" notation:
> 
> - unicode() only takes strings, so is effectively a method of the string type.

Not true.  It takes anything that supports the buffer interface:

  >>> from array import array
  >>> a = array('b', "hello world")
  >>> unicode(a)
  u'hello world'
  >>> 

The best way to look at it is to view unicode() as a constructor for
Unicode objects.

> - if narrow and wide strings are meant to be as similar as possible,
> chr(256) should just return a wide char
> - similarly, why is the u"" notation at all needed?

Many extensions don't do the right thing with Unicode string objects,
and there's not enough time to fix them all.  So my (indeed
experimental and temporary -- at the worst until Py3k) solution is to
require people to be explicit about when they want to use wide
strings.  Very similar to what Python does with 32-bit vs. long ints.
Not ideal in the long run, and to be fixed in Py3k, but (in my view)
unavoidable right now given that Python interfaces to so many
real-world systems where the distinction is important.

If in the future we'll be more automatic, we will support but ignore
the u prefix on string literals, for backward compatibility -- just
like we will support but ignore the L suffix on numeric literals once
ints and longs have been unified.

> The current design is more complex than needed, and still offers plenty of
> surprises. Making it simpler (without integrating the two string types) is
> not a huge effort. Seeing the wide string type as independent of Unicode
> takes no physical effort at all, as it's just in our heads.

What do you propose to make it simpler?  Your last implementation
proposal would require starting all over from scratch.

> Fixing str() so it can return wide strings might be harder, and can wait
> until later. Would be too bad, though.

Agreed on both counts (harder, and too bad).

--Guido van Rossum (home page: http://www.python.org/~guido/)


From andy@reportlab.com  Fri May  5 16:35:12 2000
From: andy@reportlab.com (Andy Robinson)
Date: Fri, 5 May 2000 16:35:12 +0100
Subject: [I18n-sig] Unicode strings: an alternative
References: <l03102805b5385e3de8e8@[193.78.237.127]> <l03102810b5378dda02f5@[193.78.237.126]> <l03102808b53877a3e392@[193.78.237.127]>             <14610.49115.820599.172598@cymru.basistech.com>  <200005051500.LAA14226@eric.cnri.reston.va.us>
Message-ID: <002401bfb6a7$81fa9560$01ac2ac0@boulder>

> Thanks for this support of my ASCII proposal.
> 
> --Guido van Rossum (home page: http://www.python.org/~guido/)

I missed most of the discussion due to a business trip.
I'll just say that I am very happy with ASCII as the default.  

- Andy Robinson


From guido@python.org  Fri May  5 16:54:48 2000
From: guido@python.org (Guido van Rossum)
Date: Fri, 05 May 2000 11:54:48 -0400
Subject: [I18n-sig] Perhaps the locale should matter?
Message-ID: <200005051554.LAA14606@eric.cnri.reston.va.us>

Here's a different idea that seeks a further compromise between the
7-bit and 8-bit camps.

I just realized that the existence of methods like islower() etc. on
Unicode really force the encoding issue for Unicode strings -- these
don't contain arbitrary sequences of 16-bit quantities, they contain
Unicode characters, with some of the associated semantices.  (How much
is open to debate, see Fredrik's post about the four levels of Unicode
conformance.)

If we apply this to 8-bit strings, we see that the locale plays an
important role.  With the default ("C") locale, islower() etc. only
take ASCII into account, everything else is not considered a letter or
digit or space.  However in many other locales (for the LC_CTYPE
category), islower() etc. assume a specific character encoding!  (This
is all completely up to the C library's locale interpretation, Python
doesn't add anything except an API.)  I've only tested this for a few
European locales; these all seem to assume Latin-1.

I wonder if we could make the default conversion from 8-bit to Unicode
depend on the locale?  This would be a compromise between my ASCII
proposal and the Latin-1 proposal.  My reasoning is that the locale is
an existing Python feature.  Code that is broken when the locale
differs from the default has been broken for a long time.  We might
not *like* a global setting for this kind of feature, but: "We've
already got one!"  [Imitates thick French accent.]

If the program explicitly set the locale, it is a clear signal that it
is interesting in manipulating characters in a particular locale, and
we might as well honor this.

Problem: I have no idea how to go from the locale setting (a
two-charater language abbreviation) to a specific character encoding
-- but that might conceivably a fixed table.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From walter.doerwald@catsystems.de  Fri May  5 17:05:05 2000
From: walter.doerwald@catsystems.de (Walter =?iso-8859-1?Q?D=F6rwald?=)
Date: Fri, 05 May 2000 18:05:05 +0200
Subject: [I18n-sig] Unicode strings: an alternative
References: <l03102805b5385e3de8e8@[193.78.237.127]> <l03102810b5378dda02f5@[193.78.237.126]> <l03102808b53877a3e392@[193.78.237.127]>             <14610.49115.820599.172598@cymru.basistech.com>  <200005051500.LAA14226@eric.cnri.reston.va.us> <002401bfb6a7$81fa9560$01ac2ac0@boulder>
Message-ID: <3912F131.E1B6EA85@catsystems.de>

Andy Robinson wrote:
> 
> > Thanks for this support of my ASCII proposal.
> >
> > --Guido van Rossum (home page: http://www.python.org/~guido/)
> 
> I missed most of the discussion due to a business trip.

And I'm new to the discussion! ;)

> I'll just say that I am very happy with ASCII as the default.

It's better than UTF-8, but 8bit Unicode would be better, because
it's the least suprising alternative.

People who use Python with "funny" languages, are already used to
converting their strings around, and they treat their Python
strings as byte arrays anyway. With Python 1.6 they can start
to switch to Pythons unicode strings without any problems.
That isn't so with UTF-8. I wonder how it will work with ASCII.
Will this ASCII restriction only be enforced when converting
to Unicode, or will the string type itself be restricted to
ASCII?

IMHO the long term goal should be to have only one string type
(being Unicode) and one byte array type (being our current string 
type?)

Bye,
	Walter D�rwald


From guido@python.org  Fri May  5 17:07:10 2000
From: guido@python.org (Guido van Rossum)
Date: Fri, 05 May 2000 12:07:10 -0400
Subject: [I18n-sig] Unicode strings: an alternative
In-Reply-To: Your message of "Fri, 05 May 2000 18:05:05 +0200."
 <3912F131.E1B6EA85@catsystems.de>
References: <l03102805b5385e3de8e8@[193.78.237.127]> <l03102810b5378dda02f5@[193.78.237.126]> <l03102808b53877a3e392@[193.78.237.127]> <14610.49115.820599.172598@cymru.basistech.com> <200005051500.LAA14226@eric.cnri.reston.va.us> <002401bfb6a7$81fa9560$01ac2ac0@boulder>
 <3912F131.E1B6EA85@catsystems.de>
Message-ID: <200005051607.MAA14657@eric.cnri.reston.va.us>

> > I'll just say that I am very happy with ASCII as the default.
> 
> It's better than UTF-8, but 8bit Unicode would be better, because
> it's the least suprising alternative.
> 
> People who use Python with "funny" languages, are already used to
> converting their strings around, and they treat their Python
> strings as byte arrays anyway. With Python 1.6 they can start
> to switch to Pythons unicode strings without any problems.
> That isn't so with UTF-8. I wonder how it will work with ASCII.
> Will this ASCII restriction only be enforced when converting
> to Unicode, or will the string type itself be restricted to
> ASCII?

No, 8-bit strings will always be 8-bit clear, of course!  The ASCII
restriction is only used for conversion to Unicode when no explicit
encoding is given.  For example, "abc" + u"xyz" is u"abcxyz", but "��"
+ u"xyz" raises an exception.  However you can write
unicode("��","latin-1") and it will yield u"\350\351".

> IMHO the long term goal should be to have only one string type
> (being Unicode) and one byte array type (being our current string 
> type?)

The byte array type should not support string literals at all.  The
Java model is right.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From walter.doerwald@catsystems.de  Fri May  5 17:42:07 2000
From: walter.doerwald@catsystems.de (Walter =?iso-8859-1?Q?D=F6rwald?=)
Date: Fri, 05 May 2000 18:42:07 +0200
Subject: [I18n-sig] Unicode strings: an alternative
References: <l03102805b5385e3de8e8@[193.78.237.127]> <l03102810b5378dda02f5@[193.78.237.126]> <l03102808b53877a3e392@[193.78.237.127]> <14610.49115.820599.172598@cymru.basistech.com> <200005051500.LAA14226@eric.cnri.reston.va.us> <002401bfb6a7$81fa9560$01ac2ac0@boulder>
 <3912F131.E1B6EA85@catsystems.de> <200005051607.MAA14657@eric.cnri.reston.va.us>
Message-ID: <3912F9DF.8D4DAE9D@catsystems.de>

Guido van Rossum wrote:

> [...]
> > Will this ASCII restriction only be enforced when converting
> > to Unicode, or will the string type itself be restricted to
> > ASCII?
> 
> No, 8-bit strings will always be 8-bit clear, of course!  The ASCII
> restriction is only used for conversion to Unicode when no explicit
> encoding is given.  For example, "abc" + u"xyz" is u"abcxyz", but "��"
> + u"xyz" raises an exception.

Which has to be considered an "artificial" overflow error, and error
that is raised because of the values of some object.

> However you can write
> unicode("��","latin-1") and it will yield u"\350\351".

I would like to be able to change the default encoding on a global 
scale. So when my terminal and keyboard support latin-1 I want to be 
able to specify that str() and repr() return latin-1 strings.
The __str__ and __repr__ implemented by classes should return
Unicode strings, which are converted to the system global encoding
by Python.

> [...]

Bye,
	Walter D�rwald


From just@letterror.com  Fri May  5 19:32:23 2000
From: just@letterror.com (Just van Rossum)
Date: Fri, 5 May 2000 19:32:23 +0100
Subject: [I18n-sig] Unicode strings: an alternative
In-Reply-To: <200005051500.LAA14226@eric.cnri.reston.va.us>
References: Your message of "Fri, 05 May 2000 08:34:35 EDT."
 <14610.49115.820599.172598@cymru.basistech.com>
 <l03102805b5385e3de8e8@[193.78.237.127]>
 <l03102810b5378dda02f5@[193.78.237.126]>
 <l03102808b53877a3e392@[193.78.237.127]>
 <14610.49115.820599.172598@cymru.basistech.com>
Message-ID: <l03102809b538aad2ef7e@[193.78.237.127]>

[Tom Emerson]
> Hmmmm... how often do you expect to compare narrow vs. wide strings,
> using default comparison (i.e. = or !=)? What if I'm using Latin 3 and
> use the byte comparison? I may very well have two strings (one narrow,
> one wide) that compare equal, even though they're not. Not exactly
> what I would expect.

True enough. The reason I don't mind this behavior is because I believe
it's largely unavoidable, since in many cases the encoding is unknown (to
the Python internals). Eg. I may very well have two _narrow_ strings that
compare equal, even though they're not... Not exactly what you would
expect, but there's nothing you can do about it. What I don't like about
the 7-bit proposal is that it tries to protect me from something that
should be my own responsibility. Imagine if the 7-bit proposal were used
for narrow strings:

>>> "\377" == "\377"
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
EncodingError: Not sure about that encoding, dude!

;-)

I just saw Guido's latest idea (triggered by Peter Funk I suppose, who had
some very good points): using the locale may indeed be a better compromise.

Just


From fw@deneb.cygnus.argh.org  Fri May  5 17:35:18 2000
From: fw@deneb.cygnus.argh.org (Florian Weimer)
Date: 05 May 2000 18:35:18 +0200
Subject: [I18n-sig] [PATCH] UTF-8 decoding: Fix handling of invalid byte sequences
Message-ID: <87vh0t3qgp.fsf@deneb.cygnus.argh.org>

--=-=-=

Could you have a look at the following patch?  It fixes a rather
funny scoping problem with the continue statement, which results in
more deterministic handling of invalid sequences.  In addition, the
treatment of invalid characters in "replace" mode is improved: now,
an incomplete or otherwise invalid UTF-8 sequence generates exactly
one replacement character.  As a result, the Python UTF-8 decoder now
passes Markus Kuhn's UTF-8 stress test.  (Shall I make a Python test
out of it?)

If there aren't any objections, I'll forward this patch through the
official channels (if it's still necessary).


--=-=-=
Content-Type: text/x-patch
Content-Disposition: attachment; filename=python-utf8.diff

Index: unicodeobject.c
===================================================================
RCS file: /projects/cvsroot/python/dist/src/Objects/unicodeobject.c,v
retrieving revision 2.18
diff -u -r2.18 unicodeobject.c
--- unicodeobject.c	2000/05/04 15:52:20	2.18
+++ unicodeobject.c	2000/05/05 15:57:53
@@ -534,7 +534,8 @@
 #define UTF8_ERROR(details)  do {                       \
     if (utf8_decoding_error(&s, &p, errors, details))   \
         goto onError;                                   \
-    continue;                                           \
+    else                                                \
+        goto nextCharacter;                             \
 } while (0)
 
 PyObject *PyUnicode_DecodeUTF8(const char *s,
@@ -559,7 +560,10 @@
     e = s + size;
 
     while (s < e) {
-        register Py_UNICODE ch = (unsigned char)*s;
+        register Py_UNICODE ch;
+
+    nextCharacter:
+	ch = (unsigned char)*s;
 
         if (ch < 0x80) {
             *p++ = ch;
@@ -583,29 +587,44 @@
             break;
 
         case 2:
-            if ((s[1] & 0xc0) != 0x80) 
+	    if ((s[1] & 0xc0) != 0x80) {
                 UTF8_ERROR("invalid data");
+	    }
             ch = ((s[0] & 0x1f) << 6) + (s[1] & 0x3f);
-            if (ch < 0x80)
+            if (ch < 0x80) {
+		/* Skip rest of this sequence. */
+		s++;
                 UTF8_ERROR("illegal encoding");
-	    else
+	    } else
 		*p++ = ch;
             break;
 
         case 3:
             if ((s[1] & 0xc0) != 0x80 || 
-                (s[2] & 0xc0) != 0x80) 
+                (s[2] & 0xc0) != 0x80) {
+		/* Skip character which likely belongs to this sequence. */
+		if ((s[1] & 0xc0) == 0x80) {
+		    s++;
+		}
                 UTF8_ERROR("invalid data");
+	    }
             ch = ((s[0] & 0x0f) << 12) + ((s[1] & 0x3f) << 6) + (s[2] & 0x3f);
-            if (ch < 0x800 || (ch >= 0xd800 && ch < 0xe000))
+            if (ch < 0x800 || (ch >= 0xd800 && ch < 0xe000)) {
+		/* Skip rest of this sequence. */
+		s += 2;
                 UTF8_ERROR("illegal encoding");
-	    else
+	    } else
 		*p++ = ch;
             break;
 
         default:
             /* Other sizes are only needed for UCS-4 */
-            UTF8_ERROR("unsupported Unicode code range");
+	    /* Skip over these characters. */
+	    s++;
+	    while (s < e && ((*s & 0xc0) == 0x80)) s++;
+	    /* UTF8_ERROR will skip one character. */
+	    s--;
+	    UTF8_ERROR("unsupported Unicode code range");
         }
         s += n;
     }

--=-=-=--


From fw@deneb.cygnus.argh.org  Fri May  5 17:13:42 2000
From: fw@deneb.cygnus.argh.org (Florian Weimer)
Date: 05 May 2000 18:13:42 +0200
Subject: [I18n-sig] Unicode strings: an alternative
In-Reply-To: Just van Rossum's message of "Fri, 5 May 2000 14:17:31 +0100"
References: <l03102805b5385e3de8e8@[193.78.237.127]>
 <l03102810b5378dda02f5@[193.78.237.126]>
 <l03102805b5385e3de8e8@[193.78.237.127]>
 <l03102808b53877a3e392@[193.78.237.127]>
Message-ID: <8766st5615.fsf@deneb.cygnus.argh.org>

Just van Rossum <just@letterror.com> writes:

> Good point. All this taken together still means to me that comparisons
> between wide and narrow strings should take place at the character level,
> which implies that coercion from narrow to wide is done at the character
> level, without looking at the encoding. (Which in my book in turn still
> implies that as long as we're talking about Unicode, narrow strings are
> effectively Latin-1.)

Sorry for jumping in, I've only recently discovered this list. :-/

At the moment, most of the computing world is not Latin-1 but
Windows-12??.  That's why I don't think this is a good idea at all.


From fw@deneb.cygnus.argh.org  Fri May  5 17:26:45 2000
From: fw@deneb.cygnus.argh.org (Florian Weimer)
Date: 05 May 2000 18:26:45 +0200
Subject: [I18n-sig] Perhaps the locale should matter?
In-Reply-To: Guido van Rossum's message of "Fri, 05 May 2000 11:54:48 -0400"
References: <200005051554.LAA14606@eric.cnri.reston.va.us>
Message-ID: <87zoq53quy.fsf@deneb.cygnus.argh.org>

Guido van Rossum <guido@python.org> writes:

[Problem: implicit conversion from non-Unicode strings to Unicode, so
that existing Python code doesn't break when fed with both Unicode and
non-Unicode strings]

> Problem: I have no idea how to go from the locale setting (a
> two-charater language abbreviation) to a specific character encoding
> -- but that might conceivably a fixed table.

I like this idea a lot.  There was some discussion on the Linux-UTF-8
list <linux-utf8@nl.linux.org> on this topic (I'm Cc:ing them).
Current GNU libc development versions provide nl_langinfo() and
understand locale settings like "LANG=de_DE.UTF-8".  I don't know if
this convention is widely used, though.  If it isn't, there's probably
nothing else, so you could use it anyway. ;)


From Markus.Kuhn@cl.cam.ac.uk  Fri May  5 20:59:02 2000
From: Markus.Kuhn@cl.cam.ac.uk (Markus Kuhn)
Date: Fri, 05 May 2000 20:59:02 +0100
Subject: [I18n-sig] Perhaps the locale should matter?
In-Reply-To: Your message of "05 May 2000 18:26:45 +0200."
 <87zoq53quy.fsf@deneb.cygnus.argh.org>
Message-ID: <E12noFb-0003vo-00@wisbech.cl.cam.ac.uk>

Guido van Rossum <guido@python.org> writes:
> Problem: I have no idea how to go from the locale setting (a
> two-charater language abbreviation) to a specific character encoding
> -- but that might conceivably a fixed table.

Starting with glibc 2.2, you can ask for the encoding name with

  #include <langinfo.h>

  encoding_string = nl_langinfo(CODESET);

as described on

  http://www.opengroup.org/onlinepubs/7908799/xsh/langinfo.h.html

But are you really interested in the name of the encoding or not more in
the already Unicode-converted string? In this case, simply use the C
library's wide character I/O functions getwc(), fwscanf(), etc. as
described in

  http://www.unix-systems.org/version2/whatsnew/login_mse.html

or

  http://www.cl.cam.ac.uk/~mgk25/volatile/ISO-C-FDIS.1999-04.pdf
  http://www.cl.cam.ac.uk/~mgk25/volatile/ISO-C-FDIS.1999-04.txt
  (section 7.24)

and the locale dependent conversion to Unicode will be done for you by
the C library. Under glibc 2.2, wchar_t always contains UCS-4 values.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>


From guido@python.org  Fri May  5 21:28:14 2000
From: guido@python.org (Guido van Rossum)
Date: Fri, 05 May 2000 16:28:14 -0400
Subject: [I18n-sig] Perhaps the locale should matter?
In-Reply-To: Your message of "Fri, 05 May 2000 20:59:02 BST."
 <E12noFb-0003vo-00@wisbech.cl.cam.ac.uk>
References: <E12noFb-0003vo-00@wisbech.cl.cam.ac.uk>
Message-ID: <200005052028.QAA14802@eric.cnri.reston.va.us>

> Starting with glibc 2.2, you can ask for the encoding name with
> 
>   #include <langinfo.h>
> 
>   encoding_string = nl_langinfo(CODESET);
> 
> as described on
> 
>   http://www.opengroup.org/onlinepubs/7908799/xsh/langinfo.h.html
> 
> But are you really interested in the name of the encoding or not more in
> the already Unicode-converted string?

Yes, we're really interested in the encoding name for 8-bit strings,
because we already have our own decoders.  Remember that Python runs
on more than Linux systems and we can't incorporate GPL'ed code
because our license needs to be more liberal.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From haible@ilog.fr  Fri May  5 21:52:39 2000
From: haible@ilog.fr (Bruno Haible)
Date: Fri, 5 May 2000 22:52:39 +0200 (MET DST)
Subject: [I18n-sig] Perhaps the locale should matter?
In-Reply-To: <87zoq53quy.fsf@deneb.cygnus.argh.org>
References: <200005051554.LAA14606@eric.cnri.reston.va.us>
 <87zoq53quy.fsf@deneb.cygnus.argh.org>
Message-ID: <200005052052.WAA22355@oberkampf.ilog.fr>

Florian Weimer quotes Guido van Rossum <guido@python.org>:

> > Problem: I have no idea how to go from the locale setting (a
> > two-charater language abbreviation) to a specific character encoding
> > -- but that might conceivably a fixed table.

The recommended POSIX way is  nl_langinfo(CODESET).

But you have to hack around two system dependencies:

1. Some systems don't support it correctly:
  - FreeBSD 3.3 and SunOS 4 always return a NULL pointer.
  - Solaris 2.4 always returns an empty string.
  - Solaris 2.6 sometimes returns an empty string.
  - Linux libc5 and glibc 2.0.x don't have it at all.
  - glibc 2.1.x has it but only if you use -D_XOPEN_SOURCE.

2. Some systems returns non-canonical names for encodings, e.g. Solaris
   returns "PCK" when it means Shift_JIS.

Markus Kuhn writes:
> But are you really interested in the name of the encoding or not more in
> the already Unicode-converted string? In this case, simply use the C
> library's wide character I/O functions getwc(), fwscanf(), etc.

This will be true for glibc 2.2, but is not portable. The wchar_t type
is not guaranteed to be Unicode. On FreeBSD, indeed, it is not; it is locale
dependent.

Bruno


From Harald@Alvestrand.no  Fri May  5 22:23:36 2000
From: Harald@Alvestrand.no (Harald Tveit Alvestrand)
Date: Fri, 05 May 2000 23:23:36 +0200
Subject: [I18n-sig] Perhaps the locale should matter?
In-Reply-To: <87zoq53quy.fsf@deneb.cygnus.argh.org>
References: <Guido van Rossum's message of "Fri, 05 May 2000 11:54:48 -0400">
 <200005051554.LAA14606@eric.cnri.reston.va.us>
Message-ID: <4.3.1.2.20000505225502.033cdef8@dokka.kvatro.no>

At 18:26 05.05.2000 +0200, Florian Weimer wrote:
>I like this idea a lot.  There was some discussion on the Linux-UTF-8
>list <linux-utf8@nl.linux.org> on this topic (I'm Cc:ing them).
>Current GNU libc development versions provide nl_langinfo() and
>understand locale settings like "LANG=de_DE.UTF-8".  I don't know if
>this convention is widely used, though.  If it isn't, there's probably
>nothing else, so you could use it anyway. ;)

I believe the <language>_<country>.<charset> convention is either part of 
POSIX or part of the ISO locales work. Keld Simonsen <keld@dkuug.dk> would 
know.

(I found the <language>_<country> convention in a draft for ISO 15897; that 
doc did not specify a .charset extension for it)

                         Harald

--
Harald Tveit Alvestrand, EDB Maxware, Norway
Harald.Alvestrand@edb.maxware.no


From haible@ilog.fr  Fri May  5 23:20:05 2000
From: haible@ilog.fr (Bruno Haible)
Date: Sat, 6 May 2000 00:20:05 +0200 (MET DST)
Subject: [I18n-sig] Perhaps the locale should matter?
In-Reply-To: <4.3.1.2.20000505225502.033cdef8@dokka.kvatro.no>
References: <Guido van Rossum's message of "Fri, 05 May 2000 11:54:48 -0400">
 <200005051554.LAA14606@eric.cnri.reston.va.us>
 <87zoq53quy.fsf@deneb.cygnus.argh.org>
 <4.3.1.2.20000505225502.033cdef8@dokka.kvatro.no>
Message-ID: <200005052220.AAA23666@oberkampf.ilog.fr>

Harald Tveit Alvestrand writes:

> I believe the <language>_<country>.<charset> convention is either part of 
> POSIX or part of the ISO locales work.

It is reasonably standardized. But it doesn't help Guido: When he is faced
with a locale named "ru" or "ru_RU", he wouldn't know whether its character
set is ISO-8859-5 or KOI8-R.

Bruno


From drepper@cygnus.com (Ulrich Drepper)  Fri May  5 23:51:32 2000
From: drepper@cygnus.com (Ulrich Drepper) (Ulrich Drepper)
Date: 05 May 2000 15:51:32 -0700
Subject: [I18n-sig] Perhaps the locale should matter?
In-Reply-To: Bruno Haible's message of "Sat, 6 May 2000 00:20:05 +0200 (MET DST)"
References: <Guido van Rossum's message of "Fri, 05 May 2000 11:54:48 -0400"> <200005051554.LAA14606@eric.cnri.reston.va.us> <87zoq53quy.fsf@deneb.cygnus.argh.org> <4.3.1.2.20000505225502.033cdef8@dokka.kvatro.no> <200005052220.AAA23666@oberkampf.ilog.fr>
Message-ID: <m3og6ktxu3.fsf@localhost.localnet>

Bruno Haible <haible@ilog.fr> writes:

> > I believe the <language>_<country>.<charset> convention is either part of 
> > POSIX or part of the ISO locales work.

Neither.  This syntax comes from XPG.

-- 
---------------.      drepper at gnu.org  ,-.   1325 Chesapeake Terrace
Ulrich Drepper  \    ,-------------------'   \  Sunnyvale, CA 94089 USA
Red Hat          `--' drepper at redhat.com   `------------------------


From keld@dkuug.dk  Fri May  5 23:58:41 2000
From: keld@dkuug.dk (=?iso-8859-1?Q?Keld_J=F8rn_Simonsen?=)
Date: Sat, 6 May 2000 00:58:41 +0200
Subject: [I18n-sig] Perhaps the locale should matter?
In-Reply-To: <4.3.1.2.20000505225502.033cdef8@dokka.kvatro.no>; from Harald Tveit Alvestrand on Fri, May 05, 2000 at 11:23:36PM +0200
References: <Guido <200005051554.LAA14606@eric.cnri.reston.va.us> <87zoq53quy.fsf@deneb.cygnus.argh.org> <4.3.1.2.20000505225502.033cdef8@dokka.kvatro.no>
Message-ID: <20000506005841.A3812@light.dkuug.dk>

On Fri, May 05, 2000 at 11:23:36PM +0200, Harald Tveit Alvestrand wrote:
> At 18:26 05.05.2000 +0200, Florian Weimer wrote:
> >I like this idea a lot.  There was some discussion on the Linux-UTF-8
> >list <linux-utf8@nl.linux.org> on this topic (I'm Cc:ing them).
> >Current GNU libc development versions provide nl_langinfo() and
> >understand locale settings like "LANG=de_DE.UTF-8".  I don't know if
> >this convention is widely used, though.  If it isn't, there's probably
> >nothing else, so you could use it anyway. ;)
> 
> I believe the <language>_<country>.<charset> convention is either part of 
> POSIX or part of the ISO locales work. Keld Simonsen <keld@dkuug.dk> would 
> know.

I believe this convention comes out of the Open Group.

> (I found the <language>_<country> convention in a draft for ISO 15897; that 
> doc did not specify a .charset extension for it)

ISO 15897 uses another naming convention for charsets , 
namely <language>_<country>/<charset> plus a number of other
parameters. You can see a good approximization of what is the
approved ISO standard in its first draft for revision
at http://www.dkuug.dk/jtc1/sc22/wg20/ look then in the projects
page or the standards page.

Keld


From just@letterror.com  Sat May  6 09:32:56 2000
From: just@letterror.com (Just van Rossum)
Date: Sat, 6 May 2000 09:32:56 +0100
Subject: [I18n-sig] Perhaps the locale should matter?
In-Reply-To: <200005052220.AAA23666@oberkampf.ilog.fr>
References: <4.3.1.2.20000505225502.033cdef8@dokka.kvatro.no> <Guido van
 Rossum's message of "Fri, 05 May 2000 11:54:48 -0400">
 <200005051554.LAA14606@eric.cnri.reston.va.us>
 <87zoq53quy.fsf@deneb.cygnus.argh.org>
 <4.3.1.2.20000505225502.033cdef8@dokka.kvatro.no>
Message-ID: <l03102801b53982228125@[193.78.237.127]>

At 12:20 AM +0200 06-05-2000, Bruno Haible wrote:
>It is reasonably standardized. But it doesn't help Guido: When he is faced
>with a locale named "ru" or "ru_RU", he wouldn't know whether its character
>set is ISO-8859-5 or KOI8-R.

Hm, is this the show stopper that it appears to be?

I have no idea how the locale stuff works, nor how exactly it relates to
standard C functions like islower() and toupper(), but I do know that these
do the "right" thing on my platform. That is, they assume MacRoman, and
work correctly with accented characters. Peter Funk's post reminded me of
this -- there's probably lots of code out there that depends on it :-(. So
far this has been the only argument that convinced me that the
8-bit/Latin-1 really *is* flawed (Guido should thank you, Peter! ;-). Now
I'm not even sure that using the locale to aid narrow to wide conversion
(and vv) is such a good idea -- even if it were possible. The 7-bit
proposal may be the only wise choice after all.


Just

PS: has any progress been made to add an encoding pragma to source files?
Or is this 1.7 stuff?

PPSS: shouldn't u'\337'.upper() yield u'SS'? (\337 is the german "sharp s")


From fw@deneb.cygnus.argh.org  Sat May  6 09:03:34 2000
From: fw@deneb.cygnus.argh.org (Florian Weimer)
Date: 06 May 2000 10:03:34 +0200
Subject: [I18n-sig] Perhaps the locale should matter?
In-Reply-To: Just van Rossum's message of "Sat, 6 May 2000 09:32:56 +0100"
References: <4.3.1.2.20000505225502.033cdef8@dokka.kvatro.no> <Guido van
 Rossum's message of "Fri, 05 May 2000 11:54:48 -0400">
 <200005051554.LAA14606@eric.cnri.reston.va.us>
 <87zoq53quy.fsf@deneb.cygnus.argh.org>
 <4.3.1.2.20000505225502.033cdef8@dokka.kvatro.no>
 <l03102801b53982228125@[193.78.237.127]>
Message-ID: <878zxot8a1.fsf@deneb.cygnus.argh.org>

Just van Rossum <just@letterror.com> writes:

> PPSS: shouldn't u'\337'.upper() yield u'SS'? (\337 is the german "sha=
rp s")

Yes, that's right.  There are about one hundred special case mapping
rules (some of them are even locale-dependent), covering ligatures
(such as "=EF=AC�", "=EF=AC�", "=EF=AC�"), and characters for which the=
re's no
precomposed upper case form (like "=CE�", which becomes "=CE�̈́"), or w=
hich
are otherwise special.  These case mapping rules are not normative,
but it's a good idea to follow them anyway, I think.


From Fredrik Lundh" <effbot@telia.com  Sun May  7 16:16:13 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Sun, 7 May 2000 17:16:13 +0200
Subject: [I18n-sig] Perhaps the locale should matter?
References: <200005051554.LAA14606@eric.cnri.reston.va.us>
Message-ID: <007801bfb837$302052c0$34aab5d4@hagrid>

Guido van Rossum wrote:
> I wonder if we could make the default conversion from 8-bit to Unicode
> depend on the locale?  This would be a compromise between my ASCII
> proposal and the Latin-1 proposal.  My reasoning is that the locale is
> an existing Python feature.  Code that is broken when the locale
> differs from the default has been broken for a long time.  We might
> not *like* a global setting for this kind of feature, but: "We've
> already got one!"  [Imitates thick French accent.]

well, I was going to suggest that we take that one away
in 1.7...

    "Avoidance of locales is strongly encouraged."
    (from the Perl unicode docs)

> If the program explicitly set the locale, it is a clear signal that it
> is interesting in manipulating characters in a particular locale, and
> we might as well honor this.

no time to elaborate, but here's what my (yet unpublished)
"how to handle strings in 1.7" proposal says:

-- "narrow" strings should assume unicode, and use unicode
   aware replacements for the ctype operations (isspace, is-
   digit, etc).

-- the locale should not control conversions between "narrow"
   and wide character strings.

-- the locale should be used to install codecs on standard I/O
   streams and on the system API's (e.g. filenames), on Unix
   platforms (and compatibles).

that is, the Unix locale is reduced to being a platform specific
way to tell Python what language/locale we're running under
(the "dot charset" notation plus some simple heuristics is used
to determine a default character set).

for other platforms, use the platform specific mechanisms for
this (active code page, character set used by system font,
etc).

more later.

</F>


From Fredrik Lundh" <effbot@telia.com  Sun May  7 16:49:02 2000
From: Fredrik Lundh" <effbot@telia.com (Fredrik Lundh)
Date: Sun, 7 May 2000 17:49:02 +0200
Subject: [I18n-sig] Unicode strings: an alternative
References: <l03102805b5385e3de8e8@[193.78.237.127]><l03102810b5378dda02f5@[193.78.237.126]><l03102808b53877a3e392@[193.78.237.127]> <14610.49115.820599.172598@cymru.basistech.com>
Message-ID: <010f01bfb83b$c6de4560$34aab5d4@hagrid>

Tom Emerson wrote:
> Just van Rossum writes:
>  > Good point. All this taken together still means to me that =
comparisons
>  > between wide and narrow strings should take place at the character =
level,
>  > which implies that coercion from narrow to wide is done at the =
character
>  > level, without looking at the encoding. (Which in my book in turn =
still
>  > implies that as long as we're talking about Unicode, narrow strings =
are
>  > effectively Latin-1.)
>=20
> Only true if "wide" strings are encoded in UCS-2 or UCS-4. If "wide
> characters" are Unicode, but stored in UTF-8 encoding, then you loose.

why?

if you're comparing byte arrays using different encodings, sure.

if you're comparing characters, it'll work.

I find it amazing that you're still stuck at the "visible encoding" =
level,
despite everything that's been posted to these mailing lists over the
last weeks.  let's spell it out again: a "character" is NOT the same
thing as a C char.

> Hmmmm... how often do you expect to compare narrow vs. wide strings,
> using default comparison (i.e. =3D or !=3D)?

all the time -- much more often than I compare integers with long
integers or floating point numbers.

the idea of standardizing on strings of characters is to make narrow
and wide strings interchangable.  just like you can mix standard and
long integers in today's python, *despite* the fact that they're not
using the same internal representation.

> What if I'm using Latin 3 and use the byte comparison?

if you have a byte array containing latin 3 encoded data, that's a
byte array, not a string...

> I may very well have two strings (one narrow, one wide) that
> compare equal, even though they're not.

if you decode both byte arrays to real strings and compare them,
they will only compare equal if they are in fact equal...

> Not exactly what I would expect.

I think you're still not getting what we're talking about here.  I =
suggest
reading the W3C paper (http://www.w3.org/TR/charmod) once again:

    "It should be clear, however, that characters and bytes
    are very different entities that SHOULD NOT be confused:
    in general, the relationship is many-to-many."

please follow their advice, and stop confusing characters and
bytes.

</F>


From guido@python.org  Sun May  7 21:51:08 2000
From: guido@python.org (Guido van Rossum)
Date: Sun, 07 May 2000 16:51:08 -0400
Subject: [I18n-sig] Perhaps the locale should matter?
In-Reply-To: Your message of "Sun, 07 May 2000 17:16:13 +0200."
 <007801bfb837$302052c0$34aab5d4@hagrid>
References: <200005051554.LAA14606@eric.cnri.reston.va.us>
 <007801bfb837$302052c0$34aab5d4@hagrid>
Message-ID: <200005072051.QAA15791@eric.cnri.reston.va.us>

[Fredrik]
> no time to elaborate, but here's what my (yet unpublished)
> "how to handle strings in 1.7" proposal says:
[...]

I'm looking forward to that proposal.  (But I'm looking forward even
more to the next sre snapshot!)

Besides a new way of handling strings, we also need a new way of
handling non-string raw data, i.e. byte arrays.  For operations that
read data from a file (or socket), we need ways to say whether we want
it to return a string or a byte array.

Lots of changes ahead...

--Guido van Rossum (home page: http://www.python.org/~guido/)


From mal@lemburg.com  Mon May  8 09:01:20 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Mon, 08 May 2000 10:01:20 +0200
Subject: [I18n-sig] [PATCH] UTF-8 decoding: Fix handling of invalid byte
 sequences
References: <87vh0t3qgp.fsf@deneb.cygnus.argh.org>
Message-ID: <39167450.DB8CA6@lemburg.com>

Florian Weimer wrote:
> 
> Could you have a look at the following patch?  It fixes a rather
> funny scoping problem with the continue statement, which results in
> more deterministic handling of invalid sequences.  In addition, the
> treatment of invalid characters in "replace" mode is improved: now,
> an incomplete or otherwise invalid UTF-8 sequence generates exactly
> one replacement character.  As a result, the Python UTF-8 decoder now
> passes Markus Kuhn's UTF-8 stress test.  (Shall I make a Python test
> out of it?)
> 
> If there aren't any objections, I'll forward this patch through the
> official channels (if it's still necessary).

Looks good, except that you should move the nextCharacter:
label right before the closing } of the while loop. Otherwise,
the while() condition won't be checked.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From cyrus@garage.co.jp  Wed May 10 15:13:30 2000
From: cyrus@garage.co.jp (Cyrus Shaoul)
Date: Wed, 10 May 2000 23:13:30 +0900
Subject: [I18n-sig] News from the perl world.
Message-ID: <39196E8A32.954ACYRUS@smtp.jp.interramp.com>

Cyrus here. (Been lurking and listening.)

I saw this page on the Perl.com site giving some info on the new Unicode
support in Perl 5.6.0

http://www.perl.com/pub/2000/04/whatsnew.html

YMMV,

Cyrus


From ht@cogsci.ed.ac.uk  Mon May 22 11:24:32 2000
From: ht@cogsci.ed.ac.uk (Henry S. Thompson)
Date: 22 May 2000 11:24:32 +0100
Subject: [I18n-sig] case insensitivity for Python 3K
Message-ID: <f5b1z2u51cv.fsf@cogsci.ed.ac.uk>

Without wanting to move the flamewar wrt this  topic over here from
the main list, I'd observe that if there's any thought of moving
beyond ASCII for identifiers in Python 3K, then case insensitivity is
a potentially _very_ confusing and confounding move:  case folding is
just not well-formed in lots of linguages, including real obscure ones 
such as French :-).

ht
-- 
  Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh
          W3C Fellow 1999--2001, part-time member of W3C Team
     2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440
	    Fax: (44) 131 650-4587, e-mail: ht@cogsci.ed.ac.uk
		     URL: http://www.ltg.ed.ac.uk/~ht/


From mal@lemburg.com  Mon May 22 11:56:28 2000
From: mal@lemburg.com (M.-A. Lemburg)
Date: Mon, 22 May 2000 12:56:28 +0200
Subject: [I18n-sig] case insensitivity for Python 3K
References: <f5b1z2u51cv.fsf@cogsci.ed.ac.uk>
Message-ID: <3929125C.907DF461@lemburg.com>

"Henry S. Thompson" wrote:
> 
> Without wanting to move the flamewar wrt this  topic over here from
> the main list, I'd observe that if there's any thought of moving
> beyond ASCII for identifiers in Python 3K, then case insensitivity is
> a potentially _very_ confusing and confounding move:  case folding is
> just not well-formed in lots of linguages, including real obscure ones
> such as French :-).

I haven't followed the discussion on c.l.p, but I'm pretty
sure that Python will not loose its case sensitivity w/r
to identifiers in Py3K.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/


From guido@python.org  Mon May 22 17:37:06 2000
From: guido@python.org (Guido van Rossum)
Date: Mon, 22 May 2000 09:37:06 -0700
Subject: [I18n-sig] case insensitivity for Python 3K
In-Reply-To: Your message of "22 May 2000 11:24:32 BST."
 <f5b1z2u51cv.fsf@cogsci.ed.ac.uk>
References: <f5b1z2u51cv.fsf@cogsci.ed.ac.uk>
Message-ID: <200005221637.JAA07325@cj20424-a.reston1.va.home.com>

> Without wanting to move the flamewar wrt this  topic over here from
> the main list, I'd observe that if there's any thought of moving
> beyond ASCII for identifiers in Python 3K, then case insensitivity is
> a potentially _very_ confusing and confounding move:  case folding is
> just not well-formed in lots of linguages, including real obscure ones 
> such as French :-).

True, but Unicode defines rigorously how case mappings should be done,
and the Python Unicode support implements these as toupper() and
tolower().  We could just say "whatever these do is the language
definition".

So I don't see this as an extra argument against case folding (on top
of the excellent arguments that have already been given).

--Guido van Rossum (home page: http://www.python.org/~guido/)


From tim_one@email.msn.com  Tue May 23 06:15:28 2000
From: tim_one@email.msn.com (Tim Peters)
Date: Tue, 23 May 2000 01:15:28 -0400
Subject: [I18n-sig] case insensitivity for Python 3K
In-Reply-To: <200005221637.JAA07325@cj20424-a.reston1.va.home.com>
Message-ID: <000001bfc475$e9408280$612d153f@tim>

[Henry S. Thompson]
> ... if there's any thought of moving beyond ASCII for
> identifiers in Python 3K, then case insensitivity is
> potentially _very_ confusing and confounding move:  case
> folding is just not well-formed in lots of linguages,
> including real obscure ones such as French :-).

[Guido]
> True, but Unicode defines rigorously how case mappings should be
> done, and the Python Unicode support implements these as toupper()
> and tolower().  We could just say "whatever these do is the language
> definition".
>
> So I don't see this as an extra argument against case folding (on
> top of the excellent arguments that have already been given).

Except that Unicode's well-definedness is not the same as Henry's
well-formedness:  the mappings may well surprise the heck out of native
writers, and don't forget about Unicode titlecase either (which appears more
relevant than Unicode uppercase for VariablesWrittenInThisStyle).

Kinda like floating point is rigorously defined by IEEE-754, but that
doesn't mean it's not agonizingly surprising.  You don't have a lot of
choice about supporting fp in some way, but deciding to conflate case is
purely self-inflicted <wink>.

the-more-i-learn-of-the-world-the-more-i-cling-to-ascii-ly y'rs
    - tim