From mal@lemburg.com  Fri Jun  1 09:10:08 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 01 Jun 2001 10:10:08 +0200
Subject: [I18n-sig] XML and UTF-16
References: <3B167B9F.344D6992@lemburg.com>
 <15126.32275.110670.236066@cymru.basistech.com>
 <3B1681C5.71FD484D@lemburg.com> <15126.33995.327715.84261@cymru.basistech.com> <200105312046.f4VKkVY02913@mira.informatik.hu-berlin.de>
Message-ID: <3B174DE0.EFABF55E@lemburg.com>

"Martin v. Loewis" wrote:
> 
> > Yes, I think this would be a good idea. I would use something along
> > the lines of:
> 
> Please have a look at
> xml.parsers.xmlproc.EntityParser.autodetect_encoding. This almost
> follows the procedure in the XML recommendation, except that it does
> not expect "unusual" byte orders (2134, 3412), and that it does not
> detect EBCDIC.

I don't have a file EntityParser in the xmlproc subdir... is
that in CVS somewhere ?
 
> > 0) Assume UTF-8.
> >
> > 1) Look for the UTF-16 and UTF-32 uniBOMs. If you find one, assume the
> >    appropriate transmission format and endian nature. Goto 4.
> >
> > 2) Look for the UTF-8 uniBOM, since some editors like putting that in.
> >    Ignore it and goto 4.
> 
> I see this was added to the XML recommendation only in the second
> edition, so I should also added to xmlproc.
> 
> > 3) Look for the sundry forms of '<?xml ' in ASCII, UTF-16, and UTF-32,
> >    with appropriate endian variants. If found, assume the detected
> >    encoding. Goto 4.
> 
> Please note that ASCII is not detectable this way: If you see '<?xml',
> then you don't know anything about the encoding except that you should
> be able to parse the encoding= attribute successfully if present.

I think that's what Tom had in mind here.

Could we maybe have the function autodetect_encoding at
some higher level in PyXML ?! This is a very basic API and
doesn't only apply to xmlproc.

I also think that it would be worthwhile adding a similar
API to codecs.py which takes the magic ('<?xml' in this case)
as argument and then tries to determine whether the input
data is an ASCII superset, UTF-8 or UTF-16/32.

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From mal@lemburg.com  Fri Jun  1 09:13:19 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 01 Jun 2001 10:13:19 +0200
Subject: [I18n-sig] XML and UTF-16
References: <3B167B9F.344D6992@lemburg.com> <15126.32275.110670.236066@cymru.basistech.com> <3B1681C5.71FD484D@lemburg.com> <3B16B5D4.730D8E30@ActiveState.com>
Message-ID: <3B174E9F.4EDA2289@lemburg.com>

Paul Prescod wrote:
> 
> "M.-A. Lemburg" wrote:
> >
> >...
> >
> > Perhaps we should have some smart auto-detection API somewhere
> > which does this automagically ?! Something like
> >
> >         guess_xml_encoding(data) -> encoding string
> >
> > It could work by looking at the first 256 bytes of the data
> > string and then apply all the tricks needed to extract the
> > encoding information (or default to UTF-8 if no such information
> > is given).
> 
> This might help:
> 
> http://aspn.activestate.com/ASPN/Python/Cookbook/Recipe/52257
> 
> I think Lars has a version too...

Could you clarify what the licensing conditions are for using
code from your recipe collection ?

Thanks,
-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From mal@lemburg.com  Fri Jun  1 09:17:04 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 01 Jun 2001 10:17:04 +0200
Subject: [I18n-sig] XML and UTF-16
References: <3B167B9F.344D6992@lemburg.com> <15126.32275.110670.236066@cymru.basistech.com> <3B16B4DE.B0E8ADD4@ActiveState.com>
Message-ID: <3B174F80.4D1E93FB@lemburg.com>

Paul Prescod wrote:
> 
> Tom Emerson wrote:
> >
> >...
> >
> > Yes. You can then pretty easily autodetect the which Unicode
> > transformation format is being used by looking at the first ten or
> > so bytes.
> 
> Actually, the first four bytes are sufficient to get you started. Then
> you have to look at the encoding declaration if present.
> 
> > If the BOM is present, that's a big clue right there.
> 
> """Entities encoded in UTF-16 must begin with the Byte Order Mark
> described by Annex F of [ISO/IEC 10646], Annex H of [ISO/IEC
> 10646-2000], section 2.4 of [Unicode], and section 2.7 of [Unicode3]
> (the ZERO WIDTH NO-BREAK SPACE character, #xFEFF). This is an encoding
> signature, not part of either the markup or the character data of the
> XML document. XML processors must be able to use this character to
> differentiate between UTF-8 and UTF-16 encoded documents."""

Where did you get that from ? Note that the Unicode specs have a 
different opinion on this... (a BOM mark is part of a protocol and
should only be used if the encoding information is not 
available in some other form or implicit)

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From martin@loewis.home.cs.tu-berlin.de  Fri Jun  1 13:59:37 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Fri, 1 Jun 2001 14:59:37 +0200
Subject: [I18n-sig] XML and UTF-16
In-Reply-To: <3B174DE0.EFABF55E@lemburg.com> (mal@lemburg.com)
References: <3B167B9F.344D6992@lemburg.com>
 <15126.32275.110670.236066@cymru.basistech.com>
 <3B1681C5.71FD484D@lemburg.com> <15126.33995.327715.84261@cymru.basistech.com> <200105312046.f4VKkVY02913@mira.informatik.hu-berlin.de> <3B174DE0.EFABF55E@lemburg.com>
Message-ID: <200106011259.f51CxbQ00877@mira.informatik.hu-berlin.de>

> > > Yes, I think this would be a good idea. I would use something along
> > > the lines of:
> > 
> > Please have a look at
> > xml.parsers.xmlproc.EntityParser.autodetect_encoding. This almost
> > follows the procedure in the XML recommendation, except that it does
> > not expect "unusual" byte orders (2134, 3412), and that it does not
> > detect EBCDIC.
> 
> I don't have a file EntityParser in the xmlproc subdir... is
> that in CVS somewhere ?

Oops, missed on level of indirection:

xml.parsers.xmlproc.xmlutils.EntityParser.autodetect_encoding

And yes, the function is only in the CVS, not in a released version
(yet).

> Could we maybe have the function autodetect_encoding at
> some higher level in PyXML ?! This is a very basic API and
> doesn't only apply to xmlproc.

We might (contributions are welcome). However, such a function would
not necessarily be usable for xmlproc: xmlproc deals with reading data
in small chunks, expecting that information may be broken at arbitrary
boundaries. For example, would you expect that the autodetection
function looks for the encoding= attribute? That may not be included
in the first fragment of data.

> I also think that it would be worthwhile adding a similar
> API to codecs.py which takes the magic ('<?xml' in this case)
> as argument and then tries to determine whether the input
> data is an ASCII superset, UTF-8 or UTF-16/32.

I don't think so. Doing the XML autodetection is not terribly
complicated, and rarely needs to be done - you'd normally pass the
byte stream to an XML parser, so you would not need to care about the
encoding.

As for XML and encodings, having a convenient mechanism to extend
existing codecs to encode unknown characters as character entities is
much more important, IMO, since that is very difficult to achieve with
the existing API.

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Fri Jun  1 14:06:11 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Fri, 1 Jun 2001 15:06:11 +0200
Subject: [I18n-sig] XML and UTF-16
In-Reply-To: <3B174F80.4D1E93FB@lemburg.com> (mal@lemburg.com)
References: <3B167B9F.344D6992@lemburg.com> <15126.32275.110670.236066@cymru.basistech.com> <3B16B4DE.B0E8ADD4@ActiveState.com> <3B174F80.4D1E93FB@lemburg.com>
Message-ID: <200106011306.f51D6B000916@mira.informatik.hu-berlin.de>

> > """Entities encoded in UTF-16 must begin with the Byte Order Mark
> > described by Annex F of [ISO/IEC 10646], Annex H of [ISO/IEC
> > 10646-2000], section 2.4 of [Unicode], and section 2.7 of [Unicode3]
> > (the ZERO WIDTH NO-BREAK SPACE character, #xFEFF). This is an encoding
> > signature, not part of either the markup or the character data of the
> > XML document. XML processors must be able to use this character to
> > differentiate between UTF-8 and UTF-16 encoded documents."""
> 
> Where did you get that from ? 

That's from the XML recommendation, section 4.3.3. I really recommend
that you get a copy of that document :-)

> Note that the Unicode specs have a different opinion on this... (a
> BOM mark is part of a protocol and should only be used if the
> encoding information is not available in some other form or
> implicit)

Why is that different? XML says that the BOM is not part of the
document, but an encoding signature. You say that that it is part of a
protocol - in the XML case, it is part of the encoding autodetection
protocol.

If the character was part of the document, any document containing it
would be ill-formed, since the ZWNBSP is not allowed as the first
character of an XML document (only whitespace and '<' are allowed,
AFAICT).

Regards,
Martin


From walter@livinglogic.de  Fri Jun  1 14:58:09 2001
From: walter@livinglogic.de (Walter Doerwald)
Date: Fri, 01 Jun 2001 15:58:09 +0200
Subject: [I18n-sig] XML and UTF-16
In-Reply-To: <200106011259.f51CxbQ00877@mira.informatik.hu-berlin.de>
References: <3B167B9F.344D6992@lemburg.com>
 <15126.32275.110670.236066@cymru.basistech.com>
 <3B1681C5.71FD484D@lemburg.com>
 <15126.33995.327715.84261@cymru.basistech.com>
 <200105312046.f4VKkVY02913@mira.informatik.hu-berlin.de>
 <3B174DE0.EFABF55E@lemburg.com>
 <200106011259.f51CxbQ00877@mira.informatik.hu-berlin.de>
Message-ID: <200106011558090859.0148DB03@mail.livinglogic.de>

On 01.06.01 at 14:59 Martin v. Loewis wrote:

> [...]
> As for XML and encodings, having a convenient mechanism to extend
> existing codecs to encode unknown characters as character entities is
> much more important, IMO, since that is very difficult to achieve with
> the existing API.

I've written such functions:

- escapeText(S, encoding) -> unicode
  Return a copy of the unicode string S, where every occurrence of
  '<', '>' and '&' and all unencodable characters in the
  specified encoding have been replaced with their XML character entity.

- escapeAttr(S, encoding) -> unicode
  Return a copy of the unicode string S, where every occurrence of
  '<', '>', '&', and '\"' and all unencodable characters in the
  specified encoding have been replaced with their XML character entity.

Although these functions are written in C, they have to call the codec
twice for every single character (if encoding the string in one go fails),
so they are rather slow for codecs implemented in Python.

Could this be used until we get codecs with customizable errror handling?

If yes, I could put them as a patch on python.sf.net or pyxml.sf.net
or mail them to Martin.


Bye,
   Walter D=F6rwald

-- 
Walter D=F6rwald =B7 LivingLogic AG =B7 Bayreuth, Germany =B7
www.livinglogic.de


From mal@lemburg.com  Fri Jun  1 14:57:11 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 01 Jun 2001 15:57:11 +0200
Subject: [I18n-sig] XML and UTF-16
References: <3B167B9F.344D6992@lemburg.com> <15126.32275.110670.236066@cymru.basistech.com> <3B16B4DE.B0E8ADD4@ActiveState.com> <3B174F80.4D1E93FB@lemburg.com> <200106011306.f51D6B000916@mira.informatik.hu-berlin.de>
Message-ID: <3B179F37.8AEE7D55@lemburg.com>

"Martin v. Loewis" wrote:
> 
> > > """Entities encoded in UTF-16 must begin with the Byte Order Mark
> > > described by Annex F of [ISO/IEC 10646], Annex H of [ISO/IEC
> > > 10646-2000], section 2.4 of [Unicode], and section 2.7 of [Unicode3]
> > > (the ZERO WIDTH NO-BREAK SPACE character, #xFEFF). This is an encoding
> > > signature, not part of either the markup or the character data of the
> > > XML document. XML processors must be able to use this character to
> > > differentiate between UTF-8 and UTF-16 encoded documents."""
> >
> > Where did you get that from ?
> 
> That's from the XML recommendation, section 4.3.3. I really recommend
> that you get a copy of that document :-)

Just did... :)
 
> > Note that the Unicode specs have a different opinion on this... (a
> > BOM mark is part of a protocol and should only be used if the
> > encoding information is not available in some other form or
> > implicit)
> 
> Why is that different? XML says that the BOM is not part of the
> document, but an encoding signature. You say that that it is part of a
> protocol - in the XML case, it is part of the encoding autodetection
> protocol.
> 
> If the character was part of the document, any document containing it
> would be ill-formed, since the ZWNBSP is not allowed as the first
> character of an XML document (only whitespace and '<' are allowed,
> AFAICT).

In that sense you are right. I was under the impression that the
quoted text was talking about UTF-16 documents in general (not just
only XML docs).

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From mal@lemburg.com  Fri Jun  1 22:10:50 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 01 Jun 2001 23:10:50 +0200
Subject: [I18n-sig] Encoding auto-detection
References: <3B167B9F.344D6992@lemburg.com>
 <15126.32275.110670.236066@cymru.basistech.com>
 <3B1681C5.71FD484D@lemburg.com> <15126.33995.327715.84261@cymru.basistech.com> <200105312046.f4VKkVY02913@mira.informatik.hu-berlin.de> <3B174DE0.EFABF55E@lemburg.com> <200106011259.f51CxbQ00877@mira.informatik.hu-berlin.de>
Message-ID: <3B1804DA.8C48861E@lemburg.com>

"Martin v. Loewis" wrote:
> 
> > I also think that it would be worthwhile adding a similar
> > API to codecs.py which takes the magic ('<?xml' in this case)
> > as argument and then tries to determine whether the input
> > data is an ASCII superset, UTF-8 or UTF-16/32.
> 
> I don't think so. Doing the XML autodetection is not terribly
> complicated, and rarely needs to be done - you'd normally pass the
> byte stream to an XML parser, so you would not need to care about the
> encoding.

I was talking about a general purpose encoding sniffer, the XML
case would only be a special case. The idea is to pass a magic
string to the API and then let it fiddle around with to try 
to deduce the encoding. The magic string might also be regular
expression which then has the encoding parameter as group 1, etc.

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From mal@lemburg.com  Fri Jun  1 22:23:02 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 01 Jun 2001 23:23:02 +0200
Subject: [I18n-sig] XML and codecs
References: <3B167B9F.344D6992@lemburg.com>
 <15126.32275.110670.236066@cymru.basistech.com>
 <3B1681C5.71FD484D@lemburg.com> <15126.33995.327715.84261@cymru.basistech.com> <200105312046.f4VKkVY02913@mira.informatik.hu-berlin.de> <3B174DE0.EFABF55E@lemburg.com> <200106011259.f51CxbQ00877@mira.informatik.hu-berlin.de>
Message-ID: <3B1807B6.11ED32B9@lemburg.com>

"Martin v. Loewis" wrote:
> 
> As for XML and encodings, having a convenient mechanism to extend
> existing codecs to encode unknown characters as character entities is
> much more important, IMO, since that is very difficult to achieve with
> the existing API.

Until we've found a backward compatible way to fix this, how
about adding a new error handling scheme which at least gives
the caller enough information to do some smart processing on the
input and output, e.g.

errors="break":

	raise an UnicodeBreakError with argument
        (reason, error_position_in_input, work_done_so_far)

The caller could then use the information returned
by the codec to fix the input data and reuse the already
encoded/decoded data to avoid duplicate work.

This scheme is very simple, but also very effective, since
it allows complex error processing to be done in the
namespace where the data is being processed (rather than
in a callback which wouldn't have access to this namespace).

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From martin@loewis.home.cs.tu-berlin.de  Fri Jun  1 23:17:32 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Sat, 2 Jun 2001 00:17:32 +0200
Subject: [I18n-sig] XML and codecs
In-Reply-To: <3B1807B6.11ED32B9@lemburg.com> (mal@lemburg.com)
References: <3B167B9F.344D6992@lemburg.com>
 <15126.32275.110670.236066@cymru.basistech.com>
 <3B1681C5.71FD484D@lemburg.com> <15126.33995.327715.84261@cymru.basistech.com> <200105312046.f4VKkVY02913@mira.informatik.hu-berlin.de> <3B174DE0.EFABF55E@lemburg.com> <200106011259.f51CxbQ00877@mira.informatik.hu-berlin.de> <3B1807B6.11ED32B9@lemburg.com>
Message-ID: <200106012217.f51MHWR01771@mira.informatik.hu-berlin.de>

> Until we've found a backward compatible way to fix this, how
> about adding a new error handling scheme which at least gives
> the caller enough information to do some smart processing on the
> input and output, e.g.
> 
> errors="break":
> 
> 	raise an UnicodeBreakError with argument
>         (reason, error_position_in_input, work_done_so_far)

That is good enough, IMO, so let's do it. I think we also need a few
well-defined reasons, in particular

UnicodeBreakError.CannotConvert # character not supported in target
                                # character set
UnicodeBreakError.OutOfData     # input string stops in the middle
                                # of a character

The latter case deals with the nasty problem of UTF-8 input which
breaks if your file.read() call happens to split a UTF-8 sequence.
Of course, the well-known reasons could be subclasses, too.

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Fri Jun  1 23:12:14 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Sat, 2 Jun 2001 00:12:14 +0200
Subject: [I18n-sig] Encoding auto-detection
In-Reply-To: <3B1804DA.8C48861E@lemburg.com> (mal@lemburg.com)
References: <3B167B9F.344D6992@lemburg.com>
 <15126.32275.110670.236066@cymru.basistech.com>
 <3B1681C5.71FD484D@lemburg.com> <15126.33995.327715.84261@cymru.basistech.com> <200105312046.f4VKkVY02913@mira.informatik.hu-berlin.de> <3B174DE0.EFABF55E@lemburg.com> <200106011259.f51CxbQ00877@mira.informatik.hu-berlin.de> <3B1804DA.8C48861E@lemburg.com>
Message-ID: <200106012212.f51MCEN01482@mira.informatik.hu-berlin.de>

> I was talking about a general purpose encoding sniffer, the XML
> case would only be a special case. The idea is to pass a magic
> string to the API and then let it fiddle around with to try 
> to deduce the encoding. The magic string might also be regular
> expression which then has the encoding parameter as group 1, etc.

I see. For a general purpose encoding guesser to be useful, it would
work totally different from the XML autodetection. E.g. UTF-8 can be
detected quite reliably, but you'll have to look at the entire input.

In general, I think encoding auto-detection is a stupid idea, you
really have to have a higher-level protocol that tells you what the
encoding is. Trying Unicode-encodings-autodetection might be more
successful, but I still think it is quite pointless: I predict that
UTF-16 or UTF-32 will be quite rare, and that most Unicode text will
be exchanged as UTF-8. 

In addition, unless you are writing a general-purpose text editor,
there *will* be a higher-level protocol telling you the encoding.

Regards,
Martin


From paulp@ActiveState.com  Sat Jun  2 00:07:59 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Fri, 01 Jun 2001 16:07:59 -0700
Subject: [I18n-sig] Encoding auto-detection
References: <3B167B9F.344D6992@lemburg.com>
 <15126.32275.110670.236066@cymru.basistech.com>
 <3B1681C5.71FD484D@lemburg.com> <15126.33995.327715.84261@cymru.basistech.com> <200105312046.f4VKkVY02913@mira.informatik.hu-berlin.de> <3B174DE0.EFABF55E@lemburg.com> <200106011259.f51CxbQ00877@mira.informatik.hu-berlin.de> <3B1804DA.8C48861E@lemburg.com> <200106012212.f51MCEN01482@mira.informatik.hu-berlin.de>
Message-ID: <3B18204F.82B991F7@ActiveState.com>

"Martin v. Loewis" wrote:
> 
>...
> 
> I see. For a general purpose encoding guesser to be useful, it would
> work totally different from the XML autodetection. 

Agreed. They should be treated as two different problems.

>...
> In general, I think encoding auto-detection is a stupid idea, you
> really have to have a higher-level protocol that tells you what the
> encoding is. 

These protocols are very unreliable. I often see data served from a
website as application/octet-stream no matter what its real data type
is.

> ... Trying Unicode-encodings-autodetection might be more
> successful, but I still think it is quite pointless: I predict that
> UTF-16 or UTF-32 will be quite rare, and that most Unicode text will
> be exchanged as UTF-8.

On Windows, if you save a file as "Unicode", it means UTF-16. I think
that UTF-16 is Microsoft's "standard" Unicode encoding. UTF-8 could be
considered Unix's "standard" encoding.

I don't think you should treat it as either-or. Autodetection is not as
good as really knowing for sure, of course. That doesn't mean that it is
*stupid*. It means it is the best fallback available when dealing with
stupid systems like the Unix file system or misconfigured web servers.

-- 
Take a recipe. Leave a recipe.  
Python Cookbook!  http://www.ActiveState.com/pythoncookbook


From tree@basistech.com  Sat Jun  2 00:43:41 2001
From: tree@basistech.com (Tom Emerson)
Date: Fri, 1 Jun 2001 19:43:41 -0400
Subject: [I18n-sig] Encoding auto-detection
In-Reply-To: <200106012212.f51MCEN01482@mira.informatik.hu-berlin.de>
References: <3B167B9F.344D6992@lemburg.com>
 <15126.32275.110670.236066@cymru.basistech.com>
 <3B1681C5.71FD484D@lemburg.com>
 <15126.33995.327715.84261@cymru.basistech.com>
 <200105312046.f4VKkVY02913@mira.informatik.hu-berlin.de>
 <3B174DE0.EFABF55E@lemburg.com>
 <200106011259.f51CxbQ00877@mira.informatik.hu-berlin.de>
 <3B1804DA.8C48861E@lemburg.com>
 <200106012212.f51MCEN01482@mira.informatik.hu-berlin.de>
Message-ID: <15128.10413.377254.142035@cymru.basistech.com>

Martin v. Loewis writes:
> In general, I think encoding auto-detection is a stupid idea, you
> really have to have a higher-level protocol that tells you what the
> encoding is.

This is a utopian idea that completely falls apart in the real world.

It is *very* common for email to be sent making use of both 8-bit and
7-bit encodings with no content-type or content-transfer-encoding.
Without some form of encoding/character set detection you have no idea
what the mail message is encoded with. The fact that the mail RFCs
dictate something is irrelevant.

Similarly you can almost never trust the character encoding specified
for web pages. I have seen a lot of pages that claim to be using
CP1252 or ISO-8859-1 that are actually encoded with Shift-JIS or
EUC-CN or Big 5. Indeed, when I was working on the Device Mosaic
browser (the descendent of NCSA Mosaic that is was targeted for
embedded devices) if we found a document claiming to be Latin-1 we
ignored it and sniffed the encoding.

It is also common to find pages in Japan, China, and Korea that don't
specify a character set or encoding at all... the authors make
assumptions about the people viewing the pages, which may be false. I
have also seen Japanese pages that contain Shift-JIS *and* EUC-JP
encoded characters in the *same* document.

Higher level protocols cannot be believed.

    -tree

> Trying Unicode-encodings-autodetection might be more
> successful, but I still think it is quite pointless: I predict that
> UTF-16 or UTF-32 will be quite rare, and that most Unicode text will
> be exchanged as UTF-8. 

On Unix. This isn't necessarily true on other platforms.

    -tree
-- 
Tom Emerson                                          Basis Technology Corp.
Sr. Sinostringologist                              http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From martin@loewis.home.cs.tu-berlin.de  Sat Jun  2 07:59:35 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Sat, 2 Jun 2001 08:59:35 +0200
Subject: [I18n-sig] Encoding auto-detection
In-Reply-To: <15128.10413.377254.142035@cymru.basistech.com> (message from Tom
 Emerson on Fri, 1 Jun 2001 19:43:41 -0400)
References: <3B167B9F.344D6992@lemburg.com>
 <15126.32275.110670.236066@cymru.basistech.com>
 <3B1681C5.71FD484D@lemburg.com>
 <15126.33995.327715.84261@cymru.basistech.com>
 <200105312046.f4VKkVY02913@mira.informatik.hu-berlin.de>
 <3B174DE0.EFABF55E@lemburg.com>
 <200106011259.f51CxbQ00877@mira.informatik.hu-berlin.de>
 <3B1804DA.8C48861E@lemburg.com>
 <200106012212.f51MCEN01482@mira.informatik.hu-berlin.de> <15128.10413.377254.142035@cymru.basistech.com>
Message-ID: <200106020659.f526xZM01136@mira.informatik.hu-berlin.de>

> It is *very* common for email to be sent making use of both 8-bit and
> 7-bit encodings with no content-type or content-transfer-encoding.

I think this claim is difficult to support by facts. Of the messages I
receive, most do have a MIME header, giving a charset in their
content.

> Indeed, when I was working on the Device Mosaic browser (the
> descendent of NCSA Mosaic that is was targeted for embedded devices)
> if we found a document claiming to be Latin-1 we ignored it and
> sniffed the encoding.

That might be a useful thing to do, but I guess the routine you've
been using was way more complex than what MAL suggested for the
standard library. I doubt you can reliably detect Big 5 by looking at
the first 10 or so bytes of an HTML document.

In fact, I'd suggest that HTML encoding detection is yet again
different from general-purpose encoding detection, since you'll have
to take the declared encoding (if any) into account.

> Higher level protocols cannot be believed.

And neither can autodetection.

Regards,
Martin


From mal@lemburg.com  Sat Jun  2 12:24:05 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Sat, 02 Jun 2001 13:24:05 +0200
Subject: [I18n-sig] Encoding auto-detection
References: <3B167B9F.344D6992@lemburg.com>
 <15126.32275.110670.236066@cymru.basistech.com>
 <3B1681C5.71FD484D@lemburg.com>
 <15126.33995.327715.84261@cymru.basistech.com>
 <200105312046.f4VKkVY02913@mira.informatik.hu-berlin.de>
 <3B174DE0.EFABF55E@lemburg.com>
 <200106011259.f51CxbQ00877@mira.informatik.hu-berlin.de>
 <3B1804DA.8C48861E@lemburg.com>
 <200106012212.f51MCEN01482@mira.informatik.hu-berlin.de> <15128.10413.377254.142035@cymru.basistech.com>
Message-ID: <3B18CCD5.8EBF8546@lemburg.com>

Tom Emerson wrote:
> 
> Martin v. Loewis writes:
> > In general, I think encoding auto-detection is a stupid idea, you
> > really have to have a higher-level protocol that tells you what the
> > encoding is.
> 
> This is a utopian idea that completely falls apart in the real world.

That's why I need such a function... first for XML and then for
other files having some standard magic prepended to them.

The reason for this is simple: even if a protocol defines which
encoding to use, this is not necessarily respected in input data.
The usual thing to do is first to try to decode the data into Unicode
using the given encoding, then to analyse the data and try the
guessed encoding and only then to reject the data as false input.

Without the second step there would be far to many instances of
data being rejected due to wrong encoding information, e.g. a
common situation for XML is that XML files use Latin-1 in the body and
forget to define the XML header. The parser will then default to
UTF-8 and fail to read the data.

You have a similar situation for data which originated in parts
of the world where more than one encoding is in common
use e.g. Russia or Asia. Input data generated by humans can
should always be treated with care ;-)

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From mal@lemburg.com  Sat Jun  2 12:26:14 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Sat, 02 Jun 2001 13:26:14 +0200
Subject: [I18n-sig] XML and codecs
References: <3B167B9F.344D6992@lemburg.com>
 <15126.32275.110670.236066@cymru.basistech.com>
 <3B1681C5.71FD484D@lemburg.com> <15126.33995.327715.84261@cymru.basistech.com> <200105312046.f4VKkVY02913@mira.informatik.hu-berlin.de> <3B174DE0.EFABF55E@lemburg.com> <200106011259.f51CxbQ00877@mira.informatik.hu-berlin.de> <3B1807B6.11ED32B9@lemburg.com> <200106012217.f51MHWR01771@mira.informatik.hu-berlin.de>
Message-ID: <3B18CD56.5BFDE77F@lemburg.com>

"Martin v. Loewis" wrote:
> 
> > Until we've found a backward compatible way to fix this, how
> > about adding a new error handling scheme which at least gives
> > the caller enough information to do some smart processing on the
> > input and output, e.g.
> >
> > errors="break":
> >
> >       raise an UnicodeBreakError with argument
> >         (reason, error_position_in_input, work_done_so_far)
> 
> That is good enough, IMO, so let's do it. 

Ok.

> I think we also need a few
> well-defined reasons, in particular
> 
> UnicodeBreakError.CannotConvert # character not supported in target
>                                 # character set
> UnicodeBreakError.OutOfData     # input string stops in the middle
>                                 # of a character
> 
> The latter case deals with the nasty problem of UTF-8 input which
> breaks if your file.read() call happens to split a UTF-8 sequence.
> Of course, the well-known reasons could be subclasses, too.

Fine.

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From cyrus@garage.co.jp  Sat Jun  2 13:33:35 2001
From: cyrus@garage.co.jp (Cyrus Shaoul)
Date: Sat, 02 Jun 2001 08:33:35 -0400
Subject: Re[2]: [I18n-sig] Encoding auto-detection
In-Reply-To: <15128.10413.377254.142035@cymru.basistech.com>
References: <200106012212.f51MCEN01482@mira.informatik.hu-berlin.de> <15128.10413.377254.142035@cymru.basistech.com>
Message-ID: <20010602082927.ABD5.CYRUS@garage.co.jp>

I have to agree with Tom. If there is room for human error, there will
be lots of errors. I have personally seen many CGI scripts that have
been sent data in unexpected encodings by buggy browsers. These browsers
still are in use (ex: IE 3.0), and I bet some future browser will
contain a similar bug in the future.

Just my .02,

Cyrus


> 
> This is a utopian idea that completely falls apart in the real world.
> 
> It is *very* common for email to be sent making use of both 8-bit and
> 7-bit encodings with no content-type or content-transfer-encoding.
> Without some form of encoding/character set detection you have no idea
> what the mail message is encoded with. The fact that the mail RFCs
> dictate something is irrelevant.
> 
> Similarly you can almost never trust the character encoding specified
> for web pages. I have seen a lot of pages that claim to be using
> CP1252 or ISO-8859-1 that are actually encoded with Shift-JIS or
> EUC-CN or Big 5. Indeed, when I was working on the Device Mosaic
> browser (the descendent of NCSA Mosaic that is was targeted for
> embedded devices) if we found a document claiming to be Latin-1 we
> ignored it and sniffed the encoding.
> 
> It is also common to find pages in Japan, China, and Korea that don't
> specify a character set or encoding at all... the authors make
> assumptions about the people viewing the pages, which may be false. I
> have also seen Japanese pages that contain Shift-JIS *and* EUC-JP
> encoded characters in the *same* document.
> 
> Higher level protocols cannot be believed.
> 
>     -tree
> 


From tree@basistech.com  Sat Jun  2 18:10:30 2001
From: tree@basistech.com (Tom Emerson)
Date: Sat, 2 Jun 2001 13:10:30 -0400
Subject: [I18n-sig] Encoding auto-detection
In-Reply-To: <200106020659.f526xZM01136@mira.informatik.hu-berlin.de>
References: <3B167B9F.344D6992@lemburg.com>
 <15126.32275.110670.236066@cymru.basistech.com>
 <3B1681C5.71FD484D@lemburg.com>
 <15126.33995.327715.84261@cymru.basistech.com>
 <200105312046.f4VKkVY02913@mira.informatik.hu-berlin.de>
 <3B174DE0.EFABF55E@lemburg.com>
 <200106011259.f51CxbQ00877@mira.informatik.hu-berlin.de>
 <3B1804DA.8C48861E@lemburg.com>
 <200106012212.f51MCEN01482@mira.informatik.hu-berlin.de>
 <15128.10413.377254.142035@cymru.basistech.com>
 <200106020659.f526xZM01136@mira.informatik.hu-berlin.de>
Message-ID: <15129.7686.306629.523526@cymru.basistech.com>

Martin v. Loewis writes:
> > It is *very* common for email to be sent making use of both 8-bit and
> > 7-bit encodings with no content-type or content-transfer-encoding.
> 
> I think this claim is difficult to support by facts. Of the messages I
> receive, most do have a MIME header, giving a charset in their
> content.

I am a computational linguist --- part of the work I've been doing
over the last year is an email corpus, built from messages coming from
a number of mailing lists from over thirteen countries. With over 21K
messages and 60+ MB of text, my experience has been that many of these
messages lack any indication of character set or encoding. I'll write
a script to spin through the headers and determine how many conform to
the standard RFCs, and how many actually include charset information
either in the header or in a MIME body.

> That might be a useful thing to do, but I guess the routine you've
> been using was way more complex than what MAL suggested for the
> standard library. I doubt you can reliably detect Big 5 by looking at
> the first 10 or so bytes of an HTML document.

You can't reliably detect much of anything by looking at the first 10
bytes of a document, unless in a very constrained domain like the
character set detection that spawned this thread. So we agree.

> > Higher level protocols cannot be believed.
> 
> And neither can autodetection.

That's right... I didn't mean to imply that it could. But the two
together can be quite useful, and if you have enough text,
autodetection can be quite accurate. The problem, of course, is that
most text on the web contains a lot of English as well as other
languages.

-- 
Tom Emerson                                          Basis Technology Corp.
Sr. Sinostringologist                              http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From walter@livinglogic.de  Tue Jun  5 09:39:04 2001
From: walter@livinglogic.de (Walter Doerwald)
Date: Tue, 05 Jun 2001 10:39:04 +0200
Subject: [I18n-sig] XML and codecs
In-Reply-To: <3B1807B6.11ED32B9@lemburg.com>
References: <3B167B9F.344D6992@lemburg.com>
 <15126.32275.110670.236066@cymru.basistech.com>
 <3B1681C5.71FD484D@lemburg.com>
 <15126.33995.327715.84261@cymru.basistech.com>
 <200105312046.f4VKkVY02913@mira.informatik.hu-berlin.de>
 <3B174DE0.EFABF55E@lemburg.com>
 <200106011259.f51CxbQ00877@mira.informatik.hu-berlin.de>
 <3B1807B6.11ED32B9@lemburg.com>
Message-ID: <200106051039040859.000CF3EB@mail.livinglogic.de>

On 01.06.01 at 23:23 M.-A. Lemburg wrote:

> "Martin v. Loewis" wrote:
> > 
> > As for XML and encodings, having a convenient mechanism to extend
> > existing codecs to encode unknown characters as character entities is
> > much more important, IMO, since that is very difficult to achieve with
> > the existing API.
> 
> Until we've found a backward compatible way to fix this, how
> about adding a new error handling scheme which at least gives
> the caller enough information to do some smart processing on the
> input and output, e.g.
> 
> errors=3D"break":
> 
> 	raise an UnicodeBreakError with argument
>         (reason, error_position_in_input, work_done_so_far)
> 
> The caller could then use the information returned
> by the codec to fix the input data and reuse the already
> encoded/decoded data to avoid duplicate work.

How would UTF-16 be handled? I guess without additional
code multiple BOMs would be generated for a string that
contains unencodable characters.

> This scheme is very simple, but also very effective, since
> it allows complex error processing to be done in the
> namespace where the data is being processed (rather than
> in a callback which wouldn't have access to this namespace).

A callback could be a class instance with a __call__ method
and so can have as much state information as it needs.

Bye,
   Walter D=F6rwald

-- 
Walter D=F6rwald =B7 LivingLogic AG =B7 Bayreuth, Germany =B7
www.livinglogic.de


From mal@lemburg.com  Tue Jun  5 10:02:37 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 05 Jun 2001 11:02:37 +0200
Subject: [I18n-sig] XML and codecs
References: <3B167B9F.344D6992@lemburg.com>
 <15126.32275.110670.236066@cymru.basistech.com>
 <3B1681C5.71FD484D@lemburg.com>
 <15126.33995.327715.84261@cymru.basistech.com>
 <200105312046.f4VKkVY02913@mira.informatik.hu-berlin.de>
 <3B174DE0.EFABF55E@lemburg.com>
 <200106011259.f51CxbQ00877@mira.informatik.hu-berlin.de>
 <3B1807B6.11ED32B9@lemburg.com> <200106051039040859.000CF3EB@mail.livinglogic.de>
Message-ID: <3B1CA02D.71C4A6EB@lemburg.com>

Walter Doerwald wrote:
> 
> On 01.06.01 at 23:23 M.-A. Lemburg wrote:
> 
> > "Martin v. Loewis" wrote:
> > >
> > > As for XML and encodings, having a convenient mechanism to extend
> > > existing codecs to encode unknown characters as character entities is
> > > much more important, IMO, since that is very difficult to achieve with
> > > the existing API.
> >
> > Until we've found a backward compatible way to fix this, how
> > about adding a new error handling scheme which at least gives
> > the caller enough information to do some smart processing on the
> > input and output, e.g.
> >
> > errors="break":
> >
> >       raise an UnicodeBreakError with argument
> >         (reason, error_position_in_input, work_done_so_far)
> >
> > The caller could then use the information returned
> > by the codec to fix the input data and reuse the already
> > encoded/decoded data to avoid duplicate work.
> 
> How would UTF-16 be handled? I guess without additional
> code multiple BOMs would be generated for a string that
> contains unencodable characters.

Why ? You should know out of the context which byte order is
in current use and thus use the appropriate code UTF-16-LE
or -BE. These don't generate BOMs.
 
> > This scheme is very simple, but also very effective, since
> > it allows complex error processing to be done in the
> > namespace where the data is being processed (rather than
> > in a callback which wouldn't have access to this namespace).
> 
> A callback could be a class instance with a __call__ method
> and so can have as much state information as it needs.

Sure, but it breaks the current API completely. The above
mechanism is different in that the communication in the error
case is done by means of an exception. While this is not as
fast as a callback it does have some advantages:

* you can write the error handling code in the context using
  the codec

* it enables you to write error handling code at higher levels
  in the calling stack

* it fits in with the current API

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From martin@loewis.home.cs.tu-berlin.de  Tue Jun  5 18:26:15 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Tue, 5 Jun 2001 19:26:15 +0200
Subject: [I18n-sig] XML and codecs
In-Reply-To: "walter@livinglogic.de"'s message of Tue, 05 Jun 2001 10:39:04
 +0200
Message-ID: <200106051726.f55HQFY01124@mira.informatik.hu-berlin.de>

> How would UTF-16 be handled? I guess without additional
> code multiple BOMs would be generated for a string that
> contains unencodable characters.

When you generate or decode UTF-16, this is not a problem: There won't
be any unencodable characters.

Even if that was a problem: Just by raising the exception, there won't
be multiple BOMs. So you have to provide additional code, anyway, so
you better make sure this code is correct.

The problem becomes real for codecs that preserve state: You'll need
to maintain the state of the codec from the time the exception
occurred, so that subsequence .encode calls will continue in the shift
state they were in previously.

So for codecs that preserve state across .encode calls, codecs.lookup
will need to return a bound method as encode and decode function, not
a simple function; see the iconv codec for an example.

In some sense, one can argue that the UTF-16 Codec also preserves
state: whether it has yet emitted a BOM.

Regards,
Martin


From mal@lemburg.com  Tue Jun  5 19:01:52 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 05 Jun 2001 20:01:52 +0200
Subject: [I18n-sig] XML and codecs
References: <200106051726.f55HQFY01124@mira.informatik.hu-berlin.de>
Message-ID: <3B1D1E90.B3802D5E@lemburg.com>

"Martin v. Loewis" wrote:
> 
> > How would UTF-16 be handled? I guess without additional
> > code multiple BOMs would be generated for a string that
> > contains unencodable characters.
> 
> When you generate or decode UTF-16, this is not a problem: There won't
> be any unencodable characters.
> 
> Even if that was a problem: Just by raising the exception, there won't
> be multiple BOMs. So you have to provide additional code, anyway, so
> you better make sure this code is correct.
> 
> The problem becomes real for codecs that preserve state: You'll need
> to maintain the state of the codec from the time the exception
> occurred, so that subsequence .encode calls will continue in the shift
> state they were in previously.

Should be no problem since the exception will sort of freeze
the current state of the codec (provided it's a StreamWriter/Reader)
and let you use this state to take appropriate actions.
 
> So for codecs that preserve state across .encode calls, codecs.lookup
> will need to return a bound method as encode and decode function, not
> a simple function; see the iconv codec for an example.

Not sure what you mean here, but the encoder and decoder
returned by codecs.lookup() must not maintain state. This
property is reserved for StreamWriters and Readers (see the
Unicode docs).
 
> In some sense, one can argue that the UTF-16 Codec also preserves
> state: whether it has yet emitted a BOM.

BTW, I haven't yet had time to check your utf16 patch but from
a first glance it looks good.

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From martin@loewis.home.cs.tu-berlin.de  Tue Jun  5 19:58:51 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Tue, 5 Jun 2001 20:58:51 +0200
Subject: [I18n-sig] XML and codecs
In-Reply-To: "mal@lemburg.com"'s message of Tue, 05 Jun 2001 20:01:52 +0200
Message-ID: <200106051858.f55IwpU01510@mira.informatik.hu-berlin.de>

> Should be no problem since the exception will sort of freeze
> the current state of the codec (provided it's a StreamWriter/Reader)
> and let you use this state to take appropriate actions.

What do you mean: "provided it's a StreamReader/Writer". What if I
invoke the encode method found in codec lookup, and get an exception?

The exception does not carry the state. Suppose you encode into JIS X
0201.  That has four shift states:

CHARSETS = {
    "\033(B": US_ASCII,
    "\033(J": JISX0201_1976,
    "\033$@": JISX0208_1978,
    "\033$B": JISX0208_1983,
}

Depending on which of the escape codes you've emitted last, the
following bytes will have different meanings.

Now, suppose we encode a string that cannot be translated to JIS
X0201.  The codec will raise an exception, telling us how much bytes
it has encoded. Now, suppose we want to replace this character with
the string "&9898;". If we are in the US_ASCII shift state, we can
immediately encode it. If we are in a different shift state, we must
issue the control sequence first.

When the codec does not preserve state, it cannot correctly encode the
entire string, since concatenating the results of encode() invocations
might be incorrect.

If you don't believe me, tell me how I can use your proposed interface
to encode a Unicode into JIS X 0201 + XML escapes, with using the
encode/decode functions only.

> Not sure what you mean here, but the encoder and decoder
> returned by codecs.lookup() must not maintain state. This
> property is reserved for StreamWriters and Readers (see the
> Unicode docs).

You mean the sentence that says

# The functions/methods are expected to work in a stateless mode.

What is "expected to work"? Who expects they work in stateless mode,
and why? And what happens if they don't?

It also says

# These must be functions or methods which have the same interface as
# the encode()/decode() methods of Codec instances (see Codec
# Interface).

So surely, the result of codecs.lookup may be a method. If it is a
method, it surely must be a bound method (or else, where does the self
argument come from?) Since bound methods are allows, the encode/decode
functions *may* preserve state: A bound method always references state
in form of the object it is bound to.

So I think the sentence in the documentation saying "expected to work"
is an error.

Regards,
Martin


From mal@lemburg.com  Tue Jun  5 20:46:57 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 05 Jun 2001 21:46:57 +0200
Subject: [I18n-sig] XML and codecs
References: <200106051858.f55IwpU01510@mira.informatik.hu-berlin.de>
Message-ID: <3B1D3731.2B915C87@lemburg.com>

"Martin v. Loewis" wrote:
> 
> > Should be no problem since the exception will sort of freeze
> > the current state of the codec (provided it's a StreamWriter/Reader)
> > and let you use this state to take appropriate actions.
> 
> What do you mean: "provided it's a StreamReader/Writer". What if I
> invoke the encode method found in codec lookup, and get an exception?

The encoders/decoders returned in the lookup tuple are not
supposed to store state. If you want to or need to store state,
then you should use the factory functions (StreamWriter and
-Reader) to first create an instance which can store state
and then use its .encode()/.decode() methods.

> The exception does not carry the state.

That's not what I meant. If you have created say a StreamReader
object, then this object will store the state and if its
.encode() method raises a UnicodeBreakError exception you
can use the current state stored in the object to take
some action of recovery, etc.

> Suppose you encode into JIS X
> 0201.  That has four shift states:
> 
> CHARSETS = {
>     "\033(B": US_ASCII,
>     "\033(J": JISX0201_1976,
>     "\033$@": JISX0208_1978,
>     "\033$B": JISX0208_1983,
> }
> 
> Depending on which of the escape codes you've emitted last, the
> following bytes will have different meanings.
> 
> Now, suppose we encode a string that cannot be translated to JIS
> X0201.  The codec will raise an exception, telling us how much bytes
> it has encoded. Now, suppose we want to replace this character with
> the string "&9898;". If we are in the US_ASCII shift state, we can
> immediately encode it. If we are in a different shift state, we must
> issue the control sequence first.
> 
> When the codec does not preserve state, it cannot correctly encode the
> entire string, since concatenating the results of encode() invocations
> might be incorrect.
> 
> If you don't believe me, tell me how I can use your proposed interface
> to encode a Unicode into JIS X 0201 + XML escapes, with using the
> encode/decode functions only.
> 
> > Not sure what you mean here, but the encoder and decoder
> > returned by codecs.lookup() must not maintain state. This
> > property is reserved for StreamWriters and Readers (see the
> > Unicode docs).
> 
> You mean the sentence that says
> 
> # The functions/methods are expected to work in a stateless mode.
> 
> What is "expected to work"? Who expects they work in stateless mode,
> and why? And what happens if they don't?
> 
> It also says
> 
> # These must be functions or methods which have the same interface as
> # the encode()/decode() methods of Codec instances (see Codec
> # Interface).
> 
> So surely, the result of codecs.lookup may be a method. If it is a
> method, it surely must be a bound method (or else, where does the self
> argument come from?) Since bound methods are allows, the encode/decode
> functions *may* preserve state: A bound method always references state
> in form of the object it is bound to.
> 
> So I think the sentence in the documentation saying "expected to work"
> is an error.

This is per design and not a mistake.

If encoders/decoders (the first two items in the
lookup tuple) would store state, then you would have serious problems
when reusing them for different inputs. I'm not even talking about
threading problems here.

The other two entries were designed to provide statefull codec
interfaces, so your JIS codec would have to use those in order
to store shift states etc. or do more complex work on the data.

The encoder/decoder functions should only provide very basic
encoding/decoding facilities which do not require keeping
state (e.g. they might have additional keyword arguments to
parameterize them to work in different shift states).

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From martin@loewis.home.cs.tu-berlin.de  Tue Jun  5 21:05:04 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Tue, 5 Jun 2001 22:05:04 +0200
Subject: [I18n-sig] XML and codecs
In-Reply-To: "mal@lemburg.com"'s message of Tue, 05 Jun 2001 21:46:57 +0200
Message-ID: <200106052005.f55K54U02481@mira.informatik.hu-berlin.de>

> > What do you mean: "provided it's a StreamReader/Writer". What if I
> > invoke the encode method found in codec lookup, and get an exception?
> 
> The encoders/decoders returned in the lookup tuple are not
> supposed to store state. If you want to or need to store state,
> then you should use the factory functions (StreamWriter and
> -Reader) to first create an instance which can store state
> and then use its .encode()/.decode() methods.

To create one of these, I need a file object. I just want a stateful
encoder, not a stream. So if I don't have a file object, how do I
create an encoder?

Plus, if I cannot use the functions returned from codecs.lookup in
stateful encodings, what are they good for, anyways?

> > So I think the sentence in the documentation saying "expected to work"
> > is an error.
> 
> This is per design and not a mistake.

Ok, so it is an error in the design, not only in the documentation.

> If encoders/decoders (the first two items in the
> lookup tuple) would store state, then you would have serious problems
> when reusing them for different inputs. I'm not even talking about
> threading problems here.

What specific problems would you have? I.e. is there anything in the
standard library that gets into serious problems if codecs.lookup
returns a stateful object?

> The other two entries were designed to provide statefull codec
> interfaces, so your JIS codec would have to use those in order
> to store shift states etc. or do more complex work on the data.

First, as I said, I cannot use them as-is, since I need a file.

Furthermore, are you saying that I can use codecs.lookup(enc)[:2] only
for some encodings, not for others? That sounds like a huge design
flaw.

> The encoder/decoder functions should only provide very basic
> encoding/decoding facilities which do not require keeping
> state (e.g. they might have additional keyword arguments to
> parameterize them to work in different shift states).

Arghh. Whether the facilities are basic or not depends on the
encoding.

So again I consider this broken, and the best fix is to allow the
callable objects returned in codecs.lookup(enc)[:2] to maintain state
if they want.

Users must then either look them up again if they want to reuse them
for different input, or they can recycle them if they happen to know
that no state is maintained.

Regards,
Martin


From mal@lemburg.com  Tue Jun  5 21:23:30 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 05 Jun 2001 22:23:30 +0200
Subject: [I18n-sig] XML and codecs
References: <200106052005.f55K54U02481@mira.informatik.hu-berlin.de>
Message-ID: <3B1D3FC2.988F3945@lemburg.com>

"Martin v. Loewis" wrote:
> 
> > > What do you mean: "provided it's a StreamReader/Writer". What if I
> > > invoke the encode method found in codec lookup, and get an exception?
> >
> > The encoders/decoders returned in the lookup tuple are not
> > supposed to store state. If you want to or need to store state,
> > then you should use the factory functions (StreamWriter and
> > -Reader) to first create an instance which can store state
> > and then use its .encode()/.decode() methods.
> 
> To create one of these, I need a file object. I just want a stateful
> encoder, not a stream. So if I don't have a file object, how do I
> create an encoder?

Simple: use cStringIO !
 
> Plus, if I cannot use the functions returned from codecs.lookup in
> stateful encodings, what are they good for, anyways?

For simple stateless encodings.
 
> > > So I think the sentence in the documentation saying "expected to work"
> > > is an error.
> >
> > This is per design and not a mistake.
> 
> Ok, so it is an error in the design, not only in the documentation.

Oh please...
 
> > If encoders/decoders (the first two items in the
> > lookup tuple) would store state, then you would have serious problems
> > when reusing them for different inputs. I'm not even talking about
> > threading problems here.
> 
> What specific problems would you have? I.e. is there anything in the
> standard library that gets into serious problems if codecs.lookup
> returns a stateful object?

Please reread what I wrote and then think this over again... by
reusing a stateful encoder multiple times you would carry over
state from one usage to the next, e.g. carry over the shift
state from one data set to the next (which may not even use this
shift state).
 
> > The other two entries were designed to provide statefull codec
> > interfaces, so your JIS codec would have to use those in order
> > to store shift states etc. or do more complex work on the data.
> 
> First, as I said, I cannot use them as-is, since I need a file.
> 
> Furthermore, are you saying that I can use codecs.lookup(enc)[:2] only
> for some encodings, not for others? That sounds like a huge design
> flaw.

These two APIs are exposed to simplify the interface for simple,
stateless encodings. Since most encodings work just fine with
these APIs they are indeed very useful.
 
> > The encoder/decoder functions should only provide very basic
> > encoding/decoding facilities which do not require keeping
> > state (e.g. they might have additional keyword arguments to
> > parameterize them to work in different shift states).
> 
> Arghh. Whether the facilities are basic or not depends on the
> encoding.
> 
> So again I consider this broken, and the best fix is to allow the
> callable objects returned in codecs.lookup(enc)[:2] to maintain state
> if they want.
> 
> Users must then either look them up again if they want to reuse them
> for different input, or they can recycle them if they happen to know
> that no state is maintained.

Again, this decision was per design: the codec registry lookup
mechanism caches the lookup tuples. With your proposal the cache
would be rendered useless.

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From martin@loewis.home.cs.tu-berlin.de  Tue Jun  5 21:50:43 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Tue, 5 Jun 2001 22:50:43 +0200
Subject: [I18n-sig] XML and codecs
In-Reply-To: "mal@lemburg.com"'s message of Tue, 05 Jun 2001 22:23:30 +0200
Message-ID: <200106052050.f55Koho02886@mira.informatik.hu-berlin.de>

> > To create one of these, I need a file object. I just want a stateful
> > encoder, not a stream. So if I don't have a file object, how do I
> > create an encoder?
> 
> Simple: use cStringIO !

Are you serious? To encode strings, I need cStringIO ?!?

> > Plus, if I cannot use the functions returned from codecs.lookup in
> > stateful encodings, what are they good for, anyways?
> 
> For simple stateless encodings.

So it is not a general-purpose facility. What should a lookup function
return if it cannot provide a stateless encoding function?

> Please reread what I wrote and then think this over again... 

Why do you think I did not pay attention?

> by reusing a stateful encoder multiple times you would carry over
> state from one usage to the next, e.g. carry over the shift state
> from one data set to the next (which may not even use this shift
> state).

Indeed, that's what I want. How else could continuing after an
encoding error work?

If I want to start with fresh data, I also need to get a fresh codec
function, from codecs.lookup.

> These two APIs are exposed to simplify the interface for simple,
> stateless encodings. Since most encodings work just fine with
> these APIs they are indeed very useful.

It turns out that both UTF-16 and UTF-8 have problems with a stateless
approach, so I'm questioning the usefulness of the API.

Of course, having to use cStringIO isn't any better...

> Again, this decision was per design: the codec registry lookup
> mechanism caches the lookup tuples. With your proposal the cache
> would be rendered useless.

Given that encoding.search_function caches the result also, it is
questionable why codecs.lookup should do that. One cache should be
enough, and it should be in encodings, since all these encodings are
known to be stateless.

Regards,
Martin


From walter@livinglogic.de  Wed Jun  6 15:52:36 2001
From: walter@livinglogic.de (Walter Doerwald)
Date: Wed, 06 Jun 2001 16:52:36 +0200
Subject: [I18n-sig] XML and codecs
In-Reply-To: <3B1CA02D.71C4A6EB@lemburg.com>
References: <3B167B9F.344D6992@lemburg.com>
 <15126.32275.110670.236066@cymru.basistech.com>
 <3B1681C5.71FD484D@lemburg.com>
 <15126.33995.327715.84261@cymru.basistech.com>
 <200105312046.f4VKkVY02913@mira.informatik.hu-berlin.de>
 <3B174DE0.EFABF55E@lemburg.com>
 <200106011259.f51CxbQ00877@mira.informatik.hu-berlin.de>
 <3B1807B6.11ED32B9@lemburg.com>
 <200106051039040859.000CF3EB@mail.livinglogic.de>
 <3B1CA02D.71C4A6EB@lemburg.com>
Message-ID: <200106061652360296.01088FC8@mail.livinglogic.de>

On 05.06.01 at 11:02 M.-A. Lemburg wrote:

> Walter Doerwald wrote:
> > 
> [...]
>  
> > > This scheme is very simple, but also very effective, since
> > > it allows complex error processing to be done in the
> > > namespace where the data is being processed (rather than
> > > in a callback which wouldn't have access to this namespace).
> > 
> > A callback could be a class instance with a __call__ method
> > and so can have as much state information as it needs.
> 
> Sure, but it breaks the current API completely. The above
> mechanism is different in that the communication in the error
> case is done by means of an exception. While this is not as
> fast as a callback it does have some advantages:
> 
> * you can write the error handling code in the context using
>   the codec
> 
> * it enables you to write error handling code at higher levels
>   in the calling stack

But this means that you would have to allow the encoder to keep 
state between calls. That's no isse with a callback, because there
is only one call.

> * it fits in with the current API

That's right. Unfortunately there are a lot of functions that
would have to be changed.


Bye,
   Walter D=F6rwald

-- 
Walter D=F6rwald =B7 LivingLogic AG =B7 Bayreuth, Germany =B7
www.livinglogic.de


From walter@livinglogic.de  Wed Jun  6 16:51:10 2001
From: walter@livinglogic.de (Walter Doerwald)
Date: Wed, 06 Jun 2001 17:51:10 +0200
Subject: [I18n-sig] XML and codecs
In-Reply-To: <3B1E4BBA.9BA3A4D8@lemburg.com>
References: <3B167B9F.344D6992@lemburg.com>
 <15126.32275.110670.236066@cymru.basistech.com>
 <3B1681C5.71FD484D@lemburg.com>
 <15126.33995.327715.84261@cymru.basistech.com>
 <200105312046.f4VKkVY02913@mira.informatik.hu-berlin.de>
 <3B174DE0.EFABF55E@lemburg.com>
 <200106011259.f51CxbQ00877@mira.informatik.hu-berlin.de>
 <3B1807B6.11ED32B9@lemburg.com>
 <200106051039040859.000CF3EB@mail.livinglogic.de>
 <3B1CA02D.71C4A6EB@lemburg.com>
 <200106061651570250.0107F741@mail.livinglogic.de>
 <3B1E4BBA.9BA3A4D8@lemburg.com>
Message-ID: <200106061751100625.013E2FA0@mail.livinglogic.de>

On 06.06.01 at 17:26 M.-A. Lemburg wrote:

> Walter Doerwald wrote:
> > 
> > On 05.06.01 at 11:02 M.-A. Lemburg wrote:
> > 
> > > [...]
> > >
> > > Sure, but it breaks the current API completely. The above
> > > mechanism is different in that the communication in the error
> > > case is done by means of an exception. While this is not as
> > > fast as a callback it does have some advantages:
> > >
> > > * you can write the error handling code in the context using
> > >   the codec
> > >
> > > * it enables you to write error handling code at higher levels
> > >   in the calling stack
> > 
> > But this means that you would have to allow the encoder to keep
> > state between calls. That's no isse with a callback, because there
> > is only one call.
> 
> Well, either the codec keeps state or your application;
> here's some pseudo code to illustrate the first situation:
> 
> def do_something(data):
> 
>     StreamWriter =3D codec.lookup('myencoding')[3]
>     output =3D cStringIO(data)
>     writer =3D StreamWriter(output, 'break')
>     while 1:
>         try:
>             writer.write(data)
>         except UnicodeBreakError, (reason, position, work):
>             # Write data converted so far
>             output.write(work)
>             # Roll back 10 chars in the input and retry
>             data =3D data[position - 10:]
>         else:
>             break
>     return output.getvalue()

Apart from the fact, that I have to use a StreamWriter
(I probably would have to anyway, since only one BOM at the
start of an output file is required.) this looks usable.

The big question is: Is 'break' a temporary workaround
that will go away as soon as we have error handling
callbacks? Do we want error handling callbacks?

And finally: How fast is it?

> > > * it fits in with the current API
> > 
> > That's right. Unfortunately there are a lot of functions that
> > would have to be changed.
> 
> That's why I prefer small steps rather than replacing the
> complete codec suite with new interfaces.

The type of one argument changes in all the functions, i.e.
there's a new set of *Ex functions, where
	const char *errors
becomes
	PyObject *errors

Bye,
   Walter D=F6rwald

-- 
Walter D=F6rwald =B7 LivingLogic AG =B7 Bayreuth, Germany =B7
www.livinglogic.de


From mal@lemburg.com  Wed Jun  6 16:57:54 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Wed, 06 Jun 2001 17:57:54 +0200
Subject: [I18n-sig] XML and codecs
References: <3B167B9F.344D6992@lemburg.com>
 <15126.32275.110670.236066@cymru.basistech.com>
 <3B1681C5.71FD484D@lemburg.com>
 <15126.33995.327715.84261@cymru.basistech.com>
 <200105312046.f4VKkVY02913@mira.informatik.hu-berlin.de>
 <3B174DE0.EFABF55E@lemburg.com>
 <200106011259.f51CxbQ00877@mira.informatik.hu-berlin.de>
 <3B1807B6.11ED32B9@lemburg.com>
 <200106051039040859.000CF3EB@mail.livinglogic.de>
 <3B1CA02D.71C4A6EB@lemburg.com>
 <200106061651570250.0107F741@mail.livinglogic.de>
 <3B1E4BBA.9BA3A4D8@lemburg.com> <200106061751100625.013E2FA0@mail.livinglogic.de>
Message-ID: <3B1E5302.B9D83C94@lemburg.com>

Walter Doerwald wrote:
> 
> > > > Sure, but it breaks the current API completely. The above
> > > > mechanism is different in that the communication in the error
> > > > case is done by means of an exception. While this is not as
> > > > fast as a callback it does have some advantages:
> > > >
> > > > * you can write the error handling code in the context using
> > > >   the codec
> > > >
> > > > * it enables you to write error handling code at higher levels
> > > >   in the calling stack
> > >
> > > But this means that you would have to allow the encoder to keep
> > > state between calls. That's no isse with a callback, because there
> > > is only one call.
> >
> > Well, either the codec keeps state or your application;
> > here's some pseudo code to illustrate the first situation:
> >
> > def do_something(data):
> >
> >     StreamWriter = codec.lookup('myencoding')[3]
> >     output = cStringIO(data)
> >     writer = StreamWriter(output, 'break')
> >     while 1:
> >         try:
> >             writer.write(data)
> >         except UnicodeBreakError, (reason, position, work):
> >             # Write data converted so far
> >             output.write(work)
> >             # Roll back 10 chars in the input and retry
> >             data = data[position - 10:]
> >         else:
> >             break
> >     return output.getvalue()
> 
> Apart from the fact, that I have to use a StreamWriter
> (I probably would have to anyway, since only one BOM at the
> start of an output file is required.) this looks usable.
> 
> The big question is: Is 'break' a temporary workaround
> that will go away as soon as we have error handling
> callbacks? 

No.

> Do we want error handling callbacks?

I think we should still keep them on the TODO list.
 
> And finally: How fast is it?

Since errors will always cause extra cycles to be used,
I think the small overhead of using an exception for
the notification is reasonable.

Written in C, you probably won't notice much of a slowdown
compared to a callback solution, since there exceptions are
faster than in Python (the exception objects are created
lazily in Python).
 
> > > > * it fits in with the current API
> > >
> > > That's right. Unfortunately there are a lot of functions that
> > > would have to be changed.
> >
> > That's why I prefer small steps rather than replacing the
> > complete codec suite with new interfaces.
> 
> The type of one argument changes in all the functions, i.e.
> there's a new set of *Ex functions, where
>         const char *errors
> becomes
>         PyObject *errors

... plus all the callback logic which goes with it, changes
to the way errors are handled by the codecs, etc. It is doable,
but certainly a lot of work.

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From martin@loewis.home.cs.tu-berlin.de  Wed Jun  6 19:33:07 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Wed, 6 Jun 2001 20:33:07 +0200
Subject: [I18n-sig] XML and codecs
In-Reply-To: "walter@livinglogic.de"'s message of Wed, 06 Jun 2001 17:51:10
 +0200
Message-ID: <200106061833.f56IX7S01099@mira.informatik.hu-berlin.de>

> > Well, either the codec keeps state or your application;
> > here's some pseudo code to illustrate the first situation:
> > 
> > def do_something(data):
> > 
> >     StreamWriter = codec.lookup('myencoding')[3]
> >     output = cStringIO(data)
> >     writer = StreamWriter(output, 'break')
> >     while 1:
> >         try:
> >             writer.write(data)
> >         except UnicodeBreakError, (reason, position, work):
> >             # Write data converted so far
> >             output.write(work)
> >             # Roll back 10 chars in the input and retry
> >             data = data[position - 10:]
> >         else:
> >             break
> >     return output.getvalue()

I've missed Marc's posting of this code fragment: How can rolling back
10 characters possibly be the right thing? Couldn't this cause data to
be written twice to the stream?

I would expect that, when calling .write(), all correctly encoded data
is written to the stream and that position points to the first
character that cannot be encoded.

Regards,
Martin


From mal@lemburg.com  Wed Jun  6 20:24:28 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Wed, 06 Jun 2001 21:24:28 +0200
Subject: [I18n-sig] XML and codecs
References: <200106061833.f56IX7S01099@mira.informatik.hu-berlin.de>
Message-ID: <3B1E836C.D773D3B@lemburg.com>

"Martin v. Loewis" wrote:
> 
> > > Well, either the codec keeps state or your application;
> > > here's some pseudo code to illustrate the first situation:
> > >
> > > def do_something(data):
> > >
> > >     StreamWriter = codec.lookup('myencoding')[3]
> > >     output = cStringIO(data)
> > >     writer = StreamWriter(output, 'break')
> > >     while 1:
> > >         try:
> > >             writer.write(data)
> > >         except UnicodeBreakError, (reason, position, work):
> > >             # Write data converted so far
> > >             output.write(work)
> > >             # Roll back 10 chars in the input and retry
> > >             data = data[position - 10:]
> > >         else:
> > >             break
> > >     return output.getvalue()
> 
> I've missed Marc's posting of this code fragment: How can rolling back
> 10 characters possibly be the right thing? Couldn't this cause data to
> be written twice to the stream?

This depends on how the codec and encoding works. The above is
just an example of how you could use the 'break' mechanism
to apply customized action in case of an error.
 
> I would expect that, when calling .write(), all correctly encoded data
> is written to the stream and that position points to the first
> character that cannot be encoded.

i think it's better not to write any information to the
stream unless you are absolutely sure that no error occurred.
Remember that you cannot take back characters which were written
to the stream.

With the above information at hand, the caller can make all 
decisions needed to assure the data written to the output 
stream is correct.

The codec will place the work done so far into the third
tuple argument and the position which caused the failure
into the second. reason can be used to provide additional
information to the caller.

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From martin@loewis.home.cs.tu-berlin.de  Thu Jun  7 06:10:50 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Thu, 7 Jun 2001 07:10:50 +0200
Subject: [I18n-sig] XML and codecs
In-Reply-To: "mal@lemburg.com"'s message of Wed, 06 Jun 2001 21:24:28 +0200
Message-ID: <200106070510.f575AoA00835@mira.informatik.hu-berlin.de>

> The codec will place the work done so far into the third
> tuple argument and the position which caused the failure
> into the second. reason can be used to provide additional
> information to the caller.

How does that work with writelines()? In this case, the caller does
not have the string which the position refers to.

Regards,
Martin


From mal@lemburg.com  Thu Jun  7 09:36:37 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 07 Jun 2001 10:36:37 +0200
Subject: [I18n-sig] XML and codecs
References: <200106070510.f575AoA00835@mira.informatik.hu-berlin.de>
Message-ID: <3B1F3D15.20795A56@lemburg.com>

"Martin v. Loewis" wrote:
> 
> > The codec will place the work done so far into the third
> > tuple argument and the position which caused the failure
> > into the second. reason can be used to provide additional
> > information to the caller.
> 
> How does that work with writelines()? In this case, the caller does
> not have the string which the position refers to.

In that case you'd either 

a) have to subclass the StreamWriter
and provide the necessary logic in the .writelines() method
(using the .write() method to do the actual work) or 

b) forget about .writelines() and move the for-loop directly into
your application or

c) use u"".join(datalines) and .write().

Not really all that difficult.

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From walter@livinglogic.de  Thu Jun  7 10:53:46 2001
From: walter@livinglogic.de (Walter Doerwald)
Date: Thu, 07 Jun 2001 11:53:46 +0200
Subject: [I18n-sig] XML and codecs
In-Reply-To: <3B1E5302.B9D83C94@lemburg.com>
References: <3B167B9F.344D6992@lemburg.com>
 <15126.32275.110670.236066@cymru.basistech.com>
 <3B1681C5.71FD484D@lemburg.com>
 <15126.33995.327715.84261@cymru.basistech.com>
 <200105312046.f4VKkVY02913@mira.informatik.hu-berlin.de>
 <3B174DE0.EFABF55E@lemburg.com>
 <200106011259.f51CxbQ00877@mira.informatik.hu-berlin.de>
 <3B1807B6.11ED32B9@lemburg.com>
 <200106051039040859.000CF3EB@mail.livinglogic.de>
 <3B1CA02D.71C4A6EB@lemburg.com>
 <200106061651570250.0107F741@mail.livinglogic.de>
 <3B1E4BBA.9BA3A4D8@lemburg.com>
 <200106061751100625.013E2FA0@mail.livinglogic.de>
 <3B1E5302.B9D83C94@lemburg.com>
Message-ID: <200106071153460500.002D9DCB@mail.livinglogic.de>

On 06.06.01 at 17:57 M.-A. Lemburg wrote:

> [...]
> > Do we want error handling callbacks?
> 
> I think we should still keep them on the TODO list.

OK! Then I'll start playing around with it.

> > And finally: How fast is it?
> 
> Since errors will always cause extra cycles to be used,
> I think the small overhead of using an exception for
> the notification is reasonable.
> 
> Written in C, you probably won't notice much of a slowdown
> compared to a callback solution, since there exceptions are
> faster than in Python (the exception objects are created
> lazily in Python).
>
> > > > > * it fits in with the current API
> > > >
> > > > That's right. Unfortunately there are a lot of functions that
> > > > would have to be changed.
> > >
> > > That's why I prefer small steps rather than replacing the
> > > complete codec suite with new interfaces.
> > 
> > The type of one argument changes in all the functions, i.e.
> > there's a new set of *Ex functions, where
> >         const char *errors
> > becomes
> >         PyObject *errors
> 
> .. plus all the callback logic which goes with it, changes
> to the way errors are handled by the codecs, etc. It is doable,
> but certainly a lot of work.

Well, I need something to do in my free time! ;)


Bye,
   Walter D=F6rwald

-- 
Walter D=F6rwald =B7 LivingLogic AG =B7 Bayreuth, Germany =B7
www.livinglogic.de


From walter@livinglogic.de  Thu Jun  7 21:27:33 2001
From: walter@livinglogic.de (Walter Doerwald)
Date: Thu, 07 Jun 2001 22:27:33 +0200
Subject: [I18n-sig] XML and codecs
In-Reply-To: <200106071153460500.002D9DCB@mail.livinglogic.de>
References: <3B167B9F.344D6992@lemburg.com>
 <15126.32275.110670.236066@cymru.basistech.com>
 <3B1681C5.71FD484D@lemburg.com>
 <15126.33995.327715.84261@cymru.basistech.com>
 <200105312046.f4VKkVY02913@mira.informatik.hu-berlin.de>
 <3B174DE0.EFABF55E@lemburg.com>
 <200106011259.f51CxbQ00877@mira.informatik.hu-berlin.de>
 <3B1807B6.11ED32B9@lemburg.com>
 <200106051039040859.000CF3EB@mail.livinglogic.de>
 <3B1CA02D.71C4A6EB@lemburg.com>
 <200106061651570250.0107F741@mail.livinglogic.de>
 <3B1E4BBA.9BA3A4D8@lemburg.com>
 <200106061751100625.013E2FA0@mail.livinglogic.de>
 <3B1E5302.B9D83C94@lemburg.com>
 <200106071153460500.002D9DCB@mail.livinglogic.de>
Message-ID: <200106072227330828.0271DE0B@mail.livinglogic.de>

On 07.06.01 at 11:53 Walter Doerwald wrote:

> On 06.06.01 at 17:57 M.-A. Lemburg wrote:
> 
> > [...]
> > > Do we want error handling callbacks?
> > 
> > I think we should still keep them on the TODO list.
> 
> OK! Then I'll start playing around with it.

I started working on this, and it's progressing nicely.
It's already possible to do things like:

>>> import codecs
>>> codecs.ascii_encode(
...     u"a=E4u=FCo=F6=DF",
...     lambda enc, uni, pos: u"&#x%x;" % ord(uni[pos]))=
                          
('a&#xe4;u&#xfc;o&#xf6;&#xdf;', 7)
>>> import unicodedata
>>> codecs.latin_1_encode(
...     u"a\u3042b",
...     lambda enc, uni, pos: u"<%s>" % unicodedata.name(uni[pos]))
('a<HIRAGANA LETTER A>b', 3)

String arguments are still accepted:
>>> codecs.ascii_encode(u"a=E4u=FCo=F6=DF", "ignore") 
('auo', 7)


Bye,
   Walter D=F6rwald

-- 
Walter D=F6rwald =B7 LivingLogic AG =B7 Bayreuth, Germany =B7=
 www.livinglogic.de


From mal@lemburg.com  Thu Jun  7 22:04:23 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 07 Jun 2001 23:04:23 +0200
Subject: [I18n-sig] XML and codecs
References: <3B167B9F.344D6992@lemburg.com>
 <15126.32275.110670.236066@cymru.basistech.com>
 <3B1681C5.71FD484D@lemburg.com>
 <15126.33995.327715.84261@cymru.basistech.com>
 <200105312046.f4VKkVY02913@mira.informatik.hu-berlin.de>
 <3B174DE0.EFABF55E@lemburg.com>
 <200106011259.f51CxbQ00877@mira.informatik.hu-berlin.de>
 <3B1807B6.11ED32B9@lemburg.com>
 <200106051039040859.000CF3EB@mail.livinglogic.de>
 <3B1CA02D.71C4A6EB@lemburg.com>
 <200106061651570250.0107F741@mail.livinglogic.de>
 <3B1E4BBA.9BA3A4D8@lemburg.com>
 <200106061751100625.013E2FA0@mail.livinglogic.de>
 <3B1E5302.B9D83C94@lemburg.com>
 <200106071153460500.002D9DCB@mail.livinglogic.de> <200106072227330828.0271DE0B@mail.livinglogic.de>
Message-ID: <3B1FEC57.AA4587ED@lemburg.com>

Walter Doerwald wrote:
> 
> On 07.06.01 at 11:53 Walter Doerwald wrote:
> 
> > On 06.06.01 at 17:57 M.-A. Lemburg wrote:
> >
> > > [...]
> > > > Do we want error handling callbacks?
> > >
> > > I think we should still keep them on the TODO list.
> >
> > OK! Then I'll start playing around with it.
> 
> I started working on this, and it's progressing nicely.

Cool :)

> It's already possible to do things like:
> 
> >>> import codecs
> >>> codecs.ascii_encode(
> ...     u"a�u�o��",
> ...     lambda enc, uni, pos: u"&#x%x;" % ord(uni[pos]))
> ('a&#xe4;u&#xfc;o&#xf6;&#xdf;', 7)
> >>> import unicodedata
> >>> codecs.latin_1_encode(
> ...     u"a\u3042b",
> ...     lambda enc, uni, pos: u"<%s>" % unicodedata.name(uni[pos]))
> ('a<HIRAGANA LETTER A>b', 3)
> 
> String arguments are still accepted:
> >>> codecs.ascii_encode(u"a�u�o��", "ignore")
> ('auo', 7)
> 
> Bye,
>    Walter D�rwald
> 
> --
> Walter D�rwald � LivingLogic AG � Bayreuth, Germany � www.livinglogic.de
> 
> _______________________________________________
> I18n-sig mailing list
> I18n-sig@python.org
> http://mail.python.org/mailman/listinfo/i18n-sig

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From kajiyama@grad.sccs.chukyo-u.ac.jp  Fri Jun  8 15:59:11 2001
From: kajiyama@grad.sccs.chukyo-u.ac.jp (Tamito KAJIYAMA)
Date: Fri, 8 Jun 2001 23:59:11 +0900
Subject: [I18n-sig] JapaneseCodecs and the license
Message-ID: <200106081459.XAA06340@dhcp198.grad.sccs.chukyo-u.ac.jp>

Hi.

I decided to change the license of my JapaneseCodecs package
from GNU GPL to a BSD variant.  Due to the license change, I
released JapaneseCodecs 1.3.  It is available at:

http://pseudo.grad.sccs.chukyo-u.ac.jp/~kajiyama/python/

There is no change in the codes, so you don't need to update
your copy if the license doesn't matter.

By the way, I also released a new module named kanjilib.  As the
name implies, the kanjilib module provides Japanese encoding
conversion functions for EUC-JP, Shift_JIS and ISO-2022-JP.  The
module does not rely on Python's Unicode facilities, so it may
be convenient if you need to handle Japanese character encodings
but not Unicode, or if you need Japanese encoding conversion in
Python 1.5.2 or former.  The module is also available on the
page above.

Thanks,

-- 
KAJIYAMA, Tamito <kajiyama@grad.sccs.chukyo-u.ac.jp>


From mal@lemburg.com  Fri Jun  8 16:14:31 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 08 Jun 2001 17:14:31 +0200
Subject: [I18n-sig] JapaneseCodecs and the license
References: <200106081459.XAA06340@dhcp198.grad.sccs.chukyo-u.ac.jp>
Message-ID: <3B20EBD7.F112D288@lemburg.com>

Tamito KAJIYAMA wrote:
> 
> Hi.
> 
> I decided to change the license of my JapaneseCodecs package
> from GNU GPL to a BSD variant.  Due to the license change, I
> released JapaneseCodecs 1.3.  It is available at:
> 
> http://pseudo.grad.sccs.chukyo-u.ac.jp/~kajiyama/python/
> 
> There is no change in the codes, so you don't need to update
> your copy if the license doesn't matter.

Great... this is very good news ! I just wish some of the
other codec authors would follow your example. Anyway,
your move will certainly improve the usability of Python in Asia.

> By the way, I also released a new module named kanjilib.  As the
> name implies, the kanjilib module provides Japanese encoding
> conversion functions for EUC-JP, Shift_JIS and ISO-2022-JP.  The
> module does not rely on Python's Unicode facilities, so it may
> be convenient if you need to handle Japanese character encodings
> but not Unicode, or if you need Japanese encoding conversion in
> Python 1.5.2 or former.  The module is also available on the
> page above.

Thank you,
-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From paulp@ActiveState.com  Fri Jun  8 17:30:05 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Fri, 08 Jun 2001 09:30:05 -0700
Subject: [I18n-sig] JapaneseCodecs and the license
References: <200106081459.XAA06340@dhcp198.grad.sccs.chukyo-u.ac.jp> <3B20EBD7.F112D288@lemburg.com>
Message-ID: <3B20FD8D.25CD576@ActiveState.com>

"M.-A. Lemburg" wrote:
> 
>...
> 
> Great... this is very good news ! I just wish some of the
> other codec authors would follow your example. Anyway,
> your move will certainly improve the usability of Python in Asia.

Frank Chen has agreed to do the same for Chinese codecs. I asked him if
he would do so a few days ago. He sent me a zipfile with a license that
is:

"It is licensed under the same license as Python 2.1."

I can send this zipfile on to you, MAL and you could look them over and
then if they meet your approval you could check them into the codecs
directory. Does that sound good?

-- 
Take a recipe. Leave a recipe.  
Python Cookbook!  http://www.ActiveState.com/pythoncookbook


From mal@lemburg.com  Fri Jun  8 18:05:28 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Fri, 08 Jun 2001 19:05:28 +0200
Subject: [I18n-sig] JapaneseCodecs and the license
References: <200106081459.XAA06340@dhcp198.grad.sccs.chukyo-u.ac.jp> <3B20EBD7.F112D288@lemburg.com> <3B20FD8D.25CD576@ActiveState.com>
Message-ID: <3B2105D8.AA40284F@lemburg.com>

Paul Prescod wrote:
> 
> "M.-A. Lemburg" wrote:
> >
> >...
> >
> > Great... this is very good news ! I just wish some of the
> > other codec authors would follow your example. Anyway,
> > your move will certainly improve the usability of Python in Asia.
> 
> Frank Chen has agreed to do the same for Chinese codecs. I asked him if
> he would do so a few days ago. He sent me a zipfile with a license that
> is:
> 
> "It is licensed under the same license as Python 2.1."
> 
> I can send this zipfile on to you, MAL and you could look them over and
> then if they meet your approval you could check them into the codecs
> directory. Does that sound good?

I'll have to get BDFL approval on that first since these
codec are huge. When we first discussed these issues it was
decided to keep the codecs in a separate package which was
to be maintained by packagers like ActiveState ;-)

I'm not so sure anymore, though, since adding a few more
100kB to the distribution archive will certainly not hurt anybody 
these days and it would certainly gain some user base in Asia...
which we are currently losing to [that other Japanese scripting 
language ;-)].

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From tim.one@home.com  Fri Jun  8 19:07:19 2001
From: tim.one@home.com (Tim Peters)
Date: Fri, 8 Jun 2001 14:07:19 -0400
Subject: [I18n-sig] JapaneseCodecs and the license
In-Reply-To: <3B20FD8D.25CD576@ActiveState.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCKEENKHAA.tim.one@home.com>

[Paul Prescod]
> Frank Chen has agreed to do the same for Chinese codecs. I asked him if
> he would do so a few days ago. He sent me a zipfile with a license that
> is:
>
> "It is licensed under the same license as Python 2.1."

Ah, licensing.  I suggest people hold off just a little longer on this.
While Python isn't released under the GPL, we've got nothing against it
either, and the FSF doesn't believe the 2.1 license is GPL *compatible*.  So
releasing more stuff under the 2.1 license will create that many more
problems for GPL'ed projects.

We have agreement from the FSF that the license for 2.0.1, 2.1.1 and 2.2
(whichever gets released first -- none have yet) is GPL-compatible, so
that's a friendlier target to shoot for.  For anyone who has actually read
all these things, the only real difference between 2.1's license and
2.0.1/2.1.1/2.2's is removing the contentious "State of Virginia"
choice-of-law clause.  I doubt that's a clause anyone in China would be keen
to keep anyway <wink>.


From jaleco@gameone.com.tw  Mon Jun 18 09:17:20 2001
From: jaleco@gameone.com.tw (jaleco)
Date: Mon, 18 Jun 2001 16:17:20 +0800
Subject: [I18n-sig] unicode
Message-ID: <000a01c0f7cf$18733bf0$94bd4ed3@jaleco>

This is a multi-part message in MIME format.

------=_NextPart_000_0007_01C0F812.2634ACE0
Content-Type: text/plain;
	charset="big5"
Content-Transfer-Encoding: quoted-printable

how to conver a integer to unicode type
behaior like chr() to a integer type ?=20

------=_NextPart_000_0007_01C0F812.2634ACE0
Content-Type: text/html;
	charset="big5"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=3DContent-Type content=3D"text/html; charset=3Dbig5">
<META content=3D"MSHTML 5.50.4522.1800" name=3DGENERATOR>
<STYLE></STYLE>
</HEAD>
<BODY bgColor=3D#ffffff>
<DIV><FONT face=3DMingLiu size=3D2>how to conver a integer to unicode=20
type</FONT></DIV>
<DIV><FONT face=3D=B2=D3=A9=FA=C5=E9 size=3D2>behaior like chr()&nbsp;to =
a integer type=20
?&nbsp;</FONT></DIV></BODY></HTML>

------=_NextPart_000_0007_01C0F812.2634ACE0--


From mal@lemburg.com  Mon Jun 18 09:25:11 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Mon, 18 Jun 2001 10:25:11 +0200
Subject: [I18n-sig] unicode
References: <000a01c0f7cf$18733bf0$94bd4ed3@jaleco>
Message-ID: <3B2DBAE7.EBD653D8@lemburg.com>

> jaleco wrote:
> 
> how to conver a integer to unicode type
> behaior like chr() to a integer type ?

Try unichr().

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From barry@wooz.org  Tue Jun 19 20:52:35 2001
From: barry@wooz.org (Barry A. Warsaw)
Date: Tue, 19 Jun 2001 15:52:35 -0400
Subject: [I18n-sig] Re: pygettext.py extraction of docstrings
References: <14840.35473.307059.990479@anthem.concentric.net>
 <200010272228.AAA01066@loewis.home.cs.tu-berlin.de>
 <15113.29005.357449.812516@anthem.wooz.org>
 <shy9rmwm54.fsf@tux.gnu.franken.de>
 <15117.38438.361043.255768@anthem.wooz.org>
 <shn182ui5e.fsf@tux.gnu.franken.de>
 <15118.27210.930905.339141@anthem.wooz.org>
 <200105252012.f4PKCg801160@mira.informatik.hu-berlin.de>
Message-ID: <15151.44419.951894.490695@anthem.wooz.org>

>>>>> "MvL" == Martin v Loewis <martin@loewis.home.cs.tu-berlin.de> writes:

    >> But the po-file format documentation doesn't say that
    >> additional flags can be defined for #, comments.  It seems to
    >> me a simple omission in the documentation, right?  Is the
    >> intent of #, flags that the extraction tools can define
    >> additional, language-specific flags?

    MvL> I'd say that nobody has thought of that. Bruno is probably
    MvL> the person to give a definitive yay or nay here, but I'd hope
    MvL> that tools shouldn't go into flames if they see an extra
    MvL> flag. Atleast GNU msgmerge does not show any concern.

    MvL> Of course, it would be better if this possibility could be
    MvL> codified somewhere, and if gettext.texi could serve as the
    MvL> repository of well-known flags - even if they don't all have
    MvL> a meaning to GNU gettext. Adding such documentation is
    MvL> probably an issue of submitting patches against gettext.texi.

I'm trying to close this issue out (along with the associated SF
patch).  Since I haven't heard otherwise from Bruno, I'm going to
change the output to produce "#, docstring" flags.

-Barry


From barry@digicool.com  Tue Jun 19 23:59:31 2001
From: barry@digicool.com (Barry A. Warsaw)
Date: Tue, 19 Jun 2001 18:59:31 -0400
Subject: [I18n-sig] Autoguessing charset for Unicode strings?
Message-ID: <15151.55635.370702.813650@yyz.digicool.com>

I just don't know enough about Unicode in general (I've been one of
those eye-glazers Skip refers to ;), so I figured I'd ask this
question here.  First, some background.

I'm trying to add support for RFC 2047 in mimelib.  Essentially, this
RFC specifies how to include non-ASCII characters in mail headers, by
describing an encoding format.  The format lets you wrap "funny"
characters in something like: =?iso-8859-1?Q?B=E2rry W=E2rs=E2w?=

So, I think I've got the first part working, which is this: when I see
such an encoded header, I pull out the encoded string, quopri decode
it[*], then coerce to Unicode, giving the charset part as the second
argument to unicode().  Specifically, the algorithm is something like:

    parts = value.split('?')
    if parts[0].endswith('=') and parts[4].startswith('='):
	charset = parts[1]
	encoding = parts[2].lower()
	atom = parts[3]
	if encoding == 'q':
	    decoded_atom = quopri.decodestring(atom)
	elif encoding == 'b':
	    decoded_atom = base64.decodestring(atom)
	else:
	    raise ValueError, 'bad encoding: %s' % encoding
	return unicode(decoded_atom, charset)

So far so good.  Now let's say I want to go in the other direction,
i.e. given a Unicode string, I want to create the RFC 2047 encoded
string to add to the header, so I need to be able to go "the other way
'round".  Is this possible without requiring the user to explicitly
provide the charset that the Unicode string is encoded with?

My understanding is that the unicode string doesn't have a notion of
the charset that it was encoded with, but is it possible to guess the
charset of a Unicode string reliably?  Even if you can only guess 80%
of the time, that'd be fine if I can throw an exception for the other
20%.  Is there an existing Python solution for this?  Does my question
even make sense? ;)

Thanks,
-Barry

[*] The `Q' (or `q') in between the ?'s means the string is encoded
using quoted-printable.  Thus the recent rash of fixes to the quopri
module.  The RFC says that alternatively, a `B' (or `b') is valid,
meaning Base64 was used.


From martin@loewis.home.cs.tu-berlin.de  Wed Jun 20 00:27:28 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Wed, 20 Jun 2001 01:27:28 +0200
Subject: [I18n-sig] Autoguessing charset for Unicode strings?
In-Reply-To: <15151.55635.370702.813650@yyz.digicool.com> (barry@digicool.com)
References: <15151.55635.370702.813650@yyz.digicool.com>
Message-ID: <200106192327.f5JNRSp01853@mira.informatik.hu-berlin.de>

> So far so good.  Now let's say I want to go in the other direction,
> i.e. given a Unicode string, I want to create the RFC 2047 encoded
> string to add to the header, so I need to be able to go "the other way
> 'round".  Is this possible without requiring the user to explicitly
> provide the charset that the Unicode string is encoded with?

Yes, doing so is trivial - the tricky part is to make work elegant.

> My understanding is that the unicode string doesn't have a notion of
> the charset that it was encoded with, but is it possible to guess the
> charset of a Unicode string reliably?  Even if you can only guess 80%
> of the time, that'd be fine if I can throw an exception for the other
> 20%.  Is there an existing Python solution for this?  Does my question
> even make sense? ;)

Your question makes perfect sense, it is one of the rather troubling
problems in the world of character set conversions. Another form of
the same problem is "how does Tk pick the right font to display some
unicode string"?

Back to your question: The easiest path is to always use UTF-8 as the
outgoing character set. UTF-8 is a well-recognized MIME encoding
(although I forgot the RFC number), and it is capable of encoding all
Unicode strings lossless.

However, that might produce quotations even if there are no funny
characters in the string, so a better procedure might be:

1. try to encode as ASCII. If that succeeds, no quotation is needed
2. if that fails, use UTF-8

Now, many email readers will still choke these days when they see
UTF-8 (the Microsoft ones being positive exceptions here), but will
recognize Latin-1. So, another procedure might be

1. try to encode as ASCII
2. if that fails, try iso-8859-1
3. if that fails, use UTF-8

You'll see that this becomes more and more expensive. People now may
propose that this really should be application controlled, but I think
they'd be misguided: the application is normally in no better position
to select a "good" encoding than the library.

The latter algorithm may also be considered Euro-centric. It probably
is.

BTW, the same procedure probably needs to be used for MIME messages of
type text/plain when a charset= is specified. I.e. usage of
mimify.CHARSET is really not appropriate anymore.

Regards,
Martin


From tree@basistech.com  Wed Jun 20 00:05:29 2001
From: tree@basistech.com (Tom Emerson)
Date: Tue, 19 Jun 2001 19:05:29 -0400
Subject: [I18n-sig] Autoguessing charset for Unicode strings?
In-Reply-To: <200106192327.f5JNRSp01853@mira.informatik.hu-berlin.de>
References: <15151.55635.370702.813650@yyz.digicool.com>
 <200106192327.f5JNRSp01853@mira.informatik.hu-berlin.de>
Message-ID: <15151.55993.967688.913171@cymru.basistech.com>

Martin v. Loewis writes:
> Now, many email readers will still choke these days when they see
> UTF-8 (the Microsoft ones being positive exceptions here), but will
> recognize Latin-1. So, another procedure might be
> 
> 1. try to encode as ASCII
> 2. if that fails, try iso-8859-1
> 3. if that fails, use UTF-8
> 
> You'll see that this becomes more and more expensive. People now may
> propose that this really should be application controlled, but I think
> they'd be misguided: the application is normally in no better position
> to select a "good" encoding than the library.
> 
> The latter algorithm may also be considered Euro-centric. It probably
> is.

Yes, it is. ;-) Western-Euro-centric, in fact.

One could hint the character set in (2) based on the domain name of
the sender, e.g., if the sender is from .jp then try ISO-2022-JP
instead of 8859-1.

It would be possible to construct a table mapping ranges of Unicode
codepoints (perhaps even character blocks) to certain legacy encodings
so that the correct one can be chosen quickly. Something like this is
needed when transcoding from Unicode to ISO-2022-CN.

    -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Sr. Sinostringologist                              http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From JMachin@Colonial.com.au  Wed Jun 20 00:50:23 2001
From: JMachin@Colonial.com.au (Machin, John)
Date: Wed, 20 Jun 2001 09:50:23 +1000
Subject: [I18n-sig] Autoguessing charset for Unicode strings?
Message-ID: <9F2D83017589D211BD1000805FA70CA703B139AE@ntxmel03.cmutual.com.au>

maybe not so expensive, depending on (a) what's in C and what's in Python
and (b) function call overhead and (c) what proportion of text needs which
character set ...

loop once through your Unicode;
	if there were any chars with ordinal > 255, then use UTF-8
	elif there were any > 127, then use iso-8859-1
	else use ASCII

-----Original Message-----
From: Martin v. Loewis [mailto:martin@loewis.home.cs.tu-berlin.de]
Sent: Wednesday, 20 June 2001 9:27
To: barry@digicool.com
Cc: i18n-sig@python.org
Subject: Re: [I18n-sig] Autoguessing charset for Unicode strings?

[snip]

Now, many email readers will still choke these days when they see
UTF-8 (the Microsoft ones being positive exceptions here), but will
recognize Latin-1. So, another procedure might be

1. try to encode as ASCII
2. if that fails, try iso-8859-1
3. if that fails, use UTF-8

You'll see that this becomes more and more expensive.

[snip]

Regards,
Martin

_______________________________________________
I18n-sig mailing list
I18n-sig@python.org
http://mail.python.org/mailman/listinfo/i18n-sig


**************   IMPORTANT MESSAGE  **************

The information contained in or attached to this message is intended only for the people it is addressed to. If you are not the intended recipient, any use, disclosure or copying of this information is unauthorised and prohibited. This information may be confidential or subject to legal privilege. It is not the expressed view of Colonial Limited or any of its subsidiaries unless that is clearly stated. Colonial cannot accept liability for any virus damage caused by this message.

**************************************************


From tim.one@home.com  Wed Jun 20 01:32:19 2001
From: tim.one@home.com (Tim Peters)
Date: Tue, 19 Jun 2001 20:32:19 -0400
Subject: [I18n-sig] Autoguessing charset for Unicode strings?
In-Reply-To: <9F2D83017589D211BD1000805FA70CA703B139AE@ntxmel03.cmutual.com.au>
Message-ID: <LNBBLJKPBEHFEDALKOLCIEFBKJAA.tim.one@home.com>

[Machin, John]
> maybe not so expensive, depending on (a) what's in C and what's in
> Python and (b) function call overhead and (c) what proportion of text
> needs which character set ...
>
> loop once through your Unicode;
> 	if there were any chars with ordinal > 255, then use UTF-8
> 	elif there were any > 127, then use iso-8859-1
> 	else use ASCII

I don't know whether that algorithm makes sense, but it's efficient enough
in Python:

    biggest = max(map(ord, some_unicode_string))
    if biggest > 255:
        yadda
    elif biggest > 127:
        yadda
    else:
        yadda

So the bulk of the work goes almost entirely at C speed.


From martin@loewis.home.cs.tu-berlin.de  Wed Jun 20 07:57:12 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Wed, 20 Jun 2001 08:57:12 +0200
Subject: [I18n-sig] Autoguessing charset for Unicode strings?
In-Reply-To: <15151.55993.967688.913171@cymru.basistech.com> (message from Tom
 Emerson on Tue, 19 Jun 2001 19:05:29 -0400)
References: <15151.55635.370702.813650@yyz.digicool.com>
 <200106192327.f5JNRSp01853@mira.informatik.hu-berlin.de> <15151.55993.967688.913171@cymru.basistech.com>
Message-ID: <200106200657.f5K6vCX01071@mira.informatik.hu-berlin.de>

> It would be possible to construct a table mapping ranges of Unicode
> codepoints (perhaps even character blocks) to certain legacy encodings
> so that the correct one can be chosen quickly. Something like this is
> needed when transcoding from Unicode to ISO-2022-CN.

That would be valuable as a general-purpose service in the Python
library, it seems. I have no experience with such API, but I think

codecs.find_encodings(ustring)

could work; this would return a list of tuples, each tuple containing
the name of an encoding and the number of initial characters of
ustring that can be represented in this encoding.

An important implementation detail, of course, is how to construct the
necessary data structures in an efficient way. For the codecs that
ship with Python, the tables could be precomputed. For dynamically
registered codecs, the first problem is to come up with a list of all
known codec names - which in itself would be a useful service...

Regards,
Martin


From keichwa@gmx.net  Wed Jun 20 07:35:49 2001
From: keichwa@gmx.net (Karl Eichwalder)
Date: 20 Jun 2001 08:35:49 +0200
Subject: [I18n-sig] Re: pygettext.py extraction of docstrings
In-Reply-To: <15151.44419.951894.490695@anthem.wooz.org>
References: <14840.35473.307059.990479@anthem.concentric.net>
 <200010272228.AAA01066@loewis.home.cs.tu-berlin.de>
 <15113.29005.357449.812516@anthem.wooz.org>
 <shy9rmwm54.fsf@tux.gnu.franken.de>
 <15117.38438.361043.255768@anthem.wooz.org>
 <shn182ui5e.fsf@tux.gnu.franken.de>
 <15118.27210.930905.339141@anthem.wooz.org>
 <200105252012.f4PKCg801160@mira.informatik.hu-berlin.de>
 <15151.44419.951894.490695@anthem.wooz.org>
Message-ID: <shbsnjirp6.fsf@tux.gnu.franken.de>

barry@wooz.org (Barry A. Warsaw) writes:

> I'm trying to close this issue out (along with the associated SF
> patch).  Since I haven't heard otherwise from Bruno, I'm going to
> change the output to produce "#, docstring" flags.

Sounds good to me.  Please, make sure to put the "#, ..." expression
just before the "msgid" line; thus it's easier for the translator to see
(sometimes we've very long "#: " lines).

-- 
work : ke@suse.de                          |                   ,__o
     : http://www.suse.de/~ke/             |                 _-\_<,
home : keichwa@gmx.net                     |                (*)/'(*)


From tdickenson@geminidataloggers.com  Wed Jun 20 10:30:41 2001
From: tdickenson@geminidataloggers.com (Toby Dickenson)
Date: Wed, 20 Jun 2001 10:30:41 +0100
Subject: [I18n-sig] Autoguessing charset for Unicode strings?
In-Reply-To: <LNBBLJKPBEHFEDALKOLCIEFBKJAA.tim.one@home.com>
References: <9F2D83017589D211BD1000805FA70CA703B139AE@ntxmel03.cmutual.com.au> <LNBBLJKPBEHFEDALKOLCIEFBKJAA.tim.one@home.com>
Message-ID: <n6r0jtkmhpod4vapcntailqeg2lf1feq01@4ax.com>

On Tue, 19 Jun 2001 20:32:19 -0400, "Tim Peters" <tim.one@home.com>
wrote:

>I don't know whether that algorithm makes sense, but it's efficient =
enough
>in Python:
>
>    biggest =3D max(map(ord, some_unicode_string))

or marginally more efficient still:

    biggest =3D ord(max(some_unicode_string))


Toby Dickenson
tdickenson@geminidataloggers.com


From haible@ilog.fr  Wed Jun 20 11:05:07 2001
From: haible@ilog.fr (Bruno Haible)
Date: Wed, 20 Jun 2001 12:05:07 +0200 (MET DST)
Subject: [I18n-sig] Re: pygettext.py extraction of docstrings
In-Reply-To: <15151.44419.951894.490695@anthem.wooz.org>
References: <14840.35473.307059.990479@anthem.concentric.net>
 <200010272228.AAA01066@loewis.home.cs.tu-berlin.de>
 <15113.29005.357449.812516@anthem.wooz.org>
 <shy9rmwm54.fsf@tux.gnu.franken.de>
 <15117.38438.361043.255768@anthem.wooz.org>
 <shn182ui5e.fsf@tux.gnu.franken.de>
 <15118.27210.930905.339141@anthem.wooz.org>
 <200105252012.f4PKCg801160@mira.informatik.hu-berlin.de>
 <15151.44419.951894.490695@anthem.wooz.org>
Message-ID: <200106201005.MAA14588@oberkampf.ilog.fr>

barry@wooz.org writes:

>     MvL> I'd hope
>     MvL> that tools shouldn't go into flames if they see an extra
>     MvL> flag. Atleast GNU msgmerge does not show any concern.

The tools don't flame if there is an unknown #, flag, but the tools
like msgmerge currently don't preserve the flag either.

Support for other languages than C/C++ in the gettext tools is on my
list for gettext 0.12. This includes calling pygettext, and it also
includes support for language specific #, flags.

> I'm trying to close this issue out (along with the associated SF
> patch).  Since I haven't heard otherwise from Bruno, I'm going to
> change the output to produce "#, docstring" flags.

OK.

Bruno


From barry@wooz.org  Wed Jun 20 20:44:40 2001
From: barry@wooz.org (Barry A. Warsaw)
Date: Wed, 20 Jun 2001 15:44:40 -0400
Subject: [I18n-sig] Re: pygettext.py extraction of docstrings
References: <14840.35473.307059.990479@anthem.concentric.net>
 <200010272228.AAA01066@loewis.home.cs.tu-berlin.de>
 <15113.29005.357449.812516@anthem.wooz.org>
 <shy9rmwm54.fsf@tux.gnu.franken.de>
 <15117.38438.361043.255768@anthem.wooz.org>
 <shn182ui5e.fsf@tux.gnu.franken.de>
 <15118.27210.930905.339141@anthem.wooz.org>
 <200105252012.f4PKCg801160@mira.informatik.hu-berlin.de>
 <15151.44419.951894.490695@anthem.wooz.org>
 <shbsnjirp6.fsf@tux.gnu.franken.de>
Message-ID: <15152.64808.571763.915115@anthem.wooz.org>

>>>>> "KE" == Karl Eichwalder <keichwa@gmx.net> writes:

    KE> Sounds good to me.  Please, make sure to put the "#, ..."
    KE> expression just before the "msgid" line; thus it's easier for
    KE> the translator to see (sometimes we've very long "#: " lines).

Ah, good point.  Done.

>>>>> "BH" == Bruno Haible <haible@ilog.fr> writes:

    BH> Support for other languages than C/C++ in the gettext tools is
    BH> on my list for gettext 0.12. This includes calling pygettext,
    BH> and it also includes support for language specific #, flags.

Cool.  Let me know if I can help.  I'm relying on pygettext quite
heavily in Mailman, so I think it's pretty solid (latest revision is
pygettext.py 1.20).  Martin's also written a Python version of msgfmt
which is in Python's Tools/i18n directory.

Cheers,
-Barry


From martin@loewis.home.cs.tu-berlin.de  Wed Jun 20 22:24:18 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Wed, 20 Jun 2001 23:24:18 +0200
Subject: [I18n-sig] Re: pygettext.py extraction of docstrings
In-Reply-To: <15152.64808.571763.915115@anthem.wooz.org> (barry@wooz.org)
References: <14840.35473.307059.990479@anthem.concentric.net>
 <200010272228.AAA01066@loewis.home.cs.tu-berlin.de>
 <15113.29005.357449.812516@anthem.wooz.org>
 <shy9rmwm54.fsf@tux.gnu.franken.de>
 <15117.38438.361043.255768@anthem.wooz.org>
 <shn182ui5e.fsf@tux.gnu.franken.de>
 <15118.27210.930905.339141@anthem.wooz.org>
 <200105252012.f4PKCg801160@mira.informatik.hu-berlin.de>
 <15151.44419.951894.490695@anthem.wooz.org>
 <shbsnjirp6.fsf@tux.gnu.franken.de> <15152.64808.571763.915115@anthem.wooz.org>
Message-ID: <200106202124.f5KLOIW02383@mira.informatik.hu-berlin.de>

> Cool.  Let me know if I can help.  I'm relying on pygettext quite
> heavily in Mailman, so I think it's pretty solid (latest revision is
> pygettext.py 1.20).

Personally, I think xgettext should itself recognize docstrings. The
po-utils already support extracting doc strings, and I added support
to extract strings with __doc__ from C modules as well.

Maybe I'll look into contributing these features to GNU gettext with
native code.

Regards,
Martin


From barry@wooz.org  Wed Jun 20 23:07:52 2001
From: barry@wooz.org (Barry A. Warsaw)
Date: Wed, 20 Jun 2001 18:07:52 -0400
Subject: [I18n-sig] Re: pygettext.py extraction of docstrings
References: <14840.35473.307059.990479@anthem.concentric.net>
 <200010272228.AAA01066@loewis.home.cs.tu-berlin.de>
 <15113.29005.357449.812516@anthem.wooz.org>
 <shy9rmwm54.fsf@tux.gnu.franken.de>
 <15117.38438.361043.255768@anthem.wooz.org>
 <shn182ui5e.fsf@tux.gnu.franken.de>
 <15118.27210.930905.339141@anthem.wooz.org>
 <200105252012.f4PKCg801160@mira.informatik.hu-berlin.de>
 <15151.44419.951894.490695@anthem.wooz.org>
 <shbsnjirp6.fsf@tux.gnu.franken.de>
 <15152.64808.571763.915115@anthem.wooz.org>
 <200106202124.f5KLOIW02383@mira.informatik.hu-berlin.de>
Message-ID: <15153.7864.788320.742815@anthem.wooz.org>

>>>>> "MvL" == Martin v Loewis <martin@loewis.home.cs.tu-berlin.de> writes:

    >> Cool.  Let me know if I can help.  I'm relying on pygettext
    >> quite heavily in Mailman, so I think it's pretty solid (latest
    >> revision is pygettext.py 1.20).

    MvL> Personally, I think xgettext should itself recognize
    MvL> docstrings. The po-utils already support extracting doc
    MvL> strings, and I added support to extract strings with __doc__
    MvL> from C modules as well.

    MvL> Maybe I'll look into contributing these features to GNU
    MvL> gettext with native code.

Cool, just be sure to make docstring extraction optional.  E.g. it
makes sense for Mailman's bin/* scripts where the module docstring
doubles as usage text, but it doesn't make much sense for most plain
old module docstrings.

OTOH, maybe we should define a convention in the docstring to indicate
that it's ripe for extraction.  E.g. an _ as the first character in
the docstring...

-Barry


From martin@loewis.home.cs.tu-berlin.de  Thu Jun 21 07:46:53 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Thu, 21 Jun 2001 08:46:53 +0200
Subject: [I18n-sig] Re: pygettext.py extraction of docstrings
In-Reply-To: <15153.7864.788320.742815@anthem.wooz.org> (barry@wooz.org)
References: <14840.35473.307059.990479@anthem.concentric.net>
 <200010272228.AAA01066@loewis.home.cs.tu-berlin.de>
 <15113.29005.357449.812516@anthem.wooz.org>
 <shy9rmwm54.fsf@tux.gnu.franken.de>
 <15117.38438.361043.255768@anthem.wooz.org>
 <shn182ui5e.fsf@tux.gnu.franken.de>
 <15118.27210.930905.339141@anthem.wooz.org>
 <200105252012.f4PKCg801160@mira.informatik.hu-berlin.de>
 <15151.44419.951894.490695@anthem.wooz.org>
 <shbsnjirp6.fsf@tux.gnu.franken.de>
 <15152.64808.571763.915115@anthem.wooz.org>
 <200106202124.f5KLOIW02383@mira.informatik.hu-berlin.de> <15153.7864.788320.742815@anthem.wooz.org>
Message-ID: <200106210646.f5L6krj01120@mira.informatik.hu-berlin.de>

> Cool, just be sure to make docstring extraction optional.  E.g. it
> makes sense for Mailman's bin/* scripts where the module docstring
> doubles as usage text, but it doesn't make much sense for most plain
> old module docstrings.

Certainly. This is essentially like a new keyword (-k) to look for.

> OTOH, maybe we should define a convention in the docstring to indicate
> that it's ripe for extraction.  E.g. an _ as the first character in
> the docstring...

Please, no. In any case, it is up to translators to translate them; if
the doc strings look too useless, they can ignore them.

Regards,
Martin


From paulp@ActiveState.com  Sat Jun 23 02:35:18 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Fri, 22 Jun 2001 18:35:18 -0700
Subject: [I18n-sig] International Components for Unicode
Message-ID: <3B33F256.3C133966@ActiveState.com>

Is this of any value to us?

http://oss.software.ibm.com/icu/index.html
-- 
Take a recipe. Leave a recipe.  
Python Cookbook!  http://www.ActiveState.com/pythoncookbook


From martin@loewis.home.cs.tu-berlin.de  Sat Jun 23 08:47:34 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Sat, 23 Jun 2001 09:47:34 +0200
Subject: [I18n-sig] International Components for Unicode
In-Reply-To: <3B33F256.3C133966@ActiveState.com> (message from Paul Prescod on
 Fri, 22 Jun 2001 18:35:18 -0700)
References: <3B33F256.3C133966@ActiveState.com>
Message-ID: <200106230747.f5N7lYG01049@mira.informatik.hu-berlin.de>

> Is this of any value to us?
> 
> http://oss.software.ibm.com/icu/index.html

I'm not sure. It always seemed to me that ICU is an all-or-nothing
solution. I.e. if you want to access its functionality, you have to
use their Unicode type, their locale objects, their message catalogs
and so on.

Python 2.1 offers already quite a lot of this functionality; merging
that with ICU would be a real challenge. You'd probably need to offer
a choice: either ICU locales or C locales; either ICU message catalogs
or gettext. For the Unicode types, you'd have to copy strings forth
and back between ICU Unicode objects and Python Unicode objects.

Also, offering these services to Python users is challenging. It can't
really become a standard library: The ICU distribution is 6.5MB of C++
source code, so I doubt it would be ever included in core
Python. Somebody could volunteer and offer wrapper code, and put that
on SF. To use that API, and application author would need to get ICU,
and the wrapper (preferably in versions that match). Later, all users
of the application also need to install ICU, and the wrapper. These
days, Linux distributions offer precompiled ICU installations, but
that might add to the problems rather than reducing them: The wrapper
will need to deal with multiple ICU versions.

Finally, ICU solves non of the most urgent Python-and-I18N problems:
None of the standard libraries will become more Unicode-aware than
they are now; it still is not possible to use non-ASCII text in source
code in a convenient way; printing Unicode strings to sys.stdout will
continue to produce exceptions.

So my guess is that nothing will happen with ICU integration, and that
the question will come up every few months.

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Sat Jun 23 09:26:26 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Sat, 23 Jun 2001 10:26:26 +0200
Subject: [I18n-sig] How does Python Unicode treat surrogates?
In-Reply-To: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> (message
 from Guido van Rossum on Tue, 20 Feb 2001 14:36:35 -0500)
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com>
Message-ID: <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de>

[Uche]
> Sure.  I admit it's hearsay, but I thought I'd read that because Java
> Unicode is or was underspecified, that there was the possibility of
> transposition of the high-surrogate with the low-surrogate character
> between Java implementations or platforms.

I've tried to find out what problem that could be. So far, I found

http://developer.java.sun.com/developer/bugParade/bugs/4344266.html

Here, they complain that the codecs don't properly check for
surrogates that straddle invocations of convert, or get incorrect
surrogate pairs. There is a bug report on SF that Python has similar
problems.

http://developer.java.sun.com/developer/bugParade/bugs/4328816.html

summarizes problems that have been fixed with surrogates in UTF-8,
again, similar problems are probably present in Python.

There were also a few bug reports about surrogates working differently
depending on locale (fail in zh_CN, pass in C), and type of virtual
machine (fail in classic, pass in hotspot).

I could not find any report on a bug where surrogates are output in
incorrect order.

[Guido]
> On the XML sig the following exchange happened.  I don't know enough
> about the issues to investigate, but I'm sure that someone here can
> provide insight?  It seems to boil down to whether or not surrogates
> may get transposed when between platforms.

I very much doubt this could ever happen.

Regards,
Martin


From mal@lemburg.com  Sat Jun 23 11:38:39 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Sat, 23 Jun 2001 12:38:39 +0200
Subject: [I18n-sig] How does Python Unicode treat surrogates?
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de>
Message-ID: <3B3471AF.1311E872@lemburg.com>

Could someone please restate the original question ? The archives
don't seem to have the original postings and the quotes Martin
have in his reply don't seem to have anything todo with Python.

About surrogate support in Python: the UTF-8 codec has full
surrogate support for encodings and decoding, the unicode-escape
codec can decode using surrogates, all others don't support
surrogates.

Thanks,
-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/

"Martin v. Loewis" wrote:
> 
> [Uche]
> > Sure.  I admit it's hearsay, but I thought I'd read that because Java
> > Unicode is or was underspecified, that there was the possibility of
> > transposition of the high-surrogate with the low-surrogate character
> > between Java implementations or platforms.
> 
> I've tried to find out what problem that could be. So far, I found
> 
> http://developer.java.sun.com/developer/bugParade/bugs/4344266.html
> 
> Here, they complain that the codecs don't properly check for
> surrogates that straddle invocations of convert, or get incorrect
> surrogate pairs. There is a bug report on SF that Python has similar
> problems.
> 
> http://developer.java.sun.com/developer/bugParade/bugs/4328816.html
> 
> summarizes problems that have been fixed with surrogates in UTF-8,
> again, similar problems are probably present in Python.
> 
> There were also a few bug reports about surrogates working differently
> depending on locale (fail in zh_CN, pass in C), and type of virtual
> machine (fail in classic, pass in hotspot).
> 
> I could not find any report on a bug where surrogates are output in
> incorrect order.
> 
> [Guido]
> > On the XML sig the following exchange happened.  I don't know enough
> > about the issues to investigate, but I'm sure that someone here can
> > provide insight?  It seems to boil down to whether or not surrogates
> > may get transposed when between platforms.
> 
> I very much doubt this could ever happen.
> 
> Regards,
> Martin
> 
> _______________________________________________
> I18n-sig mailing list
> I18n-sig@python.org
> http://mail.python.org/mailman/listinfo/i18n-sig


From martin@loewis.home.cs.tu-berlin.de  Sat Jun 23 13:20:38 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Sat, 23 Jun 2001 14:20:38 +0200
Subject: [I18n-sig] How does Python Unicode treat surrogates?
In-Reply-To: <3B3471AF.1311E872@lemburg.com> (mal@lemburg.com)
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com>
Message-ID: <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de>

> About surrogate support in Python: the UTF-8 codec has full
> surrogate support for encodings and decoding

I think there are a number of bugs lying around here. For example,
shouldn't

>>> u" \ud800 ".encode("utf-8")
' \xa0\x80 '

give an error, since this is a lone low surrogate word?

Likewise, but somewhat more troubling, surrogates that straddle write
invocations are not processed properly.

>>> s=StringIO.StringIO()
>>> _,_,r,w=codecs.lookup("utf-8")
>>> f=w(s)
>>> f.write(u"\ud800")
>>> f.write(u"\udc00")
>>> f.flush()
>>> s.getvalue()
'\xa0\x80\xed\xb0\x80'

whereas the correct answer would have been

>>> u"\ud800\udc00".encode("utf-8")
'\xf0\x90\x80\x80'

Regards,
Martin


From mal@lemburg.com  Sat Jun 23 21:19:09 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Sat, 23 Jun 2001 22:19:09 +0200
Subject: [I18n-sig] How does Python Unicode treat surrogates?
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de>
Message-ID: <3B34F9BD.4FDEFC62@lemburg.com>

"Martin v. Loewis" wrote:
> 
> > About surrogate support in Python: the UTF-8 codec has full
> > surrogate support for encodings and decoding
> 
> I think there are a number of bugs lying around here. For example,
> shouldn't
> 
> >>> u" \ud800 ".encode("utf-8")
> ' \xa0\x80 '
> 
> give an error, since this is a lone low surrogate word?

Yes.
 
> Likewise, but somewhat more troubling, surrogates that straddle write
> invocations are not processed properly.
> 
> >>> s=StringIO.StringIO()
> >>> _,_,r,w=codecs.lookup("utf-8")
> >>> f=w(s)
> >>> f.write(u"\ud800")
> >>> f.write(u"\udc00")
> >>> f.flush()
> >>> s.getvalue()
> '\xa0\x80\xed\xb0\x80'
> 
> whereas the correct answer would have been
> 
> >>> u"\ud800\udc00".encode("utf-8")
> '\xf0\x90\x80\x80'

This is a special case of the above (since the encoder will
see truncated surrogates and should raise raise an exception 
for these).

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From tree@basistech.com  Sat Jun 23 21:20:54 2001
From: tree@basistech.com (Tom Emerson)
Date: Sat, 23 Jun 2001 16:20:54 -0400
Subject: [I18n-sig] International Components for Unicode
In-Reply-To: <3B33F256.3C133966@ActiveState.com>
References: <3B33F256.3C133966@ActiveState.com>
Message-ID: <15156.64038.410669.795084@cymru.basistech.com>

The one thing from ICU that would be useful is the plethora of
encoding tables it comes with. If we had support for their tables we
would have access to several hundred (last I checked they had over 600
encodings) encodings immediately available, and they would be
responsible for updating them.
-- 
Tom Emerson                                          Basis Technology Corp.
Sr. Sinostringologist                              http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From mal@lemburg.com  Sat Jun 23 22:18:34 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Sat, 23 Jun 2001 23:18:34 +0200
Subject: [I18n-sig] International Components for Unicode
References: <3B33F256.3C133966@ActiveState.com> <15156.64038.410669.795084@cymru.basistech.com>
Message-ID: <3B3507AA.2E5121C1@lemburg.com>

Tom Emerson wrote:
> 
> The one thing from ICU that would be useful is the plethora of
> encoding tables it comes with. If we had support for their tables we
> would have access to several hundred (last I checked they had over 600
> encodings) encodings immediately available, and they would be
> responsible for updating them.

While this would be nice to have, the size of ICU will prevent
any inclusion in the Python core. However, wrapping all or parts
of the lib to integrate them into the existing Python i18n support
would certainly be a project worth trying.

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From martin@loewis.home.cs.tu-berlin.de  Sat Jun 23 23:19:22 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Sun, 24 Jun 2001 00:19:22 +0200
Subject: [I18n-sig] How does Python Unicode treat surrogates?
In-Reply-To: <3B34F9BD.4FDEFC62@lemburg.com> (mal@lemburg.com)
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com>
Message-ID: <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de>

> > Likewise, but somewhat more troubling, surrogates that straddle write
> > invocations are not processed properly.
> > 
> > >>> s=StringIO.StringIO()
> > >>> _,_,r,w=codecs.lookup("utf-8")
> > >>> f=w(s)
> > >>> f.write(u"\ud800")
> > >>> f.write(u"\udc00")
> > >>> f.flush()
> > >>> s.getvalue()
> > '\xa0\x80\xed\xb0\x80'
> > 
> > whereas the correct answer would have been
> > 
> > >>> u"\ud800\udc00".encode("utf-8")
> > '\xf0\x90\x80\x80'
> 
> This is a special case of the above (since the encoder will
> see truncated surrogates and should raise raise an exception 
> for these).

I don't think it should; it is not truncated since a later write call
will provide the missing word. If you have a Unicode stream, it should
be possible to read the stream contents in arbitrary chunks of works,
and encode it with a stream encode. 

The stream encoder should produce the same output no matter how you
split the input. Under your proposed behaviour, this is not the case.

Please note that

http://sourceforge.net/tracker/index.php?func=detail&aid=433882&group_id=5470&atid=105470

adds a few other aspects to the problem: It appears that Unicode 3.1
specifies that certain forms of UTF-8 encoded surrogates are merely
irregular, not illegal. There may be some misinterpretation of the
spec in this report, but I think all this needs careful checking.

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Sat Jun 23 23:26:27 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Sun, 24 Jun 2001 00:26:27 +0200
Subject: [I18n-sig] International Components for Unicode
In-Reply-To: <15156.64038.410669.795084@cymru.basistech.com> (message from Tom
 Emerson on Sat, 23 Jun 2001 16:20:54 -0400)
References: <3B33F256.3C133966@ActiveState.com> <15156.64038.410669.795084@cymru.basistech.com>
Message-ID: <200106232226.f5NMQR720529@mira.informatik.hu-berlin.de>

> The one thing from ICU that would be useful is the plethora of
> encoding tables it comes with. If we had support for their tables we
> would have access to several hundred (last I checked they had over 600
> encodings) encodings immediately available, and they would be
> responsible for updating them.

That's true, but I'd rather prefer to integrate the encodings that
come with the operating systems first. E.g. on Unix, iconv(3) will
also give you many encodings. Including aliases, glibc 2.2 provides
about 1100 encodings.

On Windows, some Internet/ActiveX API offers a huge variety of
encodings, if the administrator has chosen to install them.

If you have Tcl, it provides a number of converters that are not
currently included in Python.

All these encodings can be made available to Python users by just
installing an extension module; whereas with ICU, you'd have to
install some huge library.

Regards,
Martin


From JMachin@Colonial.com.au  Sun Jun 24 01:09:34 2001
From: JMachin@Colonial.com.au (Machin, John)
Date: Sun, 24 Jun 2001 10:09:34 +1000
Subject: [I18n-sig] How does Python Unicode treat surrogates?
Message-ID: <9F2D83017589D211BD1000805FA70CA703B139D6@ntxmel03.cmutual.com.au>

Hello there,

I'm the 'nobody' who raised the SF bug report to which Martin refers.

According to Unicode 3.0, transformations between scalars and UTF-n
should provide lossless round-trip transcoding, even for invalid scalars
like
unpaired surrogates and values like 0xFFFE and 0xFFFF.

Unicode 3.1 adds further clarification by listing out what are legal
byte sequences for UTF-8; these include byte sequences that encompass
those invalid scalars.

There is a note in the Unicode docs that ISO/IEC 10646 ("ISO" for short)
forbids this permissive treatment of invalid scalars.

The implementation in Python 2.1 does this:

encoding to UTF-8:
  0xFFFF etc: Unicode-compliant
  unpaired low surrogate: Unicode-compliant
  unpaired high surrogate: *BUG*, generates invalid UTF-8 byte sequence
decoding from UTF-8:
  0xFFFF etc: Unicode-compliant
  unpaired surrogates: ISO-compliant 

In a note that Martin added to my bug report, he seems to be
advocating ISO compliance.

My two-cents-worth on approach to differences between Unicode
and ISO:

Unicode is the *practical* standard. Unicode is the
*available* standard -- you can buy the book; you can access
the web site. Martin said in his note to my bug report that
he doesn't have a copy of the ISO document(s); he's not alone!

Python advertises Unicode support, not ISO/IEC 10646 support.
If we make the transcoding of invalid scalars ISO-compliant,
then we should document and justify this. We should do this
for *all* invalid scalars, not just unpaired surrogates.

Perhaps the effort that would be required to do all the 
explicit testing to make all the transcoders ISO-compliant 
would be better directed into providing a function or method
that checked a Unicode string for the presence of invalid scalars.

A very practical point: Fixing the invalid-byte-sequence bug involves
adding two or three lines of code. Making the UTF-8 decoder
Unicode-compliant involves removing half a line of code. Minimal
effort and no documentation and justifications required.

Hmmm, 4 cents worth by the end of the rant :-)
Anyway, hope this helps,
John


-----Original Message-----
From: Martin v. Loewis [mailto:martin@loewis.home.cs.tu-berlin.de]
Sent: Sunday, 24 June 2001 8:19
To: mal@lemburg.com
Cc: guido@digicool.com; i18n-sig@python.org
Subject: Re: [I18n-sig] How does Python Unicode treat surrogates?


> > Likewise, but somewhat more troubling, surrogates that straddle write
> > invocations are not processed properly.
> > 
> > >>> s=StringIO.StringIO()
> > >>> _,_,r,w=codecs.lookup("utf-8")
> > >>> f=w(s)
> > >>> f.write(u"\ud800")
> > >>> f.write(u"\udc00")
> > >>> f.flush()
> > >>> s.getvalue()
> > '\xa0\x80\xed\xb0\x80'
> > 
> > whereas the correct answer would have been
> > 
> > >>> u"\ud800\udc00".encode("utf-8")
> > '\xf0\x90\x80\x80'
> 
> This is a special case of the above (since the encoder will
> see truncated surrogates and should raise raise an exception 
> for these).

I don't think it should; it is not truncated since a later write call
will provide the missing word. If you have a Unicode stream, it should
be possible to read the stream contents in arbitrary chunks of works,
and encode it with a stream encode. 

The stream encoder should produce the same output no matter how you
split the input. Under your proposed behaviour, this is not the case.

Please note that

http://sourceforge.net/tracker/index.php?func=detail&aid=433882&group_id=547
0&atid=105470

adds a few other aspects to the problem: It appears that Unicode 3.1
specifies that certain forms of UTF-8 encoded surrogates are merely
irregular, not illegal. There may be some misinterpretation of the
spec in this report, but I think all this needs careful checking.

Regards,
Martin


_______________________________________________
I18n-sig mailing list
I18n-sig@python.org
http://mail.python.org/mailman/listinfo/i18n-sig


**************   IMPORTANT MESSAGE  **************

The information contained in or attached to this message is intended only for the people it is addressed to. If you are not the intended recipient, any use, disclosure or copying of this information is unauthorised and prohibited. This information may be confidential or subject to legal privilege. It is not the expressed view of Colonial Limited or any of its subsidiaries unless that is clearly stated. Colonial cannot accept liability for any virus damage caused by this message.

**************************************************


From fw@deneb.enyo.de  Sun Jun 24 10:16:22 2001
From: fw@deneb.enyo.de (Florian Weimer)
Date: 24 Jun 2001 11:16:22 +0200
Subject: [I18n-sig] How does Python Unicode treat surrogates?
In-Reply-To: <9F2D83017589D211BD1000805FA70CA703B139D6@ntxmel03.cmutual.com.au> ("Machin, John"'s message of "Sun, 24 Jun 2001 10:09:34 +1000")
References: <9F2D83017589D211BD1000805FA70CA703B139D6@ntxmel03.cmutual.com.au>
Message-ID: <87u216qluh.fsf@deneb.enyo.de>

"Machin, John" <JMachin@Colonial.com.au> writes:

> Unicode is the *practical* standard. Unicode is the
> *available* standard -- you can buy the book; you can access
> the web site. Martin said in his note to my bug report that
> he doesn't have a copy of the ISO document(s); he's not alone!

ISO 10646 is the ISO standard with lowest money per page ratio ever, I
think.  You can order a PDF version (shipped on CD-ROM) from the ISO
website at http://www.iso.ch/ .

Some standards used by Python are much, much more expensive.


From mal@lemburg.com  Sun Jun 24 12:28:06 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Sun, 24 Jun 2001 13:28:06 +0200
Subject: [I18n-sig] How does Python Unicode treat surrogates?
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de>
Message-ID: <3B35CEC6.710243E7@lemburg.com>

First of all, I'd like to say that we left the handling of surrogates
undefined back when we initially discussed the internal format 
for storing Unicode. The reasoning was simple: there were no
assign char points outside the BMP (roughly the lower 16-bit range).

It was decided to use 16-bits per character as basis for dealing with
Unicode in such a way that we get the disjunction of UTF-16 and
UCS-2 (Unicode 2.x). This allowed us to postpone the handling of
variable length problems to a later point in time.

Now with Unicode 3.1, the time has come to rethink these things,
since for the first time, there are assigned char points outside
the BMP which could eventually be used by programmers.

This means that we have to start thinking about how to treat
UTF-16 surrogates (two Py_UNICODE elements per Unicode character).

The basic questions are:

1. How to treat lone surrogates (the Unicode char U+10000 is
   represented as the two words 0xd800 0xdc00 in UTF-16) ?

2. What to do when slicing of Unicode strings would break
   a surrogate pair ?

3. How to treat input data which has lone surrogate words 
   in strings (at the start, in the middle and at the end) ?

4. How to process requests for creating output data from 
   lone surrogate words ?

BTW, Python's Unicode implementation is bound to the standard
defined at www.unicode.org; moving over to ISO 10646 is not an
option.

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From tree@basistech.com  Sun Jun 24 17:15:57 2001
From: tree@basistech.com (Tom Emerson)
Date: Sun, 24 Jun 2001 12:15:57 -0400
Subject: [I18n-sig] International Components for Unicode
In-Reply-To: <200106232226.f5NMQR720529@mira.informatik.hu-berlin.de>
References: <3B33F256.3C133966@ActiveState.com>
 <15156.64038.410669.795084@cymru.basistech.com>
 <200106232226.f5NMQR720529@mira.informatik.hu-berlin.de>
Message-ID: <15158.4669.583190.272218@cymru.basistech.com>

Martin v. Loewis writes:
> That's true, but I'd rather prefer to integrate the encodings that
> come with the operating systems first. E.g. on Unix, iconv(3) will
> also give you many encodings. Including aliases, glibc 2.2 provides
> about 1100 encodings.

Of course iconv on Linux has a different set of encodings than iconv
on solaris, which has a different set than on Irix. And of course
those encodings that are shared are often implemented differently.

> All these encodings can be made available to Python users by just
> installing an extension module; whereas with ICU, you'd have to
> install some huge library.

You've misunderstood. I'm not saying we pull in ICU. I'm saying that
we write a set of Python modules that can read and make use of the ICU
encoding datafile formats, and use those. In ICU all encoding data is
kept as external data.

Obviously integrating all of ICU into Python would be a fool's errand.

-- 
Tom Emerson                                          Basis Technology Corp.
Sr. Sinostringologist                              http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From martin@loewis.home.cs.tu-berlin.de  Sun Jun 24 18:03:33 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Sun, 24 Jun 2001 19:03:33 +0200
Subject: [I18n-sig] How does Python Unicode treat surrogates?
In-Reply-To: <3B35CEC6.710243E7@lemburg.com> (mal@lemburg.com)
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com>
Message-ID: <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de>

> The basic questions are:
> 
> 1. How to treat lone surrogates (the Unicode char U+10000 is
>    represented as the two words 0xd800 0xdc00 in UTF-16) ?
> 
> 2. What to do when slicing of Unicode strings would break
>    a surrogate pair ?
> 
> 3. How to treat input data which has lone surrogate words 
>    in strings (at the start, in the middle and at the end) ?
> 
> 4. How to process requests for creating output data from 
>    lone surrogate words ?

I'd like to add another question

0. Should Py_UNICODE be extended to 32 bits?

> BTW, Python's Unicode implementation is bound to the standard
> defined at www.unicode.org; moving over to ISO 10646 is not an
> option.

Can you elaborate? How can you rule out that option that easily?
And why can't Python support the two standards simultaneously?

Regards,
Martin


From mal@lemburg.com  Sun Jun 24 19:04:28 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Sun, 24 Jun 2001 20:04:28 +0200
Subject: [I18n-sig] International Components for Unicode
References: <3B33F256.3C133966@ActiveState.com>
 <15156.64038.410669.795084@cymru.basistech.com>
 <200106232226.f5NMQR720529@mira.informatik.hu-berlin.de> <15158.4669.583190.272218@cymru.basistech.com>
Message-ID: <3B362BAC.A08E8128@lemburg.com>

Tom Emerson wrote:
> 
> I'm not saying we pull in ICU. I'm saying that
> we write a set of Python modules that can read and make use of the ICU
> encoding datafile formats, and use those. In ICU all encoding data is
> kept as external data.

Would we need to incorporate some of ICU for this to work or could
we use a Python script to convert those tables to ones usable in 
Python ?

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From mal@lemburg.com  Sun Jun 24 19:16:59 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Sun, 24 Jun 2001 20:16:59 +0200
Subject: [I18n-sig] How does Python Unicode treat surrogates?
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de>
Message-ID: <3B362E9B.4DC8DD81@lemburg.com>

"Martin v. Loewis" wrote:
> 
> > The basic questions are:
> >
> > 1. How to treat lone surrogates (the Unicode char U+10000 is
> >    represented as the two words 0xd800 0xdc00 in UTF-16) ?
> >
> > 2. What to do when slicing of Unicode strings would break
> >    a surrogate pair ?
> >
> > 3. How to treat input data which has lone surrogate words
> >    in strings (at the start, in the middle and at the end) ?
> >
> > 4. How to process requests for creating output data from
> >    lone surrogate words ?
> 
> I'd like to add another question
> 
> 0. Should Py_UNICODE be extended to 32 bits?

This would mean 4 bytes per Unicode character and is
unacceptable given the fact that most of these would be 0-bytes
in practice. It would also break binary compatibility to the
native Unicode wchar_t type on e.g. Windows which we are among
the most Unicode-aware platforms there are today.
 
> > BTW, Python's Unicode implementation is bound to the standard
> > defined at www.unicode.org; moving over to ISO 10646 is not an
> > option.
> 
> Can you elaborate? How can you rule out that option that easily?

It is not an option because we chose Unicode as our basis for 
i18n work and not the ISO 10646 Uniform Character Set. I'd rather
have those two camps fight over the details of the Unicode standard
than try to fix anything related to the differences between the two
in Python by mixing them.

> And why can't Python support the two standards simultaneously?

Why would you want to support two standards for the same thing ?

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From tree@basistech.com  Sun Jun 24 20:32:03 2001
From: tree@basistech.com (Tom Emerson)
Date: Sun, 24 Jun 2001 15:32:03 -0400
Subject: [I18n-sig] International Components for Unicode
In-Reply-To: <3B362BAC.A08E8128@lemburg.com>
References: <3B33F256.3C133966@ActiveState.com>
 <15156.64038.410669.795084@cymru.basistech.com>
 <200106232226.f5NMQR720529@mira.informatik.hu-berlin.de>
 <15158.4669.583190.272218@cymru.basistech.com>
 <3B362BAC.A08E8128@lemburg.com>
Message-ID: <15158.16435.45725.274341@cymru.basistech.com>

M.-A. Lemburg writes:
> Would we need to incorporate some of ICU for this to work or could
> we use a Python script to convert those tables to ones usable in 
> Python ?

No, we wouldn't need to incorporate anything from ICU except the
tables: that's my point. As long as we wrote the code to read the
tables directly people could use them without conversion or anything
like it.

    -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Sr. Sinostringologist                              http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From martin@loewis.home.cs.tu-berlin.de  Sun Jun 24 19:37:02 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Sun, 24 Jun 2001 20:37:02 +0200
Subject: [I18n-sig] How does Python Unicode treat surrogates?
In-Reply-To: <3B362E9B.4DC8DD81@lemburg.com> (mal@lemburg.com)
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com>
Message-ID: <200106241837.f5OIb2r07377@mira.informatik.hu-berlin.de>

> It is not an option because we chose Unicode as our basis for 
> i18n work and not the ISO 10646 Uniform Character Set.

Please speak for yourself only.

> > And why can't Python support the two standards simultaneously?
> 
> Why would you want to support two standards for the same thing ?

Because they are almost identical.

Regards,
Martin


From tim.one@home.com  Mon Jun 25 06:37:25 2001
From: tim.one@home.com (Tim Peters)
Date: Mon, 25 Jun 2001 01:37:25 -0400
Subject: [I18n-sig] How does Python Unicode treat surrogates?
In-Reply-To: <3B35CEC6.710243E7@lemburg.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCMEGKKKAA.tim.one@home.com>

[M.-A. Lemburg]
> ...
> 2. What to do when slicing of Unicode strings would break
>    a surrogate pair ?

To me a string is a sequence of characters, and s[0] returns the first, s[1]
the second, and so on.  The internal details of how the implementation
chooses to torture itself <0.7 wink> should be invisible.  That is, breaking
a surrogate via slicing should be impossible:  s[i:j] returns j-i
characters, and that's that.  This implies the internal start address for
the character s[i] can't be computed as base + N*i, unless-- what? --some
fixed number B of bits >= 20 is used internally for each character.

> ...
> BTW, Python's Unicode implementation is bound to the standard
> defined at www.unicode.org; moving over to ISO 10646 is not an
> option.

I doubt that either std says anything about how an implementation represents
characters internally.  And I'm certain neither mentions Py_UNICODE at all
<wink>.


From mal@lemburg.com  Mon Jun 25 12:39:07 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Mon, 25 Jun 2001 13:39:07 +0200
Subject: [I18n-sig] How does Python Unicode treat surrogates?
References: <LNBBLJKPBEHFEDALKOLCMEGKKKAA.tim.one@home.com>
Message-ID: <3B3722DB.1FF54794@lemburg.com>

Tim Peters wrote:
> 
> [M.-A. Lemburg]
> > ...
> > 2. What to do when slicing of Unicode strings would break
> >    a surrogate pair ?
> 
> To me a string is a sequence of characters, and s[0] returns the first, s[1]
> the second, and so on.  The internal details of how the implementation
> chooses to torture itself <0.7 wink> should be invisible.  That is, breaking
> a surrogate via slicing should be impossible:  s[i:j] returns j-i
> characters, and that's that. 

It's not that simple: lone surrogates are true Unicode char points in
their own right; it's just that they are pretty useless without
their resp. partners in the data stream. And with this "feature"
they are in good company: the Unicode combining characters (e.g. the
combining acute) have th same property.

Hard to say what's right and wrong here... (note that I posted the
questions without an initial comment on what I think on these issues 
-- I simply don't know for sure just yet ;-)

> This implies the internal start address for
> the character s[i] can't be computed as base + N*i, unless-- what? --some
> fixed number B of bits >= 20 is used internally for each character.
>
> > ...
> > BTW, Python's Unicode implementation is bound to the standard
> > defined at www.unicode.org; moving over to ISO 10646 is not an
> > option.
> 
> I doubt that either std says anything about how an implementation represents
> characters internally.  And I'm certain neither mentions Py_UNICODE at all
> <wink>.

That comment was aimed at Martin's proposal to stick with ISO 10646
for the UTF-8 codec treatment of lone surrogates. It has nothing
to do with how we store Unicode internally... (sorry for the
confusion).

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From mal@lemburg.com  Mon Jun 25 12:41:10 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Mon, 25 Jun 2001 13:41:10 +0200
Subject: [I18n-sig] International Components for Unicode
References: <3B33F256.3C133966@ActiveState.com>
 <15156.64038.410669.795084@cymru.basistech.com>
 <200106232226.f5NMQR720529@mira.informatik.hu-berlin.de>
 <15158.4669.583190.272218@cymru.basistech.com>
 <3B362BAC.A08E8128@lemburg.com> <15158.16435.45725.274341@cymru.basistech.com>
Message-ID: <3B372356.A9BED3F9@lemburg.com>

Tom Emerson wrote:
> 
> M.-A. Lemburg writes:
> > Would we need to incorporate some of ICU for this to work or could
> > we use a Python script to convert those tables to ones usable in
> > Python ?
> 
> No, we wouldn't need to incorporate anything from ICU except the
> tables: that's my point. As long as we wrote the code to read the
> tables directly people could use them without conversion or anything
> like it.

Sounds great !

What's the license on those tables ?

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From tree@basistech.com  Mon Jun 25 12:06:00 2001
From: tree@basistech.com (Tom Emerson)
Date: Mon, 25 Jun 2001 07:06:00 -0400
Subject: [I18n-sig] International Components for Unicode
In-Reply-To: <3B372356.A9BED3F9@lemburg.com>
References: <3B33F256.3C133966@ActiveState.com>
 <15156.64038.410669.795084@cymru.basistech.com>
 <200106232226.f5NMQR720529@mira.informatik.hu-berlin.de>
 <15158.4669.583190.272218@cymru.basistech.com>
 <3B362BAC.A08E8128@lemburg.com>
 <15158.16435.45725.274341@cymru.basistech.com>
 <3B372356.A9BED3F9@lemburg.com>
Message-ID: <15159.6936.745436.585017@cymru.basistech.com>

M.-A. Lemburg writes:
> Sounds great !
> 
> What's the license on those tables ?

The latest ICU was released under the MIT/X license. I assume the
tables are licensed similarly.

-- 
Tom Emerson                                          Basis Technology Corp.
Sr. Sinostringologist                              http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From mal@lemburg.com  Mon Jun 25 12:46:11 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Mon, 25 Jun 2001 13:46:11 +0200
Subject: [I18n-sig] International Components for Unicode
References: <3B33F256.3C133966@ActiveState.com>
 <15156.64038.410669.795084@cymru.basistech.com>
 <200106232226.f5NMQR720529@mira.informatik.hu-berlin.de>
 <15158.4669.583190.272218@cymru.basistech.com>
 <3B362BAC.A08E8128@lemburg.com>
 <15158.16435.45725.274341@cymru.basistech.com>
 <3B372356.A9BED3F9@lemburg.com> <15159.6936.745436.585017@cymru.basistech.com>
Message-ID: <3B372483.A6E71057@lemburg.com>

Tom Emerson wrote:
> 
> M.-A. Lemburg writes:
> > Sounds great !
> >
> > What's the license on those tables ?
> 
> The latest ICU was released under the MIT/X license. I assume the
> tables are licensed similarly.

Sound even better :-) I think we should look into getting support
for them into a extension similar to the one Tamito is working on
and then place them into the python/dist/encodings directory.

I just wish I had time to look into this... :-(

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From mal@lemburg.com  Mon Jun 25 13:01:33 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Mon, 25 Jun 2001 14:01:33 +0200
Subject: [I18n-sig] How does Python Unicode treat surrogates?
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106241837.f5OIb2r07377@mira.informatik.hu-berlin.de>
Message-ID: <3B37281D.BE13E297@lemburg.com>

"Martin v. Loewis" wrote:
> 
> > It is not an option because we chose Unicode as our basis for
> > i18n work and not the ISO 10646 Uniform Character Set.
> 
> Please speak for yourself only.

With "we" I referred to the python-dev/i18n-sig team. Since these
things are all based on concensus not necessarily all members of
those teams will have or have had the same opinion.

Speaking only for myself: I would very much appreciate if you would
stop throwing these meta-comments into discussions we have on this
list.

> > > And why can't Python support the two standards simultaneously?
> >
> > Why would you want to support two standards for the same thing ?
> 
> Because they are almost identical.

True, but it's those small differences that make life harder.

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From gs234@cam.ac.uk  Mon Jun 25 13:03:31 2001
From: gs234@cam.ac.uk (Gaute B Strokkenes)
Date: 25 Jun 2001 13:03:31 +0100
Subject: [I18n-sig] Re: How does Python Unicode treat surrogates?
In-Reply-To: <3B3722DB.1FF54794@lemburg.com> ("M.-A. Lemburg"'s message of "Mon, 25 Jun 2001 13:39:07 +0200")
References: <LNBBLJKPBEHFEDALKOLCMEGKKKAA.tim.one@home.com>
 <3B3722DB.1FF54794@lemburg.com>
Message-ID: <4ak820g418.fsf@kern.srcf.societies.cam.ac.uk>

[I'm cc:-ing the unicode list to make sure that I've gotten my
terminology right, and to solicit comments

On Mon, 25 Jun 2001, mal@lemburg.com wrote:
> Tim Peters wrote:
>> 
>> [M.-A. Lemburg]
>> > ...
>> > 2. What to do when slicing of Unicode strings would break
>> >    a surrogate pair ?
>> 
>> To me a string is a sequence of characters, and s[0] returns the
>> first, s[1] the second, and so on.  The internal details of how the
>> implementation chooses to torture itself <0.7 wink> should be
>> invisible.  That is, breaking a surrogate via slicing should be
>> impossible: s[i:j] returns j-i characters, and that's that.
> 
> It's not that simple: lone surrogates are true Unicode char points
> in their own right; it's just that they are pretty useless without
> their resp. partners in the data stream. And with this "feature"
> they are in good company: the Unicode combining characters (e.g. the
> combining acute) have th same property.

This is completely and totally wrong.  The Unicode standard version
3.1 states (conformance requirement C12(c): A conformant process shall
not interpret illegal UTF code unit sequences as characters.

The precise definition of "illegal" in this context is given
elsewhere.  See <http://www.unicode.org/unicode/reports/tr17/>:

  0xD800 is incomplete in Unicode.  Unless followed by another 16-bit
  value of the right form, it is illegal.

(Unicode here should read UTF-16, off course.  The reason it does not
is that the language of the technical report has not been updated to
that of 3.1)

-- 
Big Gaute                               http://www.srcf.ucam.org/~gs234/
Hello?  Enema Bondage?  I'm calling because I want to be happy, I guess..


From JMachin@Colonial.com.au  Mon Jun 25 13:33:50 2001
From: JMachin@Colonial.com.au (Machin, John)
Date: Mon, 25 Jun 2001 22:33:50 +1000
Subject: [I18n-sig] Re: How does Python Unicode treat surrogates?
Message-ID: <9F2D83017589D211BD1000805FA70CA703B139D8@ntxmel03.cmutual.com.au>

MAL and Gaute,

Can I please take the middle ground (and risk having both of you throw
things at me?

=> Lone surrogates are not 'true Unicode char points
 in their own right' [MAL] -- they don't represent characters. 

On the other hand, UTF code sequences that would decode into lone surrogates
are not "illegal".
Please read clause D29 in section 3.8 of the Unicode 3.0 standard. This is
further clarified by Unicode 3.1
which expressly lists legal UTF-8 sequences; these encompass lone
surrogates.


-----Original Message-----
From: Gaute B Strokkenes [mailto:gs234@cam.ac.uk]
Sent: Monday, 25 June 2001 22:04
To: M.-A. Lemburg
Cc: Tim Peters; i18n-sig@python.org; unicode@unicode.org
Subject: [I18n-sig] Re: How does Python Unicode treat surrogates?


[I'm cc:-ing the unicode list to make sure that I've gotten my
terminology right, and to solicit comments

On Mon, 25 Jun 2001, mal@lemburg.com wrote:
> Tim Peters wrote:
>> 
>> [M.-A. Lemburg]
>> > ...
>> > 2. What to do when slicing of Unicode strings would break
>> >    a surrogate pair ?
>> 
>> To me a string is a sequence of characters, and s[0] returns the
>> first, s[1] the second, and so on.  The internal details of how the
>> implementation chooses to torture itself <0.7 wink> should be
>> invisible.  That is, breaking a surrogate via slicing should be
>> impossible: s[i:j] returns j-i characters, and that's that.
> 
> It's not that simple: lone surrogates are true Unicode char points
> in their own right; it's just that they are pretty useless without
> their resp. partners in the data stream. And with this "feature"
> they are in good company: the Unicode combining characters (e.g. the
> combining acute) have th same property.

This is completely and totally wrong.  The Unicode standard version
3.1 states (conformance requirement C12(c): A conformant process shall
not interpret illegal UTF code unit sequences as characters.

The precise definition of "illegal" in this context is given
elsewhere.  See <http://www.unicode.org/unicode/reports/tr17/>:

  0xD800 is incomplete in Unicode.  Unless followed by another 16-bit
  value of the right form, it is illegal.

(Unicode here should read UTF-16, off course.  The reason it does not
is that the language of the technical report has not been updated to
that of 3.1)

-- 
Big Gaute                               http://www.srcf.ucam.org/~gs234/
Hello?  Enema Bondage?  I'm calling because I want to be happy, I guess..

_______________________________________________
I18n-sig mailing list
I18n-sig@python.org
http://mail.python.org/mailman/listinfo/i18n-sig


**************   IMPORTANT MESSAGE  **************

The information contained in or attached to this message is intended only for the people it is addressed to. If you are not the intended recipient, any use, disclosure or copying of this information is unauthorised and prohibited. This information may be confidential or subject to legal privilege. It is not the expressed view of Colonial Limited or any of its subsidiaries unless that is clearly stated. Colonial cannot accept liability for any virus damage caused by this message.

**************************************************


From mal@lemburg.com  Mon Jun 25 13:56:23 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Mon, 25 Jun 2001 14:56:23 +0200
Subject: [I18n-sig] Re: How does Python Unicode treat surrogates?
References: <9F2D83017589D211BD1000805FA70CA703B139D8@ntxmel03.cmutual.com.au>
Message-ID: <3B3734F7.AEDDAAAA@lemburg.com>

"Machin, John" wrote:
> 
> MAL and Gaute,
> 
> Can I please take the middle ground (and risk having both of you throw
> things at me?

Sure :-)
 
> => Lone surrogates are not 'true Unicode char points
>  in their own right' [MAL] -- they don't represent characters.

I should have added "please correct me if I'm wrong", sorry.

Let me put this into an example:
Say you have a Unicode string which contains the following data:

        U+0061 U+0062 U+0063 U+DC00 U+0064 U+0065 U+0066
       ("a"    "b"    "c"    ?      "d"    "e"    "f")

Would you consider this sequence a Unicode string or not ? Please
note that I am not talking about some UTF-n encoding here. The
above snippet is simply to be seen as sequence of data entries
which are referenced by the Unicode database.

> On the other hand, UTF code sequences that would decode into lone surrogates
> are not "illegal".
> Please read clause D29 in section 3.8 of the Unicode 3.0 standard. This is
> further clarified by Unicode 3.1
> which expressly lists legal UTF-8 sequences; these encompass lone
> surrogates.
> 
> -----Original Message-----
> From: Gaute B Strokkenes [mailto:gs234@cam.ac.uk]
> Sent: Monday, 25 June 2001 22:04
> To: M.-A. Lemburg
> Cc: Tim Peters; i18n-sig@python.org; unicode@unicode.org
> Subject: [I18n-sig] Re: How does Python Unicode treat surrogates?
> 
> [I'm cc:-ing the unicode list to make sure that I've gotten my
> terminology right, and to solicit comments
> 
> On Mon, 25 Jun 2001, mal@lemburg.com wrote:
> > Tim Peters wrote:
> >>
> >> [M.-A. Lemburg]
> >> > ...
> >> > 2. What to do when slicing of Unicode strings would break
> >> >    a surrogate pair ?
> >>
> >> To me a string is a sequence of characters, and s[0] returns the
> >> first, s[1] the second, and so on.  The internal details of how the
> >> implementation chooses to torture itself <0.7 wink> should be
> >> invisible.  That is, breaking a surrogate via slicing should be
> >> impossible: s[i:j] returns j-i characters, and that's that.
> >
> > It's not that simple: lone surrogates are true Unicode char points
> > in their own right; it's just that they are pretty useless without
> > their resp. partners in the data stream. And with this "feature"
> > they are in good company: the Unicode combining characters (e.g. the
> > combining acute) have th same property.
> 
> This is completely and totally wrong.  The Unicode standard version
> 3.1 states (conformance requirement C12(c): A conformant process shall
> not interpret illegal UTF code unit sequences as characters.
> 
> The precise definition of "illegal" in this context is given
> elsewhere.  See <http://www.unicode.org/unicode/reports/tr17/>:
> 
>   0xD800 is incomplete in Unicode.  Unless followed by another 16-bit
>   value of the right form, it is illegal.
> 
> (Unicode here should read UTF-16, off course.  The reason it does not
> is that the language of the technical report has not been updated to
> that of 3.1)
> 
> --
> Big Gaute                               http://www.srcf.ucam.org/~gs234/
> Hello?  Enema Bondage?  I'm calling because I want to be happy, I guess..
> 
> _______________________________________________
> I18n-sig mailing list
> I18n-sig@python.org
> http://mail.python.org/mailman/listinfo/i18n-sig
> 
> **************   IMPORTANT MESSAGE  **************
> 
> The information contained in or attached to this message is intended only for the people it is addressed to. If you are not the intended recipient, any use, disclosure or copying of this information is unauthorised and prohibited. This information may be confidential or subject to legal privilege. It is not the expressed view of Colonial Limited or any of its subsidiaries unless that is clearly stated. Colonial cannot accept liability for any virus damage caused by this message.
> 
> **************************************************

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From mal@lemburg.com  Mon Jun 25 14:21:36 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Mon, 25 Jun 2001 15:21:36 +0200
Subject: [I18n-sig] Re: How does Python Unicode treat surrogates?
References: <LNBBLJKPBEHFEDALKOLCMEGKKKAA.tim.one@home.com>
 <3B3722DB.1FF54794@lemburg.com> <4ak820g418.fsf@kern.srcf.societies.cam.ac.uk>
Message-ID: <3B373AE0.21E25716@lemburg.com>

Gaute B Strokkenes wrote:
> 
> [I'm cc:-ing the unicode list to make sure that I've gotten my
> terminology right, and to solicit comments
> 
> On Mon, 25 Jun 2001, mal@lemburg.com wrote:
> > Tim Peters wrote:
> >>
> >> [M.-A. Lemburg]
> >> > ...
> >> > 2. What to do when slicing of Unicode strings would break
> >> >    a surrogate pair ?
> >>
> >> To me a string is a sequence of characters, and s[0] returns the
> >> first, s[1] the second, and so on.  The internal details of how the
> >> implementation chooses to torture itself <0.7 wink> should be
> >> invisible.  That is, breaking a surrogate via slicing should be
> >> impossible: s[i:j] returns j-i characters, and that's that.
> >
> > It's not that simple: lone surrogates are true Unicode char points
> > in their own right; it's just that they are pretty useless without
> > their resp. partners in the data stream. And with this "feature"
> > they are in good company: the Unicode combining characters (e.g. the
> > combining acute) have th same property.
> 
> This is completely and totally wrong.  The Unicode standard version
> 3.1 states (conformance requirement C12(c): A conformant process shall
> not interpret illegal UTF code unit sequences as characters.

This would solve the UTF codec issue, but I was talking about Unicode
itself. In Python, you can write u"abc\uD800\uDC00"[0:4] giving
u"abc\uD800" without getting an exception and I am not sure whether
this is correct or not.

The internal machinery is a totally different issue: we currently
use UTF-16 for this but have deliberatly left out the surrogate
support for the first implementation phase.
 
> The precise definition of "illegal" in this context is given
> elsewhere.  See <http://www.unicode.org/unicode/reports/tr17/>:
> 
>   0xD800 is incomplete in Unicode.  Unless followed by another 16-bit
>   value of the right form, it is illegal.
> 
> (Unicode here should read UTF-16, off course.  The reason it does not
> is that the language of the technical report has not been updated to
> that of 3.1)

If you would have left it at "Unicode" I would have felt
better ;-)

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From guido@digicool.com  Mon Jun 25 14:42:01 2001
From: guido@digicool.com (Guido van Rossum)
Date: Mon, 25 Jun 2001 09:42:01 -0400
Subject: [I18n-sig] How does Python Unicode treat surrogates?
In-Reply-To: Your message of "Sun, 24 Jun 2001 20:16:59 +0200."
 <3B362E9B.4DC8DD81@lemburg.com>
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de>
 <3B362E9B.4DC8DD81@lemburg.com>
Message-ID: <200106251342.f5PDg1q07291@odiug.digicool.com>

> This would mean 4 bytes per Unicode character and is
> unacceptable given the fact that most of these would be 0-bytes

Agreed, but see below.

> in practice. It would also break binary compatibility to the
> native Unicode wchar_t type on e.g. Windows which we are among
> the most Unicode-aware platforms there are today.

Shouldn't there be a conversion routine between wchar_t[] and
Py_UNICODE[] instead of assuming they have the same format?  This will
come up more often, and Linux has sizeif(wchar_t) == 4 I believe.
(Which suggests that others disagree on the waste of space.)

> > > BTW, Python's Unicode implementation is bound to the standard
> > > defined at www.unicode.org; moving over to ISO 10646 is not an
> > > option.
> > 
> > Can you elaborate? How can you rule out that option that easily?
> 
> It is not an option because we chose Unicode as our basis for 
> i18n work and not the ISO 10646 Uniform Character Set. I'd rather
> have those two camps fight over the details of the Unicode standard
> than try to fix anything related to the differences between the two
> in Python by mixing them.

Agreed.  But be prepared that at some point in the future the Unicode
world might end up agreeing on 4 bytes too...

> > And why can't Python support the two standards simultaneously?
> 
> Why would you want to support two standards for the same thing ?

Well, we support ASCII and Unicode. :-)

If ISO 10646 becomes important to our users, we'll have to support
it, if only by providing a codec.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From tree@basistech.com  Mon Jun 25 14:10:15 2001
From: tree@basistech.com (Tom Emerson)
Date: Mon, 25 Jun 2001 09:10:15 -0400
Subject: [I18n-sig] How does Python Unicode treat surrogates?
In-Reply-To: <200106251342.f5PDg1q07291@odiug.digicool.com>
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com>
 <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de>
 <3B3471AF.1311E872@lemburg.com>
 <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de>
 <3B34F9BD.4FDEFC62@lemburg.com>
 <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de>
 <3B35CEC6.710243E7@lemburg.com>
 <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de>
 <3B362E9B.4DC8DD81@lemburg.com>
 <200106251342.f5PDg1q07291@odiug.digicool.com>
Message-ID: <15159.14391.718891.645489@cymru.basistech.com>

Guido van Rossum writes:
[snip]
> Agreed.  But be prepared that at some point in the future the Unicode
> world might end up agreeing on 4 bytes too...

With the release of the Plane 2 ideographic extensions in Unicode 3.1
there are two options available: include surrogate support via UTF-16,
which means dealing with multibyte (really multi"word") characters, or
switching to UTF-32, allowing characters outside Plane 0 to be
accessed uniformly.

Note that this is a real issue: the Hong Kong Supplementary Character
Set includes characters contained in Plane 2 when mapped to Unicode
3.1.

> If ISO 10646 becomes important to our users, we'll have to support
> it, if only by providing a codec.

This is beyond ISO 10646 --- Unicode 3.1 support brings the issue to
the fore.

    -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Sr. Sinostringologist                              http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From JMachin@Colonial.com.au  Mon Jun 25 14:51:29 2001
From: JMachin@Colonial.com.au (Machin, John)
Date: Mon, 25 Jun 2001 23:51:29 +1000
Subject: [I18n-sig] Re: How does Python Unicode treat surrogates?
Message-ID: <9F2D83017589D211BD1000805FA70CA703B139D9@ntxmel03.cmutual.com.au>

Marc-Andre,

> I should have added "please correct me if I'm wrong", sorry.

I'm sorry too; I didn't intend to be rude; it's just
that I normally operate under a protocol where that
licence ("please correct me if I'm wrong") is the
default and doesn't need to be stated explicitly in each paragraph.

> Say you have a Unicode string which contains the following data:
>
>        U+0061 U+0062 U+0063 U+DC00 U+0064 U+0065 U+0066
>       ("a"    "b"    "c"    ?      "d"    "e"    "f")
>
> Would you consider this sequence a Unicode string or not ? 

I think you are using "Unicode string" with two different meanings here.

However, the pragmatic question is what should Python do when given such a
sequence.
Do we permit such a sequence to be held internally as a "Unicode string"?
Is u"\udc00" legal in source code or should Python throw a syntax error?
Same question for u"\uffff".

We *do* need to consider UTF encodings, because Unicode *expressly* allows
decoding UTF sequences 
that become unpaired surrogates, or other "not 100% valid" scalars such as
0xffff and 0xfffe. So, 
given that Python supports Unicode, not ISO 10646, we must IMO permit such
sequences in our internal 
representation. It follows that we should stop worrying about these
irregular values -- it's less
programming that way. Unicode 3.1 will create enough extra programming as it
is, because we now have
variable-length characters again -- just what Unicode was going to save us
from :-(

Cheers,
John

-----Original Message-----
From: M.-A. Lemburg [mailto:mal@lemburg.com]
Sent: Monday, 25 June 2001 22:56
To: Machin, John
Cc: 'Gaute B Strokkenes'; Tim Peters; i18n-sig@python.org;
unicode@unicode.org
Subject: Re: [I18n-sig] Re: How does Python Unicode treat surrogates?


"Machin, John" wrote:
> 
> MAL and Gaute,
> 
> Can I please take the middle ground (and risk having both of you throw
> things at me?

Sure :-)
 
> => Lone surrogates are not 'true Unicode char points
>  in their own right' [MAL] -- they don't represent characters.

I should have added "please correct me if I'm wrong", sorry.

Let me put this into an example:
Say you have a Unicode string which contains the following data:

        U+0061 U+0062 U+0063 U+DC00 U+0064 U+0065 U+0066
       ("a"    "b"    "c"    ?      "d"    "e"    "f")

Would you consider this sequence a Unicode string or not ? Please
note that I am not talking about some UTF-n encoding here. The
above snippet is simply to be seen as sequence of data entries
which are referenced by the Unicode database.

> On the other hand, UTF code sequences that would decode into lone
surrogates
> are not "illegal".
> Please read clause D29 in section 3.8 of the Unicode 3.0 standard. This is
> further clarified by Unicode 3.1
> which expressly lists legal UTF-8 sequences; these encompass lone
> surrogates.
> 
> -----Original Message-----
> From: Gaute B Strokkenes [mailto:gs234@cam.ac.uk]
> Sent: Monday, 25 June 2001 22:04
> To: M.-A. Lemburg
> Cc: Tim Peters; i18n-sig@python.org; unicode@unicode.org
> Subject: [I18n-sig] Re: How does Python Unicode treat surrogates?
> 
> [I'm cc:-ing the unicode list to make sure that I've gotten my
> terminology right, and to solicit comments
> 
> On Mon, 25 Jun 2001, mal@lemburg.com wrote:
> > Tim Peters wrote:
> >>
> >> [M.-A. Lemburg]
> >> > ...
> >> > 2. What to do when slicing of Unicode strings would break
> >> >    a surrogate pair ?
> >>
> >> To me a string is a sequence of characters, and s[0] returns the
> >> first, s[1] the second, and so on.  The internal details of how the
> >> implementation chooses to torture itself <0.7 wink> should be
> >> invisible.  That is, breaking a surrogate via slicing should be
> >> impossible: s[i:j] returns j-i characters, and that's that.
> >
> > It's not that simple: lone surrogates are true Unicode char points
> > in their own right; it's just that they are pretty useless without
> > their resp. partners in the data stream. And with this "feature"
> > they are in good company: the Unicode combining characters (e.g. the
> > combining acute) have th same property.
> 
> This is completely and totally wrong.  The Unicode standard version
> 3.1 states (conformance requirement C12(c): A conformant process shall
> not interpret illegal UTF code unit sequences as characters.
> 
> The precise definition of "illegal" in this context is given
> elsewhere.  See <http://www.unicode.org/unicode/reports/tr17/>:
> 
>   0xD800 is incomplete in Unicode.  Unless followed by another 16-bit
>   value of the right form, it is illegal.
> 
> (Unicode here should read UTF-16, off course.  The reason it does not
> is that the language of the technical report has not been updated to
> that of 3.1)
> 
> --
> Big Gaute                               http://www.srcf.ucam.org/~gs234/
> Hello?  Enema Bondage?  I'm calling because I want to be happy, I guess..
> 
> _______________________________________________
> I18n-sig mailing list
> I18n-sig@python.org
> http://mail.python.org/mailman/listinfo/i18n-sig
> 
> **************   IMPORTANT MESSAGE  **************
> 
> The information contained in or attached to this message is intended only
for the people it is addressed to. If you are not the intended recipient,
any use, disclosure or copying of this information is unauthorised and
prohibited. This information may be confidential or subject to legal
privilege. It is not the expressed view of Colonial Limited or any of its
subsidiaries unless that is clearly stated. Colonial cannot accept liability
for any virus damage caused by this message.
> 
> **************************************************

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From guido@digicool.com  Mon Jun 25 15:22:40 2001
From: guido@digicool.com (Guido van Rossum)
Date: Mon, 25 Jun 2001 10:22:40 -0400
Subject: [I18n-sig] How does Python Unicode treat surrogates?
In-Reply-To: Your message of "Mon, 25 Jun 2001 09:10:15 EDT."
 <15159.14391.718891.645489@cymru.basistech.com>
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com>
 <15159.14391.718891.645489@cymru.basistech.com>
Message-ID: <200106251422.f5PEMel07612@odiug.digicool.com>

> Guido van Rossum writes:
> [snip]
> > Agreed.  But be prepared that at some point in the future the Unicode
> > world might end up agreeing on 4 bytes too...
> 
> With the release of the Plane 2 ideographic extensions in Unicode 3.1
> there are two options available: include surrogate support via UTF-16,
> which means dealing with multibyte (really multi"word") characters, or
> switching to UTF-32, allowing characters outside Plane 0 to be
> accessed uniformly.
> 
> Note that this is a real issue: the Hong Kong Supplementary Character
> Set includes characters contained in Plane 2 when mapped to Unicode
> 3.1.
> 
> > If ISO 10646 becomes important to our users, we'll have to support
> > it, if only by providing a codec.
> 
> This is beyond ISO 10646 --- Unicode 3.1 support brings the issue to
> the fore.
> 
>     -tree

I don't think switching to a 32-bit character is the right thing to do
for us (although I think it should be easier than it currently is --
changing to define Py_UNICODE as a 32-bit unsigned int should be all
that it takes, which is currently not the case).

I'm all for taking the lazy approach and letting applications that
need surrogate support do it themselves, at the application level.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From mark@macchiato.com  Mon Jun 25 15:24:28 2001
From: mark@macchiato.com (Mark Davis)
Date: Mon, 25 Jun 2001 07:24:28 -0700
Subject: [I18n-sig] Re: How does Python Unicode treat surrogates?
References: <LNBBLJKPBEHFEDALKOLCMEGKKKAA.tim.one@home.com><3B3722DB.1FF54794@lemburg.com> <4ak820g418.fsf@kern.srcf.societies.cam.ac.uk>
Message-ID: <006501c0fd82$8b5ba9f0$0c680b41@c1340594a>

You cannot interpret isolated UTF-16 surrogate code units as characters. For
example, you can't interpret the sequence of D800 followed by 0061 as if it
were some private use character (say, Klingon) followed by an 'a'.

(For those unfamiliar with the terminology, see
http://www.unicode.org/glossary, and my paper at
http://www-106.ibm.com/developerworks/unicode/library/utfencodingforms/.)

However, you can certainly deal with surrogate code units in storage, and it
is permissible on that level to handle them. For example, most UTF-16 string
interfaces use code unit indices, so that a string from position 3 of length
5 will include precisely 5 code units, not however many code points (or
graphemes!) they take up. Similarly for UTF-8 strings, the low-level units
are bytes.

In most people's experience, it is best to leave the low level interfaces
with indices in terms of code units, then supply some utility routines that
tell you information about code points. The most useful are:

- given a string and an index into that string, how many code points are
before it?
- given a string and a number of code points, what is the lowest index that
contains them?
- given a string and an index into that string, is the index on a code point
boundary?

An example for Java is at
http://oss.software.ibm.com/icu4j/doc/com/ibm/text/UTF16.html.

Mark

----- Original Message -----
From: "Gaute B Strokkenes" <gs234@cam.ac.uk>
To: "M.-A. Lemburg" <mal@lemburg.com>
Cc: "Tim Peters" <tim.one@home.com>; <i18n-sig@python.org>;
<unicode@unicode.org>
Sent: Monday, June 25, 2001 05:03
Subject: Re: How does Python Unicode treat surrogates?


>
> [I'm cc:-ing the unicode list to make sure that I've gotten my
> terminology right, and to solicit comments
>
> On Mon, 25 Jun 2001, mal@lemburg.com wrote:
> > Tim Peters wrote:
> >>
> >> [M.-A. Lemburg]
> >> > ...
> >> > 2. What to do when slicing of Unicode strings would break
> >> >    a surrogate pair ?
> >>
> >> To me a string is a sequence of characters, and s[0] returns the
> >> first, s[1] the second, and so on.  The internal details of how the
> >> implementation chooses to torture itself <0.7 wink> should be
> >> invisible.  That is, breaking a surrogate via slicing should be
> >> impossible: s[i:j] returns j-i characters, and that's that.
> >
> > It's not that simple: lone surrogates are true Unicode char points
> > in their own right; it's just that they are pretty useless without
> > their resp. partners in the data stream. And with this "feature"
> > they are in good company: the Unicode combining characters (e.g. the
> > combining acute) have th same property.
>
> This is completely and totally wrong.  The Unicode standard version
> 3.1 states (conformance requirement C12(c): A conformant process shall
> not interpret illegal UTF code unit sequences as characters.
>
> The precise definition of "illegal" in this context is given
> elsewhere.  See <http://www.unicode.org/unicode/reports/tr17/>:
>
>   0xD800 is incomplete in Unicode.  Unless followed by another 16-bit
>   value of the right form, it is illegal.
>
> (Unicode here should read UTF-16, off course.  The reason it does not
> is that the language of the technical report has not been updated to
> that of 3.1)
>
> --
> Big Gaute                               http://www.srcf.ucam.org/~gs234/
> Hello?  Enema Bondage?  I'm calling because I want to be happy, I guess..
>
>


From tree@basistech.com  Mon Jun 25 14:55:07 2001
From: tree@basistech.com (Tom Emerson)
Date: Mon, 25 Jun 2001 09:55:07 -0400
Subject: [I18n-sig] How does Python Unicode treat surrogates?
In-Reply-To: <200106251422.f5PEMel07612@odiug.digicool.com>
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com>
 <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de>
 <3B3471AF.1311E872@lemburg.com>
 <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de>
 <3B34F9BD.4FDEFC62@lemburg.com>
 <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de>
 <3B35CEC6.710243E7@lemburg.com>
 <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de>
 <3B362E9B.4DC8DD81@lemburg.com>
 <200106251342.f5PDg1q07291@odiug.digicool.com>
 <15159.14391.718891.645489@cymru.basistech.com>
 <200106251422.f5PEMel07612@odiug.digicool.com>
Message-ID: <15159.17083.978971.519453@cymru.basistech.com>

Guido van Rossum writes:
[...]
> I'm all for taking the lazy approach and letting applications that
> need surrogate support do it themselves, at the application level.

Meaning what? Leaving it up to the application to be entirely
responsible for handling surrogates is a mistake. As was stated
earlier in the thread (apologies, I don't have the message around to
make the appropriate attribution) surrogates are an implementation
detail: to the user/application developer the presence of the
surrogate pair needs to be transparent.

As long as the Unicode support functionality groks surrogates
correctly (fully implements UTF-16) then the issue becomes a small one
for the end user. The scanner would need to be modified to support
Unicode escapes for values up to 0x10FFFF. Internally these are
represented as surrogates.

Put the burden of these multibyte representations on the library
implementor, not the end-user.

    -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Sr. Sinostringologist                              http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From guido@digicool.com  Mon Jun 25 15:43:02 2001
From: guido@digicool.com (Guido van Rossum)
Date: Mon, 25 Jun 2001 10:43:02 -0400
Subject: [I18n-sig] How does Python Unicode treat surrogates?
In-Reply-To: Your message of "Mon, 25 Jun 2001 09:55:07 EDT."
 <15159.17083.978971.519453@cymru.basistech.com>
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <15159.14391.718891.645489@cymru.basistech.com> <200106251422.f5PEMel07612@odiug.digicool.com>
 <15159.17083.978971.519453@cymru.basistech.com>
Message-ID: <200106251443.f5PEh2p07753@odiug.digicool.com>

> Guido van Rossum writes:
> [...]
> > I'm all for taking the lazy approach and letting applications that
> > need surrogate support do it themselves, at the application level.
> 
> Meaning what? Leaving it up to the application to be entirely
> responsible for handling surrogates is a mistake. As was stated
> earlier in the thread (apologies, I don't have the message around to
> make the appropriate attribution) surrogates are an implementation
> detail: to the user/application developer the presence of the
> surrogate pair needs to be transparent.
> 
> As long as the Unicode support functionality groks surrogates
> correctly (fully implements UTF-16) then the issue becomes a small one
> for the end user. The scanner would need to be modified to support
> Unicode escapes for values up to 0x10FFFF. Internally these are
> represented as surrogates.
> 
> Put the burden of these multibyte representations on the library
> implementor, not the end-user.
> 
>     -tree

Depends on what you call transparent.  I'm all for smart codecs
between UTF-16 and UTF-8, but if you have a surrogate in a Unicode
string, the application will have to know not to split it in the
middle, and it must realize that len(u) is not necessarily the number
of characters -- it's the number of 16-bit units in the UTF-16
encoding.

Does that make sense?

I know I am hindered by a lack of understanding of Unicode
hairsplitting, angels-on-a-pin-dancing details; if I'm missing
something, it's likely that many other people don't know the details
either, so an explanation would be much appreciated!

--Guido van Rossum (home page: http://www.python.org/~guido/)


From tree@basistech.com  Mon Jun 25 15:36:10 2001
From: tree@basistech.com (Tom Emerson)
Date: Mon, 25 Jun 2001 10:36:10 -0400
Subject: [I18n-sig] How does Python Unicode treat surrogates?
In-Reply-To: <200106251443.f5PEh2p07753@odiug.digicool.com>
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com>
 <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de>
 <3B3471AF.1311E872@lemburg.com>
 <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de>
 <3B34F9BD.4FDEFC62@lemburg.com>
 <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de>
 <3B35CEC6.710243E7@lemburg.com>
 <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de>
 <3B362E9B.4DC8DD81@lemburg.com>
 <200106251342.f5PDg1q07291@odiug.digicool.com>
 <15159.14391.718891.645489@cymru.basistech.com>
 <200106251422.f5PEMel07612@odiug.digicool.com>
 <15159.17083.978971.519453@cymru.basistech.com>
 <200106251443.f5PEh2p07753@odiug.digicool.com>
Message-ID: <15159.19546.226155.383490@cymru.basistech.com>

Guido van Rossum writes:
> Depends on what you call transparent.  I'm all for smart codecs
> between UTF-16 and UTF-8, but if you have a surrogate in a Unicode
> string, the application will have to know not to split it in the
> middle, and it must realize that len(u) is not necessarily the number
> of characters -- it's the number of 16-bit units in the UTF-16
> encoding.

Surrogates were created as a way to allow characters outside Plane 0
(the BMP) to be accessed within a sixteen-bit codespace. When using
UTF-16 a character constists of either two-octets or four-octets. A
character that cannot be represented within the 16-bit code space is
encoded using a surrogate pair, but it is the same character
regardless.

So, for example, the ideograph at U+20000 is the same character
whether it is encoded as <20000> (UCS-4, UTF-32), <D840 DC00>
(UTF-16), or <F0 A0 80 80> (UTF-8). It doesn't matter what
transformation format you use: it's the *same* character.

Hence, when I have Unicode string, I'm thinking of each character as a
Unicode character, not as a sequence of UTF-16 or UCS-2 two-octet
words.

Hence my belief that Unicode strings should not be synonymous with the
underlying physical character representation is used.

Clear as mud? :-)

    -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Sr. Sinostringologist                              http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From guido@digicool.com  Mon Jun 25 16:44:32 2001
From: guido@digicool.com (Guido van Rossum)
Date: Mon, 25 Jun 2001 11:44:32 -0400
Subject: [I18n-sig] How does Python Unicode treat surrogates?
In-Reply-To: Your message of "Mon, 25 Jun 2001 10:36:10 EDT."
 <15159.19546.226155.383490@cymru.basistech.com>
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <15159.14391.718891.645489@cymru.basistech.com> <200106251422.f5PEMel07612@odiug.digicool.com> <15159.17083.978971.519453@cymru.basistech.com> <200106251443.f5PEh2p07753@odiug.digicool.com>
 <15159.19546.226155.383490@cymru.basistech.com>
Message-ID: <200106251544.f5PFiWe07979@odiug.digicool.com>

> Guido van Rossum writes:
> > Depends on what you call transparent.  I'm all for smart codecs
> > between UTF-16 and UTF-8, but if you have a surrogate in a Unicode
> > string, the application will have to know not to split it in the
> > middle, and it must realize that len(u) is not necessarily the number
> > of characters -- it's the number of 16-bit units in the UTF-16
> > encoding.
> 
> Surrogates were created as a way to allow characters outside Plane 0
> (the BMP) to be accessed within a sixteen-bit codespace. When using
> UTF-16 a character constists of either two-octets or four-octets. A
> character that cannot be represented within the 16-bit code space is
> encoded using a surrogate pair, but it is the same character
> regardless.
> 
> So, for example, the ideograph at U+20000 is the same character
> whether it is encoded as <20000> (UCS-4, UTF-32), <D840 DC00>
> (UTF-16), or <F0 A0 80 80> (UTF-8). It doesn't matter what
> transformation format you use: it's the *same* character.
> 
> Hence, when I have Unicode string, I'm thinking of each character as a
> Unicode character, not as a sequence of UTF-16 or UCS-2 two-octet
> words.
> 
> Hence my belief that Unicode strings should not be synonymous with the
> underlying physical character representation is used.
> 
> Clear as mud? :-)
> 
>     -tree

Very clear.

But, just as a Python 8-bit string object containing the UTF-8 encoded
character U+20000 contains 4 bytes, with s[0] being '\xF0' etc., a
Python "unicode" string containing that character as a surrogate will
have length 2, with u[0] being u'\uD840' and u[1] being u'\uDC00'.
You can think of it as containing a single character, but the
interface gives you the individual items of the UTF-16 encoding.

You can believe what *should* happen all you want, but we're not going
to change this soon.  u[i] has to be independent of the length of u
and the value of i.

It may change *eventually* -- when we switch to UCS-4 for the internal
representation.  Until then, the API will deal in 16-bit values that
may or may not be "characters".

I'd say that ideally the choice to have a 2 or 4 byte internal
representation (or no Unicode support at all, for some platforms like
PalmOS!) should be a configuration choice.  Right now the
implementation doesn't allow that choice at all, which should be
remedied -- maybe you can help by submitting patches?

--Guido van Rossum (home page: http://www.python.org/~guido/)


From mal@lemburg.com  Mon Jun 25 16:58:49 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Mon, 25 Jun 2001 17:58:49 +0200
Subject: [I18n-sig] How does Python Unicode treat surrogates?
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de>
 <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com>
Message-ID: <3B375FB9.91BA4B1E@lemburg.com>

Guido van Rossum wrote:
> 
> > This would mean 4 bytes per Unicode character and is
> > unacceptable given the fact that most of these would be 0-bytes
> 
> Agreed, but see below.
> 
> > in practice. It would also break binary compatibility to the
> > native Unicode wchar_t type on e.g. Windows which we are among
> > the most Unicode-aware platforms there are today.
> 
> Shouldn't there be a conversion routine between wchar_t[] and
> Py_UNICODE[] instead of assuming they have the same format?  This will
> come up more often, and Linux has sizeif(wchar_t) == 4 I believe.
> (Which suggests that others disagree on the waste of space.)

There are conversion routines which map between Py_UNICODE
and wchar_t in Python and these make use of the fact that
e.g. on Windows Py_UNICODE can use wchar_t as basis which makes
the conversion very fast.

On Linux (which uses 4 bytes per wchar_t) the routine inserts
tons of zeros to make Tux happy :-)
 
> > > > BTW, Python's Unicode implementation is bound to the standard
> > > > defined at www.unicode.org; moving over to ISO 10646 is not an
> > > > option.
> > >
> > > Can you elaborate? How can you rule out that option that easily?
> >
> > It is not an option because we chose Unicode as our basis for
> > i18n work and not the ISO 10646 Uniform Character Set. I'd rather
> > have those two camps fight over the details of the Unicode standard
> > than try to fix anything related to the differences between the two
> > in Python by mixing them.
> 
> Agreed.  But be prepared that at some point in the future the Unicode
> world might end up agreeing on 4 bytes too...

No problem... we can change to 4 byte values too if the world
agrees on 4 bytes per character. However, 2 bytes or 4 bytes
is an implementation detail and not part of the Unicode standard
itself.

4 bytes per character makes things at the C level much easier
and this is probably why the GNU C lib team chose 4 bytes. Other
programming languages like Java and platforms like Windows
chose 2-byte UTF-16 as internal format. I guess it's up to the
user acceptance to choose between the two. 2 bytes means more
work on the implementor, 4 bytes means more $$$ for Micron et al. ;-)

> > > And why can't Python support the two standards simultaneously?
> >
> > Why would you want to support two standards for the same thing ?
> 
> Well, we support ASCII and Unicode. :-)
> 
> If ISO 10646 becomes important to our users, we'll have to support
> it, if only by providing a codec.

This is different: ISO 10646 is a competing standard, not just a 
different encoding.

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From tree@basistech.com  Mon Jun 25 16:25:38 2001
From: tree@basistech.com (Tom Emerson)
Date: Mon, 25 Jun 2001 11:25:38 -0400
Subject: [I18n-sig] How does Python Unicode treat surrogates?
In-Reply-To: <200106251544.f5PFiWe07979@odiug.digicool.com>
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com>
 <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de>
 <3B3471AF.1311E872@lemburg.com>
 <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de>
 <3B34F9BD.4FDEFC62@lemburg.com>
 <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de>
 <3B35CEC6.710243E7@lemburg.com>
 <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de>
 <3B362E9B.4DC8DD81@lemburg.com>
 <200106251342.f5PDg1q07291@odiug.digicool.com>
 <15159.14391.718891.645489@cymru.basistech.com>
 <200106251422.f5PEMel07612@odiug.digicool.com>
 <15159.17083.978971.519453@cymru.basistech.com>
 <200106251443.f5PEh2p07753@odiug.digicool.com>
 <15159.19546.226155.383490@cymru.basistech.com>
 <200106251544.f5PFiWe07979@odiug.digicool.com>
Message-ID: <15159.22514.976923.894201@cymru.basistech.com>

Guido van Rossum writes:
> But, just as a Python 8-bit string object containing the UTF-8 encoded
> character U+20000 contains 4 bytes, with s[0] being '\xF0' etc., a
> Python "unicode" string containing that character as a surrogate will
> have length 2, with u[0] being u'\uD840' and u[1] being u'\uDC00'.
> You can think of it as containing a single character, but the
> interface gives you the individual items of the UTF-16 encoding.

So what has been implemented is UCS-2, not UTF-16, and certainly not
Unicode. Better to document u"" string literals as UCS-2, and not
Unicode.

> It may change *eventually* -- when we switch to UCS-4 for the internal
> representation.  Until then, the API will deal in 16-bit values that
> may or may not be "characters".

You don't need to switch to UCS-4 internally to implement what I'm
suggesting.

> I'd say that ideally the choice to have a 2 or 4 byte internal
> representation (or no Unicode support at all, for some platforms like
> PalmOS!) should be a configuration choice.

I don't think it should be a configuration choice. That leads to
incompatibilities between people's scripts. It's bad enough already
with some things working with threaded versions of python and some not
(e.g., Zope requires threading, but mod_python doesn't work if its
turned on).

BTW, Palm recently joined the Unicode Consortium, and Symbian has
Unicode support.

>Right now the implementation doesn't allow that choice at all, which
>should be remedied -- maybe you can help by submitting patches?

Touch=E9.


-- =

Tom Emerson                                          Basis Technology Cor=
p.
Sr. Sinostringologist                              http://www.basistech.c=
om
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From guido@digicool.com  Mon Jun 25 17:20:23 2001
From: guido@digicool.com (Guido van Rossum)
Date: Mon, 25 Jun 2001 12:20:23 -0400
Subject: [I18n-sig] How does Python Unicode treat surrogates?
In-Reply-To: Your message of "Mon, 25 Jun 2001 17:58:49 +0200."
 <3B375FB9.91BA4B1E@lemburg.com>
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com>
 <3B375FB9.91BA4B1E@lemburg.com>
Message-ID: <200106251620.f5PGKNP08234@odiug.digicool.com>

> > Shouldn't there be a conversion routine between wchar_t[] and
> > Py_UNICODE[] instead of assuming they have the same format?  This will
> > come up more often, and Linux has sizeif(wchar_t) == 4 I believe.
> > (Which suggests that others disagree on the waste of space.)
> 
> There are conversion routines which map between Py_UNICODE
> and wchar_t in Python and these make use of the fact that
> e.g. on Windows Py_UNICODE can use wchar_t as basis which makes
> the conversion very fast.
> 
> On Linux (which uses 4 bytes per wchar_t) the routine inserts
> tons of zeros to make Tux happy :-)

Maybe this code should be restructured so that it lengthens the
characters or not depending on the size difference between Py_UNICODE
and wchar_t, rather than making platform assumptions.

If this is the only thing that keeps us from having a configuration
OPTION to make Py_UNICODE 32-bit wide, I'd say let's fix it.

> > > > > BTW, Python's Unicode implementation is bound to the standard
> > > > > defined at www.unicode.org; moving over to ISO 10646 is not an
> > > > > option.
> > > >
> > > > Can you elaborate? How can you rule out that option that easily?
> > >
> > > It is not an option because we chose Unicode as our basis for
> > > i18n work and not the ISO 10646 Uniform Character Set. I'd rather
> > > have those two camps fight over the details of the Unicode standard
> > > than try to fix anything related to the differences between the two
> > > in Python by mixing them.
> > 
> > Agreed.  But be prepared that at some point in the future the Unicode
> > world might end up agreeing on 4 bytes too...
> 
> No problem... we can change to 4 byte values too if the world
> agrees on 4 bytes per character. However, 2 bytes or 4 bytes
> is an implementation detail and not part of the Unicode standard
> itself.

But UTF-16 vs. UCS-4 is not an implementation detail!

If we store 4 bytes per character, we should treat surrogates
differently.  I don't know where those would be converted -- probably
in the UTF-16 to UCS-4 codec.

I'd be happy to make the configuration choice between UTF-16 and
UCS-4, if that's doable.

> 4 bytes per character makes things at the C level much easier
> and this is probably why the GNU C lib team chose 4 bytes. Other
> programming languages like Java and platforms like Windows
> chose 2-byte UTF-16 as internal format. I guess it's up to the
> user acceptance to choose between the two. 2 bytes means more
> work on the implementor, 4 bytes means more $$$ for Micron et al. ;-)

My 1-year old laptop has a 10 Gb hard drive and 128 Mb RAM.  Current
machines are between 2-4 times that.  How much of that space will be
wasted on extra Unicode?  For a typical user, most of it is MP3's
anyway. :-)

> > > > And why can't Python support the two standards simultaneously?
> > >
> > > Why would you want to support two standards for the same thing ?
> > 
> > Well, we support ASCII and Unicode. :-)
> > 
> > If ISO 10646 becomes important to our users, we'll have to support
> > it, if only by providing a codec.
> 
> This is different: ISO 10646 is a competing standard, not just a 
> different encoding.

Oh.  I didn't know.  How does it differ from Unicode?  What's the user
acceptance?

--Guido van Rossum (home page: http://www.python.org/~guido/)


From mal@lemburg.com  Mon Jun 25 17:23:10 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Mon, 25 Jun 2001 18:23:10 +0200
Subject: [I18n-sig] Re: How does Python Unicode treat surrogates?
References: <9F2D83017589D211BD1000805FA70CA703B139D9@ntxmel03.cmutual.com.au>
Message-ID: <3B37656E.9E09DB1A@lemburg.com>

"Machin, John" wrote:
> 
> > Say you have a Unicode string which contains the following data:
> >
> >        U+0061 U+0062 U+0063 U+DC00 U+0064 U+0065 U+0066
> >       ("a"    "b"    "c"    ?      "d"    "e"    "f")
> >
> > Would you consider this sequence a Unicode string or not ?
> 
> I think you are using "Unicode string" with two different meanings here.

The question is really very simple: is the above correct Unicode
or not ?
 
> However, the pragmatic question is what should Python do when given such a
> sequence.
> Do we permit such a sequence to be held internally as a "Unicode string"?
> Is u"\udc00" legal in source code or should Python throw a syntax error?
> Same question for u"\uffff".

Right... that's what I was getting at. 

The Unicode object in Python
represent a "Unicode string"; the underlying logic is really secondary,
the question here is whether construction of objects like u"\uFFFF"
should be possible or not. 

If the standards defines these as illegal
Unicode, then the constructors should make sure that construction of
these objects is not possible; otherwise, it should work on them
just like all other "code points". (http://www.unicode.org/glossary/)
 
> We *do* need to consider UTF encodings, because Unicode *expressly* allows
> decoding UTF sequences
> that become unpaired surrogates, or other "not 100% valid" scalars such as
> 0xffff and 0xfffe.

The standard says this on the noncharacter code points:

"""
 D7b
         Noncharacter: a code point that is permanently reserved for internal use,
         and that should never be interchanged. In Unicode 3.1, these consist of
         the values U+nFFFE and U+nFFFF (where n is from 0 to 1016) and the
         values U+FDD0..U+FDEF.

  C5
        A process shall not interpret a noncharacter
        code point as an abstract character.

        The code points may be used internally, such as for sentinel values or
        delimiters, but should not be exchanged publicly. 

C10
         A process shall make no change in a valid coded character representation
         other than the possible replacement of character sequences by their
         canonical-equivalent sequences or the deletion of noncharacter code
         points, if that process purports not to modify the interpretation of that
         coded character sequence.


        If a noncharacter which does not have a specific internal use is
        unexpectedly encountered in processing, an implementation may signal an
        error or delete or ignore the noncharacter. If these options are not taken,
        the noncharacter should be treated as an unassigned code point. For
        example, an API that returned a character property value for a noncharacter
        would return the same value as the default value for an unassigned code
        point. 
"""

Note that lone surrogates are not regarded as noncharacters (for some
reason).

> So,
> given that Python supports Unicode, not ISO 10646, we must IMO permit such
> sequences in our internal
> representation. It follows that we should stop worrying about these
> irregular values -- it's less
> programming that way. Unicode 3.1 will create enough extra programming as it
> is, because we now have
> variable-length characters again -- just what Unicode was going to save us
> from :-(

Agreed; now who's going to submit the patches ;-)

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From mal@lemburg.com  Mon Jun 25 17:46:59 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Mon, 25 Jun 2001 18:46:59 +0200
Subject: [I18n-sig] Re: How does Python Unicode treat surrogates?
References: <LNBBLJKPBEHFEDALKOLCMEGKKKAA.tim.one@home.com><3B3722DB.1FF54794@lemburg.com> <4ak820g418.fsf@kern.srcf.societies.cam.ac.uk> <006501c0fd82$8b5ba9f0$0c680b41@c1340594a>
Message-ID: <3B376B03.A2A84AE1@lemburg.com>

Mark Davis wrote:
> 
> You cannot interpret isolated UTF-16 surrogate code units as characters. For
> example, you can't interpret the sequence of D800 followed by 0061 as if it
> were some private use character (say, Klingon) followed by an 'a'.
> 
> (For those unfamiliar with the terminology, see
> http://www.unicode.org/glossary, and my paper at
> http://www-106.ibm.com/developerworks/unicode/library/utfencodingforms/.)

Thanks for the pointers and the explanations. Your paper is a very
good reading indeed.

My question was targetting into a slightly different direction,
though. I know that UTF-16 does not allow lone surrogates, but 
how does Unicode itself treat these ? If I have a sequence of Unicode
code points which includes an isolated surrogate code point,
would this be considered a legal Unicode sequence or not ?
 
> However, you can certainly deal with surrogate code units in storage, and it
> is permissible on that level to handle them. For example, most UTF-16 string
> interfaces use code unit indices, so that a string from position 3 of length
> 5 will include precisely 5 code units, not however many code points (or
> graphemes!) they take up. Similarly for UTF-8 strings, the low-level units
> are bytes.

FYI, Python currently uses UTF-16 as internal storage format
and also exposes this through its indexing interfaces. In that
sense isolated surrogates would be illegal. The codecs which
convert such Unicode object to other encodings would raise an
exception. Unicode object constructors, slicing and concatenating
Unicode objects currently do not apply any checks though.
 
> In most people's experience, it is best to leave the low level interfaces
> with indices in terms of code units, then supply some utility routines that
> tell you information about code points. 

So surrogate support or its handling is left to the applications
using the interface ?! Perhaps you are right and this is the only
feasable way to approach the problem...

> The most useful are:
> 
> - given a string and an index into that string, how many code points are
> before it?
> - given a string and a number of code points, what is the lowest index that
> contains them?
> - given a string and an index into that string, is the index on a code point
> boundary?

These are still missing in Python; we should probably add methods
for them in one of the next releases, though.
 
> An example for Java is at
> http://oss.software.ibm.com/icu4j/doc/com/ibm/text/UTF16.html.
> 
> Mark
> 
> ----- Original Message -----
> From: "Gaute B Strokkenes" <gs234@cam.ac.uk>
> To: "M.-A. Lemburg" <mal@lemburg.com>
> Cc: "Tim Peters" <tim.one@home.com>; <i18n-sig@python.org>;
> <unicode@unicode.org>
> Sent: Monday, June 25, 2001 05:03
> Subject: Re: How does Python Unicode treat surrogates?
> 
> >
> > [I'm cc:-ing the unicode list to make sure that I've gotten my
> > terminology right, and to solicit comments
> >
> > On Mon, 25 Jun 2001, mal@lemburg.com wrote:
> > > Tim Peters wrote:
> > >>
> > >> [M.-A. Lemburg]
> > >> > ...
> > >> > 2. What to do when slicing of Unicode strings would break
> > >> >    a surrogate pair ?
> > >>
> > >> To me a string is a sequence of characters, and s[0] returns the
> > >> first, s[1] the second, and so on.  The internal details of how the
> > >> implementation chooses to torture itself <0.7 wink> should be
> > >> invisible.  That is, breaking a surrogate via slicing should be
> > >> impossible: s[i:j] returns j-i characters, and that's that.
> > >
> > > It's not that simple: lone surrogates are true Unicode char points
> > > in their own right; it's just that they are pretty useless without
> > > their resp. partners in the data stream. And with this "feature"
> > > they are in good company: the Unicode combining characters (e.g. the
> > > combining acute) have th same property.
> >
> > This is completely and totally wrong.  The Unicode standard version
> > 3.1 states (conformance requirement C12(c): A conformant process shall
> > not interpret illegal UTF code unit sequences as characters.
> >
> > The precise definition of "illegal" in this context is given
> > elsewhere.  See <http://www.unicode.org/unicode/reports/tr17/>:
> >
> >   0xD800 is incomplete in Unicode.  Unless followed by another 16-bit
> >   value of the right form, it is illegal.
> >
> > (Unicode here should read UTF-16, off course.  The reason it does not
> > is that the language of the technical report has not been updated to
> > that of 3.1)
> >
> > --
> > Big Gaute                               http://www.srcf.ucam.org/~gs234/
> > Hello?  Enema Bondage?  I'm calling because I want to be happy, I guess..
> >
> >
> 
> _______________________________________________
> I18n-sig mailing list
> I18n-sig@python.org
> http://mail.python.org/mailman/listinfo/i18n-sig

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From mal@lemburg.com  Mon Jun 25 18:01:28 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Mon, 25 Jun 2001 19:01:28 +0200
Subject: [I18n-sig] How does Python Unicode treat surrogates?
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com>
 <3B375FB9.91BA4B1E@lemburg.com> <200106251620.f5PGKNP08234@odiug.digicool.com>
Message-ID: <3B376E68.505BF6E@lemburg.com>

Guido van Rossum wrote:
> 
> > > Shouldn't there be a conversion routine between wchar_t[] and
> > > Py_UNICODE[] instead of assuming they have the same format?  This will
> > > come up more often, and Linux has sizeif(wchar_t) == 4 I believe.
> > > (Which suggests that others disagree on the waste of space.)
> >
> > There are conversion routines which map between Py_UNICODE
> > and wchar_t in Python and these make use of the fact that
> > e.g. on Windows Py_UNICODE can use wchar_t as basis which makes
> > the conversion very fast.
> >
> > On Linux (which uses 4 bytes per wchar_t) the routine inserts
> > tons of zeros to make Tux happy :-)
> 
> Maybe this code should be restructured so that it lengthens the
> characters or not depending on the size difference between Py_UNICODE
> and wchar_t, rather than making platform assumptions.

This is how it currently works.
 
> If this is the only thing that keeps us from having a configuration
> OPTION to make Py_UNICODE 32-bit wide, I'd say let's fix it.

This is not easy to fix and can certainly not be made an
option: UTF-16 has surrogates and is a variable width encoding
of Unicode while UCS-4 is a fixed width encoding.

Python currently only has minimal support for surrogates, so
purist would say that we support UCS-2. However, we deliberatly
chose this path to be able to upgrade to UTF-16 at some later
point in time and it seems that this time has now come.

> > > Agreed.  But be prepared that at some point in the future the Unicode
> > > world might end up agreeing on 4 bytes too...
> >
> > No problem... we can change to 4 byte values too if the world
> > agrees on 4 bytes per character. However, 2 bytes or 4 bytes
> > is an implementation detail and not part of the Unicode standard
> > itself.
> 
> But UTF-16 vs. UCS-4 is not an implementation detail!

True.
 
> If we store 4 bytes per character, we should treat surrogates
> differently.  I don't know where those would be converted -- probably
> in the UTF-16 to UCS-4 codec.
> 
> I'd be happy to make the configuration choice between UTF-16 and
> UCS-4, if that's doable.

Not easily, I'm afraid.
 
> > 4 bytes per character makes things at the C level much easier
> > and this is probably why the GNU C lib team chose 4 bytes. Other
> > programming languages like Java and platforms like Windows
> > chose 2-byte UTF-16 as internal format. I guess it's up to the
> > user acceptance to choose between the two. 2 bytes means more
> > work on the implementor, 4 bytes means more $$$ for Micron et al. ;-)
> 
> My 1-year old laptop has a 10 Gb hard drive and 128 Mb RAM.  Current
> machines are between 2-4 times that.  How much of that space will be
> wasted on extra Unicode?  For a typical user, most of it is MP3's
> anyway. :-)

True again :-) Still, it's the main argument people have against
using 4 bytes per character; here's a quote from Mark Davis,
the Unicode Consortium President:

http://www-106.ibm.com/developerworks/unicode/library/utfencodingforms/
"""
Decisions, decisions...
  Ultimately, the choice of which encoding format to use will depend heavily on the programming environment. For systems that only offer
  8-bit strings currently, but are multi-byte enabled, UTF-8 may be the best choice. For systems that do not care about storage requirements,
  UTF-32 may be best. For systems such as Windows, Java, or ICU that use UTF-16 strings already, UTF-16 is the obvious choice. Even if
  they have not yet upgraded to fully support surrogates, they will be before long. 

  If the programming environment is not an issue, UTF-16 is recommended as a good compromise between elegance, performance, and
  storage.
"""
 
> > > > > And why can't Python support the two standards simultaneously?
> > > >
> > > > Why would you want to support two standards for the same thing ?
> > >
> > > Well, we support ASCII and Unicode. :-)
> > >
> > > If ISO 10646 becomes important to our users, we'll have to support
> > > it, if only by providing a codec.
> >
> > This is different: ISO 10646 is a competing standard, not just a
> > different encoding.
> 
> Oh.  I didn't know.  How does it differ from Unicode?  What's the user
> acceptance?

http://www.unicode.org/unicode/consortium/memblogo.html says it all.

ISO 10646 documents are only available on a pay-per-page basis --
not really ideal for spreading the word...
(http://wwwold.dkuug.dk/JTC1/SC2/WG2/)
 
-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From mike.sykes@acm.org  Mon Jun 25 18:38:09 2001
From: mike.sykes@acm.org (J M Sykes)
Date: Mon, 25 Jun 2001 18:38:09 +0100
Subject: [I18n-sig] Re: How does Python Unicode treat surrogates?
References: <LNBBLJKPBEHFEDALKOLCMEGKKKAA.tim.one@home.com><3B3722DB.1FF54794@lemburg.com> <4ak820g418.fsf@kern.srcf.societies.cam.ac.uk> <006501c0fd82$8b5ba9f0$0c680b41@c1340594a>
Message-ID: <005b01c0fd9d$e4469e60$1a2cf7c2@oakdale2>

Mark Davis said:
>
> In most people's experience, it is best to leave the low level interfaces
> with indices in terms of code units, then supply some utility routines
that
> tell you information about code points. ...

Anyone on the list interested in the treatment of UCS aka Unicode in
programming languages might like to know that a meeting of ISO/IEC JTC 1/SC
32/WG 3 recently approved a paper that specifies how SQL implementations
should do it.

The proposal can be found at:

ftp://sqlstandards.org/SC32/WG3/Meetings/PER_2001_04_Perth_AUS/per054r1.pdf

The current CD of the next SQL standard (ISO/IEC 9075), as amended by this
proposal (and many others) can be found at:

ftp://sqlstandards.org/SC32/WG3/Progression_Documents/CD/cd1r1-foundation-20
01-06.pdf

Briefly, the SQL functions CHARACTER_LENGTH, POSITION (the SQL string
indexing function), and SUBSTRING will all accept a parameter specifying the
units to be used, the alternatives being OCTETS, CODE_UNITS and CHARACTERS
(which to SQL means code points); the default being characters.

This proposal was agreed by major SQL implementors.

Which doesn't mean that it's right, nor that it can't be changed. But that's
how it is at the moment.

Mike.

***********************************************************

J M Sykes              Email: Mike.Sykes@acm.org
97 Oakdale Drive
Heald Green
CHEADLE
Cheshire   SK8 3SN
UK                        Tel: (44) 161 437 5413

***********************************************************


From guido@digicool.com  Mon Jun 25 18:42:29 2001
From: guido@digicool.com (Guido van Rossum)
Date: Mon, 25 Jun 2001 13:42:29 -0400
Subject: [I18n-sig] How does Python Unicode treat surrogates?
In-Reply-To: Your message of "Mon, 25 Jun 2001 11:25:38 EDT."
 <15159.22514.976923.894201@cymru.basistech.com>
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <15159.14391.718891.645489@cymru.basistech.com> <200106251422.f5PEMel07612@odiug.digicool.com> <15159.17083.978971.519453@cymru.basistech.com> <200106251443.f5PEh2p07753@odiug.digicool.com> <15159.19546.226155.383490@cymru.basistech.com> <200106251544.f5PFiWe07979@odiug.digicool.com>
 <15159.22514.976923.894201@cymru.basistech.com>
Message-ID: <200106251742.f5PHgTW08532@odiug.digicool.com>

> So what has been implemented is UCS-2, not UTF-16, and certainly not
> Unicode. Better to document u"" string literals as UCS-2, and not
> Unicode.

I'm sorry, but I don't see why it's UCS-2 any more or less than
UTF-16.  That's like arguing whether 8-bit strings contains ASCII or
UTF-8.  That's up to the application; the data type can be used for
either.

> > It may change *eventually* -- when we switch to UCS-4 for the internal
> > representation.  Until then, the API will deal in 16-bit values that
> > may or may not be "characters".
> 
> You don't need to switch to UCS-4 internally to implement what I'm
> suggesting.

But unless I misunderstand what it *is* that you are suggesting, the
O(1) indexing property can't be retained with your suggestion, and
that's out of the question.

> > I'd say that ideally the choice to have a 2 or 4 byte internal
> > representation (or no Unicode support at all, for some platforms like
> > PalmOS!) should be a configuration choice.
> 
> I don't think it should be a configuration choice. That leads to
> incompatibilities between people's scripts. It's bad enough already
> with some things working with threaded versions of python and some not
> (e.g., Zope requires threading, but mod_python doesn't work if its
> turned on).

That turned out to be a myth, actually.  mod_python works fine with
threads on most platforms.

Anyway, code that specifically doesn't work when a particular feature
is turned *on* is rare.  Code that *requires* a specific feature is
common, of course, and I would think that Python's Unicode type is
useful as it is for applications that don't need the newer planes.

> BTW, Palm recently joined the Unicode Consortium, and Symbian has
> Unicode support.
> 
> >Right now the implementation doesn't allow that choice at all, which
> >should be remedied -- maybe you can help by submitting patches?
> 
> Touch�.

:-)

--Guido van Rossum (home page: http://www.python.org/~guido/)


From tree@basistech.com  Mon Jun 25 18:13:56 2001
From: tree@basistech.com (Tom Emerson)
Date: Mon, 25 Jun 2001 13:13:56 -0400
Subject: [I18n-sig] How does Python Unicode treat surrogates?
In-Reply-To: <200106251742.f5PHgTW08532@odiug.digicool.com>
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com>
 <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de>
 <3B3471AF.1311E872@lemburg.com>
 <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de>
 <3B34F9BD.4FDEFC62@lemburg.com>
 <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de>
 <3B35CEC6.710243E7@lemburg.com>
 <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de>
 <3B362E9B.4DC8DD81@lemburg.com>
 <200106251342.f5PDg1q07291@odiug.digicool.com>
 <15159.14391.718891.645489@cymru.basistech.com>
 <200106251422.f5PEMel07612@odiug.digicool.com>
 <15159.17083.978971.519453@cymru.basistech.com>
 <200106251443.f5PEh2p07753@odiug.digicool.com>
 <15159.19546.226155.383490@cymru.basistech.com>
 <200106251544.f5PFiWe07979@odiug.digicool.com>
 <15159.22514.976923.894201@cymru.basistech.com>
 <200106251742.f5PHgTW08532@odiug.digicool.com>
Message-ID: <15159.29012.266722.112773@cymru.basistech.com>

Guido van Rossum writes:
> I'm sorry, but I don't see why it's UCS-2 any more or less than
> UTF-16.  That's like arguing whether 8-bit strings contains ASCII or
> UTF-8.  That's up to the application; the data type can be used for
> either.

UCS-2 and UTF-16 and UTF-8 are encoding forms of Unicode. Unicode
defines characters using an abstract integer, the code-point. As of
Unicode 3.1 code points range from 0x000000 to 0x10FFFF.

The so-called Unicode string type in Python is a wide-string type,
where each character is treated as a 16-bit quantity. The
interpretation placed on those 16-bit quantities is that of UCS-2. In
that case each half of a surrogate pair is an unknown character.

As soon as you impose UTF-16 semantics on the 16-bit quantities, then
you need to treat surrogate pairs as a single character.

If the implementation won't change, then the standard library needs to
support surrogates as a wrapper: leaving it up to each application is
a mistake. IMHO you cannot trust implementers to do this right.

> But unless I misunderstand what it *is* that you are suggesting, the
> O(1) indexing property can't be retained with your suggestion, and
> that's out of the question.

You understand me completely. Adding transparent UTF-16 support
changes your O(1) indexing operation to O(1+c), where 'c' is the small
amount of time required to check for the surrogate. Granted, this 'c'
could get large, but...

But I see your point: this requirement is what prompted the glibc
folks to go with the 32-bit wchar_t type.

> That turned out to be a myth, actually.  mod_python works fine with
> threads on most platforms.

Not in my experience. On my FreeBSD box Python 2.0 built with threads
does not get along in some cases where Apache 1.3.19. Not that it matters.


-- 
Tom Emerson                                          Basis Technology Corp.
Sr. Sinostringologist                              http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From guido@digicool.com  Mon Jun 25 19:04:13 2001
From: guido@digicool.com (Guido van Rossum)
Date: Mon, 25 Jun 2001 14:04:13 -0400
Subject: [I18n-sig] How does Python Unicode treat surrogates?
In-Reply-To: Your message of "Mon, 25 Jun 2001 19:01:28 +0200."
 <3B376E68.505BF6E@lemburg.com>
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <3B375FB9.91BA4B1E@lemburg.com> <200106251620.f5PGKNP08234@odiug.digicool.com>
 <3B376E68.505BF6E@lemburg.com>
Message-ID: <200106251804.f5PI4D008730@odiug.digicool.com>

OK, focusing on a single item.

[me]
> > If this is the only thing that keeps us from having a configuration
> > OPTION to make Py_UNICODE 32-bit wide, I'd say let's fix it.

[MAL]
> This is not easy to fix and can certainly not be made an
> option: UTF-16 has surrogates and is a variable width encoding
> of Unicode while UCS-4 is a fixed width encoding.

But even if we supported UTF-16 with surrogates, picking strings apart
using u[i] would still be able to access the separate lower and upper
halves of the surrogates, right, and in the presence of surrogates
len(u) would not match the number of *characters* in u.

> Python currently only has minimal support for surrogates, so
> purist would say that we support UCS-2. However, we deliberatly
> chose this path to be able to upgrade to UTF-16 at some later
> point in time and it seems that this time has now come.

How hard would it be to also change the party line about what the
encoding used is based on whether we use 2 or 4 bytes?  We could even
give three choices: UCS-2 (current situation, no surrogates), UTF-16
(16-bit items with some surrogate support) or UCS-4 (32-bit items)?

> > I'd be happy to make the configuration choice between UTF-16 and
> > UCS-4, if that's doable.
> 
> Not easily, I'm afraid.

Can you explain why this is not easy?
> http://www-106.ibm.com/developerworks/unicode/library/utfencodingforms/
> """
> Decisions, decisions...
>   Ultimately, the choice of which encoding format to use will depend heavily on the programming environment. For systems that only offer
>   8-bit strings currently, but are multi-byte enabled, UTF-8 may be the best choice. For systems that do not care about storage requirements,
>   UTF-32 may be best. For systems such as Windows, Java, or ICU that use UTF-16 strings already, UTF-16 is the obvious choice. Even if
>   they have not yet upgraded to fully support surrogates, they will be before long. 
> 
>   If the programming environment is not an issue, UTF-16 is recommended as a good compromise between elegance, performance, and
>   storage.
> """

I buy that as an argument for supporting UTF-16, but not for cutting
off the road to supporting UCS-4 for those users who would like to opt
in.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@digicool.com  Mon Jun 25 19:16:40 2001
From: guido@digicool.com (Guido van Rossum)
Date: Mon, 25 Jun 2001 14:16:40 -0400
Subject: [I18n-sig] How does Python Unicode treat surrogates?
In-Reply-To: Your message of "Mon, 25 Jun 2001 13:13:56 EDT."
 <15159.29012.266722.112773@cymru.basistech.com>
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <15159.14391.718891.645489@cymru.basistech.com> <200106251422.f5PEMel07612@odiug.digicool.com> <15159.17083.978971.519453@cymru.basistech.com> <200106251443.f5PEh2p07753@odiug.digicool.com> <15159.19546.226155.383490@cymru.basistech.com> <200106251544.f5PFiWe07979@odiug.digicool.com> <15159.22514.976923.894201@cymru.basistech.com> <200106251742.f5PHgTW08532@odiug.digicool.com>
 <15159.29012.266722.112773@cymru.basistech.com>
Message-ID: <200106251816.f5PIGev08808@odiug.digicool.com>

> Guido van Rossum writes:
> > I'm sorry, but I don't see why it's UCS-2 any more or less than
> > UTF-16.  That's like arguing whether 8-bit strings contains ASCII or
> > UTF-8.  That's up to the application; the data type can be used for
> > either.
> 
> UCS-2 and UTF-16 and UTF-8 are encoding forms of Unicode. Unicode
> defines characters using an abstract integer, the code-point. As of
> Unicode 3.1 code points range from 0x000000 to 0x10FFFF.
> 
> The so-called Unicode string type in Python is a wide-string type,
> where each character is treated as a 16-bit quantity. The
> interpretation placed on those 16-bit quantities is that of UCS-2. In
> that case each half of a surrogate pair is an unknown character.

So far we agree.

> As soon as you impose UTF-16 semantics on the 16-bit quantities, then
> you need to treat surrogate pairs as a single character.
> 
> If the implementation won't change, then the standard library needs to
> support surrogates as a wrapper: leaving it up to each application is
> a mistake. IMHO you cannot trust implementers to do this right.

Sure, someone can add a module that provides surrogate support using
the standard Unicode datatype.

> > But unless I misunderstand what it *is* that you are suggesting, the
> > O(1) indexing property can't be retained with your suggestion, and
> > that's out of the question.
> 
> You understand me completely. Adding transparent UTF-16 support
> changes your O(1) indexing operation to O(1+c), where 'c' is the small
> amount of time required to check for the surrogate. Granted, this 'c'
> could get large, but...

I don't think there is such a thing as "O(1+c) for small c".

To extract the n'th Unicode character you would have to loop over all
the preceding characters checking for surrogates.  This makes it O(n).

It's a common Python idiom to read megabytes of text into a single
(8-bit or 16-bit) string object, so changing O(1) to O(n) is a real
problem!

> But I see your point: this requirement is what prompted the glibc
> folks to go with the 32-bit wchar_t type.
> 
> > That turned out to be a myth, actually.  mod_python works fine with
> > threads on most platforms.
> 
> Not in my experience. On my FreeBSD box Python 2.0 built with threads
> does not get along in some cases where Apache 1.3.19. Not that it matters.

FreeBSD happens to be one of those platforms. :-(

Has to do with the fact that on *BSD you link with a different version
of the C library to enable threads, and since Apache is linked with
the unthreaded version, any versions of Python embedded in Apache must
also be unthreaded.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From mark@macchiato.com  Mon Jun 25 19:18:52 2001
From: mark@macchiato.com (Mark Davis)
Date: Mon, 25 Jun 2001 11:18:52 -0700
Subject: [I18n-sig] Re: How does Python Unicode treat surrogates?
References: <LNBBLJKPBEHFEDALKOLCMEGKKKAA.tim.one@home.com><3B3722DB.1FF54794@lemburg.com> <4ak820g418.fsf@kern.srcf.societies.cam.ac.uk> <006501c0fd82$8b5ba9f0$0c680b41@c1340594a> <3B376B03.A2A84AE1@lemburg.com>
Message-ID: <00f101c0fda3$4a2529e0$0c680b41@c1340594a>

comments below.

----- Original Message -----
From: "M.-A. Lemburg" <mal@lemburg.com>
To: "Mark Davis" <mark@macchiato.com>
Cc: "Gaute B Strokkenes" <gs234@cam.ac.uk>; "Tim Peters" <tim.one@home.com>;
<i18n-sig@python.org>; <unicode@unicode.org>
Sent: Monday, June 25, 2001 09:46
Subject: Re: [I18n-sig] Re: How does Python Unicode treat surrogates?


[snip]
>
> My question was targetting into a slightly different direction,
> though. I know that UTF-16 does not allow lone surrogates, but
> how does Unicode itself treat these ? If I have a sequence of Unicode
> code points which includes an isolated surrogate code point,
> would this be considered a legal Unicode sequence or not ?

It is a legal Unicode code point sequence. However, it is not a legal
Unicode *character* sequence, since it contains code points that by
definition cannot be used to represent characters.

>
> > However, you can certainly deal with surrogate code units in storage,
and it
> > is permissible on that level to handle them. For example, most UTF-16
string
> > interfaces use code unit indices, so that a string from position 3 of
length
> > 5 will include precisely 5 code units, not however many code points (or
> > graphemes!) they take up. Similarly for UTF-8 strings, the low-level
units
> > are bytes.
>
> FYI, Python currently uses UTF-16 as internal storage format
> and also exposes this through its indexing interfaces. In that
> sense isolated surrogates would be illegal. The codecs which
> convert such Unicode object to other encodings would raise an
> exception.

> Unicode object constructors, slicing and concatenating
> Unicode objects currently do not apply any checks though.

That is what is typically done, since using codepoint indices on each
operation is a very significant performance burden.

Mark


From tree@basistech.com  Mon Jun 25 18:43:23 2001
From: tree@basistech.com (Tom Emerson)
Date: Mon, 25 Jun 2001 13:43:23 -0400
Subject: [I18n-sig] How does Python Unicode treat surrogates?
In-Reply-To: <200106251816.f5PIGev08808@odiug.digicool.com>
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com>
 <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de>
 <3B3471AF.1311E872@lemburg.com>
 <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de>
 <3B34F9BD.4FDEFC62@lemburg.com>
 <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de>
 <3B35CEC6.710243E7@lemburg.com>
 <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de>
 <3B362E9B.4DC8DD81@lemburg.com>
 <200106251342.f5PDg1q07291@odiug.digicool.com>
 <15159.14391.718891.645489@cymru.basistech.com>
 <200106251422.f5PEMel07612@odiug.digicool.com>
 <15159.17083.978971.519453@cymru.basistech.com>
 <200106251443.f5PEh2p07753@odiug.digicool.com>
 <15159.19546.226155.383490@cymru.basistech.com>
 <200106251544.f5PFiWe07979@odiug.digicool.com>
 <15159.22514.976923.894201@cymru.basistech.com>
 <200106251742.f5PHgTW08532@odiug.digicool.com>
 <15159.29012.266722.112773@cymru.basistech.com>
 <200106251816.f5PIGev08808@odiug.digicool.com>
Message-ID: <15159.30780.1143.760653@cymru.basistech.com>

Guido van Rossum writes:
> To extract the n'th Unicode character you would have to loop over all
> the preceding characters checking for surrogates.  This makes it O(n).

No. If the n'th character is a valid high-surrogate (U+D800 -- U+DBFF)
then look at the n+1'th character for a valid low-surrogate. If the
n'th character is a valid low-surrogate and the n-1'th character is a
valid high-surrogate, then skip it.

> It's a common Python idiom to read megabytes of text into a single
> (8-bit or 16-bit) string object, so changing O(1) to O(n) is a real
> problem!

Yes, I do it all the time... my primary use of Python is managing
Chinese and Japanese lexicographic data where the files are upwards of
25+MB of UTF-8 encoded Unicode text.


-- 
Tom Emerson                                          Basis Technology Corp.
Sr. Sinostringologist                              http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From mal@lemburg.com  Mon Jun 25 19:35:12 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Mon, 25 Jun 2001 20:35:12 +0200
Subject: [I18n-sig] How does Python Unicode treat surrogates?
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <3B375FB9.91BA4B1E@lemburg.com> <200106251620.f5PGKNP08234@odiug.digicool.com>
 <3B376E68.505BF6E@lemburg.com> <200106251804.f5PI4D008730@odiug.digicool.com>
Message-ID: <3B378460.C27CDCDD@lemburg.com>

Guido van Rossum wrote:
> 
> OK, focusing on a single item.
> 
> [me]
> > > If this is the only thing that keeps us from having a configuration
> > > OPTION to make Py_UNICODE 32-bit wide, I'd say let's fix it.
> 
> [MAL]
> > This is not easy to fix and can certainly not be made an
> > option: UTF-16 has surrogates and is a variable width encoding
> > of Unicode while UCS-4 is a fixed width encoding.
> 
> But even if we supported UTF-16 with surrogates, picking strings apart
> using u[i] would still be able to access the separate lower and upper
> halves of the surrogates, right, and in the presence of surrogates
> len(u) would not match the number of *characters* in u.

That's because len(u) has nothing to do with the number of 
characters in the string, it only counts the code units (Py_UNICODEs)
which are used to represent characters. The same is true for normal
strings, e.g. UTF-8 can use between 1-4 code units (bytes in this
case) for a single code unit and in Unicode you can create characters
by combining code units 

As Mark Davis pointed out:

"""In most people's experience, it is best to leave the low level interfaces
with indices in terms of code units, then supply some utility routines that
tell you information about code points. The most useful are:

- given a string and an index into that string, how many code points are
  before it?
- given a string and a number of code points, what is the lowest index that
  contains them?
- given a string and an index into that string, is the index on a code point
  boundary?
"""
 
Python could use some more Unicode methods to answer these
questions.

> > Python currently only has minimal support for surrogates, so
> > purist would say that we support UCS-2. However, we deliberatly
> > chose this path to be able to upgrade to UTF-16 at some later
> > point in time and it seems that this time has now come.
> 
> How hard would it be to also change the party line about what the
> encoding used is based on whether we use 2 or 4 bytes?  We could even
> give three choices: UCS-2 (current situation, no surrogates), UTF-16
> (16-bit items with some surrogate support) or UCS-4 (32-bit items)?

Ehm... what are you getting at here ?
 
> > > I'd be happy to make the configuration choice between UTF-16 and
> > > UCS-4, if that's doable.
> >
> > Not easily, I'm afraid.
> 
> Can you explain why this is not easy?

Because choosing whether or not to support surrogates is a 
fundamental choice which affects far more than just the way you
access storage. Surrogates introduce variable width characters:
some characters use two or more Py_UNICODE code units while (most)
others only use one.

Remember when we discussed which internal format to use or
which default encoding to apply ? We ruled out UTF-8 because
it fails badly when it comes to slicing, concatenation, indexing,
etc. 

UTF-16 is much less painful as most code points only take
up a single code unit, but it still introduces a break in concept.

> > http://www-106.ibm.com/developerworks/unicode/library/utfencodingforms/
> > """
> > Decisions, decisions...
> >   Ultimately, the choice of which encoding format to use will depend heavily on the programming environment. For systems that only offer
> >   8-bit strings currently, but are multi-byte enabled, UTF-8 may be the best choice. For systems that do not care about storage requirements,
> >   UTF-32 may be best. For systems such as Windows, Java, or ICU that use UTF-16 strings already, UTF-16 is the obvious choice. Even if
> >   they have not yet upgraded to fully support surrogates, they will be before long.
> >
> >   If the programming environment is not an issue, UTF-16 is recommended as a good compromise between elegance, performance, and
> >   storage.
> > """
> 
> I buy that as an argument for supporting UTF-16, but not for cutting
> off the road to supporting UCS-4 for those users who would like to opt
> in.

That was not my point. I just wanted to point out how well UTF-16
is being accepted out there and that we are in good company by
moving from UCS-2 to UTF-16 as current internal format.

I don't want to cut off the road to UCS-4, I just want to make
clear that UTF-16 is a good choice and one which will last at
least some more years. We can then always decide to move on
to UCS-4 for the internal storage format.

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From fredrik@pythonware.com  Mon Jun 25 19:41:48 2001
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Mon, 25 Jun 2001 20:41:48 +0200
Subject: [I18n-sig] How does Python Unicode treat surrogates?
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com><200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de><3B3471AF.1311E872@lemburg.com><200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de><3B34F9BD.4FDEFC62@lemburg.com><200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de><3B35CEC6.710243E7@lemburg.com><200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de><3B362E9B.4DC8DD81@lemburg.com><200106251342.f5PDg1q07291@odiug.digicool.com><15159.14391.718891.645489@cymru.basistech.com><200106251422.f5PEMel07612@odiug.digicool.com><15159.17083.978971.519453@cymru.basistech.com><200106251443.f5PEh2p07753@odiug.digicool.com><15159.19546.226155.383490@cymru.basistech.com><200106251544.f5PFiWe07979@odiug.digicool.com><15159.22514.976923.894201@cymru.basistech.com><200106251742.f5PHgTW08532@odiug.digicool.com><15159.29012.266722.112773@cymru.basistech.com><200106251816.f5PIGev08808@odiug.digicool.com> <15159.30780.1143.760653@cymru.basistech.com>
Message-ID: <008e01c0fda6$7fe81ad0$4ffa42d5@hagrid>

Tom Emerson wrote:
> > To extract the n'th Unicode character you would have to loop over all
> > the preceding characters checking for surrogates.  This makes it O(n).
> 
> No. If the n'th character is a valid high-surrogate (U+D800 -- U+DBFF)
> then look at the n+1'th character for a valid low-surrogate. If the
> n'th character is a valid low-surrogate and the n-1'th character is a
> valid high-surrogate, then skip it.

bzzt.  try again.

</F>


From guido@digicool.com  Mon Jun 25 19:42:24 2001
From: guido@digicool.com (Guido van Rossum)
Date: Mon, 25 Jun 2001 14:42:24 -0400
Subject: [I18n-sig] How does Python Unicode treat surrogates?
In-Reply-To: Your message of "Mon, 25 Jun 2001 13:43:23 EDT."
 <15159.30780.1143.760653@cymru.basistech.com>
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <15159.14391.718891.645489@cymru.basistech.com> <200106251422.f5PEMel07612@odiug.digicool.com> <15159.17083.978971.519453@cymru.basistech.com> <200106251443.f5PEh2p07753@odiug.digicool.com> <15159.19546.226155.383490@cymru.basistech.com> <200106251544.f5PFiWe07979@odiug.digicool.com> <15159.22514.976923.894201@cymru.basistech.com> <200106251742.f5PHgTW08532@odiug.digicool.com> <15159.29012.266722.112773@cymru.basistech.com> <200106251816.f5PIGev08808@odiug.digicool.com>
 <15159.30780.1143.760653@cymru.basistech.com>
Message-ID: <200106251842.f5PIgOe09018@odiug.digicool.com>

> No. If the n'th character is a valid high-surrogate (U+D800 -- U+DBFF)
> then look at the n+1'th character for a valid low-surrogate. If the
> n'th character is a valid low-surrogate and the n-1'th character is a
> valid high-surrogate, then skip it.

Ouch.  So suppose we have a string u containing four items: a regular
16-bit char, a high surrogate, a low surrogate, and another regular
16-bit char.  You're saying that u[0] should return the first
character, u[1] the entire surrogate (so it would still be a 2-item
string), u[2] I gues the empty string, and u[3] the final regular
char.

IMO that would break an important invariant of string-like objects,
namely that len(s[i]) == 1.

I could live with a method u.character(i) that would behave like the
above rule -- but not the u[i] notation.

But wouldn't it be enough to have a test u.issurrogate() that would
test if the first character of u is a valid high-surrogate?  (And
maybe another test u.islowsurrogate() testing for a valid
low-surrogate.)  Then you could write it yourself easily:

def char(u, i):
    c = u[i]
    if c.issurrogate():
       c2 = u[i+1]
       assert c2.islowsurrogate()
       c = c + c2
    return c

(Don't pay attention to the method names I'm proposing -- that's for a
separate subcommittee. :-)

--Guido van Rossum (home page: http://www.python.org/~guido/)


From tree@basistech.com  Mon Jun 25 19:12:17 2001
From: tree@basistech.com (Tom Emerson)
Date: Mon, 25 Jun 2001 14:12:17 -0400
Subject: [I18n-sig] How does Python Unicode treat surrogates?
In-Reply-To: <200106251842.f5PIgOe09018@odiug.digicool.com>
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com>
 <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de>
 <3B3471AF.1311E872@lemburg.com>
 <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de>
 <3B34F9BD.4FDEFC62@lemburg.com>
 <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de>
 <3B35CEC6.710243E7@lemburg.com>
 <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de>
 <3B362E9B.4DC8DD81@lemburg.com>
 <200106251342.f5PDg1q07291@odiug.digicool.com>
 <15159.14391.718891.645489@cymru.basistech.com>
 <200106251422.f5PEMel07612@odiug.digicool.com>
 <15159.17083.978971.519453@cymru.basistech.com>
 <200106251443.f5PEh2p07753@odiug.digicool.com>
 <15159.19546.226155.383490@cymru.basistech.com>
 <200106251544.f5PFiWe07979@odiug.digicool.com>
 <15159.22514.976923.894201@cymru.basistech.com>
 <200106251742.f5PHgTW08532@odiug.digicool.com>
 <15159.29012.266722.112773@cymru.basistech.com>
 <200106251816.f5PIGev08808@odiug.digicool.com>
 <15159.30780.1143.760653@cymru.basistech.com>
 <200106251842.f5PIgOe09018@odiug.digicool.com>
Message-ID: <15159.32513.611214.399097@cymru.basistech.com>

Guido van Rossum writes:
> Ouch.  So suppose we have a string u containing four items: a regular
> 16-bit char, a high surrogate, a low surrogate, and another regular
> 16-bit char.  You're saying that u[0] should return the first
> character, u[1] the entire surrogate (so it would still be a 2-item
> string), u[2] I gues the empty string, and u[3] the final regular
> char.
[...]

No, but we may as well stop going around on this, since my views are
not going to happen.

In my view the string 'u' is a Unicode string. I don't care what sits
underneath: 16-bits or 32-bits I don't care. As far as I'm concerned
the string has three characters in it:

foo = u"\u4e00\u020000a"

means that foo[0] == u"\u4e00", foo[1] == u"\u020000", and foo[2] ==
u"a".

The fact that this is represented internally different ways shouldn't
matter to the user who only cares about characters.

> IMO that would break an important invariant of string-like objects,
> namely that len(s[i]) == 1.

Yes it would, which is why it isn't what I'm recommending.

> I could live with a method u.character(i) that would behave like the
> above rule -- but not the u[i] notation.

Me to. 'nuff said. ;-)

> But wouldn't it be enough to have a test u.issurrogate() that would
> test if the first character of u is a valid high-surrogate?  (And
> maybe another test u.islowsurrogate() testing for a valid
> low-surrogate.)  Then you could write it yourself easily:

> def char(u, i):
>     c = u[i]
>     if c.issurrogate():
>        c2 = u[i+1]
>        assert c2.islowsurrogate()
>        c = c + c2
>     return c

Sure, as long as you check for the edge conditions. This should be in
the library.

-- 
Tom Emerson                                          Basis Technology Corp.
Sr. Sinostringologist                              http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From guido@digicool.com  Mon Jun 25 20:12:31 2001
From: guido@digicool.com (Guido van Rossum)
Date: Mon, 25 Jun 2001 15:12:31 -0400
Subject: [I18n-sig] How does Python Unicode treat surrogates?
In-Reply-To: Your message of "Mon, 25 Jun 2001 20:35:12 +0200."
 <3B378460.C27CDCDD@lemburg.com>
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <3B375FB9.91BA4B1E@lemburg.com> <200106251620.f5PGKNP08234@odiug.digicool.com> <3B376E68.505BF6E@lemburg.com> <200106251804.f5PI4D008730@odiug.digicool.com>
 <3B378460.C27CDCDD@lemburg.com>
Message-ID: <200106251912.f5PJCVD09465@odiug.digicool.com>

> That's because len(u) has nothing to do with the number of 
> characters in the string, it only counts the code units (Py_UNICODEs)
> which are used to represent characters. The same is true for normal
> strings, e.g. UTF-8 can use between 1-4 code units (bytes in this
> case) for a single code unit and in Unicode you can create characters
> by combining code units 

Total agreement.

> As Mark Davis pointed out:
> 
> """In most people's experience, it is best to leave the low level interfaces
> with indices in terms of code units, then supply some utility routines that
> tell you information about code points. The most useful are:
> 
> - given a string and an index into that string, how many code points are
>   before it?
> - given a string and a number of code points, what is the lowest index that
>   contains them?

I understand the first and the third, but what is this one?  Is it a
search?

> - given a string and an index into that string, is the index on a code point
>   boundary?
> """
>  
> Python could use some more Unicode methods to answer these
> questions.

Agreed (see my other post responding to Ton Emerson).

> > > Python currently only has minimal support for surrogates, so
> > > purist would say that we support UCS-2. However, we deliberatly
> > > chose this path to be able to upgrade to UTF-16 at some later
> > > point in time and it seems that this time has now come.
> > 
> > How hard would it be to also change the party line about what the
> > encoding used is based on whether we use 2 or 4 bytes?  We could even
> > give three choices: UCS-2 (current situation, no surrogates), UTF-16
> > (16-bit items with some surrogate support) or UCS-4 (32-bit items)?
> 
> Ehm... what are you getting at here ?

Earlier on you said it would be hard to offer a config-time choice
between UTF-16 and UCS-4.  I'm still trying to figure out why.  Given
the additional stuff I've learned now about surrogates, it doesn't
make sense to choose between UCS-2 and UTF-16; the surrogate handling
can always be present.

So let me rephrase the question.  How hard would it be to offer the
config-time choice between UCS-4 and UTF-16?  If it's hard, why?
(I've heard you say that it's hard before, but I still don't
understand the problem.)

> > > > I'd be happy to make the configuration choice between UTF-16 and
> > > > UCS-4, if that's doable.
> > >
> > > Not easily, I'm afraid.
> > 
> > Can you explain why this is not easy?
> 
> Because choosing whether or not to support surrogates is a 
> fundamental choice which affects far more than just the way you
> access storage. Surrogates introduce variable width characters:
> some characters use two or more Py_UNICODE code units while (most)
> others only use one.
> 
> Remember when we discussed which internal format to use or
> which default encoding to apply ? We ruled out UTF-8 because
> it fails badly when it comes to slicing, concatenation, indexing,
> etc. 
> 
> UTF-16 is much less painful as most code points only take
> up a single code unit, but it still introduces a break in concept.

Hm, it sounds like you have the same problem that I had with Ton
Emerson's suggestion to support Unicode before he clarified it.

If we make a clean distinction between characters and storage units,
and if stick to the rule that u[i] accesses a storage unit, what's the
conceptual difficulty?  There might be a separate method u.char(i)
which returns the *character* starting u[i:], or "" if u[i] is a
low-surrogate.  That could be all we need to support surrogates.  How
bad is that?  (These could even continue to be supported when the
storage uses UCS-4; there, u.char(i) would always be u[i], until
someone comes up with a 64-bit character set. ;-)

> > I buy that as an argument for supporting UTF-16, but not for cutting
> > off the road to supporting UCS-4 for those users who would like to opt
> > in.
> 
> That was not my point. I just wanted to point out how well UTF-16
> is being accepted out there and that we are in good company by
> moving from UCS-2 to UTF-16 as current internal format.

Good!  I agree.

> I don't want to cut off the road to UCS-4, I just want to make
> clear that UTF-16 is a good choice and one which will last at
> least some more years. We can then always decide to move on
> to UCS-4 for the internal storage format.

Agreed again.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@digicool.com  Mon Jun 25 20:22:58 2001
From: guido@digicool.com (Guido van Rossum)
Date: Mon, 25 Jun 2001 15:22:58 -0400
Subject: [I18n-sig] How does Python Unicode treat surrogates?
In-Reply-To: Your message of "Mon, 25 Jun 2001 14:12:17 EDT."
 <15159.32513.611214.399097@cymru.basistech.com>
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <15159.14391.718891.645489@cymru.basistech.com> <200106251422.f5PEMel07612@odiug.digicool.com> <15159.17083.978971.519453@cymru.basistech.com> <200106251443.f5PEh2p07753@odiug.digicool.com> <15159.19546.226155.383490@cymru.basistech.com> <200106251544.f5PFiWe07979@odiug.digicool.com> <15159.22514.976923.894201@cymru.basistech.com> <200106251742.f5PHgTW08532@odiug.digicool.com> <15159.29012.266722.112773@cymru.basistech.com> <200106251816.f5PIGev08808@odiug.digicool.com> <15159.30780.1143.760653@cymru.basistech.com>!
 <200106251842.f5PIgOe09018@odiug.digicool.com>
 <15159.32513.611214.399097@cymru.basistech.com>
Message-ID: <200106251922.f5PJMwm09492@odiug.digicool.com>

> Guido van Rossum writes:
> > Ouch.  So suppose we have a string u containing four items: a regular
> > 16-bit char, a high surrogate, a low surrogate, and another regular
> > 16-bit char.  You're saying that u[0] should return the first
> > character, u[1] the entire surrogate (so it would still be a 2-item
> > string), u[2] I gues the empty string, and u[3] the final regular
> > char.
> [...]
> 
> No, but we may as well stop going around on this, since my views are
> not going to happen.
> 
> In my view the string 'u' is a Unicode string. I don't care what sits
> underneath: 16-bits or 32-bits I don't care. As far as I'm concerned
> the string has three characters in it:
> 
> foo = u"\u4e00\u020000a"
> 
> means that foo[0] == u"\u4e00", foo[1] == u"\u020000", and foo[2] ==
> u"a".

I hope you meant foo = u"\u4e00\U00020000a" and foo[1] == u'\U00020000'.

(I worry that your sloppy use of variable length \u escapes above
shows that your understanding of the subject matter is less than
you've made me believe.  Please say it ain't so!)

> The fact that this is represented internally different ways shouldn't
> matter to the user who only cares about characters.

You misunderstand.  I am claiming that this shouldn't happen because
it would make u[i] an O(n) operation.  Then you brought up an argument
that suggested a way of indexing that *wouldn't* make it O(n), and
that's what I guessed (in my "Ouch" paragraph quoted above).

But what you describe now doesn't have a constant number of storage
units per character, so it has to have O(n) indexing time (unless you
assume a terribly hairy data structure).

I'm worried that you don't understand the O(n) notation, or that you
don't understand why what you are proposing would make indexing O(n).
Your suggestion of "O(1+c) for some small c" makes me *really* worried
about this.

In which case what you want ain't gonna happen, but not for the reason
you fear (BDFL decree): it's not well thought out.

> > IMO that would break an important invariant of string-like objects,
> > namely that len(s[i]) == 1.
> 
> Yes it would, which is why it isn't what I'm recommending.
> 
> > I could live with a method u.character(i) that would behave like the
> > above rule -- but not the u[i] notation.
> 
> Me to. 'nuff said. ;-)

But would u.character(i) be O(1) or O(n)?

> > But wouldn't it be enough to have a test u.issurrogate() that would
> > test if the first character of u is a valid high-surrogate?  (And
> > maybe another test u.islowsurrogate() testing for a valid
> > low-surrogate.)  Then you could write it yourself easily:
> 
> > def char(u, i):
> >     c = u[i]
> >     if c.issurrogate():
> >        c2 = u[i+1]
> >        assert c2.islowsurrogate()
> >        c = c + c2
> >     return c
> 
> Sure, as long as you check for the edge conditions. This should be in
> the library.

Note that in your above example, char(foo, 2) would not be u'a' but
would be u'\u0000', and char(foo, 3) would be u'a'.

So I still think you haven't thought this out as much as you believe.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From mark@macchiato.com  Mon Jun 25 20:27:07 2001
From: mark@macchiato.com (Mark Davis)
Date: Mon, 25 Jun 2001 12:27:07 -0700
Subject: [I18n-sig] Re: How does Python Unicode treat surrogates?
References: <LNBBLJKPBEHFEDALKOLCMEGKKKAA.tim.one@home.com><3B3722DB.1FF54794@lemburg.com> <4ak820g418.fsf@kern.srcf.societies.cam.ac.uk> <006501c0fd82$8b5ba9f0$0c680b41@c1340594a> <005b01c0fd9d$e4469e60$1a2cf7c2@oakdale2>
Message-ID: <013901c0fdac$d27d1970$0c680b41@c1340594a>

That is an interesting approach; one that basically amounts to some
convenience functions. For example, instead of writing:

myString.substring(myString.cpToIndex(3), myString.cpToIndex(5));

you could write:

myString.substring(3, 5, myString.CODEPOINT);

This hides some of the work, when someone is working in code points. The
performance cost is still there, of course; using code point indexes
requires each operation to examine every code unit up to that point, which
is much more expensive.

For a general programming language or string library, I'm not sure about
implementing this pattern throughout. I know in the ICU library, for
example, we have a significant number of functions that take offsets into
strings. Having such a parameter on all of them would be clumsy, when most
of the time people are simply working in code units.

Mark

----- Original Message -----
From: "J M Sykes" <mike.sykes@acm.org>
To: "Mark Davis" <mark@macchiato.com>; "M.-A. Lemburg" <mal@lemburg.com>;
"Gaute B Strokkenes" <gs234@cam.ac.uk>
Cc: "Tim Peters" <tim.one@home.com>; <i18n-sig@python.org>; "Unicode List"
<unicode@unicode.org>
Sent: Monday, June 25, 2001 10:38
Subject: Re: How does Python Unicode treat surrogates?


> Mark Davis said:
> >
> > In most people's experience, it is best to leave the low level
interfaces
> > with indices in terms of code units, then supply some utility routines
> that
> > tell you information about code points. ...
>
> Anyone on the list interested in the treatment of UCS aka Unicode in
> programming languages might like to know that a meeting of ISO/IEC JTC
1/SC
> 32/WG 3 recently approved a paper that specifies how SQL implementations
> should do it.
>
> The proposal can be found at:
>
>
ftp://sqlstandards.org/SC32/WG3/Meetings/PER_2001_04_Perth_AUS/per054r1.pdf
>
> The current CD of the next SQL standard (ISO/IEC 9075), as amended by this
> proposal (and many others) can be found at:
>
>
ftp://sqlstandards.org/SC32/WG3/Progression_Documents/CD/cd1r1-foundation-20
> 01-06.pdf
>
> Briefly, the SQL functions CHARACTER_LENGTH, POSITION (the SQL string
> indexing function), and SUBSTRING will all accept a parameter specifying
the
> units to be used, the alternatives being OCTETS, CODE_UNITS and CHARACTERS
> (which to SQL means code points); the default being characters.
>
> This proposal was agreed by major SQL implementors.
>
> Which doesn't mean that it's right, nor that it can't be changed. But
that's
> how it is at the moment.
>
> Mike.
>
> ***********************************************************
>
> J M Sykes              Email: Mike.Sykes@acm.org
> 97 Oakdale Drive
> Heald Green
> CHEADLE
> Cheshire   SK8 3SN
> UK                        Tel: (44) 161 437 5413
>
> ***********************************************************
>
>
>
>
>
>


From paulp@ActiveState.com  Mon Jun 25 20:41:15 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Mon, 25 Jun 2001 12:41:15 -0700
Subject: [I18n-sig] How does Python Unicode treat surrogates?
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <3B375FB9.91BA4B1E@lemburg.com> <200106251620.f5PGKNP08234@odiug.digicool.com> <3B376E68.505BF6E@lemburg.com> <200106251804.f5PI4D008730@odiug.digicool.com>
 <3B378460.C27CDCDD@lemburg.com> <200106251912.f5PJCVD09465@odiug.digicool.com>
Message-ID: <3B3793DB.DFF114EC@ActiveState.com>

Guido van Rossum wrote:
> 
>...
> 
> If we make a clean distinction between characters and storage units,
> and if stick to the rule that u[i] accesses a storage unit, what's the
> conceptual difficulty?  
>
> There might be a separate method u.char(i)
> which returns the *character* starting u[i:], or "" if u[i] is a
> low-surrogate.

Are you saying that having u[i] return the i'th character (code point)
of 'u' is not going to be provided at all?

> That could be all we need to support surrogates.  How
> bad is that?  (These could even continue to be supported when the
> storage uses UCS-4; there, u.char(i) would always be u[i], until
> someone comes up with a 64-bit character set. ;-)

So the same input will have a different behavior based on the fact that
we upgraded our internal representation? :(

The strikes me as an int/long issue. I'd rather we design in terms of
the logical construct: "arbitrary-sized mathematical integer", "Unicode
code point" rather than the implementation detail: "32-bit 2's
complement integer", "UTF-16 code unit."

-- 
Take a recipe. Leave a recipe.  
Python Cookbook!  http://www.ActiveState.com/pythoncookbook


From tree@basistech.com  Mon Jun 25 20:01:43 2001
From: tree@basistech.com (Tom Emerson)
Date: Mon, 25 Jun 2001 15:01:43 -0400
Subject: [I18n-sig] How does Python Unicode treat surrogates?
In-Reply-To: <200106251922.f5PJMwm09492@odiug.digicool.com>
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com>
 <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de>
 <3B3471AF.1311E872@lemburg.com>
 <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de>
 <3B34F9BD.4FDEFC62@lemburg.com>
 <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de>
 <3B35CEC6.710243E7@lemburg.com>
 <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de>
 <3B362E9B.4DC8DD81@lemburg.com>
 <200106251342.f5PDg1q07291@odiug.digicool.com>
 <15159.14391.718891.645489@cymru.basistech.com>
 <200106251422.f5PEMel07612@odiug.digicool.com>
 <15159.17083.978971.519453@cymru.basistech.com>
 <200106251443.f5PEh2p07753@odiug.digicool.com>
 <15159.19546.226155.383490@cymru.basistech.com>
 <200106251544.f5PFiWe07979@odiug.digicool.com>
 <15159.22514.976923.894201@cymru.basistech.com>
 <200106251742.f5PHgTW08532@odiug.digicool.com>
 <15159.29012.266722.112773@cymru.basistech.com>
 <200106251816.f5PIGev08808@odiug.digicool.com>
 <15159.30780.1143.760653@cymru.basistech.com>
 <200106251842.f5PIgOe09018@odiug.digicool.com>
 <15159.32513.611214.399097@cymru.basistech.com>
 <200106251922.f5PJMwm09492@odiug.digicool.com>
Message-ID: <15159.35479.42093.828285@cymru.basistech.com>

[ I'm the first to admit this hasn't been thought out... I'm writing off the cuff ]
Guido van Rossum writes:
> > foo = u"\u4e00\u020000a"
> > 
> > means that foo[0] == u"\u4e00", foo[1] == u"\u020000", and foo[2] ==
> > u"a".
> 
> I hope you meant foo = u"\u4e00\U00020000a" and foo[1] == u'\U00020000'.
> 
> (I worry that your sloppy use of variable length \u escapes above
> shows that your understanding of the subject matter is less than
> you've made me believe.  Please say it ain't so!)

The maximum code-point value for a Unicode character is U+10FFFF,
hence the suggested notation above (I should have noted it as
such). If Python is going to implement full support for ISO 10646 then
the full 32-bit representation (and 8-digit \U escape) is
appropriate. If you limit the maximum size of the character escape so
that the scanner catches improper character sizes you save grief for
the end-user, IMHO.

I must admit that I wasn't aware of the "\U00020000" notation. I still
think it should limit itself to 6 digits, not 8.

> > The fact that this is represented internally different ways shouldn't
> > matter to the user who only cares about characters.
> 
> You misunderstand.  I am claiming that this shouldn't happen because
> it would make u[i] an O(n) operation.  Then you brought up an argument
> that suggested a way of indexing that *wouldn't* make it O(n), and
> that's what I guessed (in my "Ouch" paragraph quoted above).
> 
> But what you describe now doesn't have a constant number of storage
> units per character, so it has to have O(n) indexing time (unless you
> assume a terribly hairy data structure).

I understand O(n) and O(1) perfectly well. My point is that you do not
have to scan the entire string when doing this indexing. You only need
to look at most one storage unit on either side of the index. We're
only concerned here with transparently handling surrogates when the
underlying representation is UTF-16.

> Note that in your above example, char(foo, 2) would not be u'a' but
> would be u'\u0000', and char(foo, 3) would be u'a'.

My example above presumes that indicies in the index refers to
characters, not storage units, and that UTF-16 is being used
transparently internally. So in my world, evaluating

foo = u"\u4e00\U00020000a"

would treat foo[1] as u'\U00200000' and foo[2] as u'a'.

> So I still think you haven't thought this out as much as you believe.

As I said, I have no belief that this is thought out. I'm merely
stating what I believe the observable behavior should be.

    -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Sr. Sinostringologist                              http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From fredrik@pythonware.com  Mon Jun 25 20:54:37 2001
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Mon, 25 Jun 2001 21:54:37 +0200
Subject: [I18n-sig] How does Python Unicode treat surrogates?
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <3B375FB9.91BA4B1E@lemburg.com> <200106251620.f5PGKNP08234@odiug.digicool.com> <3B376E68.505BF6E@lemburg.com> <200106251804.f5PI4D008730@odiug.digicool.com>              <3B378460.C27CDCDD@lemburg.com>  <200106251912.f5PJCVD09465@odiug.digicool.com>
Message-ID: <00f201c0fdb0$ab0fe170$4ffa42d5@hagrid>

guido wrote:

> > That's because len(u) has nothing to do with the number of 
> > characters in the string, it only counts the code units (Py_UNICODEs)
> > which are used to represent characters. The same is true for normal
> > strings, e.g. UTF-8 can use between 1-4 code units (bytes in this
> > case) for a single code unit and in Unicode you can create characters
> > by combining code units 
> 
> Total agreement.

I disagree: in python's current string model, there's a difference
between *encoded* byte buffers and character strings.

> So let me rephrase the question.  How hard would it be to offer the
> config-time choice between UCS-4 and UTF-16?

> If it's hard, why?

the core string type (which I wrote) should support this pretty
much out of the box.

probably more work to fix the codecs (I didn't write them, so I
cannot tell for sure), but I doubt it's that much work.

SRE and the unicode databases (me again) should also work
pretty much out of the box.

> If we make a clean distinction between characters and storage units,
> and if stick to the rule that u[i] accesses a storage unit, what's the
> conceptual difficulty?

I'm sceptical -- I see very little reason to maintain that distinction.
let's use either UCS-2 or UCS-4 for the internal storage, stick to the
"character strings are character sequences" concept, and keep the
UTF-16 surrogate issue where it belongs: in the codecs.

Cheers /F


From tree@basistech.com  Mon Jun 25 20:17:57 2001
From: tree@basistech.com (Tom Emerson)
Date: Mon, 25 Jun 2001 15:17:57 -0400
Subject: [I18n-sig] How does Python Unicode treat surrogates?
In-Reply-To: <00f201c0fdb0$ab0fe170$4ffa42d5@hagrid>
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com>
 <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de>
 <3B3471AF.1311E872@lemburg.com>
 <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de>
 <3B34F9BD.4FDEFC62@lemburg.com>
 <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de>
 <3B35CEC6.710243E7@lemburg.com>
 <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de>
 <3B362E9B.4DC8DD81@lemburg.com>
 <200106251342.f5PDg1q07291@odiug.digicool.com>
 <3B375FB9.91BA4B1E@lemburg.com>
 <200106251620.f5PGKNP08234@odiug.digicool.com>
 <3B376E68.505BF6E@lemburg.com>
 <200106251804.f5PI4D008730@odiug.digicool.com>
 <3B378460.C27CDCDD@lemburg.com>
 <200106251912.f5PJCVD09465@odiug.digicool.com>
 <00f201c0fdb0$ab0fe170$4ffa42d5@hagrid>
Message-ID: <15159.36453.486716.705433@cymru.basistech.com>

Fredrik Lundh writes:
> I'm sceptical -- I see very little reason to maintain that distinction.
> let's use either UCS-2 or UCS-4 for the internal storage, stick to the
> "character strings are character sequences" concept, and keep the
> UTF-16 surrogate issue where it belongs: in the codecs.

How then is u"\U00200000" represented internally if you use UCS-2 as
the internal storage representation?

-- 
Tom Emerson                                          Basis Technology Corp.
Sr. Sinostringologist                              http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From paulp@ActiveState.com  Mon Jun 25 21:03:54 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Mon, 25 Jun 2001 13:03:54 -0700
Subject: [I18n-sig] How does Python Unicode treat surrogates?
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <3B375FB9.91BA4B1E@lemburg.com> <200106251620.f5PGKNP08234@odiug.digicool.com> <3B376E68.505BF6E@lemburg.com> <200106251804.f5PI4D008730@odiug.digicool.com>              <3B378460.C27CDCDD@lemburg.com>  <200106251912.f5PJCVD09465@odiug.digicool.com> <00f201c0fdb0$ab0fe170$4ffa42d5@hagrid>
Message-ID: <3B37992A.40CD1CF2@ActiveState.com>

Fredrik Lundh wrote:
> 
>...
> 
> I'm sceptical -- I see very little reason to maintain that distinction.
> let's use either UCS-2 or UCS-4 for the internal storage, stick to the
> "character strings are character sequences" concept, and keep the
> UTF-16 surrogate issue where it belongs: in the codecs.

I agree. But I'd add that if different people really need different
performance/simplicity trade-offs then maybe we need multiple variants
of the Unicode object. But please don't cut those of us who value
simplicity off from the option of strings that work entirely in terms of
logical characters (code points) and not physical representation units.

-- 
Take a recipe. Leave a recipe.  
Python Cookbook!  http://www.ActiveState.com/pythoncookbook


From guido@digicool.com  Mon Jun 25 21:08:52 2001
From: guido@digicool.com (Guido van Rossum)
Date: Mon, 25 Jun 2001 16:08:52 -0400
Subject: [I18n-sig] How does Python Unicode treat surrogates?
In-Reply-To: Your message of "Mon, 25 Jun 2001 15:01:43 EDT."
 <15159.35479.42093.828285@cymru.basistech.com>
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <15159.14391.718891.645489@cymru.basistech.com> <200106251422.f5PEMel07612@odiug.digicool.com> <15159.17083.978971.519453@cymru.basistech.com> <200106251443.f5PEh2p07753@odiug.digicool.com> <15159.19546.226155.383490@cymru.basistech.com> <200106251544.f5PFiWe07979@odiug.digicool.com> <15159.22514.976923.894201@cymru.basistech.com> <200106251742.f5PHgTW08532@odiug.digicool.com> <15159.29012.266722.112773@cymru.basistech.com> <200106251816.f5PIGev08808@odiug.digicool.com> <15159.30780.1143.760653@cymru.basistech.com>!
 <200106251842.f5PIgOe09018@odiug.digicool.com> <15159.32513.611214.399097@cymru.basistech.com> <200106251922.f5PJMwm09492@odiug.digicool.com>
 <15159.35479.42093.828285@cymru.basistech.com>
Message-ID: <200106252008.f5PK8q109630@odiug.digicool.com>

> I must admit that I wasn't aware of the "\U00020000" notation. I still
> think it should limit itself to 6 digits, not 8.

Too late -- It's some kind of standard already (maybe borrowed from Java?).

> I understand O(n) and O(1) perfectly well. My point is that you do not
> have to scan the entire string when doing this indexing. You only need
> to look at most one storage unit on either side of the index. We're
> only concerned here with transparently handling surrogates when the
> underlying representation is UTF-16.

And that's where your proposal simple doesn't work.  If the storage
units are all 16 bits, and you want the index to count in characters,
you can't know where in a megabyte-long string to start looking for
character 1,000,000: you have to iterate over the storage units from
the beginning until you have counted 1,000,000 characters.  If there
were no surrogates, that's 1,000,000 storage units from the beginning;
if all characters happened to be surrogates, it would be 2,000,000
storage units.  If there are n surrogates between character 0 and
character n, character n starts at storage unit offset n+m; the only
way to determine m is a brute-force O(n) search.

> > Note that in your above example, char(foo, 2) would not be u'a' but
> > would be u'\u0000', and char(foo, 3) would be u'a'.
> 
> My example above presumes that indicies in the index refers to
> characters, not storage units, and that UTF-16 is being used
> transparently internally. So in my world, evaluating
> 
> foo = u"\u4e00\U00020000a"
> 
> would treat foo[1] as u'\U00200000' and foo[2] as u'a'.
> 
> > So I still think you haven't thought this out as much as you believe.
> 
> As I said, I have no belief that this is thought out. I'm merely
> stating what I believe the observable behavior should be.

So explain once more how the observable behavior could be O(1).

--Guido van Rossum (home page: http://www.python.org/~guido/)


From tree@basistech.com  Mon Jun 25 20:33:35 2001
From: tree@basistech.com (Tom Emerson)
Date: Mon, 25 Jun 2001 15:33:35 -0400
Subject: [I18n-sig] How does Python Unicode treat surrogates?
In-Reply-To: <200106252008.f5PK8q109630@odiug.digicool.com>
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com>
 <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de>
 <3B3471AF.1311E872@lemburg.com>
 <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de>
 <3B34F9BD.4FDEFC62@lemburg.com>
 <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de>
 <3B35CEC6.710243E7@lemburg.com>
 <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de>
 <3B362E9B.4DC8DD81@lemburg.com>
 <200106251342.f5PDg1q07291@odiug.digicool.com>
 <15159.14391.718891.645489@cymru.basistech.com>
 <200106251422.f5PEMel07612@odiug.digicool.com>
 <15159.17083.978971.519453@cymru.basistech.com>
 <200106251443.f5PEh2p07753@odiug.digicool.com>
 <15159.19546.226155.383490@cymru.basistech.com>
 <200106251544.f5PFiWe07979@odiug.digicool.com>
 <15159.22514.976923.894201@cymru.basistech.com>
 <200106251742.f5PHgTW08532@odiug.digicool.com>
 <15159.29012.266722.112773@cymru.basistech.com>
 <200106251816.f5PIGev08808@odiug.digicool.com>
 <15159.30780.1143.760653@cymru.basistech.com>
 <200106251842.f5PIgOe09018@odiug.digicool.com>
 <15159.32513.611214.399097@cymru.basistech.com>
 <200106251922.f5PJMwm09492@odiug.digicool.com>
 <15159.35479.42093.828285@cymru.basistech.com>
 <200106252008.f5PK8q109630@odiug.digicool.com>
Message-ID: <15159.37391.172601.161556@cymru.basistech.com>

Guido van Rossum writes:
> And that's where your proposal simple doesn't work.  If the storage
> units are all 16 bits, and you want the index to count in characters,
> you can't know where in a megabyte-long string to start looking for
> character 1,000,000: you have to iterate over the storage units from
> the beginning until you have counted 1,000,000 characters.  If there
> were no surrogates, that's 1,000,000 storage units from the beginning;
> if all characters happened to be surrogates, it would be 2,000,000
> storage units.  If there are n surrogates between character 0 and
> character n, character n starts at storage unit offset n+m; the only
> way to determine m is a brute-force O(n) search.

Bing, the light goes on. Of course. "Never mind." :-)

-- 
Tom Emerson                                          Basis Technology Corp.
Sr. Sinostringologist                              http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From fredrik@pythonware.com  Mon Jun 25 21:39:14 2001
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Mon, 25 Jun 2001 22:39:14 +0200
Subject: [I18n-sig] How does Python Unicode treat surrogates?
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <3B375FB9.91BA4B1E@lemburg.com> <200106251620.f5PGKNP08234@odiug.digicool.com> <3B376E68.505BF6E@lemburg.com> <200106251804.f5PI4D008730@odiug.digicool.com>              <3B378460.C27CDCDD@lemburg.com>  <200106251912.f5PJCVD09465@odiug.digicool.com> <00f201c0fdb0$ab0fe170$4ffa42d5@hagrid>
Message-ID: <012e01c0fdb6$ea4e9e70$4ffa42d5@hagrid>

I wrote:
> SRE and the unicode databases (me again) should also work
> pretty much out of the box.

a 32-bit version SRE works as expected, at least:

>>> a = array.array("i", map(ord, "hello"))
>>> m = sre.search("l+", a)
>>> m
<SRE_Match object at 008CECA8>
>>> m.group(0)
array('i', [108, 108])

the DLL size is identical, and the performance is roughly the
same.

Cheers /F


From mal@lemburg.com  Mon Jun 25 21:43:55 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Mon, 25 Jun 2001 22:43:55 +0200
Subject: [I18n-sig] Re: How does Python Unicode treat surrogates?
References: <200106251434.KAA20168@unicode.org>
Message-ID: <3B37A28B.8445BDF7@lemburg.com>

Rick McGowan wrote:
> 
> Gaute B Strokkenes wrote...
> 
> > [I'm cc:-ing the unicode list to make sure that I've gotten my
> > terminology right, and to solicit comments
> 
> Interesting... I just started looking at Python the other day, once I
> discovered it has such nice built-in Unicode support.
> 
> If Python is explicitly storing the stuff as UTF-16 in u"" strings, then
> slicing operations certainly should be acting on units of the backing
> store, just as for ASCII "character" strings.  In that case, in order for
> every unit to be addressible, it should allow breaking up of surrogate
> pairs.  (Apple's Cocoa environment strings work the same way with
> "ranges".)  There should be another operation, or several, that slice up
> strings based on other kinds of text element boundaries.  For example, a
> "slice on character boundaries" that would always shift the range to
> accommodate surrogate pairs -- as a separate operation.
> 
> The low-level routines in Python, like slicing with absolute locations,
> shouldn't presume to know about the encoding, only about the UNITS that are
> in the "array".

Exactly my opinion. 

Do you have references which we could look at
to determine which of these boundary kinds would actually be
useful in daily programming ?

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From mal@lemburg.com  Mon Jun 25 21:52:54 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Mon, 25 Jun 2001 22:52:54 +0200
Subject: [I18n-sig] How does Python Unicode treat surrogates?
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <3B375FB9.91BA4B1E@lemburg.com> <200106251620.f5PGKNP08234@odiug.digicool.com> <3B376E68.505BF6E@lemburg.com> <200106251804.f5PI4D008730@odiug.digicool.com>              <3B378460.C27CDCDD@lemburg.com>  <200106251912.f5PJCVD09465@odiug.digicool.com> <00f201c0fdb0$ab0fe170$4ffa42d5@hagrid> <012e01c0fdb6$ea4e9e70$4ffa42d5@hagrid>
Message-ID: <3B37A4A6.2B5D068A@lemburg.com>

Fredrik Lundh wrote:
> 
> I wrote:
> > SRE and the unicode databases (me again) should also work
> > pretty much out of the box.
> 
> a 32-bit version SRE works as expected, at least:
> 
> >>> a = array.array("i", map(ord, "hello"))
> >>> m = sre.search("l+", a)
> >>> m
> <SRE_Match object at 008CECA8>
> >>> m.group(0)
> array('i', [108, 108])
> 
> the DLL size is identical, and the performance is roughly the
> same.

That's good to know, but Guido was asking about supporting
both UTF-16 and UCS-4 by means of a configure switch -- supporting
this kind of dual approach is what I consider hard to maintain
and implement. 

Dealing only with UTF-16 or only with UCS-4
would be much less work and this is what I am advertising (stick
with UTF-16 for the next few years and then maybe switch over to
UCS-4; note that this will cause an incompatibility due to u[i]
referencing code units which then change).

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From tim@digicool.com  Mon Jun 25 22:12:42 2001
From: tim@digicool.com (Tim Peters)
Date: Mon, 25 Jun 2001 17:12:42 -0400
Subject: [I18n-sig] How does Python Unicode treat surrogates?
In-Reply-To: <200106252008.f5PK8q109630@odiug.digicool.com>
Message-ID: <BIEJKCLHCIOIHAGOKOLHKEGMCBAA.tim@digicool.com>

[Tom Emerson]
> I must admit that I wasn't aware of the "\U00020000" notation. I still
> think it should limit itself to 6 digits, not 8.

[Guido]
> Too late -- It's some kind of standard already (maybe borrowed
> from Java?).

We borrowed \U12345678 notation from the current ISO/ANSI C standard
("C99").  A space with 2**20 characters isn't going to last either -- and
unlike the Unicode folks, X3J11 didn't have any reason to indulge wishful
thinking on this point <wink>.


From tim@digicool.com  Mon Jun 25 22:22:31 2001
From: tim@digicool.com (Tim Peters)
Date: Mon, 25 Jun 2001 17:22:31 -0400
Subject: [I18n-sig] How does Python Unicode treat surrogates?
In-Reply-To: <15159.37391.172601.161556@cymru.basistech.com>
Message-ID: <BIEJKCLHCIOIHAGOKOLHKEGNCBAA.tim@digicool.com>

My understanding is that UTF-16 (like UTF-8 in this respect) was
deliberately designed so that given a random pointer into the middle of a
contiguous vector of encodings, it's indeed O(1) to find the start of the
nearest *character* going either forwards or backwards.

"The right way" to solve the character (not binary blob) indexing problem is
to add a search finger to the string, a pair mapping "the last" character
index asked for to the address of the start of its encoding.  Since string
traversal generally moves ahead-- or back --just one character at a time,
the point in the first paragraph assures that traversing a string with N
characters, in whole, takes O(N) time overall.  It's not as simple as base +
offset, but requires no more than a few range compares (plus updating the
finger) per indexing operation.


From fredrik@pythonware.com  Mon Jun 25 22:43:34 2001
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Mon, 25 Jun 2001 23:43:34 +0200
Subject: [I18n-sig] How does Python Unicode treat surrogates?
References: <BIEJKCLHCIOIHAGOKOLHKEGNCBAA.tim@digicool.com>
Message-ID: <001d01c0fdbf$e37720f0$4ffa42d5@hagrid>

Tim Peters wrote:
> "The right way" to solve the character (not binary blob) indexing problem is
> to add a search finger to the string, a pair mapping "the last" character
> index asked for to the address of the start of its encoding.  Since string
> traversal generally moves ahead-- or back --just one character at a time,
> the point in the first paragraph assures that traversing a string with N
> characters, in whole, takes O(N) time overall.  It's not as simple as base +
> offset, but requires no more than a few range compares (plus updating the
> finger) per indexing operation.

plus the time it takes to acquire and release a thread lock
for each character...

</F>


From tim@digicool.com  Mon Jun 25 23:11:21 2001
From: tim@digicool.com (Tim Peters)
Date: Mon, 25 Jun 2001 18:11:21 -0400
Subject: [I18n-sig] How does Python Unicode treat surrogates?
In-Reply-To: <001d01c0fdbf$e37720f0$4ffa42d5@hagrid>
Message-ID: <BIEJKCLHCIOIHAGOKOLHGEHACBAA.tim@digicool.com>

[Fredrik Lundh]
> plus the time it takes to acquire and release a thread lock
> for each character...

Eh?  Python code runs under the protection of the global interpreter lock.
There are no instances of Py_BEGIN_ALLOW_THREADS in any of the Unicode or
regexp C support code now -- but you know that, so I must be missing your
point.  Or you're just feeling contrary <wink>.


From mal@lemburg.com  Mon Jun 25 21:05:36 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Mon, 25 Jun 2001 22:05:36 +0200
Subject: [I18n-sig] How does Python Unicode treat surrogates?
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <3B375FB9.91BA4B1E@lemburg.com> <200106251620.f5PGKNP08234@odiug.digicool.com> <3B376E68.505BF6E@lemburg.com> <200106251804.f5PI4D008730@odiug.digicool.com>
 <3B378460.C27CDCDD@lemburg.com> <200106251912.f5PJCVD09465@odiug.digicool.com>
Message-ID: <3B379990.581C50EE@lemburg.com>

Guido van Rossum wrote:
> 
> > That's because len(u) has nothing to do with the number of
> > characters in the string, it only counts the code units (Py_UNICODEs)
> > which are used to represent characters. The same is true for normal
> > strings, e.g. UTF-8 can use between 1-4 code units (bytes in this
> > case) for a single code unit and in Unicode you can create characters
> > by combining code units
> 
> Total agreement.
> 
> > As Mark Davis pointed out:
> >
> > """In most people's experience, it is best to leave the low level interfaces
> > with indices in terms of code units, then supply some utility routines that
> > tell you information about code points. The most useful are:
> >
> > - given a string and an index into that string, how many code points are
> >   before it?
> > - given a string and a number of code points, what is the lowest index that
> >   contains them?
> 
> I understand the first and the third, but what is this one?  Is it a
> search?

Right. The difference to .find(s) is that it would return a
code point index (which can differ from the code unit index).
 
> > - given a string and an index into that string, is the index on a code point
> >   boundary?
> > """
> >
> > Python could use some more Unicode methods to answer these
> > questions.
> 
> Agreed (see my other post responding to Ton Emerson).
> 
> > > > Python currently only has minimal support for surrogates, so
> > > > purist would say that we support UCS-2. However, we deliberatly
> > > > chose this path to be able to upgrade to UTF-16 at some later
> > > > point in time and it seems that this time has now come.
> > >
> > > How hard would it be to also change the party line about what the
> > > encoding used is based on whether we use 2 or 4 bytes?  We could even
> > > give three choices: UCS-2 (current situation, no surrogates), UTF-16
> > > (16-bit items with some surrogate support) or UCS-4 (32-bit items)?
> >
> > Ehm... what are you getting at here ?
> 
> Earlier on you said it would be hard to offer a config-time choice
> between UTF-16 and UCS-4.  I'm still trying to figure out why. 

Here's an example of how this change affects semantics:

u = u"\U00010000"

# UTF-16
u[0] -> u"\uDC00"

# UCS-4
u[0] -> u"\U00010000"

> Given
> the additional stuff I've learned now about surrogates, it doesn't
> make sense to choose between UCS-2 and UTF-16; the surrogate handling
> can always be present.

Right.
 
> So let me rephrase the question.  How hard would it be to offer the
> config-time choice between UCS-4 and UTF-16? 

It would mean lot's of #ifdefs and a change in semantics.

> If it's hard, why?

It's mostly hard due to the fact that indexing, sizes and
memory management will be different for the two (e.g. dynamic
resizing vs. one time allocation). 

Codecs will have to pay attention to the difference too since UCS-4 
would not need surrogates while UTF-16 requires these.

> (I've heard you say that it's hard before, but I still don't
> understand the problem.)
> 
> > > > > I'd be happy to make the configuration choice between UTF-16 and
> > > > > UCS-4, if that's doable.
> > > >
> > > > Not easily, I'm afraid.
> > >
> > > Can you explain why this is not easy?
> >
> > Because choosing whether or not to support surrogates is a
> > fundamental choice which affects far more than just the way you
> > access storage. Surrogates introduce variable width characters:
> > some characters use two or more Py_UNICODE code units while (most)
> > others only use one.
> >
> > Remember when we discussed which internal format to use or
> > which default encoding to apply ? We ruled out UTF-8 because
> > it fails badly when it comes to slicing, concatenation, indexing,
> > etc.
> >
> > UTF-16 is much less painful as most code points only take
> > up a single code unit, but it still introduces a break in concept.
> 
> Hm, it sounds like you have the same problem that I had with Ton
> Emerson's suggestion to support Unicode before he clarified it.

No, I do understand what you mean. The "break in concept" refers
to the different ways you have to deal with variable and fixed
width representations internally (as I tried to briefly explain
above).
 
> If we make a clean distinction between characters and storage units,
> and if stick to the rule that u[i] accesses a storage unit, what's the
> conceptual difficulty?  There might be a separate method u.char(i)
> which returns the *character* starting u[i:], or "" if u[i] is a
> low-surrogate.  That could be all we need to support surrogates.  How
> bad is that?  (These could even continue to be supported when the
> storage uses UCS-4; there, u.char(i) would always be u[i], until
> someone comes up with a 64-bit character set. ;-)

Right... that should solve the "problem".
 
> > > I buy that as an argument for supporting UTF-16, but not for cutting
> > > off the road to supporting UCS-4 for those users who would like to opt
> > > in.
> >
> > That was not my point. I just wanted to point out how well UTF-16
> > is being accepted out there and that we are in good company by
> > moving from UCS-2 to UTF-16 as current internal format.
> 
> Good!  I agree.
> 
> > I don't want to cut off the road to UCS-4, I just want to make
> > clear that UTF-16 is a good choice and one which will last at
> > least some more years. We can then always decide to move on
> > to UCS-4 for the internal storage format.
> 
> Agreed again.

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From martin@loewis.home.cs.tu-berlin.de  Tue Jun 26 00:53:49 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Tue, 26 Jun 2001 01:53:49 +0200
Subject: [I18n-sig] How does Python Unicode treat surrogates?
In-Reply-To: <3B376E68.505BF6E@lemburg.com> (mal@lemburg.com)
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com>
 <3B375FB9.91BA4B1E@lemburg.com> <200106251620.f5PGKNP08234@odiug.digicool.com> <3B376E68.505BF6E@lemburg.com>
Message-ID: <200106252353.f5PNrnO01574@mira.informatik.hu-berlin.de>

> > Oh.  I didn't know.  How does it differ from Unicode?  What's the user
> > acceptance?
> 
> http://www.unicode.org/unicode/consortium/memblogo.html says it all.

Mmh. 

http://www.iso.ch/iso/en/aboutiso/isomembers/MemberCountryList.MemberCountryList

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Tue Jun 26 01:07:43 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Tue, 26 Jun 2001 02:07:43 +0200
Subject: [I18n-sig] How does Python Unicode treat surrogates?
In-Reply-To: <200106251842.f5PIgOe09018@odiug.digicool.com> (message from
 Guido van Rossum on Mon, 25 Jun 2001 14:42:24 -0400)
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <15159.14391.718891.645489@cymru.basistech.com> <200106251422.f5PEMel07612@odiug.digicool.com> <15159.17083.978971.519453@cymru.basistech.com> <200106251443.f5PEh2p07753@odiug.digicool.com> <15159.19546.226155.383490@cymru.basistech.com> <200106251544.f5PFiWe07979@odiug.digicool.com> <15159.22514.976923.894201@cymru.basistech.com> <200106251742.f5PHgTW08532@odiug.digicool.com> <15159.29012.266722.112773@cymru.basistech.com> <200106251816.f5PIGev08808@odiug.digicool.com>
 <15159.30780.1143.760653@cymru.basistech.com> <200106251842.f5PIgOe09018@odiug.digicool.com>
Message-ID: <200106260007.f5Q07hV01625@mira.informatik.hu-berlin.de>

> 16-bit char, a high surrogate, a low surrogate, and another regular
> 16-bit char.  You're saying that u[0] should return the first
> character, u[1] the entire surrogate (so it would still be a 2-item
> string), u[2] I gues the empty string, and u[3] the final regular
> char.
> 
> IMO that would break an important invariant of string-like objects,
> namely that len(s[i]) == 1.

No, it wouldn't. s[1] would return a string containing 2 Py_UNICODE
values, but len(s[1]) would still be 1.

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Tue Jun 26 00:47:18 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Tue, 26 Jun 2001 01:47:18 +0200
Subject: [I18n-sig] How does Python Unicode treat surrogates?
In-Reply-To: <200106251620.f5PGKNP08234@odiug.digicool.com> (message from
 Guido van Rossum on Mon, 25 Jun 2001 12:20:23 -0400)
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com>
 <3B375FB9.91BA4B1E@lemburg.com> <200106251620.f5PGKNP08234@odiug.digicool.com>
Message-ID: <200106252347.f5PNlIT01439@mira.informatik.hu-berlin.de>

> If this is the only thing that keeps us from having a configuration
> OPTION to make Py_UNICODE 32-bit wide, I'd say let's fix it.

I think there are numerous places which assume sizeof(Py_UNICODE)==2,
including, but not limited to, sre.

> But UTF-16 vs. UCS-4 is not an implementation detail!
> 
> If we store 4 bytes per character, we should treat surrogates
> differently.  I don't know where those would be converted -- probably
> in the UTF-16 to UCS-4 codec.

Indeed, they would never appear in a 32-bit Unicode string.

> > This is different: ISO 10646 is a competing standard, not just a 
> > different encoding.
> 
> Oh.  I didn't know.  How does it differ from Unicode?  What's the user
> acceptance?

To my knowledge, it only differs in minor points, which is only caused
by different release dates (at one time, Unicode is behind, at another
time, the ISO standard).

End users typically view it as Unicode, whereas standards bodies and
agencies typically view it as ISO 10646 (e.g. C, C++, and Posix all
refer to ISO 10646, Microsoft refers to Unicode).

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Tue Jun 26 01:18:55 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Tue, 26 Jun 2001 02:18:55 +0200
Subject: [I18n-sig] How does Python Unicode treat surrogates?
In-Reply-To: <15159.36453.486716.705433@cymru.basistech.com> (message from Tom
 Emerson on Mon, 25 Jun 2001 15:17:57 -0400)
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com>
 <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de>
 <3B3471AF.1311E872@lemburg.com>
 <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de>
 <3B34F9BD.4FDEFC62@lemburg.com>
 <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de>
 <3B35CEC6.710243E7@lemburg.com>
 <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de>
 <3B362E9B.4DC8DD81@lemburg.com>
 <200106251342.f5PDg1q07291@odiug.digicool.com>
 <3B375FB9.91BA4B1E@lemburg.com>
 <200106251620.f5PGKNP08234@odiug.digicool.com>
 <3B376E68.505BF6E@lemburg.com>
 <200106251804.f5PI4D008730@odiug.digicool.com>
 <3B378460.C27CDCDD@lemburg.com>
 <200106251912.f5PJCVD09465@odiug.digicool.com>
 <00f201c0fdb0$ab0fe170$4ffa42d5@hagrid> <15159.36453.486716.705433@cymru.basistech.com>
Message-ID: <200106260018.f5Q0ItN01657@mira.informatik.hu-berlin.de>

> Fredrik Lundh writes:
> > I'm sceptical -- I see very little reason to maintain that distinction.
> > let's use either UCS-2 or UCS-4 for the internal storage, stick to the
> > "character strings are character sequences" concept, and keep the
> > UTF-16 surrogate issue where it belongs: in the codecs.
> 
> How then is u"\U00200000" represented internally if you use UCS-2 as
> the internal storage representation?

I think the obvious answer is: It is not supported. It will give an
exception when you try to convert an UTF-8 or UTF-16 string that has
such a character, it will be an error if you pass a surrogate to
unichr, or in a \u literal.

That would simplify a lot, IMO, and only require support for a 32-bit
Py_UNICODE.

Of course, that would have to be done as a per-platform choice, to
avoid binary-incompatible extension modules.

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Tue Jun 26 01:16:08 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Tue, 26 Jun 2001 02:16:08 +0200
Subject: [I18n-sig] How does Python Unicode treat surrogates?
In-Reply-To: <15159.35479.42093.828285@cymru.basistech.com> (message from Tom
 Emerson on Mon, 25 Jun 2001 15:01:43 -0400)
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com>
 <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de>
 <3B3471AF.1311E872@lemburg.com>
 <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de>
 <3B34F9BD.4FDEFC62@lemburg.com>
 <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de>
 <3B35CEC6.710243E7@lemburg.com>
 <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de>
 <3B362E9B.4DC8DD81@lemburg.com>
 <200106251342.f5PDg1q07291@odiug.digicool.com>
 <15159.14391.718891.645489@cymru.basistech.com>
 <200106251422.f5PEMel07612@odiug.digicool.com>
 <15159.17083.978971.519453@cymru.basistech.com>
 <200106251443.f5PEh2p07753@odiug.digicool.com>
 <15159.19546.226155.383490@cymru.basistech.com>
 <200106251544.f5PFiWe07979@odiug.digicool.com>
 <15159.22514.976923.894201@cymru.basistech.com>
 <200106251742.f5PHgTW08532@odiug.digicool.com>
 <15159.29012.266722.112773@cymru.basistech.com>
 <200106251816.f5PIGev08808@odiug.digicool.com>
 <15159.30780.1143.760653@cymru.basistech.com>
 <200106251842.f5PIgOe09018@odiug.digicool.com>
 <15159.32513.611214.399097@cymru.basistech.com>
 <200106251922.f5PJMwm09492@odiug.digicool.com> <15159.35479.42093.828285@cymru.basistech.com>
Message-ID: <200106260016.f5Q0G8x01656@mira.informatik.hu-berlin.de>

> The maximum code-point value for a Unicode character is U+10FFFF,
> hence the suggested notation above (I should have noted it as
> such). If Python is going to implement full support for ISO 10646 then
> the full 32-bit representation (and 8-digit \U escape) is
> appropriate. 

Correct me if I'm wrong, but doesn't some 10646 amendment limit the
code range to 10FFFF also (i.e. to only a part of group 0)?

> If you limit the maximum size of the character escape so that the
> scanner catches improper character sizes you save grief for the
> end-user, IMHO.

I think Python should still use the \UXXXXXXXX notation, as does C and
C++ - no matter that the first two XX will always be 00.

> I understand O(n) and O(1) perfectly well. My point is that you do not
> have to scan the entire string when doing this indexing. You only need
> to look at most one storage unit on either side of the index. We're
> only concerned here with transparently handling surrogates when the
> underlying representation is UTF-16.

Please think carefully. What if you are indexing index 20, but you
have a surrogate at words 10 and 11? Then you should take word 21,
instead of word 20, no? How are you going to find that out in constant
time?

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Tue Jun 26 00:40:01 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Tue, 26 Jun 2001 01:40:01 +0200
Subject: [I18n-sig] How does Python Unicode treat surrogates?
In-Reply-To: <200106251544.f5PFiWe07979@odiug.digicool.com> (message from
 Guido van Rossum on Mon, 25 Jun 2001 11:44:32 -0400)
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <15159.14391.718891.645489@cymru.basistech.com> <200106251422.f5PEMel07612@odiug.digicool.com> <15159.17083.978971.519453@cymru.basistech.com> <200106251443.f5PEh2p07753@odiug.digicool.com>
 <15159.19546.226155.383490@cymru.basistech.com> <200106251544.f5PFiWe07979@odiug.digicool.com>
Message-ID: <200106252340.f5PNe1E01408@mira.informatik.hu-berlin.de>

> You can believe what *should* happen all you want, but we're not going
> to change this soon.  u[i] has to be independent of the length of u
> and the value of i.

Not even if a patch is submitted that puts a bit into Unicode objects
which have surrogates in them, to transparently implement indexing and
length differently for them?

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Tue Jun 26 00:58:17 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Tue, 26 Jun 2001 01:58:17 +0200
Subject: [I18n-sig] How does Python Unicode treat surrogates?
In-Reply-To: <200106251742.f5PHgTW08532@odiug.digicool.com> (message from
 Guido van Rossum on Mon, 25 Jun 2001 13:42:29 -0400)
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <15159.14391.718891.645489@cymru.basistech.com> <200106251422.f5PEMel07612@odiug.digicool.com> <15159.17083.978971.519453@cymru.basistech.com> <200106251443.f5PEh2p07753@odiug.digicool.com> <15159.19546.226155.383490@cymru.basistech.com> <200106251544.f5PFiWe07979@odiug.digicool.com>
 <15159.22514.976923.894201@cymru.basistech.com> <200106251742.f5PHgTW08532@odiug.digicool.com>
Message-ID: <200106252358.f5PNwHg01594@mira.informatik.hu-berlin.de>

> But unless I misunderstand what it *is* that you are suggesting, the
> O(1) indexing property can't be retained with your suggestion, and
> that's out of the question.

The O(1) indexing property can be retained for strings not containing
surrogates, while still counting surrogate pairs as one character.
Unfortunately, this will require an additional word per unicode
object, unless I'm allowed to use a byte past the terminating zero
(which will only slightly reduce the memory overhead).

If somebody can find a spare bit :-)

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Tue Jun 26 00:26:51 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Tue, 26 Jun 2001 01:26:51 +0200
Subject: [I18n-sig] How does Python Unicode treat surrogates?
In-Reply-To: <200106251422.f5PEMel07612@odiug.digicool.com> (message from
 Guido van Rossum on Mon, 25 Jun 2001 10:22:40 -0400)
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com>
 <15159.14391.718891.645489@cymru.basistech.com> <200106251422.f5PEMel07612@odiug.digicool.com>
Message-ID: <200106252326.f5PNQp401376@mira.informatik.hu-berlin.de>

> I don't think switching to a 32-bit character is the right thing to do
> for us (although I think it should be easier than it currently is --
> changing to define Py_UNICODE as a 32-bit unsigned int should be all
> that it takes, which is currently not the case).
> 
> I'm all for taking the lazy approach and letting applications that
> need surrogate support do it themselves, at the application level.

That, of course, means that you cast in stone the 16-bit
Py_UNICODE. In a 32-bit Py_UNICODE, unichr(0xd800) would be surely
illegal, wouldn't it? So an application that explicitly creates
surrogates using unichr (how else would it do that?) won't be portable
to a 32-bit Py_UNICODE.

Would you accept patches that deal with surrogate pairs transparently
throughout the implementation, in the sense of mapping them to
ordinals above 0x10000?

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Tue Jun 26 00:32:25 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Tue, 26 Jun 2001 01:32:25 +0200
Subject: [I18n-sig] How does Python Unicode treat surrogates?
In-Reply-To: <200106251443.f5PEh2p07753@odiug.digicool.com> (message from
 Guido van Rossum on Mon, 25 Jun 2001 10:43:02 -0400)
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <15159.14391.718891.645489@cymru.basistech.com> <200106251422.f5PEMel07612@odiug.digicool.com>
 <15159.17083.978971.519453@cymru.basistech.com> <200106251443.f5PEh2p07753@odiug.digicool.com>
Message-ID: <200106252332.f5PNWPh01407@mira.informatik.hu-berlin.de>

> Does that make sense?
> 
> I know I am hindered by a lack of understanding of Unicode
> hairsplitting, angels-on-a-pin-dancing details; if I'm missing
> something, it's likely that many other people don't know the details
> either, so an explanation would be much appreciated!

I don't think you are missing any detail; I guess you are fully aware
that you are throwing one of Unicode's biggest strengths out of the
window :-) namely the possibility to index index characters, not the
internal representation.

As for Unicode hairsplitting: I think combining characters *are*
different in that respect; they are code points on their own, even
though they might have a zero-width representation. Also,
normalization forms can help with combining characters; they don't
help with surrogates.

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Tue Jun 26 01:21:56 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Tue, 26 Jun 2001 02:21:56 +0200
Subject: [I18n-sig] How does Python Unicode treat surrogates?
In-Reply-To: <3B37992A.40CD1CF2@ActiveState.com> (message from Paul Prescod on
 Mon, 25 Jun 2001 13:03:54 -0700)
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <3B375FB9.91BA4B1E@lemburg.com> <200106251620.f5PGKNP08234@odiug.digicool.com> <3B376E68.505BF6E@lemburg.com> <200106251804.f5PI4D008730@odiug.digicool.com>              <3B378460.C27CDCDD@lemburg.com>  <200106251912.f5PJCVD09465@odiug.digicool.com> <00f201c0fdb0$ab0fe170$4ffa42d5@hagrid> <3B37992A.40CD1CF2@ActiveState.com>
Message-ID: <200106260021.f5Q0Luo01684@mira.informatik.hu-berlin.de>

> I agree. But I'd add that if different people really need different
> performance/simplicity trade-offs then maybe we need multiple variants
> of the Unicode object.

The question really is: Those people that require a 16-bit Py_UNICODE,
would they ever need characters outside the BMP?

My guess is no, so Fredrik's proposal sounds good to me.

Regards,
Martin


From paulp@ActiveState.com  Tue Jun 26 01:43:05 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Mon, 25 Jun 2001 17:43:05 -0700
Subject: [I18n-sig] How does Python Unicode treat surrogates?
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <3B375FB9.91BA4B1E@lemburg.com> <200106251620.f5PGKNP08234@odiug.digicool.com> <3B376E68.505BF6E@lemburg.com> <200106251804.f5PI4D008730@odiug.digicool.com>              <3B378460.C27CDCDD@lemburg.com>  <200106251912.f5PJCVD09465@odiug.digicool.com> <00f201c0fdb0$ab0fe170$4ffa42d5@hagrid> <3B37992A.40CD1CF2@ActiveState.com> <200106260021.f5Q0Luo01684@mira.informatik.hu-berlin.de>
Message-ID: <3B37DA99.31002323@ActiveState.com>

"Martin v. Loewis" wrote:
> 
> > I agree. But I'd add that if different people really need different
> > performance/simplicity trade-offs then maybe we need multiple variants
> > of the Unicode object.
> 
> The question really is: Those people that require a 16-bit Py_UNICODE,
> would they ever need characters outside the BMP?

Hard to tell. People usually want to have their cake and eat it too.
i.e. I want the performance of 16-bit Py_UNICODE but I want to support
the occasional non-BMP character that happens to show up in a document.

> My guess is no, so Fredrik's proposal sounds good to me.

I'm not clear on what Fredrik's proposal is. He says: "let's use either
UCS-2 or UCS-4 for the internal storage". Is he saying:

 1. let's choose one or the other today
 2. let's make it a compile-time switch
 3. make it a runtime option

I could live with 1. for a while longer...I haven't heard of a real user
complaint about our current model. The longer we put it off, the more
acceptable UCS-4 is.

I wouldn't be thrilled with 2., because it makes Python code harder to
move between machines (depends on your build options!)

3 would be okay if it is handled intelligently.

Any of these is better to me than exposing the details of UTF-16 to the
Python programmer in our Unicode type!
-- 
Take a recipe. Leave a recipe.  
Python Cookbook!  http://www.ActiveState.com/pythoncookbook


From rick@unicode.org  Tue Jun 26 01:50:18 2001
From: rick@unicode.org (Rick McGowan)
Date: Mon, 25 Jun 2001 17:50:18 -0700
Subject: [I18n-sig] How does Python Unicode treat surrogates?
In-Reply-To: <15159.35479.42093.828285@cymru.basistech.com> (message from TomEmerson on Mon, 25 Jun 2001 15:01:43 -0400)
Message-ID: <200106252245.SAA04499@unicode.org>

> Correct me if I'm wrong, but doesn't some 10646 amendment limit the
> code range to 10FFFF also (i.e. to only a part of group 0)?

Yes.  It's recent.

	Rick


From rick@unicode.org  Tue Jun 26 01:51:59 2001
From: rick@unicode.org (Rick McGowan)
Date: Mon, 25 Jun 2001 17:51:59 -0700
Subject: [I18n-sig] How does Python Unicode treat surrogates?
In-Reply-To: <3B37992A.40CD1CF2@ActiveState.com> (message from Paul Prescod onMon, 25 Jun 2001 13:03:54 -0700)
Message-ID: <200106252246.SAA04521@unicode.org>

> The question really is: Those people that require a 16-bit Py_UNICODE,
> would they ever need characters outside the BMP?

Yes. More and more stuff is going outside the BMP in the future.  Probably  
will be lots of procurement requirements eventually that need Plane 2 Han  
characters...  But everyone of course likes the space savings of UTF-16.

	Rick


From rick@unicode.org  Tue Jun 26 01:59:58 2001
From: rick@unicode.org (Rick McGowan)
Date: Mon, 25 Jun 2001 17:59:58 -0700
Subject: [I18n-sig] How does Python Unicode treat surrogates?
Message-ID: <200106252254.SAA04620@unicode.org>

>  1. let's choose one or the other today
>  2. let's make it a compile-time switch
>  3. make it a runtime option

I definitely think Python should make a decision at the language level.   
But with the OO model, you can hide a lot of details behind string objects  
and accessors...

Runtime options on such things are bad.  This is one of the things Unicode  
is designed as an antidote for: the "choose char set at runtime" kind of  
18n model.

Compile time switch is poor because you do end up with two real models in  
the world.  Could affect interoperability a lot, and byte-code stuff might  
not be as easily portable.  (I don't know enough about the implementation  
or the language to guess, by the way.)

	Rick


From tree@basistech.com  Tue Jun 26 01:57:58 2001
From: tree@basistech.com (Tom Emerson)
Date: Mon, 25 Jun 2001 20:57:58 -0400
Subject: [I18n-sig] How does Python Unicode treat surrogates?
In-Reply-To: <200106260018.f5Q0ItN01657@mira.informatik.hu-berlin.de>
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com>
 <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de>
 <3B3471AF.1311E872@lemburg.com>
 <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de>
 <3B34F9BD.4FDEFC62@lemburg.com>
 <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de>
 <3B35CEC6.710243E7@lemburg.com>
 <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de>
 <3B362E9B.4DC8DD81@lemburg.com>
 <200106251342.f5PDg1q07291@odiug.digicool.com>
 <3B375FB9.91BA4B1E@lemburg.com>
 <200106251620.f5PGKNP08234@odiug.digicool.com>
 <3B376E68.505BF6E@lemburg.com>
 <200106251804.f5PI4D008730@odiug.digicool.com>
 <3B378460.C27CDCDD@lemburg.com>
 <200106251912.f5PJCVD09465@odiug.digicool.com>
 <00f201c0fdb0$ab0fe170$4ffa42d5@hagrid>
 <15159.36453.486716.705433@cymru.basistech.com>
 <200106260018.f5Q0ItN01657@mira.informatik.hu-berlin.de>
Message-ID: <15159.56854.539327.291739@cymru.basistech.com>

Martin v. Loewis writes:
> > How then is u"\U00200000" represented internally if you use UCS-2 as
> > the internal storage representation?
> 
> I think the obvious answer is: It is not supported. It will give an
> exception when you try to convert an UTF-8 or UTF-16 string that has
> such a character, it will be an error if you pass a surrogate to
> unichr, or in a \u literal.

So the characters added in Unicode 3.1 in planes 1, 2, and 14 would
not be representable in Python? Seems a bit draconian to make your
life easier.

    -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Sr. Sinostringologist                              http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From tree@basistech.com  Tue Jun 26 02:01:58 2001
From: tree@basistech.com (Tom Emerson)
Date: Mon, 25 Jun 2001 21:01:58 -0400
Subject: [I18n-sig] How does Python Unicode treat surrogates?
In-Reply-To: <200106252347.f5PNlIT01439@mira.informatik.hu-berlin.de>
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com>
 <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de>
 <3B3471AF.1311E872@lemburg.com>
 <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de>
 <3B34F9BD.4FDEFC62@lemburg.com>
 <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de>
 <3B35CEC6.710243E7@lemburg.com>
 <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de>
 <3B362E9B.4DC8DD81@lemburg.com>
 <200106251342.f5PDg1q07291@odiug.digicool.com>
 <3B375FB9.91BA4B1E@lemburg.com>
 <200106251620.f5PGKNP08234@odiug.digicool.com>
 <200106252347.f5PNlIT01439@mira.informatik.hu-berlin.de>
Message-ID: <15159.57094.857439.860222@cymru.basistech.com>

Martin v. Loewis writes:
> To my knowledge, it only differs in minor points, which is only caused
> by different release dates (at one time, Unicode is behind, at another
> time, the ISO standard).

The Unicode Technical Committee and WG2 are striving to make the two
standards move in lock step as much as possible. Unfortunately the
process of adding to an ISO standard is much more involved and time
consuming than that required for Unicode.

> End users typically view it as Unicode, whereas standards bodies and
> agencies typically view it as ISO 10646 (e.g. C, C++, and Posix all
> refer to ISO 10646, Microsoft refers to Unicode).

The standards are code-point for code-point compatible. The primary
difference is that Unicode provides property information that 10646
does not, and the UTC strives to standardize mapping tables for new
encodings (e.g., GB 18030 and JIS X 0213-2000).

-- 
Tom Emerson                                          Basis Technology Corp.
Sr. Sinostringologist                              http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From tree@basistech.com  Tue Jun 26 02:41:51 2001
From: tree@basistech.com (Tom Emerson)
Date: Mon, 25 Jun 2001 21:41:51 -0400
Subject: [I18n-sig] How does Python Unicode treat surrogates?
In-Reply-To: <200106252315.f5PNFw601373@mira.informatik.hu-berlin.de>
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com>
 <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de>
 <3B3471AF.1311E872@lemburg.com>
 <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de>
 <3B34F9BD.4FDEFC62@lemburg.com>
 <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de>
 <3B35CEC6.710243E7@lemburg.com>
 <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de>
 <3B362E9B.4DC8DD81@lemburg.com>
 <200106251342.f5PDg1q07291@odiug.digicool.com>
 <15159.14391.718891.645489@cymru.basistech.com>
 <200106252315.f5PNFw601373@mira.informatik.hu-berlin.de>
Message-ID: <15159.59487.398979.804494@cymru.basistech.com>

Martin v. Loewis writes:
> So nothing will happen until enough Chinese users complain. I don't
> know whether you count as Chinese for these purposes :-)

Perhaps not. :-) But the Chinese aren't the only ones to worry
about. The Japanese also have characters being added outside the BMP,
and Ruby holds sway in Japan...

> P.S. The real issue IMO is display: If there are fonts supporting
> these characters, people will want to write programs that make use of
> the fonts. Until nobody can actually display such text, nobody will
> request that indexing works reasonable.

True to a point. Fonts do exist for these characters. And I end up
referencing them even when I don't have fonts. Many Chinese
organizations are worried more about making sure all their characters
are encoded, and less on being able to display them
adequately. Indeed, the HKSAR and CUHK are working on a project
whereby rare characters are also encoded using the ideographic
description characters.

> P.P.S. Of course, if we wait until users actually use surrogates, it
> is too late to change the indexing - that would likely break people's
> code.

All too true.

    -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Sr. Sinostringologist                              http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From mal@lemburg.com  Mon Jun 25 20:18:13 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Mon, 25 Jun 2001 21:18:13 +0200
Subject: [I18n-sig] How does Python Unicode treat surrogates?
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com>
 <3B34F9BD.4FDEFC62@lemburg.com>
 <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de>
 <3B35CEC6.710243E7@lemburg.com>
 <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de>
 <3B362E9B.4DC8DD81@lemburg.com>
 <200106251342.f5PDg1q07291@odiug.digicool.com>
 <15159.14391.718891.645489@cymru.basistech.com>
 <200106251422.f5PEMel07612@odiug.digicool.com>
 <15159.17083.978971.519453@cymru.basistech.com>
 <200106251443.f5PEh2p07753@odiug.digicool.com>
 <15159.19546.226155.383490@cymru.basistech.com>
 <200106251544.f5PFiWe07979@odiug.digicool.com>
 <15159.22514.976923.894201@cymru.basistech.com>
 <200106251742.f5PHgTW08532@odiug.digicool.com>
 <15159.29012.266722.112773@cymru.basistech.com>
 <200106251816.f5PIGev08808@odiug.digicool.com>
 <15159.30780.1143.760653@cymru.basistech.com>
 <200106251842.f5PIgOe09018@odiug.digicool.com> <15159.32513.611214.399097@cymru.basistech.com>
Message-ID: <3B378E75.740FFB52@lemburg.com>

Tom Emerson wrote:
> ...
> No, but we may as well stop going around on this, since my views are
> not going to happen.
> 
> In my view the string 'u' is a Unicode string. I don't care what sits
> underneath: 16-bits or 32-bits I don't care. As far as I'm concerned
> the string has three characters in it:
> 
> foo = u"\u4e00\u020000a"
> 
> means that foo[0] == u"\u4e00", foo[1] == u"\u020000", and foo[2] ==
> u"a".
> 
> The fact that this is represented internally different ways shouldn't
> matter to the user who only cares about characters.

While I agree with Guido that foo[i] should return the code
unit and not the code point, I think that by providing a few
more Unicode methods (like the ones Mark mentioned) would
go a long way in providing a compromise, e.g. foo.codepoint(1)
would then return u"\u020000", foo.codelen() would return 3, etc.

Alternatively we could of course also provide this functionality
in form of functions in a separate module (with the recent 
controveries over methods vs. functions I am not sure anymore
what the general guideline is for Python... string methods at least
don't seem to be too popular around here anymore; OK, 
just rambling ;-).

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From martin@loewis.home.cs.tu-berlin.de  Tue Jun 26 00:22:27 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Tue, 26 Jun 2001 01:22:27 +0200
Subject: [I18n-sig] Re: How does Python Unicode treat surrogates?
In-Reply-To: <9F2D83017589D211BD1000805FA70CA703B139D9@ntxmel03.cmutual.com.au>
 (JMachin@Colonial.com.au)
References: <9F2D83017589D211BD1000805FA70CA703B139D9@ntxmel03.cmutual.com.au>
Message-ID: <200106252322.f5PNMRi01375@mira.informatik.hu-berlin.de>

> Do we permit such a sequence to be held internally as a "Unicode string"?
> Is u"\udc00" legal in source code or should Python throw a syntax error?

I think it shouldn't. If we disallow it, we should
a) simultaneously disallow unichr(0xDC00)
b) allow \U00010000, and unichr(0x10000), which would both give strings
   with two Py_UNICODE values inside (leaving out the question what len()
   of such a string would give).

> We *do* need to consider UTF encodings, because Unicode *expressly*
> allows decoding UTF sequences that become unpaired surrogates, or
> other "not 100% valid" scalars such as 0xffff and 0xfffe. So, given
> that Python supports Unicode, not ISO 10646, we must IMO permit such
> sequences in our internal representation.

I think the Unicode standard is in error here (or somebody is
misinterpreting it). It has happened before: Unicode 2.0 strongly
believed that the internal representation of a unicode character MUST
be 16-bit, and found some funny wording to mark a 32-bit wchar_t as
not strictly compliant, but acceptable. Unicode 3.1 has finally
revised this wrong view.

> It follows that we should stop worrying about these irregular values
> -- it's less programming that way. Unicode 3.1 will create enough
> extra programming as it is, because we now have variable-length
> characters again -- just what Unicode was going to save us from :-(

We wouldn't if we could widen Py_UNICODE to 32 bits...

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Tue Jun 26 00:15:58 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Tue, 26 Jun 2001 01:15:58 +0200
Subject: [I18n-sig] How does Python Unicode treat surrogates?
In-Reply-To: <15159.14391.718891.645489@cymru.basistech.com> (message from Tom
 Emerson on Mon, 25 Jun 2001 09:10:15 -0400)
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com>
 <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de>
 <3B3471AF.1311E872@lemburg.com>
 <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de>
 <3B34F9BD.4FDEFC62@lemburg.com>
 <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de>
 <3B35CEC6.710243E7@lemburg.com>
 <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de>
 <3B362E9B.4DC8DD81@lemburg.com>
 <200106251342.f5PDg1q07291@odiug.digicool.com> <15159.14391.718891.645489@cymru.basistech.com>
Message-ID: <200106252315.f5PNFw601373@mira.informatik.hu-berlin.de>

> With the release of the Plane 2 ideographic extensions in Unicode 3.1
> there are two options available: include surrogate support via UTF-16,
> which means dealing with multibyte (really multi"word") characters, or
> switching to UTF-32, allowing characters outside Plane 0 to be
> accessed uniformly.
> 
> Note that this is a real issue: the Hong Kong Supplementary Character
> Set includes characters contained in Plane 2 when mapped to Unicode
> 3.1.

The most likely solution, of course, for the time to come, is: Ignore
characters outside the BMP. IMO, Tim Peter's view is right: If the
internal representation uses surrogates, indexing should ignore this,
and count a surrogate pair as one character. This is not going to
happen unless somebody comes up with an efficient implementation.

The obvious alternative solution is to use a 32-bit Py_UNICODE, which,
given Guido's comment, is also not going to happen.

So nothing will happen until enough Chinese users complain. I don't
know whether you count as Chinese for these purposes :-)

Regards,
Martin

P.S. The real issue IMO is display: If there are fonts supporting
these characters, people will want to write programs that make use of
the fonts. Until nobody can actually display such text, nobody will
request that indexing works reasonable.

P.P.S. Of course, if we wait until users actually use surrogates, it
is too late to change the indexing - that would likely break people's
code.


From gs234@cam.ac.uk  Tue Jun 26 04:06:07 2001
From: gs234@cam.ac.uk (Gaute B Strokkenes)
Date: 26 Jun 2001 04:06:07 +0100
Subject: [I18n-sig] Re: How does Python Unicode treat surrogates?
In-Reply-To: <9F2D83017589D211BD1000805FA70CA703B139D8@ntxmel03.cmutual.com.au> ("Machin, John"'s message of "Mon, 25 Jun 2001 22:33:50 +1000")
References: <9F2D83017589D211BD1000805FA70CA703B139D8@ntxmel03.cmutual.com.au>
Message-ID: <4ag0coey8w.fsf@kern.srcf.societies.cam.ac.uk>

On Mon, 25 Jun 2001, JMachin@Colonial.com.au wrote:
> MAL and Gaute,
> 
> Can I please take the middle ground (and risk having both of you
> throw things at me?
> 
> => Lone surrogates are not 'true Unicode char points
>  in their own right' [MAL] -- they don't represent characters. 

I think you're misquoting MAL; the "not" was not there in his original
statement.

> On the other hand, UTF code sequences that would decode into lone
> surrogates are not "illegal".  Please read clause D29 in section 3.8
> of the Unicode 3.0 standard. This is further clarified by Unicode
> 3.1 which expressly lists legal UTF-8 sequences; these encompass
> lone surrogates.

This is really a different issue.  The paragraph states that the
various UTFs have the property that they can transform any sequence of
scalar values in the range 0 - 0x10FFFF to whatever representation is
mandated by the UTF and then back again in a bijective fashion--even
when the sequence includes scalars that are not Unicode characters,
such as 0xFFFF, 0xFFFE and the various values that are reserved to
contain UTF-16 surrogates.  Personally, I'm having difficulty seeing
how this statement could possibly apply to UTF-16.  (For instance, I
don't see how it would be possible to encode a sequence of unicode
scalar values corresponding to a low and a high surrogate; if you
tried to map this back then you would get a single unicode scalar
value outside of the BMP).  Perhaps someone on the unicode list could
elaborate?

My personal theory is that this is a vestige of the days when
"Unicode" meant "16-bit characters" and all UTFs other than UTF-16
were just hacks that one was supposed to use for compatibility reasons
only.  Eventually someone realised that 16 bits wasn't going to be
enough after all, and so kludges like surrogates were invented.  It is
instructive in this regard to note how the Unicode 3.0 conformance
requirements effectively state that "thou shalt use 16-bit
characters"; the paragraph stating that using UCS-4 for the wchar_t
type in ISO C (this is what glibc does) is not Unicode conformant is
particularly amusing.  This was all changed for 3.1.

-- 
Big Gaute                               http://www.srcf.ucam.org/~gs234/
..  here I am in 53 B.C. and all I want is a dill pickle!!


From gs234@cam.ac.uk  Tue Jun 26 04:24:27 2001
From: gs234@cam.ac.uk (Gaute B Strokkenes)
Date: 26 Jun 2001 04:24:27 +0100
Subject: [I18n-sig] Re: How does Python Unicode treat surrogates?
In-Reply-To: <200106251620.f5PGKNP08234@odiug.digicool.com> (Guido van Rossum's message of "Mon, 25 Jun 2001 12:20:23 -0400")
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com>
 <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de>
 <3B3471AF.1311E872@lemburg.com>
 <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de>
 <3B34F9BD.4FDEFC62@lemburg.com>
 <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de>
 <3B35CEC6.710243E7@lemburg.com>
 <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de>
 <3B362E9B.4DC8DD81@lemburg.com>
 <200106251342.f5PDg1q07291@odiug.digicool.com>
 <3B375FB9.91BA4B1E@lemburg.com>
 <200106251620.f5PGKNP08234@odiug.digicool.com>
Message-ID: <4a8zifgbys.fsf@kern.srcf.societies.cam.ac.uk>

On Mon, 25 Jun 2001, guido@digicool.com wrote:

>> No problem... we can change to 4 byte values too if the world
>> agrees on 4 bytes per character. However, 2 bytes or 4 bytes
>> is an implementation detail and not part of the Unicode standard
>> itself.
> 
> But UTF-16 vs. UCS-4 is not an implementation detail!

Sure it is!  A given chunk of Unicode data is semantically just a
finite sequence of Unicode scalar values.  The difference between
UTF-16 and UCS-4 is entirely one of how you are arranging bits and
bytes to store the same information.  The meaning is exactly the same;
so it's an implementation detail.

A (somewhat far-fetched, but there you are) analogy is this: imagine
that you wish to store a true-colour bitmap in memory.  You could do
this by, say, storing the R, G and B components of a given pixel right
next to each other, in that order.  Alternatively, you could keep all
the R components in one chunk and all the G components in another, or
you could store the pixels in a different order.  All of this makes no
difference to the actual bitmap itself.

I hope you see what I mean.

> If we store 4 bytes per character, we should treat surrogates
> differently.  I don't know where those would be converted --
> probably in the UTF-16 to UCS-4 codec.

An important point here is that the sole raison d'etre of surrogates
is to enable one to store the entire 21-bit Unicode character set
within the confines of a 16-bit encoding.  If you're not dealing with
UTF-16, surrogates quite simply do not exist and the only time you
have to worry about them is when and if you wish to convert to and
from UTF-16.  As such the statement "we should treat surrogates
differently when storing four bytes per character" is rather
imprecise; the whole point is that you don't treat or worry about
surrogates at all; except during conversion to and from UTF-16,
obviously.

-- 
Big Gaute                               http://www.srcf.ucam.org/~gs234/
I have nostalgia for the late Sixties!  In 1969 I left my laundry with
 a hippie!!  During an unauthorized Tupperware party it was chopped &
 diced!


From tim.one@home.com  Tue Jun 26 04:52:24 2001
From: tim.one@home.com (Tim Peters)
Date: Mon, 25 Jun 2001 23:52:24 -0400
Subject: [I18n-sig] Re: How does Python Unicode treat surrogates?
In-Reply-To: <4a8zifgbys.fsf@kern.srcf.societies.cam.ac.uk>
Message-ID: <LNBBLJKPBEHFEDALKOLCCEKGKKAA.tim.one@home.com>

[Guido]
> But UTF-16 vs. UCS-4 is not an implementation detail!

[Gaute B Strokkenes]
> Sure it is!  A given chunk of Unicode data is semantically just a
> finite sequence of Unicode scalar values.  The difference between
> UTF-16 and UCS-4 is entirely one of how you are arranging bits and
> bytes to store the same information.  The meaning is exactly the same;
> so it's an implementation detail.

I don't know what possessed Guido to make that claim, but I'm sure he'll
agree after some thought (he must, because you're right <wink>).

Something else is bothering me here, though:  Python isn't C, or even Java,
so a slicing gimmick returning raw encoding bytes (call 'em octets if you
must, but they're bytes to me <wink>) favored by Unicode *implementors* is
at the wrong level.  Unicode *users* can't paste this crap together again
efficiently using Python code, because high-volume low-level bit-fiddling is
exactly what Python code is worst at.  So the idea that u[i] (for a Unicode
string u and int i) should ever return meaningless binary blobs at the
*Python* level is just astonishing to me:  Unicode strings in Python are an
end-user feature, not a low-level crutch for Unicode library developers.


From martin@loewis.home.cs.tu-berlin.de  Tue Jun 26 06:21:35 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Tue, 26 Jun 2001 07:21:35 +0200
Subject: [I18n-sig] How does Python Unicode treat surrogates?
In-Reply-To: <15159.56854.539327.291739@cymru.basistech.com> (message from Tom
 Emerson on Mon, 25 Jun 2001 20:57:58 -0400)
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com>
 <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de>
 <3B3471AF.1311E872@lemburg.com>
 <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de>
 <3B34F9BD.4FDEFC62@lemburg.com>
 <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de>
 <3B35CEC6.710243E7@lemburg.com>
 <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de>
 <3B362E9B.4DC8DD81@lemburg.com>
 <200106251342.f5PDg1q07291@odiug.digicool.com>
 <3B375FB9.91BA4B1E@lemburg.com>
 <200106251620.f5PGKNP08234@odiug.digicool.com>
 <3B376E68.505BF6E@lemburg.com>
 <200106251804.f5PI4D008730@odiug.digicool.com>
 <3B378460.C27CDCDD@lemburg.com>
 <200106251912.f5PJCVD09465@odiug.digicool.com>
 <00f201c0fdb0$ab0fe170$4ffa42d5@hagrid>
 <15159.36453.486716.705433@cymru.basistech.com>
 <200106260018.f5Q0ItN01657@mira.informatik.hu-berlin.de> <15159.56854.539327.291739@cymru.basistech.com>
Message-ID: <200106260521.f5Q5LZK00933@mira.informatik.hu-berlin.de>

> Martin v. Loewis writes:
> > > How then is u"\U00200000" represented internally if you use UCS-2 as
> > > the internal storage representation?
> > 
> > I think the obvious answer is: It is not supported. It will give an
> > exception when you try to convert an UTF-8 or UTF-16 string that has
> > such a character, it will be an error if you pass a surrogate to
> > unichr, or in a \u literal.
> 
> So the characters added in Unicode 3.1 in planes 1, 2, and 14 would
> not be representable in Python? Seems a bit draconian to make your
> life easier.

With Fredrik's solution, you'ld have to rebuild your Python interpreter
with a 32-bit Unicode type to represent the characters. With that
option, we'ld delegate the decision to administrators and Python
distributors. If their users demand support for the additional
characters, they will need to consider wasting space.

Of course, byte code files should then use UTF-16, to allow some
portability of byte code across platforms. If a byte code file
contains a plane 2 string literal, it could not be imported into an
interpreter who uses UCS-2, just as the corresponding source code
import would fail.

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Tue Jun 26 06:26:03 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Tue, 26 Jun 2001 07:26:03 +0200
Subject: [I18n-sig] How does Python Unicode treat surrogates?
In-Reply-To: <15159.59487.398979.804494@cymru.basistech.com> (message from Tom
 Emerson on Mon, 25 Jun 2001 21:41:51 -0400)
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com>
 <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de>
 <3B3471AF.1311E872@lemburg.com>
 <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de>
 <3B34F9BD.4FDEFC62@lemburg.com>
 <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de>
 <3B35CEC6.710243E7@lemburg.com>
 <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de>
 <3B362E9B.4DC8DD81@lemburg.com>
 <200106251342.f5PDg1q07291@odiug.digicool.com>
 <15159.14391.718891.645489@cymru.basistech.com>
 <200106252315.f5PNFw601373@mira.informatik.hu-berlin.de> <15159.59487.398979.804494@cymru.basistech.com>
Message-ID: <200106260526.f5Q5Q3900934@mira.informatik.hu-berlin.de>

> Martin v. Loewis writes:
> > So nothing will happen until enough Chinese users complain. I don't
> > know whether you count as Chinese for these purposes :-)
> 
> Perhaps not. :-) But the Chinese aren't the only ones to worry
> about. The Japanese also have characters being added outside the BMP,
> and Ruby holds sway in Japan...

That's a good point. How does Ruby deal with surrogates? Java JDK 1.4?
Perl? Tcl? Windows XP?

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Tue Jun 26 07:02:51 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Tue, 26 Jun 2001 08:02:51 +0200
Subject: [I18n-sig] Re: How does Python Unicode treat surrogates?
In-Reply-To: <3B37656E.9E09DB1A@lemburg.com> (mal@lemburg.com)
References: <9F2D83017589D211BD1000805FA70CA703B139D9@ntxmel03.cmutual.com.au> <3B37656E.9E09DB1A@lemburg.com>
Message-ID: <200106260602.f5Q62pg01129@mira.informatik.hu-berlin.de>

> > > Say you have a Unicode string which contains the following data:
> > >
> > >        U+0061 U+0062 U+0063 U+DC00 U+0064 U+0065 U+0066
> > >       ("a"    "b"    "c"    ?      "d"    "e"    "f")
> > >
> > > Would you consider this sequence a Unicode string or not ?
> > 
> > I think you are using "Unicode string" with two different meanings here.
> 
> The question is really very simple: is the above correct Unicode
> or not ?

I think it is not. Looking at Unicode TR 17
(http://www.unicode.org/unicode/reports/tr17/), this is an illegal
sequence of code units. Specifically, they give the example

- 0xD800 is incomplete in Unicode
  Unless followed by another 16-bit value of the right form, it is illegal.

Now what does it mean that this is an illegal code unit sequence?
Looking at Unicode TR 27 (aka Unicode 3.1), we see, for C12

(a) When a process generates data in a Unicode Transformation Format,
    it shall not emit ill-formed code unit sequences.

(b) When a process interprets data in a Unicode Transformation Format,
    it shall treat illegal code unit sequences as an error condition.

(c) A conformant process shall not interpret illegal UTF code unit
    sequences as characters.

So clearly, we shall never emit that Unicode string in a UTF. In
another message, you write

> FYI, Python currently uses UTF-16 as internal storage format and
> also exposes this through its indexing interfaces.

Since Python uses UTF-16 as an internal format, Python must not emit
above Unicode string into the internal representation,
either. Therefore, if Python can represent above sequence of code
units, it is not conforming.

Regards,
Martin


From fredrik@pythonware.com  Tue Jun 26 07:50:07 2001
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Tue, 26 Jun 2001 08:50:07 +0200
Subject: [I18n-sig] How does Python Unicode treat surrogates?
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com><200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de><3B3471AF.1311E872@lemburg.com><200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de><3B34F9BD.4FDEFC62@lemburg.com><200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de><3B35CEC6.710243E7@lemburg.com><200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de><3B362E9B.4DC8DD81@lemburg.com><200106251342.f5PDg1q07291@odiug.digicool.com><3B375FB9.91BA4B1E@lemburg.com><200106251620.f5PGKNP08234@odiug.digicool.com><3B376E68.505BF6E@lemburg.com><200106251804.f5PI4D008730@odiug.digicool.com><3B378460.C27CDCDD@lemburg.com><200106251912.f5PJCVD09465@odiug.digicool.com><00f201c0fdb0$ab0fe170$4ffa42d5@hagrid><15159.36453.486716.705433@cymru.basistech.com><200106260018.f5Q0ItN01657@mira.informatik.hu-berlin.de> <15159.56854.539327.291739@cymru.basistech.com>
Message-ID: <009d01c0fe0e$566a7af0$4ffa42d5@hagrid>

Tom Emerson wrote:
> > How then is u"\U00200000" represented internally if you use UCS-2 as
> > the internal storage representation?
> >
> > I think the obvious answer is: It is not supported. It will give an
> > exception when you try to convert an UTF-8 or UTF-16 string that has
> > such a character, it will be an error if you pass a surrogate to
> > unichr, or in a \u literal.
> 
> So the characters added in Unicode 3.1 in planes 1, 2, and 14 would
> not be representable in Python? Seems a bit draconian to make your
> life easier.

it is not directly supported in Python 2.0, 2.1, and the
current 2.2 codebase.  no amount of arguing or wishful
thinking will change that.

</F>


From fredrik@pythonware.com  Tue Jun 26 08:05:07 2001
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Tue, 26 Jun 2001 09:05:07 +0200
Subject: [I18n-sig] How does Python Unicode treat surrogates?
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com><200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de><3B3471AF.1311E872@lemburg.com><200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de><3B34F9BD.4FDEFC62@lemburg.com><200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de><3B35CEC6.710243E7@lemburg.com><200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de><3B362E9B.4DC8DD81@lemburg.com><200106251342.f5PDg1q07291@odiug.digicool.com><3B375FB9.91BA4B1E@lemburg.com><200106251620.f5PGKNP08234@odiug.digicool.com><3B376E68.505BF6E@lemburg.com><200106251804.f5PI4D008730@odiug.digicool.com><3B378460.C27CDCDD@lemburg.com><200106251912.f5PJCVD09465@odiug.digicool.com><00f201c0fdb0$ab0fe170$4ffa42d5@hagrid><15159.36453.486716.705433@cymru.basistech.com><200106260018.f5Q0ItN01657@mira.informatik.hu-berlin.de> <15159.56854.539327.291739@cymru.basistech.com> <200106260521.f5Q5LZK00933@mira.informatik.hu-berlin.de>
Message-ID: <009e01c0fe0e$56e947e0$4ffa42d5@hagrid>

mvl wrote:

> With Fredrik's solution, you'ld have to rebuild your Python interpreter
> with a 32-bit Unicode type to represent the characters. With that
> option, we'ld delegate the decision to administrators and Python
> distributors. If their users demand support for the additional
> characters, they will need to consider wasting space.

my suggestion is to prepare the Unicode subsystem for
sizeof(Py_UNICODE) >= 4 *today*, and make the switch
to UCS-4 when the time is right [1].

UTF-16 is an encoding format, not a storage format, so as
long as sizeof(Py_UNICODE) is 2, there will be no support for
surrogates beyond what's already in there [2].

</F>

1) imho, that time is "as soon as the unicode subsystem
is ready".

2) the U escape, plus some codecs, already support it:

>>> u"\U0010ffff"
u'\uDBFF\uDFFF'
>>> unicode("\xf4\x8f\xbf\xbf", "utf-8")
u'\uDBFF\uDFFF'


From guido@digicool.com  Tue Jun 26 09:51:38 2001
From: guido@digicool.com (Guido van Rossum)
Date: Tue, 26 Jun 2001 04:51:38 -0400
Subject: [I18n-sig] Unicode surrogates: just say no!
Message-ID: <200106260851.f5Q8pcN10662@odiug.digicool.com>

I'm trying to reset this discussion to come to some sort of
conclusion.  There's been a lot of useful input; I believe I've read
and understood it all.  May the new thread subject serve as a summary
of my position. :-)

Terminology: "character" is a Unicode code point; "unit" is a storage
unit, i.e. a 16-bit or 32-bit value.  A "surrogate pair" is two
16-bit storage units with special values that represent a single
character.  I'll use "surrogate" for a single storage unit whose value
indicates that it should be part of a surrogate pair.  The variable u
is a Python Unicode string object of some sort.

There are several possible options for representing Unicode strings:

1. The current situation.  I'd say that this uses UCS-2 for storage;
   it doesn't pay any attention to surrogates.  u[i] might be a lone
   surrogate.  unicode(i) where i is a lone surrogate value returns a
   string containing a lone surrogate.  An application could use the
   unicode data type to store UTF-16 units, but it would have to be
   aware of all the rules pertaining to surrogates.  The codecs,
   however, are surrogate-unaware.  (Am I right that even the UTF-16
   codec pays *no* special attention to surrogates?)

2. The compromise proposal.  This uses true UTF-16 for storage and
   changes the interface to always deal in characters.  unichr(i)
   where i is a lone surrogate is illegal, and so are the
   corresponding \u and \U encodings.  unichr(i) for 0x10000 <= i <
   0x100000 will return a one-character string that happens to be
   represented using a surrogate pair, but there's no way in Python to
   find out (short of knowing the implementation).  Codecs that are
   capable of encoding full Unicode need to be aware of surrogate
   pairs.

3. The ideal situation.  This uses UCS-4 for storage and doesn't
   require any support for surrogates except in the UTF-16 codecs (and
   maybe in the UTF-8 codecs; it seems that encoded surrogate pairs
   are legal in UTF-8 streams but should be converted back to a single
   character).  It's unclear to me whether the (illegal, according to
   the Unicode standard) "characters" whose numerical value looks like
   a lone surrogate should be entirely ruled out here, or whether a
   dedicated programmer could create strings containing these.  We
   could make it hard by declaring unichr(i) with surrogate i and \u
   and \U escapes that encode surrogates illegal, and by adding
   explicit checks to codecs as appropriate, but a C extension could
   still create an array containing illegal characters unless we do
   draconian input checking.

Option 1, which does not reasonably support characters >= 0x10000, has
clear problems, and these will grow with time, hence the current
discussion.

As a solution, option 2 seems to be most popular; this must be because
it appears to promise the most efficient storage solution while
allowing the largest range of characters to be represented without
effort for the application.

I'd like to argue that option 2 is REALLY BAD, given where we are, and
that we should provide an upgrade path directly from 1 to 3 instead.

The main problem with option 2 is that it breaks the correspondence
between storage unit indices and character indices, and given Python's
reliance on indexing and slicing for string operations, we need a way
to keep the indexing operation (u[i]) efficient, as in O(1).

Tim suggested a reasonable way to implement 2 efficiently: add a
"finger" to each unicode object that caches the last used index
(mapping the character index to the storage unit index).  This can be
used efficiently to walk through the characters in sequence.  Of
course, we would also have to store the length of the string in
characters (so len(u) can be computed efficiently) as well as in
storage units (so the implementation can efficiently know the storage
boundaries).

Martin has hinted at a solution requiring even less memory per string
object, but I don't know for sure what he is thinking of.  All I can
imagine is a single flag saying "this string contains no surrogates".

But either way, I believe that this requires that every part of the
Unicode implementation be changed to become aware of the difference
between characters and storage units.  Every piece of C code that
currently deals with indices into arrays of Py_UNICODE storage units
will have to be changed.

This would have to be one gigantic patch, just to change the basic
Unicode object implementation.  The assumption that storage indices
and character indices are the same thing appears in almost every
function.

And then think of the required changes to the SRE engine.  It
currently assumes a strict character <--> storage unit equivalence
throughout.  In order to support option 2 correctly, it would have to
become surrogate-aware.  There are two parts to this: the internal
engine needs to realize that e.g. "." and certain "[...]" sets may
match a surrogate pair, and the indices returned by e.g. the span()
method of match objects should be translated to character indices as
expected by the applications.

On the other hand, the changes needed to support option 3 are minimal.
Fredrik claims that SRE already supports this (or at least it's very
close); Tim has looked over the source code of the Unicode object
implementation and has not found any code that would break if
Py_UNICODE were changed to a 32-bit int type.  (There must be some
breakage, since the code as it stands won't build on machines where
sizeof(short) != 2, but it's got to be a very shallow problem.)

I see only one remaining argument against choosing 3 over 2: FUD about
disk and promary memory space usage.

(I can't believe that anyone would still worry about the extra CPU
time, after Fredrik's report that SRE is about as fast with 4 byte
characters as it with 2.  In any case this is secondary to the memory
space issue, as it is only related to the extra cycles needed to move
twice as many bytes around; the cost of most algorithms is determined
mostly by the number of characters (or storage units) processed rather
than by the number of bytes.)

I think the disk space usage problem is dealt with easily by choosing
appropriate encodings; UTF-8 and UTF-16 are both great space-savers,
and I doubt many sites will store large amounts of UCS-4 directly,
given that good codecs are available.

The primary memory space problem will go away with time; assuming that
most textual documents contain at most a few millions of characters,
it's already not that much of a problem on modern machines.
Applications that are required to deal efficiently with larger
documents should support some way of streaming or chunking the data
anyway.

The only remaining question is how to provide an upgrade path to
option 3:

A. At some Python version, we switch.

B. Choose between 1 and 3 based on the platform.

C. Make it a configuration-time choice.

D. Make it a run-time choice.

I hink we all agree that D is bad.  I'd say that C is the best;
eventually (say, when Windows is fixed :-) the choice becomes
unnecessary.  I don't think it will be hard to support C, with some
careful coding.

Politically, I think C will also look best to the users -- it allows
sites to make their own decision based on storage needs (i.e. do they
have the main memory it takes) and compatibility requirements (i.e. do
they need the full Unicode set).  I don't think interoperability will
be much of a problem, since file exchanges should use encodings.  Oh
yes, we'll need a UCS-4 codec or two (one for each byte order).

We could use B to determine the default choice, e.g. we could choose
between option 1 and 3 depending on the platform's wchar_t; but it
would be bad not to have a way to override this default, so we
couldn't exploit the correspondence much.  Some code could be
#ifdef'ed out when Py_UNICODE == wchar_t, but there would always have
to be code to support these two having different sizes.

The outcome of the choice must be available at run-time, because it
may affect certain codecs.  Maybe sys.maxunicode could be the largest
character value supported, i.e. 0xffff or 0xfffff?


A different way to look at it: if we had wanted to use a
variable-lenth internal representation, we should have picked UTF-8
way back, like Perl did.  Moving to a UTF-16-based internal
representation now will give us all the problems of the Perl choice
without any of the benefits.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From walter@livinglogic.de  Tue Jun 26 10:56:49 2001
From: walter@livinglogic.de (Walter =?iso-8859-1?Q?D=F6rwald?=)
Date: Tue, 26 Jun 2001 11:56:49 +0200
Subject: [I18n-sig] How does Python Unicode treat surrogates?
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <3B375FB9.91BA4B1E@lemburg.com> <200106251620.f5PGKNP08234@odiug.digicool.com> <3B376E68.505BF6E@lemburg.com> <200106251804.f5PI4D008730@odiug.digicool.com>              <3B378460.C27CDCDD@lemburg.com>  <200106251912.f5PJCVD09465@odiug.digicool.com> <00f201c0fdb0$ab0fe170$4ffa42d5@hagrid>
Message-ID: <3B385C61.E69146D@livinglogic.de>

Fredrik Lundh wrote:
>=20
> guido wrote:
>=20
> [...]
> > If we make a clean distinction between characters and storage units,
> > and if stick to the rule that u[i] accesses a storage unit, what's th=
e
> > conceptual difficulty?
>=20
> I'm sceptical -- I see very little reason to maintain that distinction.
> let's use either UCS-2 or UCS-4 for the internal storage, stick to the
> "character strings are character sequences" concept, and keep the
> UTF-16 surrogate issue where it belongs: in the codecs.

Exactly!

Using UTF-16 as the internal storage and defining new methods for
accessing characters instead of code units essentially means
implementing
half a new string type. We'd have to duplicate every method Unicode
objects=20
provide now. It would be two string type APIs combined in one type.

Do we really need 2 1/2 string types?

Bye,
	Walter D=F6rwald


From mal@lemburg.com  Tue Jun 26 10:54:36 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 26 Jun 2001 11:54:36 +0200
Subject: [I18n-sig] Unicode surrogates: just say no!
References: <200106260851.f5Q8pcN10662@odiug.digicool.com>
Message-ID: <3B385BDC.AB40A761@lemburg.com>

Guido van Rossum wrote:
> 
> I'm trying to reset this discussion to come to some sort of
> conclusion.  There's been a lot of useful input; I believe I've read
> and understood it all.  May the new thread subject serve as a summary
> of my position. :-)
> 
> Terminology: "character" is a Unicode code point; "unit" is a storage
> unit, i.e. a 16-bit or 32-bit value.  A "surrogate pair" is two
> 16-bit storage units with special values that represent a single
> character.  I'll use "surrogate" for a single storage unit whose value
> indicates that it should be part of a surrogate pair.  The variable u
> is a Python Unicode string object of some sort.
> 
> There are several possible options for representing Unicode strings:
> 
> 1. The current situation.  I'd say that this uses UCS-2 for storage;
>    it doesn't pay any attention to surrogates.  u[i] might be a lone
>    surrogate.  unicode(i) where i is a lone surrogate value returns a
>    string containing a lone surrogate.  An application could use the
>    unicode data type to store UTF-16 units, but it would have to be
>    aware of all the rules pertaining to surrogates.  The codecs,
>    however, are surrogate-unaware.  (Am I right that even the UTF-16
>    codec pays *no* special attention to surrogates?)

The UTF-16 decoder will raise an exception if it sees a surrogate.
The encoder write the internal format as-is without checking for
surrogate usage.

The UTF-8 codec is fully surrogate aware and will translate
the input into UTF-16 surrogates if necessary. The encoder
will translate UTF-16 surrogates into UTF-8 representations
of the code point.
 
> 2. The compromise proposal.  This uses true UTF-16 for storage and
>    changes the interface to always deal in characters.  unichr(i)
>    where i is a lone surrogate is illegal, and so are the
>    corresponding \u and \U encodings.  unichr(i) for 0x10000 <= i <
>    0x100000 will return a one-character string that happens to be
>    represented using a surrogate pair, but there's no way in Python to
>    find out (short of knowing the implementation).  Codecs that are
>    capable of encoding full Unicode need to be aware of surrogate
>    pairs.
> 
> 3. The ideal situation.  This uses UCS-4 for storage and doesn't
>    require any support for surrogates except in the UTF-16 codecs (and
>    maybe in the UTF-8 codecs; it seems that encoded surrogate pairs
>    are legal in UTF-8 streams but should be converted back to a single
>    character).

The support is require in all Unicode codecs (UTF-n, unicode-escape
and raw-unicode-escape).

>    It's unclear to me whether the (illegal, according to
>    the Unicode standard) "characters" whose numerical value looks like
>    a lone surrogate should be entirely ruled out here, or whether a
>    dedicated programmer could create strings containing these. 

As Mark Davis told me, isolated surrogates are legal code
points, but the resulting sequence is not a legal Unicode
character sequence, sinde these code point (like a few others
as well) are not considered characters.

After all this discussion and the feedback from the Unicode
mailing list, I think we should leave surrogate handling
solely to the codecs and not deal with them in the internal
storage. That is, it is the applications responsability to
make sure to create proper sequences of code points which can
be used as character sequences. 

The codecs, OTOH, should be aware of what is and what is not
considered a legal sequence. The default handling should be to
follow the Unicode Consortium standard. If someone wants to
have additional codecs which implement the ISO 10646 view of things
with respect to UTF-n handling, then these can easily be supported
by codec extensions packages.

>    We
>    could make it hard by declaring unichr(i) with surrogate i and \u
>    and \U escapes that encode surrogates illegal, and by adding
>    explicit checks to codecs as appropriate, but a C extension could
>    still create an array containing illegal characters unless we do
>    draconian input checking.

See above: it's better to leave these decisions to the applications
using the Unicode implementation.
 
> ...choose option 3...
>
> The only remaining question is how to provide an upgrade path to
> option 3:
> 
> A. At some Python version, we switch.

Like Fredrik said: as soon as the implementation is ready.
 
> B. Choose between 1 and 3 based on the platform.
> 
> C. Make it a configuration-time choice.
> 
> D. Make it a run-time choice.

I'd rather not make it a choice: let's go with UCS-4 and be
done with these problems once and for all !

As side effect, you could then also enjoy Unicode on Crays :-)

Instead of adding an option which allows selecting between
2 or 4 bytes per code unit, I think people would rather like
to see for disabling Unicode support completely (I know that 
the Pippy Team would :-).

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From andy@reportlab.com  Tue Jun 26 11:06:27 2001
From: andy@reportlab.com (Andy Robinson)
Date: Tue, 26 Jun 2001 11:06:27 +0100
Subject: [I18n-sig] Unicode surrogates: just say no!
In-Reply-To: <3B385BDC.AB40A761@lemburg.com>
Message-ID: <PGECLPOBGNBNKHNAGIJHEENHCKAA.andy@reportlab.com>

> I'd rather not make it a choice: let's go with UCS-4 and be
> done with these problems once and for all !
> 
> As side effect, you could then also enjoy Unicode on Crays :-)

I missed most of this thread, but I think there
could be "marketing" benefits from proper UCS-4.
I suspect a lot of other languages and libraries
will be stuck with clunky workarounds and Python
could be made out to be in the lead. 

That is, for the tiny number of people who care
about these things :-)

- Andy


From tdickenson@geminidataloggers.com  Tue Jun 26 13:49:12 2001
From: tdickenson@geminidataloggers.com (Toby Dickenson)
Date: Tue, 26 Jun 2001 13:49:12 +0100
Subject: [I18n-sig] Unicode surrogates: just say no!
In-Reply-To: <200106260851.f5Q8pcN10662@odiug.digicool.com>
References: <200106260851.f5Q8pcN10662@odiug.digicool.com>
Message-ID: <kl0hjt0uj2bj2j2623nj130gkqo8efc4jj@4ax.com>

On Tue, 26 Jun 2001 04:51:38 -0400, Guido van Rossum
<guido@digicool.com> wrote:

>I see only one remaining argument against choosing 3 over 2: FUD about
>disk and promary memory space usage.

In previous discussion about unifying plain strings an unicode
strings, someone (I forget who, sorry) proposed that a unified string
type that would store its data in arrays of either 1 or 2 byte
elements (depending what was efficient for each string) but provide a
unified interface independant of storage option.

Could the same option be used to support an option E, individual
strings use UCS-4 if they have to, but otherwise gain the space
advantages of UCS-2?

>
>A. At some Python version, we switch.
>
>B. Choose between 1 and 3 based on the platform.
>
>C. Make it a configuration-time choice.
>
>D. Make it a run-time choice.

Toby Dickenson
tdickenson@geminidataloggers.com


From tree@basistech.com  Tue Jun 26 13:17:07 2001
From: tree@basistech.com (Tom Emerson)
Date: Tue, 26 Jun 2001 08:17:07 -0400
Subject: [I18n-sig] How does Python Unicode treat surrogates?
In-Reply-To: <009d01c0fe0e$566a7af0$4ffa42d5@hagrid>
References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com>
 <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de>
 <3B3471AF.1311E872@lemburg.com>
 <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de>
 <3B34F9BD.4FDEFC62@lemburg.com>
 <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de>
 <3B35CEC6.710243E7@lemburg.com>
 <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de>
 <3B362E9B.4DC8DD81@lemburg.com>
 <200106251342.f5PDg1q07291@odiug.digicool.com>
 <3B375FB9.91BA4B1E@lemburg.com>
 <200106251620.f5PGKNP08234@odiug.digicool.com>
 <3B376E68.505BF6E@lemburg.com>
 <200106251804.f5PI4D008730@odiug.digicool.com>
 <3B378460.C27CDCDD@lemburg.com>
 <200106251912.f5PJCVD09465@odiug.digicool.com>
 <00f201c0fdb0$ab0fe170$4ffa42d5@hagrid>
 <15159.36453.486716.705433@cymru.basistech.com>
 <200106260018.f5Q0ItN01657@mira.informatik.hu-berlin.de>
 <15159.56854.539327.291739@cymru.basistech.com>
 <009d01c0fe0e$566a7af0$4ffa42d5@hagrid>
Message-ID: <15160.32067.420276.464530@cymru.basistech.com>

Fredrik Lundh writes:
> it is not directly supported in Python 2.0, 2.1, and the
> current 2.2 codebase.  no amount of arguing or wishful
> thinking will change that.

It is supported insofar as I can write

u"\U0020000"

and get the UTF-16 encoded u"\ud840\udc00" back. If you limit the
internal representation to UCS-2 then you constrain yourself only to
Plane 0 and the surrogate pairs are undefined. Hence you would have to
disallow the above notation.

    -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Sr. Sinostringologist                              http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From mal@lemburg.com  Tue Jun 26 14:08:33 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Tue, 26 Jun 2001 15:08:33 +0200
Subject: [I18n-sig] Unicode surrogates: just say no!
References: <200106260851.f5Q8pcN10662@odiug.digicool.com> <kl0hjt0uj2bj2j2623nj130gkqo8efc4jj@4ax.com>
Message-ID: <3B388951.7B40652C@lemburg.com>

Toby Dickenson wrote:
> 
> On Tue, 26 Jun 2001 04:51:38 -0400, Guido van Rossum
> <guido@digicool.com> wrote:
> 
> >I see only one remaining argument against choosing 3 over 2: FUD about
> >disk and promary memory space usage.
> 
> In previous discussion about unifying plain strings an unicode
> strings, someone (I forget who, sorry) proposed that a unified string
> type that would store its data in arrays of either 1 or 2 byte
> elements (depending what was efficient for each string) but provide a
> unified interface independant of storage option.
> 
> Could the same option be used to support an option E, individual
> strings use UCS-4 if they have to, but otherwise gain the space
> advantages of UCS-2?

This makes the implementation more complicated: e.g. SRE
would then have to be provided in three flavours: 8-bit, 16-bit
and 32-bit. Same for most of the codecs.

Maintenance will become a nightmare, the Python interpreter will
put on wheight and we will probably not gain much w/r to overall
memory usage (external storage will use one of the encodings
which can be chosen on an per-application basis).
 
-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From tree@basistech.com  Tue Jun 26 13:40:27 2001
From: tree@basistech.com (Tom Emerson)
Date: Tue, 26 Jun 2001 08:40:27 -0400
Subject: [I18n-sig] Unicode surrogates: just say no!
In-Reply-To: <200106260851.f5Q8pcN10662@odiug.digicool.com>
References: <200106260851.f5Q8pcN10662@odiug.digicool.com>
Message-ID: <15160.33467.686959.415021@cymru.basistech.com>

Guido van Rossum writes:
> 3. The ideal situation.  This uses UCS-4 for storage and doesn't
>    require any support for surrogates except in the UTF-16 codecs (and
>    maybe in the UTF-8 codecs; it seems that encoded surrogate pairs
>    are legal in UTF-8 streams but should be converted back to a single
>    character).  It's unclear to me whether the (illegal, according to
>    the Unicode standard) "characters" whose numerical value looks like
>    a lone surrogate should be entirely ruled out here, or whether a
>    dedicated programmer could create strings containing these.  We
>    could make it hard by declaring unichr(i) with surrogate i and \u
>    and \U escapes that encode surrogates illegal, and by adding
>    explicit checks to codecs as appropriate, but a C extension could
>    still create an array containing illegal characters unless we do
>    draconian input checking.

UTF-8 can be used to encode encode each half of a surrogate pair
(resulting in six-bytes for the character) --- a proposal for this was
presented by PeopleSoft at the UTC meeting last month. UTF-8 can also
encode the code-point directly in four bytes.

As Marc-Andre said in his response, you can have a valid stream of Unicode
characters with half a surrogate pair: that character, however, is
undefined.

> I see only one remaining argument against choosing 3 over 2: FUD about
> disk and promary memory space usage.

At the last IUC in Hong Kong some developers from SAP presented data
against the use of UCS-4/UTF-32 as an internal representation. In
their benchmarks they found that the overhead of cache-misses due to
the increased character width were far more detrimental to runtime
than having to deal with the odd surrogate pair in a UTF-16 encoded
string. After the presentation several people (myself, Asmus Freytag,
Toby Phipps of PeopleSoft, and Paul Laenger of Software AG) had a
little chat about this issue and couldn't agree whether this was
really a big problem or not. I think it bears more research.

However, I agree that using UCS-4/UTF-32 as the internal string
representation is the best solution.

Remember too that glibc uses UCS-4 as its internal wchar_t
representation. This was also discussed at the Li18nux meetings a
couple of years ago.

> A. At some Python version, we switch.
> 
> B. Choose between 1 and 3 based on the platform.
> 
> C. Make it a configuration-time choice.

Defaulting to UCS-4?

> We could use B to determine the default choice, e.g. we could choose
> between option 1 and 3 depending on the platform's wchar_t; but it
> would be bad not to have a way to override this default, so we
> couldn't exploit the correspondence much.  Some code could be
> #ifdef'ed out when Py_UNICODE == wchar_t, but there would always have
> to be code to support these two having different sizes.

Seems to me this could add complexity and reliance on platform
functionality that may not be consistent. Is the savings worth the complexity?

> The outcome of the choice must be available at run-time, because it
> may affect certain codecs.  Maybe sys.maxunicode could be the largest
> character value supported, i.e. 0xffff or 0xfffff?

or 0x10ffff?

-- 
Tom Emerson                                          Basis Technology Corp.
Sr. Sinostringologist                              http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From martin@loewis.home.cs.tu-berlin.de  Tue Jun 26 15:53:35 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Tue, 26 Jun 2001 16:53:35 +0200
Subject: [I18n-sig] Unicode surrogates: just say no!
In-Reply-To: <200106260851.f5Q8pcN10662@odiug.digicool.com> (message from
 Guido van Rossum on Tue, 26 Jun 2001 04:51:38 -0400)
References: <200106260851.f5Q8pcN10662@odiug.digicool.com>
Message-ID: <200106261453.f5QErZP01348@mira.informatik.hu-berlin.de>

> Martin has hinted at a solution requiring even less memory per string
> object, but I don't know for sure what he is thinking of.  All I can
> imagine is a single flag saying "this string contains no surrogates".

That was my original idea. I later thought have a count of surrogate
pairs would be better, since it allows to compute len() in constant
time. Indexing would be linear time only for strings containing
surrogates, otherwise constant time also.

> But either way, I believe that this requires that every part of the
> Unicode implementation be changed to become aware of the difference
> between characters and storage units.  Every piece of C code that
> currently deals with indices into arrays of Py_UNICODE storage units
> will have to be changed.

One could try to reduce the impact of the change, in particular when
expecting your solution 3 (i.e. a 32-bit Py_UNICODE). E.g. code that
currently reads

    if (start < 0)
        start += self->length;
    if (start < 0)
        start = 0;

would then read

    if (start < 0)
        start += Py_UNICODE_LENGTH(self);
    if (start < 0)
        start = 0;
    start = Py_UNICODE_UNIT_OF(self,start);

where Py_UNICODE_UNIT_OF converts from character indices to unit
indices, and is implemented as 

#ifdef Py_UNICODE_4_BYTES
#define Py_UNICODE_UNIT_OF(str,x)  x
#else
#define Py_UNICODE_UNIT_OF(str,x)  (str->surrogates?Py_UnicodeUnitOf(str,x):x)
#endif

Not that I particular like that approach; I'm just pointing out it is
feasible.

[on sre]
> There are two parts to this: the internal
> engine needs to realize that e.g. "." and certain "[...]" sets may
> match a surrogate pair, and the indices returned by e.g. the span()
> method of match objects should be translated to character indices as
> expected by the applications.

For character classes, it may be acceptable they must only contain BMP
characters; span would use the conversion macros, and . would need
special casing. I agree this is terrible, but it could work.

> I think the disk space usage problem is dealt with easily by choosing
> appropriate encodings; UTF-8 and UTF-16 are both great space-savers,
> and I doubt many sites will store large amounts of UCS-4 directly,
> given that good codecs are available.

For application data, the internal representation is irrelevant; it is
not easy to get at the internal representation to write a string to a
file (you have to use a codec). For marshal, backward compatibility
becomes an issue; UTF-16 is the obvious choice. For pickle, UTF-8 or
raw-unicode-escape is used, anyway.

> The only remaining question is how to provide an upgrade path to
> option 3:
> 
> A. At some Python version, we switch.
> 
> B. Choose between 1 and 3 based on the platform.
> 
> C. Make it a configuration-time choice.
> 
> D. Make it a run-time choice.
> 
> I hink we all agree that D is bad.  I'd say that C is the best;
> eventually (say, when Windows is fixed :-) the choice becomes
> unnecessary.  I don't think it will be hard to support C, with some
> careful coding.

The biggest danger is that binary C modules are exchanged between
installations, e.g. pyd DLLs or RPMs. With distutils, it is really
easy to create these, so we should be careful that they break
meaningfully instead of just crashing. So I suppose your "careful
coding" includes Py_InitModule magic.

> We could use B to determine the default choice, e.g. we could choose
> between option 1 and 3 depending on the platform's wchar_t; but it
> would be bad not to have a way to override this default, so we
> couldn't exploit the correspondence much.  

Still, exploiting the platform's wchar_t might avoid copies in some
cases (I'm thinking of my iconv codec in particular), so that would
give a speed-up.

> The outcome of the choice must be available at run-time, because it
> may affect certain codecs.  Maybe sys.maxunicode could be the largest
> character value supported, i.e. 0xffff or 0xfffff?

It's actually 0x10ffff, since UTF-16 allows for 16 additional planes,
but yes, that interface sounds good.

Regards,
Martin


From tree@basistech.com  Tue Jun 26 15:39:51 2001
From: tree@basistech.com (Tom Emerson)
Date: Tue, 26 Jun 2001 10:39:51 -0400
Subject: [I18n-sig] Unicode surrogates: just say no!
In-Reply-To: <200106261453.f5QErZP01348@mira.informatik.hu-berlin.de>
References: <200106260851.f5Q8pcN10662@odiug.digicool.com>
 <200106261453.f5QErZP01348@mira.informatik.hu-berlin.de>
Message-ID: <15160.40631.208461.386096@cymru.basistech.com>

Martin v. Loewis writes:
> > Martin has hinted at a solution requiring even less memory per string
> > object, but I don't know for sure what he is thinking of.  All I can
> > imagine is a single flag saying "this string contains no surrogates".
> 
> That was my original idea. I later thought have a count of surrogate
> pairs would be better, since it allows to compute len() in constant
> time. Indexing would be linear time only for strings containing
> surrogates, otherwise constant time also.

Just so I understand: the codec will set this flag/length when it
transcodes to the internal representation?

> [on sre]
> > There are two parts to this: the internal
> > engine needs to realize that e.g. "." and certain "[...]" sets may
> > match a surrogate pair, and the indices returned by e.g. the span()
> > method of match objects should be translated to character indices as
> > expected by the applications.
> 
> For character classes, it may be acceptable they must only contain BMP
> characters; span would use the conversion macros, and . would need
> special casing. I agree this is terrible, but it could work.

UTR #18 describes the impact of surrogates on regular expressions.

http://www.unicode.org/unicode/reports/tr18/#Surrogates

> Still, exploiting the platform's wchar_t might avoid copies in some
> cases (I'm thinking of my iconv codec in particular), so that would
> give a speed-up.

Excellent point.

    -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Sr. Sinostringologist                              http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From martin@loewis.home.cs.tu-berlin.de  Tue Jun 26 17:37:32 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Tue, 26 Jun 2001 18:37:32 +0200
Subject: [I18n-sig] Unicode surrogates: just say no!
In-Reply-To: <15160.40631.208461.386096@cymru.basistech.com> (message from Tom
 Emerson on Tue, 26 Jun 2001 10:39:51 -0400)
References: <200106260851.f5Q8pcN10662@odiug.digicool.com>
 <200106261453.f5QErZP01348@mira.informatik.hu-berlin.de> <15160.40631.208461.386096@cymru.basistech.com>
Message-ID: <200106261637.f5QGbWQ01763@mira.informatik.hu-berlin.de>

> > That was my original idea. I later thought have a count of surrogate
> > pairs would be better, since it allows to compute len() in constant
> > time. Indexing would be linear time only for strings containing
> > surrogates, otherwise constant time also.
> 
> Just so I understand: the codec will set this flag/length when it
> transcodes to the internal representation?

Depends on how it is written. At the C level, it could provide a
surrogate count when creating a string, or it could give -1, in which
case the implementation would count the surrogates. At the Python
level, there would be no interface into finding out the number of
surrogates, or setting them. Instead, unichr invocations with
arguments above 0xffff would set the count.

Regards,
Martin


From guido@digicool.com  Tue Jun 26 18:00:44 2001
From: guido@digicool.com (Guido van Rossum)
Date: Tue, 26 Jun 2001 13:00:44 -0400
Subject: [I18n-sig] Unicode surrogates: just say no!
In-Reply-To: Your message of "Tue, 26 Jun 2001 11:54:36 +0200."
 <3B385BDC.AB40A761@lemburg.com>
References: <200106260851.f5Q8pcN10662@odiug.digicool.com>
 <3B385BDC.AB40A761@lemburg.com>
Message-ID: <200106261700.f5QH0ih14770@odiug.digicool.com>

(Mass followup.)

> From: "M.-A. Lemburg" <mal@lemburg.com>

> The UTF-16 decoder will raise an exception if it sees a surrogate.
> The encoder write the internal format as-is without checking for
> surrogate usage.

Hm, isn't this asymmetric?  I'd imagine that either behavior
(exception or copy as-is) can useful in either direction at times, so
this should be an option (maybe a different codec name?).

> The UTF-8 codec is fully surrogate aware and will translate
> the input into UTF-16 surrogates if necessary. The encoder
> will translate UTF-16 surrogates into UTF-8 representations
> of the code point.

Good.  This (like the UTF-16 codec's behavior) will have to be made
conditional on sizeof(Py_UNICODE) in my proposal.

> As Mark Davis told me, isolated surrogates are legal code
> points, but the resulting sequence is not a legal Unicode
> character sequence, sinde these code point (like a few others
> as well) are not considered characters.

Let me use this as an excuse to start a discussion on how far we
should go in ruling out illegal code points.

I think that *codecs* would be wise to be picky about illegal code
points (except for the special UTF-16-pass-through option).

But I think that the *datatype implementation* should allow storage
units to take every possible value, whether or not it's illegal
according to Unicode, either in isolation or in context.  It's much
easier to implement that way, and I believe that the checks ought to
be in other tools.

In particular, I propose:

- in all cases:

  - \udddd and \Udddddddd always behave the same as unichr(0xdddd) or
    unichr(0xdddddddd)

- with 16-bit (narrow) Py_UNICODE:

  - unichr(i) for 0 <= i <= 0xffff always returns a size-one string
    where ord(u[0]) == i

  - unichr(i) for 0x10000 <= i <= 0x10ffff (and hence corresponding \u
    and \U) generates a surrogate pair, where u[0] is the high
    surrogate value and u[1] the low surrogate value

  - unichr(i) for i >= 0x110000 (and hence corresponding \u and \U)
    raises an exception at Python-to-bytecode compile-time

- with 32-bit (wide) Py_UNICODE:

  - unichr(i) for 0 <= i <= 0xffffffff always returns a size-one
    string where ord(u[i]) == i

I expect that the surrogate generation rule will be controversial, so
let me explain why I think it's the best possible rule.  We're adding
a difference between Python implementations here: some can only
represent code points up to 0xffff directly, others can represent all
32-bit code points.  This is no different (IMO) than having sys.maxint
vary between platforms, or having thread support be platform
dependent, or having several choices from the *dbm family of modules.
We'll tell users their platform properties: sys.maxunicode is either
0xffff or 0x10ffff.

Users can choose to write code that only runs with wide Unicode
strings.  They ought to put "assert sys.maxunicode>=0x10ffff"
somewhere in their program, but that's their choice -- they can also
just document it, or only run it on their own system which they
configured for wide Unicode.

Users can choose to write code that doesn't use Unicode characters
outside the basic plane.  They don't have to do anything special.

Users can choose to write code that's portable between the two
versions by using surrogates on the narrow platform but not on the
wide platform.  (This would be a good idea for backward compatibility
with Python 2.0 and 2.1 anyway.)  The proposed (and current!) behavior
of \U makes it easy for them to do the right thing with string
literals; everything else, they just have to write code that won't
separate surrogate halves.

Making unichr() and the \U escape behave the same regardless of
platform makes more sense than the current situation, where unichr()
refuses characters larger than 0xffff, but \U translates them into
surrogates.

I *don't* think \U should be limited to a notation to create
surrogates.

I also don't think it's wise to stop creating surrogates from \U when
appropriate.

I *don't* think it's wise to let unichr() balk at input values that
happen to be lone surrogates.  It is easy enough to avoid these in
applications (if the application gets its input from a codec, it
should be safe already), and it would prevent code that knows what
it's doing to do stuff beyond the Unicode standard du jour.  That
would be unpythonic.

> After all this discussion and the feedback from the Unicode
> mailing list, I think we should leave surrogate handling
> solely to the codecs and not deal with them in the internal
> storage. That is, it is the applications responsability to
> make sure to create proper sequences of code points which can
> be used as character sequences. 

Exactly what I say above.

> The codecs, OTOH, should be aware of what is and what is not
> considered a legal sequence. The default handling should be to
> follow the Unicode Consortium standard. If someone wants to
> have additional codecs which implement the ISO 10646 view of things
> with respect to UTF-n handling, then these can easily be supported
> by codec extensions packages.

Yes.

> >    We
> >    could make it hard by declaring unichr(i) with surrogate i and \u
> >    and \U escapes that encode surrogates illegal, and by adding
> >    explicit checks to codecs as appropriate, but a C extension could
> >    still create an array containing illegal characters unless we do
> >    draconian input checking.
> 
> See above: it's better to leave these decisions to the applications
> using the Unicode implementation.

We agree!

> > ...choose option 3...
> >
> > The only remaining question is how to provide an upgrade path to
> > option 3:
> > 
> > A. At some Python version, we switch.
> 
> Like Fredrik said: as soon as the implementation is ready.

But will the users be ready?

> > B. Choose between 1 and 3 based on the platform.
> > 
> > C. Make it a configuration-time choice.
> > 
> > D. Make it a run-time choice.
> 
> I'd rather not make it a choice: let's go with UCS-4 and be
> done with these problems once and for all !

I assert that it's easy enough to write code that is indifferent to
sizeof(Py_UNICODE).  See SRE as a proof.

I expect that not all Unicode users will be ready to embrace UCS-4.  I
don't want to hear people say "I don't want to upgrade to Python 2.2
because it wastes 4 bytes per Unicode character, but all I ever do is
bandy around basic plane characters.  Given that there's currently
very limited need for characters outside the basic plane, I want to be
able to say that Python 2.2 is UCS-4 ready, but not that it always
uses it.

> As side effect, you could then also enjoy Unicode on Crays :-)

Indeed.

> Instead of adding an option which allows selecting between
> 2 or 4 bytes per code unit, I think people would rather like
> to see for disabling Unicode support completely (I know that 
> the Pippy Team would :-).

That's definitely another configuration switch that I would like to
see.  How hard would it be?


> From: Toby Dickenson <tdickenson@devmail.geminidataloggers.co.uk>

> In previous discussion about unifying plain strings an unicode
> strings, someone (I forget who, sorry) proposed that a unified string
> type that would store its data in arrays of either 1 or 2 byte
> elements (depending what was efficient for each string) but provide a
> unified interface independant of storage option.
> 
> Could the same option be used to support an option E, individual
> strings use UCS-4 if they have to, but otherwise gain the space
> advantages of UCS-2?

I agree with MAL's rebuttal: this would just make things more
complicated all over the place.


> From: Tom Emerson <tree@basistech.com>

> UTF-8 can be used to encode encode each half of a surrogate pair
> (resulting in six-bytes for the character) --- a proposal for this was
> presented by PeopleSoft at the UTC meeting last month. UTF-8 can also
> encode the code-point directly in four bytes.

But isn't the direct encoding highly preferable?  When would you ever
want your UTF-8 to be encoded UTF-16?

> As Marc-Andre said in his response, you can have a valid stream of Unicode
> characters with half a surrogate pair: that character, however, is
> undefined.

I guess the UTF-8 codec would have to deal with unpaired surrogates
somehow, but I would prefer it if normally it would peek ahead and
encode a valid surrogate pair as the correct 4-byte sequence.

> > I see only one remaining argument against choosing 3 over 2: FUD about
> > disk and promary memory space usage.
> 
> At the last IUC in Hong Kong some developers from SAP presented data
> against the use of UCS-4/UTF-32 as an internal representation. In
> their benchmarks they found that the overhead of cache-misses due to
> the increased character width were far more detrimental to runtime
> than having to deal with the odd surrogate pair in a UTF-16 encoded
> string. After the presentation several people (myself, Asmus Freytag,
> Toby Phipps of PeopleSoft, and Paul Laenger of Software AG) had a
> little chat about this issue and couldn't agree whether this was
> really a big problem or not. I think it bears more research.

Yet another reason to offer a configuration choice between 2-byte and
4-byte Py_UNICODE, until we know the answer.  (I'm sure it depends on
what the application does with the data too!)

> However, I agree that using UCS-4/UTF-32 as the internal string
> representation is the best solution.

Well, I find it infinitely better than trying to use UTF-16 as the
internal representation but coercing the interface into dealing with
characters and character indices uniformally.

> Remember too that glibc uses UCS-4 as its internal wchar_t
> representation. This was also discussed at the Li18nux meetings a
> couple of years ago.

But I don't think there are many Linux applications that use wchar_t
extensively yet.  At least I haven't seen any.  (Does anyone know if
Mozilla's Asian character support uses wchar_t or Unicode?)

> > A. At some Python version, we switch.
> > 
> > B. Choose between 1 and 3 based on the platform.
> > 
> > C. Make it a configuration-time choice.
> 
> Defaulting to UCS-4?

Unclear.  We'll have to user-test this default and see what the
performance hit really is.

> > We could use B to determine the default choice, e.g. we could choose
> > between option 1 and 3 depending on the platform's wchar_t; but it
> > would be bad not to have a way to override this default, so we
> > couldn't exploit the correspondence much.  Some code could be
> > #ifdef'ed out when Py_UNICODE == wchar_t, but there would always have
> > to be code to support these two having different sizes.
> 
> Seems to me this could add complexity and reliance on platform
> functionality that may not be consistent. Is the savings worth the
> complexity?

Given that the benefits of UCS-4 are unclear at this point, I think we
should be cautious and support both UCS-2 and UCS-4 on all platforms
(except maybe Crays :-).

> > The outcome of the choice must be available at run-time, because it
> > may affect certain codecs.  Maybe sys.maxunicode could be the largest
> > character value supported, i.e. 0xffff or 0xfffff?
> 
> or 0x10ffff?

Yes, I forgot about the 17th plane.


> From: "M.-A. Lemburg" <mal@lemburg.com>


> From: "Martin v. Loewis" <martin@loewis.home.cs.tu-berlin.de>

[sketches implementation idea]
> Not that I particular like that approach; I'm just pointing out it is
> feasible.

I still find this approach very unattractive, and I doubt that it will
be possible to make all aspects of the interface uniform.  What would
be a good reason to try this?  It's by far the most work of all
options.

> [on sre]
> For character classes, it may be acceptable they must only contain BMP
> characters; span would use the conversion macros, and . would need
> special casing. I agree this is terrible, but it could work.

I doubt that Fredrik would want to maintain it.

> > I think the disk space usage problem is dealt with easily by choosing
> > appropriate encodings; UTF-8 and UTF-16 are both great space-savers,
> > and I doubt many sites will store large amounts of UCS-4 directly,
> > given that good codecs are available.
> 
> For application data, the internal representation is irrelevant; it is
> not easy to get at the internal representation to write a string to a
> file (you have to use a codec). For marshal, backward compatibility
> becomes an issue; UTF-16 is the obvious choice. For pickle, UTF-8 or
> raw-unicode-escape is used, anyway.

Huh?  Marshal uses UTF-8 now.  Since the UTF-8 codec is already fully
surrogate-aware, shouldn't it do the right thing?  E.g. on a "narrow"
platform, encoding a Unicode string containing a surrogate pair
generates the UTF-8 4-byte encoding of the corresponding Unicode
character, and decoding that UTF-8 representation will create a
surrogate pair.  On a wide platform, that same UTF-8 encoding will be
turned into a single character correctly (assuming the UTF-8 codec
is adapted to the wide platform; I presume this code doesn't exist
yet).  So if either platform takes string literal containing a \U
escape for a non-basic-plane character, and marshals the resulting
string, they get the same marshalled value, and they can both read it
back correctly.  (Try it!  It works.)

> The biggest danger is that binary C modules are exchanged between
> installations, e.g. pyd DLLs or RPMs. With distutils, it is really
> easy to create these, so we should be careful that they break
> meaningfully instead of just crashing. So I suppose your "careful
> coding" includes Py_InitModule magic.

Good point!

> Still, exploiting the platform's wchar_t might avoid copies in some
> cases (I'm thinking of my iconv codec in particular), so that would
> give a speed-up.

Yes, but I don't want to *force* users to use UCS-4.  (Yet; in a few
years time this may change.)

We have this code now, so it shouldn't be too hard to keep it.


PEP time?

--Guido van Rossum (home page: http://www.python.org/~guido/)


From tree@basistech.com  Tue Jun 26 17:40:48 2001
From: tree@basistech.com (Tom Emerson)
Date: Tue, 26 Jun 2001 12:40:48 -0400
Subject: [I18n-sig] Unicode surrogates: just say no!
In-Reply-To: <200106261700.f5QH0ih14770@odiug.digicool.com>
References: <200106260851.f5Q8pcN10662@odiug.digicool.com>
 <3B385BDC.AB40A761@lemburg.com>
 <200106261700.f5QH0ih14770@odiug.digicool.com>
Message-ID: <15160.47888.58634.946673@cymru.basistech.com>

Guido van Rossum writes:
> > UTF-8 can be used to encode encode each half of a surrogate pair
> > (resulting in six-bytes for the character) --- a proposal for this was
> > presented by PeopleSoft at the UTC meeting last month. UTF-8 can also
> > encode the code-point directly in four bytes.
> 
> But isn't the direct encoding highly preferable?  When would you ever
> want your UTF-8 to be encoded UTF-16?

Amen. There were other reasons related to sort orders that I'm not
clear on as I didn't pay much attention to non-Asian issues.

> > Remember too that glibc uses UCS-4 as its internal wchar_t
> > representation. This was also discussed at the Li18nux meetings a
> > couple of years ago.
> 
> But I don't think there are many Linux applications that use wchar_t
> extensively yet.  At least I haven't seen any.  (Does anyone know if
> Mozilla's Asian character support uses wchar_t or Unicode?)

I don't have statistics on this, but I don't think it much matters: I
doubt Linux application developers are failing to use wchar_t because
it is 4-bytes.

I merely point to glibc as an example where a conscious decision was
made to go with a 4-byte wide character type in order to allow for
easy future growth without being constrained by alternate
transformation formats of Unicode. Ulrich Drepper made the right
choice, which was supported by the Li18nux group, which includes the
Linux vendors as well as IBM and Basis.


-- 
Tom Emerson                                          Basis Technology Corp.
Sr. Sinostringologist                              http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From fredrik@pythonware.com  Tue Jun 26 18:28:10 2001
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Tue, 26 Jun 2001 19:28:10 +0200
Subject: [I18n-sig] Unicode surrogates: just say no!
References: <200106260851.f5Q8pcN10662@odiug.digicool.com>              <3B385BDC.AB40A761@lemburg.com>  <200106261700.f5QH0ih14770@odiug.digicool.com>
Message-ID: <004e01c0fe65$5fe418f0$4ffa42d5@hagrid>

Guido wrote:
> I assert that it's easy enough to write code that is indifferent to
> sizeof(Py_UNICODE).  See SRE as a proof.

I just checked in a couple of patches which fixes some obvious
problems for sizeof(Py_UNICODE) > 2 (so sue me ;-).

most everything seems to work (the UTF-16 codec is a notable
exception).

there's a new (experimental) define in Include/unicodeobject.h:

    #undef USE_UCS4_STORAGE

if defined, Py_UNICODE is set to the same thing as Py_UCS4.
Cray users may want to define it...

Cheers /F


From tim@digicool.com  Tue Jun 26 18:32:39 2001
From: tim@digicool.com (Tim Peters)
Date: Tue, 26 Jun 2001 13:32:39 -0400
Subject: [I18n-sig] How does Python Unicode treat surrogates?
In-Reply-To: <200106260526.f5Q5Q3900934@mira.informatik.hu-berlin.de>
Message-ID: <BIEJKCLHCIOIHAGOKOLHIEIICBAA.tim@digicool.com>

[Tom Emerson]
> Perhaps not. :-) But the Chinese aren't the only ones to worry
> about. The Japanese also have characters being added outside the BMP,
> and Ruby holds sway in Japan...

[Martin v. Loewis]
> That's a good point. How does Ruby deal with surrogates?

Ruby has some support for UTF-8 now, but Matz (Ruby's dad) is much more a
Mule fan:

    http://www.m17n.org/

He's said that Ruby will eventually treat Unicode as "just another character
set" -- along with every other character-set gimmick ever invented.

> Java JDK 1.4? Perl? Tcl? Windows XP?

Oh, go do your own web search <wink>.


From fredrik@pythonware.com  Tue Jun 26 18:59:13 2001
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Tue, 26 Jun 2001 19:59:13 +0200
Subject: [I18n-sig] Unicode surrogates: just say no!
References: <200106260851.f5Q8pcN10662@odiug.digicool.com>              <3B385BDC.AB40A761@lemburg.com>  <200106261700.f5QH0ih14770@odiug.digicool.com>
Message-ID: <00bc01c0fe69$b6cf1620$4ffa42d5@hagrid>

Guido wrote:

> PEP time?

yes (based on this mail + your previous mail).

I can write the code if someone else writes the PEP...

Cheers /F


From fredrik@pythonware.com  Tue Jun 26 19:27:50 2001
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Tue, 26 Jun 2001 20:27:50 +0200
Subject: [I18n-sig] Unicode surrogates: just say no!
References: <200106260851.f5Q8pcN10662@odiug.digicool.com>              <3B385BDC.AB40A761@lemburg.com>  <200106261700.f5QH0ih14770@odiug.digicool.com>
Message-ID: <00d801c0fe6d$b5f81a40$4ffa42d5@hagrid>

guido wrote:

> - with 16-bit (narrow) Py_UNICODE:
> 
>   - unichr(i) for 0 <= i <= 0xffff always returns a size-one string
>     where ord(u[0]) == i
> 
>   - unichr(i) for 0x10000 <= i <= 0x10ffff (and hence corresponding \u
>     and \U) generates a surrogate pair, where u[0] is the high
>     surrogate value and u[1] the low surrogate value
> 
>   - unichr(i) for i >= 0x110000 (and hence corresponding \u and \U)
>     raises an exception at Python-to-bytecode compile-time

or in other words:

>>> unichr.__doc__
'unichr(i) -> Unicode character\n\nReturn a Unicode string of one character with
ordinal i; 0 <= i < 1114112.'
>>> unichr(-1)
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
ValueError: unichr() arg not in range(1114111)
>>> unichr(0)
u'\x00'
>>> unichr(1)
u'\x01'
>>> unichr(256)
u'\u0100'
>>> unichr(55296)
u'\ud800'
>>> unichr(65535)
u'\uffff'
>>> unichr(65536)
u'\ud800\udc00'
>>> unichr(1114111)
u'\udbff\udfff'
>>> unichr(1114112)
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
ValueError: unichr() arg not in range(1114111)

>>> "\U00000000"
'\\U00000000'
>>> "\U00000100"
'\\U00000100'
>>> u"\U00000100"
u'\u0100'
>>> u"\U00000000"
u'\x00'
>>> u"\U00000000"
u'\x00'
>>> u"\U00000100"
u'\u0100'
>>> u"\U0000d800"
u'\ud800'
>>> u"\U0000ffff"
u'\uffff'
>>> u"\U00010000"
u'\ud800\udc00'
>>> u"\U0010ffff"
u'\udbff\udfff'
>>> u"\U00110000"
UnicodeError: Unicode-Escape decoding error: illegal Unicode character

(\U behaviour as in 2.1, unichr as in my development version of 2.2)

note that unichr raises a ValueError, not a UnicodeError.  should this
be changed?

Cheers /F


From guido@digicool.com  Tue Jun 26 20:39:16 2001
From: guido@digicool.com (Guido van Rossum)
Date: Tue, 26 Jun 2001 15:39:16 -0400
Subject: [I18n-sig] Unicode surrogates: just say no!
In-Reply-To: Your message of "Tue, 26 Jun 2001 20:27:50 +0200."
 <00d801c0fe6d$b5f81a40$4ffa42d5@hagrid>
References: <200106260851.f5Q8pcN10662@odiug.digicool.com> <3B385BDC.AB40A761@lemburg.com> <200106261700.f5QH0ih14770@odiug.digicool.com>
 <00d801c0fe6d$b5f81a40$4ffa42d5@hagrid>
Message-ID: <200106261939.f5QJdGY16026@odiug.digicool.com>

> guido wrote:
> 
> > - with 16-bit (narrow) Py_UNICODE:
> > 
> >   - unichr(i) for 0 <= i <= 0xffff always returns a size-one string
> >     where ord(u[0]) == i
> > 
> >   - unichr(i) for 0x10000 <= i <= 0x10ffff (and hence corresponding \u
> >     and \U) generates a surrogate pair, where u[0] is the high
> >     surrogate value and u[1] the low surrogate value
> > 
> >   - unichr(i) for i >= 0x110000 (and hence corresponding \u and \U)
> >     raises an exception at Python-to-bytecode compile-time
> 
> or in other words:
> 
> >>> unichr.__doc__
> 'unichr(i) -> Unicode character\n\nReturn a Unicode string of one character with
> ordinal i; 0 <= i < 1114112.'

I would write 0 <= i <= 0x10ffff, but otherwise, yes.  Check it in
already!

> note that unichr raises a ValueError, not a UnicodeError.  should this
> be changed?

I think not.  The input value is wrong, that's a ValueError.  There
are lots of ValueErrors in the Unicode implementation.  There are lots
of UnicodeErrors too; the distinction isn't always clear.  MAL?

--Guido van Rossum (home page: http://www.python.org/~guido/)


From martin@loewis.home.cs.tu-berlin.de  Tue Jun 26 20:43:12 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Tue, 26 Jun 2001 21:43:12 +0200
Subject: [I18n-sig] Unicode surrogates: just say no!
In-Reply-To: <200106261700.f5QH0ih14770@odiug.digicool.com> (message from
 Guido van Rossum on Tue, 26 Jun 2001 13:00:44 -0400)
References: <200106260851.f5Q8pcN10662@odiug.digicool.com>
 <3B385BDC.AB40A761@lemburg.com> <200106261700.f5QH0ih14770@odiug.digicool.com>
Message-ID: <200106261943.f5QJhCh20482@mira.informatik.hu-berlin.de>

> > From: Tom Emerson <tree@basistech.com>
> 
> > UTF-8 can be used to encode encode each half of a surrogate pair
> > (resulting in six-bytes for the character) --- a proposal for this was
> > presented by PeopleSoft at the UTC meeting last month. UTF-8 can also
> > encode the code-point directly in four bytes.
> 
> But isn't the direct encoding highly preferable?  When would you ever
> want your UTF-8 to be encoded UTF-16?

Somebody please correct me: A conforming implementation must never
encode a non-BMP character with six bytes in UTF-8; security people
will shoot you if you say that two alternative representations for the
same string are possible.

HOWEVER, I think what the spec says that implementation shall accept
to receive non-BMP characters encoded in six bytes UTF-8. This is
because buggy implementations may produce such output, and because
that was previously left unspecified, so accepting such UTF-8 strings
improves interoperability.

> Huh?  Marshal uses UTF-8 now.

Oops, I should have checked.

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Tue Jun 26 20:46:20 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Tue, 26 Jun 2001 21:46:20 +0200
Subject: [I18n-sig] How does Python Unicode treat surrogates?
In-Reply-To: <BIEJKCLHCIOIHAGOKOLHIEIICBAA.tim@digicool.com>
References: <BIEJKCLHCIOIHAGOKOLHIEIICBAA.tim@digicool.com>
Message-ID: <200106261946.f5QJkKe20513@mira.informatik.hu-berlin.de>

> > Java JDK 1.4? Perl? Tcl? Windows XP?
> 
> Oh, go do your own web search <wink>.

I could have answered the Perl and Tcl cases myself: both use UTF-8
internally, so they are never confronted with surrogates in their
representation.

The other two were rather polemic, since I don't really expect them to
support other planes in some meaningful way - without checking, of
course.

Regards,
Martin


From paulp@ActiveState.com  Tue Jun 26 21:31:08 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Tue, 26 Jun 2001 13:31:08 -0700
Subject: [I18n-sig] Unicode surrogates: just say no!
References: <200106260851.f5Q8pcN10662@odiug.digicool.com>
 <3B385BDC.AB40A761@lemburg.com> <200106261700.f5QH0ih14770@odiug.digicool.com>
Message-ID: <3B38F10B.CCA55437@ActiveState.com>

Guido van Rossum wrote:
> 
>...
> 
> I expect that not all Unicode users will be ready to embrace UCS-4.  I
> don't want to hear people say "I don't want to upgrade to Python 2.2
> because it wastes 4 bytes per Unicode character, but all I ever do is
> bandy around basic plane characters.  Given that there's currently
> very limited need for characters outside the basic plane, I want to be
> able to say that Python 2.2 is UCS-4 ready, but not that it always
> uses it.

I'm not dead-set against this but I want to point out that binary
distributors are probably not going to bother shipping two different
binaries. So the silent majority of Python users who download
precompiled binaries are going to have a "flag day" where Python changes
its default behaviour.

Given infinite resources, I'd rather see "best of both worlds"
implementations such as a flag on the Unicode object that chooses its
internal representation (i.e. a speed tweak for the knowledgable) or
objects that "fall back" from ASCII to UCS-2 to UCS-4 depending on the
input data. Or even a unicode32() data type that was interoperable with
unicode16. (and the default could change from one to the other someday)

I accept that in a world of finite resources there may be nobody
interested enough to put in that effort but I'd rather see the option
excluded on that basis rather than just because the code becomes more
complex. The code complexity would be worth it if it prevents a minor
fork in Python and varying behavior on different Pythons.
-- 
Take a recipe. Leave a recipe.  
Python Cookbook!  http://www.ActiveState.com/pythoncookbook


From guido@digicool.com  Tue Jun 26 21:36:33 2001
From: guido@digicool.com (Guido van Rossum)
Date: Tue, 26 Jun 2001 16:36:33 -0400
Subject: [I18n-sig] Unicode surrogates: just say no!
In-Reply-To: Your message of "Tue, 26 Jun 2001 13:31:08 PDT."
 <3B38F10B.CCA55437@ActiveState.com>
References: <200106260851.f5Q8pcN10662@odiug.digicool.com> <3B385BDC.AB40A761@lemburg.com> <200106261700.f5QH0ih14770@odiug.digicool.com>
 <3B38F10B.CCA55437@ActiveState.com>
Message-ID: <200106262036.f5QKaX618195@odiug.digicool.com>

> > I expect that not all Unicode users will be ready to embrace UCS-4.  I
> > don't want to hear people say "I don't want to upgrade to Python 2.2
> > because it wastes 4 bytes per Unicode character, but all I ever do is
> > bandy around basic plane characters.  Given that there's currently
> > very limited need for characters outside the basic plane, I want to be
> > able to say that Python 2.2 is UCS-4 ready, but not that it always
> > uses it.
> 
> I'm not dead-set against this but I want to point out that binary
> distributors are probably not going to bother shipping two different
> binaries. So the silent majority of Python users who download
> precompiled binaries are going to have a "flag day" where Python changes
> its default behaviour.

Distributors know their users best -- they can decide when it's time.
E.g. I expect Asian Linux distributors to take the lead here, and
American distributors to follow last, with European distributors in
the middle.

Users with different wishes (most likely users with a desire for UCS-4
in a UCS-2 world) can always build from source.

> Given infinite resources, I'd rather see "best of both worlds"
> implementations such as a flag on the Unicode object that chooses its
> internal representation (i.e. a speed tweak for the knowledgable) or
> objects that "fall back" from ASCII to UCS-2 to UCS-4 depending on the
> input data. Or even a unicode32() data type that was interoperable with
> unicode16. (and the default could change from one to the other someday)
> 
> I accept that in a world of finite resources there may be nobody
> interested enough to put in that effort but I'd rather see the option
> excluded on that basis rather than just because the code becomes more
> complex. The code complexity would be worth it if it prevents a minor
> fork in Python and varying behavior on different Pythons.

But you don't have to maintain it.  I say that this particular varying
behavior is just as acceptable as the varying int size.

Do you want to write the PEP?

--Guido van Rossum (home page: http://www.python.org/~guido/)


From rick@unicode.org  Tue Jun 26 21:38:48 2001
From: rick@unicode.org (Rick McGowan)
Date: Tue, 26 Jun 2001 13:38:48 -0700
Subject: [I18n-sig] Unicode surrogates: just say no!
In-Reply-To: <200106261700.f5QH0ih14770@odiug.digicool.com> (message fromGuido van Rossum on Tue, 26 Jun 2001 13:00:44 -0400)
Message-ID: <200106261831.OAA22124@unicode.org>

> Somebody please correct me: A conforming implementation must never
> encode a non-BMP character with six bytes in UTF-8; security people
> will shoot you if you say that two alternative representations for the
> same string are possible.
>...
> HOWEVER, I think what the spec says that implementation shall accept
> to receive non-BMP characters encoded in six bytes UTF-8. This is

The spec has been recently changed to eliminate the ambiguity precisely  
because of security restrictions.  You are never allowed to produce "non  
shortest form".  The correct, conforming way to encode surrogate pairs in  
UTF-8 is to convert the pair to UTF-32, and then convert the UTF-32 entity  
to UTF-8.

See:
	http://www.unicode.org/unicode/reports/tr27/

which is the definition of Unicode 3.1.  It says in the intro:

    Most notable among the corrigenda to the standard is a tightening of the
    definition of UTF-8, to eliminate a possible security
    issue with non-shortest-form UTF-8.

Later, there is a section "UTF-8 Corrigendum", which starts with the text  
shown below.  This always results in a UTF-8 sequence <= 4 bytes in length,  
for all valid Unicode characters 0..10FFFF.

(BTW, I have also been working on an updated reference code for the  
various UTF transformations, but have not yet posted it due to the  
controversy surrounding the so called UTF-8S proposal.)

	Rick

------------------------------------------------------

UTF-8 Corrigendum

The current conformance clause C12 in The Unicode Standard, Version 3.0  
forbids the generation of "non-shortest form" UTF-8, and forbids the  
interpretation of illegal sequences, but not the interpretation of  
"non-shortest form". Where software does interpret the non-shortest forms,  
security issues can arise. For example:

     Process A performs security checks, but does not check for  
non-shortest forms.
     Process B accepts the byte sequence from process A, and transforms it into
	UTF-16 while interpreting non-shortest forms.
     The UTF-16 text may then contain characters that should have been filtered
	out by process A.

To address this issue, the Unicode Technical Committee has modified the  
definition of UTF-8 to forbid conformant implementations from interpreting  
non-shortest forms for BMP characters, and clarified some of the  
conformance clauses.


From mal@lemburg.com  Mon Jun 25 21:28:45 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Mon, 25 Jun 2001 22:28:45 +0200
Subject: [I18n-sig] Re: How does Python Unicode treat surrogates?
References: <LNBBLJKPBEHFEDALKOLCMEGKKKAA.tim.one@home.com><3B3722DB.1FF54794@lemburg.com> <4ak820g418.fsf@kern.srcf.societies.cam.ac.uk> <006501c0fd82$8b5ba9f0$0c680b41@c1340594a> <005b01c0fd9d$e4469e60$1a2cf7c2@oakdale2> <013901c0fdac$d27d1970$0c680b41@c1340594a>
Message-ID: <3B379EFD.40F88FC5@lemburg.com>

Mark Davis wrote:
> 
> That is an interesting approach; one that basically amounts to some
> convenience functions. For example, instead of writing:
> 
> myString.substring(myString.cpToIndex(3), myString.cpToIndex(5));
> 
> you could write:
> 
> myString.substring(3, 5, myString.CODEPOINT);
> 
> This hides some of the work, when someone is working in code points. The
> performance cost is still there, of course; using code point indexes
> requires each operation to examine every code unit up to that point, which
> is much more expensive.

Good idea !
 
> For a general programming language or string library, I'm not sure about
> implementing this pattern throughout. I know in the ICU library, for
> example, we have a significant number of functions that take offsets into
> strings. Having such a parameter on all of them would be clumsy, when most
> of the time people are simply working in code units.

In Python this would certainly be an elegant way to add the
code point indexing functionality (Python supports optional arguments
with default values).
 
-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From guido@digicool.com  Tue Jun 26 22:08:19 2001
From: guido@digicool.com (Guido van Rossum)
Date: Tue, 26 Jun 2001 17:08:19 -0400
Subject: [I18n-sig] Re: How does Python Unicode treat surrogates?
In-Reply-To: Your message of "Mon, 25 Jun 2001 22:28:45 +0200."
 <3B379EFD.40F88FC5@lemburg.com>
References: <LNBBLJKPBEHFEDALKOLCMEGKKKAA.tim.one@home.com><3B3722DB.1FF54794@lemburg.com> <4ak820g418.fsf@kern.srcf.societies.cam.ac.uk> <006501c0fd82$8b5ba9f0$0c680b41@c1340594a> <005b01c0fd9d$e4469e60$1a2cf7c2@oakdale2> <013901c0fdac$d27d1970$0c680b41@c1340594a>
 <3B379EFD.40F88FC5@lemburg.com>
Message-ID: <200106262108.f5QL8Jd18469@odiug.digicool.com>

> Mark Davis wrote:
> > 
> > That is an interesting approach; one that basically amounts to some
> > convenience functions. For example, instead of writing:
> > 
> > myString.substring(myString.cpToIndex(3), myString.cpToIndex(5));
> > 
> > you could write:
> > 
> > myString.substring(3, 5, myString.CODEPOINT);
> > 
> > This hides some of the work, when someone is working in code points. The
> > performance cost is still there, of course; using code point indexes
> > requires each operation to examine every code unit up to that point, which
> > is much more expensive.
> 
> Good idea !
>  
> > For a general programming language or string library, I'm not sure about
> > implementing this pattern throughout. I know in the ICU library, for
> > example, we have a significant number of functions that take offsets into
> > strings. Having such a parameter on all of them would be clumsy, when most
> > of the time people are simply working in code units.
> 
> In Python this would certainly be an elegant way to add the
> code point indexing functionality (Python supports optional arguments
> with default values).
>  
> -- 
> Marc-Andre Lemburg

I still think this should be an add-on module, to emphasize we're not
eager to do a whole lot of surrogate support.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From martin@loewis.home.cs.tu-berlin.de  Tue Jun 26 22:15:19 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Tue, 26 Jun 2001 23:15:19 +0200
Subject: [I18n-sig] UCS-4 configuration
Message-ID: <200106262115.f5QLFJ204654@mira.informatik.hu-berlin.de>

I've now a patch on SF which does the autoconf machinery for the
proposed simultaneous support for narrow and wide Py_UNICODE
definitions. 

https://sourceforge.net/tracker/index.php?func=detail&aid=436496&group_id=5470&atid=305470

In particular

--enable-unicode=ucs2 configures a narrow Py_UNICODE, and uses wchar_t
                      if it fits
--enable-unicode=ucs4 configures a wide Py_UNICODE likewise
--enable-unicode      configures Py_UNICODE to wchar_t if available,
                      and to UCS-4 if not; this is the default

The intention is that --disable-unicode, or --enable-unicode=no
removes the Unicode type altogether; this is not yet implemented
(it only defines a Py_USING_UNICODE macro that can be used to
wrap Unicode support).

With a wide Py_UNICODE, this patch also
- supports UTF-8 and UTF-16 encodings of the complete Unicode range
- supports unichr and \U literals:

>>> u"\U00102030"
u'\U00102030'
>>> len(u"\U00102030")
1
>>> u"\U00102030".encode("utf-8")
'\xf4\x82\x80\xb0'
>>> u"\U00102030".encode("utf-16")
'\xff\xfe\xc8\xdb0\xdc'

Regards,
Martin


From fredrik@pythonware.com  Tue Jun 26 23:04:10 2001
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Wed, 27 Jun 2001 00:04:10 +0200
Subject: [I18n-sig] UCS-4 configuration
References: <200106262115.f5QLFJ204654@mira.informatik.hu-berlin.de>
Message-ID: <005501c0fe8b$f0134d80$4ffa42d5@hagrid>

Martin v. Loewis wrote:

> I've now a patch on SF which does the autoconf machinery for the
> proposed simultaneous support for narrow and wide Py_UNICODE
> definitions. 
> 
> https://sourceforge.net/tracker/index.php?func=detail&aid=436496&group_id=5470&atid=305470

ouch.  duplicate effort here.

looks like your patch doesn't support sizeof(short) > 2 (e.g. cray).
except for that, it's not too different from what I was working on.

go ahead and check it in.

</F>


From martin@loewis.home.cs.tu-berlin.de  Tue Jun 26 23:50:24 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Wed, 27 Jun 2001 00:50:24 +0200
Subject: [I18n-sig] UCS-4 configuration
In-Reply-To: <005501c0fe8b$f0134d80$4ffa42d5@hagrid> (fredrik@pythonware.com)
References: <200106262115.f5QLFJ204654@mira.informatik.hu-berlin.de> <005501c0fe8b$f0134d80$4ffa42d5@hagrid>
Message-ID: <200106262250.f5QMoO609419@mira.informatik.hu-berlin.de>

> ouch.  duplicate effort here.

Sorry about this. When I noticed you had some code committed, I
thought "release early, release often".

> go ahead and check it in.

Done. Some clean-up could be still applied, such as defining only one
of USE_UCS4_STORAGE and Py_UNICODE_SIZE, but I'll leave that to your
judgement (i.e. I won't attempt any further changes at the moment
unless asked).

> looks like your patch doesn't support sizeof(short) > 2 (e.g. cray).
> except for that, it's not too different from what I was working on.

Indeed it doesn't. How are you going to solve this? Generating
UCS-2/UTF-16 when you have no two-byte type is not easy, unless you
plan to do all byte operations yourself.

Anyway, at the moment, it is a compile time error if short is not two
bytes. I hope I found all places where Py_UCS2 should be used.

Regards,
Martin

P.S. This patch makes the test suite fail in four byte mode, when
trying to check the output of u'\ud800\udc02'.encode('utf-8'). IMO,
all literals denoting surrogates should be replaced with \U
literals in test_unicode; this is not done yet.


From gs234@cam.ac.uk  Wed Jun 27 00:15:26 2001
From: gs234@cam.ac.uk (Gaute B Strokkenes)
Date: 27 Jun 2001 00:15:26 +0100
Subject: [I18n-sig] Re: Unicode surrogates: just say no!
In-Reply-To: <200106261700.f5QH0ih14770@odiug.digicool.com> (Guido van Rossum's message of "Tue, 26 Jun 2001 13:00:44 -0400")
References: <200106260851.f5Q8pcN10662@odiug.digicool.com>
 <3B385BDC.AB40A761@lemburg.com>
 <200106261700.f5QH0ih14770@odiug.digicool.com>
Message-ID: <4almmeq1dd.fsf@kern.srcf.societies.cam.ac.uk>

On Tue, 26 Jun 2001, guido@digicool.com wrote:
> Let me use this as an excuse to start a discussion on how far we
> should go in ruling out illegal code points.
> 
> I think that *codecs* would be wise to be picky about illegal code
> points (except for the special UTF-16-pass-through option).
> 
> But I think that the *datatype implementation* should allow storage
> units to take every possible value, whether or not it's illegal
> according to Unicode, either in isolation or in context.  It's much
> easier to implement that way, and I believe that the checks ought to
> be in other tools.

I think that it is a good idea to allow users to stick any scalar
value that will fit into the internal representation into a Python
Unicode string, and that unichr(some value > 0xFFFF) should return a
Unicode string with len(unichr(some value > 0xFFFF)) = 2 when UCS-2 is
being used.  There are a few issues that need to be considered,
however:

1) Sort order.  Unicode strings should sort in Unicode lexicographical
   order.  With UCS-4 this is easy; just compare the Py_UNICODE values
   one by one like C does with strcmp().  With UTF-16 this is more
   complicated when surrogates get involved.  Basically, you go
   through the strings being compared until you find the first
   difference.  If both characters at this point are in the BMP or
   both are high surrogates, just compare them as usual.  However, if
   one is in the BMP and the other is a surrogate, you need to make
   sure that the string with the surrogate in it sorts after the one
   with the BMP character.  Straight comparison won't work since there
   are characters in the BMP with numerical values greater than those
   of surrogates.

   I believe that this is the right thing to do when Py_UNICODE is
   UCS-2 since the added complexity is only O(1) per string comparison
   and is very easy to implement.  This will ensure that
   cmp(unichr(0xFFFD), unichr(0x10ABCD)) will work consistently and
   correctly for both UCS-2 and UCS-4.

2) There is an incompatibility between the two approaches since
   unichr(high surrogate) + unichr(low surrogate) will magically be
   the same as unichr(the approriate astral codepoint) when UCS-2 is
   used.  With UCS-4 they will not; it will result in a string with
   two values that have no well-defined meaning.

   I don't think this is a show-stopper, but people will need to be
   made aware.

> PEP time?

Quite possibly...

-- 
Big Gaute                               http://www.srcf.ucam.org/~gs234/
..  does your DRESSING ROOM have enough ASPARAGUS?


From gs234@cam.ac.uk  Wed Jun 27 00:30:00 2001
From: gs234@cam.ac.uk (Gaute B Strokkenes)
Date: 27 Jun 2001 00:30:00 +0100
Subject: [I18n-sig] Re: Unicode surrogates: just say no!
In-Reply-To: <4almmeq1dd.fsf@kern.srcf.societies.cam.ac.uk> (Gaute B Strokkenes's message of "27 Jun 2001 00:15:26 +0100")
References: <200106260851.f5Q8pcN10662@odiug.digicool.com>
 <3B385BDC.AB40A761@lemburg.com>
 <200106261700.f5QH0ih14770@odiug.digicool.com>
 <4almmeq1dd.fsf@kern.srcf.societies.cam.ac.uk>
Message-ID: <4ad77qq0p3.fsf@kern.srcf.societies.cam.ac.uk>

On 27 Jun 2001, gs234@cam.ac.uk wrote:
> 
> 1) Sort order.  Unicode strings should sort in Unicode
>    lexicographical order.  With UCS-4 this is easy; just compare the
>    Py_UNICODE values one by one like C does with strcmp().  With
>    UTF-16 this is more complicated when surrogates get involved.
>    Basically, you go through the strings being compared until you
>    find the first difference.  If both characters at this point are
>    in the BMP or both are high surrogates, just compare them as
>    usual.  However, if one is in the BMP and the other is a
>    surrogate, you need to make sure that the string with the
>    surrogate in it sorts after the one with the BMP character.
>    Straight comparison won't work since there are characters in the
>    BMP with numerical values greater than those of surrogates.

Speaking of the devil indeed: mere seconds after I sent this, the
following was posted to the unicode list:

On Tue, 26 Jun 2001, mark@macchiato.com wrote:
> I asked our performance czar to run a test comparing the performance
> of the two ICU utf-16 strcmp routines (UTF-16 binary order and
> UTF-8/32 binary order). While I want to caution that the results are
> preliminary, here they are:
> 
> "Test File       u_strcmp     u_strcmpCodePointOrder 
> --------------------------------------------------- 
> Asian Names       81 ns        83 ns / call 
> Latin Names      127 ns       124 ns 
> 
> 
> The test is a binary search of a sorted list of roughly 10000 names.
> The Asian names are quite a bit shorter, which probably accounts for
> the time difference between them and the Latin names.
> 
> The code path through the u_strcmpCodePointOrder function has
> (statistically, anyhow) exactly one added simple if relative to
> u_strcmp.  The timing differences are repeatable on my machine, but
> are probably mostly noise from code alignment and the like..."

-- 
Big Gaute                               http://www.srcf.ucam.org/~gs234/
How's it going in those MODULAR LOVE UNITS??


From guido@digicool.com  Wed Jun 27 00:34:16 2001
From: guido@digicool.com (Guido van Rossum)
Date: Tue, 26 Jun 2001 19:34:16 -0400
Subject: [I18n-sig] Re: Unicode surrogates: just say no!
In-Reply-To: Your message of "27 Jun 2001 00:15:26 BST."
 <4almmeq1dd.fsf@kern.srcf.societies.cam.ac.uk>
References: <200106260851.f5Q8pcN10662@odiug.digicool.com> <3B385BDC.AB40A761@lemburg.com> <200106261700.f5QH0ih14770@odiug.digicool.com>
 <4almmeq1dd.fsf@kern.srcf.societies.cam.ac.uk>
Message-ID: <200106262334.f5QNYG418603@odiug.digicool.com>

> 1) Sort order.  Unicode strings should sort in Unicode lexicographical
>    order.  With UCS-4 this is easy; just compare the Py_UNICODE values
>    one by one like C does with strcmp().  With UTF-16 this is more
>    complicated when surrogates get involved.  Basically, you go
>    through the strings being compared until you find the first
>    difference.  If both characters at this point are in the BMP or
>    both are high surrogates, just compare them as usual.  However, if
>    one is in the BMP and the other is a surrogate, you need to make
>    sure that the string with the surrogate in it sorts after the one
>    with the BMP character.  Straight comparison won't work since there
>    are characters in the BMP with numerical values greater than those
>    of surrogates.
> 
>    I believe that this is the right thing to do when Py_UNICODE is
>    UCS-2 since the added complexity is only O(1) per string comparison
>    and is very easy to implement.  This will ensure that
>    cmp(unichr(0xFFFD), unichr(0x10ABCD)) will work consistently and
>    correctly for both UCS-2 and UCS-4.

I'm neutral on this one; on the one hand I think we should minimize
the surrogate support outside the codecs, on the other hand this makes
some sense.

> 2) There is an incompatibility between the two approaches since
>    unichr(high surrogate) + unichr(low surrogate) will magically be
>    the same as unichr(the approriate astral codepoint) when UCS-2 is
>    used.  With UCS-4 they will not; it will result in a string with
>    two values that have no well-defined meaning.
> 
>    I don't think this is a show-stopper, but people will need to be
>    made aware.

Agreed.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@digicool.com  Wed Jun 27 00:34:16 2001
From: guido@digicool.com (Guido van Rossum)
Date: Tue, 26 Jun 2001 19:34:16 -0400
Subject: [I18n-sig] UCS-4 configuration
In-Reply-To: Your message of "Wed, 27 Jun 2001 00:50:24 +0200."
 <200106262250.f5QMoO609419@mira.informatik.hu-berlin.de>
References: <200106262115.f5QLFJ204654@mira.informatik.hu-berlin.de> <005501c0fe8b$f0134d80$4ffa42d5@hagrid>
 <200106262250.f5QMoO609419@mira.informatik.hu-berlin.de>
Message-ID: <200106262334.f5QNYGb18598@odiug.digicool.com>

Wow, this is so cool!  Seems we don't need a PEP...  Just an update to
the NEWS file and some changes to the docs and test suite.

> > looks like your patch doesn't support sizeof(short) > 2 (e.g. cray).
> > except for that, it's not too different from what I was working on.
> 
> Indeed it doesn't. How are you going to solve this? Generating
> UCS-2/UTF-16 when you have no two-byte type is not easy, unless you
> plan to do all byte operations yourself.

Don't be a wimp. :-)

As Tim Peters keeps pointing out, it's really not that hard to write
such code, e.g. using the occasional mask operation.  And a good
compiler will remove the masks that don't do anything.

> Anyway, at the moment, it is a compile time error if short is not two
> bytes. I hope I found all places where Py_UCS2 should be used.

Me too.  I hope for the Cray folks that short will be allowede to vary
properly.

Another loose end: define sys.maxunicode.

> Regards,
> Martin
> 
> P.S. This patch makes the test suite fail in four byte mode, when
> trying to check the output of u'\ud800\udc02'.encode('utf-8'). IMO,
> all literals denoting surrogates should be replaced with \U
> literals in test_unicode; this is not done yet.

Here's another weird failure in 4-byte mode, with a manually
constructed surrogate pair (using marshal, but direct use of
u.encode('utf8') would show the same problem):

>>> u = u'\ud800\udc00'
>>> u
u'\ud800\udc00'
>>> len(u)
2
>>> import marshal
>>> s = marshal.dumps(u)
>>> s
'u\x06\x00\x00\x00\xed\xa0\x80\xed\xb0\x80'
>>> marshal.loads(s)
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeError: UTF-8 decoding error: illegal encoding
>>> 

Note how the utf8 codec has encoded the surrogate pair as two 3-byte
utf8 sequences.  I think it should either spit out an error or (I
think this is better -- "be forgiving in what you accept") recognize
the surrogate pair and spit out a 4-byte utf8 sequence.  Note that in
2-byte mode, this same string literal can be marshalled and
unmarshalled just fine!

I think I'm going to withdraw my recommendation that in 4-byte mode \U
and unichr() would accept any 32-bit value; the use of UTF8 by marshal
effectively rules this out.

Or should we change the marshalling format to do something that's more
transparent?  It feels uncomfortable that in 2-byte mode we can easily
create unicode strings containing illegal sequences (e.g. lone
surrogates), but these strings can't be marshalled.  Marshal has no
business being judgemental about the value of the data.

I think we can work out most of the backward compatibility issues by
switching to a new marshal tag byte (e.g. 'U').

PS. I checked in a tiny improvement to the unichr() code.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From paulp@ActiveState.com  Wed Jun 27 00:40:23 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Tue, 26 Jun 2001 16:40:23 -0700
Subject: [I18n-sig] Unicode surrogates: just say no!
References: <200106260851.f5Q8pcN10662@odiug.digicool.com> <3B385BDC.AB40A761@lemburg.com> <200106261700.f5QH0ih14770@odiug.digicool.com>
 <3B38F10B.CCA55437@ActiveState.com> <200106262036.f5QKaX618195@odiug.digicool.com>
Message-ID: <3B391D67.7D7D3C1D@ActiveState.com>

Guido van Rossum wrote:
> 
>...
> 
> But you don't have to maintain it.  I say that this particular varying
> behavior is just as acceptable as the varying int size.

Aren't we trying to get of the maximum int size? And even if we keep it,
the rule for working with large integers is simple: calculations work on
particular ranges of inputs. Period. 

If I understand correctly, the surrogates proposal will (for example)
change this from legal to illegal:

if unichr(0x10000) in somestring:
	...

Because sometimes unichr is a single-char string and sometimes it will
actually produce a 2-byte encoding.

> Do you want to write the PEP?

If nobody pipes up to say that they've started it, then I'll do a first
draft tonight. I presume you mean write the PEP up as you described it
and not as I would like it.

So I guess I would want to cover

 * what is the issue
 * what are surrogates
 * how Py_UNICODE effects literals and unichr
 * rationale for doing surrogate generation
 * description of the configure switches
 * description of why other options were rejected

-- 
Take a recipe. Leave a recipe.  
Python Cookbook!  http://www.ActiveState.com/pythoncookbook


From guido@digicool.com  Wed Jun 27 00:47:05 2001
From: guido@digicool.com (Guido van Rossum)
Date: Tue, 26 Jun 2001 19:47:05 -0400
Subject: [I18n-sig] Unicode surrogates: just say no!
In-Reply-To: Your message of "Tue, 26 Jun 2001 16:40:23 PDT."
 <3B391D67.7D7D3C1D@ActiveState.com>
References: <200106260851.f5Q8pcN10662@odiug.digicool.com> <3B385BDC.AB40A761@lemburg.com> <200106261700.f5QH0ih14770@odiug.digicool.com> <3B38F10B.CCA55437@ActiveState.com> <200106262036.f5QKaX618195@odiug.digicool.com>
 <3B391D67.7D7D3C1D@ActiveState.com>
Message-ID: <200106262347.f5QNl5O18720@odiug.digicool.com>

> Aren't we trying to get of the maximum int size? And even if we keep it,
> the rule for working with large integers is simple: calculations work on
> particular ranges of inputs. Period. 

Well... 0xffffffff is negative on 32-bit systems but positive on
64-systems, and there are other anomalies like it.

It's not ideal, but given the forces at work (some folks need UCS-4,
some folks don't want to waste 2 extra bytes per character, we don't
want to revise the implementation to hide the existence of surrogates
in the 2-byte version) I think it's the best we can offer.

> If I understand correctly, the surrogates proposal will (for example)
> change this from legal to illegal:
> 
> if unichr(0x10000) in somestring:
> 	...
> 
> Because sometimes unichr is a single-char string and sometimes it will
> actually produce a 2-byte encoding.

Yes good example for the PEP. :-)

> > Do you want to write the PEP?
> 
> If nobody pipes up to say that they've started it, then I'll do a first
> draft tonight. I presume you mean write the PEP up as you described it
> and not as I would like it.

Great, Paul!  I'm tired of writing PEPs myself today.

> So I guess I would want to cover
> 
>  * what is the issue
>  * what are surrogates
>  * how Py_UNICODE effects literals and unichr
>  * rationale for doing surrogate generation
>  * description of the configure switches
>  * description of why other options were rejected

Yes.  You can quote liberally from the i18n list.

Use PEP number 261.  Thanks so much!

--Guido van Rossum (home page: http://www.python.org/~guido/)


From gs234@cam.ac.uk  Wed Jun 27 00:52:17 2001
From: gs234@cam.ac.uk (Gaute B Strokkenes)
Date: 27 Jun 2001 00:52:17 +0100
Subject: [I18n-sig] Re: Unicode surrogates: just say no!
In-Reply-To: <15160.33467.686959.415021@cymru.basistech.com> (Tom Emerson's message of "Tue, 26 Jun 2001 08:40:27 -0400")
References: <200106260851.f5Q8pcN10662@odiug.digicool.com>
 <15160.33467.686959.415021@cymru.basistech.com>
Message-ID: <4aae2upzny.fsf@kern.srcf.societies.cam.ac.uk>

On Tue, 26 Jun 2001, tree@basistech.com wrote:
> 
> UTF-8 can be used to encode encode each half of a surrogate pair
> (resulting in six-bytes for the character) --- a proposal for this
> was presented by PeopleSoft at the UTC meeting last month. UTF-8 can
> also encode the code-point directly in four bytes.

This is wrong.  It is a bug to encode a non-BMP character with six
bytes by pretending that the (surrogate) values used in the UTF-16
representation are BMP characters and encoding the character as though
it was a string consisting of that character.  It is also a bug to
interpret such a six-byte sequence as a single character.  This was
clarified in Unicode 3.1.  There are several good reasons for this,
such as unique representation, security etc. etc.

Personally, I think that the codecs should report an error in the
appropriate fashion when presented with a python unicode string which
contains values that are not allowed, such as lone surrogates.  While
it may be convenient to allow the python programmer to stick all kinds
of junk into a python unicode string it is not reasonable for the
python programmer to expect that this junk can be transformed into
something meaningful when he wants to encode it with some UTF or the
other.  This has the advantage that whenever I run something through a
codec the result is always a meaningful object of the appropriate
type.

For instance, I believe that given a python unicode string conversion
to UCS-2 should always fail if the string contains surrogates (lone or
otherwise) since UCS-2 is defined not to have surrogates.  Conversion
to UTF-16 or UTF-32 should fail whenever there is a lone surrogate,
and so on.  (These are sufficient but not necessary conditions for why
such conversions should fail.)

Off course, it may be convenient to offer alternative codecs and
variations of existing ones that have a more lenient policy for use
when the programmer so wishes, for instance to interact with buggy
implementations.  However, this should not be the default.

Is the proposal you're referring to the "UTF-8s" proposal by Oracle
et.al. ?  This was brought up on the unicode list some time ago and
met with massive negative response, along the lines of "oh my god, not
another UTF; we have too many already" and "it is broken to sort
unicode strings by looking at the words in the UTF-16 representation;
you should compare in code point order instead" (this being the reason
why UTF-8s was proposed: Oracle and certain other database vendors
have old and buggy unicode implementations that do not sort UTF-16
strings in codepoint order and wanted UTF-8s so that a traditional C
strcmp() on a UTF-8s string will give the same result as comparing the
same string in UTF-16 representation word by word.  Note that UTF-8
already has the corresponding property for UCS-4 / UTF-32; this was
one of the design criteria of UTF-8.  Essentially, Oracle & co. want
their old mistakes canonised.)

-- 
Big Gaute                               http://www.srcf.ucam.org/~gs234/
Did an Italian CRANE OPERATOR just experience uninhibited sensations
 in a MALIBU HOT TUB?


From gs234@cam.ac.uk  Wed Jun 27 01:22:22 2001
From: gs234@cam.ac.uk (Gaute B Strokkenes)
Date: 27 Jun 2001 01:22:22 +0100
Subject: [I18n-sig] Re: UCS-4 configuration
In-Reply-To: <200106262334.f5QNYGb18598@odiug.digicool.com> (Guido van Rossum's message of "Tue, 26 Jun 2001 19:34:16 -0400")
References: <200106262115.f5QLFJ204654@mira.informatik.hu-berlin.de>
 <005501c0fe8b$f0134d80$4ffa42d5@hagrid>
 <200106262250.f5QMoO609419@mira.informatik.hu-berlin.de>
 <200106262334.f5QNYGb18598@odiug.digicool.com>
Message-ID: <4a4rt2py9t.fsf@kern.srcf.societies.cam.ac.uk>

On Tue, 26 Jun 2001, guido@digicool.com wrote:
 > Here's another weird failure in 4-byte mode, with a manually
> constructed surrogate pair (using marshal, but direct use of
> u.encode('utf8') would show the same problem):
> 
>>>> u = u'\ud800\udc00'
>>>> u
> u'\ud800\udc00'
>>>> len(u)
> 2
>>>> import marshal
>>>> s = marshal.dumps(u)
>>>> s
> 'u\x06\x00\x00\x00\xed\xa0\x80\xed\xb0\x80'
>>>> marshal.loads(s)
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
> UnicodeError: UTF-8 decoding error: illegal encoding
>>>> 
> 
> Note how the utf8 codec has encoded the surrogate pair as two 3-byte
> utf8 sequences.  I think it should either spit out an error or (I
> think this is better -- "be forgiving in what you accept") recognize
> the surrogate pair and spit out a 4-byte utf8 sequence.  Note that
> in 2-byte mode, this same string literal can be marshalled and
> unmarshalled just fine!

I think that the best compromise is to discourage programmers from
creating non-BMP characters by manually splicing together surrogate
values, and encourage them to use unichr(approiate non-BMP value)
instead.  This is not only more readable, but avoids this kind of
problem.  Perhaps the Python parser ought to produce a warning when it
encounters such a string constant, to help catch this sort of bug.  On
the other hand, disallowing unichr(some surrogate value) is probably
too far: you should either allow all non-sensical values, or none at
all.

> I think I'm going to withdraw my recommendation that in 4-byte mode
> \U and unichr() would accept any 32-bit value; the use of UTF8 by
> marshal effectively rules this out.

UTF-8 is easily extended to store anything 31-bit values; in fact the
current ISO definition of UTF-8 is like that, though it will be
changed to match the Unicode definition in the next version.  There is
an obvious tweak to store 32 bit values as well.

Off course, using such a scheme means that UTF-8 is not used for
marshalling, just some closely related encoding.  But since we "own"
the marshalling format, this might no be such a problem.

> Or should we change the marshalling format to do something that's
> more transparent?  It feels uncomfortable that in 2-byte mode we can
> easily create unicode strings containing illegal sequences
> (e.g. lone surrogates), but these strings can't be marshalled.
> Marshal has no business being judgemental about the value of the
> data.

Just encode the lone surrogate as though it was a proper Unicode
scalar value.  This is a no-no if you go by the standard and I know
that I've been arguing against doing things like that in the standard
UTF-8 codec, but in the context of a private file format I think that
it is ok to use a private variation of UTF-8.  All we have to do is
make sure that it is referred to by a name different from UTF-8
("marshall" would be fine, I suppose) and that we never expose this
private goo to anything outside Python.

-- 
Big Gaute                               http://www.srcf.ucam.org/~gs234/
I am having a CONCEPTION--


From tim.one@home.com  Wed Jun 27 02:38:34 2001
From: tim.one@home.com (Tim Peters)
Date: Tue, 26 Jun 2001 21:38:34 -0400
Subject: [I18n-sig] UCS-4 configuration
In-Reply-To: <200106262250.f5QMoO609419@mira.informatik.hu-berlin.de>
Message-ID: <LNBBLJKPBEHFEDALKOLCMENOKKAA.tim.one@home.com>

[/F]
> looks like your patch doesn't support sizeof(short) > 2 (e.g. cray).
> except for that, it's not too different from what I was working on.

[Martin v. Loewis]
> Indeed it doesn't. How are you going to solve this? Generating
> UCS-2/UTF-16 when you have no two-byte type is not easy, unless you
> plan to do all byte operations yourself.

As opposed to what, having elves do them for us while we sleep <wink>?  You
need at least 16 bits, but it should be no problem if you have more than
that -- all it takes is a tiny bit of care, and standard C (not even C99)
does not guarantee that any integral type has exactly 2 bytes (or 4, or 8).
All C guarantees is minimal sizes, and they refused to make stronger
guarantees than that because the real world wouldn't let them.

I have decades of experience with this, so either trust me on it or point me
at code you think is a problem.  The saving grace is that any sequence of
16-bit operations involving +, -, *, &, |, ^ and << yields exactly the same
result if you do it with any number of bits >= 16, then take the last 16
bits at the end.  /, ~ and >> *may* require a little thought.  Note that MAL
made a similar argument in the Cray T3E bug report, I asked him to point me
at some troublesome code, and it turned out that didn't need *any* changes
to work correctly when sizeof(Py_UNICODE)==4 (or 8, or 10000000000 on the
next Cray <wink>).

> Anyway, at the moment, it is a compile time error if short is not two
> bytes.

Yes, I discovered that when the Windows build fell on its face <wink>.  Just
ribbing you there -- 'twas a trivial fix.


From mal@lemburg.com  Mon Jun 25 21:31:33 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Mon, 25 Jun 2001 22:31:33 +0200
Subject: [I18n-sig] Re: How does Python Unicode treat surrogates?
References: <LNBBLJKPBEHFEDALKOLCMEGKKKAA.tim.one@home.com><3B3722DB.1FF54794@lemburg.com> <4ak820g418.fsf@kern.srcf.societies.cam.ac.uk> <006501c0fd82$8b5ba9f0$0c680b41@c1340594a> <3B376B03.A2A84AE1@lemburg.com> <00f101c0fda3$4a2529e0$0c680b41@c1340594a>
Message-ID: <3B379FA5.5E3E81DD@lemburg.com>

Mark Davis wrote:
> 
> > My question was targetting into a slightly different direction,
> > though. I know that UTF-16 does not allow lone surrogates, but
> > how does Unicode itself treat these ? If I have a sequence of Unicode
> > code points which includes an isolated surrogate code point,
> > would this be considered a legal Unicode sequence or not ?
> 
> It is a legal Unicode code point sequence. However, it is not a legal
> Unicode *character* sequence, since it contains code points that by
> definition cannot be used to represent characters.

So its basically a matter of viewing a string as sequence
of characters vs. sequence of code points.

Thanks for the explanation,
-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From martin@loewis.home.cs.tu-berlin.de  Wed Jun 27 07:08:58 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Wed, 27 Jun 2001 08:08:58 +0200
Subject: [I18n-sig] UCS-4 configuration
In-Reply-To: <LNBBLJKPBEHFEDALKOLCMENOKKAA.tim.one@home.com>
References: <LNBBLJKPBEHFEDALKOLCMENOKKAA.tim.one@home.com>
Message-ID: <200106270608.f5R68wY02785@mira.informatik.hu-berlin.de>

> I have decades of experience with this, so either trust me on it or
> point me at code you think is a problem.

I would never remotely consider questioning your authority, how could I?

The specific code in question is in PyUnicode_DecodeUTF16. It gets a
char*, and converts it to a Py_UCS2* (where Py_UCS is unsigned short).
It then fetches a Py_UCS2 after another, byte-swapping if appropriate,
and advances the Py_UCS2* by one. The intention is that this retrieves
the bytes of the input in pairs.

Is that code correct even if sizeof(unsigned short)>2? If so, I can
just remove the test that it ought to be 2. If not, how should that be
rewritten?

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Wed Jun 27 06:54:22 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Wed, 27 Jun 2001 07:54:22 +0200
Subject: [I18n-sig] Re: Unicode surrogates: just say no!
In-Reply-To: <4aae2upzny.fsf@kern.srcf.societies.cam.ac.uk> (message from
 Gaute B Strokkenes on 27 Jun 2001 00:52:17 +0100)
References: <200106260851.f5Q8pcN10662@odiug.digicool.com>
 <15160.33467.686959.415021@cymru.basistech.com> <4aae2upzny.fsf@kern.srcf.societies.cam.ac.uk>
Message-ID: <200106270554.f5R5sMW02751@mira.informatik.hu-berlin.de>

> This is wrong.  It is a bug to encode a non-BMP character with six
> bytes by pretending that the (surrogate) values used in the UTF-16
> representation are BMP characters and encoding the character as though
> it was a string consisting of that character.  It is also a bug to
> interpret such a six-byte sequence as a single character.  This was
> clarified in Unicode 3.1.

It seems to be unclear to many, including myself, what exactly was
clarified with Unicode 3.1. Where exactly does it say that processing
a six-byte two-surrogates sequence as a single character is
non-conforming? What exactly does it say that the conforming behaviour
should be?

> Personally, I think that the codecs should report an error in the
> appropriate fashion when presented with a python unicode string which
> contains values that are not allowed, such as lone surrogates.  

Other people have read Unicode 3.1 and came to the conclusion that it
mandates that implementations accept such a character...

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Wed Jun 27 07:45:11 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Wed, 27 Jun 2001 08:45:11 +0200
Subject: [I18n-sig] UCS-4 configuration
In-Reply-To: <200106262334.f5QNYGb18598@odiug.digicool.com> (message from
 Guido van Rossum on Tue, 26 Jun 2001 19:34:16 -0400)
References: <200106262115.f5QLFJ204654@mira.informatik.hu-berlin.de> <005501c0fe8b$f0134d80$4ffa42d5@hagrid>
 <200106262250.f5QMoO609419@mira.informatik.hu-berlin.de> <200106262334.f5QNYGb18598@odiug.digicool.com>
Message-ID: <200106270645.f5R6jBS06348@mira.informatik.hu-berlin.de>

> Another loose end: define sys.maxunicode.

Breaking my promise not to touch the code, I've added this. I was not
quite sure what type you meant to see in sys.maxunicode; I took
integer, since U+FFFF is a non-character.

> Note how the utf8 codec has encoded the surrogate pair as two 3-byte
> utf8 sequences.  I think it should either spit out an error or (I
> think this is better -- "be forgiving in what you accept") recognize
> the surrogate pair and spit out a 4-byte utf8 sequence.  Note that in
> 2-byte mode, this same string literal can be marshalled and
> unmarshalled just fine!

That was actually the same problem as with the test case: the UTF-8
encoder would not use the surrogate code in wide mode. I've removed
that restriction, so this test now also passes.

> Or should we change the marshalling format to do something that's more
> transparent?  It feels uncomfortable that in 2-byte mode we can easily
> create unicode strings containing illegal sequences (e.g. lone
> surrogates), but these strings can't be marshalled.  

You mean, they cannot be unmarshalled? With the current code,
marshalling them works fine...

There was another problem with the unicode database; the code assumed
that adding two Py_UNICODE values would wrap around at 65536. With
that fixed and committed, the test suite passes for me.

Regards,
Martin


From gs234@cam.ac.uk  Wed Jun 27 08:52:44 2001
From: gs234@cam.ac.uk (Gaute B Strokkenes)
Date: 27 Jun 2001 08:52:44 +0100
Subject: [I18n-sig] Re: Unicode surrogates: just say no!
References: <200106260851.f5Q8pcN10662@odiug.digicool.com>
 <15160.33467.686959.415021@cymru.basistech.com>
 <4aae2upzny.fsf@kern.srcf.societies.cam.ac.uk>
 <200106270554.f5R5sMW02751@mira.informatik.hu-berlin.de>
Message-ID: <4ahex2xstv.fsf@kern.srcf.societies.cam.ac.uk>

[I'm CC-ing the unicode list again because I'm doing some fairly
sophisticated interpretation of the Unicode conformance requirements
below and I'd like to have someone with more experience with this
check my reasoning.]

On Wed, 27 Jun 2001, martin@loewis.home.cs.tu-berlin.de wrote:
>> This is wrong.  It is a bug to encode a non-BMP character with six
>> bytes by pretending that the (surrogate) values used in the UTF-16
>> representation are BMP characters and encoding the character as
>> though it was a string consisting of that character.  It is also a
>> bug to interpret such a six-byte sequence as a single character.
>> This was clarified in Unicode 3.1.
> 
> It seems to be unclear to many, including myself, what exactly was
> clarified with Unicode 3.1.

See the section called "UTF-8 Corrigendum" in TR 27.  It explains it
all in detail.

> Where exactly does it say that processing a six-byte two-surrogates
> sequence as a single character is non-conforming?

See D39(c) at <http://www.unicode.org/unicode/reports/tr27>.  This
defines such a six-byte sequence as an "irregular UTF-8 code unit
sequence" and goes on to state that, as a consequence of C12,
conforminig processes are not allowed to generate such sequences.
This really ought to be obvious anyway: UTF-8 is defined to represent
a given USV with 1 to 4 bytes, so clearly 6 is not possible.

Conversely, C12(a) states that a conformant process can not produce
"ill-formed code unit sequences" while producing data in a UTF.  The
definition of this term is given in D30 as a code unit sequence that
can not be produced from a sequence of unicode scalar values.  This is
where things get somewhat more interesting.  Somewhat surprisingly,
the definition of "Unicode Scalar Value" has not been changed from 3.0
to 3.1.  The reason why one might expect this to have changed is that
in 3.0 UTF-16 was "the" unicode format, so that USVs were defined in
terms of UTF-16 code points.  In 3.1 it is stated elsewhere that
different UTFs are simply conrete ways to store sequences of USVs.
However, the definition of USV is still

  either: A value in the range 0 - 0xFFFF which is is not a high or
  low surrogate in UTF-16,

  or: a value in the range 0x10000 - 0x10FFFF which is obtained by
  taking a pair of values that form a high and low surrogate
  respectively in UTF-16 and applying the usual formula.

Since there is no way you can form a value in the range 0xD800 -
0xDFFF in this fashion it follows that a USV can not be in this range.
Therefore you are not allowed to create a 3 byte sequence that is the
UTF-8 encoding of value in this range.  Therefore you are not allowed
to generate pairs of such sequences either.

I hope this is all clear.

One very important thing to keep in mind when doing this stuff is that
3.1 is a brand new standard, less than one and a half months old.  A
consequence of this is that most of the material on the Unicode web
site still refers to version 3.0, so you have to be very careful to
check that the information you're looking at is in fact up to date.
(The only updated information I could find was TR 27 and [probably]
the data tables.)

> What exactly does it say that the conforming behaviour should be?

Argh.  Treat it as an error, probably.  You go and read the standard
yourself, my head is already hurting.  8-)

>> Personally, I think that the codecs should report an error in the
>> appropriate fashion when presented with a python unicode string
>> which contains values that are not allowed, such as lone
>> surrogates.
> 
> Other people have read Unicode 3.1 and came to the conclusion that
> it mandates that implementations accept such a character...

Well, they're wrong.  The standard is clear as ink in this regard.

-- 
Big Gaute                               http://www.srcf.ucam.org/~gs234/
I can't think about that.  It doesn't go with HEDGES in the shape of
 LITTLE LULU -- or ROBOTS making BRICKS...


From mal@lemburg.com  Wed Jun 27 08:52:31 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Wed, 27 Jun 2001 09:52:31 +0200
Subject: [I18n-sig] Unicode Maintenance
Message-ID: <3B3990BF.5C8410A9@lemburg.com>

Looking at the recent burst of checkins for the Unicode implementation
completely bypassing the standard SF procedure and possible comments
I might have on the different approaches, I guess I've been ruled out
as maintainer and designer of the Unicode implementation.

Well, I guess that's how things go. Was nice working for you guys,
but no longer is... I'm tired of having to defend myself against
meta-comments about the design, uncontrolled checkins and no true
backup about my standing in all this from Guido. 

Perhaps I am misunderstanding the role of a maintainer and 
implementation designer, but as it is all respect for the work I've 
put into all this seems faded. That's the conclusion I draw from recent
postings by Martin and Fredrik and their nightly "takeover".

Thanks,
-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From tim.one@home.com  Wed Jun 27 09:24:44 2001
From: tim.one@home.com (Tim Peters)
Date: Wed, 27 Jun 2001 04:24:44 -0400
Subject: [I18n-sig] UCS-4 configuration
In-Reply-To: <200106270608.f5R68wY02785@mira.informatik.hu-berlin.de>
Message-ID: <LNBBLJKPBEHFEDALKOLCMEPAKKAA.tim.one@home.com>

[Martin v. Loewis]
> I would never remotely consider questioning your authority, how could I?

LOL!  If authority were of any help in getting software to work, Guido
wouldn't need any of us:  he could just scowl at it, and it would all fall
into place <wink>.

> The specific code in question is in PyUnicode_DecodeUTF16. It gets a
> char*, and converts it to a Py_UCS2* (where Py_UCS is unsigned short).
> It then fetches a Py_UCS2 after another, byte-swapping if appropriate,
> and advances the Py_UCS2* by one. The intention is that this retrieves
> the bytes of the input in pairs.
>
> Is that code correct even if sizeof(unsigned short)>2?

Oh no.  Clearly, if sizeof(Py_UCS2) > 2, it will read more than 2 bytes each
time.  But the *obvious* way to read two bytes is to use a char* pointer!
Say q and e were declared

    const unsigned char*

instead of Py_UCS2*.  Then for big-endian getting "the next" char is just

    ch = (q[0] << 8) | q[1];
    q += 2;

and swap "0" and "1" for a little-endian machine.  The code would get
substantially simpler.  In fact, you can skip all the embedded #ifdefs and
repeated (bo == 1), (bo == -1) tests by setting up invariants

int lo_index, hi_index;

appropriately at the start before the loop-- setting one of those to 1 and
the other to 0 --and then do

    ch = (q[hi_index] << 8) | q[lo_index]
    q += 2;

unconditionally inside the loop whenever fetching another pair.  Now C
doesn't guarantee that a byte is 8 bits either, but that's one thing that's
true even on a Cray (they actually read 64 bits under the covers and
shift+mask, but it looks like "8 bits" to C code); I don't know of any
modern box on which it isn't true, and it's exceedingly unlikely any new
architecture won't play along.

Everything else should "just work" then.  BTW, the existing byte-swapping
code doesn't work right either for sizeof(Py_UCS2) > 2, because in

    ch = (ch >> 8) | (ch << 8);

there's an assumption that the left shift is end-off.  Fetch a byte at a
time as above and none of that fiddling is needed.  Else the existing
byte-swapping code needs either

    ch &= 0xffff;

after, or

    ch = (ch >> 8) | ((ch & 0xff) << 8);

in the body.  But we'd be better off getting rid of Py_UCS2 thingies
entirely in this routine (they don't *mean* "UCS2", they *mean* "exactly two
bytes", and that can't always be met).


From JMachin@Colonial.com.au  Wed Jun 27 09:27:50 2001
From: JMachin@Colonial.com.au (Machin, John)
Date: Wed, 27 Jun 2001 18:27:50 +1000
Subject: [I18n-sig]  validity of lone surrogates (was Re: Unicode surroga
 tes: just say no!)
Message-ID: <9F2D83017589D211BD1000805FA70CA703B139EF@ntxmel03.cmutual.com.au>


-----Original Message-----
From: Gaute B Strokkenes [mailto:gs234@cam.ac.uk]
Sent: Wednesday, 27 June 2001 17:53
To: Martin v. Loewis
Cc: tree@basistech.com; guido@digicool.com; i18n-sig@python.org;
unicode@unicode.org
Subject: [I18n-sig] Re: Unicode surrogates: just say no!

[earlier correspondents]
>> Personally, I think that the codecs should report an error in the
>> appropriate fashion when presented with a python unicode string
>> which contains values that are not allowed, such as lone
>> surrogates.
> 
> Other people have read Unicode 3.1 and came to the conclusion that
> it mandates that implementations accept such a character...

[big Gaute]
Well, they're wrong.  The standard is clear as ink in this regard.

[my comment]
Unfortunately ink is usually opaque :-)

The problem is caused by section 3.8 in Unicode 3.0, which is not
specifically amended by 3.1 as far as I can tell.

The offending text occurs after clause D29. It says "... every UTF supports
lossless round-trip transcoding ..." and "... a UTF mapping must also map
invalid Unicode scalar values to unique code value sequences. These invalid 
scalar values include [0xFFFE], [0xFFFF] and unpaired surrogates."

My interpretation of this is that the 2nd part I quoted says we must export
the guff,
and the 1st part says we must accept it back again.

I don't particularly like this idea, and am not in favour of codecs silently
accepting such in incoming data --- I'm just pointing out that this 
"lossless round-trip transcoding" concept seems to be at variance with
various 
interpretations of what is "legal".

Cheers,
John


**************   IMPORTANT MESSAGE  **************

The information contained in or attached to this message is intended only for the people it is addressed to. If you are not the intended recipient, any use, disclosure or copying of this information is unauthorised and prohibited. This information may be confidential or subject to legal privilege. It is not the expressed view of Colonial Limited or any of its subsidiaries unless that is clearly stated. Colonial cannot accept liability for any virus damage caused by this message.

**************************************************


From martin@loewis.home.cs.tu-berlin.de  Wed Jun 27 13:04:18 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Wed, 27 Jun 2001 14:04:18 +0200
Subject: [I18n-sig] Re: Unicode surrogates: just say no!
In-Reply-To: <4ahex2xstv.fsf@kern.srcf.societies.cam.ac.uk> (message from
 Gaute B Strokkenes on 27 Jun 2001 08:52:44 +0100)
References: <200106260851.f5Q8pcN10662@odiug.digicool.com>
 <15160.33467.686959.415021@cymru.basistech.com>
 <4aae2upzny.fsf@kern.srcf.societies.cam.ac.uk>
 <200106270554.f5R5sMW02751@mira.informatik.hu-berlin.de> <4ahex2xstv.fsf@kern.srcf.societies.cam.ac.uk>
Message-ID: <200106271204.f5RC4Ia07546@mira.informatik.hu-berlin.de>

> >> It is also a
> >> bug to interpret such a six-byte sequence as a single character.
> >> This was clarified in Unicode 3.1.
> > 
> > It seems to be unclear to many, including myself, what exactly was
> > clarified with Unicode 3.1.
> 
> See the section called "UTF-8 Corrigendum" in TR 27.  It explains it
> all in detail.

I've read this section forth an back over and over again, admittedly
without having a copy of Unicode 3.0 at hand to mentally apply the
changes.

> > Where exactly does it say that processing a six-byte two-surrogates
> > sequence as a single character is non-conforming?
> 
> See D39(c) at <http://www.unicode.org/unicode/reports/tr27>.  This
> defines such a six-byte sequence as an "irregular UTF-8 code unit
> sequence" and goes on to state that, as a consequence of C12,
> conforminig processes are not allowed to generate such sequences.

[I guess this is D36(c)]
Yes, but you've claimed that one *also* must not interpret such a
sequence as a single character - this only says that you must never
generate such a sequence.

> Therefore you are not allowed to create a 3 byte sequence that is the
> UTF-8 encoding of value in this range.  Therefore you are not allowed
> to generate pairs of such sequences either.
> 
> I hope this is all clear.

That is all clear, but I still wonder why you said that the six byte
sequence (which no conforming process can have produced) must not be
interpreted as a single character. Specifically, C12 is amended with

# Processes may transform irregular code unit sequences into the
# equivalent well-formed code unit sequences.

> > Other people have read Unicode 3.1 and came to the conclusion that
> > it mandates that implementations accept such a character...
> 
> Well, they're wrong.  The standard is clear as ink in this regard.

Not that clear to me... Please have a look at bug # 2 in

http://sourceforge.net/tracker/download.php?group_id=5470&atid=105470&file_id=7439&aid=433882

The submitter claims that an implementation has to accept a single
UTF-8 encoded surrogate word. Of course, it might be that accepting a
single one in UTF-8 is mandated, but if you have two of them, you must
reject them...

Regards,
Martin


From gs234@cam.ac.uk  Wed Jun 27 13:38:33 2001
From: gs234@cam.ac.uk (Gaute B Strokkenes)
Date: 27 Jun 2001 13:38:33 +0100
Subject: [I18n-sig] Re: validity of lone surrogates (was Re: Unicode surroga tes: just say no!)
References: <9F2D83017589D211BD1000805FA70CA703B139EF@ntxmel03.cmutual.com.au>
Message-ID: <4ak81yjdx2.fsf@kern.srcf.societies.cam.ac.uk>

On Wed, 27 Jun 2001, JMachin@Colonial.com.au wrote:
> 
> [earlier correspondents]
>>> Personally, I think that the codecs should report an error in the
>>> appropriate fashion when presented with a python unicode string
>>> which contains values that are not allowed, such as lone
>>> surrogates.
>> 
>> Other people have read Unicode 3.1 and came to the conclusion that
>> it mandates that implementations accept such a character...
> 
> [big Gaute]
> Well, they're wrong.  The standard is clear as ink in this regard.
> 
> [my comment]
> Unfortunately ink is usually opaque :-)

Precisely.  That's standardese for you.  8-)

> The problem is caused by section 3.8 in Unicode 3.0, which is not
> specifically amended by 3.1 as far as I can tell.

It's not; AFAIK the list of changes at
<http://www.unicode.org/unicode/reports/tr27/> is supposed to be
canonical and it's not listed.

> The offending text occurs after clause D29. It says "... every UTF
> supports lossless round-trip transcoding ..." and "... a UTF mapping
> must also map invalid Unicode scalar values to unique code value
> sequences. These invalid scalar values include [0xFFFE], [0xFFFF]
> and unpaired surrogates."

Sigh.  This means that the Unicode standard is self-contradicting.

It is nowhere defined precisely what "invalid Unicode Scalar Value"
means.  I can only assume that it means "an integer in the range 0 -
0x10FFFF that is not a Unicode Scalar Value".  Even so, the statement
is just plain wrong as far as UTF-16 is concerned.  If UTF-16 is
supposed to define a bijective mapping any sequence of integers in the
range 0 - 0x10FFFF to some set of sequences of integers in the range 0
- 0xFFFF (and this is definitely what this statement is saying) this
becomes a contradiction: suppose that H is some high surrogate value
and that L is some low surrogate value, and that U is the
corresponding USV.  Then the sequences

  H, L    <-- sequence consisting of two "invalid USVs"

and

  U       <-- sequence consisting of a single (valid) USV

both map to

  H, L    <-- sequence of two UTF-16 code points

under UTF-16, so that the mapping induced by UTF-16 is very definitely
not bijective.

I have no idea why the standard includes this apparent error, but my
best guess would be that this used to be true back in the pre-3.1 days
when UTF-16 (though not with that name) was Unicode proper and UTF-16
was not a UTF, but _the_ canonical Unicode encoding.  Note that the
statement given in D29 actually is true when applied to UTF-8 and
UTF-32.

However, let us put this annoying fact aside for a moment.  I believe
that D29 is intended to point out that the various UTFs will "just
work" if you try to encode scalar values that are not proper USVs.
This is not the same thing as saying that these invalid USVs or the
"pseudo-characters" or whatever that arise from them have any business
in a Unicode string.  In fact, Unicode conformant processes are
explicitly forbidden from interpreting or using U+FFFF or U+FFFE when
passing Unicode data between each other.  They are, however,
explicitly allowed and even encouraged to use these values internally
as sentinel or "fencepost" values.  To put this slightly differently,
a process may be storing some Unicode data internally and it may be
storing U+FFFF for some reason or another in that internal data.  It
may be convenient for the process to use an UTF to transform this data
into a more convenient form.  I think that D19 is merely pointing out
that this is actually feasible, in spite of the appearance of invalid
USVs in the internal data.

I would be indebted if any of the experts who hang out on the unicode
list could sort out this confusion.

> My interpretation of this is that the 2nd part I quoted says we must
> export the guff, and the 1st part says we must accept it back again.
> 
> I don't particularly like this idea, and am not in favour of codecs
> silently accepting such in incoming data --- I'm just pointing out
> that this "lossless round-trip transcoding" concept seems to be at
> variance with various interpretations of what is "legal".

Yup.

My take on this is that the various UTF codecs should follow the specs
to the letter and reject antything else in default mode.  There should
also be a "lenient" or "forgiving" mode in which the codec does its
best to interpret and repair broken, nonsensical or irregular data.
Off course, if an application uses this mode then it will have to be
aware of the dangers involved, including the security aspects.

-- 
Big Gaute                               http://www.srcf.ucam.org/~gs234/
I'm having BEAUTIFUL THOUGHTS about the INSIPID WIVES
 of smug and wealthy CORPORATE LAWYERS..


From mark@macchiato.com  Wed Jun 27 15:13:39 2001
From: mark@macchiato.com (Mark Davis)
Date: Wed, 27 Jun 2001 07:13:39 -0700
Subject: [I18n-sig] Re: validity of lone surrogates (was Re: Unicode surroga tes: just say no!)
References: <9F2D83017589D211BD1000805FA70CA703B139EF@ntxmel03.cmutual.com.au> <4ak81yjdx2.fsf@kern.srcf.societies.cam.ac.uk>
Message-ID: <005101c0ff13$5d851c40$0c680b41@c1340594a>

Your are correct in that the text is not nearly as clear as it should be,
and is open to different interpretations. My view of the status in Unicode
3.1 is represented on http://www.macchiato.com/utc/utf_comparison.htm.
Corresponding computations are on
http://www.macchiato.com/utc/utf_computations.htm.

One of the goals for Unicode 4.0 is to clear up the text describing UTFs in
particular, which may change some of the edge cases (isolates and/or
irregulars).

Mark

----- Original Message -----
From: "Gaute B Strokkenes" <gs234@cam.ac.uk>
To: "Machin, John" <JMachin@Colonial.com.au>
Cc: <tree@basistech.com>; <guido@digicool.com>; <i18n-sig@python.org>;
<unicode@unicode.org>; "Martin v. Loewis"
<martin@loewis.home.cs.tu-berlin.de>
Sent: Wednesday, June 27, 2001 05:38
Subject: Re: validity of lone surrogates (was Re: Unicode surroga tes: just
say no!)


> On Wed, 27 Jun 2001, JMachin@Colonial.com.au wrote:
> >
> > [earlier correspondents]
> >>> Personally, I think that the codecs should report an error in the
> >>> appropriate fashion when presented with a python unicode string
> >>> which contains values that are not allowed, such as lone
> >>> surrogates.
> >>
> >> Other people have read Unicode 3.1 and came to the conclusion that
> >> it mandates that implementations accept such a character...
> >
> > [big Gaute]
> > Well, they're wrong.  The standard is clear as ink in this regard.
> >
> > [my comment]
> > Unfortunately ink is usually opaque :-)
>
> Precisely.  That's standardese for you.  8-)
>
> > The problem is caused by section 3.8 in Unicode 3.0, which is not
> > specifically amended by 3.1 as far as I can tell.
>
> It's not; AFAIK the list of changes at
> <http://www.unicode.org/unicode/reports/tr27/> is supposed to be
> canonical and it's not listed.
>
> > The offending text occurs after clause D29. It says "... every UTF
> > supports lossless round-trip transcoding ..." and "... a UTF mapping
> > must also map invalid Unicode scalar values to unique code value
> > sequences. These invalid scalar values include [0xFFFE], [0xFFFF]
> > and unpaired surrogates."
>
> Sigh.  This means that the Unicode standard is self-contradicting.
>
> It is nowhere defined precisely what "invalid Unicode Scalar Value"
> means.  I can only assume that it means "an integer in the range 0 -
> 0x10FFFF that is not a Unicode Scalar Value".  Even so, the statement
> is just plain wrong as far as UTF-16 is concerned.  If UTF-16 is
> supposed to define a bijective mapping any sequence of integers in the
> range 0 - 0x10FFFF to some set of sequences of integers in the range 0
> - 0xFFFF (and this is definitely what this statement is saying) this
> becomes a contradiction: suppose that H is some high surrogate value
> and that L is some low surrogate value, and that U is the
> corresponding USV.  Then the sequences
>
>   H, L    <-- sequence consisting of two "invalid USVs"
>
> and
>
>   U       <-- sequence consisting of a single (valid) USV
>
> both map to
>
>   H, L    <-- sequence of two UTF-16 code points
>
> under UTF-16, so that the mapping induced by UTF-16 is very definitely
> not bijective.
>
> I have no idea why the standard includes this apparent error, but my
> best guess would be that this used to be true back in the pre-3.1 days
> when UTF-16 (though not with that name) was Unicode proper and UTF-16
> was not a UTF, but _the_ canonical Unicode encoding.  Note that the
> statement given in D29 actually is true when applied to UTF-8 and
> UTF-32.
>
> However, let us put this annoying fact aside for a moment.  I believe
> that D29 is intended to point out that the various UTFs will "just
> work" if you try to encode scalar values that are not proper USVs.
> This is not the same thing as saying that these invalid USVs or the
> "pseudo-characters" or whatever that arise from them have any business
> in a Unicode string.  In fact, Unicode conformant processes are
> explicitly forbidden from interpreting or using U+FFFF or U+FFFE when
> passing Unicode data between each other.  They are, however,
> explicitly allowed and even encouraged to use these values internally
> as sentinel or "fencepost" values.  To put this slightly differently,
> a process may be storing some Unicode data internally and it may be
> storing U+FFFF for some reason or another in that internal data.  It
> may be convenient for the process to use an UTF to transform this data
> into a more convenient form.  I think that D19 is merely pointing out
> that this is actually feasible, in spite of the appearance of invalid
> USVs in the internal data.
>
> I would be indebted if any of the experts who hang out on the unicode
> list could sort out this confusion.
>
> > My interpretation of this is that the 2nd part I quoted says we must
> > export the guff, and the 1st part says we must accept it back again.
> >
> > I don't particularly like this idea, and am not in favour of codecs
> > silently accepting such in incoming data --- I'm just pointing out
> > that this "lossless round-trip transcoding" concept seems to be at
> > variance with various interpretations of what is "legal".
>
> Yup.
>
> My take on this is that the various UTF codecs should follow the specs
> to the letter and reject antything else in default mode.  There should
> also be a "lenient" or "forgiving" mode in which the codec does its
> best to interpret and repair broken, nonsensical or irregular data.
> Off course, if an application uses this mode then it will have to be
> aware of the dangers involved, including the security aspects.
>
> --
> Big Gaute                               http://www.srcf.ucam.org/~gs234/
> I'm having BEAUTIFUL THOUGHTS about the INSIPID WIVES
>  of smug and wealthy CORPORATE LAWYERS..
>
>


From guido@digicool.com  Wed Jun 27 15:16:47 2001
From: guido@digicool.com (Guido van Rossum)
Date: Wed, 27 Jun 2001 10:16:47 -0400
Subject: [I18n-sig] Re: validity of lone surrogates
In-Reply-To: Your message of "27 Jun 2001 13:38:33 BST."
 <4ak81yjdx2.fsf@kern.srcf.societies.cam.ac.uk>
References: <9F2D83017589D211BD1000805FA70CA703B139EF@ntxmel03.cmutual.com.au>
 <4ak81yjdx2.fsf@kern.srcf.societies.cam.ac.uk>
Message-ID: <200106271416.f5REGl519361@odiug.digicool.com>

[Gaute]
> My take on this is that the various UTF codecs should follow the specs
> to the letter and reject antything else in default mode.  There should
> also be a "lenient" or "forgiving" mode in which the codec does its
> best to interpret and repair broken, nonsensical or irregular data.
> Off course, if an application uses this mode then it will have to be
> aware of the dangers involved, including the security aspects.

Python's codec mechanism has a nice API gimmick: you can pass an error
handling option.  Currently, this can be 'strict', 'ignore', or
'replace'.  I wonder if we could add a fourth mode, 'lenient', that
tries its best to encode anything passed in?

--Guido van Rossum (home page: http://www.python.org/~guido/)


From fredrik@pythonware.com  Wed Jun 27 16:09:27 2001
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Wed, 27 Jun 2001 17:09:27 +0200
Subject: [I18n-sig] UCS-4 configuration
References: <200106262115.f5QLFJ204654@mira.informatik.hu-berlin.de> <005501c0fe8b$f0134d80$4ffa42d5@hagrid> <200106262250.f5QMoO609419@mira.informatik.hu-berlin.de>
Message-ID: <00f701c0ff1b$29498da0$4ffa42d5@hagrid>

martin wrote:
> > go ahead and check it in.
> 
> Done. Some clean-up could be still applied, such as defining only one
> of USE_UCS4_STORAGE and Py_UNICODE_SIZE, but I'll leave that to your
> judgement (i.e. I won't attempt any further changes at the moment
> unless asked).

after a good night's sleep, I'm not sure Py_UNICODE_SIZE should
be used for feature selection (especially not SIZE == 4).

I'd rather see a separate define for UCS-2/UTF-16 vs. UCS-4, which
works no matter what the exact sizes are (as long as Py_UCS4 is at
least 32 bits, and Py_UCS2 is at least 16 bits, of course).

(how about PY_UNICODE_WIDE?)

(and what's the deal with Py_ vs PY_ prefixes, btw?)

</F>


From guido@digicool.com  Wed Jun 27 16:20:14 2001
From: guido@digicool.com (Guido van Rossum)
Date: Wed, 27 Jun 2001 11:20:14 -0400
Subject: [I18n-sig] UCS-4 configuration
In-Reply-To: Your message of "Wed, 27 Jun 2001 08:45:11 +0200."
 <200106270645.f5R6jBS06348@mira.informatik.hu-berlin.de>
References: <200106262115.f5QLFJ204654@mira.informatik.hu-berlin.de> <005501c0fe8b$f0134d80$4ffa42d5@hagrid> <200106262250.f5QMoO609419@mira.informatik.hu-berlin.de> <200106262334.f5QNYGb18598@odiug.digicool.com>
 <200106270645.f5R6jBS06348@mira.informatik.hu-berlin.de>
Message-ID: <200106271520.f5RFKE519522@odiug.digicool.com>

> > Another loose end: define sys.maxunicode.
> 
> Breaking my promise not to touch the code, I've added this. I was not
> quite sure what type you meant to see in sys.maxunicode; I took
> integer, since U+FFFF is a non-character.

Correct.  And thanks!

> > Note how the utf8 codec has encoded the surrogate pair as two 3-byte
> > utf8 sequences.  I think it should either spit out an error or (I
> > think this is better -- "be forgiving in what you accept") recognize
> > the surrogate pair and spit out a 4-byte utf8 sequence.  Note that in
> > 2-byte mode, this same string literal can be marshalled and
> > unmarshalled just fine!
> 
> That was actually the same problem as with the test case: the UTF-8
> encoder would not use the surrogate code in wide mode. I've removed
> that restriction, so this test now also passes.

Thanks again!

> > Or should we change the marshalling format to do something that's more
> > transparent?  It feels uncomfortable that in 2-byte mode we can easily
> > create unicode strings containing illegal sequences (e.g. lone
> > surrogates), but these strings can't be marshalled.  
> 
> You mean, they cannot be unmarshalled? With the current code,
> marshalling them works fine...

Yes.

> There was another problem with the unicode database; the code assumed
> that adding two Py_UNICODE values would wrap around at 65536. With
> that fixed and committed, the test suite passes for me.

Wow.  And for both versions, too!

Are there any open issues left?  A list of those would help!  Some I
can think of:

- Marc-Andre's message
- disable Unicode entirely with a configuration switch
- documentation
- marshalling UCS2 strings containing lone surrogates

Anything else?

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@digicool.com  Wed Jun 27 16:25:54 2001
From: guido@digicool.com (Guido van Rossum)
Date: Wed, 27 Jun 2001 11:25:54 -0400
Subject: [I18n-sig] UCS-4 configuration
In-Reply-To: Your message of "Wed, 27 Jun 2001 17:09:27 +0200."
 <00f701c0ff1b$29498da0$4ffa42d5@hagrid>
References: <200106262115.f5QLFJ204654@mira.informatik.hu-berlin.de> <005501c0fe8b$f0134d80$4ffa42d5@hagrid> <200106262250.f5QMoO609419@mira.informatik.hu-berlin.de>
 <00f701c0ff1b$29498da0$4ffa42d5@hagrid>
Message-ID: <200106271525.f5RFPsJ19534@odiug.digicool.com>

> martin wrote:
> > > go ahead and check it in.
> > 
> > Done. Some clean-up could be still applied, such as defining only one
> > of USE_UCS4_STORAGE and Py_UNICODE_SIZE, but I'll leave that to your
> > judgement (i.e. I won't attempt any further changes at the moment
> > unless asked).
> 
> after a good night's sleep, I'm not sure Py_UNICODE_SIZE should
> be used for feature selection (especially not SIZE == 4).
> 
> I'd rather see a separate define for UCS-2/UTF-16 vs. UCS-4, which
> works no matter what the exact sizes are (as long as Py_UCS4 is at
> least 32 bits, and Py_UCS2 is at least 16 bits, of course).

Makes sense.

> (how about PY_UNICODE_WIDE?)
> 
> (and what's the deal with Py_ vs PY_ prefixes, btw?)
> 
> </F>

The majority of macros use Py_, but a few use PY_.  I'd stick with Py_
unless you're defining a new one that's part of a series that already
uses PY_.

In the Unicode support, the only one using PY_ seems to be
PY_UNICODE_TYPE.  Since that's a recent addition, there's time to
rename it.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From fredrik@pythonware.com  Wed Jun 27 16:45:17 2001
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Wed, 27 Jun 2001 17:45:17 +0200
Subject: [I18n-sig] UCS-4 configuration
References: <200106262115.f5QLFJ204654@mira.informatik.hu-berlin.de> <005501c0fe8b$f0134d80$4ffa42d5@hagrid> <200106262250.f5QMoO609419@mira.informatik.hu-berlin.de> <200106262334.f5QNYGb18598@odiug.digicool.com>              <200106270645.f5R6jBS06348@mira.informatik.hu-berlin.de>  <200106271520.f5RFKE519522@odiug.digicool.com>
Message-ID: <019b01c0ff20$2b85a8b0$4ffa42d5@hagrid>

guido wrote:

> Anything else?

also after a good night's sleep: should the default on unix really be
"same as your wchar", or should we keep it as "ucs2" for the next
release?

(i.e. if you don't specify anything, you get UCS-2, like before)

</F>


From guido@digicool.com  Wed Jun 27 16:50:14 2001
From: guido@digicool.com (Guido van Rossum)
Date: Wed, 27 Jun 2001 11:50:14 -0400
Subject: [I18n-sig] UCS-4 configuration
In-Reply-To: Your message of "Wed, 27 Jun 2001 17:45:17 +0200."
 <019b01c0ff20$2b85a8b0$4ffa42d5@hagrid>
References: <200106262115.f5QLFJ204654@mira.informatik.hu-berlin.de> <005501c0fe8b$f0134d80$4ffa42d5@hagrid> <200106262250.f5QMoO609419@mira.informatik.hu-berlin.de> <200106262334.f5QNYGb18598@odiug.digicool.com> <200106270645.f5R6jBS06348@mira.informatik.hu-berlin.de> <200106271520.f5RFKE519522@odiug.digicool.com>
 <019b01c0ff20$2b85a8b0$4ffa42d5@hagrid>
Message-ID: <200106271550.f5RFoEZ19613@odiug.digicool.com>

> guido wrote:
> 
> > Anything else?
> 
> also after a good night's sleep: should the default on unix really be
> "same as your wchar", or should we keep it as "ucs2" for the next
> release?
> 
> (i.e. if you don't specify anything, you get UCS-2, like before)
> 
> </F>

Yes, that's my preference.  I had the same thought overnight.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From rick@unicode.org  Wed Jun 27 16:52:28 2001
From: rick@unicode.org (Rick McGowan)
Date: Wed, 27 Jun 2001 08:52:28 -0700
Subject: [I18n-sig] Re: Unicode surrogates: just say no!
In-Reply-To: <4aae2upzny.fsf@kern.srcf.societies.cam.ac.uk> (message fromGaute B Strokkenes on 27 Jun 2001 00:52:17 +0100)
Message-ID: <200106271344.JAA08050@unicode.org>

Martin v. Loewis" <martin@loewis.home.cs.tu-berlin.de> wrote:

> It seems to be unclear to many, including myself, what exactly was
> clarified with Unicode 3.1. Where exactly does it say that processing
> a six-byte two-surrogates sequence as a single character is
> non-conforming?

It's not non-conforming, it's "irregular". Please read the technical  
report (#27) that I pointed at yesterday (on the i18n-sig@python).  It  
gives detailed specifications for UTF-8.  Anything not in the table "UTF-8  
Bit Distribution" and accompanying description shown there is  
non-conforming.

Rule D36 specifies:

<quote>
(a) UTF-8 is the Unicode Transformation Format that serializes a Unicode  
code point as a sequence of one to four bytes, as specified in Table 3.1,  
UTF-8 Bit Distribution.
(b) An illegal UTF-8 code unit sequence is any byte sequence that does not  
match the patterns listed in Table 3.1B, Legal UTF-8 Byte Sequences.
(c) An irregular UTF-8 code unit sequence is a six-byte sequence where the  
first three bytes correspond to a high surrogate, and the next three bytes  
correspond to a low surrogate. As a consequence of C12, these irregular  
UTF-8 sequences shall not be generated by a conformant process.
</quote>

In other words, it is non-conforming to generate two 3-byte things for a  
surrogate pair.  However, it remains "legal but irregular" to interpret  
such a pair of 3-byte entities.  Why wasn't it just made non-conforming to  
interpret such things?  Because there are old implementations of UTF-8 in  
the world that pre-date the definition of surrogates, and if they ever  
encountered codepoints in that range, they would generate those pairs of  
3-byte sequences.  So it is legal for a process to recognize them and  
either raise an exception or try to "fix" the situation.

> What exactly does it say that the conforming behaviour
> should be?

TR27 says: "Processes that require unique representation must not  
interpret irregular UTF code unit sequences as characters. They may, for  
example, reject or remove those sequences."

If I were going to implement a UTF-8 interpeter for Python, I would give  
it a hook to optionally return a specific error condition on irregular  
sequences.

If you still find the definitions and discussion in the technical report  
to be unclear, then the Unicode editorial committee would undoubtedly like  
to hear about it.

	Rick


From guido@digicool.com  Wed Jun 27 17:11:49 2001
From: guido@digicool.com (Guido van Rossum)
Date: Wed, 27 Jun 2001 12:11:49 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode Maintenance
In-Reply-To: Your message of "Wed, 27 Jun 2001 14:10:57 +0200."
 <3B39CD51.406C28F0@lemburg.com>
References: <3B39CD51.406C28F0@lemburg.com>
Message-ID: <200106271611.f5RGBn819631@odiug.digicool.com>

> Looking at the recent burst of checkins for the Unicode implementation
> completely bypassing the standard SF procedure and possible comments
> I might have on the different approaches, I guess I've been ruled out
> as maintainer and designer of the Unicode implementation.
> 
> Well, I guess that's how things go. Was nice working for you guys,
> but no longer is... I'm tired of having to defend myself against
> meta-comments about the design, uncontrolled checkins and no true
> backup about my standing in all this from Guido. 
> 
> Perhaps I am misunderstanding the role of a maintainer and 
> implementation designer, but as it is all respect for the work I've 
> put into all this seems faded. That's the conclusion I draw from recent
> postings by Martin and Fredrik and their nightly "takeover".
> 
> Thanks,
> -- 
> Marc-Andre Lemburg

[For those of us to whom Marc-Andre's complaint comes as a total
surprise: there was a thread on i18n-sig about whether we should
support Unicode surrogates, followed by a conclusion to skip
surrogates and jump directly to optional support for UCS-4, followed
by some checkins that enabled a configuration choice between UCS-2 and
UCS-4, and code to make it work.  As a side effect, surrogate support
in the UCS-2 version actually improved slightly.]

Now, now, Marc-Andre.

The only comments I recall from you on my "surrogates: just say no"
post seemed favorable, except that you proposed to to all the way and
make UCS-4 mandatory.  I explained why I didn't want to go that far,
and why I didn't believe your arguments against giving users a choice.
I didn't hear back from you then, and I didn't think you could have
much of a problem with my position.

Our process requires the use of the SF patch manager only for
controversial changes.  Based on your feedback, I didn't think there
was anything controversial about the changes that Fredrik and Martin
have made!  (If there was, IMO it was temporarily breaking the Windows
build and the test suite -- but that's all fixed now.)

I don't understand where you get the idea that we lost respect for
your work!  In fact, the fact that it was so easy to make the changes
suggested to me that the original design was well suited to this
particular change (as opposed to the surrugate support proposals,
which all sounded like they would require a *lot* of changes).

I don't think that we have very strict roles in this community anyway.
(My role as BDFL excluded -- that's why I get to write this
response. :-)  I'd say that Fredrik owns SRE, because he has asserted
that ownership at various times: he's undone changes by others that
broke the 1.5.2 support, for example.

But the Unicode support in Python isn't owned by one person: many
folks have contributed to that, including Fredrik, who designed and
wrote the original Unicode string object implementation.

If you have specific comments about the changes made, please be
specific.  If you feel slighted by meta-comments, please also be
specific.  I don't think I've said anything derogatory about you or
your design.

Paul Prescod offered to write a PEP on this issue.  My cynical half
believes that we'll never hear from him again, but my optimistic half
hopes that he'll actually write one, so that we'll be able to discuss
the various issues for the users with the users.  I encourage you to
co-author the PEP, since you have a lot of background knowledge about
the issues.

BTW, I think that Misc/unicode.txt should be converted to a PEP, for
the historic record.  It was very much a PEP before the PEP process
was invented.  Barry, how much work would this be?  No editing needed,
just formatting, and assignment of a PEP number (the lower the better).

--Guido van Rossum (home page: http://www.python.org/~guido/)


From barry@digicool.com  Wed Jun 27 17:24:30 2001
From: barry@digicool.com (Barry A. Warsaw)
Date: Wed, 27 Jun 2001 12:24:30 -0400
Subject: [I18n-sig] Re: [Python-Dev] Unicode Maintenance
References: <3B39CD51.406C28F0@lemburg.com>
 <200106271611.f5RGBn819631@odiug.digicool.com>
Message-ID: <15162.2238.720508.508081@anthem.wooz.org>

>>>>> "GvR" == Guido van Rossum <guido@digicool.com> writes:

    GvR> BTW, I think that Misc/unicode.txt should be converted to a
    GvR> PEP, for the historic record.  It was very much a PEP before
    GvR> the PEP process was invented.  Barry, how much work would
    GvR> this be?  No editing needed, just formatting, and assignment
    GvR> of a PEP number (the lower the better).

Not much work at all, so I'll do this (and replace Misc/unicode.txt
with a pointer to the PEP).  Let's go with PEP 7, but stick it under
the "Other Informational PEPs" category.

-Barry


From tim.one@home.com  Wed Jun 27 17:51:16 2001
From: tim.one@home.com (Tim Peters)
Date: Wed, 27 Jun 2001 12:51:16 -0400
Subject: [I18n-sig] UCS-4 configuration
In-Reply-To: <200106271520.f5RFKE519522@odiug.digicool.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCOEADKLAA.tim.one@home.com>

[Guido]
> Are there any open issues left?  A list of those would help!  Some I
> can think of:
>
> - Marc-Andre's message
> - disable Unicode entirely with a configuration switch
> - documentation
> - marshalling UCS2 strings containing lone surrogates
>
> Anything else?

Other unresolved glitches raised here in the wee hours:

+ New warnings (prototype/definition mismatches).

+ Windows _winreg doesn't link.  Unclear (to me) what assumptions
  it really needs to have met; it's failing now because
  HAVE_USABLE_WCHAR_T isn't #define'd anymore, but I don't know
  really know what "usable" refers to (perhaps that it's usable
  by _winreg <wink>).


From walter@livinglogic.de  Wed Jun 27 17:56:00 2001
From: walter@livinglogic.de (Walter =?iso-8859-1?Q?D=F6rwald?=)
Date: Wed, 27 Jun 2001 18:56:00 +0200
Subject: [I18n-sig] Re: validity of lone surrogates
References: <9F2D83017589D211BD1000805FA70CA703B139EF@ntxmel03.cmutual.com.au>
 <4ak81yjdx2.fsf@kern.srcf.societies.cam.ac.uk> <200106271416.f5REGl519361@odiug.digicool.com>
Message-ID: <3B3A1020.7154E4B6@livinglogic.de>

Guido van Rossum wrote:
>=20
> [Gaute]
> > My take on this is that the various UTF codecs should follow the spec=
s
> > to the letter and reject antything else in default mode.  There shoul=
d
> > also be a "lenient" or "forgiving" mode in which the codec does its
> > best to interpret and repair broken, nonsensical or irregular data.
> > Off course, if an application uses this mode then it will have to be
> > aware of the dangers involved, including the security aspects.
>=20
> Python's codec mechanism has a nice API gimmick: you can pass an error
> handling option.  Currently, this can be 'strict', 'ignore', or
> 'replace'.  I wonder if we could add a fourth mode, 'lenient', that
> tries its best to encode anything passed in?

How would this work together with the proposed encode error handling
callback feature (see patch #432401)? Does this patch have any change of
getting into Python (when it's finished)?

Bye,
	Walter D=F6rwald


From martin@loewis.home.cs.tu-berlin.de  Wed Jun 27 17:27:41 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Wed, 27 Jun 2001 18:27:41 +0200
Subject: [I18n-sig] UCS-4 configuration
In-Reply-To: <00f701c0ff1b$29498da0$4ffa42d5@hagrid> (fredrik@pythonware.com)
References: <200106262115.f5QLFJ204654@mira.informatik.hu-berlin.de> <005501c0fe8b$f0134d80$4ffa42d5@hagrid> <200106262250.f5QMoO609419@mira.informatik.hu-berlin.de> <00f701c0ff1b$29498da0$4ffa42d5@hagrid>
Message-ID: <200106271627.f5RGRf909183@mira.informatik.hu-berlin.de>

> after a good night's sleep, I'm not sure Py_UNICODE_SIZE should
> be used for feature selection (especially not SIZE == 4).
> 
> I'd rather see a separate define for UCS-2/UTF-16 vs. UCS-4, which
> works no matter what the exact sizes are (as long as Py_UCS4 is at
> least 32 bits, and Py_UCS2 is at least 16 bits, of course).
> 
> (how about PY_UNICODE_WIDE?)

Normalizing everything to Py_UNICODE_WIDE sounds fine to me; I won't
start writing a patch for that, though. Feel free to get completely
rid of Py_UNICODE_SIZE in the process (and probably of
USE_UCS4_STORAGE as well).

> (and what's the deal with Py_ vs PY_ prefixes, btw?)

I took PY_ out of confusion, as mentioned in another message.

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Wed Jun 27 17:21:41 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Wed, 27 Jun 2001 18:21:41 +0200
Subject: [I18n-sig] UCS-4 configuration
In-Reply-To: <200106271525.f5RFPsJ19534@odiug.digicool.com> (message from
 Guido van Rossum on Wed, 27 Jun 2001 11:25:54 -0400)
References: <200106262115.f5QLFJ204654@mira.informatik.hu-berlin.de> <005501c0fe8b$f0134d80$4ffa42d5@hagrid> <200106262250.f5QMoO609419@mira.informatik.hu-berlin.de>
 <00f701c0ff1b$29498da0$4ffa42d5@hagrid> <200106271525.f5RFPsJ19534@odiug.digicool.com>
Message-ID: <200106271621.f5RGLf709180@mira.informatik.hu-berlin.de>

> The majority of macros use Py_, but a few use PY_.  I'd stick with Py_
> unless you're defining a new one that's part of a series that already
> uses PY_.
> 
> In the Unicode support, the only one using PY_ seems to be
> PY_UNICODE_TYPE.  Since that's a recent addition, there's time to
> rename it.

PY_UNICODE_TYPE is the #define that is used in the typedef for
Py_UNICODE; it should not be used elsewhere. I couldn't figure out how
to have autoconf generate typedefs, so I generate a
#define. Originally, I wanted to use PY_UNICODE for the #define, but
then thought it to be too similar to Py_UNICODE, hence
PY_UNICODE_TYPE.

Changing it to Py_UNICODE_TYPE sounds fine to me.

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Wed Jun 27 17:46:37 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Wed, 27 Jun 2001 18:46:37 +0200
Subject: [I18n-sig] UCS-4 configuration
In-Reply-To: <200106271520.f5RFKE519522@odiug.digicool.com> (message from
 Guido van Rossum on Wed, 27 Jun 2001 11:20:14 -0400)
References: <200106262115.f5QLFJ204654@mira.informatik.hu-berlin.de> <005501c0fe8b$f0134d80$4ffa42d5@hagrid> <200106262250.f5QMoO609419@mira.informatik.hu-berlin.de> <200106262334.f5QNYGb18598@odiug.digicool.com>
 <200106270645.f5R6jBS06348@mira.informatik.hu-berlin.de> <200106271520.f5RFKE519522@odiug.digicool.com>
Message-ID: <200106271646.f5RGkbR09253@mira.informatik.hu-berlin.de>

> Are there any open issues left?  A list of those would help!  Some I
> can think of:
> 
> - Marc-Andre's message
> - disable Unicode entirely with a configuration switch
> - documentation
> - marshalling UCS2 strings containing lone surrogates
> 
> Anything else?

- bump the API version? With the current CVS, this is only necessary
  for systems with a 4-byte wchar_t.
- Find some magic to deal with exchanging extensions across
  incompatible installation.
- fix UTF-8 encoding for lone surrogates, as per SF bug report.
- Windows configuration: should unicodeobject.h provide
  autoconfiguration, or should everything be defined in PC/config.h
  (or similar manually-maintained config files).

I'll be leaving for two weeks next week, so I can tackle larger tasks
only later.

On the PYD compatibility, the easiest solution would be to create a
Py_InitModule5, which also takes a flag value, this flag value could
include other incompatible settings, such as --without-cycle-gc. Of
course, such a change would break all existing binary modules, unless
Python continues to provide Py_InitModule4 to binary modules. Calling
Py_InitModule4 would then imply narrow Unicode.

To hack without Py_InitModule5, putting flags into PYTHON_API_VERSION
might also work.

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Wed Jun 27 18:06:30 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Wed, 27 Jun 2001 19:06:30 +0200
Subject: [I18n-sig] Re: Unicode surrogates: just say no!
In-Reply-To: <200106271344.JAA08050@unicode.org> (message from Rick McGowan on
 Wed, 27 Jun 2001 08:52:28 -0700)
References: <200106271344.JAA08050@unicode.org>
Message-ID: <200106271706.f5RH6UT09389@mira.informatik.hu-berlin.de>

> Martin v. Loewis" <martin@loewis.home.cs.tu-berlin.de> wrote:
> 
> > It seems to be unclear to many, including myself, what exactly was
> > clarified with Unicode 3.1. Where exactly does it say that processing
> > a six-byte two-surrogates sequence as a single character is
> > non-conforming?
> 
> It's not non-conforming, it's "irregular". 

If some implementation processes something, it can be either
conforming or non-conforming doing so, no? The byte sequence itself
may be irregular; I'm asking how a conforming implementation should
deal with it when it sees it.

> Please read the technical report (#27) that I pointed at yesterday
> (on the i18n-sig@python).  It gives detailed specifications for
> UTF-8.  Anything not in the table "UTF-8 Bit Distribution" and
> accompanying description shown there is non-conforming.

I see conformant/non-conformant (*) only used for implementations (and
processes), not for byte sequences. There you use illegal, ill-formed,
irregular; much of my confusion probably is because I don't know how
these terms relate, except for

- an irregular sequence (of bytes, or code units) is not illegal.

Also, I assume that negation of these concepts follows the English
language rules (i.e. "not illegal" == "legal", "not ill-formed" ==
"well-formed", etc)

> In other words, it is non-conforming to generate two 3-byte things for a  
> surrogate pair.  However, it remains "legal but irregular" to interpret  
> such a pair of 3-byte entities.
[...]
> If you still find the definitions and discussion in the technical report  
> to be unclear, then the Unicode editorial committee would undoubtedly like  
> to hear about it.

The issue of UTF-8 encoded surrogate pairs is clear now to me, I hope:
You must not write them, but you may read them.

The next question then is what to do with lone surrogate triplets; the
table in TR 27 suggests they are legal, but people on this list have
argued they must neither be emitted nor consumed (since what you get
is not a legal USV).

Thanks for your comments,
Martin

(*) "Conforming" is never used, sorry for the confusion


From martin@loewis.home.cs.tu-berlin.de  Wed Jun 27 18:21:26 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Wed, 27 Jun 2001 19:21:26 +0200
Subject: [I18n-sig] UCS-4 configuration
In-Reply-To: <LNBBLJKPBEHFEDALKOLCOEADKLAA.tim.one@home.com>
References: <LNBBLJKPBEHFEDALKOLCOEADKLAA.tim.one@home.com>
Message-ID: <200106271721.f5RHLQM09510@mira.informatik.hu-berlin.de>

> + Windows _winreg doesn't link.  Unclear (to me) what assumptions
>   it really needs to have met; it's failing now because
>   HAVE_USABLE_WCHAR_T isn't #define'd anymore, but I don't know
>   really know what "usable" refers to (perhaps that it's usable
>   by _winreg <wink>).

HAVE_USABLE_WCHAR_T should be defined iff
sizeof(wchar_t)==sizeof(Py_UNICODE) (*). If you follow my proposal,
PC/config.h should define this simultaneously with defining
Py_UNICODE_TYPE to wchar_t.

OTOH, the implementation of PyUnicode_DecodeMBCS and friends should
probably be changed to operate for a wide Py_UNICODE also. Currently,
it calls MultiByteToWideChar; this should be followed by widening each
value if a wide Py_UNICODE is used. Without such a change, the "mbcs"
codec won't work on Windows with a wide Py_UNICODE.

Regards,
Martin

(*) technically, this requires also that wchar_t values are always
understood as Unicode in the C library, instead of, say, EUC-JP. This
is hard to test in general, but for Windows, it is known to be true.


From fredrik@pythonware.com  Wed Jun 27 18:41:01 2001
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Wed, 27 Jun 2001 19:41:01 +0200
Subject: [I18n-sig] UCS-4 configuration
References: <200106262115.f5QLFJ204654@mira.informatik.hu-berlin.de> <005501c0fe8b$f0134d80$4ffa42d5@hagrid> <200106262250.f5QMoO609419@mira.informatik.hu-berlin.de> <200106262334.f5QNYGb18598@odiug.digicool.com>              <200106270645.f5R6jBS06348@mira.informatik.hu-berlin.de> <200106271520.f5RFKE519522@odiug.digicool.com> <200106271646.f5RGkbR09253@mira.informatik.hu-berlin.de>
Message-ID: <02c901c0ff30$ed0d9fa0$4ffa42d5@hagrid>

martin wrote:
> - Windows configuration: should unicodeobject.h provide
>   autoconfiguration, or should everything be defined in PC/config.h
>   (or similar manually-maintained config files).
> 
> I'll be leaving for two weeks next week, so I can tackle larger tasks
> only later.

before you leave, can you change the ./configure default to ucs2?
(see my and gvr's earlier mails)

I'll clean up the unicode defines tonight.

Cheers /F


From guido@digicool.com  Wed Jun 27 18:44:20 2001
From: guido@digicool.com (Guido van Rossum)
Date: Wed, 27 Jun 2001 13:44:20 -0400
Subject: [I18n-sig] Re: Unicode surrogates: just say no!
In-Reply-To: Your message of "Wed, 27 Jun 2001 19:06:30 +0200."
 <200106271706.f5RH6UT09389@mira.informatik.hu-berlin.de>
References: <200106271344.JAA08050@unicode.org>
 <200106271706.f5RH6UT09389@mira.informatik.hu-berlin.de>
Message-ID: <200106271744.f5RHiKO19739@odiug.digicool.com>

> The issue of UTF-8 encoded surrogate pairs is clear now to me, I hope:
> You must not write them, but you may read them.

Agreed.  Clarifying: if you read one pair when converting to UCS-4,
you should store one character; when converting to UCS-2, you should
store a pair, of course.

> The next question then is what to do with lone surrogate triplets; the
> table in TR 27 suggests they are legal, but people on this list have
> argued they must neither be emitted nor consumed (since what you get
> is not a legal USV).

I see two positions possible:

(1) it's up to the application to ensure this, not to the codec, so
    the codec needn't check for this;

(2) the codec's output should be legal, and this is a good time to
    check for illegalities.

Since both are reasonable positions, perhaps the error handling option
of the codec should be used to decide?

Neither of "strict", "replace" or "ignore" really matches the
semantics of (1) however; perhaps this behavior should be called
"lenient".

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@digicool.com  Wed Jun 27 18:53:10 2001
From: guido@digicool.com (Guido van Rossum)
Date: Wed, 27 Jun 2001 13:53:10 -0400
Subject: [I18n-sig] Re: validity of lone surrogates
In-Reply-To: Your message of "Wed, 27 Jun 2001 18:56:00 +0200."
 <3B3A1020.7154E4B6@livinglogic.de>
References: <9F2D83017589D211BD1000805FA70CA703B139EF@ntxmel03.cmutual.com.au> <4ak81yjdx2.fsf@kern.srcf.societies.cam.ac.uk> <200106271416.f5REGl519361@odiug.digicool.com>
 <3B3A1020.7154E4B6@livinglogic.de>
Message-ID: <200106271753.f5RHrAB19753@odiug.digicool.com>

> How would this work together with the proposed encode error handling
> callback feature (see patch #432401)? Does this patch have any change of
> getting into Python (when it's finished)?

I don't know.  The patch looks awfully big, and the motivation seems
thin, so I don't have high hopes.  I doubt that I would use it myself,
and I fear that it would be pretty slow if called frequently.

An alternative way to get what you want would be to write your own
codec.  Also, some standard codecs might be subclassable in a way that
makes it easy to get the desired functionality through subclassing
rather than through changing lots of C level APIs.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@digicool.com  Wed Jun 27 19:13:10 2001
From: guido@digicool.com (Guido van Rossum)
Date: Wed, 27 Jun 2001 14:13:10 -0400
Subject: [I18n-sig] UCS-4 configuration
In-Reply-To: Your message of "Wed, 27 Jun 2001 18:46:37 +0200."
 <200106271646.f5RGkbR09253@mira.informatik.hu-berlin.de>
References: <200106262115.f5QLFJ204654@mira.informatik.hu-berlin.de> <005501c0fe8b$f0134d80$4ffa42d5@hagrid> <200106262250.f5QMoO609419@mira.informatik.hu-berlin.de> <200106262334.f5QNYGb18598@odiug.digicool.com> <200106270645.f5R6jBS06348@mira.informatik.hu-berlin.de> <200106271520.f5RFKE519522@odiug.digicool.com>
 <200106271646.f5RGkbR09253@mira.informatik.hu-berlin.de>
Message-ID: <200106271813.f5RIDAU19807@odiug.digicool.com>

> To hack without Py_InitModule5, putting flags into PYTHON_API_VERSION
> might also work.

I like adding a flag better than Py_InitModule5.  If
PYTHON_API_VERSION > 1010, the low bit should be off for UCS-2 and on
for UCS-4.  So the next version should be 1012; this would become 1013
for UCS-4.

If a program doesn't use Unicode-specific APIs that take or return
Py_UNICODE arrays, it's not vulnerable to this problem.  An
alternative would be to use the C preprocessor to give all affected
APIs a different name when using UCS4.  (There are also macros
affected, e.g. Py_UNICODE_COPY().  But macro users are likely to also
refernce the function APIs.)

There's a bunch of functions that take or return a single Py_UNICODE
value.  These would be affected too.  That's a shame; if they had been
defined to take/return an unsigned long they would have worked just as
well.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From paulp@ActiveState.com  Wed Jun 27 20:17:32 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Wed, 27 Jun 2001 12:17:32 -0700
Subject: [I18n-sig] Unicode surrogates: just say no!
References: <200106260851.f5Q8pcN10662@odiug.digicool.com>
 <3B385BDC.AB40A761@lemburg.com> <200106261700.f5QH0ih14770@odiug.digicool.com>
Message-ID: <3B3A314C.161FE431@ActiveState.com>

I'm trying to sift through all of the decisions made in different
messages for the PEP.

Guido van Rossum wrote:
> 
>...
> 
>   - unichr(i) for 0x10000 <= i <= 0x10ffff (and hence corresponding \u
>     and \U) generates a surrogate pair, where u[0] is the high
>     surrogate value and u[1] the low surrogate value

Does this imply that ord() should take in surrogate pairs too?

-- 
Take a recipe. Leave a recipe.  
Python Cookbook!  http://www.ActiveState.com/pythoncookbook


From kenw@sybase.com  Wed Jun 27 20:23:47 2001
From: kenw@sybase.com (Kenneth Whistler)
Date: Wed, 27 Jun 2001 12:23:47 -0700 (PDT)
Subject: [I18n-sig] Re: validity of lone surrogates (was Re: Unicode surroga tes: just say no!)
Message-ID: <200106271923.MAA11557@birdie.sybase.com>

Mark Davis wrote:

> Your are correct in that the text is not nearly as clear as it should be,
> and is open to different interpretations. My view of the status in Unicode
> 3.1 is represented on http://www.macchiato.com/utc/utf_comparison.htm.
> Corresponding computations are on
> http://www.macchiato.com/utc/utf_computations.htm.

I concur in general with Mark's characterization of what the current
text is intended to say. In particular, Mark is correct that there
is language just below D29 that says that "a UTF mapping *must also*
map invalid Unicode scalar values to unique code value sequences. These
invalid scalar values include FFFE, FFFF, and unpaired surrogates."

I strongly agree with Mark that this is the correct position to
take with respect to the *noncharacters*, i.e. FFFE, FFFF (and their
ilk on the supplementary planes, as well as the newly defined
FDD0..FDFF). In this respect, ISO/IEC 10646 is inconsistent in
its definition of UTF-8, and needs to be fixed.

However, like Gaute, I think there are logical contradictions in the
current text of the Unicode Standard when it comes to dealing with the
isolated surrogate code points.

Gaute is also correct that much of the problem of textual interpretation
results from the incomplete transition in Unicode 3.0 from thinking of
UTF-16 as Unicode, with UTF-8 derived from UTF-16, to UTF-16 and UTF-8 as
coequal transforms from the Unicode Scalar Value. The UTC editorial
committee struggled with that text, but also attempted to minimize
the overall impact on Chapter 3 of the standard. In retrospect, it
probably would have been better to take the hit then and completely
rewrite Chapter 3 in terms of the new model, because of the continuing
confusion that the incomplete transition has obviously engendered among
implementers.

> 
> One of the goals for Unicode 4.0 is to clear up the text describing UTFs in
> particular, which may change some of the edge cases (isolates and/or
> irregulars).

This work is actively underway. I can guarantee that the Unicode 4.0
text will be *much* clearer about all these issues.

However, the UTC editorial committee is still struggling with exactly
how to present the edge cases.

It is my *personal* opinion -- and not yet one that could be stated
to be consensus in UTC or the UTC editorial committee -- that
the Unicode Standard should adopt formal definitions similar to
that of the IETF, where isolated surrogates and/or irregular sequences
are just ill-formed, period. And where the issues of lenient interpretation
of irregular UTF-8 generated by older implementations are shunted
off into a migration strategy section dealing with UTF converters.

--Ken Whistler


From guido@digicool.com  Wed Jun 27 20:30:19 2001
From: guido@digicool.com (Guido van Rossum)
Date: Wed, 27 Jun 2001 15:30:19 -0400
Subject: [I18n-sig] Unicode surrogates: just say no!
In-Reply-To: Your message of "Wed, 27 Jun 2001 12:17:32 PDT."
 <3B3A314C.161FE431@ActiveState.com>
References: <200106260851.f5Q8pcN10662@odiug.digicool.com> <3B385BDC.AB40A761@lemburg.com> <200106261700.f5QH0ih14770@odiug.digicool.com>
 <3B3A314C.161FE431@ActiveState.com>
Message-ID: <200106271930.f5RJUJw19910@odiug.digicool.com>

> I'm trying to sift through all of the decisions made in different
> messages for the PEP.

Excellent!

> Guido van Rossum wrote:
> > 
> >...
> > 
> >   - unichr(i) for 0x10000 <= i <= 0x10ffff (and hence corresponding \u
> >     and \U) generates a surrogate pair, where u[0] is the high
> >     surrogate value and u[1] the low surrogate value
> 
> Does this imply that ord() should take in surrogate pairs too?

Oooh, hadn't thought of that, but yes, it makes sense!

Not yet implemented, but I think it should.  Makes for a nice pair
of invariants:

  unichr(ord('\Udddddddd')) == '\Udddddddd'
  ord(unichr(0xdddddddd)) == 0xdddddddd

regardless of whether we're using UCS-2 or UCS-4 storage.

Currently this is broken for 0xdddddddd > 0xffff with UCS-2 storage.

On the other hand, unichr() and ord() should still work for lone
surrogate values as well (even though these are invalid code points).

--Guido van Rossum (home page: http://www.python.org/~guido/)


From paulp@ActiveState.com  Wed Jun 27 20:40:07 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Wed, 27 Jun 2001 12:40:07 -0700
Subject: [I18n-sig] Unicode surrogates: just say no!
References: <200106260851.f5Q8pcN10662@odiug.digicool.com>
 <3B385BDC.AB40A761@lemburg.com> <200106261700.f5QH0ih14770@odiug.digicool.com>
Message-ID: <3B3A3696.FFA7FCE@ActiveState.com>

Guido van Rossum wrote:
> 
>...
> 
>   - unichr(i) for i >= 0x110000 (and hence corresponding \u and \U)
>     raises an exception at Python-to-bytecode compile-time

unichr(i) is an expression. When would it be evaluated at compile-time?

Also, I'm not sure what runtime behavior you want for these "very large"
unichr(i) values.

In general I don't understand why we're treating the > 0x11000 range
specially at all?
-- 
Take a recipe. Leave a recipe.  
Python Cookbook!  http://www.ActiveState.com/pythoncookbook


From Peter_Constable@sil.org  Wed Jun 27 19:58:46 2001
From: Peter_Constable@sil.org (Peter_Constable@sil.org)
Date: Wed, 27 Jun 2001 11:58:46 -0700
Subject: [I18n-sig] Re: Unicode surrogates: just say no!
Message-ID: <OF01CE10CE.B06D7BF6-ON88256A78.0067C30C@sil.org>

>If you still find the definitions and discussion in the technical report
>to be unclear, then the Unicode editorial committee would undoubtedly like
>to hear about it.

There is no question that there are still things that are unclear and
things that are anachronistic in the definitions. I have been told that the
editorial *is* aware of these things and looking at them with the intent to
revise them for TUS 4.0.


- Peter


---------------------------------------------------------------------------
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <peter_constable@sil.org>


From paulp@ActiveState.com  Wed Jun 27 20:50:24 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Wed, 27 Jun 2001 12:50:24 -0700
Subject: [I18n-sig] Unicode surrogates: just say no!
References: <200106260851.f5Q8pcN10662@odiug.digicool.com> <3B385BDC.AB40A761@lemburg.com> <200106261700.f5QH0ih14770@odiug.digicool.com>
 <3B3A314C.161FE431@ActiveState.com> <200106271930.f5RJUJw19910@odiug.digicool.com>
Message-ID: <3B3A3900.CB73F3E0@ActiveState.com>

Guido van Rossum wrote:
> 
>...
> 
> Oooh, hadn't thought of that, but yes, it makes sense!
> 
> Not yet implemented, but I think it should.  Makes for a nice pair
> of invariants:
> 
>   unichr(ord('\Udddddddd')) == '\Udddddddd'
>   ord(unichr(0xdddddddd)) == 0xdddddddd
> 
> regardless of whether we're using UCS-2 or UCS-4 storage.

I'm going to presume that ord should accept surrogate pairs on both
narrow and wide interpreters.

-- 
Take a recipe. Leave a recipe.  
Python Cookbook!  http://www.ActiveState.com/pythoncookbook


From guido@digicool.com  Wed Jun 27 20:53:25 2001
From: guido@digicool.com (Guido van Rossum)
Date: Wed, 27 Jun 2001 15:53:25 -0400
Subject: [I18n-sig] Unicode surrogates: just say no!
In-Reply-To: Your message of "Wed, 27 Jun 2001 12:40:07 PDT."
 <3B3A3696.FFA7FCE@ActiveState.com>
References: <200106260851.f5Q8pcN10662@odiug.digicool.com> <3B385BDC.AB40A761@lemburg.com> <200106261700.f5QH0ih14770@odiug.digicool.com>
 <3B3A3696.FFA7FCE@ActiveState.com>
Message-ID: <200106271953.f5RJrPi19963@odiug.digicool.com>

> Guido van Rossum wrote:
> > 
> >...
> > 
> >   - unichr(i) for i >= 0x110000 (and hence corresponding \u and \U)
> >     raises an exception at Python-to-bytecode compile-time
> 
> unichr(i) is an expression. When would it be evaluated at compile-time?

My mistake.  The corresponding \U would be a compile-time error,
unichr() of course a run-time error.

> Also, I'm not sure what runtime behavior you want for these "very large"
> unichr(i) values.
> 
> In general I don't understand why we're treating the > 0x11000 range
> specially at all?

When using UCS-2 + surrogate pairs (== UTF-16), they are not
representable, and the Unicode and ISO standards have effectively
declared that this will be the supported range forever.  (For *some*
definition of forever. :-)

When using UCS-4 mode, I was in favor of allowing unichr() and \U to
specify any value in range(0x100000000L), but that's not what Martin
and Fredrik checked in.  Note that if C code somehow creates a UCS-4
string containing something with the high bit on, ord() will currently
return a negative value on platforms where a C long is 32 bits.
Returning a Python long int with a positive value would be more
consistent, but since these values aren't useful, I wonder if we
should care.  On the other hand, do we want ord() to raise an error
when the value is not a legal Unicode code point?  (Fortunately lone
surrogates are still legal code points -- AFAIK all values in
range(0x110000) are legal code points.)

Definitely a PEP question; it's not cast in stone.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@digicool.com  Wed Jun 27 20:57:12 2001
From: guido@digicool.com (Guido van Rossum)
Date: Wed, 27 Jun 2001 15:57:12 -0400
Subject: [I18n-sig] Unicode surrogates: just say no!
In-Reply-To: Your message of "Wed, 27 Jun 2001 12:50:24 PDT."
 <3B3A3900.CB73F3E0@ActiveState.com>
References: <200106260851.f5Q8pcN10662@odiug.digicool.com> <3B385BDC.AB40A761@lemburg.com> <200106261700.f5QH0ih14770@odiug.digicool.com> <3B3A314C.161FE431@ActiveState.com> <200106271930.f5RJUJw19910@odiug.digicool.com>
 <3B3A3900.CB73F3E0@ActiveState.com>
Message-ID: <200106271957.f5RJvC219975@odiug.digicool.com>

> Guido van Rossum wrote:
> > 
> >...
> > 
> > Oooh, hadn't thought of that, but yes, it makes sense!
> > 
> > Not yet implemented, but I think it should.  Makes for a nice pair
> > of invariants:
> > 
> >   unichr(ord('\Udddddddd')) == '\Udddddddd'
> >   ord(unichr(0xdddddddd)) == 0xdddddddd
> > 
> > regardless of whether we're using UCS-2 or UCS-4 storage.
> 
> I'm going to presume that ord should accept surrogate pairs on both
> narrow and wide interpreters.

That's a separate question.  On wide interpreters, surrogate pairs
"shouldn't" exist if the app plays by the rules.  But they're easily
created of course!  What should ord(u'\uD800\uDC00') mean on a wide
interpreter?  I think it's nice if you support this.  Of course, if a
length-two Unicode string is anything else than a high surrogate
followed by a low surrogate, ord() should be illegal.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From rick@unicode.org  Wed Jun 27 21:04:00 2001
From: rick@unicode.org (Rick McGowan)
Date: Wed, 27 Jun 2001 13:04:00 -0700
Subject: [I18n-sig] Re: Unicode surrogates: just say no!
In-Reply-To: <200106271344.JAA08050@unicode.org> (message from Rick McGowan onWed, 27 Jun 2001 08:52:28 -0700)
Message-ID: <200106271756.NAA11851@unicode.org>

Martin v. Loewis" <martin@loewis.home.cs.tu-berlin.de> wrote:

> The next question then is what to do with lone surrogate triplets; the
> table in TR 27 suggests they are legal, but people on this list have
> argued they must neither be emitted nor consumed (since what you get
> is not a legal USV).

Part of the confusion every has is because the UTFs have been envisioned  
as both (A) pure mathematical transformations of integer spaces, and (B)  
transformations of coded characters.  But the explanations have been  
muddled a little.  Part of the re-write that's happening now in the Unicode  
editorial committee is dealing with this confusion.  In the future, I hope  
that it can be clarified.

> an irregular sequence (of bytes, or code units) is not illegal.
> Also, I assume that negation of these concepts follows the English
> language rules (i.e. "not illegal" == "legal", "not ill-formed" ==
> "well-formed", etc)

Well, yes, you're right.  However, in English when something phrased as  
"not foo" that wording often carries the implication of some shadiness that  
occupies the boundary between foo and anti-foo.  In this sense, "not  
illegal" does not mean the same thing as "legal".  "Not illegal" means  
something more like "socially backward and frowned upon, but not worthy of  
legal prosecution in the strict sense".

Here's my take on irregular sequences / lone surrogates:

If you have a process which is claiming to take in arbitrary data and emit  
identical data in the same or different UTF, then it should probably allow  
unpaired surrogates to be eaten, stored, and re-emitted without error in  
the UTF-8 input case.

If you have a process which is claiming to take in legal characters,  
transform them into something else, then you can (A) barf on lone surrogate  
pairs or (B) try to fix the situation.

Allowing the user of the API to decide which is preferrable in a given  
situation is probably the right answer.  I.e., the codec for UTF-8  
reading/writing should have strict and non-strict modes.  And strict mode  
should be the default.

> The issue of UTF-8 encoded surrogate pairs is clear now to me, I hope:
> You must not write them, but you may read them.

Exactly.  They could exist in nature; their existance cannot be ruled out,  
and hence, it may transpire that you could be presented with one.

	Rick


From paulp@ActiveState.com  Wed Jun 27 21:10:45 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Wed, 27 Jun 2001 13:10:45 -0700
Subject: [I18n-sig] Unicode surrogates: just say no!
References: <200106260851.f5Q8pcN10662@odiug.digicool.com>
 <3B385BDC.AB40A761@lemburg.com> <200106261700.f5QH0ih14770@odiug.digicool.com>
Message-ID: <3B3A3DC5.CA6767FD@ActiveState.com>

Guido van Rossum wrote:
> 
>..
> 
> Users can choose to write code that's portable between the two
> versions by using surrogates on the narrow platform but not on the
> wide platform.  (This would be a good idea for backward compatibility
> with Python 2.0 and 2.1 anyway.)  The proposed (and current!) behavior
> of \U makes it easy for them to do the right thing with string
> literals; everything else, they just have to write code that won't
> separate surrogate halves.

What is the virtue in making the literal syntax easy and making unichr()
easy when everything else is hard? Counting characters is hard.
Addressing characters reliably is hard. Slicing reliably is hard. Why
not simplify things? Surrogates are just characters. If you want to
handle wide characters you need to build Python that way.

I'm trying to imagine the use-case where you care about surrogates
enough to want them to be automatically generated but not enough to care
about slicing and addressing and counting and ...and is this use-case
worth breaking the invariant that len(unichr(i))==1.

Surrogates: Just say no. :)
-- 
Take a recipe. Leave a recipe.  
Python Cookbook!  http://www.ActiveState.com/pythoncookbook


From martin@loewis.home.cs.tu-berlin.de  Wed Jun 27 21:23:00 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Wed, 27 Jun 2001 22:23:00 +0200
Subject: [I18n-sig] UCS-4 configuration
In-Reply-To: <02c901c0ff30$ed0d9fa0$4ffa42d5@hagrid> (fredrik@pythonware.com)
References: <200106262115.f5QLFJ204654@mira.informatik.hu-berlin.de> <005501c0fe8b$f0134d80$4ffa42d5@hagrid> <200106262250.f5QMoO609419@mira.informatik.hu-berlin.de> <200106262334.f5QNYGb18598@odiug.digicool.com>              <200106270645.f5R6jBS06348@mira.informatik.hu-berlin.de> <200106271520.f5RFKE519522@odiug.digicool.com> <200106271646.f5RGkbR09253@mira.informatik.hu-berlin.de> <02c901c0ff30$ed0d9fa0$4ffa42d5@hagrid>
Message-ID: <200106272023.f5RKN0b13150@mira.informatik.hu-berlin.de>

> before you leave, can you change the ./configure default to ucs2?

Done.

Martin


From martin@loewis.home.cs.tu-berlin.de  Wed Jun 27 21:24:41 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Wed, 27 Jun 2001 22:24:41 +0200
Subject: [I18n-sig] Re: Unicode surrogates: just say no!
In-Reply-To: <200106271744.f5RHiKO19739@odiug.digicool.com> (message from
 Guido van Rossum on Wed, 27 Jun 2001 13:44:20 -0400)
References: <200106271344.JAA08050@unicode.org>
 <200106271706.f5RH6UT09389@mira.informatik.hu-berlin.de> <200106271744.f5RHiKO19739@odiug.digicool.com>
Message-ID: <200106272024.f5RKOfL13151@mira.informatik.hu-berlin.de>

> Neither of "strict", "replace" or "ignore" really matches the
> semantics of (1) however; perhaps this behavior should be called
> "lenient".

Sounds good to me (although "lenient" is not even in my passive
vocabulary); implementing it may take time, though.

Regards,
Martin


From guido@digicool.com  Wed Jun 27 21:49:18 2001
From: guido@digicool.com (Guido van Rossum)
Date: Wed, 27 Jun 2001 16:49:18 -0400
Subject: [I18n-sig] Re: Unicode surrogates: just say no!
In-Reply-To: Your message of "Wed, 27 Jun 2001 22:24:41 +0200."
 <200106272024.f5RKOfL13151@mira.informatik.hu-berlin.de>
References: <200106271344.JAA08050@unicode.org> <200106271706.f5RH6UT09389@mira.informatik.hu-berlin.de> <200106271744.f5RHiKO19739@odiug.digicool.com>
 <200106272024.f5RKOfL13151@mira.informatik.hu-berlin.de>
Message-ID: <200106272049.f5RKnIj20036@odiug.digicool.com>

> > Neither of "strict", "replace" or "ignore" really matches the
> > semantics of (1) however; perhaps this behavior should be called
> > "lenient".
> 
> Sounds good to me (although "lenient" is not even in my passive
> vocabulary);

Have a better suggestion?  Maybe "liberal"?  (The IETF motto is most
often quoted as "be liberal in what you accept and conservative in
what you send."  Must be a reference to US politics. ;-)

> implementing it may take time, though.

Not too much -- there isn't a whole lot of checking of the error
values until the error occurs, so I think this could be a
codec-specific extension.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@digicool.com  Wed Jun 27 21:54:37 2001
From: guido@digicool.com (Guido van Rossum)
Date: Wed, 27 Jun 2001 16:54:37 -0400
Subject: [I18n-sig] Unicode surrogates: just say no!
In-Reply-To: Your message of "Wed, 27 Jun 2001 13:10:45 PDT."
 <3B3A3DC5.CA6767FD@ActiveState.com>
References: <200106260851.f5Q8pcN10662@odiug.digicool.com> <3B385BDC.AB40A761@lemburg.com> <200106261700.f5QH0ih14770@odiug.digicool.com>
 <3B3A3DC5.CA6767FD@ActiveState.com>
Message-ID: <200106272054.f5RKsbL20050@odiug.digicool.com>

> Guido van Rossum wrote:
> > 
> >..
> > 
> > Users can choose to write code that's portable between the two
> > versions by using surrogates on the narrow platform but not on the
> > wide platform.  (This would be a good idea for backward compatibility
> > with Python 2.0 and 2.1 anyway.)  The proposed (and current!) behavior
> > of \U makes it easy for them to do the right thing with string
> > literals; everything else, they just have to write code that won't
> > separate surrogate halves.
> 
> What is the virtue in making the literal syntax easy and making unichr()
> easy when everything else is hard? Counting characters is hard.
> Addressing characters reliably is hard. Slicing reliably is hard. Why
> not simplify things? Surrogates are just characters. If you want to
> handle wide characters you need to build Python that way.
> 
> I'm trying to imagine the use-case where you care about surrogates
> enough to want them to be automatically generated but not enough to care
> about slicing and addressing and counting and ...and is this use-case
> worth breaking the invariant that len(unichr(i))==1.
> 
> Surrogates: Just say no. :)

\U has supported surrogate creation since Python 2.0 was released, but
I can't find a clear answer in PEP 100 (a.k.a. Misc/unicode.txt; \U
was added after that was finalized).

The use case I've been assuming of is simple enough: someone wants to
print "Hello World" in Klingon.  They have a printing routine that
takes Unicode, but only ASCII keyboard.  They look up the Unicode
values for the Klingon characters spelling "Hello World" in Klingon on
the web.  The characters happen to be in plane 17.  Do we really want
to place the additional burden on them to (a) figure out if their
Python interpreter uses UCS-2 or UCS-4, and (b) correctly implement
the surrogate creation algorithm on the UCS-2 platform?  I don't think
we should.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From rick@unicode.org  Wed Jun 27 22:09:57 2001
From: rick@unicode.org (Rick McGowan)
Date: Wed, 27 Jun 2001 14:09:57 -0700
Subject: [I18n-sig] Unicode surrogates: just say no!
In-Reply-To: Your message of "Wed, 27 Jun 2001 13:10:45 PDT."            <3B3A3DC5.CA6767FD@ActiveState.com>
Message-ID: <200106271902.PAA12747@unicode.org>

> someone wants to
> print "Hello World" in Klingon.  They have a printing routine that
> takes Unicode, but only ASCII keyboard.  They look up the Unicode
> values for the Klingon characters spelling "Hello World" in Klingon

Whew!  Luckily we cut off this avenue for them.  See:
	http://www.unicode.org/unicode/alloc/Pipeline.html
and scroll to the bottom.

;-)

	Rick


From JMachin@Colonial.com.au  Wed Jun 27 23:50:13 2001
From: JMachin@Colonial.com.au (Machin, John)
Date: Thu, 28 Jun 2001 08:50:13 +1000
Subject: [I18n-sig] Unicode surrogates: just say no!
Message-ID: <9F2D83017589D211BD1000805FA70CA703B139F6@ntxmel03.cmutual.com.au>

The "nice pair of invariants" for unichr() and ord() seem to involve
what I call "all that variable-length mucking about" and Tim more
robustly called "crap".

IMO, there should be a very short list of places where a narrow 
Unicode implementation will need to know anything at all about
surrogates. This short list will include codecs, the 
\Uxxxxxxxx notation for literals, and unichr() --- the users can 
ship it into the warehouse and ship it out again, but it won't be
processed as other than 16-bit values.  Attempts to place other
items on the list should be rigorously justified.

Guido asked:
   What should ord(u'\uD800\uDC00') mean on a wide interpreter? 

IMO, this should mean an exception on *both* narrow and wide
interpreters, just as ord("xy") does. ord() should expect one
and only one *character*

Let's just keep on saying no!

-----Original Message-----
From: Guido van Rossum [mailto:guido@digicool.com]
Sent: Thursday, 28 June 2001 5:57
To: Paul Prescod
Cc: i18n-sig@python.org
Subject: Re: [I18n-sig] Unicode surrogates: just say no!


> Guido van Rossum wrote:
> > 
> >...
> > 
> > Oooh, hadn't thought of that, but yes, it makes sense!
> > 
> > Not yet implemented, but I think it should.  Makes for a nice pair
> > of invariants:
> > 
> >   unichr(ord('\Udddddddd')) == '\Udddddddd'
> >   ord(unichr(0xdddddddd)) == 0xdddddddd
> > 
> > regardless of whether we're using UCS-2 or UCS-4 storage.
> 
> I'm going to presume that ord should accept surrogate pairs on both
> narrow and wide interpreters.

That's a separate question.  On wide interpreters, surrogate pairs
"shouldn't" exist if the app plays by the rules.  But they're easily
created of course!  What should ord(u'\uD800\uDC00') mean on a wide
interpreter?  I think it's nice if you support this.  Of course, if a
length-two Unicode string is anything else than a high surrogate
followed by a low surrogate, ord() should be illegal.

--Guido van Rossum (home page: http://www.python.org/~guido/)

_______________________________________________
I18n-sig mailing list
I18n-sig@python.org
http://mail.python.org/mailman/listinfo/i18n-sig


**************   IMPORTANT MESSAGE  **************

The information contained in or attached to this message is intended only for the people it is addressed to. If you are not the intended recipient, any use, disclosure or copying of this information is unauthorised and prohibited. This information may be confidential or subject to legal privilege. It is not the expressed view of Colonial Limited or any of its subsidiaries unless that is clearly stated. Colonial cannot accept liability for any virus damage caused by this message.

**************************************************


From paulp@ActiveState.com  Wed Jun 27 23:54:48 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Wed, 27 Jun 2001 15:54:48 -0700
Subject: [I18n-sig] Python Support for "Wide" Unicode characters
Message-ID: <3B3A6438.6DA39268@ActiveState.com>

PEP: 261
Title: Python Support for "Wide" Unicode characters
Version: 1.0
Author: paulp@activestate.com (Paul Prescod)
Status: Draft
Type: Standards Track
Python-Version: 2.2
Created: 27-Jun-2001
Post-History: 27-Jun-2001

Abstract

    Python 2.1 unicode characters can have ordinals only up to 65536. 
    These characters are known as Basic Multilinual Plane characters.
    There are now characters in Unicode that live on other "planes".
    The largest addressable character in Unicode has the ordinal
    2**20 + 2**16 - 1. For readability, we will call this TOPCHAR.

Proposed Solution

    One solution would be to merely increase the maximum ordinal to a
    larger value. Unfortunately the only straightforward implementation
    of this idea is to increase the character code unit to 4 bytes. This
    has the effect of doubling the size of most Unicode strings. In
    order to avoid imposing this cost on every user, Python 2.2 will
    allow 4-byte Unicode characters as a build-time option.


    The 4-byte option is called "wide Py_UNICODE". The 2-byte option
    is called "narrow Py_UNICODE".

    Most things will behave identically in the wide and narrow worlds.

    * the \u  and \U literal syntaxes will always generate the same
      data that the unichr function would. They are just different
      syntaxes for the same thing.

    * unichr(i) for 0 <= i <= 2**16 always returns a size-one string.

    * unichr(i) for 2**16+1 <= i <= TOPCHAR will always
      return a string representing the character. 

    * BUT on narrow builds of Python, the string will actually be
      composed of two characters called a "surrogate pair".

    * ord() will now accept surrogate pairs and return the ordinal of
      the "wide" character. Open question: should it accept surrogate
      pairs on wide Python builds?

    * There is an integer value in the sys module that describes the
      largest ordinal for a Unicode character on the current
      interpreter. sys.maxunicode is 2**16-1 on narrow builds of
      Python. On wide builds it could be either TOPCHAR
      or 2**32-1. That's an open question.

    * Note that ord() can in some cases return ordinals
      higher than sys.maxunicode because it accepts surrogate pairs
      on narrow Python builds.

    * codecs will be upgraded to support "wide characters". On narrow
      Python builds, the codecs will generate surrogate pairs, on 
      wide Python builds they will generate a single character.

    * new codecs will be written for 4-byte Unicode and older codecs
      will be updated to recognize surrogates and map them to wide
      characters on wide Pythons.

    * there are no restrictions on constructing strings that use 
      code points "reserved for surrogates" improperly. These are
      called "lone surrogates". The codecs should disallow reading
      these but you could construct them using string literals or
      unichr().

Implementation

    There is a new (experimental) define in Include/unicodeobject.h:

        #undef USE_UCS4_STORAGE

    if defined, Py_UNICODE is set to the same thing as Py_UCS4.

        USE_UCS4_STORAGE

    There is a new configure options:

        --enable-unicode=ucs2 configures a narrow Py_UNICODE, and uses
                        wchar_t if it fits
        --enable-unicode=ucs4 configures a wide Py_UNICODE likewise
        --enable-unicode      configures Py_UNICODE to wchar_t if
available,
                              and to UCS-4 if not; this is the default

    The intention is that --disable-unicode, or --enable-unicode=no
    removes the Unicode type altogether; this is not yet implemented.

Notes

    Note that len(unichr(i))==2 for i>=0x10000 on narrow machines.

    This means (for example) that the following code is not portable:

    x = 0x10000
    if unichr(x) in somestring:
        ...

    In general, you should be careful using "in" if the character
    that is searched for could have been generated from unichr applied
    to a number greater than 0x10000 or from a string literal greater
    than 0x10000.

    This PEP does NOT imply that people using Unicode need to use a
    4-byte encoding. It only allows them to do so. For example, ASCII
    is still a legitimate (7-bit) Unicode-encoding.

Open Questions

    "Code points" above TOPCHAR cannot be expressed in two 16-bit
    characters. These are not assigned to Unicode characters and 
    supposedly will never be. Should we allow them to be passed as 
    arguments to unichr() anyhow? We could allow knowledgable
    programmers to use these "unused" characters for whatever
    they want, though Unicode does not address them.

    "Lone surrogates" "should not" occur on wide platforms. Should
    ord() still accept them?
-- 
Take a recipe. Leave a recipe.  
Python Cookbook!  http://www.ActiveState.com/pythoncookbook


From paulp@ActiveState.com  Wed Jun 27 23:58:38 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Wed, 27 Jun 2001 15:58:38 -0700
Subject: No Klingon? was Re: [I18n-sig] Unicode surrogates: just say no!
References: <200106271902.PAA12747@unicode.org>
Message-ID: <3B3A651E.1C3EAEE0@ActiveState.com>

Rick McGowan wrote:
> 
>...
> 
> Whew!  Luckily we cut off this avenue for them.  See:
>         http://www.unicode.org/unicode/alloc/Pipeline.html
> and scroll to the bottom.

You should have told us that Klingon was rejected before we went to all
of this work! Did you think we were interested in the japanese dentistry
characters? The Wiggly Fences? Shavian?

-- 
Take a recipe. Leave a recipe.  
Python Cookbook!  http://www.ActiveState.com/pythoncookbook


From guido@digicool.com  Wed Jun 27 23:58:36 2001
From: guido@digicool.com (Guido van Rossum)
Date: Wed, 27 Jun 2001 18:58:36 -0400
Subject: [I18n-sig] Unicode surrogates: just say no!
In-Reply-To: Your message of "Thu, 28 Jun 2001 08:50:13 +1000."
 <9F2D83017589D211BD1000805FA70CA703B139F6@ntxmel03.cmutual.com.au>
References: <9F2D83017589D211BD1000805FA70CA703B139F6@ntxmel03.cmutual.com.au>
Message-ID: <200106272258.f5RMwaZ20144@odiug.digicool.com>

> The "nice pair of invariants" for unichr() and ord() seem to involve
> what I call "all that variable-length mucking about" and Tim more
> robustly called "crap".
> 
> IMO, there should be a very short list of places where a narrow 
> Unicode implementation will need to know anything at all about
> surrogates. This short list will include codecs, the 
> \Uxxxxxxxx notation for literals, and unichr() --- the users can 
> ship it into the warehouse and ship it out again, but it won't be
> processed as other than 16-bit values.  Attempts to place other
> items on the list should be rigorously justified.

Thanks, that's about what I wanted to say!  But I assume you meant to
include ord() in that list, as it is unichr()'s inverse.  We should
have one place that implements the surrogate creation magic (unichr)
and one place that implements the surrogate unpacking magic (ord).
(Plus \U, which is to act like unichr(), and codecs.)

> Guido asked:
>    What should ord(u'\uD800\uDC00') mean on a wide interpreter? 
> 
> IMO, this should mean an exception on *both* narrow and wide
> interpreters, just as ord("xy") does. ord() should expect one
> and only one *character*

But on a narrow interpreter, that's a valid surrogate pair, so it's a
single character, so ord() *should* return 0x10000 for this example.

> Let's just keep on saying no!

Yes!

--Guido van Rossum (home page: http://www.python.org/~guido/)


From JMachin@Colonial.com.au  Thu Jun 28 00:14:17 2001
From: JMachin@Colonial.com.au (Machin, John)
Date: Thu, 28 Jun 2001 09:14:17 +1000
Subject: [I18n-sig] Unicode surrogates: just say no!
Message-ID: <9F2D83017589D211BD1000805FA70CA703B139F7@ntxmel03.cmutual.com.au>

Guido said:
   But on a narrow interpreter, that's a valid surrogate pair, so it's a
   single character, so ord() *should* return 0x10000 for this example.

IMO, once you say that a "valid surrogate pair" is a "single
character" in a narrow implementation, people will want to do
the indexing / slicing /dicing thing as well. ord() is just the 
thin end of the wedge.

"No" should mean "no".

unichr() and ord() should be inverses *only*
in respect of scalar values up to sys.maxunicode.

-----Original Message-----
From: Guido van Rossum [mailto:guido@digicool.com]
Sent: Thursday, 28 June 2001 8:59
To: Machin, John
Cc: Paul Prescod; i18n-sig@python.org
Subject: Re: [I18n-sig] Unicode surrogates: just say no!


> The "nice pair of invariants" for unichr() and ord() seem to involve
> what I call "all that variable-length mucking about" and Tim more
> robustly called "crap".
> 
> IMO, there should be a very short list of places where a narrow 
> Unicode implementation will need to know anything at all about
> surrogates. This short list will include codecs, the 
> \Uxxxxxxxx notation for literals, and unichr() --- the users can 
> ship it into the warehouse and ship it out again, but it won't be
> processed as other than 16-bit values.  Attempts to place other
> items on the list should be rigorously justified.

Thanks, that's about what I wanted to say!  But I assume you meant to
include ord() in that list, as it is unichr()'s inverse.  We should
have one place that implements the surrogate creation magic (unichr)
and one place that implements the surrogate unpacking magic (ord).
(Plus \U, which is to act like unichr(), and codecs.)

> Guido asked:
>    What should ord(u'\uD800\uDC00') mean on a wide interpreter? 
> 
> IMO, this should mean an exception on *both* narrow and wide
> interpreters, just as ord("xy") does. ord() should expect one
> and only one *character*

But on a narrow interpreter, that's a valid surrogate pair, so it's a
single character, so ord() *should* return 0x10000 for this example.

> Let's just keep on saying no!

Yes!

--Guido van Rossum (home page: http://www.python.org/~guido/)


**************   IMPORTANT MESSAGE  **************

The information contained in or attached to this message is intended only for the people it is addressed to. If you are not the intended recipient, any use, disclosure or copying of this information is unauthorised and prohibited. This information may be confidential or subject to legal privilege. It is not the expressed view of Colonial Limited or any of its subsidiaries unless that is clearly stated. Colonial cannot accept liability for any virus damage caused by this message.

**************************************************


From guido@digicool.com  Thu Jun 28 00:19:49 2001
From: guido@digicool.com (Guido van Rossum)
Date: Wed, 27 Jun 2001 19:19:49 -0400
Subject: [I18n-sig] Python Support for "Wide" Unicode characters
In-Reply-To: Your message of "Wed, 27 Jun 2001 15:54:48 PDT."
 <3B3A6438.6DA39268@ActiveState.com>
References: <3B3A6438.6DA39268@ActiveState.com>
Message-ID: <200106272319.f5RNJnO20162@odiug.digicool.com>

Nice job, Paul!  I especially like the notion of narrow and wide
Pythons. :-)

In the style of the PEP process, there should probably be some
discussion of the alternatives that were proposed, considered and
rejected, in particular (1) place the burden of surrogate handling on
the application, possibly with limited library support, and (2) try to
mend the unicode string object so that it is always indexed in
characters, even if it contains surrogates.

> PEP: 261
> Title: Python Support for "Wide" Unicode characters
> Version: 1.0
> Author: paulp@activestate.com (Paul Prescod)
> Status: Draft
> Type: Standards Track
> Python-Version: 2.2
> Created: 27-Jun-2001
> Post-History: 27-Jun-2001

I think PEPs should get wider distribution than a SIG.  Maybe after
the first round of comments on i18n-sig is over you can post it to
c.l.py(.a) and python-dev?

> Abstract
> 
>     Python 2.1 unicode characters can have ordinals only up to 65536. 
>     These characters are known as Basic Multilinual Plane characters.
>     There are now characters in Unicode that live on other "planes".
>     The largest addressable character in Unicode has the ordinal
>     2**20 + 2**16 - 1. For readability, we will call this TOPCHAR.

I would express this as 17 * 2**16 - 1, to emphasize the fact that
there are 17 planes of 2**16 characters each.

> Proposed Solution
> 
>     One solution would be to merely increase the maximum ordinal to a
>     larger value. Unfortunately the only straightforward implementation
>     of this idea is to increase the character code unit to 4 bytes. This
>     has the effect of doubling the size of most Unicode strings. In
>     order to avoid imposing this cost on every user, Python 2.2 will
>     allow 4-byte Unicode characters as a build-time option.
> 
> 
>     The 4-byte option is called "wide Py_UNICODE". The 2-byte option
>     is called "narrow Py_UNICODE".
> 
>     Most things will behave identically in the wide and narrow worlds.
> 
>     * the \u  and \U literal syntaxes will always generate the same
>       data that the unichr function would. They are just different
>       syntaxes for the same thing.
> 
>     * unichr(i) for 0 <= i <= 2**16 always returns a size-one string.
> 
>     * unichr(i) for 2**16+1 <= i <= TOPCHAR will always
>       return a string representing the character. 
> 
>     * BUT on narrow builds of Python, the string will actually be
>       composed of two characters called a "surrogate pair".

Can't call these characters.  Maybe use "characters" in quotes, maybe
use code points or items.

>     * ord() will now accept surrogate pairs and return the ordinal of
>       the "wide" character. Open question: should it accept surrogate
>       pairs on wide Python builds?

After thinking about it, I think it should.  Apps that are written
specifically to handle surrogates (e.g. a conversion tool to remove
surrogates!) should work on wide interpreters, and ord() is the only
way to get the character value from a surrogate pair (short from
implementing the shifts and masks yourself, which is doable but a
pain).

>     * There is an integer value in the sys module that describes the
>       largest ordinal for a Unicode character on the current
>       interpreter. sys.maxunicode is 2**16-1 on narrow builds of
>       Python. On wide builds it could be either TOPCHAR
>       or 2**32-1. That's an open question.

Given its name I think it should be TOPCHAR, even if unichr() accepts
larger values.

>     * Note that ord() can in some cases return ordinals
>       higher than sys.maxunicode because it accepts surrogate pairs
>       on narrow Python builds.
> 
>     * codecs will be upgraded to support "wide characters". On narrow
>       Python builds, the codecs will generate surrogate pairs, on 
>       wide Python builds they will generate a single character.

Maybe add a note that this is the main thing that hasn't been fully
implemented yet; everything else except the extended ord() is
implemented now, AFAIK.

>     * new codecs will be written for 4-byte Unicode and older codecs
>       will be updated to recognize surrogates and map them to wide
                                     ^^^^^^^^^^
Make that "surrogate pairs"

>       characters on wide Pythons.
> 
>     * there are no restrictions on constructing strings that use 
>       code points "reserved for surrogates" improperly. These are
>       called "lone surrogates". The codecs should disallow reading
>       these but you could construct them using string literals or
>       unichr().
> 
> Implementation
> 
>     There is a new (experimental) define in Include/unicodeobject.h:
> 
>         #undef USE_UCS4_STORAGE
> 
>     if defined, Py_UNICODE is set to the same thing as Py_UCS4.
> 
>         USE_UCS4_STORAGE

USE_UCS4_STORAGE is no more.  Long live Py_UNICODE_SIZE (2 or 4).

>     There is a new configure options:
> 
>         --enable-unicode=ucs2 configures a narrow Py_UNICODE, and uses
>                         wchar_t if it fits
>         --enable-unicode=ucs4 configures a wide Py_UNICODE likewise
>         --enable-unicode      configures Py_UNICODE to wchar_t if
> available,
>                               and to UCS-4 if not; this is the default

Not any more; the default is ucs2 now.

>     The intention is that --disable-unicode, or --enable-unicode=no
>     removes the Unicode type altogether; this is not yet implemented.
> 
> Notes
> 
>     Note that len(unichr(i))==2 for i>=0x10000 on narrow machines.
> 
>     This means (for example) that the following code is not portable:
> 
>     x = 0x10000
>     if unichr(x) in somestring:
>         ...
> 
>     In general, you should be careful using "in" if the character
>     that is searched for could have been generated from unichr applied
>     to a number greater than 0x10000 or from a string literal greater
>     than 0x10000.

I suppose we *could* fix the __contains__ implementation for Unicode
objects, but I'm -0 on that.

>     This PEP does NOT imply that people using Unicode need to use a
>     4-byte encoding. It only allows them to do so. For example, ASCII
>     is still a legitimate (7-bit) Unicode-encoding.
> 
> Open Questions
> 
>     "Code points" above TOPCHAR cannot be expressed in two 16-bit
>     characters. These are not assigned to Unicode characters and 
>     supposedly will never be. Should we allow them to be passed as 
>     arguments to unichr() anyhow? We could allow knowledgable
>     programmers to use these "unused" characters for whatever
>     they want, though Unicode does not address them.
> 
>     "Lone surrogates" "should not" occur on wide platforms. Should
>     ord() still accept them?

Unclear what you tried to say here.  You already explained that there
are no restrictions on the use of lone surrogates, so ord() has no
choice (It would be pretty bad if you could construct a 1-code-point
string but ord() could't tell you what that code point was).  Or did
you mean "should ord() accept surrogate pairs?  That question was
already asked above.  Or did you mean this to be a summary of all open
issues?  Then there are several more.

Nit: there's no copyright clause.  All PEPs should have one.

Again, thanks!!!

--Guido van Rossum (home page: http://www.python.org/~guido/)


From paulp@ActiveState.com  Thu Jun 28 00:40:36 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Wed, 27 Jun 2001 16:40:36 -0700
Subject: [I18n-sig] Unicode surrogates: just say no!
References: <9F2D83017589D211BD1000805FA70CA703B139F7@ntxmel03.cmutual.com.au>
Message-ID: <3B3A6EF4.A62BD417@ActiveState.com>

"Machin, John" wrote:
> 
>...
> 
> IMO, once you say that a "valid surrogate pair" is a "single
> character" in a narrow implementation, people will want to do
> the indexing / slicing /dicing thing as well. ord() is just the
> thin end of the wedge.

I'll see your puritanism and raise: unichr(bignum) and \Ubignum are the
thin edge of the wedge. :)

I would still prefer to abolish the notion of surrogates from anything
except codecs.

Or at least abolish them now and see if anyone screams. We should do the
simplest thing possible and see what happens.

-- 
Take a recipe. Leave a recipe.  
Python Cookbook!  http://www.ActiveState.com/pythoncookbook


From guido@digicool.com  Thu Jun 28 00:38:04 2001
From: guido@digicool.com (Guido van Rossum)
Date: Wed, 27 Jun 2001 19:38:04 -0400
Subject: [I18n-sig] Unicode surrogates: just say no!
In-Reply-To: Your message of "Thu, 28 Jun 2001 09:14:17 +1000."
 <9F2D83017589D211BD1000805FA70CA703B139F7@ntxmel03.cmutual.com.au>
References: <9F2D83017589D211BD1000805FA70CA703B139F7@ntxmel03.cmutual.com.au>
Message-ID: <200106272338.f5RNc5k20236@odiug.digicool.com>

> Guido said:
>    But on a narrow interpreter, that's a valid surrogate pair, so it's a
>    single character, so ord() *should* return 0x10000 for this example.
> 
> IMO, once you say that a "valid surrogate pair" is a "single
> character" in a narrow implementation, people will want to do
> the indexing / slicing /dicing thing as well. ord() is just the 
> thin end of the wedge.
> 
> "No" should mean "no".
> 
> unichr() and ord() should be inverses *only*
> in respect of scalar values up to sys.maxunicode.

Your position is weakened by inconsistency.  If you really wanted to
be consistent, you should argue against \U and unichr() with ordinals
>= 0x10000 on narrow Pythons. :-)

IMO ord() and unichr() are so closely tied that either both of them
should support surrogate pairs, or none.  You know my position.  It's
not usable as a wedge to get the indexing/slicing/dicing, because the
implementation would be too complicated, and we have the wide Python
as a mighty weapon.

BTW, I quoted Paul:

> >     * ord() will now accept surrogate pairs and return the ordinal of
> >       the "wide" character. Open question: should it accept surrogate
> >       pairs on wide Python builds?

and replied:

> After thinking about it, I think it should.  Apps that are written
> specifically to handle surrogates (e.g. a conversion tool to remove
> surrogates!) should work on wide interpreters, and ord() is the only
> way to get the character value from a surrogate pair (short from
> implementing the shifts and masks yourself, which is doable but a
> pain).

I take that back.  On wide Pythons, unichr() doesn't return surrogates
either.  Once the whole world uses UCS-4 (around the time Python 3000
is released :-), surrogates can be deprecated anyway.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From JMachin@Colonial.com.au  Thu Jun 28 01:05:39 2001
From: JMachin@Colonial.com.au (Machin, John)
Date: Thu, 28 Jun 2001 10:05:39 +1000
Subject: [I18n-sig] Unicode surrogates: just say no!
Message-ID: <9F2D83017589D211BD1000805FA70CA703B139F8@ntxmel03.cmutual.com.au>

OK. I take (most of) your point on consistency between unichr() and ord().

However there is a practical problem with ord(surrogate_pair) on a
narrow Python. 

ord('\x01') -> 1
ord('\x01\x02') -> exception
ord(u'\u0001') -> 1
ord(u'\u0001\u0002') -> exception
ord(u'\ud800\udc00') -> 0x10000 # magic!

so either 
(a) programmer wanting to write (say) the 
conversion tool that you mentioned still has to work very hard
or (b) we redefine ord() so that the arg may also be a Unicode 
string, and it returns the ordinal of the first character (which may involve
two code units)
or (c) we provide some other functionality for unpacking Unicode strings
into ints


-----Original Message-----
From: Guido van Rossum [mailto:guido@digicool.com]
Sent: Thursday, 28 June 2001 9:38
To: Machin, John
Cc: i18n-sig@python.org
Subject: Re: [I18n-sig] Unicode surrogates: just say no!


> Guido said:
>    But on a narrow interpreter, that's a valid surrogate pair, so it's a
>    single character, so ord() *should* return 0x10000 for this example.
> 
> IMO, once you say that a "valid surrogate pair" is a "single
> character" in a narrow implementation, people will want to do
> the indexing / slicing /dicing thing as well. ord() is just the 
> thin end of the wedge.
> 
> "No" should mean "no".
> 
> unichr() and ord() should be inverses *only*
> in respect of scalar values up to sys.maxunicode.

Your position is weakened by inconsistency.  If you really wanted to
be consistent, you should argue against \U and unichr() with ordinals
>= 0x10000 on narrow Pythons. :-)

IMO ord() and unichr() are so closely tied that either both of them
should support surrogate pairs, or none.  You know my position.  It's
not usable as a wedge to get the indexing/slicing/dicing, because the
implementation would be too complicated, and we have the wide Python
as a mighty weapon.

BTW, I quoted Paul:

> >     * ord() will now accept surrogate pairs and return the ordinal of
> >       the "wide" character. Open question: should it accept surrogate
> >       pairs on wide Python builds?

and replied:

> After thinking about it, I think it should.  Apps that are written
> specifically to handle surrogates (e.g. a conversion tool to remove
> surrogates!) should work on wide interpreters, and ord() is the only
> way to get the character value from a surrogate pair (short from
> implementing the shifts and masks yourself, which is doable but a
> pain).

I take that back.  On wide Pythons, unichr() doesn't return surrogates
either.  Once the whole world uses UCS-4 (around the time Python 3000
is released :-), surrogates can be deprecated anyway.

--Guido van Rossum (home page: http://www.python.org/~guido/)


**************   IMPORTANT MESSAGE  **************

The information contained in or attached to this message is intended only for the people it is addressed to. If you are not the intended recipient, any use, disclosure or copying of this information is unauthorised and prohibited. This information may be confidential or subject to legal privilege. It is not the expressed view of Colonial Limited or any of its subsidiaries unless that is clearly stated. Colonial cannot accept liability for any virus damage caused by this message.

**************************************************


From paulp@ActiveState.com  Thu Jun 28 01:20:39 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Wed, 27 Jun 2001 17:20:39 -0700
Subject: [I18n-sig] Python Support for "Wide" Unicode characters
References: <3B3A6438.6DA39268@ActiveState.com> <200106272319.f5RNJnO20162@odiug.digicool.com>
Message-ID: <3B3A7857.1593F72@ActiveState.com>

Guido van Rossum wrote:
> 
>...
> 
> In the style of the PEP process, there should probably be some
> discussion of the alternatives that were proposed, considered and
> rejected, in particular (1) place the burden of surrogate handling on
> the application, possibly with limited library support, 
> and (2) try to
> mend the unicode string object so that it is always indexed in
> characters, even if it contains surrogates.

Okay.

> 
> I think PEPs should get wider distribution than a SIG.  Maybe after
> the first round of comments on i18n-sig is over you can post it to
> c.l.py(.a) and python-dev?

I agree. That's what I intended. I thought it would be confusing if I
posted to the other areas before I had all of my facts right.

> I would express this as 17 * 2**16 - 1, to emphasize the fact that
> there are 17 planes of 2**16 characters each.

Done.

> >     * BUT on narrow builds of Python, the string will actually be
> >       composed of two characters called a "surrogate pair".
> 
> Can't call these characters.  Maybe use "characters" in quotes, maybe
> use code points or items.

I think they ARE characters in the Python, not Unicode sense. So I said:

    * BUT on narrow builds of Python, the string will actually be
      composed of two characters (in the Python, not Unicode sense)
      called a "surrogate pair". These two Python characters are
      logically one Unicode character.

> >     * There is an integer value in the sys module that describes the
> >       largest ordinal for a Unicode character on the current
> >       interpreter. sys.maxunicode is 2**16-1 on narrow builds of
> >       Python. On wide builds it could be either TOPCHAR
> >       or 2**32-1. That's an open question.
> 
> Given its name I think it should be TOPCHAR, even if unichr() accepts
> larger values.

Maybe there is a virtue in having a way to both ask for the largest
*legal* Unicode character and the largest character that will fit into a
Python character on the platform. I mean in theory the maximum Unicode
character is constant but that doesn't mean I want to declare it in my
programs explicitly.

unicodedata.maxchar => always TOPCHAR
sys.maxunicode => some power of 2 - 1

I'm not entirely happy that we call a thing "sys.maxunicode" and then
tell people how to generate larger values. How about sys.maxcodeunit .
(or we could remove the whole surrogate building stuff :) )

Do you want to rule on this or call it an open issue?

> >     * Note that ord() can in some cases return ordinals
> >       higher than sys.maxunicode because it accepts surrogate pairs
> >       on narrow Python builds.

And if sys.maxunicode is TOPCHAR then you can also get ords greater than
sys.maxunicode just by using unichr on values larger than
sys.maxunicode.

> >     * codecs will be upgraded to support "wide characters". On narrow
> >       Python builds, the codecs will generate surrogate pairs, on
> >       wide Python builds they will generate a single character.
> 
> Maybe add a note that this is the main thing that hasn't been fully
> implemented yet; everything else except the extended ord() is
> implemented now, AFAIK.

Done.

> >     * new codecs will be written for 4-byte Unicode and older codecs
> >       will be updated to recognize surrogates and map them to wide
>                                      ^^^^^^^^^^
> Make that "surrogate pairs"

Done.

> >         USE_UCS4_STORAGE
> 
> USE_UCS4_STORAGE is no more.  Long live Py_UNICODE_SIZE (2 or 4).

Okay.

> >     There is a new configure options:
> >
> >         --enable-unicode=ucs2 configures a narrow Py_UNICODE, and uses
> >                         wchar_t if it fits
> >         --enable-unicode=ucs4 configures a wide Py_UNICODE likewise
> >         --enable-unicode      configures Py_UNICODE to wchar_t if
> > available,
> >                               and to UCS-4 if not; this is the default
> 
> Not any more; the default is ucs2 now.

So there is no way to get the heuristic of "wchar_t if available, UCS-4
if not". I'm not complaining, just checking. The list of options is just
two with ucs2 the default.

>... Or did you mean this to be a summary of all open
> issues?  Then there are several more.

What are the open issues in your mind...I'm not clear on what things
you've expressed an opinion on and what things you've ruled on.

> Nit: there's no copyright clause.  All PEPs should have one.

Okay.

When I hear from you I'll update it.

-- 
Take a recipe. Leave a recipe.  
Python Cookbook!  http://www.ActiveState.com/pythoncookbook


From rick@unicode.org  Thu Jun 28 01:31:11 2001
From: rick@unicode.org (Rick McGowan)
Date: Wed, 27 Jun 2001 17:31:11 -0700
Subject: [I18n-sig] Python Support for "Wide" Unicode characters
Message-ID: <200106272224.SAA15484@unicode.org>

I don't suppose that anyone has actually considered just using a 24-bit  
scalar type?  What would be the downside to doing so?

	Rick


From rick@unicode.org  Thu Jun 28 01:34:52 2001
From: rick@unicode.org (Rick McGowan)
Date: Wed, 27 Jun 2001 17:34:52 -0700
Subject: No Klingon? was Re: [I18n-sig] Unicode surrogates: just say no!
Message-ID: <200106272228.SAA15535@unicode.org>

Oh, sorry, Paul... The venerable work in Python unfortunately preceded the  
rather recent rejection of Klingon.  We didn't think anyone was using it!

Now, if you'd beamed an armed party into a meeting when I was casting  
about for some serious reps from the Klingon empire, we could have saved  
everyone the trouble of rejecting it...

;-)
	Rick

> You should have told us that Klingon was rejected before we went to all
> of this work! Did you think we were interested in the japanese dentistry
> characters? The Wiggly Fences? Shavian?


From guido@digicool.com  Thu Jun 28 01:43:44 2001
From: guido@digicool.com (Guido van Rossum)
Date: Wed, 27 Jun 2001 20:43:44 -0400
Subject: [I18n-sig] Python Support for "Wide" Unicode characters
In-Reply-To: Your message of "Wed, 27 Jun 2001 17:31:11 PDT."
 <200106272224.SAA15484@unicode.org>
References: <200106272224.SAA15484@unicode.org>
Message-ID: <200106280043.f5S0hi520359@odiug.digicool.com>

> I don't suppose that anyone has actually considered just using a 24-bit  
> scalar type?  What would be the downside to doing so?
> 
> 	Rick

Because of alignment requirements and the absence in general of a
3-byte integral type in C, you can't extract a 24-bit integer given
its address without doing something like two shifts and two or
operations.  For mostly the same reasons you also can't declare arrays
of 3-byte integers, so you'd have to do all your address arithmetic
yourself.

While none of this makes it impossible, it makes it impractical,
because ever place in the code that indexes or declares a Py_UNICODE
array would have to be patched.  The elegance of the 4-byte approach
is that almost all code continues to work without changes.

(Technically, it's the "smallest integral type containing at least 32
bits" approach.  C guarantees there always is such a type, since long
is guaranteed to be at least 32 bits.  I suppose we could try to be
exact and use the "smallest integral type containing at least 21 bits"
approach, but it wouldn't make a difference on current practical
hardware.  It would have 20 years ago, when machines with 24 or 28
bits per word were common. :-)

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@digicool.com  Thu Jun 28 01:47:29 2001
From: guido@digicool.com (Guido van Rossum)
Date: Wed, 27 Jun 2001 20:47:29 -0400
Subject: [I18n-sig] Python Support for "Wide" Unicode characters
In-Reply-To: Your message of "Wed, 27 Jun 2001 17:20:39 PDT."
 <3B3A7857.1593F72@ActiveState.com>
References: <3B3A6438.6DA39268@ActiveState.com> <200106272319.f5RNJnO20162@odiug.digicool.com>
 <3B3A7857.1593F72@ActiveState.com>
Message-ID: <200106280047.f5S0lTQ20371@odiug.digicool.com>

I agree with everything I deleted from the quoting below!

> > Given its name I think it should be TOPCHAR, even if unichr() accepts
> > larger values.
> 
> Maybe there is a virtue in having a way to both ask for the largest
> *legal* Unicode character and the largest character that will fit into a
> Python character on the platform. I mean in theory the maximum Unicode
> character is constant but that doesn't mean I want to declare it in my
> programs explicitly.
> 
> unicodedata.maxchar => always TOPCHAR
> sys.maxunicode => some power of 2 - 1
> 
> I'm not entirely happy that we call a thing "sys.maxunicode" and then
> tell people how to generate larger values. How about sys.maxcodeunit .
> (or we could remove the whole surrogate building stuff :) )
> 
> Do you want to rule on this or call it an open issue?

Leave it open; personally I'd be happy with the heuristic "if
sys.maxunicode >= 2**16 then a unicode character can store 32 bits".

> What are the open issues in your mind...I'm not clear on what things
> you've expressed an opinion on and what things you've ruled on.

Sorry.  I meant that there were two open issues listed earlier in the
PEP, and one of those was repeated here, so I wasn't sure if this was
intended to be a summary and you missed one, or it was intended to be
additional open issues and you had a duplicate.  Either way is fine
but I think you should make up your mind. :-)

> When I hear from you I'll update it.

Go ahead!

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@digicool.com  Thu Jun 28 01:50:37 2001
From: guido@digicool.com (Guido van Rossum)
Date: Wed, 27 Jun 2001 20:50:37 -0400
Subject: [I18n-sig] Unicode surrogates: just say no!
In-Reply-To: Your message of "Thu, 28 Jun 2001 10:05:39 +1000."
 <9F2D83017589D211BD1000805FA70CA703B139F8@ntxmel03.cmutual.com.au>
References: <9F2D83017589D211BD1000805FA70CA703B139F8@ntxmel03.cmutual.com.au>
Message-ID: <200106280050.f5S0obA20385@odiug.digicool.com>

> OK. I take (most of) your point on consistency between unichr() and ord().
> 
> However there is a practical problem with ord(surrogate_pair) on a
> narrow Python. 
> 
> ord('\x01') -> 1
> ord('\x01\x02') -> exception
> ord(u'\u0001') -> 1
> ord(u'\u0001\u0002') -> exception
> ord(u'\ud800\udc00') -> 0x10000 # magic!
> 
> so either 
> (a) programmer wanting to write (say) the 
> conversion tool that you mentioned still has to work very hard
> or (b) we redefine ord() so that the arg may also be a Unicode 
> string, and it returns the ordinal of the first character (which may involve
> two code units)
> or (c) we provide some other functionality for unpacking Unicode strings
> into ints

Yes, the longer I think about this the less I like it.  Unfortunately,
the surrogate-creating behavior of \U is present in 2.0 and 2.1, so I
think we can't reasonably remove this from narrow Python 2.2, and I
like the rule that unichr and \U match.  But maybe that's the one that
should go, and unichr() and ord() should deal with single code points
only.

Then sys.maxunicode should be the largest value that unichr() will
accept.  This could be 0xffff (narrow Python), 0x10ffff (wide Python
with strict unichr()), or 0xffffffffL (wide Python with liberal
unichr()).  The latter is an open PEP issue.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From fredrik@pythonware.com  Thu Jun 28 01:56:59 2001
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Thu, 28 Jun 2001 02:56:59 +0200
Subject: [I18n-sig] Python Support for "Wide" Unicode characters
References: <200106272224.SAA15484@unicode.org>
Message-ID: <002501c0ff6d$3cfc4de0$4ffa42d5@hagrid>

Rick McGowan wrote:

> I don't suppose that anyone has actually considered just using a 24-bit  
> scalar type?  What would be the downside to doing so?

nothing stops you from using 24-bit unsigned integers, if your
compiler supports them.

Cheers /F


From JMachin@Colonial.com.au  Thu Jun 28 02:23:55 2001
From: JMachin@Colonial.com.au (Machin, John)
Date: Thu, 28 Jun 2001 11:23:55 +1000
Subject: [I18n-sig] Unicode surrogates: just say no!
Message-ID: <9F2D83017589D211BD1000805FA70CA703B139FA@ntxmel03.cmutual.com.au>

> Unfortunately, the surrogate-creating behavior of \U
> is present in 2.0 and 2.1, so I
> think we can't reasonably remove this from narrow Python 2.2, and I
> like the rule that unichr and \U match.  But maybe that's the one that
> should go, and unichr() and ord() should deal with single code points
> only.

My understanding is that very few people noticed that \U was creating
surrogate pairs, and my guess would be that nobody would be affected in
practice by stopping this behaviour.

IOW, I suggest treating "\U -> surrogate pairs" just like the more
esoteric parts of xrange() -- or the "Korean mess" in earlier Unicode --
just bury it and move on.

IMO, the type of people wanting to fiddle with surrogate pairs
in narrow Python would also be capable of whipping up a C extension
to unpack a narrow Unicode string into a list of ints and do the shifting
and masking necessary with surrogates. If this is not so, then the next
preference
would be for "someone" to write such a C extension and publicise it. I
would volunteer to be that "someone" in the interests of not
burdening ord() with "magic". 


**************   IMPORTANT MESSAGE  **************

The information contained in or attached to this message is intended only for the people it is addressed to. If you are not the intended recipient, any use, disclosure or copying of this information is unauthorised and prohibited. This information may be confidential or subject to legal privilege. It is not the expressed view of Colonial Limited or any of its subsidiaries unless that is clearly stated. Colonial cannot accept liability for any virus damage caused by this message.

**************************************************


From rick@unicode.org  Thu Jun 28 02:23:21 2001
From: rick@unicode.org (Rick McGowan)
Date: Wed, 27 Jun 2001 18:23:21 -0700
Subject: [I18n-sig] Python Support for "Wide" Unicode characters
In-Reply-To: Your message of "Wed, 27 Jun 2001 17:31:11 PDT."            <200106272224.SAA15484@unicode.org>
Message-ID: <200106272316.TAA16083@unicode.org>

Re 24-bit scalar type...  Guido, those are good reasons & lotsa juicy  
downsides.  Thanks.

	Rick


From paulp@ActiveState.com  Thu Jun 28 02:41:17 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Wed, 27 Jun 2001 18:41:17 -0700
Subject: [I18n-sig] Unicode surrogates: just say no!
References: <9F2D83017589D211BD1000805FA70CA703B139F8@ntxmel03.cmutual.com.au> <200106280050.f5S0obA20385@odiug.digicool.com>
Message-ID: <3B3A8B3D.B838CB80@ActiveState.com>

Guido van Rossum wrote:
> 
>...
> 
> Yes, the longer I think about this the less I like it.  Unfortunately,
> the surrogate-creating behavior of \U is present in 2.0 and 2.1, so I
> think we can't reasonably remove this from narrow Python 2.2, and 

I'm having a hard time caring about backwards compatibilty much here.
And I can't square it with your enthusiasm for ripping the guts out of
poor old xrange. <wink>

We're talking about a certain kind of *literal* right? Even ASCII
literals are rare in my code. Unicode literals are extremely rare. Now
consider that we're talking about Unicode literals to characters so
obscure that they were passed over by the first three versions of
Unicode. And so new that most people don't even know that they are part
of Unicode.

Let's just put a deprecation warning in for \U where you've asked for a
character larger than your build's code unit size. And if there is a
need, someone, somewhere will write a beautiful surrogates library that
handles all details of surrogate handling.

> Then sys.maxunicode should be the largest value that unichr() will
> accept.  This could be 0xffff (narrow Python), 0x10ffff (wide Python
> with strict unichr()), or 0xffffffffL (wide Python with liberal
> unichr()).  The latter is an open PEP issue.

Okay.
-- 
Take a recipe. Leave a recipe.  
Python Cookbook!  http://www.ActiveState.com/pythoncookbook


From barry@digicool.com  Thu Jun 28 02:47:46 2001
From: barry@digicool.com (Barry A. Warsaw)
Date: Wed, 27 Jun 2001 21:47:46 -0400
Subject: [I18n-sig] Python Support for "Wide" Unicode characters
References: <3B3A6438.6DA39268@ActiveState.com>
 <200106272319.f5RNJnO20162@odiug.digicool.com>
Message-ID: <15162.36034.762209.479359@anthem.wooz.org>

    GvR> Nit: there's no copyright clause.  All PEPs should have one.

Whoops, I forgot to nag Paul about that.  Feel free to add one when
you revise the PEP, Paul <wink>.

-Barry


From tim.one@home.com  Thu Jun 28 05:55:08 2001
From: tim.one@home.com (Tim Peters)
Date: Thu, 28 Jun 2001 00:55:08 -0400
Subject: [I18n-sig] Unicode surrogates: just say no!
In-Reply-To: <3B3A8B3D.B838CB80@ActiveState.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCIECLKLAA.tim.one@home.com>

[Guido]
> Unfortunately, the surrogate-creating behavior of \U is present in
> 2.0 and 2.1, so I think we can't reasonably remove this from narrow
> Python 2.2

[Paul Prescod]
> I'm having a hard time caring about backwards compatibilty much here.
> And I can't square it with your enthusiasm for ripping the guts out of
> poor old xrange. <wink>

But there's a HUGE difference.  The xrange() behaviors we're seeking to shed
have been documented for years.  But the Python 2.1 Reference Manual's
section on Unicode literals reads:

    2.4.3 Unicode literals
    XXX explain more here...

in its entirety, and the word "surrogate" appears nowhere at all.  Well, OK,
it's mentioned twice in unicode.txt, both times in a disclaimer sense ("we
don't need no stinkin' surrogates -- and neither do you").

See?  I thought you would, if someone just paused to explain it <wink>.

> ...
> Let's just put a deprecation warning in for \U where you've asked for a
> character larger than your build's code unit size.

More consideration than it merits, if anyone were silly enough tp ask me.


From tim.one@home.com  Thu Jun 28 06:10:33 2001
From: tim.one@home.com (Tim Peters)
Date: Thu, 28 Jun 2001 01:10:33 -0400
Subject: [I18n-sig] UCS-4 configuration
In-Reply-To: <LNBBLJKPBEHFEDALKOLCMEPAKKAA.tim.one@home.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCEECMKLAA.tim.one@home.com>

[discussion about PyUnicode_DecodeUTF16]

It's nice that we got to chat about portability to Platforms from Mars, but
is anyone actually going to work on that function?  It shouldn't be hard, I
just don't want to see it fall thru the cracks.

otoh-falling-between-the-surrogates-is-fine-ly y'rs  - tim


From paulp@ActiveState.com  Thu Jun 28 06:25:12 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Wed, 27 Jun 2001 22:25:12 -0700
Subject: [I18n-sig] Support for "wide" Unicode characters
Message-ID: <3B3ABFB8.84C7510B@ActiveState.com>

Round 2: I can't check in right now but I'll collect another round of
suggestions and then post this to other lists tomorrow.
----
PEP: 261
Title: Support for "wide" Unicode characters
Version: $Revision: 1.2 $
Author: paulp@activestate.com (Paul Prescod)
Status: Draft
Type: Standards Track
Created: 27-Jun-2001
Python-Version: 2.2
Post-History: 27-Jun-2001


Abstract

    Python 2.1 unicode characters can have ordinals only up to 65536. 
    These characters are known as Basic Multilinual Plane characters.
    There are now characters in Unicode that live on other "planes".
    The largest addressable character in Unicode has the ordinal
    17 * 2**16 - 1. For readability, we will call this TOPCHAR.


Proposed Solution

    One solution would be to merely increase the maximum ordinal to a
    larger value.  Unfortunately the only straightforward
    implementation of this idea is to increase the character code unit
    to 4 bytes.  This has the effect of doubling the size of most
    Unicode strings.  In order to avoid imposing this cost on every
    user, Python 2.2 will allow 4-byte Unicode characters as a
    build-time option.

    The 4-byte option is called "wide Py_UNICODE".  The 2-byte option
    is called "narrow Py_UNICODE".

    Most things will behave identically in the wide and narrow worlds.

    * the \u and \U literal syntaxes will always generate the same
      data that the unichr function would.  They are just different
      syntaxes for the same thing.

    * unichr(i) for 0 <= i <= 2**16 always returns a size-one string.

    * unichr(i) for 2**16+1 <= i <= TOPCHAR will always return a
      string representing the character.

    * BUT on narrow builds of Python, the string will actually be
      composed of two characters (in the Python, not Unicode sense) 
      called a "surrogate pair". These two Python characters are
      logically one Unicode character. 

        ISSUE: Should Python return surrogate pairs or narrow builds
               or should it just disallow them?

        ISSUE: Should the upper bound of the domain of unichr and
               range of ord() be TOPCHAR or 2**32-1 or even 2**31?

    * ord() will now accept surrogate pairs and return the ordinal of
      the "wide" character.  

        ISSUE: Should Python accept surrogate pairs on wide 
               Python builds?

    * There is an integer value in the sys module that describes the
      largest ordinal for a Unicode character on the current
      interpreter. sys.maxunicode is 2**16-1 on narrow builds of
      Python.  

        ISSUE: Should sys.maxunicode be TOPCHAR or 2**32-1 or even
               2**31 on wide builds?

        ISSUE: Should there be distinct constants for accessing
               TOPCHAR and the real upper bound for the domain of 
               unichr?

    * Note that ord() can in some cases return ordinals higher than
      sys.maxunicode because it accepts surrogate pairs on narrow
      Python builds. 

    * codecs will be upgraded to support "wide characters"
      (represented directly in UCS-4, as surrogate pairs in UTF-16 and
      as multi-byte sequences in UTF-8). On narrow Python builds, the
      codecs will generate surrogate pairs, on wide Python builds they
      will generate a single character. This is the main part of the 
      implementation left to be done.

    * there are no restrictions on constructing strings that use 
      code points "reserved for surrogates" improperly. These are
      called "lone surrogates". The codecs should disallow reading
      these but you could construct them using string literals or
      unichr(). unichr() is not restricted to values less than either
      TOPCHAR nor sys.maxunicode.

        ISSUE: Should lone surrogates be allowed as input to ord even
               on wide platforms where they "should" not occur?


Implementation


    There is a new (experimental) define:

        #define PY_UNICODE_SIZE 2

    There is a new configure options:

        --enable-unicode=ucs2 configures a narrow Py_UNICODE, and uses
                              wchar_t if it fits
        --enable-unicode=ucs4 configures a wide Py_UNICODE likewise
        --enable-unicode      same as "=ucs2"

    The intention is that --disable-unicode, or --enable-unicode=no
    removes the Unicode type altogether; this is not yet implemented.


Notes

    Note that len(unichr(i))==2 for i>=2**16 on narrow machines
    because of the returned surrogates.

    This means (for example) that the following code is not portable:

    x = 2**16
    if unichr(x) in somestring:
        ...

    In general, you should be careful using "in" if the character that
    is searched for could have been generated from unichr applied to a
    number greater than 2**16 or from a string literal greater than
    2**16.

    This PEP does NOT imply that people using Unicode need to use a
    4-byte encoding.  It only allows them to do so.  For example,
    ASCII is still a legitimate (7-bit) Unicode-encoding.

Rationale for Surrogate Creation Behaviour

    Python currently supports the construction of a surrogate pair
    for a large unicode literal character escape sequence. This is
    basically designed as a simple way to construct "wide characters"
    even in a narrow Python build.

        ISSUE: surrogates can be created this way but the user still 
               needs to be careful about slicing, indexing, printing 
               etc. Another option is to remove knowledge of
               surrogates from everything other than the codecs.

Rejected Suggestions

    There were two primary solutions that were rejected. The first was
    more or less the status-quo. We could officially say that UTF-16
    is the Python character encoding and require programmers to
    implement wide characters in their application logic. This is a
    heavy burden because emulating 32-bit characters is likely to be
    very inefficient if it is coded entirely in Python.

    The other solution is to use UTF-16 (or even UTF-8) internally
    (for efficiency) but present an abstraction of 32-bit characters
    to the programmer. This would require a much more complex
    implementation than the accepted solution. In theory, we could
    move to this implementation in the future without breaking Python
    code. It would just emulate a wide Python build on narrow
    Pythons.


Copyright

    This document has been placed in the public domain.


Local Variables:
mode: indented-text
indent-tabs-mode: nil
End:
-- 
Take a recipe. Leave a recipe.  
Python Cookbook!  http://www.ActiveState.com/pythoncookbook


From martin@loewis.home.cs.tu-berlin.de  Wed Jun 27 21:44:05 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Wed, 27 Jun 2001 22:44:05 +0200
Subject: [I18n-sig] Unicode surrogates: just say no!
In-Reply-To: <3B3A314C.161FE431@ActiveState.com> (message from Paul Prescod on
 Wed, 27 Jun 2001 12:17:32 -0700)
References: <200106260851.f5Q8pcN10662@odiug.digicool.com>
 <3B385BDC.AB40A761@lemburg.com> <200106261700.f5QH0ih14770@odiug.digicool.com> <3B3A314C.161FE431@ActiveState.com>
Message-ID: <200106272044.f5RKi5g13272@mira.informatik.hu-berlin.de>

> >   - unichr(i) for 0x10000 <= i <= 0x10ffff (and hence corresponding \u
> >     and \U) generates a surrogate pair, where u[0] is the high
> >     surrogate value and u[1] the low surrogate value
> 
> Does this imply that ord() should take in surrogate pairs too?

Good question. IMO, it shouldn't, so ord(unichr(n)) may raise
exceptions, even for values of n where unichr(n) succeeds. 

The basic rationale here is: if you need surrogates a lot, you should
use a wide unicode implementation. In a narrow unicode implementation,
a lot of surprises are likely (although each surprise should be
documented, of course).

In the specific case, there isn't even a single best solution: If ord
of a surrogate pair would return a value, you'd lose the property that
ord(s[0])==ord(s) either raises an exception or gives 1.

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Wed Jun 27 21:53:11 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Wed, 27 Jun 2001 22:53:11 +0200
Subject: [I18n-sig] Unicode surrogates: just say no!
In-Reply-To: <200106271957.f5RJvC219975@odiug.digicool.com> (message from
 Guido van Rossum on Wed, 27 Jun 2001 15:57:12 -0400)
References: <200106260851.f5Q8pcN10662@odiug.digicool.com> <3B385BDC.AB40A761@lemburg.com> <200106261700.f5QH0ih14770@odiug.digicool.com> <3B3A314C.161FE431@ActiveState.com> <200106271930.f5RJUJw19910@odiug.digicool.com>
 <3B3A3900.CB73F3E0@ActiveState.com> <200106271957.f5RJvC219975@odiug.digicool.com>
Message-ID: <200106272053.f5RKrBA13303@mira.informatik.hu-berlin.de>

> That's a separate question.  On wide interpreters, surrogate pairs
> "shouldn't" exist if the app plays by the rules.  But they're easily
> created of course!  What should ord(u'\uD800\uDC00') mean on a wide
> interpreter?  I think it's nice if you support this.  Of course, if a
> length-two Unicode string is anything else than a high surrogate
> followed by a low surrogate, ord() should be illegal.

But then, you get unichr(ord(u'\uD800\uDC00')) <> u'\uD800\uDC00'.
Is that acceptable?

I'd rather prefer ord not to work on surrogate pairs. It means that
code may behave differently, but that is no surprise:
len(u'\U00102030') already varies depending on the width of unicode.

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Wed Jun 27 22:00:18 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Wed, 27 Jun 2001 23:00:18 +0200
Subject: [I18n-sig] Unicode surrogates: just say no!
In-Reply-To: <200106271953.f5RJrPi19963@odiug.digicool.com> (message from
 Guido van Rossum on Wed, 27 Jun 2001 15:53:25 -0400)
References: <200106260851.f5Q8pcN10662@odiug.digicool.com> <3B385BDC.AB40A761@lemburg.com> <200106261700.f5QH0ih14770@odiug.digicool.com>
 <3B3A3696.FFA7FCE@ActiveState.com> <200106271953.f5RJrPi19963@odiug.digicool.com>
Message-ID: <200106272100.f5RL0IP13334@mira.informatik.hu-berlin.de>

> When using UCS-4 mode, I was in favor of allowing unichr() and \U to
> specify any value in range(0x100000000L), but that's not what Martin
> and Fredrik checked in.  Note that if C code somehow creates a UCS-4
> string containing something with the high bit on, ord() will currently
> return a negative value on platforms where a C long is 32 bits.

Couldn't it be an unenforced rule that C code also must stick to the
17 planes? There are many unenforced rules, like that you must not
modify a string unless you've created it by passing a NULL char*, and
not handed out a reference to anybody.

Effectively, using C code might introduce undefined behaviour. On some
systems, ord will return a negative value, on others, a positive one;
in a future version, it may produce an error if we find too many
people had problems with their C code writing large integers into
unicode characters.

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Thu Jun 28 07:20:58 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Thu, 28 Jun 2001 08:20:58 +0200
Subject: [I18n-sig] Unicode surrogates: just say no!
In-Reply-To: <3B3A3DC5.CA6767FD@ActiveState.com> (message from Paul Prescod on
 Wed, 27 Jun 2001 13:10:45 -0700)
References: <200106260851.f5Q8pcN10662@odiug.digicool.com>
 <3B385BDC.AB40A761@lemburg.com> <200106261700.f5QH0ih14770@odiug.digicool.com> <3B3A3DC5.CA6767FD@ActiveState.com>
Message-ID: <200106280620.f5S6Kwe01395@mira.informatik.hu-berlin.de>

> What is the virtue in making the literal syntax easy and making unichr()
> easy when everything else is hard? Counting characters is hard.
> Addressing characters reliably is hard. Slicing reliably is hard. Why
> not simplify things? Surrogates are just characters. If you want to
> handle wide characters you need to build Python that way.
> 
> I'm trying to imagine the use-case where you care about surrogates
> enough to want them to be automatically generated but not enough to care
> about slicing and addressing and counting and ...and is this use-case
> worth breaking the invariant that len(unichr(i))==1.

I'm in favour of supporting the \U notation to denote non-BMP
characters even in a "narrow" installation. Whether unichr should also
support them is less interesting, but it gives some consistency if it
does.

The rationale for supporting \U is two-fold: One, importing a module
should not fail in one installation, and succeed in another (of the
same Python version). Running the module may give different results,
but you should be able to generate byte code. Furthermore, people
using non-BMP characters in source are probably not very interested in
counting the characters: They want to display them. For just
displaying them, you need to represent them, and you need the fonts.
String manipulation is less important.

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Thu Jun 28 08:05:19 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Thu, 28 Jun 2001 09:05:19 +0200
Subject: [I18n-sig] UCS-4 configuration
In-Reply-To: <LNBBLJKPBEHFEDALKOLCEECMKLAA.tim.one@home.com>
References: <LNBBLJKPBEHFEDALKOLCEECMKLAA.tim.one@home.com>
Message-ID: <200106280705.f5S75J701656@mira.informatik.hu-berlin.de>

> It's nice that we got to chat about portability to Platforms from Mars, but
> is anyone actually going to work on that function?  It shouldn't be hard, I
> just don't want to see it fall thru the cracks.

If nothing happens within three weeks, I will.

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Thu Jun 28 08:08:21 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Thu, 28 Jun 2001 09:08:21 +0200
Subject: [I18n-sig] Support for "wide" Unicode characters
In-Reply-To: <3B3ABFB8.84C7510B@ActiveState.com> (message from Paul Prescod on
 Wed, 27 Jun 2001 22:25:12 -0700)
References: <3B3ABFB8.84C7510B@ActiveState.com>
Message-ID: <200106280708.f5S78Lm01658@mira.informatik.hu-berlin.de>

>     * ord() will now accept surrogate pairs and return the ordinal of
>       the "wide" character.  

I'm still -1 on this.

>         ISSUE: Should sys.maxunicode be TOPCHAR or 2**32-1 or even
>               2**31 on wide builds?

It should be TOPCHAR, the maximum value that unichr accepts.

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Thu Jun 28 07:57:34 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Thu, 28 Jun 2001 08:57:34 +0200
Subject: [I18n-sig] Python Support for "Wide" Unicode characters
In-Reply-To: <3B3A7857.1593F72@ActiveState.com> (message from Paul Prescod on
 Wed, 27 Jun 2001 17:20:39 -0700)
References: <3B3A6438.6DA39268@ActiveState.com> <200106272319.f5RNJnO20162@odiug.digicool.com> <3B3A7857.1593F72@ActiveState.com>
Message-ID: <200106280657.f5S6vYl01625@mira.informatik.hu-berlin.de>

> Maybe there is a virtue in having a way to both ask for the largest
> *legal* Unicode character and the largest character that will fit into a
> Python character on the platform. I mean in theory the maximum Unicode
> character is constant but that doesn't mean I want to declare it in my
> programs explicitly.
> 
> unicodedata.maxchar => always TOPCHAR
> sys.maxunicode => some power of 2 - 1
> 
> I'm not entirely happy that we call a thing "sys.maxunicode" and then
> tell people how to generate larger values. How about sys.maxcodeunit .
> (or we could remove the whole surrogate building stuff :) )

-1. The Unicode consortium and ISO have promised that there will never
be characters above 0x10ffff. Most of the characters below TOPCHAR are
"unassigned", whereas the ones above TOPCHAR are "illegal" (or not
even representable in UTF-16).

If we were to allow putting very large numbers into Unicode strings,
we'd have to check for them in all codecs also. I'd rather disallow
them from Python code, and declare using them in C as undefined
behaviour.

> So there is no way to get the heuristic of "wchar_t if available, UCS-4
> if not". I'm not complaining, just checking. The list of options is just
> two with ucs2 the default.

I'd be complaining, though, if I wasn't that pleased with this PEP
overall.

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Thu Jun 28 07:36:55 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Thu, 28 Jun 2001 08:36:55 +0200
Subject: [I18n-sig] Unicode surrogates: just say no!
In-Reply-To: <9F2D83017589D211BD1000805FA70CA703B139F7@ntxmel03.cmutual.com.au>
 (JMachin@Colonial.com.au)
References: <9F2D83017589D211BD1000805FA70CA703B139F7@ntxmel03.cmutual.com.au>
Message-ID: <200106280636.f5S6at701531@mira.informatik.hu-berlin.de>

> "No" should mean "no".
> 
> unichr() and ord() should be inverses *only*
> in respect of scalar values up to sys.maxunicode.

+1.

Martin


From mkuhn@suse.de  Thu Jun 28 09:03:59 2001
From: mkuhn@suse.de (Markus Kuhn)
Date: Thu, 28 Jun 2001 10:03:59 +0200 (CEST)
Subject: [I18n-sig] Determine encoding from $LANG
In-Reply-To: <15160.60506.589750.287186@honolulu.ilog.fr>
Message-ID: <Pine.LNX.4.30.0106280942240.7451-100000@Appserv.suse.de>

On Tue, 26 Jun 2001, Bruno Haible wrote:
>
>      A program cannot be considered properly internationalized
>      until it obeys the current locale (LC_ALL || LC_CTYPE || LANG).
>
> The programs we are waiting for are:
> [...]

Add to that list many of the programming languages that use Unicode
internally but that do not yet set the default i/o encoding correctly
automatically based on LC_ALL || LC_CTYPE || LANG.

For example TCL currently uses some primitive LANG substring matching,
which basically gets only a few Japanese and Russian encodings right. The
TCL function unix/tclUnixInit.c:TclpSetInitialEncodings really should call
libcharset or nl_langinfo(CODESET) instead:

  https://sourceforge.net/tracker/?func=detail&aid=418645&group_id=10894&atid=110894

I suspect that Perl and Python are not much better and don't call
nl_langinfo(CODESET) or the portable libcharset wrapper around it either
to properly determine the locale-dependent external encoding.

References on how to determine the character encoding from the locale in a
safe and portable manner:

http://www.cl.cam.ac.uk/~mgk25/unicode.html#activate
http://clisp.cons.org/~haible/packages-libcharset.html
http://www.opengroup.org/onlinepubs/7908799/xsh/langinfo.h.html

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>


From Markus.Kuhn@cl.cam.ac.uk  Thu Jun 28 09:20:32 2001
From: Markus.Kuhn@cl.cam.ac.uk (Markus Kuhn)
Date: Thu, 28 Jun 2001 09:20:32 +0100
Subject: [I18n-sig] Re: Unicode 3.1 and contradictions.
In-Reply-To: Your message of "27 Jun 2001 12:30:05 BST."
 <4a7kxykvnm.fsf@kern.srcf.societies.cam.ac.uk>
Message-ID: <E15FX2O-0007cK-00@wisbech.cl.cam.ac.uk>

> It is a bug to encode a non-BMP character with six
> bytes by pretending that the (surrogate) values used in the UTF-16
> representation are BMP characters and encoding the character as
> though it was a string consisting of that character.  It is also a
> bug to interpret such a six-byte sequence as a single character.
> This was clarified in Unicode 3.1.

Fully agreed. Independent of what the letter of the standard says, it is
absolutely essential for numerous practical security reasons, that a
UTF-8 decoder accepts one and only one single possible UTF-8 sequence as
the encoding of any Unicode character. ISO 10646 is also very clear
about that surrogates must not appear in a UTF-8 stream and are
malformed UTF-8 sequences. Unicode 3.0 was badly flawed in that respect
and that has led to numerous security problems in fielded implementations.
As I understand it, Unicode 3.1 fixed that, but in any case, no matter what
the standard says, you should definitely follow the advice given in the
UTF-8 decoder robustness test file

  http://www.cl.cam.ac.uk/~mgk25/ucs/examples/   UTF-8-test.txt

and accept only one single representation for every Unicode character,
otherwise you just generate nice loopholes for hackers to pass critical
characters through non-decoding filters.

The UTF-8 representations of U+D800..U+DFFF, U+FFFE, and U+FFFF are not
allowed in a UTF-8 stream and a secure UTF-8 decoder must never output
any of these characters.

http://www.cl.cam.ac.uk/~mgk25/unicode.html

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>


From mal@lemburg.com  Thu Jun 28 10:04:07 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 28 Jun 2001 11:04:07 +0200
Subject: [I18n-sig] Re: [Python-Dev] Unicode Maintenance
References: <3B39CD51.406C28F0@lemburg.com> <200106271611.f5RGBn819631@odiug.digicool.com>
Message-ID: <3B3AF307.6496AFB4@lemburg.com>

Guido van Rossum wrote:
> 
> > Looking at the recent burst of checkins for the Unicode implementation
> > completely bypassing the standard SF procedure and possible comments
> > I might have on the different approaches, I guess I've been ruled out
> > as maintainer and designer of the Unicode implementation.
> >
> > Well, I guess that's how things go. Was nice working for you guys,
> > but no longer is... I'm tired of having to defend myself against
> > meta-comments about the design, uncontrolled checkins and no true
> > backup about my standing in all this from Guido.
> >
> > Perhaps I am misunderstanding the role of a maintainer and
> > implementation designer, but as it is all respect for the work I've
> > put into all this seems faded. That's the conclusion I draw from recent
> > postings by Martin and Fredrik and their nightly "takeover".
> >
> > Thanks,
> > --
> > Marc-Andre Lemburg
> 
> [For those of us to whom Marc-Andre's complaint comes as a total
> surprise: there was a thread on i18n-sig about whether we should
> support Unicode surrogates, followed by a conclusion to skip
> surrogates and jump directly to optional support for UCS-4, followed
> by some checkins that enabled a configuration choice between UCS-2 and
> UCS-4, and code to make it work.  As a side effect, surrogate support
> in the UCS-2 version actually improved slightly.]
> 
> Now, now, Marc-Andre.
> 
> The only comments I recall from you on my "surrogates: just say no"
> post seemed favorable, except that you proposed to to all the way and
> make UCS-4 mandatory.  I explained why I didn't want to go that far,
> and why I didn't believe your arguments against giving users a choice.
> I didn't hear back from you then, and I didn't think you could have
> much of a problem with my position.
> 
> Our process requires the use of the SF patch manager only for
> controversial changes.  Based on your feedback, I didn't think there
> was anything controversial about the changes that Fredrik and Martin
> have made!  (If there was, IMO it was temporarily breaking the Windows
> build and the test suite -- but that's all fixed now.)
> 
> I don't understand where you get the idea that we lost respect for
> your work!  In fact, the fact that it was so easy to make the changes
> suggested to me that the original design was well suited to this
> particular change (as opposed to the surrugate support proposals,
> which all sounded like they would require a *lot* of changes).
> 
> I don't think that we have very strict roles in this community anyway.
> (My role as BDFL excluded -- that's why I get to write this
> response. :-)  I'd say that Fredrik owns SRE, because he has asserted
> that ownership at various times: he's undone changes by others that
> broke the 1.5.2 support, for example.
> 
> But the Unicode support in Python isn't owned by one person: many
> folks have contributed to that, including Fredrik, who designed and
> wrote the original Unicode string object implementation.
>
> If you have specific comments about the changes made, please be
> specific.  If you feel slighted by meta-comments, please also be
> specific.  I don't think I've said anything derogatory about you or
> your design.

You didn't get my point. I feel responsable for the Unicode 
implementation design and would like to see it become a continued
success. In that sense and taking into account that I am the
maintainer of all this stuff, I think it is very reasonable to 
ask me before making any significant changes to the implementation
and also respect any comments I put forward.

Currently, I have to watch the checkins list very closely
to find out who changed what in the implementation and then to
take actions only after the fact. Since I'm not supporting Unicode
as my full-time job this is simply impossible. We have the SF manager
and there is really no need to rush anything around here. 

If I am offline or too busy with other things for a day or two, 
then I want to see patches on SF and not find new versions of 
the implementation already checked in. 

This has worked just fine during the last year, so I can only explain 
the latest actions in this direction with an urge to bypass my comments
and any discussion this might cause. Needless to say that
quality control is not possible anymore.

Conclusion:
I am not going to continue this work if this does not change.

Another other problem for me is the continued hostility I feel on i18n
against parts of the design and some of my decisions. I am
not talking about your feedback and the feedback from many other
people on the list which was excellent and to high standards. 
But reading the postings of the last few months you will
find notices of what I am referring to here (no, I don't want 
to be specific).

If people don't respect my comments or decision, then how can
I defend the design and how can I stop endless discussions which
simply don't lead anywhere ? So either I am missing something
or there is a need for a clear statement from you about
my status in all this.

If I don't have the right to comment on proposals and patches,
possibly even rejecting them, then I simply don't see any
ground for keeping the implementation in a state which I can
maintain.

And last but not least: The fun-factor has faded which was
the main motor driving my into working on Unicode in the first
place. Nothing much you can do about this, though :-/

> Paul Prescod offered to write a PEP on this issue.  My cynical half
> believes that we'll never hear from him again, but my optimistic half
> hopes that he'll actually write one, so that we'll be able to discuss
> the various issues for the users with the users.  I encourage you to
> co-author the PEP, since you have a lot of background knowledge about
> the issues.

I guess your optimistic half won :-) I think Paul already did all the
work, so I'll simply comment on what he wrote.
 
> BTW, I think that Misc/unicode.txt should be converted to a PEP, for
> the historic record.  It was very much a PEP before the PEP process
> was invented.  Barry, how much work would this be?  No editing needed,
> just formatting, and assignment of a PEP number (the lower the better).

Thanks for converting the text to PEP format, Barry.

Thanks for reading this far,
-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From mal@lemburg.com  Thu Jun 28 10:27:35 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 28 Jun 2001 11:27:35 +0200
Subject: [I18n-sig] Support for "wide" Unicode characters
References: <3B3ABFB8.84C7510B@ActiveState.com>
Message-ID: <3B3AF887.5181D0CF@lemburg.com>

Paul Prescod wrote:
> 
> Round 2: I can't check in right now but I'll collect another round of
> suggestions and then post this to other lists tomorrow.

Here you go...

> ----
> PEP: 261
> Title: Support for "wide" Unicode characters
> Version: $Revision: 1.2 $
> Author: paulp@activestate.com (Paul Prescod)
> Status: Draft
> Type: Standards Track
> Created: 27-Jun-2001
> Python-Version: 2.2
> Post-History: 27-Jun-2001
> 
> Abstract
> 
>     Python 2.1 unicode characters can have ordinals only up to 65536.
>     These characters are known as Basic Multilinual Plane characters.
>     There are now characters in Unicode that live on other "planes".
>     The largest addressable character in Unicode has the ordinal
>     17 * 2**16 - 1. For readability, we will call this TOPCHAR.

I would add hex notations for those who are more familiar with
HEX and Unicode (which uses HEX to pinpoint code points).

Also, a suggestion: I think to avoid all the problems of understanding
the different terms in this PEP, I'd do two things:

1. add a Glossary (copying from the Unicode glossary)
2. use the standard Unicode terms throughout the PEP (code points,
   code units, etc.)

The reason is that otherwise you'll get confusion about what
you mean by noncharacter characters ;-)
 
> Proposed Solution
> 
>     One solution would be to merely increase the maximum ordinal to a
>     larger value.  Unfortunately the only straightforward
>     implementation of this idea is to increase the character code unit
>     to 4 bytes.  This has the effect of doubling the size of most
>     Unicode strings.  In order to avoid imposing this cost on every
>     user, Python 2.2 will allow 4-byte Unicode characters as a
>     build-time option.
> 
>     The 4-byte option is called "wide Py_UNICODE".  The 2-byte option
>     is called "narrow Py_UNICODE".
> 
>     Most things will behave identically in the wide and narrow worlds.
> 
>     * the \u and \U literal syntaxes will always generate the same
>       data that the unichr function would.  They are just different
>       syntaxes for the same thing.
> 
>     * unichr(i) for 0 <= i <= 2**16 always returns a size-one string.
> 
>     * unichr(i) for 2**16+1 <= i <= TOPCHAR will always return a
>       string representing the character.

-1. 

If the platform does not support the character in question,
then this should raise a ValueError instead of returning anything
with len() > 1.

Reasoning: u[i] in Python should always refer to a code
point *and* code unit in the Unicode sense. If this is not
possible, raise an exception.
 
>     * BUT on narrow builds of Python, the string will actually be
>       composed of two characters (in the Python, not Unicode sense)
>       called a "surrogate pair". These two Python characters are
>       logically one Unicode character.
> 
>         ISSUE: Should Python return surrogate pairs or narrow builds
>                or should it just disallow them?
> 
>         ISSUE: Should the upper bound of the domain of unichr and
>                range of ord() be TOPCHAR or 2**32-1 or even 2**31?

-1. See above.
 
>     * ord() will now accept surrogate pairs and return the ordinal of
>       the "wide" character.
> 
>         ISSUE: Should Python accept surrogate pairs on wide
>                Python builds?

-1. Have the codecs do the business of dealing with surrogates and
ord() return the code point ordinal (isolated surrogates are 
code points as well; they are not Unicode characters though).
 
>     * There is an integer value in the sys module that describes the
>       largest ordinal for a Unicode character on the current
>       interpreter. sys.maxunicode is 2**16-1 on narrow builds of
>       Python.
> 
>         ISSUE: Should sys.maxunicode be TOPCHAR or 2**32-1 or even
>                2**31 on wide builds?
> 
>         ISSUE: Should there be distinct constants for accessing
>                TOPCHAR and the real upper bound for the domain of
>                unichr?

Hmm, not sure. 

Wouldn't it be better to simply an attribute
sys.unicodewidth == 'narrow' | 'wide' ? This leaves out all
the complicated issues and redirects people to this PEP.

 
>     * Note that ord() can in some cases return ordinals higher than
>       sys.maxunicode because it accepts surrogate pairs on narrow
>       Python builds.

-1.
 
>     * codecs will be upgraded to support "wide characters"
>       (represented directly in UCS-4, as surrogate pairs in UTF-16 and
>       as multi-byte sequences in UTF-8). On narrow Python builds, the
>       codecs will generate surrogate pairs, on wide Python builds they
>       will generate a single character. This is the main part of the
>       implementation left to be done.

+1. This is how surrogates should be treated: in the codecs !
 
>     * there are no restrictions on constructing strings that use
>       code points "reserved for surrogates" improperly. These are
>       called "lone surrogates".

Better call them "isolated surrogates"; that's the term Mark
Davis used and he should know.

>       The codecs should disallow reading
>       these but you could construct them using string literals or
>       unichr(). unichr() is not restricted to values less than either
>       TOPCHAR nor sys.maxunicode.
> 
>         ISSUE: Should lone surrogates be allowed as input to ord even
>                on wide platforms where they "should" not occur?

Yes, see above. Isolated surrogates are true code points.
 
> Implementation
> 
>     There is a new (experimental) define:
> 
>         #define PY_UNICODE_SIZE 2

Doesn't sizeof(Py_UNICODE) do the same ?
 
>     There is a new configure options:
> 
>         --enable-unicode=ucs2 configures a narrow Py_UNICODE, and uses
>                               wchar_t if it fits
>         --enable-unicode=ucs4 configures a wide Py_UNICODE likewise

With "likewise" meaning: "and uses wchar_t if it fits" !

>         --enable-unicode      same as "=ucs2"
> 
>     The intention is that --disable-unicode, or --enable-unicode=no
>     removes the Unicode type altogether; this is not yet implemented.

Let's add the UCS-2/UCS-4 stuff first and only then think
about adding the removal #ifdefs.
 
> Notes
> 
>     Note that len(unichr(i))==2 for i>=2**16 on narrow machines
>     because of the returned surrogates.

-1. See above.
 
>     This means (for example) that the following code is not portable:
> 
>     x = 2**16
>     if unichr(x) in somestring:
>         ...
> 
>     In general, you should be careful using "in" if the character that
>     is searched for could have been generated from unichr applied to a
>     number greater than 2**16 or from a string literal greater than
>     2**16.
> 
>     This PEP does NOT imply that people using Unicode need to use a
>     4-byte encoding.  It only allows them to do so.  For example,
>     ASCII is still a legitimate (7-bit) Unicode-encoding.
> 
> Rationale for Surrogate Creation Behaviour
> 
>     Python currently supports the construction of a surrogate pair
>     for a large unicode literal character escape sequence. This is
>     basically designed as a simple way to construct "wide characters"
>     even in a narrow Python build.
> 
>         ISSUE: surrogates can be created this way but the user still
>                needs to be careful about slicing, indexing, printing
>                etc. Another option is to remove knowledge of
>                surrogates from everything other than the codecs.

Side note: 

Python uses the unicode-escape codec for interpreting
the Unicode literals. This means that narrow builds will also
support the full range of UCS-4 -- using surrogates if needed.

This introduces an incompatibility between narrow and wide
builds at run-time. PYC should not be harmed by this since they
store Unicode strings using UTF-8.
 
> Rejected Suggestions
> 
>     There were two primary solutions that were rejected. The first was
>     more or less the status-quo. We could officially say that UTF-16
>     is the Python character encoding and require programmers to
>     implement wide characters in their application logic. This is a
>     heavy burden because emulating 32-bit characters is likely to be
>     very inefficient if it is coded entirely in Python.
> 
>     The other solution is to use UTF-16 (or even UTF-8) internally
>     (for efficiency) but present an abstraction of 32-bit characters
>     to the programmer. This would require a much more complex
>     implementation than the accepted solution. In theory, we could
>     move to this implementation in the future without breaking Python
>     code. It would just emulate a wide Python build on narrow
>     Pythons.
> 
> Copyright
> 
>     This document has been placed in the public domain.
> 
> Local Variables:
> mode: indented-text
> indent-tabs-mode: nil
> End:

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From guido@digicool.com  Thu Jun 28 12:17:03 2001
From: guido@digicool.com (Guido van Rossum)
Date: Thu, 28 Jun 2001 07:17:03 -0400
Subject: [I18n-sig] Unicode surrogates: just say no!
In-Reply-To: Your message of "Thu, 28 Jun 2001 00:55:08 EDT."
 <LNBBLJKPBEHFEDALKOLCIECLKLAA.tim.one@home.com>
References: <LNBBLJKPBEHFEDALKOLCIECLKLAA.tim.one@home.com>
Message-ID: <200106281117.f5SBH3Z20788@odiug.digicool.com>

OK, I'm convinced that ord() should only work on single-unit strings.

If we're going to deprecate creating surrogates with \U, I think
unichr() should follow suit.  (My Klingon use case had a need for \U
but not for unichr() doing this.)

But reasonable people can argue over this.

[Tim]
> But there's a HUGE difference.  The xrange() behaviors we're seeking to shed
> have been documented for years.

Oh yeah?  Where?  The docs for XRange objects are very vague, claiming
that they "behave like tuples" and have a tolist() method.  Well, they
can't be concatenated, so they don't behave like tuples.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@digicool.com  Thu Jun 28 12:25:30 2001
From: guido@digicool.com (Guido van Rossum)
Date: Thu, 28 Jun 2001 07:25:30 -0400
Subject: [I18n-sig] Re: Unicode 3.1 and contradictions.
In-Reply-To: Your message of "Thu, 28 Jun 2001 09:20:32 BST."
 <E15FX2O-0007cK-00@wisbech.cl.cam.ac.uk>
References: <E15FX2O-0007cK-00@wisbech.cl.cam.ac.uk>
Message-ID: <200106281125.f5SBPVc20814@odiug.digicool.com>

> The UTF-8 representations of U+D800..U+DFFF, U+FFFE, and U+FFFF are not
> allowed in a UTF-8 stream and a secure UTF-8 decoder must never output
> any of these characters.

Can you explain a bit more about the security issues?

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@digicool.com  Thu Jun 28 12:33:25 2001
From: guido@digicool.com (Guido van Rossum)
Date: Thu, 28 Jun 2001 07:33:25 -0400
Subject: [I18n-sig] Support for "wide" Unicode characters
In-Reply-To: Your message of "Thu, 28 Jun 2001 11:27:35 +0200."
 <3B3AF887.5181D0CF@lemburg.com>
References: <3B3ABFB8.84C7510B@ActiveState.com>
 <3B3AF887.5181D0CF@lemburg.com>
Message-ID: <200106281133.f5SBXQ020837@odiug.digicool.com>

> >     There is a new (experimental) define:
> > 
> >         #define PY_UNICODE_SIZE 2
> 
> Doesn't sizeof(Py_UNICODE) do the same ?

Not on a Cray!  And not in the C standard.  Ask Tim. :-)

> This introduces an incompatibility between narrow and wide
> builds at run-time. PYC should not be harmed by this since they
> store Unicode strings using UTF-8.

Does UTF-8 transfer isolated surrogates correctly?  I think that's
necessary, otherwise I can't marshal or unmarshal literals containing
those, which means that .pyc files for .py files containing those
can't be read (on maybe aren't portable between wide and narrow
interpreters).

Note that I'm OK with the UTF-8 encoder recognizing hi+lo surrogate
pairs and encoding them as one Unicode character, since the decoder
generates surrogates for non-BMP characters on a narrow platform.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@digicool.com  Thu Jun 28 12:37:06 2001
From: guido@digicool.com (Guido van Rossum)
Date: Thu, 28 Jun 2001 07:37:06 -0400
Subject: [I18n-sig] Support for "wide" Unicode characters
In-Reply-To: Your message of "Wed, 27 Jun 2001 22:25:12 PDT."
 <3B3ABFB8.84C7510B@ActiveState.com>
References: <3B3ABFB8.84C7510B@ActiveState.com>
Message-ID: <200106281137.f5SBb6r20850@odiug.digicool.com>

Whether \U can create surrogates should now be marked as an open issue
as well, like for unichr().  No further comments but agree with what
others have said; I like the idea of adding a Glossary and using the
Unicode terminology correctly.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From Markus.Kuhn@cl.cam.ac.uk  Thu Jun 28 12:48:40 2001
From: Markus.Kuhn@cl.cam.ac.uk (Markus Kuhn)
Date: Thu, 28 Jun 2001 12:48:40 +0100
Subject: [I18n-sig] Re: Unicode 3.1 and contradictions.
In-Reply-To: Your message of "Thu, 28 Jun 2001 07:25:30 EDT."
 <200106281125.f5SBPVc20814@odiug.digicool.com>
Message-ID: <E15FaHo-0000re-00@wisbech.cl.cam.ac.uk>

Guido van Rossum wrote on 2001-06-28 11:25 UTC:
> > The UTF-8 representations of U+D800..U+DFFF, U+FFFE, and U+FFFF are not
> > allowed in a UTF-8 stream and a secure UTF-8 decoder must never output
> > any of these characters.
> 
> Can you explain a bit more about the security issues?

There are two ways of processing UTF-8 encoded UCS text:

  a) as a UTF-8 bytestream
  b) as a stream of decoded integer code values (32-bit wchar_t, etc.)

Problems arise if security-relevant checks are done in one
representation and interpretation of the data is done in the other.

Imagine, you have an application with the following processing steps:

  - read a UTF-8 string
  - apply a substring test to convince yourself that certain characters
    are not present in the string
  - decode UTF-8
  - use the decoded string in an application where presence of the
    tested characters could be security critical

The classical example is a Win32 web server, where a UTF-8 URL is fed
in, tested by a script in UTF-8 to be free of the byte sequence '/../',
and then UTF-8 decoded and fed into a UTF-16 API for file system access.
Even though the presence of '/../' encoded in ASCII was filtered out,
the same character sequence can still be passed past the filter by a
clever attacker using alternative encodings that an unsafe UTF-8 decoder
might accept, for instance an overlong sequence for any of the
characters.

This problem is most severe with non-ASCII representations of ASCII
characters by overlong UTF-8 sequences, because ASCII characters have
often lots of special functions associated, but it also occurs with
other tests. For example, it should be perfectly legitimate to test a
UTF-8 string to be free of non-BMP characters by simply testing that no
byte >= 0xE0 is present, without the far less efficient use of a UTF-8
decoder.

Other risks are people smuggling a UTF-8 encoded U+FFFE or U+FFFF into a
system, which when decoded into UTF-16 might be interpreted as an
instruction to swap the byte sex (anti-BOM) or as some generic
escape-or-end-of-string/file character (U+FFFF).

The golden rule that there must be exactly one single UTF-8 byte
sequence that can result in the output of a certain Unicode character
and that Unicode code positions reserved for special non-character use
such as U+D800..U+DFFF, U+FFFE, and U+FFFF should never be generated by
a UTF-8 decoder eliminates all these potential pitfalls.

http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>


From fredrik@pythonware.com  Thu Jun 28 12:55:25 2001
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Thu, 28 Jun 2001 13:55:25 +0200
Subject: [I18n-sig] Support for "wide" Unicode characters
References: <3B3ABFB8.84C7510B@ActiveState.com>              <3B3AF887.5181D0CF@lemburg.com>  <200106281133.f5SBXQ020837@odiug.digicool.com>
Message-ID: <00b601c0ffc9$38ae4bb0$0900a8c0@spiff>

guido wrote:
> > >     There is a new (experimental) define:
> > >
> > >         #define PY_UNICODE_SIZE 2
> >
> > Doesn't sizeof(Py_UNICODE) do the same ?
>
> Not on a Cray!  And not in the C standard.  Ask Tim. :-)

not to mention that the preprocessor doesn't understand sizeof(type)...

(note that in the current implementation, the Py_UNICODE_WIDE macro
is used to enable wide storage and disable the surrogate stuff.  it's currently
set if PY_UNICODE_SIZE >= 4, but it might be better to do it the other
way around)

Cheers /F


From JMachin@Colonial.com.au  Thu Jun 28 13:27:45 2001
From: JMachin@Colonial.com.au (Machin, John)
Date: Thu, 28 Jun 2001 22:27:45 +1000
Subject: [I18n-sig] Support for "wide" Unicode characters
Message-ID: <9F2D83017589D211BD1000805FA70CA703B13A02@ntxmel03.cmutual.com.au>

Guido asked:
   Does UTF-8 transfer isolated surrogates correctly? 

No. See my bug report in SF. Briefly, a lone high
surrogate has its leading UTF-8 byte omitted,
causing an illegal UTF-8 sequence to be generated.

Here's the URL:
http://sourceforge.net/tracker/?group_id=5470&atid=105470&func=detail&aid=43
3882

(or search for "surrogates")


**************   IMPORTANT MESSAGE  **************

The information contained in or attached to this message is intended only for the people it is addressed to. If you are not the intended recipient, any use, disclosure or copying of this information is unauthorised and prohibited. This information may be confidential or subject to legal privilege. It is not the expressed view of Colonial Limited or any of its subsidiaries unless that is clearly stated. Colonial cannot accept liability for any virus damage caused by this message.

**************************************************


From mal@lemburg.com  Thu Jun 28 14:11:04 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 28 Jun 2001 15:11:04 +0200
Subject: [I18n-sig] Support for "wide" Unicode characters
References: <3B3ABFB8.84C7510B@ActiveState.com>
 <3B3AF887.5181D0CF@lemburg.com> <200106281133.f5SBXQ020837@odiug.digicool.com>
Message-ID: <3B3B2CE8.B1A062C4@lemburg.com>

Guido van Rossum wrote:
> 
> > >     There is a new (experimental) define:
> > >
> > >         #define PY_UNICODE_SIZE 2
> >
> > Doesn't sizeof(Py_UNICODE) do the same ?
> 
> Not on a Cray!  And not in the C standard.  Ask Tim. :-)

Ah, OK... nice sofas these Crays, BTW ;-)
 
> > This introduces an incompatibility between narrow and wide
> > builds at run-time. PYC should not be harmed by this since they
> > store Unicode strings using UTF-8.
> 
> Does UTF-8 transfer isolated surrogates correctly?  I think that's
> necessary, otherwise I can't marshal or unmarshal literals containing
> those, which means that .pyc files for .py files containing those
> can't be read (on maybe aren't portable between wide and narrow
> interpreters).

It handles surrogates correctly, but rejects isolated ones on input
(easy to fix though) and passes them through on output. As I said
before, surrogate is far from being complete.
 
> Note that I'm OK with the UTF-8 encoder recognizing hi+lo surrogate
> pairs and encoding them as one Unicode character, since the decoder
> generates surrogates for non-BMP characters on a narrow platform.

That's what it currently does.

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From JMachin@Colonial.com.au  Thu Jun 28 14:41:08 2001
From: JMachin@Colonial.com.au (Machin, John)
Date: Thu, 28 Jun 2001 23:41:08 +1000
Subject: [I18n-sig] Support for "wide" Unicode characters
Message-ID: <9F2D83017589D211BD1000805FA70CA703B13A03@ntxmel03.cmutual.com.au>


[Guido van Rossum]
> store Unicode strings using UTF-8.
> 
> Does UTF-8 transfer isolated surrogates correctly?  

[Marc-Andre Lemburg}
It handles surrogates correctly, but rejects isolated ones on input
(easy to fix though) and passes them through on output. As I said
before, surrogate is far from being complete.

Marc-Andre, there is a *bug* in 2.1 encoding isolated high surrogates. I
reported it
and you assigned it to yourself on 23 June. Lookee here:

Python 2.1 (#15, Apr 16 2001, 18:25:49) [MSC 32 bit (Intel)] on win32
Type "copyright", "credits" or "license" for more information.
>>> u'\ud800'.encode('utf-8')
'\xa0\x80' # should be 3 bytes, not 2
>>>

While the fix is trivial, IMO an appropriate answer to Guido's question
would include
this particular lack of correctness.

Cheers,
John


**************   IMPORTANT MESSAGE  **************

The information contained in or attached to this message is intended only for the people it is addressed to. If you are not the intended recipient, any use, disclosure or copying of this information is unauthorised and prohibited. This information may be confidential or subject to legal privilege. It is not the expressed view of Colonial Limited or any of its subsidiaries unless that is clearly stated. Colonial cannot accept liability for any virus damage caused by this message.

**************************************************


From mal@lemburg.com  Thu Jun 28 14:49:28 2001
From: mal@lemburg.com (M.-A. Lemburg)
Date: Thu, 28 Jun 2001 15:49:28 +0200
Subject: [I18n-sig] Support for "wide" Unicode characters
References: <9F2D83017589D211BD1000805FA70CA703B13A03@ntxmel03.cmutual.com.au>
Message-ID: <3B3B35E8.6634D032@lemburg.com>

"Machin, John" wrote:
> 
> [Guido van Rossum]
> > store Unicode strings using UTF-8.
> >
> > Does UTF-8 transfer isolated surrogates correctly?
> 
> [Marc-Andre Lemburg}
> It handles surrogates correctly, but rejects isolated ones on input
> (easy to fix though) and passes them through on output. As I said
> before, surrogate is far from being complete.
> 
> Marc-Andre, there is a *bug* in 2.1 encoding isolated high surrogates. I
> reported it
> and you assigned it to yourself on 23 June. Lookee here:
> 
> Python 2.1 (#15, Apr 16 2001, 18:25:49) [MSC 32 bit (Intel)] on win32
> Type "copyright", "credits" or "license" for more information.
> >>> u'\ud800'.encode('utf-8')
> '\xa0\x80' # should be 3 bytes, not 2
> >>>
> 
> While the fix is trivial, IMO an appropriate answer to Guido's question
> would include
> this particular lack of correctness.

Thanks for the note. I was looking at the code rather than
actually trying an example -- guess the latter is faster and
gives better answers ;-)

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/


From jhi@iki.fi  Thu Jun 28 14:51:01 2001
From: jhi@iki.fi (Jarkko Hietaniemi)
Date: Thu, 28 Jun 2001 08:51:01 -0500
Subject: [I18n-sig] Re: Determine encoding from $LANG
In-Reply-To: <Pine.LNX.4.30.0106280942240.7451-100000@Appserv.suse.de>; from mkuhn@suse.de on Thu, Jun 28, 2001 at 10:03:59AM +0200
References: <15160.60506.589750.287186@honolulu.ilog.fr> <Pine.LNX.4.30.0106280942240.7451-100000@Appserv.suse.de>
Message-ID: <20010628085101.B21832@chaos.wustl.edu>

On Thu, Jun 28, 2001 at 10:03:59AM +0200, Markus Kuhn wrote:
> On Tue, 26 Jun 2001, Bruno Haible wrote:
> >
> >      A program cannot be considered properly internationalized
> >      until it obeys the current locale (LC_ALL || LC_CTYPE || LANG).
> >
> > The programs we are waiting for are:
> > [...]
> 
> Add to that list many of the programming languages that use Unicode
> internally but that do not yet set the default i/o encoding correctly
> automatically based on LC_ALL || LC_CTYPE || LANG.

Until very recently the term "default I/O encoding" didn't mean
anything to Perl (it was native bytes, period).  Now we do have a new
I/O subsystem (with which we can do things like "this I/O stream is in
UTF-8") but the new I/O subsystem is not yet available in any public
release of Perl, only in one developer release so far (5.7.1).

> I suspect that Perl and Python are not much better and don't call
> nl_langinfo(CODESET) or the portable libcharset wrapper around it either

No, we don't call nl_langinfo(CODESET).  We still need to figure out
the correct policy and place for doing that.  Sorry if "the correct
policy" has been already extensively discussed and answered in this
thread, this is the first message that was CCed (well, which I saw,
anyway) to perl-unicode.  But as a general rule, Perl doesn't do much
in the way of locales unless the user explicitly asks for a locale
behaviour by using setlocale().  Changing that now to be more
'automatic' would break backward compatibility.

> to properly determine the locale-dependent external encoding.
> 
> References on how to determine the character encoding from the locale in a
> safe and portable manner:
> 
> http://www.cl.cam.ac.uk/~mgk25/unicode.html#activate
> http://clisp.cons.org/~haible/packages-libcharset.html

Alas, IIUC, LGPL is currently slightly incompatible for inclusion
into Perl, for something as central piece of a code as locale
handling.  (Note: this is just a statement of facts as far as
I understand them, I do not intend or want to start discussion
about software licensing politics.)

> http://www.opengroup.org/onlinepubs/7908799/xsh/langinfo.h.html

But thanks for the pointers.  I don't know whether I will be able to
smush in the use use nl_langinfo() for the upcoming public release of
Perl, Perl 5.8.0, but I will certainly give some thought to the matter.

-- 
$jhi++; # http://www.iki.fi/jhi/
        # There is this special biologist word we use for 'stable'.
        # It is 'dead'. -- Jack Cohen


From guido@digicool.com  Thu Jun 28 15:14:31 2001
From: guido@digicool.com (Guido van Rossum)
Date: Thu, 28 Jun 2001 10:14:31 -0400
Subject: [I18n-sig] Support for "wide" Unicode characters
In-Reply-To: Your message of "Thu, 28 Jun 2001 22:27:45 +1000."
 <9F2D83017589D211BD1000805FA70CA703B13A02@ntxmel03.cmutual.com.au>
References: <9F2D83017589D211BD1000805FA70CA703B13A02@ntxmel03.cmutual.com.au>
Message-ID: <200106281414.f5SEEVX23234@odiug.digicool.com>

> Guido asked:
>    Does UTF-8 transfer isolated surrogates correctly? 
> 
> No. See my bug report in SF. Briefly, a lone high
> surrogate has its leading UTF-8 byte omitted,
> causing an illegal UTF-8 sequence to be generated.
> 
> Here's the URL:
> http://sourceforge.net/tracker/?group_id=5470&atid=105470&func=detail&aid=43
> 3882
> 
> (or search for "surrogates")

It's a bug indeed.

But my question was about the definition of UTF8, not our (fallible)
implementation.

What *should* be the result of u'\ud800'.encode('utf8')?
'\xed\xa0\x80' or an exception?

And likewise, what should be the result of unicode('\xed\xa0\x80',
'utf8')?
u'\ud800' or an exception?

(Likewise for low surrogates; currently, u'\udc00'.encode('utf8')
returns '\xed\xb0\x80', but unicode('\xed\xb0\x80', 'utf8') raise an
exception.)

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@digicool.com  Thu Jun 28 15:51:30 2001
From: guido@digicool.com (Guido van Rossum)
Date: Thu, 28 Jun 2001 10:51:30 -0400
Subject: [I18n-sig] Re: Unicode 3.1 and contradictions.
In-Reply-To: Your message of "Thu, 28 Jun 2001 12:48:40 BST."
 <E15FaHo-0000re-00@wisbech.cl.cam.ac.uk>
References: <E15FaHo-0000re-00@wisbech.cl.cam.ac.uk>
Message-ID: <200106281451.f5SEpUv23358@odiug.digicool.com>

[Markus]
> > > The UTF-8 representations of U+D800..U+DFFF, U+FFFE, and U+FFFF are not
> > > allowed in a UTF-8 stream and a secure UTF-8 decoder must never output
> > > any of these characters.

[Guido]
> > Can you explain a bit more about the security issues?

[Markus]
> There are two ways of processing UTF-8 encoded UCS text:
> 
>   a) as a UTF-8 bytestream
>   b) as a stream of decoded integer code values (32-bit wchar_t, etc.)
> 
> Problems arise if security-relevant checks are done in one
> representation and interpretation of the data is done in the other.
> 
> Imagine, you have an application with the following processing steps:
> 
>   - read a UTF-8 string
>   - apply a substring test to convince yourself that certain characters
>     are not present in the string
>   - decode UTF-8
>   - use the decoded string in an application where presence of the
>     tested characters could be security critical

I'd say that the security implementation of such an application is
broken -- the check should have been done on the final datya.  It
seems you are trying to patch up a legacy system the wrong way.  Or am
I missing something?  How can this be a common pattern?

> The classical example is a Win32 web server, where a UTF-8 URL is fed
> in, tested by a script in UTF-8 to be free of the byte sequence '/../',
> and then UTF-8 decoded and fed into a UTF-16 API for file system access.
> Even though the presence of '/../' encoded in ASCII was filtered out,
> the same character sequence can still be passed past the filter by a
> clever attacker using alternative encodings that an unsafe UTF-8 decoder
> might accept, for instance an overlong sequence for any of the
> characters.

Here you are assuming an unsafe UTF-8 decoder.  I agree that an UTF-8
decoder that accepts overlong sequences is broken.

But we were talking about isolated surrogates.  How can passing
through *isolated* surrogates cause a security violation?  It's not an
overlong sequence!  (Assuming the decoder does the right thing for
surrogate *pairs*.)

> This problem is most severe with non-ASCII representations of ASCII
> characters by overlong UTF-8 sequences, because ASCII characters have
> often lots of special functions associated, but it also occurs with
> other tests. For example, it should be perfectly legitimate to test a
> UTF-8 string to be free of non-BMP characters by simply testing that no
> byte >= 0xE0 is present, without the far less efficient use of a UTF-8
> decoder.

Why is testing for non-BMP characters part of a security screening?
Maybe you are worried that an application will over-index some table
prepared for the BMP only.  But Python already protects against
over-indexing with an exception.

Why would you want a security screening of the UTF-8 stream when
you're going to decode it eventually?  If you *have* to check that no
decoded character is >= 2**16, faster than a separate scan would be to
fold the security screening into the UTF-8 codec.

> Other risks are people smuggling a UTF-8 encoded U+FFFE or U+FFFF into a
> system, which when decoded into UTF-16 might be interpreted as an
> instruction to swap the byte sex (anti-BOM) or as some generic
> escape-or-end-of-string/file character (U+FFFF).

These aren't isolated surrogates, so they would fall under a different
rule (currently they pass through Python's UTF-8 codec just fine).  I
have the feeling that you want the UTF-8 decoder to make up for all
the sloppy coding practices that might be used in the application.

> The golden rule that there must be exactly one single UTF-8 byte
> sequence that can result in the output of a certain Unicode character
> and that Unicode code positions reserved for special non-character use
> such as U+D800..U+DFFF, U+FFFE, and U+FFFF should never be generated by
> a UTF-8 decoder eliminates all these potential pitfalls.

Sorry, you haven't convinced me that these tests should be applied by
Python's standard UTF-8 codec.  Also, your use of "such as" suggests
that the collection of dangerous code points is open-ended, but I find
that hard to believe (since legacy codecs won't be updated).

> http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
> http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt

--Guido van Rossum (home page: http://www.python.org/~guido/)


From Markus.Kuhn@cl.cam.ac.uk  Thu Jun 28 16:47:59 2001
From: Markus.Kuhn@cl.cam.ac.uk (Markus Kuhn)
Date: Thu, 28 Jun 2001 16:47:59 +0100
Subject: [I18n-sig] Re: Unicode 3.1 and contradictions.
In-Reply-To: Your message of "Thu, 28 Jun 2001 10:51:30 EDT."
 <200106281451.f5SEpUv23358@odiug.digicool.com>
Message-ID: <E15Fe1P-0002pF-00@wisbech.cl.cam.ac.uk>

Guido van Rossum wrote on 2001-06-28 14:51 UTC:
> > Imagine, you have an application with the following processing steps:
> > 
> >   - read a UTF-8 string
> >   - apply a substring test to convince yourself that certain characters
> >     are not present in the string
> >   - decode UTF-8
> >   - use the decoded string in an application where presence of the
> >     tested characters could be security critical
> 
> I'd say that the security implementation of such an application is
> broken -- the check should have been done on the final datya.  It
> seems you are trying to patch up a legacy system the wrong way.  Or am
> I missing something?  How can this be a common pattern?

We should not expect that any and all UTF-8 data has to be decoded
before it can be processed. UTF-8 has been very carefully designed to
allow much text processing (substring searching without case mapping,
etc.) to be done on UTF-8 data directly. Only few operations (display,
case mapping, proper sorting) actually require a UTF-8 decoder. The name
"UCS Transfer Format" is in practise misleading, because processing
UTF-8 as opposed to just transfering is often the right thing to do,
unless a buggy UTF-8 decoder would make that risky.

> But we were talking about isolated surrogates.  How can passing
> through *isolated* surrogates cause a security violation?  It's not an
> overlong sequence!  (Assuming the decoder does the right thing for
> surrogate *pairs*.)

OK, that is far less of a security concern. However, an isolated
surrogate is usually a symptom of something else being wrong (e.g.,
UTF-16 strings being split at the wrong place, then UTF-8 converted,
then joined again), and if not spotted will lead to incorrect UTF-8
sequences at the end. Signalling an exception might often be better than
passing everything through quietly.

> > This problem is most severe with non-ASCII representations of ASCII
> > characters by overlong UTF-8 sequences, because ASCII characters have
> > often lots of special functions associated, but it also occurs with
> > other tests. For example, it should be perfectly legitimate to test a
> > UTF-8 string to be free of non-BMP characters by simply testing that no
> > byte >= 0xE0 is present, without the far less efficient use of a UTF-8
> > decoder.
> 
> Why is testing for non-BMP characters part of a security screening?

If a database field has a policy of not allowing non-BMP characters in a
field, then that policy can be violated. How bad that is depends on the
application. It was really just an example, not a specific risk.

> Sorry, you haven't convinced me that these tests should be applied by
> Python's standard UTF-8 codec.  Also, your use of "such as" suggests
> that the collection of dangerous code points is open-ended, but I find
> that hard to believe (since legacy codecs won't be updated).

My list of unwanted UTF-8 code points was just the one found in a note
in the UTF-8 definition in ISO 10646-1:1993 (R.4):

  NOTE 3 - Values of x in the range 0000 D800 .. 0000 DFFF are reserved
  for the UTF-16 form and do not occur in UCS-4.  The values 0000 FFFE and
  0000 FFFF also do not occur (see clause 8).  The mappings of these code
  positions in UTF-8 are undefined.
  http://www.cl.cam.ac.uk/~mgk25/ucs/ISO-10646-UTF-8.html

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>


From tim@digicool.com  Thu Jun 28 17:40:29 2001
From: tim@digicool.com (Tim Peters)
Date: Thu, 28 Jun 2001 12:40:29 -0400
Subject: [I18n-sig] Unicode surrogates: just say no!
In-Reply-To: <200106281117.f5SBH3Z20788@odiug.digicool.com>
Message-ID: <BIEJKCLHCIOIHAGOKOLHEELHCBAA.tim@digicool.com>

[Tim]
> But there's a HUGE difference.  The xrange() behaviors we're
> seeking to shed have been documented for years.

[Guido]
> Oh yeah?  Where?

Same place as \U surrogates:  in the c.l.py archives <wink>.  Well, I take
that back:  while any number of bizarre xrange tricks have been posted over
the years, I don't think I ever saw a surrogate literal example before this
thread.

'twas-news-to-me-but-then-so-was-80%-of-what-xrange-did-ly y'rs  - tim


From paulp@ActiveState.com  Thu Jun 28 19:11:59 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Thu, 28 Jun 2001 11:11:59 -0700
Subject: [I18n-sig] Unicode surrogates: just say no!
References: <200106260851.f5Q8pcN10662@odiug.digicool.com>
 <3B385BDC.AB40A761@lemburg.com> <200106261700.f5QH0ih14770@odiug.digicool.com> <3B3A3DC5.CA6767FD@ActiveState.com> <200106280620.f5S6Kwe01395@mira.informatik.hu-berlin.de>
Message-ID: <3B3B736F.316649CA@ActiveState.com>

"Martin v. Loewis" wrote:
> 
>...
> 
> The rationale for supporting \U is two-fold: One, importing a module
> should not fail in one installation, and succeed in another (of the
> same Python version). Running the module may give different results,
> but you should be able to generate byte code. 

Isn't it already the case that big Python integer literals can be legal
on one platform and illegal on another? (I don't know, I just thought
that was the case....)

> ... Furthermore, people
> using non-BMP characters in source are probably not very interested in
> counting the characters: They want to display them. For just
> displaying them, you need to represent them, and you need the fonts.
> String manipulation is less important.

What are the chances that anybody is in this situation in the near
future? Can you even display these characters on Windows? Does Tk
support them? And if so, on what platforms? What about the Java APIs?
(once again, these are real, not rhetorical questions)

Wide Python builds may be the "default" before these characters become
practically usable in GUIs.
-- 
Take a recipe. Leave a recipe.  
Python Cookbook!  http://www.ActiveState.com/pythoncookbook


From paulp@ActiveState.com  Thu Jun 28 19:13:44 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Thu, 28 Jun 2001 11:13:44 -0700
Subject: [I18n-sig] Python Support for "Wide" Unicode characters
References: <3B3A6438.6DA39268@ActiveState.com> <200106272319.f5RNJnO20162@odiug.digicool.com> <3B3A7857.1593F72@ActiveState.com> <200106280657.f5S6vYl01625@mira.informatik.hu-berlin.de>
Message-ID: <3B3B73D8.B9D92DE8@ActiveState.com>

"Martin v. Loewis" wrote:
> 
>...
> 
> > So there is no way to get the heuristic of "wchar_t if available, UCS-4
> > if not". I'm not complaining, just checking. The list of options is just
> > two with ucs2 the default.
> 
> I'd be complaining, though, if I wasn't that pleased with this PEP
> overall.

Sorry, I don't understand the point you were making here. You may be
away already so I'll take explanations from anyone who is interested. :)
-- 
Take a recipe. Leave a recipe.  
Python Cookbook!  http://www.ActiveState.com/pythoncookbook


From paulp@ActiveState.com  Thu Jun 28 19:28:28 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Thu, 28 Jun 2001 11:28:28 -0700
Subject: [I18n-sig] Closing some issues
Message-ID: <3B3B774C.5D3F1E99@ActiveState.com>

I'd like to close some issues in the PEP if there is agreement. If you
feel that the following issues still deserve further discussion, just
yell and I'll leave them as issues:

 * unichr() should never return surrogate pairs so its domain and range
vary between wide and narrow Python builds.

 * ord() should never accept pairs so its domain and range vary between
wide and narrow Python builds.

 * nowhere in the design will we discriminate against "lone surrogates"
other than potentially the codecs.

"Agreement" means everybody comes out on the same side or Guido rules.
-- 
Take a recipe. Leave a recipe.  
Python Cookbook!  http://www.ActiveState.com/pythoncookbook


From tree@basistech.com  Thu Jun 28 19:00:47 2001
From: tree@basistech.com (Tom Emerson)
Date: Thu, 28 Jun 2001 14:00:47 -0400
Subject: [I18n-sig] Unicode surrogates: just say no!
In-Reply-To: <3B3B736F.316649CA@ActiveState.com>
References: <200106260851.f5Q8pcN10662@odiug.digicool.com>
 <3B385BDC.AB40A761@lemburg.com>
 <200106261700.f5QH0ih14770@odiug.digicool.com>
 <3B3A3DC5.CA6767FD@ActiveState.com>
 <200106280620.f5S6Kwe01395@mira.informatik.hu-berlin.de>
 <3B3B736F.316649CA@ActiveState.com>
Message-ID: <15163.28879.54534.29084@cymru.basistech.com>

Paul Prescod writes:
> "Martin v. Loewis" wrote:
[snip]
> > ... Furthermore, people
> > using non-BMP characters in source are probably not very interested in
> > counting the characters: They want to display them. For just
> > displaying them, you need to represent them, and you need the fonts.
> > String manipulation is less important.
> 
> What are the chances that anybody is in this situation in the near
> future? Can you even display these characters on Windows? Does Tk
> support them? And if so, on what platforms? What about the Java APIs?
> (once again, these are real, not rhetorical questions)

I can't speak for the characters in plane 1, but the characters in
plane 2 have fonts available already for those who need them.

Also, plane 14 contains code-points that *would* be used for both
display and text processing applications.

Finally I would expect that those using the ideographs in plane 2 care
less about display than they do being able to encode and manipulate
the data. Either the characters are used in names which must be put
into databases and the like, or they are being used to encode
historical documents for searching and the like. While display is
important, I strongly suggest that the ability to display them does
not outweigh the ability to work with strings containing them.

    -tree

-- 
Tom Emerson                                          Basis Technology Corp.
Sr. Sinostringologist                              http://www.basistech.com
  "Beware the lollipop of mediocrity: lick it once and you suck forever"


From paulp@ActiveState.com  Thu Jun 28 19:59:29 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Thu, 28 Jun 2001 11:59:29 -0700
Subject: [I18n-sig] Unicode surrogates: just say no!
References: <200106260851.f5Q8pcN10662@odiug.digicool.com>
 <3B385BDC.AB40A761@lemburg.com>
 <200106261700.f5QH0ih14770@odiug.digicool.com>
 <3B3A3DC5.CA6767FD@ActiveState.com>
 <200106280620.f5S6Kwe01395@mira.informatik.hu-berlin.de>
 <3B3B736F.316649CA@ActiveState.com> <15163.28879.54534.29084@cymru.basistech.com>
Message-ID: <3B3B7E91.D5899327@ActiveState.com>

Tom Emerson wrote:
> 
>...
>   Used to encode
> historical documents for searching and the like. While display is
> important, I strongly suggest that the ability to display them does
> not outweigh the ability to work with strings containing them.

The ability to work with them is not at issue. The question is whether
you can use them in string literals. One side of the argument says that
"working with them" in narrow Python builds will be extremely difficult,
so allowing them in literals and as inputs to unichr doesn't help much.
The other side says that at least allowing them in literals makes them
available in code in a straightforwards way. "Working with them" will
still require understanding of surrogates. (in narrow Python builds!)
-- 
Take a recipe. Leave a recipe.  
Python Cookbook!  http://www.ActiveState.com/pythoncookbook


From guido@digicool.com  Thu Jun 28 20:37:43 2001
From: guido@digicool.com (Guido van Rossum)
Date: Thu, 28 Jun 2001 15:37:43 -0400
Subject: [I18n-sig] Unicode surrogates: just say no!
In-Reply-To: Your message of "Thu, 28 Jun 2001 11:11:59 PDT."
 <3B3B736F.316649CA@ActiveState.com>
References: <200106260851.f5Q8pcN10662@odiug.digicool.com> <3B385BDC.AB40A761@lemburg.com> <200106261700.f5QH0ih14770@odiug.digicool.com> <3B3A3DC5.CA6767FD@ActiveState.com> <200106280620.f5S6Kwe01395@mira.informatik.hu-berlin.de>
 <3B3B736F.316649CA@ActiveState.com>
Message-ID: <200106281937.f5SJbit27023@odiug.digicool.com>

> > The rationale for supporting \U is two-fold: One, importing a module
> > should not fail in one installation, and succeed in another (of the
> > same Python version). Running the module may give different results,
> > but you should be able to generate byte code. 
> 
> Isn't it already the case that big Python integer literals can be legal
> on one platform and illegal on another? (I don't know, I just thought
> that was the case....)

Yes, this is why the argument for \U as surrogate-generator is not so
strong.

> > ... Furthermore, people
> > using non-BMP characters in source are probably not very interested in
> > counting the characters: They want to display them. For just
> > displaying them, you need to represent them, and you need the fonts.
> > String manipulation is less important.
> 
> What are the chances that anybody is in this situation in the near
> future? Can you even display these characters on Windows? Does Tk
> support them? And if so, on what platforms? What about the Java APIs?
> (once again, these are real, not rhetorical questions)

I don't know the answers.

> Wide Python builds may be the "default" before these characters become
> practically usable in GUIs.

:-)

--Guido van Rossum (home page: http://www.python.org/~guido/)


From guido@digicool.com  Thu Jun 28 20:40:09 2001
From: guido@digicool.com (Guido van Rossum)
Date: Thu, 28 Jun 2001 15:40:09 -0400
Subject: [I18n-sig] Closing some issues
In-Reply-To: Your message of "Thu, 28 Jun 2001 11:28:28 PDT."
 <3B3B774C.5D3F1E99@ActiveState.com>
References: <3B3B774C.5D3F1E99@ActiveState.com>
Message-ID: <200106281940.f5SJeAe27046@odiug.digicool.com>

> I'd like to close some issues in the PEP if there is agreement. If you
> feel that the following issues still deserve further discussion, just
> yell and I'll leave them as issues:
> 
>  * unichr() should never return surrogate pairs so its domain and range
> vary between wide and narrow Python builds.

+1

>  * ord() should never accept pairs so its domain and range vary between
> wide and narrow Python builds.

+1

>  * nowhere in the design will we discriminate against "lone surrogates"
> other than potentially the codecs.

+1

> "Agreement" means everybody comes out on the same side or Guido rules.

+1 :-)

I take it that \U is still open?  At this point I am +1 on making that
behave platform-specific too.

--Guido van Rossum (home page: http://www.python.org/~guido/)


From rick@unicode.org  Thu Jun 28 21:40:27 2001
From: rick@unicode.org (Rick McGowan)
Date: Thu, 28 Jun 2001 13:40:27 -0700
Subject: [I18n-sig] Unicode surrogates: just say no!
Message-ID: <200106281833.OAA31487@unicode.org>

I have a question...

Since Unicode does define upper-plane charactes -- some 40,000 of them I  
believe -- and more are on the way...  What would be the use in going  
forward with any Python implementation that doesn't handle the 21-bit  
space?

	Rick


From paulp@ActiveState.com  Thu Jun 28 22:16:47 2001
From: paulp@ActiveState.com (Paul Prescod)
Date: Thu, 28 Jun 2001 14:16:47 -0700
Subject: [I18n-sig] Unicode surrogates: just say no!
References: <200106281833.OAA31487@unicode.org>
Message-ID: <3B3B9EBF.DB008390@ActiveState.com>

Rick McGowan wrote:
> 
> I have a question...
> 
> Since Unicode does define upper-plane charactes -- some 40,000 of them I
> believe -- and more are on the way...  What would be the use in going
> forward with any Python implementation that doesn't handle the 21-bit
> space?

There will be only one Python implementation and it will support all
Unicode characters. As a compile time flag you can turn this support on
or off based on the individual's feeling about the importance of the new
characters versus the importance of conserving memory.
 -- 
Take a recipe. Leave a recipe.  
Python Cookbook!  http://www.ActiveState.com/pythoncookbook


From JMachin@Colonial.com.au  Thu Jun 28 23:08:04 2001
From: JMachin@Colonial.com.au (Machin, John)
Date: Fri, 29 Jun 2001 08:08:04 +1000
Subject: [I18n-sig] Support for "wide" Unicode characters
Message-ID: <9F2D83017589D211BD1000805FA70CA703B13A04@ntxmel03.cmutual.com.au>

[John Machin]
> Guido asked:
>    Does UTF-8 transfer isolated surrogates correctly? 
> 
> No. See my bug report in SF. Briefly, a lone high
> surrogate has its leading UTF-8 byte omitted,
> causing an illegal UTF-8 sequence to be generated.
> 
> Here's the URL:
>
http://sourceforge.net/tracker/?group_id=5470&atid=105470&func=detail&aid=43
> 3882
> 
> (or search for "surrogates")

[Guido again]
It's a bug indeed.

But my question was about the definition of UTF8, not our (fallible)
implementation.

What *should* be the result of u'\ud800'.encode('utf8')?
'\xed\xa0\x80' or an exception?

And likewise, what should be the result of unicode('\xed\xa0\x80',
'utf8')?
u'\ud800' or an exception?

(Likewise for low surrogates; currently, u'\udc00'.encode('utf8')
returns '\xed\xb0\x80', but unicode('\xed\xb0\x80', 'utf8') raise an
exception.)

[John Machin]
OK, sorry for the misunderstanding.
A UTF-8 codec can be made to transcode scalars up to at least 31 bits wide.
The ISO 10646 specification allows for this. 

So, for marshalling and (pickling?) purposes, calling the UTF-8 codec with
errors='liberal' would be the way to go. IMO, 'liberal' should still give an
exception for over-long UTF-8 byte sequences -- an encoder which produces
such is broken (either accidentally or deliberately) -- but should happily
transcode any scalar value <= X for some X in (0x10FFFF, 0x7FFFFFFF).

IMO, when errors is 'strict', upper limit should be 0xFFFF for narrow
builds,
and 0x10FFFF for wide builds.

IMO, unicode(), u.encode() and the \U notation should all use 'strict' ...
and
perhaps the exception messages produced by the narrow build could be 
marketing-aligned and point the punter to the wide build.


Cheers,
John


**************   IMPORTANT MESSAGE  **************

The information contained in or attached to this message is intended only for the people it is addressed to. If you are not the intended recipient, any use, disclosure or copying of this information is unauthorised and prohibited. This information may be confidential or subject to legal privilege. It is not the expressed view of Colonial Limited or any of its subsidiaries unless that is clearly stated. Colonial cannot accept liability for any virus damage caused by this message.

**************************************************


From martin@loewis.home.cs.tu-berlin.de  Thu Jun 28 23:49:45 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Fri, 29 Jun 2001 00:49:45 +0200
Subject: [I18n-sig] Closing some issues
In-Reply-To: <3B3B774C.5D3F1E99@ActiveState.com> (message from Paul Prescod on
 Thu, 28 Jun 2001 11:28:28 -0700)
References: <3B3B774C.5D3F1E99@ActiveState.com>
Message-ID: <200106282249.f5SMnj901841@mira.informatik.hu-berlin.de>

> I'd like to close some issues in the PEP if there is agreement.

I agree with all of those.

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Thu Jun 28 23:31:49 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Fri, 29 Jun 2001 00:31:49 +0200
Subject: [I18n-sig] Re: Unicode 3.1 and contradictions.
In-Reply-To: <E15Fe1P-0002pF-00@wisbech.cl.cam.ac.uk> (message from Markus
 Kuhn on Thu, 28 Jun 2001 16:47:59 +0100)
References: <E15Fe1P-0002pF-00@wisbech.cl.cam.ac.uk>
Message-ID: <200106282231.f5SMVnC01808@mira.informatik.hu-berlin.de>

> My list of unwanted UTF-8 code points was just the one found in a note
> in the UTF-8 definition in ISO 10646-1:1993 (R.4):
> 
>   NOTE 3 - Values of x in the range 0000 D800 .. 0000 DFFF are reserved
>   for the UTF-16 form and do not occur in UCS-4.  The values 0000 FFFE and
>   0000 FFFF also do not occur (see clause 8).  The mappings of these code
>   positions in UTF-8 are undefined.

That explains a lot. Apparently, Unicode takes the stand of making the
undefined well-defined, which is just in the spirit of standards:
Unicode is an extension to ISO 10646, in this respect.

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Thu Jun 28 23:38:26 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Fri, 29 Jun 2001 00:38:26 +0200
Subject: [I18n-sig] Python Support for "Wide" Unicode characters
In-Reply-To: <3B3B73D8.B9D92DE8@ActiveState.com> (message from Paul Prescod on
 Thu, 28 Jun 2001 11:13:44 -0700)
References: <3B3A6438.6DA39268@ActiveState.com> <200106272319.f5RNJnO20162@odiug.digicool.com> <3B3A7857.1593F72@ActiveState.com> <200106280657.f5S6vYl01625@mira.informatik.hu-berlin.de> <3B3B73D8.B9D92DE8@ActiveState.com>
Message-ID: <200106282238.f5SMcQT01809@mira.informatik.hu-berlin.de>

> > > So there is no way to get the heuristic of "wchar_t if available, UCS-4
> > > if not". I'm not complaining, just checking. The list of options is just
> > > two with ucs2 the default.
> > 
> > I'd be complaining, though, if I wasn't that pleased with this PEP
> > overall.
> 
> Sorry, I don't understand the point you were making here. 

I still would prefer if the default was wchar_t if available, so I'd
get a wide Python from distributors as default. As it stands, most
distributors will ship a narrow Python 2.2, since they are unlikely to
change the default settings.

Since I like the overall design of this patch very much, I'm not going
to start long discussions on the detail of some default setting.

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Thu Jun 28 23:48:25 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Fri, 29 Jun 2001 00:48:25 +0200
Subject: [I18n-sig] Unicode surrogates: just say no!
In-Reply-To: <3B3B736F.316649CA@ActiveState.com> (message from Paul Prescod on
 Thu, 28 Jun 2001 11:11:59 -0700)
References: <200106260851.f5Q8pcN10662@odiug.digicool.com>
 <3B385BDC.AB40A761@lemburg.com> <200106261700.f5QH0ih14770@odiug.digicool.com> <3B3A3DC5.CA6767FD@ActiveState.com> <200106280620.f5S6Kwe01395@mira.informatik.hu-berlin.de> <3B3B736F.316649CA@ActiveState.com>
Message-ID: <200106282248.f5SMmPI01840@mira.informatik.hu-berlin.de>

> > The rationale for supporting \U is two-fold: One, importing a module
> > should not fail in one installation, and succeed in another (of the
> > same Python version). Running the module may give different results,
> > but you should be able to generate byte code. 
> 
> Isn't it already the case that big Python integer literals can be legal
> on one platform and illegal on another? (I don't know, I just thought
> that was the case....)

I guess so; I'm not even sure you can exchange byte code files across
machines with sizeof(long).

OTOH, I think this is a real problem, and we should not extend this
problem into other areas as well. Furthermore, if you encounter a
source incompatibility between installations because of very large
integers, you can switch to long integers with little effort. The same
is not that easy for Unicode literals.

> What are the chances that anybody is in this situation in the near
> future? Can you even display these characters on Windows? Does Tk
> support them? And if so, on what platforms? 

I'm pretty sure that Tk can display them soon after fonts become
available. I believe the X11 fonts support full ISO 10646. Since Tk
uses UTF-8, it is also capable of representing these characters
internally. For Windows, I don't know the power of TrueType/OpenType
in this respect, but I'd assume they have thought of UTF-16 already.

As for the fonts themselves, I've seen PDF files for the plane 2
characters, so I guess fonts are available *somehwere*.

> What about the Java APIs?

I could not care less about the Unicode capabilities of Java.

> Wide Python builds may be the "default" before these characters become
> practically usable in GUIs.

That would be a good thing, since I think infrastructures need to
build from ground up (operating system, programming language, GUI
libraries, applications).

Given that it is much easier to support representing the characters in
Python than producing a font, it seems only natural that Python can
represent them first. Python won't have a lot of other facilities
needed for processing them (like character properties, combining,
sorting, etc), but the representation should work fairly early.

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Thu Jun 28 23:05:16 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Fri, 29 Jun 2001 00:05:16 +0200
Subject: [I18n-sig] Support for "wide" Unicode characters
In-Reply-To: <3B3AF887.5181D0CF@lemburg.com> (mal@lemburg.com)
References: <3B3ABFB8.84C7510B@ActiveState.com> <3B3AF887.5181D0CF@lemburg.com>
Message-ID: <200106282205.f5SM5GK00908@mira.informatik.hu-berlin.de>

> > Implementation
> > 
> >     There is a new (experimental) define:
> > 
> >         #define PY_UNICODE_SIZE 2
> 
> Doesn't sizeof(Py_UNICODE) do the same ?

No, you can't use sizeof in a preprocessor #if test.

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Thu Jun 28 23:10:46 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Fri, 29 Jun 2001 00:10:46 +0200
Subject: [I18n-sig] Re: Unicode 3.1 and contradictions.
In-Reply-To: <200106281125.f5SBPVc20814@odiug.digicool.com> (message from
 Guido van Rossum on Thu, 28 Jun 2001 07:25:30 -0400)
References: <E15FX2O-0007cK-00@wisbech.cl.cam.ac.uk> <200106281125.f5SBPVc20814@odiug.digicool.com>
Message-ID: <200106282210.f5SMAk501230@mira.informatik.hu-berlin.de>

> > The UTF-8 representations of U+D800..U+DFFF, U+FFFE, and U+FFFF are not
> > allowed in a UTF-8 stream and a secure UTF-8 decoder must never output
> > any of these characters.
> 
> Can you explain a bit more about the security issues?

I don't understand the comment about filters, but one aspect is the
requirement for a canonical encoding: If you encrypt two pieces of
text of code with the same key, the original pieces must be considered
equal iff the encrypted versions are equal. Non-canonical forms break
this guarantee: the pieces might be equal even if the encrypted output
is not.

Regards,
Martin


From martin@loewis.home.cs.tu-berlin.de  Fri Jun 29 00:28:17 2001
From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis)
Date: Fri, 29 Jun 2001 01:28:17 +0200
Subject: [I18n-sig] Support for "wide" Unicode characters
In-Reply-To: <9F2D83017589D211BD1000805FA70CA703B13A04@ntxmel03.cmutual.com.au>
 (JMachin@Colonial.com.au)
References: <9F2D83017589D211BD1000805FA70CA703B13A04@ntxmel03.cmutual.com.au>
Message-ID: <200106282328.f5SNSHf02025@mira.informatik.hu-berlin.de>

> IMO, unicode(), u.encode() and the \U notation should all use
> 'strict' ...  and perhaps the exception messages produced by the
> narrow build could be marketing-aligned and point the punter to the
> wide build.

Both unicode and u.encode support an optional errors parameter, for
which Guido proposed to accept an additional value of "lenient". The
default is "strict".

Regards,
Martin


From tim.one@home.com  Fri Jun 29 04:46:10 2001
From: tim.one@home.com (Tim Peters)
Date: Thu, 28 Jun 2001 23:46:10 -0400
Subject: [I18n-sig] Support for "wide" Unicode characters
In-Reply-To: <3B3B2CE8.B1A062C4@lemburg.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCEEFLKLAA.tim.one@home.com>

[MAL]
> Ah, OK... nice sofas these Crays, BTW ;-)

You're going to get a Cray Education before this is over even if it kills
you -- which it may <wink>.  Crays (at least in my day) made for horrible
sofas!  The oh-so-inviting padded leather "seats" surrounding the box
actually covered massive cooling coils.  Sit on it for 10 minutes and your
butt went numb; some poor souls who tried sleeping on them suffered serious
cases of hypothermia.  And these were people who didn't believe *anything*
was smaller than 64 bits.  I can't imagine what it would do to a C weenie
with heretical delusions about sizeof(short) -- if it got the chance, it
would probably put you in cryonic suspension until PCs moved to 128-bit
ints.

don't-screw-with-the-icy-ghost-of-seymour-cray-ly y'rs  - tim


From fredrik@pythonware.com  Fri Jun 29 09:54:56 2001
From: fredrik@pythonware.com (Fredrik Lundh)
Date: Fri, 29 Jun 2001 10:54:56 +0200
Subject: [I18n-sig] Python Support for "Wide" Unicode characters
References: <3B3A6438.6DA39268@ActiveState.com> <200106272319.f5RNJnO20162@odiug.digicool.com> <3B3A7857.1593F72@ActiveState.com> <200106280657.f5S6vYl01625@mira.informatik.hu-berlin.de> <3B3B73D8.B9D92DE8@ActiveState.com> <200106282238.f5SMcQT01809@mira.informatik.hu-berlin.de>
Message-ID: <017301c10079$7e87db00$0900a8c0@spiff>

martin wrote:

> > Sorry, I don't understand the point you were making here.
>
> I still would prefer if the default was wchar_t if available, so I'd
> get a wide Python from distributors as default. As it stands, most
> distributors will ship a narrow Python 2.2, since they are unlikely to
> change the default settings.

I haven't ruled out "wchar_t" as a default for 2.2, but we shouldn't make
the switch right now -- popular subsystems may not be 32-bit ready (the
xml stuff, tkinter and other gui toolkits).

Just give it a little more calendar time.

Cheers /F


From Misha.Wolf@reuters.com  Fri Jun 29 20:07:10 2001
From: Misha.Wolf@reuters.com (Misha.Wolf@reuters.com)
Date: Fri, 29 Jun 2001 20:07:10 +0100
Subject: [I18n-sig] 19th Unicode Conference, September 2001, San Jose, CA,
 USA -- Register now!
Message-ID: <E15G3f9-0000Xw-00@mail.python.org>

           Nineteenth International Unicode Conference (IUC19)
               Unicode and the Web: The Global Connection
                    http://www.unicode.org/iuc/iuc19
                         September 10-14, 2001
                           San Jose, CA, USA
                             Register now!

                               * * * * *

The Internet and the World Wide Web continue to change the shape of
computing.  The goal of network computing and understandable text
access across wide, diverse groups of people has brought great momentum
to computing environments that build Unicode into their foundation.
Whether it's Internet commerce, network access to data, or highly
portable applications, Unicode makes a solid foundation for the network,
global enterprises, and software users everywhere.

The Nineteenth International Unicode Conference (IUC19) will address
topics ranging from Unicode use in the World Wide Web and in operating
systems and databases, to the latest developments with Unicode 3.1,
Java, Open Source, XML and Web protocols.  Conference attendees will
include managers, software engineers, systems analysts, and product
marketing personnel responsible for the development of software
supporting Unicode, as well as those involved in all aspects of the
globalization of software and the Internet.

CONFERENCE DATES

   The Conference has been extended to 5 days:
      2 days of Tutorials / Workshops
      3 days of Conference Sessions

CONFERENCE WEB SITE, PROGRAM and REGISTRATION

   The Conference Program, including abstracts and speaker biographies,
   and Registration form are now available at the Conference Web site:
      http://www.unicode.org/iuc/iuc19

CONFERENCE SPONSORS

   Agfa Monotype Corporation
   Basis Technology Corporation
   Lionbridge Technologies
   Microsoft Corporation
   Netscape Communications
   Oracle Corporation
   PeopleSoft, Inc.
   Reuters Ltd.
   Sun Microsystems, Inc.
   Trados Corporation
   Trigeminal Software, Inc.
   World Wide Web Consortium (W3C)
   Wrox Press

GLOBAL COMPUTING SHOWCASE

   Visit the Showcase to find out more about products supporting the
   Unicode Standard, and products and services that can help you
   globalize/localize your software, documentation and Internet content.
   For details, visit the Conference Web site:
      http://www.unicode.org/iuc/iuc19

CONFERENCE VENUE

   DoubleTree Hotel San Jose
   2050 Gateway Place
   San Jose, CA 95110
   USA

   Tel: +1 408 453 4000
   Fax: +1 408 437 2898

CONFERENCE MANAGEMENT

   Global Meeting Services Inc.
   4360 Benhurst Avenue
   San Diego, CA 92122, USA

   Tel: +1 858 638 0206 (voice)
        +1 858 638 0504 (fax)

   Email: info@global-conference.com
      or: conference@unicode.org

THE UNICODE CONSORTIUM

The Unicode Consortium was founded as a non-profit organization in 1991.
It is dedicated to the development, maintenance and promotion of The
Unicode Standard, a worldwide character encoding.  The Unicode Standard
encodes the characters of the world's principal scripts and languages,
and is code-for-code identical to the international standard ISO/IEC
10646.  In addition to cooperating with ISO on the future development of
ISO/IEC 10646, the Consortium is responsible for providing character
properties and algorithms for use in implementations.  Today the
membership base of the Unicode Consortium includes major computer
corporations, software producers, database vendors, research
institutions, international agencies and various user groups.

For further information on the Unicode Standard, visit the Unicode Web
site at http://www.unicode.org or e-mail <info@unicode.org>

                           *  *  *  *  *

Unicode(r) and the Unicode logo are registered trademarks of Unicode,
Inc.  Used with permission.


-----------------------------------------------------------------
        Visit our Internet site at http://www.reuters.com

Any views expressed in this message are those of  the  individual
sender,  except  where  the sender specifically states them to be
the views of Reuters Ltd.