From mal@lemburg.com Fri Jun 1 09:10:08 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Fri, 01 Jun 2001 10:10:08 +0200 Subject: [I18n-sig] XML and UTF-16 References: <3B167B9F.344D6992@lemburg.com> <15126.32275.110670.236066@cymru.basistech.com> <3B1681C5.71FD484D@lemburg.com> <15126.33995.327715.84261@cymru.basistech.com> <200105312046.f4VKkVY02913@mira.informatik.hu-berlin.de> Message-ID: <3B174DE0.EFABF55E@lemburg.com> "Martin v. Loewis" wrote: > > > Yes, I think this would be a good idea. I would use something along > > the lines of: > > Please have a look at > xml.parsers.xmlproc.EntityParser.autodetect_encoding. This almost > follows the procedure in the XML recommendation, except that it does > not expect "unusual" byte orders (2134, 3412), and that it does not > detect EBCDIC. I don't have a file EntityParser in the xmlproc subdir... is that in CVS somewhere ? > > 0) Assume UTF-8. > > > > 1) Look for the UTF-16 and UTF-32 uniBOMs. If you find one, assume the > > appropriate transmission format and endian nature. Goto 4. > > > > 2) Look for the UTF-8 uniBOM, since some editors like putting that in. > > Ignore it and goto 4. > > I see this was added to the XML recommendation only in the second > edition, so I should also added to xmlproc. > > > 3) Look for the sundry forms of ' > with appropriate endian variants. If found, assume the detected > > encoding. Goto 4. > > Please note that ASCII is not detectable this way: If you see ' then you don't know anything about the encoding except that you should > be able to parse the encoding= attribute successfully if present. I think that's what Tom had in mind here. Could we maybe have the function autodetect_encoding at some higher level in PyXML ?! This is a very basic API and doesn't only apply to xmlproc. I also think that it would be worthwhile adding a similar API to codecs.py which takes the magic (' <15126.32275.110670.236066@cymru.basistech.com> <3B1681C5.71FD484D@lemburg.com> <3B16B5D4.730D8E30@ActiveState.com> Message-ID: <3B174E9F.4EDA2289@lemburg.com> Paul Prescod wrote: > > "M.-A. Lemburg" wrote: > > > >... > > > > Perhaps we should have some smart auto-detection API somewhere > > which does this automagically ?! Something like > > > > guess_xml_encoding(data) -> encoding string > > > > It could work by looking at the first 256 bytes of the data > > string and then apply all the tricks needed to extract the > > encoding information (or default to UTF-8 if no such information > > is given). > > This might help: > > http://aspn.activestate.com/ASPN/Python/Cookbook/Recipe/52257 > > I think Lars has a version too... Could you clarify what the licensing conditions are for using code from your recipe collection ? Thanks, -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From mal@lemburg.com Fri Jun 1 09:17:04 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Fri, 01 Jun 2001 10:17:04 +0200 Subject: [I18n-sig] XML and UTF-16 References: <3B167B9F.344D6992@lemburg.com> <15126.32275.110670.236066@cymru.basistech.com> <3B16B4DE.B0E8ADD4@ActiveState.com> Message-ID: <3B174F80.4D1E93FB@lemburg.com> Paul Prescod wrote: > > Tom Emerson wrote: > > > >... > > > > Yes. You can then pretty easily autodetect the which Unicode > > transformation format is being used by looking at the first ten or > > so bytes. > > Actually, the first four bytes are sufficient to get you started. Then > you have to look at the encoding declaration if present. > > > If the BOM is present, that's a big clue right there. > > """Entities encoded in UTF-16 must begin with the Byte Order Mark > described by Annex F of [ISO/IEC 10646], Annex H of [ISO/IEC > 10646-2000], section 2.4 of [Unicode], and section 2.7 of [Unicode3] > (the ZERO WIDTH NO-BREAK SPACE character, #xFEFF). This is an encoding > signature, not part of either the markup or the character data of the > XML document. XML processors must be able to use this character to > differentiate between UTF-8 and UTF-16 encoded documents.""" Where did you get that from ? Note that the Unicode specs have a different opinion on this... (a BOM mark is part of a protocol and should only be used if the encoding information is not available in some other form or implicit) -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From martin@loewis.home.cs.tu-berlin.de Fri Jun 1 13:59:37 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Fri, 1 Jun 2001 14:59:37 +0200 Subject: [I18n-sig] XML and UTF-16 In-Reply-To: <3B174DE0.EFABF55E@lemburg.com> (mal@lemburg.com) References: <3B167B9F.344D6992@lemburg.com> <15126.32275.110670.236066@cymru.basistech.com> <3B1681C5.71FD484D@lemburg.com> <15126.33995.327715.84261@cymru.basistech.com> <200105312046.f4VKkVY02913@mira.informatik.hu-berlin.de> <3B174DE0.EFABF55E@lemburg.com> Message-ID: <200106011259.f51CxbQ00877@mira.informatik.hu-berlin.de> > > > Yes, I think this would be a good idea. I would use something along > > > the lines of: > > > > Please have a look at > > xml.parsers.xmlproc.EntityParser.autodetect_encoding. This almost > > follows the procedure in the XML recommendation, except that it does > > not expect "unusual" byte orders (2134, 3412), and that it does not > > detect EBCDIC. > > I don't have a file EntityParser in the xmlproc subdir... is > that in CVS somewhere ? Oops, missed on level of indirection: xml.parsers.xmlproc.xmlutils.EntityParser.autodetect_encoding And yes, the function is only in the CVS, not in a released version (yet). > Could we maybe have the function autodetect_encoding at > some higher level in PyXML ?! This is a very basic API and > doesn't only apply to xmlproc. We might (contributions are welcome). However, such a function would not necessarily be usable for xmlproc: xmlproc deals with reading data in small chunks, expecting that information may be broken at arbitrary boundaries. For example, would you expect that the autodetection function looks for the encoding= attribute? That may not be included in the first fragment of data. > I also think that it would be worthwhile adding a similar > API to codecs.py which takes the magic (' as argument and then tries to determine whether the input > data is an ASCII superset, UTF-8 or UTF-16/32. I don't think so. Doing the XML autodetection is not terribly complicated, and rarely needs to be done - you'd normally pass the byte stream to an XML parser, so you would not need to care about the encoding. As for XML and encodings, having a convenient mechanism to extend existing codecs to encode unknown characters as character entities is much more important, IMO, since that is very difficult to achieve with the existing API. Regards, Martin From martin@loewis.home.cs.tu-berlin.de Fri Jun 1 14:06:11 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Fri, 1 Jun 2001 15:06:11 +0200 Subject: [I18n-sig] XML and UTF-16 In-Reply-To: <3B174F80.4D1E93FB@lemburg.com> (mal@lemburg.com) References: <3B167B9F.344D6992@lemburg.com> <15126.32275.110670.236066@cymru.basistech.com> <3B16B4DE.B0E8ADD4@ActiveState.com> <3B174F80.4D1E93FB@lemburg.com> Message-ID: <200106011306.f51D6B000916@mira.informatik.hu-berlin.de> > > """Entities encoded in UTF-16 must begin with the Byte Order Mark > > described by Annex F of [ISO/IEC 10646], Annex H of [ISO/IEC > > 10646-2000], section 2.4 of [Unicode], and section 2.7 of [Unicode3] > > (the ZERO WIDTH NO-BREAK SPACE character, #xFEFF). This is an encoding > > signature, not part of either the markup or the character data of the > > XML document. XML processors must be able to use this character to > > differentiate between UTF-8 and UTF-16 encoded documents.""" > > Where did you get that from ? That's from the XML recommendation, section 4.3.3. I really recommend that you get a copy of that document :-) > Note that the Unicode specs have a different opinion on this... (a > BOM mark is part of a protocol and should only be used if the > encoding information is not available in some other form or > implicit) Why is that different? XML says that the BOM is not part of the document, but an encoding signature. You say that that it is part of a protocol - in the XML case, it is part of the encoding autodetection protocol. If the character was part of the document, any document containing it would be ill-formed, since the ZWNBSP is not allowed as the first character of an XML document (only whitespace and '<' are allowed, AFAICT). Regards, Martin From walter@livinglogic.de Fri Jun 1 14:58:09 2001 From: walter@livinglogic.de (Walter Doerwald) Date: Fri, 01 Jun 2001 15:58:09 +0200 Subject: [I18n-sig] XML and UTF-16 In-Reply-To: <200106011259.f51CxbQ00877@mira.informatik.hu-berlin.de> References: <3B167B9F.344D6992@lemburg.com> <15126.32275.110670.236066@cymru.basistech.com> <3B1681C5.71FD484D@lemburg.com> <15126.33995.327715.84261@cymru.basistech.com> <200105312046.f4VKkVY02913@mira.informatik.hu-berlin.de> <3B174DE0.EFABF55E@lemburg.com> <200106011259.f51CxbQ00877@mira.informatik.hu-berlin.de> Message-ID: <200106011558090859.0148DB03@mail.livinglogic.de> On 01.06.01 at 14:59 Martin v. Loewis wrote: > [...] > As for XML and encodings, having a convenient mechanism to extend > existing codecs to encode unknown characters as character entities is > much more important, IMO, since that is very difficult to achieve with > the existing API. I've written such functions: - escapeText(S, encoding) -> unicode Return a copy of the unicode string S, where every occurrence of '<', '>' and '&' and all unencodable characters in the specified encoding have been replaced with their XML character entity. - escapeAttr(S, encoding) -> unicode Return a copy of the unicode string S, where every occurrence of '<', '>', '&', and '\"' and all unencodable characters in the specified encoding have been replaced with their XML character entity. Although these functions are written in C, they have to call the codec twice for every single character (if encoding the string in one go fails), so they are rather slow for codecs implemented in Python. Could this be used until we get codecs with customizable errror handling? If yes, I could put them as a patch on python.sf.net or pyxml.sf.net or mail them to Martin. Bye, Walter D=F6rwald -- Walter D=F6rwald =B7 LivingLogic AG =B7 Bayreuth, Germany =B7 www.livinglogic.de From mal@lemburg.com Fri Jun 1 14:57:11 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Fri, 01 Jun 2001 15:57:11 +0200 Subject: [I18n-sig] XML and UTF-16 References: <3B167B9F.344D6992@lemburg.com> <15126.32275.110670.236066@cymru.basistech.com> <3B16B4DE.B0E8ADD4@ActiveState.com> <3B174F80.4D1E93FB@lemburg.com> <200106011306.f51D6B000916@mira.informatik.hu-berlin.de> Message-ID: <3B179F37.8AEE7D55@lemburg.com> "Martin v. Loewis" wrote: > > > > """Entities encoded in UTF-16 must begin with the Byte Order Mark > > > described by Annex F of [ISO/IEC 10646], Annex H of [ISO/IEC > > > 10646-2000], section 2.4 of [Unicode], and section 2.7 of [Unicode3] > > > (the ZERO WIDTH NO-BREAK SPACE character, #xFEFF). This is an encoding > > > signature, not part of either the markup or the character data of the > > > XML document. XML processors must be able to use this character to > > > differentiate between UTF-8 and UTF-16 encoded documents.""" > > > > Where did you get that from ? > > That's from the XML recommendation, section 4.3.3. I really recommend > that you get a copy of that document :-) Just did... :) > > Note that the Unicode specs have a different opinion on this... (a > > BOM mark is part of a protocol and should only be used if the > > encoding information is not available in some other form or > > implicit) > > Why is that different? XML says that the BOM is not part of the > document, but an encoding signature. You say that that it is part of a > protocol - in the XML case, it is part of the encoding autodetection > protocol. > > If the character was part of the document, any document containing it > would be ill-formed, since the ZWNBSP is not allowed as the first > character of an XML document (only whitespace and '<' are allowed, > AFAICT). In that sense you are right. I was under the impression that the quoted text was talking about UTF-16 documents in general (not just only XML docs). -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From mal@lemburg.com Fri Jun 1 22:10:50 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Fri, 01 Jun 2001 23:10:50 +0200 Subject: [I18n-sig] Encoding auto-detection References: <3B167B9F.344D6992@lemburg.com> <15126.32275.110670.236066@cymru.basistech.com> <3B1681C5.71FD484D@lemburg.com> <15126.33995.327715.84261@cymru.basistech.com> <200105312046.f4VKkVY02913@mira.informatik.hu-berlin.de> <3B174DE0.EFABF55E@lemburg.com> <200106011259.f51CxbQ00877@mira.informatik.hu-berlin.de> Message-ID: <3B1804DA.8C48861E@lemburg.com> "Martin v. Loewis" wrote: > > > I also think that it would be worthwhile adding a similar > > API to codecs.py which takes the magic (' > as argument and then tries to determine whether the input > > data is an ASCII superset, UTF-8 or UTF-16/32. > > I don't think so. Doing the XML autodetection is not terribly > complicated, and rarely needs to be done - you'd normally pass the > byte stream to an XML parser, so you would not need to care about the > encoding. I was talking about a general purpose encoding sniffer, the XML case would only be a special case. The idea is to pass a magic string to the API and then let it fiddle around with to try to deduce the encoding. The magic string might also be regular expression which then has the encoding parameter as group 1, etc. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From mal@lemburg.com Fri Jun 1 22:23:02 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Fri, 01 Jun 2001 23:23:02 +0200 Subject: [I18n-sig] XML and codecs References: <3B167B9F.344D6992@lemburg.com> <15126.32275.110670.236066@cymru.basistech.com> <3B1681C5.71FD484D@lemburg.com> <15126.33995.327715.84261@cymru.basistech.com> <200105312046.f4VKkVY02913@mira.informatik.hu-berlin.de> <3B174DE0.EFABF55E@lemburg.com> <200106011259.f51CxbQ00877@mira.informatik.hu-berlin.de> Message-ID: <3B1807B6.11ED32B9@lemburg.com> "Martin v. Loewis" wrote: > > As for XML and encodings, having a convenient mechanism to extend > existing codecs to encode unknown characters as character entities is > much more important, IMO, since that is very difficult to achieve with > the existing API. Until we've found a backward compatible way to fix this, how about adding a new error handling scheme which at least gives the caller enough information to do some smart processing on the input and output, e.g. errors="break": raise an UnicodeBreakError with argument (reason, error_position_in_input, work_done_so_far) The caller could then use the information returned by the codec to fix the input data and reuse the already encoded/decoded data to avoid duplicate work. This scheme is very simple, but also very effective, since it allows complex error processing to be done in the namespace where the data is being processed (rather than in a callback which wouldn't have access to this namespace). -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From martin@loewis.home.cs.tu-berlin.de Fri Jun 1 23:17:32 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Sat, 2 Jun 2001 00:17:32 +0200 Subject: [I18n-sig] XML and codecs In-Reply-To: <3B1807B6.11ED32B9@lemburg.com> (mal@lemburg.com) References: <3B167B9F.344D6992@lemburg.com> <15126.32275.110670.236066@cymru.basistech.com> <3B1681C5.71FD484D@lemburg.com> <15126.33995.327715.84261@cymru.basistech.com> <200105312046.f4VKkVY02913@mira.informatik.hu-berlin.de> <3B174DE0.EFABF55E@lemburg.com> <200106011259.f51CxbQ00877@mira.informatik.hu-berlin.de> <3B1807B6.11ED32B9@lemburg.com> Message-ID: <200106012217.f51MHWR01771@mira.informatik.hu-berlin.de> > Until we've found a backward compatible way to fix this, how > about adding a new error handling scheme which at least gives > the caller enough information to do some smart processing on the > input and output, e.g. > > errors="break": > > raise an UnicodeBreakError with argument > (reason, error_position_in_input, work_done_so_far) That is good enough, IMO, so let's do it. I think we also need a few well-defined reasons, in particular UnicodeBreakError.CannotConvert # character not supported in target # character set UnicodeBreakError.OutOfData # input string stops in the middle # of a character The latter case deals with the nasty problem of UTF-8 input which breaks if your file.read() call happens to split a UTF-8 sequence. Of course, the well-known reasons could be subclasses, too. Regards, Martin From martin@loewis.home.cs.tu-berlin.de Fri Jun 1 23:12:14 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Sat, 2 Jun 2001 00:12:14 +0200 Subject: [I18n-sig] Encoding auto-detection In-Reply-To: <3B1804DA.8C48861E@lemburg.com> (mal@lemburg.com) References: <3B167B9F.344D6992@lemburg.com> <15126.32275.110670.236066@cymru.basistech.com> <3B1681C5.71FD484D@lemburg.com> <15126.33995.327715.84261@cymru.basistech.com> <200105312046.f4VKkVY02913@mira.informatik.hu-berlin.de> <3B174DE0.EFABF55E@lemburg.com> <200106011259.f51CxbQ00877@mira.informatik.hu-berlin.de> <3B1804DA.8C48861E@lemburg.com> Message-ID: <200106012212.f51MCEN01482@mira.informatik.hu-berlin.de> > I was talking about a general purpose encoding sniffer, the XML > case would only be a special case. The idea is to pass a magic > string to the API and then let it fiddle around with to try > to deduce the encoding. The magic string might also be regular > expression which then has the encoding parameter as group 1, etc. I see. For a general purpose encoding guesser to be useful, it would work totally different from the XML autodetection. E.g. UTF-8 can be detected quite reliably, but you'll have to look at the entire input. In general, I think encoding auto-detection is a stupid idea, you really have to have a higher-level protocol that tells you what the encoding is. Trying Unicode-encodings-autodetection might be more successful, but I still think it is quite pointless: I predict that UTF-16 or UTF-32 will be quite rare, and that most Unicode text will be exchanged as UTF-8. In addition, unless you are writing a general-purpose text editor, there *will* be a higher-level protocol telling you the encoding. Regards, Martin From paulp@ActiveState.com Sat Jun 2 00:07:59 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Fri, 01 Jun 2001 16:07:59 -0700 Subject: [I18n-sig] Encoding auto-detection References: <3B167B9F.344D6992@lemburg.com> <15126.32275.110670.236066@cymru.basistech.com> <3B1681C5.71FD484D@lemburg.com> <15126.33995.327715.84261@cymru.basistech.com> <200105312046.f4VKkVY02913@mira.informatik.hu-berlin.de> <3B174DE0.EFABF55E@lemburg.com> <200106011259.f51CxbQ00877@mira.informatik.hu-berlin.de> <3B1804DA.8C48861E@lemburg.com> <200106012212.f51MCEN01482@mira.informatik.hu-berlin.de> Message-ID: <3B18204F.82B991F7@ActiveState.com> "Martin v. Loewis" wrote: > >... > > I see. For a general purpose encoding guesser to be useful, it would > work totally different from the XML autodetection. Agreed. They should be treated as two different problems. >... > In general, I think encoding auto-detection is a stupid idea, you > really have to have a higher-level protocol that tells you what the > encoding is. These protocols are very unreliable. I often see data served from a website as application/octet-stream no matter what its real data type is. > ... Trying Unicode-encodings-autodetection might be more > successful, but I still think it is quite pointless: I predict that > UTF-16 or UTF-32 will be quite rare, and that most Unicode text will > be exchanged as UTF-8. On Windows, if you save a file as "Unicode", it means UTF-16. I think that UTF-16 is Microsoft's "standard" Unicode encoding. UTF-8 could be considered Unix's "standard" encoding. I don't think you should treat it as either-or. Autodetection is not as good as really knowing for sure, of course. That doesn't mean that it is *stupid*. It means it is the best fallback available when dealing with stupid systems like the Unix file system or misconfigured web servers. -- Take a recipe. Leave a recipe. Python Cookbook! http://www.ActiveState.com/pythoncookbook From tree@basistech.com Sat Jun 2 00:43:41 2001 From: tree@basistech.com (Tom Emerson) Date: Fri, 1 Jun 2001 19:43:41 -0400 Subject: [I18n-sig] Encoding auto-detection In-Reply-To: <200106012212.f51MCEN01482@mira.informatik.hu-berlin.de> References: <3B167B9F.344D6992@lemburg.com> <15126.32275.110670.236066@cymru.basistech.com> <3B1681C5.71FD484D@lemburg.com> <15126.33995.327715.84261@cymru.basistech.com> <200105312046.f4VKkVY02913@mira.informatik.hu-berlin.de> <3B174DE0.EFABF55E@lemburg.com> <200106011259.f51CxbQ00877@mira.informatik.hu-berlin.de> <3B1804DA.8C48861E@lemburg.com> <200106012212.f51MCEN01482@mira.informatik.hu-berlin.de> Message-ID: <15128.10413.377254.142035@cymru.basistech.com> Martin v. Loewis writes: > In general, I think encoding auto-detection is a stupid idea, you > really have to have a higher-level protocol that tells you what the > encoding is. This is a utopian idea that completely falls apart in the real world. It is *very* common for email to be sent making use of both 8-bit and 7-bit encodings with no content-type or content-transfer-encoding. Without some form of encoding/character set detection you have no idea what the mail message is encoded with. The fact that the mail RFCs dictate something is irrelevant. Similarly you can almost never trust the character encoding specified for web pages. I have seen a lot of pages that claim to be using CP1252 or ISO-8859-1 that are actually encoded with Shift-JIS or EUC-CN or Big 5. Indeed, when I was working on the Device Mosaic browser (the descendent of NCSA Mosaic that is was targeted for embedded devices) if we found a document claiming to be Latin-1 we ignored it and sniffed the encoding. It is also common to find pages in Japan, China, and Korea that don't specify a character set or encoding at all... the authors make assumptions about the people viewing the pages, which may be false. I have also seen Japanese pages that contain Shift-JIS *and* EUC-JP encoded characters in the *same* document. Higher level protocols cannot be believed. -tree > Trying Unicode-encodings-autodetection might be more > successful, but I still think it is quite pointless: I predict that > UTF-16 or UTF-32 will be quite rare, and that most Unicode text will > be exchanged as UTF-8. On Unix. This isn't necessarily true on other platforms. -tree -- Tom Emerson Basis Technology Corp. Sr. Sinostringologist http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever" From martin@loewis.home.cs.tu-berlin.de Sat Jun 2 07:59:35 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Sat, 2 Jun 2001 08:59:35 +0200 Subject: [I18n-sig] Encoding auto-detection In-Reply-To: <15128.10413.377254.142035@cymru.basistech.com> (message from Tom Emerson on Fri, 1 Jun 2001 19:43:41 -0400) References: <3B167B9F.344D6992@lemburg.com> <15126.32275.110670.236066@cymru.basistech.com> <3B1681C5.71FD484D@lemburg.com> <15126.33995.327715.84261@cymru.basistech.com> <200105312046.f4VKkVY02913@mira.informatik.hu-berlin.de> <3B174DE0.EFABF55E@lemburg.com> <200106011259.f51CxbQ00877@mira.informatik.hu-berlin.de> <3B1804DA.8C48861E@lemburg.com> <200106012212.f51MCEN01482@mira.informatik.hu-berlin.de> <15128.10413.377254.142035@cymru.basistech.com> Message-ID: <200106020659.f526xZM01136@mira.informatik.hu-berlin.de> > It is *very* common for email to be sent making use of both 8-bit and > 7-bit encodings with no content-type or content-transfer-encoding. I think this claim is difficult to support by facts. Of the messages I receive, most do have a MIME header, giving a charset in their content. > Indeed, when I was working on the Device Mosaic browser (the > descendent of NCSA Mosaic that is was targeted for embedded devices) > if we found a document claiming to be Latin-1 we ignored it and > sniffed the encoding. That might be a useful thing to do, but I guess the routine you've been using was way more complex than what MAL suggested for the standard library. I doubt you can reliably detect Big 5 by looking at the first 10 or so bytes of an HTML document. In fact, I'd suggest that HTML encoding detection is yet again different from general-purpose encoding detection, since you'll have to take the declared encoding (if any) into account. > Higher level protocols cannot be believed. And neither can autodetection. Regards, Martin From mal@lemburg.com Sat Jun 2 12:24:05 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Sat, 02 Jun 2001 13:24:05 +0200 Subject: [I18n-sig] Encoding auto-detection References: <3B167B9F.344D6992@lemburg.com> <15126.32275.110670.236066@cymru.basistech.com> <3B1681C5.71FD484D@lemburg.com> <15126.33995.327715.84261@cymru.basistech.com> <200105312046.f4VKkVY02913@mira.informatik.hu-berlin.de> <3B174DE0.EFABF55E@lemburg.com> <200106011259.f51CxbQ00877@mira.informatik.hu-berlin.de> <3B1804DA.8C48861E@lemburg.com> <200106012212.f51MCEN01482@mira.informatik.hu-berlin.de> <15128.10413.377254.142035@cymru.basistech.com> Message-ID: <3B18CCD5.8EBF8546@lemburg.com> Tom Emerson wrote: > > Martin v. Loewis writes: > > In general, I think encoding auto-detection is a stupid idea, you > > really have to have a higher-level protocol that tells you what the > > encoding is. > > This is a utopian idea that completely falls apart in the real world. That's why I need such a function... first for XML and then for other files having some standard magic prepended to them. The reason for this is simple: even if a protocol defines which encoding to use, this is not necessarily respected in input data. The usual thing to do is first to try to decode the data into Unicode using the given encoding, then to analyse the data and try the guessed encoding and only then to reject the data as false input. Without the second step there would be far to many instances of data being rejected due to wrong encoding information, e.g. a common situation for XML is that XML files use Latin-1 in the body and forget to define the XML header. The parser will then default to UTF-8 and fail to read the data. You have a similar situation for data which originated in parts of the world where more than one encoding is in common use e.g. Russia or Asia. Input data generated by humans can should always be treated with care ;-) -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From mal@lemburg.com Sat Jun 2 12:26:14 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Sat, 02 Jun 2001 13:26:14 +0200 Subject: [I18n-sig] XML and codecs References: <3B167B9F.344D6992@lemburg.com> <15126.32275.110670.236066@cymru.basistech.com> <3B1681C5.71FD484D@lemburg.com> <15126.33995.327715.84261@cymru.basistech.com> <200105312046.f4VKkVY02913@mira.informatik.hu-berlin.de> <3B174DE0.EFABF55E@lemburg.com> <200106011259.f51CxbQ00877@mira.informatik.hu-berlin.de> <3B1807B6.11ED32B9@lemburg.com> <200106012217.f51MHWR01771@mira.informatik.hu-berlin.de> Message-ID: <3B18CD56.5BFDE77F@lemburg.com> "Martin v. Loewis" wrote: > > > Until we've found a backward compatible way to fix this, how > > about adding a new error handling scheme which at least gives > > the caller enough information to do some smart processing on the > > input and output, e.g. > > > > errors="break": > > > > raise an UnicodeBreakError with argument > > (reason, error_position_in_input, work_done_so_far) > > That is good enough, IMO, so let's do it. Ok. > I think we also need a few > well-defined reasons, in particular > > UnicodeBreakError.CannotConvert # character not supported in target > # character set > UnicodeBreakError.OutOfData # input string stops in the middle > # of a character > > The latter case deals with the nasty problem of UTF-8 input which > breaks if your file.read() call happens to split a UTF-8 sequence. > Of course, the well-known reasons could be subclasses, too. Fine. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From cyrus@garage.co.jp Sat Jun 2 13:33:35 2001 From: cyrus@garage.co.jp (Cyrus Shaoul) Date: Sat, 02 Jun 2001 08:33:35 -0400 Subject: Re[2]: [I18n-sig] Encoding auto-detection In-Reply-To: <15128.10413.377254.142035@cymru.basistech.com> References: <200106012212.f51MCEN01482@mira.informatik.hu-berlin.de> <15128.10413.377254.142035@cymru.basistech.com> Message-ID: <20010602082927.ABD5.CYRUS@garage.co.jp> I have to agree with Tom. If there is room for human error, there will be lots of errors. I have personally seen many CGI scripts that have been sent data in unexpected encodings by buggy browsers. These browsers still are in use (ex: IE 3.0), and I bet some future browser will contain a similar bug in the future. Just my .02, Cyrus > > This is a utopian idea that completely falls apart in the real world. > > It is *very* common for email to be sent making use of both 8-bit and > 7-bit encodings with no content-type or content-transfer-encoding. > Without some form of encoding/character set detection you have no idea > what the mail message is encoded with. The fact that the mail RFCs > dictate something is irrelevant. > > Similarly you can almost never trust the character encoding specified > for web pages. I have seen a lot of pages that claim to be using > CP1252 or ISO-8859-1 that are actually encoded with Shift-JIS or > EUC-CN or Big 5. Indeed, when I was working on the Device Mosaic > browser (the descendent of NCSA Mosaic that is was targeted for > embedded devices) if we found a document claiming to be Latin-1 we > ignored it and sniffed the encoding. > > It is also common to find pages in Japan, China, and Korea that don't > specify a character set or encoding at all... the authors make > assumptions about the people viewing the pages, which may be false. I > have also seen Japanese pages that contain Shift-JIS *and* EUC-JP > encoded characters in the *same* document. > > Higher level protocols cannot be believed. > > -tree > From tree@basistech.com Sat Jun 2 18:10:30 2001 From: tree@basistech.com (Tom Emerson) Date: Sat, 2 Jun 2001 13:10:30 -0400 Subject: [I18n-sig] Encoding auto-detection In-Reply-To: <200106020659.f526xZM01136@mira.informatik.hu-berlin.de> References: <3B167B9F.344D6992@lemburg.com> <15126.32275.110670.236066@cymru.basistech.com> <3B1681C5.71FD484D@lemburg.com> <15126.33995.327715.84261@cymru.basistech.com> <200105312046.f4VKkVY02913@mira.informatik.hu-berlin.de> <3B174DE0.EFABF55E@lemburg.com> <200106011259.f51CxbQ00877@mira.informatik.hu-berlin.de> <3B1804DA.8C48861E@lemburg.com> <200106012212.f51MCEN01482@mira.informatik.hu-berlin.de> <15128.10413.377254.142035@cymru.basistech.com> <200106020659.f526xZM01136@mira.informatik.hu-berlin.de> Message-ID: <15129.7686.306629.523526@cymru.basistech.com> Martin v. Loewis writes: > > It is *very* common for email to be sent making use of both 8-bit and > > 7-bit encodings with no content-type or content-transfer-encoding. > > I think this claim is difficult to support by facts. Of the messages I > receive, most do have a MIME header, giving a charset in their > content. I am a computational linguist --- part of the work I've been doing over the last year is an email corpus, built from messages coming from a number of mailing lists from over thirteen countries. With over 21K messages and 60+ MB of text, my experience has been that many of these messages lack any indication of character set or encoding. I'll write a script to spin through the headers and determine how many conform to the standard RFCs, and how many actually include charset information either in the header or in a MIME body. > That might be a useful thing to do, but I guess the routine you've > been using was way more complex than what MAL suggested for the > standard library. I doubt you can reliably detect Big 5 by looking at > the first 10 or so bytes of an HTML document. You can't reliably detect much of anything by looking at the first 10 bytes of a document, unless in a very constrained domain like the character set detection that spawned this thread. So we agree. > > Higher level protocols cannot be believed. > > And neither can autodetection. That's right... I didn't mean to imply that it could. But the two together can be quite useful, and if you have enough text, autodetection can be quite accurate. The problem, of course, is that most text on the web contains a lot of English as well as other languages. -- Tom Emerson Basis Technology Corp. Sr. Sinostringologist http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever" From walter@livinglogic.de Tue Jun 5 09:39:04 2001 From: walter@livinglogic.de (Walter Doerwald) Date: Tue, 05 Jun 2001 10:39:04 +0200 Subject: [I18n-sig] XML and codecs In-Reply-To: <3B1807B6.11ED32B9@lemburg.com> References: <3B167B9F.344D6992@lemburg.com> <15126.32275.110670.236066@cymru.basistech.com> <3B1681C5.71FD484D@lemburg.com> <15126.33995.327715.84261@cymru.basistech.com> <200105312046.f4VKkVY02913@mira.informatik.hu-berlin.de> <3B174DE0.EFABF55E@lemburg.com> <200106011259.f51CxbQ00877@mira.informatik.hu-berlin.de> <3B1807B6.11ED32B9@lemburg.com> Message-ID: <200106051039040859.000CF3EB@mail.livinglogic.de> On 01.06.01 at 23:23 M.-A. Lemburg wrote: > "Martin v. Loewis" wrote: > > > > As for XML and encodings, having a convenient mechanism to extend > > existing codecs to encode unknown characters as character entities is > > much more important, IMO, since that is very difficult to achieve with > > the existing API. > > Until we've found a backward compatible way to fix this, how > about adding a new error handling scheme which at least gives > the caller enough information to do some smart processing on the > input and output, e.g. > > errors=3D"break": > > raise an UnicodeBreakError with argument > (reason, error_position_in_input, work_done_so_far) > > The caller could then use the information returned > by the codec to fix the input data and reuse the already > encoded/decoded data to avoid duplicate work. How would UTF-16 be handled? I guess without additional code multiple BOMs would be generated for a string that contains unencodable characters. > This scheme is very simple, but also very effective, since > it allows complex error processing to be done in the > namespace where the data is being processed (rather than > in a callback which wouldn't have access to this namespace). A callback could be a class instance with a __call__ method and so can have as much state information as it needs. Bye, Walter D=F6rwald -- Walter D=F6rwald =B7 LivingLogic AG =B7 Bayreuth, Germany =B7 www.livinglogic.de From mal@lemburg.com Tue Jun 5 10:02:37 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Tue, 05 Jun 2001 11:02:37 +0200 Subject: [I18n-sig] XML and codecs References: <3B167B9F.344D6992@lemburg.com> <15126.32275.110670.236066@cymru.basistech.com> <3B1681C5.71FD484D@lemburg.com> <15126.33995.327715.84261@cymru.basistech.com> <200105312046.f4VKkVY02913@mira.informatik.hu-berlin.de> <3B174DE0.EFABF55E@lemburg.com> <200106011259.f51CxbQ00877@mira.informatik.hu-berlin.de> <3B1807B6.11ED32B9@lemburg.com> <200106051039040859.000CF3EB@mail.livinglogic.de> Message-ID: <3B1CA02D.71C4A6EB@lemburg.com> Walter Doerwald wrote: > > On 01.06.01 at 23:23 M.-A. Lemburg wrote: > > > "Martin v. Loewis" wrote: > > > > > > As for XML and encodings, having a convenient mechanism to extend > > > existing codecs to encode unknown characters as character entities is > > > much more important, IMO, since that is very difficult to achieve with > > > the existing API. > > > > Until we've found a backward compatible way to fix this, how > > about adding a new error handling scheme which at least gives > > the caller enough information to do some smart processing on the > > input and output, e.g. > > > > errors="break": > > > > raise an UnicodeBreakError with argument > > (reason, error_position_in_input, work_done_so_far) > > > > The caller could then use the information returned > > by the codec to fix the input data and reuse the already > > encoded/decoded data to avoid duplicate work. > > How would UTF-16 be handled? I guess without additional > code multiple BOMs would be generated for a string that > contains unencodable characters. Why ? You should know out of the context which byte order is in current use and thus use the appropriate code UTF-16-LE or -BE. These don't generate BOMs. > > This scheme is very simple, but also very effective, since > > it allows complex error processing to be done in the > > namespace where the data is being processed (rather than > > in a callback which wouldn't have access to this namespace). > > A callback could be a class instance with a __call__ method > and so can have as much state information as it needs. Sure, but it breaks the current API completely. The above mechanism is different in that the communication in the error case is done by means of an exception. While this is not as fast as a callback it does have some advantages: * you can write the error handling code in the context using the codec * it enables you to write error handling code at higher levels in the calling stack * it fits in with the current API -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From martin@loewis.home.cs.tu-berlin.de Tue Jun 5 18:26:15 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Tue, 5 Jun 2001 19:26:15 +0200 Subject: [I18n-sig] XML and codecs In-Reply-To: "walter@livinglogic.de"'s message of Tue, 05 Jun 2001 10:39:04 +0200 Message-ID: <200106051726.f55HQFY01124@mira.informatik.hu-berlin.de> > How would UTF-16 be handled? I guess without additional > code multiple BOMs would be generated for a string that > contains unencodable characters. When you generate or decode UTF-16, this is not a problem: There won't be any unencodable characters. Even if that was a problem: Just by raising the exception, there won't be multiple BOMs. So you have to provide additional code, anyway, so you better make sure this code is correct. The problem becomes real for codecs that preserve state: You'll need to maintain the state of the codec from the time the exception occurred, so that subsequence .encode calls will continue in the shift state they were in previously. So for codecs that preserve state across .encode calls, codecs.lookup will need to return a bound method as encode and decode function, not a simple function; see the iconv codec for an example. In some sense, one can argue that the UTF-16 Codec also preserves state: whether it has yet emitted a BOM. Regards, Martin From mal@lemburg.com Tue Jun 5 19:01:52 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Tue, 05 Jun 2001 20:01:52 +0200 Subject: [I18n-sig] XML and codecs References: <200106051726.f55HQFY01124@mira.informatik.hu-berlin.de> Message-ID: <3B1D1E90.B3802D5E@lemburg.com> "Martin v. Loewis" wrote: > > > How would UTF-16 be handled? I guess without additional > > code multiple BOMs would be generated for a string that > > contains unencodable characters. > > When you generate or decode UTF-16, this is not a problem: There won't > be any unencodable characters. > > Even if that was a problem: Just by raising the exception, there won't > be multiple BOMs. So you have to provide additional code, anyway, so > you better make sure this code is correct. > > The problem becomes real for codecs that preserve state: You'll need > to maintain the state of the codec from the time the exception > occurred, so that subsequence .encode calls will continue in the shift > state they were in previously. Should be no problem since the exception will sort of freeze the current state of the codec (provided it's a StreamWriter/Reader) and let you use this state to take appropriate actions. > So for codecs that preserve state across .encode calls, codecs.lookup > will need to return a bound method as encode and decode function, not > a simple function; see the iconv codec for an example. Not sure what you mean here, but the encoder and decoder returned by codecs.lookup() must not maintain state. This property is reserved for StreamWriters and Readers (see the Unicode docs). > In some sense, one can argue that the UTF-16 Codec also preserves > state: whether it has yet emitted a BOM. BTW, I haven't yet had time to check your utf16 patch but from a first glance it looks good. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From martin@loewis.home.cs.tu-berlin.de Tue Jun 5 19:58:51 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Tue, 5 Jun 2001 20:58:51 +0200 Subject: [I18n-sig] XML and codecs In-Reply-To: "mal@lemburg.com"'s message of Tue, 05 Jun 2001 20:01:52 +0200 Message-ID: <200106051858.f55IwpU01510@mira.informatik.hu-berlin.de> > Should be no problem since the exception will sort of freeze > the current state of the codec (provided it's a StreamWriter/Reader) > and let you use this state to take appropriate actions. What do you mean: "provided it's a StreamReader/Writer". What if I invoke the encode method found in codec lookup, and get an exception? The exception does not carry the state. Suppose you encode into JIS X 0201. That has four shift states: CHARSETS = { "\033(B": US_ASCII, "\033(J": JISX0201_1976, "\033$@": JISX0208_1978, "\033$B": JISX0208_1983, } Depending on which of the escape codes you've emitted last, the following bytes will have different meanings. Now, suppose we encode a string that cannot be translated to JIS X0201. The codec will raise an exception, telling us how much bytes it has encoded. Now, suppose we want to replace this character with the string "&9898;". If we are in the US_ASCII shift state, we can immediately encode it. If we are in a different shift state, we must issue the control sequence first. When the codec does not preserve state, it cannot correctly encode the entire string, since concatenating the results of encode() invocations might be incorrect. If you don't believe me, tell me how I can use your proposed interface to encode a Unicode into JIS X 0201 + XML escapes, with using the encode/decode functions only. > Not sure what you mean here, but the encoder and decoder > returned by codecs.lookup() must not maintain state. This > property is reserved for StreamWriters and Readers (see the > Unicode docs). You mean the sentence that says # The functions/methods are expected to work in a stateless mode. What is "expected to work"? Who expects they work in stateless mode, and why? And what happens if they don't? It also says # These must be functions or methods which have the same interface as # the encode()/decode() methods of Codec instances (see Codec # Interface). So surely, the result of codecs.lookup may be a method. If it is a method, it surely must be a bound method (or else, where does the self argument come from?) Since bound methods are allows, the encode/decode functions *may* preserve state: A bound method always references state in form of the object it is bound to. So I think the sentence in the documentation saying "expected to work" is an error. Regards, Martin From mal@lemburg.com Tue Jun 5 20:46:57 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Tue, 05 Jun 2001 21:46:57 +0200 Subject: [I18n-sig] XML and codecs References: <200106051858.f55IwpU01510@mira.informatik.hu-berlin.de> Message-ID: <3B1D3731.2B915C87@lemburg.com> "Martin v. Loewis" wrote: > > > Should be no problem since the exception will sort of freeze > > the current state of the codec (provided it's a StreamWriter/Reader) > > and let you use this state to take appropriate actions. > > What do you mean: "provided it's a StreamReader/Writer". What if I > invoke the encode method found in codec lookup, and get an exception? The encoders/decoders returned in the lookup tuple are not supposed to store state. If you want to or need to store state, then you should use the factory functions (StreamWriter and -Reader) to first create an instance which can store state and then use its .encode()/.decode() methods. > The exception does not carry the state. That's not what I meant. If you have created say a StreamReader object, then this object will store the state and if its .encode() method raises a UnicodeBreakError exception you can use the current state stored in the object to take some action of recovery, etc. > Suppose you encode into JIS X > 0201. That has four shift states: > > CHARSETS = { > "\033(B": US_ASCII, > "\033(J": JISX0201_1976, > "\033$@": JISX0208_1978, > "\033$B": JISX0208_1983, > } > > Depending on which of the escape codes you've emitted last, the > following bytes will have different meanings. > > Now, suppose we encode a string that cannot be translated to JIS > X0201. The codec will raise an exception, telling us how much bytes > it has encoded. Now, suppose we want to replace this character with > the string "&9898;". If we are in the US_ASCII shift state, we can > immediately encode it. If we are in a different shift state, we must > issue the control sequence first. > > When the codec does not preserve state, it cannot correctly encode the > entire string, since concatenating the results of encode() invocations > might be incorrect. > > If you don't believe me, tell me how I can use your proposed interface > to encode a Unicode into JIS X 0201 + XML escapes, with using the > encode/decode functions only. > > > Not sure what you mean here, but the encoder and decoder > > returned by codecs.lookup() must not maintain state. This > > property is reserved for StreamWriters and Readers (see the > > Unicode docs). > > You mean the sentence that says > > # The functions/methods are expected to work in a stateless mode. > > What is "expected to work"? Who expects they work in stateless mode, > and why? And what happens if they don't? > > It also says > > # These must be functions or methods which have the same interface as > # the encode()/decode() methods of Codec instances (see Codec > # Interface). > > So surely, the result of codecs.lookup may be a method. If it is a > method, it surely must be a bound method (or else, where does the self > argument come from?) Since bound methods are allows, the encode/decode > functions *may* preserve state: A bound method always references state > in form of the object it is bound to. > > So I think the sentence in the documentation saying "expected to work" > is an error. This is per design and not a mistake. If encoders/decoders (the first two items in the lookup tuple) would store state, then you would have serious problems when reusing them for different inputs. I'm not even talking about threading problems here. The other two entries were designed to provide statefull codec interfaces, so your JIS codec would have to use those in order to store shift states etc. or do more complex work on the data. The encoder/decoder functions should only provide very basic encoding/decoding facilities which do not require keeping state (e.g. they might have additional keyword arguments to parameterize them to work in different shift states). -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From martin@loewis.home.cs.tu-berlin.de Tue Jun 5 21:05:04 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Tue, 5 Jun 2001 22:05:04 +0200 Subject: [I18n-sig] XML and codecs In-Reply-To: "mal@lemburg.com"'s message of Tue, 05 Jun 2001 21:46:57 +0200 Message-ID: <200106052005.f55K54U02481@mira.informatik.hu-berlin.de> > > What do you mean: "provided it's a StreamReader/Writer". What if I > > invoke the encode method found in codec lookup, and get an exception? > > The encoders/decoders returned in the lookup tuple are not > supposed to store state. If you want to or need to store state, > then you should use the factory functions (StreamWriter and > -Reader) to first create an instance which can store state > and then use its .encode()/.decode() methods. To create one of these, I need a file object. I just want a stateful encoder, not a stream. So if I don't have a file object, how do I create an encoder? Plus, if I cannot use the functions returned from codecs.lookup in stateful encodings, what are they good for, anyways? > > So I think the sentence in the documentation saying "expected to work" > > is an error. > > This is per design and not a mistake. Ok, so it is an error in the design, not only in the documentation. > If encoders/decoders (the first two items in the > lookup tuple) would store state, then you would have serious problems > when reusing them for different inputs. I'm not even talking about > threading problems here. What specific problems would you have? I.e. is there anything in the standard library that gets into serious problems if codecs.lookup returns a stateful object? > The other two entries were designed to provide statefull codec > interfaces, so your JIS codec would have to use those in order > to store shift states etc. or do more complex work on the data. First, as I said, I cannot use them as-is, since I need a file. Furthermore, are you saying that I can use codecs.lookup(enc)[:2] only for some encodings, not for others? That sounds like a huge design flaw. > The encoder/decoder functions should only provide very basic > encoding/decoding facilities which do not require keeping > state (e.g. they might have additional keyword arguments to > parameterize them to work in different shift states). Arghh. Whether the facilities are basic or not depends on the encoding. So again I consider this broken, and the best fix is to allow the callable objects returned in codecs.lookup(enc)[:2] to maintain state if they want. Users must then either look them up again if they want to reuse them for different input, or they can recycle them if they happen to know that no state is maintained. Regards, Martin From mal@lemburg.com Tue Jun 5 21:23:30 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Tue, 05 Jun 2001 22:23:30 +0200 Subject: [I18n-sig] XML and codecs References: <200106052005.f55K54U02481@mira.informatik.hu-berlin.de> Message-ID: <3B1D3FC2.988F3945@lemburg.com> "Martin v. Loewis" wrote: > > > > What do you mean: "provided it's a StreamReader/Writer". What if I > > > invoke the encode method found in codec lookup, and get an exception? > > > > The encoders/decoders returned in the lookup tuple are not > > supposed to store state. If you want to or need to store state, > > then you should use the factory functions (StreamWriter and > > -Reader) to first create an instance which can store state > > and then use its .encode()/.decode() methods. > > To create one of these, I need a file object. I just want a stateful > encoder, not a stream. So if I don't have a file object, how do I > create an encoder? Simple: use cStringIO ! > Plus, if I cannot use the functions returned from codecs.lookup in > stateful encodings, what are they good for, anyways? For simple stateless encodings. > > > So I think the sentence in the documentation saying "expected to work" > > > is an error. > > > > This is per design and not a mistake. > > Ok, so it is an error in the design, not only in the documentation. Oh please... > > If encoders/decoders (the first two items in the > > lookup tuple) would store state, then you would have serious problems > > when reusing them for different inputs. I'm not even talking about > > threading problems here. > > What specific problems would you have? I.e. is there anything in the > standard library that gets into serious problems if codecs.lookup > returns a stateful object? Please reread what I wrote and then think this over again... by reusing a stateful encoder multiple times you would carry over state from one usage to the next, e.g. carry over the shift state from one data set to the next (which may not even use this shift state). > > The other two entries were designed to provide statefull codec > > interfaces, so your JIS codec would have to use those in order > > to store shift states etc. or do more complex work on the data. > > First, as I said, I cannot use them as-is, since I need a file. > > Furthermore, are you saying that I can use codecs.lookup(enc)[:2] only > for some encodings, not for others? That sounds like a huge design > flaw. These two APIs are exposed to simplify the interface for simple, stateless encodings. Since most encodings work just fine with these APIs they are indeed very useful. > > The encoder/decoder functions should only provide very basic > > encoding/decoding facilities which do not require keeping > > state (e.g. they might have additional keyword arguments to > > parameterize them to work in different shift states). > > Arghh. Whether the facilities are basic or not depends on the > encoding. > > So again I consider this broken, and the best fix is to allow the > callable objects returned in codecs.lookup(enc)[:2] to maintain state > if they want. > > Users must then either look them up again if they want to reuse them > for different input, or they can recycle them if they happen to know > that no state is maintained. Again, this decision was per design: the codec registry lookup mechanism caches the lookup tuples. With your proposal the cache would be rendered useless. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From martin@loewis.home.cs.tu-berlin.de Tue Jun 5 21:50:43 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Tue, 5 Jun 2001 22:50:43 +0200 Subject: [I18n-sig] XML and codecs In-Reply-To: "mal@lemburg.com"'s message of Tue, 05 Jun 2001 22:23:30 +0200 Message-ID: <200106052050.f55Koho02886@mira.informatik.hu-berlin.de> > > To create one of these, I need a file object. I just want a stateful > > encoder, not a stream. So if I don't have a file object, how do I > > create an encoder? > > Simple: use cStringIO ! Are you serious? To encode strings, I need cStringIO ?!? > > Plus, if I cannot use the functions returned from codecs.lookup in > > stateful encodings, what are they good for, anyways? > > For simple stateless encodings. So it is not a general-purpose facility. What should a lookup function return if it cannot provide a stateless encoding function? > Please reread what I wrote and then think this over again... Why do you think I did not pay attention? > by reusing a stateful encoder multiple times you would carry over > state from one usage to the next, e.g. carry over the shift state > from one data set to the next (which may not even use this shift > state). Indeed, that's what I want. How else could continuing after an encoding error work? If I want to start with fresh data, I also need to get a fresh codec function, from codecs.lookup. > These two APIs are exposed to simplify the interface for simple, > stateless encodings. Since most encodings work just fine with > these APIs they are indeed very useful. It turns out that both UTF-16 and UTF-8 have problems with a stateless approach, so I'm questioning the usefulness of the API. Of course, having to use cStringIO isn't any better... > Again, this decision was per design: the codec registry lookup > mechanism caches the lookup tuples. With your proposal the cache > would be rendered useless. Given that encoding.search_function caches the result also, it is questionable why codecs.lookup should do that. One cache should be enough, and it should be in encodings, since all these encodings are known to be stateless. Regards, Martin From walter@livinglogic.de Wed Jun 6 15:52:36 2001 From: walter@livinglogic.de (Walter Doerwald) Date: Wed, 06 Jun 2001 16:52:36 +0200 Subject: [I18n-sig] XML and codecs In-Reply-To: <3B1CA02D.71C4A6EB@lemburg.com> References: <3B167B9F.344D6992@lemburg.com> <15126.32275.110670.236066@cymru.basistech.com> <3B1681C5.71FD484D@lemburg.com> <15126.33995.327715.84261@cymru.basistech.com> <200105312046.f4VKkVY02913@mira.informatik.hu-berlin.de> <3B174DE0.EFABF55E@lemburg.com> <200106011259.f51CxbQ00877@mira.informatik.hu-berlin.de> <3B1807B6.11ED32B9@lemburg.com> <200106051039040859.000CF3EB@mail.livinglogic.de> <3B1CA02D.71C4A6EB@lemburg.com> Message-ID: <200106061652360296.01088FC8@mail.livinglogic.de> On 05.06.01 at 11:02 M.-A. Lemburg wrote: > Walter Doerwald wrote: > > > [...] > > > > This scheme is very simple, but also very effective, since > > > it allows complex error processing to be done in the > > > namespace where the data is being processed (rather than > > > in a callback which wouldn't have access to this namespace). > > > > A callback could be a class instance with a __call__ method > > and so can have as much state information as it needs. > > Sure, but it breaks the current API completely. The above > mechanism is different in that the communication in the error > case is done by means of an exception. While this is not as > fast as a callback it does have some advantages: > > * you can write the error handling code in the context using > the codec > > * it enables you to write error handling code at higher levels > in the calling stack But this means that you would have to allow the encoder to keep state between calls. That's no isse with a callback, because there is only one call. > * it fits in with the current API That's right. Unfortunately there are a lot of functions that would have to be changed. Bye, Walter D=F6rwald -- Walter D=F6rwald =B7 LivingLogic AG =B7 Bayreuth, Germany =B7 www.livinglogic.de From walter@livinglogic.de Wed Jun 6 16:51:10 2001 From: walter@livinglogic.de (Walter Doerwald) Date: Wed, 06 Jun 2001 17:51:10 +0200 Subject: [I18n-sig] XML and codecs In-Reply-To: <3B1E4BBA.9BA3A4D8@lemburg.com> References: <3B167B9F.344D6992@lemburg.com> <15126.32275.110670.236066@cymru.basistech.com> <3B1681C5.71FD484D@lemburg.com> <15126.33995.327715.84261@cymru.basistech.com> <200105312046.f4VKkVY02913@mira.informatik.hu-berlin.de> <3B174DE0.EFABF55E@lemburg.com> <200106011259.f51CxbQ00877@mira.informatik.hu-berlin.de> <3B1807B6.11ED32B9@lemburg.com> <200106051039040859.000CF3EB@mail.livinglogic.de> <3B1CA02D.71C4A6EB@lemburg.com> <200106061651570250.0107F741@mail.livinglogic.de> <3B1E4BBA.9BA3A4D8@lemburg.com> Message-ID: <200106061751100625.013E2FA0@mail.livinglogic.de> On 06.06.01 at 17:26 M.-A. Lemburg wrote: > Walter Doerwald wrote: > > > > On 05.06.01 at 11:02 M.-A. Lemburg wrote: > > > > > [...] > > > > > > Sure, but it breaks the current API completely. The above > > > mechanism is different in that the communication in the error > > > case is done by means of an exception. While this is not as > > > fast as a callback it does have some advantages: > > > > > > * you can write the error handling code in the context using > > > the codec > > > > > > * it enables you to write error handling code at higher levels > > > in the calling stack > > > > But this means that you would have to allow the encoder to keep > > state between calls. That's no isse with a callback, because there > > is only one call. > > Well, either the codec keeps state or your application; > here's some pseudo code to illustrate the first situation: > > def do_something(data): > > StreamWriter =3D codec.lookup('myencoding')[3] > output =3D cStringIO(data) > writer =3D StreamWriter(output, 'break') > while 1: > try: > writer.write(data) > except UnicodeBreakError, (reason, position, work): > # Write data converted so far > output.write(work) > # Roll back 10 chars in the input and retry > data =3D data[position - 10:] > else: > break > return output.getvalue() Apart from the fact, that I have to use a StreamWriter (I probably would have to anyway, since only one BOM at the start of an output file is required.) this looks usable. The big question is: Is 'break' a temporary workaround that will go away as soon as we have error handling callbacks? Do we want error handling callbacks? And finally: How fast is it? > > > * it fits in with the current API > > > > That's right. Unfortunately there are a lot of functions that > > would have to be changed. > > That's why I prefer small steps rather than replacing the > complete codec suite with new interfaces. The type of one argument changes in all the functions, i.e. there's a new set of *Ex functions, where const char *errors becomes PyObject *errors Bye, Walter D=F6rwald -- Walter D=F6rwald =B7 LivingLogic AG =B7 Bayreuth, Germany =B7 www.livinglogic.de From mal@lemburg.com Wed Jun 6 16:57:54 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Wed, 06 Jun 2001 17:57:54 +0200 Subject: [I18n-sig] XML and codecs References: <3B167B9F.344D6992@lemburg.com> <15126.32275.110670.236066@cymru.basistech.com> <3B1681C5.71FD484D@lemburg.com> <15126.33995.327715.84261@cymru.basistech.com> <200105312046.f4VKkVY02913@mira.informatik.hu-berlin.de> <3B174DE0.EFABF55E@lemburg.com> <200106011259.f51CxbQ00877@mira.informatik.hu-berlin.de> <3B1807B6.11ED32B9@lemburg.com> <200106051039040859.000CF3EB@mail.livinglogic.de> <3B1CA02D.71C4A6EB@lemburg.com> <200106061651570250.0107F741@mail.livinglogic.de> <3B1E4BBA.9BA3A4D8@lemburg.com> <200106061751100625.013E2FA0@mail.livinglogic.de> Message-ID: <3B1E5302.B9D83C94@lemburg.com> Walter Doerwald wrote: > > > > > Sure, but it breaks the current API completely. The above > > > > mechanism is different in that the communication in the error > > > > case is done by means of an exception. While this is not as > > > > fast as a callback it does have some advantages: > > > > > > > > * you can write the error handling code in the context using > > > > the codec > > > > > > > > * it enables you to write error handling code at higher levels > > > > in the calling stack > > > > > > But this means that you would have to allow the encoder to keep > > > state between calls. That's no isse with a callback, because there > > > is only one call. > > > > Well, either the codec keeps state or your application; > > here's some pseudo code to illustrate the first situation: > > > > def do_something(data): > > > > StreamWriter = codec.lookup('myencoding')[3] > > output = cStringIO(data) > > writer = StreamWriter(output, 'break') > > while 1: > > try: > > writer.write(data) > > except UnicodeBreakError, (reason, position, work): > > # Write data converted so far > > output.write(work) > > # Roll back 10 chars in the input and retry > > data = data[position - 10:] > > else: > > break > > return output.getvalue() > > Apart from the fact, that I have to use a StreamWriter > (I probably would have to anyway, since only one BOM at the > start of an output file is required.) this looks usable. > > The big question is: Is 'break' a temporary workaround > that will go away as soon as we have error handling > callbacks? No. > Do we want error handling callbacks? I think we should still keep them on the TODO list. > And finally: How fast is it? Since errors will always cause extra cycles to be used, I think the small overhead of using an exception for the notification is reasonable. Written in C, you probably won't notice much of a slowdown compared to a callback solution, since there exceptions are faster than in Python (the exception objects are created lazily in Python). > > > > * it fits in with the current API > > > > > > That's right. Unfortunately there are a lot of functions that > > > would have to be changed. > > > > That's why I prefer small steps rather than replacing the > > complete codec suite with new interfaces. > > The type of one argument changes in all the functions, i.e. > there's a new set of *Ex functions, where > const char *errors > becomes > PyObject *errors ... plus all the callback logic which goes with it, changes to the way errors are handled by the codecs, etc. It is doable, but certainly a lot of work. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From martin@loewis.home.cs.tu-berlin.de Wed Jun 6 19:33:07 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Wed, 6 Jun 2001 20:33:07 +0200 Subject: [I18n-sig] XML and codecs In-Reply-To: "walter@livinglogic.de"'s message of Wed, 06 Jun 2001 17:51:10 +0200 Message-ID: <200106061833.f56IX7S01099@mira.informatik.hu-berlin.de> > > Well, either the codec keeps state or your application; > > here's some pseudo code to illustrate the first situation: > > > > def do_something(data): > > > > StreamWriter = codec.lookup('myencoding')[3] > > output = cStringIO(data) > > writer = StreamWriter(output, 'break') > > while 1: > > try: > > writer.write(data) > > except UnicodeBreakError, (reason, position, work): > > # Write data converted so far > > output.write(work) > > # Roll back 10 chars in the input and retry > > data = data[position - 10:] > > else: > > break > > return output.getvalue() I've missed Marc's posting of this code fragment: How can rolling back 10 characters possibly be the right thing? Couldn't this cause data to be written twice to the stream? I would expect that, when calling .write(), all correctly encoded data is written to the stream and that position points to the first character that cannot be encoded. Regards, Martin From mal@lemburg.com Wed Jun 6 20:24:28 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Wed, 06 Jun 2001 21:24:28 +0200 Subject: [I18n-sig] XML and codecs References: <200106061833.f56IX7S01099@mira.informatik.hu-berlin.de> Message-ID: <3B1E836C.D773D3B@lemburg.com> "Martin v. Loewis" wrote: > > > > Well, either the codec keeps state or your application; > > > here's some pseudo code to illustrate the first situation: > > > > > > def do_something(data): > > > > > > StreamWriter = codec.lookup('myencoding')[3] > > > output = cStringIO(data) > > > writer = StreamWriter(output, 'break') > > > while 1: > > > try: > > > writer.write(data) > > > except UnicodeBreakError, (reason, position, work): > > > # Write data converted so far > > > output.write(work) > > > # Roll back 10 chars in the input and retry > > > data = data[position - 10:] > > > else: > > > break > > > return output.getvalue() > > I've missed Marc's posting of this code fragment: How can rolling back > 10 characters possibly be the right thing? Couldn't this cause data to > be written twice to the stream? This depends on how the codec and encoding works. The above is just an example of how you could use the 'break' mechanism to apply customized action in case of an error. > I would expect that, when calling .write(), all correctly encoded data > is written to the stream and that position points to the first > character that cannot be encoded. i think it's better not to write any information to the stream unless you are absolutely sure that no error occurred. Remember that you cannot take back characters which were written to the stream. With the above information at hand, the caller can make all decisions needed to assure the data written to the output stream is correct. The codec will place the work done so far into the third tuple argument and the position which caused the failure into the second. reason can be used to provide additional information to the caller. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From martin@loewis.home.cs.tu-berlin.de Thu Jun 7 06:10:50 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Thu, 7 Jun 2001 07:10:50 +0200 Subject: [I18n-sig] XML and codecs In-Reply-To: "mal@lemburg.com"'s message of Wed, 06 Jun 2001 21:24:28 +0200 Message-ID: <200106070510.f575AoA00835@mira.informatik.hu-berlin.de> > The codec will place the work done so far into the third > tuple argument and the position which caused the failure > into the second. reason can be used to provide additional > information to the caller. How does that work with writelines()? In this case, the caller does not have the string which the position refers to. Regards, Martin From mal@lemburg.com Thu Jun 7 09:36:37 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 07 Jun 2001 10:36:37 +0200 Subject: [I18n-sig] XML and codecs References: <200106070510.f575AoA00835@mira.informatik.hu-berlin.de> Message-ID: <3B1F3D15.20795A56@lemburg.com> "Martin v. Loewis" wrote: > > > The codec will place the work done so far into the third > > tuple argument and the position which caused the failure > > into the second. reason can be used to provide additional > > information to the caller. > > How does that work with writelines()? In this case, the caller does > not have the string which the position refers to. In that case you'd either a) have to subclass the StreamWriter and provide the necessary logic in the .writelines() method (using the .write() method to do the actual work) or b) forget about .writelines() and move the for-loop directly into your application or c) use u"".join(datalines) and .write(). Not really all that difficult. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From walter@livinglogic.de Thu Jun 7 10:53:46 2001 From: walter@livinglogic.de (Walter Doerwald) Date: Thu, 07 Jun 2001 11:53:46 +0200 Subject: [I18n-sig] XML and codecs In-Reply-To: <3B1E5302.B9D83C94@lemburg.com> References: <3B167B9F.344D6992@lemburg.com> <15126.32275.110670.236066@cymru.basistech.com> <3B1681C5.71FD484D@lemburg.com> <15126.33995.327715.84261@cymru.basistech.com> <200105312046.f4VKkVY02913@mira.informatik.hu-berlin.de> <3B174DE0.EFABF55E@lemburg.com> <200106011259.f51CxbQ00877@mira.informatik.hu-berlin.de> <3B1807B6.11ED32B9@lemburg.com> <200106051039040859.000CF3EB@mail.livinglogic.de> <3B1CA02D.71C4A6EB@lemburg.com> <200106061651570250.0107F741@mail.livinglogic.de> <3B1E4BBA.9BA3A4D8@lemburg.com> <200106061751100625.013E2FA0@mail.livinglogic.de> <3B1E5302.B9D83C94@lemburg.com> Message-ID: <200106071153460500.002D9DCB@mail.livinglogic.de> On 06.06.01 at 17:57 M.-A. Lemburg wrote: > [...] > > Do we want error handling callbacks? > > I think we should still keep them on the TODO list. OK! Then I'll start playing around with it. > > And finally: How fast is it? > > Since errors will always cause extra cycles to be used, > I think the small overhead of using an exception for > the notification is reasonable. > > Written in C, you probably won't notice much of a slowdown > compared to a callback solution, since there exceptions are > faster than in Python (the exception objects are created > lazily in Python). > > > > > > * it fits in with the current API > > > > > > > > That's right. Unfortunately there are a lot of functions that > > > > would have to be changed. > > > > > > That's why I prefer small steps rather than replacing the > > > complete codec suite with new interfaces. > > > > The type of one argument changes in all the functions, i.e. > > there's a new set of *Ex functions, where > > const char *errors > > becomes > > PyObject *errors > > .. plus all the callback logic which goes with it, changes > to the way errors are handled by the codecs, etc. It is doable, > but certainly a lot of work. Well, I need something to do in my free time! ;) Bye, Walter D=F6rwald -- Walter D=F6rwald =B7 LivingLogic AG =B7 Bayreuth, Germany =B7 www.livinglogic.de From walter@livinglogic.de Thu Jun 7 21:27:33 2001 From: walter@livinglogic.de (Walter Doerwald) Date: Thu, 07 Jun 2001 22:27:33 +0200 Subject: [I18n-sig] XML and codecs In-Reply-To: <200106071153460500.002D9DCB@mail.livinglogic.de> References: <3B167B9F.344D6992@lemburg.com> <15126.32275.110670.236066@cymru.basistech.com> <3B1681C5.71FD484D@lemburg.com> <15126.33995.327715.84261@cymru.basistech.com> <200105312046.f4VKkVY02913@mira.informatik.hu-berlin.de> <3B174DE0.EFABF55E@lemburg.com> <200106011259.f51CxbQ00877@mira.informatik.hu-berlin.de> <3B1807B6.11ED32B9@lemburg.com> <200106051039040859.000CF3EB@mail.livinglogic.de> <3B1CA02D.71C4A6EB@lemburg.com> <200106061651570250.0107F741@mail.livinglogic.de> <3B1E4BBA.9BA3A4D8@lemburg.com> <200106061751100625.013E2FA0@mail.livinglogic.de> <3B1E5302.B9D83C94@lemburg.com> <200106071153460500.002D9DCB@mail.livinglogic.de> Message-ID: <200106072227330828.0271DE0B@mail.livinglogic.de> On 07.06.01 at 11:53 Walter Doerwald wrote: > On 06.06.01 at 17:57 M.-A. Lemburg wrote: > > > [...] > > > Do we want error handling callbacks? > > > > I think we should still keep them on the TODO list. > > OK! Then I'll start playing around with it. I started working on this, and it's progressing nicely. It's already possible to do things like: >>> import codecs >>> codecs.ascii_encode( ... u"a=E4u=FCo=F6=DF", ... lambda enc, uni, pos: u"&#x%x;" % ord(uni[pos]))= ('aäuüoöß', 7) >>> import unicodedata >>> codecs.latin_1_encode( ... u"a\u3042b", ... lambda enc, uni, pos: u"<%s>" % unicodedata.name(uni[pos])) ('ab', 3) String arguments are still accepted: >>> codecs.ascii_encode(u"a=E4u=FCo=F6=DF", "ignore") ('auo', 7) Bye, Walter D=F6rwald -- Walter D=F6rwald =B7 LivingLogic AG =B7 Bayreuth, Germany =B7= www.livinglogic.de From mal@lemburg.com Thu Jun 7 22:04:23 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 07 Jun 2001 23:04:23 +0200 Subject: [I18n-sig] XML and codecs References: <3B167B9F.344D6992@lemburg.com> <15126.32275.110670.236066@cymru.basistech.com> <3B1681C5.71FD484D@lemburg.com> <15126.33995.327715.84261@cymru.basistech.com> <200105312046.f4VKkVY02913@mira.informatik.hu-berlin.de> <3B174DE0.EFABF55E@lemburg.com> <200106011259.f51CxbQ00877@mira.informatik.hu-berlin.de> <3B1807B6.11ED32B9@lemburg.com> <200106051039040859.000CF3EB@mail.livinglogic.de> <3B1CA02D.71C4A6EB@lemburg.com> <200106061651570250.0107F741@mail.livinglogic.de> <3B1E4BBA.9BA3A4D8@lemburg.com> <200106061751100625.013E2FA0@mail.livinglogic.de> <3B1E5302.B9D83C94@lemburg.com> <200106071153460500.002D9DCB@mail.livinglogic.de> <200106072227330828.0271DE0B@mail.livinglogic.de> Message-ID: <3B1FEC57.AA4587ED@lemburg.com> Walter Doerwald wrote: > > On 07.06.01 at 11:53 Walter Doerwald wrote: > > > On 06.06.01 at 17:57 M.-A. Lemburg wrote: > > > > > [...] > > > > Do we want error handling callbacks? > > > > > > I think we should still keep them on the TODO list. > > > > OK! Then I'll start playing around with it. > > I started working on this, and it's progressing nicely. Cool :) > It's already possible to do things like: > > >>> import codecs > >>> codecs.ascii_encode( > ... u"aäuüoöß", > ... lambda enc, uni, pos: u"&#x%x;" % ord(uni[pos])) > ('aäuüoöß', 7) > >>> import unicodedata > >>> codecs.latin_1_encode( > ... u"a\u3042b", > ... lambda enc, uni, pos: u"<%s>" % unicodedata.name(uni[pos])) > ('ab', 3) > > String arguments are still accepted: > >>> codecs.ascii_encode(u"aäuüoöß", "ignore") > ('auo', 7) > > Bye, > Walter Dörwald > > -- > Walter Dörwald · LivingLogic AG · Bayreuth, Germany · www.livinglogic.de > > _______________________________________________ > I18n-sig mailing list > I18n-sig@python.org > http://mail.python.org/mailman/listinfo/i18n-sig -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From kajiyama@grad.sccs.chukyo-u.ac.jp Fri Jun 8 15:59:11 2001 From: kajiyama@grad.sccs.chukyo-u.ac.jp (Tamito KAJIYAMA) Date: Fri, 8 Jun 2001 23:59:11 +0900 Subject: [I18n-sig] JapaneseCodecs and the license Message-ID: <200106081459.XAA06340@dhcp198.grad.sccs.chukyo-u.ac.jp> Hi. I decided to change the license of my JapaneseCodecs package from GNU GPL to a BSD variant. Due to the license change, I released JapaneseCodecs 1.3. It is available at: http://pseudo.grad.sccs.chukyo-u.ac.jp/~kajiyama/python/ There is no change in the codes, so you don't need to update your copy if the license doesn't matter. By the way, I also released a new module named kanjilib. As the name implies, the kanjilib module provides Japanese encoding conversion functions for EUC-JP, Shift_JIS and ISO-2022-JP. The module does not rely on Python's Unicode facilities, so it may be convenient if you need to handle Japanese character encodings but not Unicode, or if you need Japanese encoding conversion in Python 1.5.2 or former. The module is also available on the page above. Thanks, -- KAJIYAMA, Tamito From mal@lemburg.com Fri Jun 8 16:14:31 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Fri, 08 Jun 2001 17:14:31 +0200 Subject: [I18n-sig] JapaneseCodecs and the license References: <200106081459.XAA06340@dhcp198.grad.sccs.chukyo-u.ac.jp> Message-ID: <3B20EBD7.F112D288@lemburg.com> Tamito KAJIYAMA wrote: > > Hi. > > I decided to change the license of my JapaneseCodecs package > from GNU GPL to a BSD variant. Due to the license change, I > released JapaneseCodecs 1.3. It is available at: > > http://pseudo.grad.sccs.chukyo-u.ac.jp/~kajiyama/python/ > > There is no change in the codes, so you don't need to update > your copy if the license doesn't matter. Great... this is very good news ! I just wish some of the other codec authors would follow your example. Anyway, your move will certainly improve the usability of Python in Asia. > By the way, I also released a new module named kanjilib. As the > name implies, the kanjilib module provides Japanese encoding > conversion functions for EUC-JP, Shift_JIS and ISO-2022-JP. The > module does not rely on Python's Unicode facilities, so it may > be convenient if you need to handle Japanese character encodings > but not Unicode, or if you need Japanese encoding conversion in > Python 1.5.2 or former. The module is also available on the > page above. Thank you, -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From paulp@ActiveState.com Fri Jun 8 17:30:05 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Fri, 08 Jun 2001 09:30:05 -0700 Subject: [I18n-sig] JapaneseCodecs and the license References: <200106081459.XAA06340@dhcp198.grad.sccs.chukyo-u.ac.jp> <3B20EBD7.F112D288@lemburg.com> Message-ID: <3B20FD8D.25CD576@ActiveState.com> "M.-A. Lemburg" wrote: > >... > > Great... this is very good news ! I just wish some of the > other codec authors would follow your example. Anyway, > your move will certainly improve the usability of Python in Asia. Frank Chen has agreed to do the same for Chinese codecs. I asked him if he would do so a few days ago. He sent me a zipfile with a license that is: "It is licensed under the same license as Python 2.1." I can send this zipfile on to you, MAL and you could look them over and then if they meet your approval you could check them into the codecs directory. Does that sound good? -- Take a recipe. Leave a recipe. Python Cookbook! http://www.ActiveState.com/pythoncookbook From mal@lemburg.com Fri Jun 8 18:05:28 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Fri, 08 Jun 2001 19:05:28 +0200 Subject: [I18n-sig] JapaneseCodecs and the license References: <200106081459.XAA06340@dhcp198.grad.sccs.chukyo-u.ac.jp> <3B20EBD7.F112D288@lemburg.com> <3B20FD8D.25CD576@ActiveState.com> Message-ID: <3B2105D8.AA40284F@lemburg.com> Paul Prescod wrote: > > "M.-A. Lemburg" wrote: > > > >... > > > > Great... this is very good news ! I just wish some of the > > other codec authors would follow your example. Anyway, > > your move will certainly improve the usability of Python in Asia. > > Frank Chen has agreed to do the same for Chinese codecs. I asked him if > he would do so a few days ago. He sent me a zipfile with a license that > is: > > "It is licensed under the same license as Python 2.1." > > I can send this zipfile on to you, MAL and you could look them over and > then if they meet your approval you could check them into the codecs > directory. Does that sound good? I'll have to get BDFL approval on that first since these codec are huge. When we first discussed these issues it was decided to keep the codecs in a separate package which was to be maintained by packagers like ActiveState ;-) I'm not so sure anymore, though, since adding a few more 100kB to the distribution archive will certainly not hurt anybody these days and it would certainly gain some user base in Asia... which we are currently losing to [that other Japanese scripting language ;-)]. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From tim.one@home.com Fri Jun 8 19:07:19 2001 From: tim.one@home.com (Tim Peters) Date: Fri, 8 Jun 2001 14:07:19 -0400 Subject: [I18n-sig] JapaneseCodecs and the license In-Reply-To: <3B20FD8D.25CD576@ActiveState.com> Message-ID: [Paul Prescod] > Frank Chen has agreed to do the same for Chinese codecs. I asked him if > he would do so a few days ago. He sent me a zipfile with a license that > is: > > "It is licensed under the same license as Python 2.1." Ah, licensing. I suggest people hold off just a little longer on this. While Python isn't released under the GPL, we've got nothing against it either, and the FSF doesn't believe the 2.1 license is GPL *compatible*. So releasing more stuff under the 2.1 license will create that many more problems for GPL'ed projects. We have agreement from the FSF that the license for 2.0.1, 2.1.1 and 2.2 (whichever gets released first -- none have yet) is GPL-compatible, so that's a friendlier target to shoot for. For anyone who has actually read all these things, the only real difference between 2.1's license and 2.0.1/2.1.1/2.2's is removing the contentious "State of Virginia" choice-of-law clause. I doubt that's a clause anyone in China would be keen to keep anyway . From jaleco@gameone.com.tw Mon Jun 18 09:17:20 2001 From: jaleco@gameone.com.tw (jaleco) Date: Mon, 18 Jun 2001 16:17:20 +0800 Subject: [I18n-sig] unicode Message-ID: <000a01c0f7cf$18733bf0$94bd4ed3@jaleco> This is a multi-part message in MIME format. ------=_NextPart_000_0007_01C0F812.2634ACE0 Content-Type: text/plain; charset="big5" Content-Transfer-Encoding: quoted-printable how to conver a integer to unicode type behaior like chr() to a integer type ?=20 ------=_NextPart_000_0007_01C0F812.2634ACE0 Content-Type: text/html; charset="big5" Content-Transfer-Encoding: quoted-printable
how to conver a integer to unicode=20 type
behaior like chr() to = a integer type=20 ? 
------=_NextPart_000_0007_01C0F812.2634ACE0-- From mal@lemburg.com Mon Jun 18 09:25:11 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Mon, 18 Jun 2001 10:25:11 +0200 Subject: [I18n-sig] unicode References: <000a01c0f7cf$18733bf0$94bd4ed3@jaleco> Message-ID: <3B2DBAE7.EBD653D8@lemburg.com> > jaleco wrote: > > how to conver a integer to unicode type > behaior like chr() to a integer type ? Try unichr(). -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From barry@wooz.org Tue Jun 19 20:52:35 2001 From: barry@wooz.org (Barry A. Warsaw) Date: Tue, 19 Jun 2001 15:52:35 -0400 Subject: [I18n-sig] Re: pygettext.py extraction of docstrings References: <14840.35473.307059.990479@anthem.concentric.net> <200010272228.AAA01066@loewis.home.cs.tu-berlin.de> <15113.29005.357449.812516@anthem.wooz.org> <15117.38438.361043.255768@anthem.wooz.org> <15118.27210.930905.339141@anthem.wooz.org> <200105252012.f4PKCg801160@mira.informatik.hu-berlin.de> Message-ID: <15151.44419.951894.490695@anthem.wooz.org> >>>>> "MvL" == Martin v Loewis writes: >> But the po-file format documentation doesn't say that >> additional flags can be defined for #, comments. It seems to >> me a simple omission in the documentation, right? Is the >> intent of #, flags that the extraction tools can define >> additional, language-specific flags? MvL> I'd say that nobody has thought of that. Bruno is probably MvL> the person to give a definitive yay or nay here, but I'd hope MvL> that tools shouldn't go into flames if they see an extra MvL> flag. Atleast GNU msgmerge does not show any concern. MvL> Of course, it would be better if this possibility could be MvL> codified somewhere, and if gettext.texi could serve as the MvL> repository of well-known flags - even if they don't all have MvL> a meaning to GNU gettext. Adding such documentation is MvL> probably an issue of submitting patches against gettext.texi. I'm trying to close this issue out (along with the associated SF patch). Since I haven't heard otherwise from Bruno, I'm going to change the output to produce "#, docstring" flags. -Barry From barry@digicool.com Tue Jun 19 23:59:31 2001 From: barry@digicool.com (Barry A. Warsaw) Date: Tue, 19 Jun 2001 18:59:31 -0400 Subject: [I18n-sig] Autoguessing charset for Unicode strings? Message-ID: <15151.55635.370702.813650@yyz.digicool.com> I just don't know enough about Unicode in general (I've been one of those eye-glazers Skip refers to ;), so I figured I'd ask this question here. First, some background. I'm trying to add support for RFC 2047 in mimelib. Essentially, this RFC specifies how to include non-ASCII characters in mail headers, by describing an encoding format. The format lets you wrap "funny" characters in something like: =?iso-8859-1?Q?B=E2rry W=E2rs=E2w?= So, I think I've got the first part working, which is this: when I see such an encoded header, I pull out the encoded string, quopri decode it[*], then coerce to Unicode, giving the charset part as the second argument to unicode(). Specifically, the algorithm is something like: parts = value.split('?') if parts[0].endswith('=') and parts[4].startswith('='): charset = parts[1] encoding = parts[2].lower() atom = parts[3] if encoding == 'q': decoded_atom = quopri.decodestring(atom) elif encoding == 'b': decoded_atom = base64.decodestring(atom) else: raise ValueError, 'bad encoding: %s' % encoding return unicode(decoded_atom, charset) So far so good. Now let's say I want to go in the other direction, i.e. given a Unicode string, I want to create the RFC 2047 encoded string to add to the header, so I need to be able to go "the other way 'round". Is this possible without requiring the user to explicitly provide the charset that the Unicode string is encoded with? My understanding is that the unicode string doesn't have a notion of the charset that it was encoded with, but is it possible to guess the charset of a Unicode string reliably? Even if you can only guess 80% of the time, that'd be fine if I can throw an exception for the other 20%. Is there an existing Python solution for this? Does my question even make sense? ;) Thanks, -Barry [*] The `Q' (or `q') in between the ?'s means the string is encoded using quoted-printable. Thus the recent rash of fixes to the quopri module. The RFC says that alternatively, a `B' (or `b') is valid, meaning Base64 was used. From martin@loewis.home.cs.tu-berlin.de Wed Jun 20 00:27:28 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Wed, 20 Jun 2001 01:27:28 +0200 Subject: [I18n-sig] Autoguessing charset for Unicode strings? In-Reply-To: <15151.55635.370702.813650@yyz.digicool.com> (barry@digicool.com) References: <15151.55635.370702.813650@yyz.digicool.com> Message-ID: <200106192327.f5JNRSp01853@mira.informatik.hu-berlin.de> > So far so good. Now let's say I want to go in the other direction, > i.e. given a Unicode string, I want to create the RFC 2047 encoded > string to add to the header, so I need to be able to go "the other way > 'round". Is this possible without requiring the user to explicitly > provide the charset that the Unicode string is encoded with? Yes, doing so is trivial - the tricky part is to make work elegant. > My understanding is that the unicode string doesn't have a notion of > the charset that it was encoded with, but is it possible to guess the > charset of a Unicode string reliably? Even if you can only guess 80% > of the time, that'd be fine if I can throw an exception for the other > 20%. Is there an existing Python solution for this? Does my question > even make sense? ;) Your question makes perfect sense, it is one of the rather troubling problems in the world of character set conversions. Another form of the same problem is "how does Tk pick the right font to display some unicode string"? Back to your question: The easiest path is to always use UTF-8 as the outgoing character set. UTF-8 is a well-recognized MIME encoding (although I forgot the RFC number), and it is capable of encoding all Unicode strings lossless. However, that might produce quotations even if there are no funny characters in the string, so a better procedure might be: 1. try to encode as ASCII. If that succeeds, no quotation is needed 2. if that fails, use UTF-8 Now, many email readers will still choke these days when they see UTF-8 (the Microsoft ones being positive exceptions here), but will recognize Latin-1. So, another procedure might be 1. try to encode as ASCII 2. if that fails, try iso-8859-1 3. if that fails, use UTF-8 You'll see that this becomes more and more expensive. People now may propose that this really should be application controlled, but I think they'd be misguided: the application is normally in no better position to select a "good" encoding than the library. The latter algorithm may also be considered Euro-centric. It probably is. BTW, the same procedure probably needs to be used for MIME messages of type text/plain when a charset= is specified. I.e. usage of mimify.CHARSET is really not appropriate anymore. Regards, Martin From tree@basistech.com Wed Jun 20 00:05:29 2001 From: tree@basistech.com (Tom Emerson) Date: Tue, 19 Jun 2001 19:05:29 -0400 Subject: [I18n-sig] Autoguessing charset for Unicode strings? In-Reply-To: <200106192327.f5JNRSp01853@mira.informatik.hu-berlin.de> References: <15151.55635.370702.813650@yyz.digicool.com> <200106192327.f5JNRSp01853@mira.informatik.hu-berlin.de> Message-ID: <15151.55993.967688.913171@cymru.basistech.com> Martin v. Loewis writes: > Now, many email readers will still choke these days when they see > UTF-8 (the Microsoft ones being positive exceptions here), but will > recognize Latin-1. So, another procedure might be > > 1. try to encode as ASCII > 2. if that fails, try iso-8859-1 > 3. if that fails, use UTF-8 > > You'll see that this becomes more and more expensive. People now may > propose that this really should be application controlled, but I think > they'd be misguided: the application is normally in no better position > to select a "good" encoding than the library. > > The latter algorithm may also be considered Euro-centric. It probably > is. Yes, it is. ;-) Western-Euro-centric, in fact. One could hint the character set in (2) based on the domain name of the sender, e.g., if the sender is from .jp then try ISO-2022-JP instead of 8859-1. It would be possible to construct a table mapping ranges of Unicode codepoints (perhaps even character blocks) to certain legacy encodings so that the correct one can be chosen quickly. Something like this is needed when transcoding from Unicode to ISO-2022-CN. -tree -- Tom Emerson Basis Technology Corp. Sr. Sinostringologist http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever" From JMachin@Colonial.com.au Wed Jun 20 00:50:23 2001 From: JMachin@Colonial.com.au (Machin, John) Date: Wed, 20 Jun 2001 09:50:23 +1000 Subject: [I18n-sig] Autoguessing charset for Unicode strings? Message-ID: <9F2D83017589D211BD1000805FA70CA703B139AE@ntxmel03.cmutual.com.au> maybe not so expensive, depending on (a) what's in C and what's in Python and (b) function call overhead and (c) what proportion of text needs which character set ... loop once through your Unicode; if there were any chars with ordinal > 255, then use UTF-8 elif there were any > 127, then use iso-8859-1 else use ASCII -----Original Message----- From: Martin v. Loewis [mailto:martin@loewis.home.cs.tu-berlin.de] Sent: Wednesday, 20 June 2001 9:27 To: barry@digicool.com Cc: i18n-sig@python.org Subject: Re: [I18n-sig] Autoguessing charset for Unicode strings? [snip] Now, many email readers will still choke these days when they see UTF-8 (the Microsoft ones being positive exceptions here), but will recognize Latin-1. So, another procedure might be 1. try to encode as ASCII 2. if that fails, try iso-8859-1 3. if that fails, use UTF-8 You'll see that this becomes more and more expensive. [snip] Regards, Martin _______________________________________________ I18n-sig mailing list I18n-sig@python.org http://mail.python.org/mailman/listinfo/i18n-sig ************** IMPORTANT MESSAGE ************** The information contained in or attached to this message is intended only for the people it is addressed to. If you are not the intended recipient, any use, disclosure or copying of this information is unauthorised and prohibited. This information may be confidential or subject to legal privilege. It is not the expressed view of Colonial Limited or any of its subsidiaries unless that is clearly stated. Colonial cannot accept liability for any virus damage caused by this message. ************************************************** From tim.one@home.com Wed Jun 20 01:32:19 2001 From: tim.one@home.com (Tim Peters) Date: Tue, 19 Jun 2001 20:32:19 -0400 Subject: [I18n-sig] Autoguessing charset for Unicode strings? In-Reply-To: <9F2D83017589D211BD1000805FA70CA703B139AE@ntxmel03.cmutual.com.au> Message-ID: [Machin, John] > maybe not so expensive, depending on (a) what's in C and what's in > Python and (b) function call overhead and (c) what proportion of text > needs which character set ... > > loop once through your Unicode; > if there were any chars with ordinal > 255, then use UTF-8 > elif there were any > 127, then use iso-8859-1 > else use ASCII I don't know whether that algorithm makes sense, but it's efficient enough in Python: biggest = max(map(ord, some_unicode_string)) if biggest > 255: yadda elif biggest > 127: yadda else: yadda So the bulk of the work goes almost entirely at C speed. From martin@loewis.home.cs.tu-berlin.de Wed Jun 20 07:57:12 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Wed, 20 Jun 2001 08:57:12 +0200 Subject: [I18n-sig] Autoguessing charset for Unicode strings? In-Reply-To: <15151.55993.967688.913171@cymru.basistech.com> (message from Tom Emerson on Tue, 19 Jun 2001 19:05:29 -0400) References: <15151.55635.370702.813650@yyz.digicool.com> <200106192327.f5JNRSp01853@mira.informatik.hu-berlin.de> <15151.55993.967688.913171@cymru.basistech.com> Message-ID: <200106200657.f5K6vCX01071@mira.informatik.hu-berlin.de> > It would be possible to construct a table mapping ranges of Unicode > codepoints (perhaps even character blocks) to certain legacy encodings > so that the correct one can be chosen quickly. Something like this is > needed when transcoding from Unicode to ISO-2022-CN. That would be valuable as a general-purpose service in the Python library, it seems. I have no experience with such API, but I think codecs.find_encodings(ustring) could work; this would return a list of tuples, each tuple containing the name of an encoding and the number of initial characters of ustring that can be represented in this encoding. An important implementation detail, of course, is how to construct the necessary data structures in an efficient way. For the codecs that ship with Python, the tables could be precomputed. For dynamically registered codecs, the first problem is to come up with a list of all known codec names - which in itself would be a useful service... Regards, Martin From keichwa@gmx.net Wed Jun 20 07:35:49 2001 From: keichwa@gmx.net (Karl Eichwalder) Date: 20 Jun 2001 08:35:49 +0200 Subject: [I18n-sig] Re: pygettext.py extraction of docstrings In-Reply-To: <15151.44419.951894.490695@anthem.wooz.org> References: <14840.35473.307059.990479@anthem.concentric.net> <200010272228.AAA01066@loewis.home.cs.tu-berlin.de> <15113.29005.357449.812516@anthem.wooz.org> <15117.38438.361043.255768@anthem.wooz.org> <15118.27210.930905.339141@anthem.wooz.org> <200105252012.f4PKCg801160@mira.informatik.hu-berlin.de> <15151.44419.951894.490695@anthem.wooz.org> Message-ID: barry@wooz.org (Barry A. Warsaw) writes: > I'm trying to close this issue out (along with the associated SF > patch). Since I haven't heard otherwise from Bruno, I'm going to > change the output to produce "#, docstring" flags. Sounds good to me. Please, make sure to put the "#, ..." expression just before the "msgid" line; thus it's easier for the translator to see (sometimes we've very long "#: " lines). -- work : ke@suse.de | ,__o : http://www.suse.de/~ke/ | _-\_<, home : keichwa@gmx.net | (*)/'(*) From tdickenson@geminidataloggers.com Wed Jun 20 10:30:41 2001 From: tdickenson@geminidataloggers.com (Toby Dickenson) Date: Wed, 20 Jun 2001 10:30:41 +0100 Subject: [I18n-sig] Autoguessing charset for Unicode strings? In-Reply-To: References: <9F2D83017589D211BD1000805FA70CA703B139AE@ntxmel03.cmutual.com.au> Message-ID: On Tue, 19 Jun 2001 20:32:19 -0400, "Tim Peters" wrote: >I don't know whether that algorithm makes sense, but it's efficient = enough >in Python: > > biggest =3D max(map(ord, some_unicode_string)) or marginally more efficient still: biggest =3D ord(max(some_unicode_string)) Toby Dickenson tdickenson@geminidataloggers.com From haible@ilog.fr Wed Jun 20 11:05:07 2001 From: haible@ilog.fr (Bruno Haible) Date: Wed, 20 Jun 2001 12:05:07 +0200 (MET DST) Subject: [I18n-sig] Re: pygettext.py extraction of docstrings In-Reply-To: <15151.44419.951894.490695@anthem.wooz.org> References: <14840.35473.307059.990479@anthem.concentric.net> <200010272228.AAA01066@loewis.home.cs.tu-berlin.de> <15113.29005.357449.812516@anthem.wooz.org> <15117.38438.361043.255768@anthem.wooz.org> <15118.27210.930905.339141@anthem.wooz.org> <200105252012.f4PKCg801160@mira.informatik.hu-berlin.de> <15151.44419.951894.490695@anthem.wooz.org> Message-ID: <200106201005.MAA14588@oberkampf.ilog.fr> barry@wooz.org writes: > MvL> I'd hope > MvL> that tools shouldn't go into flames if they see an extra > MvL> flag. Atleast GNU msgmerge does not show any concern. The tools don't flame if there is an unknown #, flag, but the tools like msgmerge currently don't preserve the flag either. Support for other languages than C/C++ in the gettext tools is on my list for gettext 0.12. This includes calling pygettext, and it also includes support for language specific #, flags. > I'm trying to close this issue out (along with the associated SF > patch). Since I haven't heard otherwise from Bruno, I'm going to > change the output to produce "#, docstring" flags. OK. Bruno From barry@wooz.org Wed Jun 20 20:44:40 2001 From: barry@wooz.org (Barry A. Warsaw) Date: Wed, 20 Jun 2001 15:44:40 -0400 Subject: [I18n-sig] Re: pygettext.py extraction of docstrings References: <14840.35473.307059.990479@anthem.concentric.net> <200010272228.AAA01066@loewis.home.cs.tu-berlin.de> <15113.29005.357449.812516@anthem.wooz.org> <15117.38438.361043.255768@anthem.wooz.org> <15118.27210.930905.339141@anthem.wooz.org> <200105252012.f4PKCg801160@mira.informatik.hu-berlin.de> <15151.44419.951894.490695@anthem.wooz.org> Message-ID: <15152.64808.571763.915115@anthem.wooz.org> >>>>> "KE" == Karl Eichwalder writes: KE> Sounds good to me. Please, make sure to put the "#, ..." KE> expression just before the "msgid" line; thus it's easier for KE> the translator to see (sometimes we've very long "#: " lines). Ah, good point. Done. >>>>> "BH" == Bruno Haible writes: BH> Support for other languages than C/C++ in the gettext tools is BH> on my list for gettext 0.12. This includes calling pygettext, BH> and it also includes support for language specific #, flags. Cool. Let me know if I can help. I'm relying on pygettext quite heavily in Mailman, so I think it's pretty solid (latest revision is pygettext.py 1.20). Martin's also written a Python version of msgfmt which is in Python's Tools/i18n directory. Cheers, -Barry From martin@loewis.home.cs.tu-berlin.de Wed Jun 20 22:24:18 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Wed, 20 Jun 2001 23:24:18 +0200 Subject: [I18n-sig] Re: pygettext.py extraction of docstrings In-Reply-To: <15152.64808.571763.915115@anthem.wooz.org> (barry@wooz.org) References: <14840.35473.307059.990479@anthem.concentric.net> <200010272228.AAA01066@loewis.home.cs.tu-berlin.de> <15113.29005.357449.812516@anthem.wooz.org> <15117.38438.361043.255768@anthem.wooz.org> <15118.27210.930905.339141@anthem.wooz.org> <200105252012.f4PKCg801160@mira.informatik.hu-berlin.de> <15151.44419.951894.490695@anthem.wooz.org> <15152.64808.571763.915115@anthem.wooz.org> Message-ID: <200106202124.f5KLOIW02383@mira.informatik.hu-berlin.de> > Cool. Let me know if I can help. I'm relying on pygettext quite > heavily in Mailman, so I think it's pretty solid (latest revision is > pygettext.py 1.20). Personally, I think xgettext should itself recognize docstrings. The po-utils already support extracting doc strings, and I added support to extract strings with __doc__ from C modules as well. Maybe I'll look into contributing these features to GNU gettext with native code. Regards, Martin From barry@wooz.org Wed Jun 20 23:07:52 2001 From: barry@wooz.org (Barry A. Warsaw) Date: Wed, 20 Jun 2001 18:07:52 -0400 Subject: [I18n-sig] Re: pygettext.py extraction of docstrings References: <14840.35473.307059.990479@anthem.concentric.net> <200010272228.AAA01066@loewis.home.cs.tu-berlin.de> <15113.29005.357449.812516@anthem.wooz.org> <15117.38438.361043.255768@anthem.wooz.org> <15118.27210.930905.339141@anthem.wooz.org> <200105252012.f4PKCg801160@mira.informatik.hu-berlin.de> <15151.44419.951894.490695@anthem.wooz.org> <15152.64808.571763.915115@anthem.wooz.org> <200106202124.f5KLOIW02383@mira.informatik.hu-berlin.de> Message-ID: <15153.7864.788320.742815@anthem.wooz.org> >>>>> "MvL" == Martin v Loewis writes: >> Cool. Let me know if I can help. I'm relying on pygettext >> quite heavily in Mailman, so I think it's pretty solid (latest >> revision is pygettext.py 1.20). MvL> Personally, I think xgettext should itself recognize MvL> docstrings. The po-utils already support extracting doc MvL> strings, and I added support to extract strings with __doc__ MvL> from C modules as well. MvL> Maybe I'll look into contributing these features to GNU MvL> gettext with native code. Cool, just be sure to make docstring extraction optional. E.g. it makes sense for Mailman's bin/* scripts where the module docstring doubles as usage text, but it doesn't make much sense for most plain old module docstrings. OTOH, maybe we should define a convention in the docstring to indicate that it's ripe for extraction. E.g. an _ as the first character in the docstring... -Barry From martin@loewis.home.cs.tu-berlin.de Thu Jun 21 07:46:53 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Thu, 21 Jun 2001 08:46:53 +0200 Subject: [I18n-sig] Re: pygettext.py extraction of docstrings In-Reply-To: <15153.7864.788320.742815@anthem.wooz.org> (barry@wooz.org) References: <14840.35473.307059.990479@anthem.concentric.net> <200010272228.AAA01066@loewis.home.cs.tu-berlin.de> <15113.29005.357449.812516@anthem.wooz.org> <15117.38438.361043.255768@anthem.wooz.org> <15118.27210.930905.339141@anthem.wooz.org> <200105252012.f4PKCg801160@mira.informatik.hu-berlin.de> <15151.44419.951894.490695@anthem.wooz.org> <15152.64808.571763.915115@anthem.wooz.org> <200106202124.f5KLOIW02383@mira.informatik.hu-berlin.de> <15153.7864.788320.742815@anthem.wooz.org> Message-ID: <200106210646.f5L6krj01120@mira.informatik.hu-berlin.de> > Cool, just be sure to make docstring extraction optional. E.g. it > makes sense for Mailman's bin/* scripts where the module docstring > doubles as usage text, but it doesn't make much sense for most plain > old module docstrings. Certainly. This is essentially like a new keyword (-k) to look for. > OTOH, maybe we should define a convention in the docstring to indicate > that it's ripe for extraction. E.g. an _ as the first character in > the docstring... Please, no. In any case, it is up to translators to translate them; if the doc strings look too useless, they can ignore them. Regards, Martin From paulp@ActiveState.com Sat Jun 23 02:35:18 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Fri, 22 Jun 2001 18:35:18 -0700 Subject: [I18n-sig] International Components for Unicode Message-ID: <3B33F256.3C133966@ActiveState.com> Is this of any value to us? http://oss.software.ibm.com/icu/index.html -- Take a recipe. Leave a recipe. Python Cookbook! http://www.ActiveState.com/pythoncookbook From martin@loewis.home.cs.tu-berlin.de Sat Jun 23 08:47:34 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Sat, 23 Jun 2001 09:47:34 +0200 Subject: [I18n-sig] International Components for Unicode In-Reply-To: <3B33F256.3C133966@ActiveState.com> (message from Paul Prescod on Fri, 22 Jun 2001 18:35:18 -0700) References: <3B33F256.3C133966@ActiveState.com> Message-ID: <200106230747.f5N7lYG01049@mira.informatik.hu-berlin.de> > Is this of any value to us? > > http://oss.software.ibm.com/icu/index.html I'm not sure. It always seemed to me that ICU is an all-or-nothing solution. I.e. if you want to access its functionality, you have to use their Unicode type, their locale objects, their message catalogs and so on. Python 2.1 offers already quite a lot of this functionality; merging that with ICU would be a real challenge. You'd probably need to offer a choice: either ICU locales or C locales; either ICU message catalogs or gettext. For the Unicode types, you'd have to copy strings forth and back between ICU Unicode objects and Python Unicode objects. Also, offering these services to Python users is challenging. It can't really become a standard library: The ICU distribution is 6.5MB of C++ source code, so I doubt it would be ever included in core Python. Somebody could volunteer and offer wrapper code, and put that on SF. To use that API, and application author would need to get ICU, and the wrapper (preferably in versions that match). Later, all users of the application also need to install ICU, and the wrapper. These days, Linux distributions offer precompiled ICU installations, but that might add to the problems rather than reducing them: The wrapper will need to deal with multiple ICU versions. Finally, ICU solves non of the most urgent Python-and-I18N problems: None of the standard libraries will become more Unicode-aware than they are now; it still is not possible to use non-ASCII text in source code in a convenient way; printing Unicode strings to sys.stdout will continue to produce exceptions. So my guess is that nothing will happen with ICU integration, and that the question will come up every few months. Regards, Martin From martin@loewis.home.cs.tu-berlin.de Sat Jun 23 09:26:26 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Sat, 23 Jun 2001 10:26:26 +0200 Subject: [I18n-sig] How does Python Unicode treat surrogates? In-Reply-To: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> (message from Guido van Rossum on Tue, 20 Feb 2001 14:36:35 -0500) References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> Message-ID: <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> [Uche] > Sure. I admit it's hearsay, but I thought I'd read that because Java > Unicode is or was underspecified, that there was the possibility of > transposition of the high-surrogate with the low-surrogate character > between Java implementations or platforms. I've tried to find out what problem that could be. So far, I found http://developer.java.sun.com/developer/bugParade/bugs/4344266.html Here, they complain that the codecs don't properly check for surrogates that straddle invocations of convert, or get incorrect surrogate pairs. There is a bug report on SF that Python has similar problems. http://developer.java.sun.com/developer/bugParade/bugs/4328816.html summarizes problems that have been fixed with surrogates in UTF-8, again, similar problems are probably present in Python. There were also a few bug reports about surrogates working differently depending on locale (fail in zh_CN, pass in C), and type of virtual machine (fail in classic, pass in hotspot). I could not find any report on a bug where surrogates are output in incorrect order. [Guido] > On the XML sig the following exchange happened. I don't know enough > about the issues to investigate, but I'm sure that someone here can > provide insight? It seems to boil down to whether or not surrogates > may get transposed when between platforms. I very much doubt this could ever happen. Regards, Martin From mal@lemburg.com Sat Jun 23 11:38:39 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Sat, 23 Jun 2001 12:38:39 +0200 Subject: [I18n-sig] How does Python Unicode treat surrogates? References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> Message-ID: <3B3471AF.1311E872@lemburg.com> Could someone please restate the original question ? The archives don't seem to have the original postings and the quotes Martin have in his reply don't seem to have anything todo with Python. About surrogate support in Python: the UTF-8 codec has full surrogate support for encodings and decoding, the unicode-escape codec can decode using surrogates, all others don't support surrogates. Thanks, -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ "Martin v. Loewis" wrote: > > [Uche] > > Sure. I admit it's hearsay, but I thought I'd read that because Java > > Unicode is or was underspecified, that there was the possibility of > > transposition of the high-surrogate with the low-surrogate character > > between Java implementations or platforms. > > I've tried to find out what problem that could be. So far, I found > > http://developer.java.sun.com/developer/bugParade/bugs/4344266.html > > Here, they complain that the codecs don't properly check for > surrogates that straddle invocations of convert, or get incorrect > surrogate pairs. There is a bug report on SF that Python has similar > problems. > > http://developer.java.sun.com/developer/bugParade/bugs/4328816.html > > summarizes problems that have been fixed with surrogates in UTF-8, > again, similar problems are probably present in Python. > > There were also a few bug reports about surrogates working differently > depending on locale (fail in zh_CN, pass in C), and type of virtual > machine (fail in classic, pass in hotspot). > > I could not find any report on a bug where surrogates are output in > incorrect order. > > [Guido] > > On the XML sig the following exchange happened. I don't know enough > > about the issues to investigate, but I'm sure that someone here can > > provide insight? It seems to boil down to whether or not surrogates > > may get transposed when between platforms. > > I very much doubt this could ever happen. > > Regards, > Martin > > _______________________________________________ > I18n-sig mailing list > I18n-sig@python.org > http://mail.python.org/mailman/listinfo/i18n-sig From martin@loewis.home.cs.tu-berlin.de Sat Jun 23 13:20:38 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Sat, 23 Jun 2001 14:20:38 +0200 Subject: [I18n-sig] How does Python Unicode treat surrogates? In-Reply-To: <3B3471AF.1311E872@lemburg.com> (mal@lemburg.com) References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> Message-ID: <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> > About surrogate support in Python: the UTF-8 codec has full > surrogate support for encodings and decoding I think there are a number of bugs lying around here. For example, shouldn't >>> u" \ud800 ".encode("utf-8") ' \xa0\x80 ' give an error, since this is a lone low surrogate word? Likewise, but somewhat more troubling, surrogates that straddle write invocations are not processed properly. >>> s=StringIO.StringIO() >>> _,_,r,w=codecs.lookup("utf-8") >>> f=w(s) >>> f.write(u"\ud800") >>> f.write(u"\udc00") >>> f.flush() >>> s.getvalue() '\xa0\x80\xed\xb0\x80' whereas the correct answer would have been >>> u"\ud800\udc00".encode("utf-8") '\xf0\x90\x80\x80' Regards, Martin From mal@lemburg.com Sat Jun 23 21:19:09 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Sat, 23 Jun 2001 22:19:09 +0200 Subject: [I18n-sig] How does Python Unicode treat surrogates? References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> Message-ID: <3B34F9BD.4FDEFC62@lemburg.com> "Martin v. Loewis" wrote: > > > About surrogate support in Python: the UTF-8 codec has full > > surrogate support for encodings and decoding > > I think there are a number of bugs lying around here. For example, > shouldn't > > >>> u" \ud800 ".encode("utf-8") > ' \xa0\x80 ' > > give an error, since this is a lone low surrogate word? Yes. > Likewise, but somewhat more troubling, surrogates that straddle write > invocations are not processed properly. > > >>> s=StringIO.StringIO() > >>> _,_,r,w=codecs.lookup("utf-8") > >>> f=w(s) > >>> f.write(u"\ud800") > >>> f.write(u"\udc00") > >>> f.flush() > >>> s.getvalue() > '\xa0\x80\xed\xb0\x80' > > whereas the correct answer would have been > > >>> u"\ud800\udc00".encode("utf-8") > '\xf0\x90\x80\x80' This is a special case of the above (since the encoder will see truncated surrogates and should raise raise an exception for these). -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From tree@basistech.com Sat Jun 23 21:20:54 2001 From: tree@basistech.com (Tom Emerson) Date: Sat, 23 Jun 2001 16:20:54 -0400 Subject: [I18n-sig] International Components for Unicode In-Reply-To: <3B33F256.3C133966@ActiveState.com> References: <3B33F256.3C133966@ActiveState.com> Message-ID: <15156.64038.410669.795084@cymru.basistech.com> The one thing from ICU that would be useful is the plethora of encoding tables it comes with. If we had support for their tables we would have access to several hundred (last I checked they had over 600 encodings) encodings immediately available, and they would be responsible for updating them. -- Tom Emerson Basis Technology Corp. Sr. Sinostringologist http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever" From mal@lemburg.com Sat Jun 23 22:18:34 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Sat, 23 Jun 2001 23:18:34 +0200 Subject: [I18n-sig] International Components for Unicode References: <3B33F256.3C133966@ActiveState.com> <15156.64038.410669.795084@cymru.basistech.com> Message-ID: <3B3507AA.2E5121C1@lemburg.com> Tom Emerson wrote: > > The one thing from ICU that would be useful is the plethora of > encoding tables it comes with. If we had support for their tables we > would have access to several hundred (last I checked they had over 600 > encodings) encodings immediately available, and they would be > responsible for updating them. While this would be nice to have, the size of ICU will prevent any inclusion in the Python core. However, wrapping all or parts of the lib to integrate them into the existing Python i18n support would certainly be a project worth trying. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From martin@loewis.home.cs.tu-berlin.de Sat Jun 23 23:19:22 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Sun, 24 Jun 2001 00:19:22 +0200 Subject: [I18n-sig] How does Python Unicode treat surrogates? In-Reply-To: <3B34F9BD.4FDEFC62@lemburg.com> (mal@lemburg.com) References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> Message-ID: <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> > > Likewise, but somewhat more troubling, surrogates that straddle write > > invocations are not processed properly. > > > > >>> s=StringIO.StringIO() > > >>> _,_,r,w=codecs.lookup("utf-8") > > >>> f=w(s) > > >>> f.write(u"\ud800") > > >>> f.write(u"\udc00") > > >>> f.flush() > > >>> s.getvalue() > > '\xa0\x80\xed\xb0\x80' > > > > whereas the correct answer would have been > > > > >>> u"\ud800\udc00".encode("utf-8") > > '\xf0\x90\x80\x80' > > This is a special case of the above (since the encoder will > see truncated surrogates and should raise raise an exception > for these). I don't think it should; it is not truncated since a later write call will provide the missing word. If you have a Unicode stream, it should be possible to read the stream contents in arbitrary chunks of works, and encode it with a stream encode. The stream encoder should produce the same output no matter how you split the input. Under your proposed behaviour, this is not the case. Please note that http://sourceforge.net/tracker/index.php?func=detail&aid=433882&group_id=5470&atid=105470 adds a few other aspects to the problem: It appears that Unicode 3.1 specifies that certain forms of UTF-8 encoded surrogates are merely irregular, not illegal. There may be some misinterpretation of the spec in this report, but I think all this needs careful checking. Regards, Martin From martin@loewis.home.cs.tu-berlin.de Sat Jun 23 23:26:27 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Sun, 24 Jun 2001 00:26:27 +0200 Subject: [I18n-sig] International Components for Unicode In-Reply-To: <15156.64038.410669.795084@cymru.basistech.com> (message from Tom Emerson on Sat, 23 Jun 2001 16:20:54 -0400) References: <3B33F256.3C133966@ActiveState.com> <15156.64038.410669.795084@cymru.basistech.com> Message-ID: <200106232226.f5NMQR720529@mira.informatik.hu-berlin.de> > The one thing from ICU that would be useful is the plethora of > encoding tables it comes with. If we had support for their tables we > would have access to several hundred (last I checked they had over 600 > encodings) encodings immediately available, and they would be > responsible for updating them. That's true, but I'd rather prefer to integrate the encodings that come with the operating systems first. E.g. on Unix, iconv(3) will also give you many encodings. Including aliases, glibc 2.2 provides about 1100 encodings. On Windows, some Internet/ActiveX API offers a huge variety of encodings, if the administrator has chosen to install them. If you have Tcl, it provides a number of converters that are not currently included in Python. All these encodings can be made available to Python users by just installing an extension module; whereas with ICU, you'd have to install some huge library. Regards, Martin From JMachin@Colonial.com.au Sun Jun 24 01:09:34 2001 From: JMachin@Colonial.com.au (Machin, John) Date: Sun, 24 Jun 2001 10:09:34 +1000 Subject: [I18n-sig] How does Python Unicode treat surrogates? Message-ID: <9F2D83017589D211BD1000805FA70CA703B139D6@ntxmel03.cmutual.com.au> Hello there, I'm the 'nobody' who raised the SF bug report to which Martin refers. According to Unicode 3.0, transformations between scalars and UTF-n should provide lossless round-trip transcoding, even for invalid scalars like unpaired surrogates and values like 0xFFFE and 0xFFFF. Unicode 3.1 adds further clarification by listing out what are legal byte sequences for UTF-8; these include byte sequences that encompass those invalid scalars. There is a note in the Unicode docs that ISO/IEC 10646 ("ISO" for short) forbids this permissive treatment of invalid scalars. The implementation in Python 2.1 does this: encoding to UTF-8: 0xFFFF etc: Unicode-compliant unpaired low surrogate: Unicode-compliant unpaired high surrogate: *BUG*, generates invalid UTF-8 byte sequence decoding from UTF-8: 0xFFFF etc: Unicode-compliant unpaired surrogates: ISO-compliant In a note that Martin added to my bug report, he seems to be advocating ISO compliance. My two-cents-worth on approach to differences between Unicode and ISO: Unicode is the *practical* standard. Unicode is the *available* standard -- you can buy the book; you can access the web site. Martin said in his note to my bug report that he doesn't have a copy of the ISO document(s); he's not alone! Python advertises Unicode support, not ISO/IEC 10646 support. If we make the transcoding of invalid scalars ISO-compliant, then we should document and justify this. We should do this for *all* invalid scalars, not just unpaired surrogates. Perhaps the effort that would be required to do all the explicit testing to make all the transcoders ISO-compliant would be better directed into providing a function or method that checked a Unicode string for the presence of invalid scalars. A very practical point: Fixing the invalid-byte-sequence bug involves adding two or three lines of code. Making the UTF-8 decoder Unicode-compliant involves removing half a line of code. Minimal effort and no documentation and justifications required. Hmmm, 4 cents worth by the end of the rant :-) Anyway, hope this helps, John -----Original Message----- From: Martin v. Loewis [mailto:martin@loewis.home.cs.tu-berlin.de] Sent: Sunday, 24 June 2001 8:19 To: mal@lemburg.com Cc: guido@digicool.com; i18n-sig@python.org Subject: Re: [I18n-sig] How does Python Unicode treat surrogates? > > Likewise, but somewhat more troubling, surrogates that straddle write > > invocations are not processed properly. > > > > >>> s=StringIO.StringIO() > > >>> _,_,r,w=codecs.lookup("utf-8") > > >>> f=w(s) > > >>> f.write(u"\ud800") > > >>> f.write(u"\udc00") > > >>> f.flush() > > >>> s.getvalue() > > '\xa0\x80\xed\xb0\x80' > > > > whereas the correct answer would have been > > > > >>> u"\ud800\udc00".encode("utf-8") > > '\xf0\x90\x80\x80' > > This is a special case of the above (since the encoder will > see truncated surrogates and should raise raise an exception > for these). I don't think it should; it is not truncated since a later write call will provide the missing word. If you have a Unicode stream, it should be possible to read the stream contents in arbitrary chunks of works, and encode it with a stream encode. The stream encoder should produce the same output no matter how you split the input. Under your proposed behaviour, this is not the case. Please note that http://sourceforge.net/tracker/index.php?func=detail&aid=433882&group_id=547 0&atid=105470 adds a few other aspects to the problem: It appears that Unicode 3.1 specifies that certain forms of UTF-8 encoded surrogates are merely irregular, not illegal. There may be some misinterpretation of the spec in this report, but I think all this needs careful checking. Regards, Martin _______________________________________________ I18n-sig mailing list I18n-sig@python.org http://mail.python.org/mailman/listinfo/i18n-sig ************** IMPORTANT MESSAGE ************** The information contained in or attached to this message is intended only for the people it is addressed to. If you are not the intended recipient, any use, disclosure or copying of this information is unauthorised and prohibited. This information may be confidential or subject to legal privilege. It is not the expressed view of Colonial Limited or any of its subsidiaries unless that is clearly stated. Colonial cannot accept liability for any virus damage caused by this message. ************************************************** From fw@deneb.enyo.de Sun Jun 24 10:16:22 2001 From: fw@deneb.enyo.de (Florian Weimer) Date: 24 Jun 2001 11:16:22 +0200 Subject: [I18n-sig] How does Python Unicode treat surrogates? In-Reply-To: <9F2D83017589D211BD1000805FA70CA703B139D6@ntxmel03.cmutual.com.au> ("Machin, John"'s message of "Sun, 24 Jun 2001 10:09:34 +1000") References: <9F2D83017589D211BD1000805FA70CA703B139D6@ntxmel03.cmutual.com.au> Message-ID: <87u216qluh.fsf@deneb.enyo.de> "Machin, John" writes: > Unicode is the *practical* standard. Unicode is the > *available* standard -- you can buy the book; you can access > the web site. Martin said in his note to my bug report that > he doesn't have a copy of the ISO document(s); he's not alone! ISO 10646 is the ISO standard with lowest money per page ratio ever, I think. You can order a PDF version (shipped on CD-ROM) from the ISO website at http://www.iso.ch/ . Some standards used by Python are much, much more expensive. From mal@lemburg.com Sun Jun 24 12:28:06 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Sun, 24 Jun 2001 13:28:06 +0200 Subject: [I18n-sig] How does Python Unicode treat surrogates? References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> Message-ID: <3B35CEC6.710243E7@lemburg.com> First of all, I'd like to say that we left the handling of surrogates undefined back when we initially discussed the internal format for storing Unicode. The reasoning was simple: there were no assign char points outside the BMP (roughly the lower 16-bit range). It was decided to use 16-bits per character as basis for dealing with Unicode in such a way that we get the disjunction of UTF-16 and UCS-2 (Unicode 2.x). This allowed us to postpone the handling of variable length problems to a later point in time. Now with Unicode 3.1, the time has come to rethink these things, since for the first time, there are assigned char points outside the BMP which could eventually be used by programmers. This means that we have to start thinking about how to treat UTF-16 surrogates (two Py_UNICODE elements per Unicode character). The basic questions are: 1. How to treat lone surrogates (the Unicode char U+10000 is represented as the two words 0xd800 0xdc00 in UTF-16) ? 2. What to do when slicing of Unicode strings would break a surrogate pair ? 3. How to treat input data which has lone surrogate words in strings (at the start, in the middle and at the end) ? 4. How to process requests for creating output data from lone surrogate words ? BTW, Python's Unicode implementation is bound to the standard defined at www.unicode.org; moving over to ISO 10646 is not an option. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From tree@basistech.com Sun Jun 24 17:15:57 2001 From: tree@basistech.com (Tom Emerson) Date: Sun, 24 Jun 2001 12:15:57 -0400 Subject: [I18n-sig] International Components for Unicode In-Reply-To: <200106232226.f5NMQR720529@mira.informatik.hu-berlin.de> References: <3B33F256.3C133966@ActiveState.com> <15156.64038.410669.795084@cymru.basistech.com> <200106232226.f5NMQR720529@mira.informatik.hu-berlin.de> Message-ID: <15158.4669.583190.272218@cymru.basistech.com> Martin v. Loewis writes: > That's true, but I'd rather prefer to integrate the encodings that > come with the operating systems first. E.g. on Unix, iconv(3) will > also give you many encodings. Including aliases, glibc 2.2 provides > about 1100 encodings. Of course iconv on Linux has a different set of encodings than iconv on solaris, which has a different set than on Irix. And of course those encodings that are shared are often implemented differently. > All these encodings can be made available to Python users by just > installing an extension module; whereas with ICU, you'd have to > install some huge library. You've misunderstood. I'm not saying we pull in ICU. I'm saying that we write a set of Python modules that can read and make use of the ICU encoding datafile formats, and use those. In ICU all encoding data is kept as external data. Obviously integrating all of ICU into Python would be a fool's errand. -- Tom Emerson Basis Technology Corp. Sr. Sinostringologist http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever" From martin@loewis.home.cs.tu-berlin.de Sun Jun 24 18:03:33 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Sun, 24 Jun 2001 19:03:33 +0200 Subject: [I18n-sig] How does Python Unicode treat surrogates? In-Reply-To: <3B35CEC6.710243E7@lemburg.com> (mal@lemburg.com) References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> Message-ID: <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> > The basic questions are: > > 1. How to treat lone surrogates (the Unicode char U+10000 is > represented as the two words 0xd800 0xdc00 in UTF-16) ? > > 2. What to do when slicing of Unicode strings would break > a surrogate pair ? > > 3. How to treat input data which has lone surrogate words > in strings (at the start, in the middle and at the end) ? > > 4. How to process requests for creating output data from > lone surrogate words ? I'd like to add another question 0. Should Py_UNICODE be extended to 32 bits? > BTW, Python's Unicode implementation is bound to the standard > defined at www.unicode.org; moving over to ISO 10646 is not an > option. Can you elaborate? How can you rule out that option that easily? And why can't Python support the two standards simultaneously? Regards, Martin From mal@lemburg.com Sun Jun 24 19:04:28 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Sun, 24 Jun 2001 20:04:28 +0200 Subject: [I18n-sig] International Components for Unicode References: <3B33F256.3C133966@ActiveState.com> <15156.64038.410669.795084@cymru.basistech.com> <200106232226.f5NMQR720529@mira.informatik.hu-berlin.de> <15158.4669.583190.272218@cymru.basistech.com> Message-ID: <3B362BAC.A08E8128@lemburg.com> Tom Emerson wrote: > > I'm not saying we pull in ICU. I'm saying that > we write a set of Python modules that can read and make use of the ICU > encoding datafile formats, and use those. In ICU all encoding data is > kept as external data. Would we need to incorporate some of ICU for this to work or could we use a Python script to convert those tables to ones usable in Python ? -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From mal@lemburg.com Sun Jun 24 19:16:59 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Sun, 24 Jun 2001 20:16:59 +0200 Subject: [I18n-sig] How does Python Unicode treat surrogates? References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> Message-ID: <3B362E9B.4DC8DD81@lemburg.com> "Martin v. Loewis" wrote: > > > The basic questions are: > > > > 1. How to treat lone surrogates (the Unicode char U+10000 is > > represented as the two words 0xd800 0xdc00 in UTF-16) ? > > > > 2. What to do when slicing of Unicode strings would break > > a surrogate pair ? > > > > 3. How to treat input data which has lone surrogate words > > in strings (at the start, in the middle and at the end) ? > > > > 4. How to process requests for creating output data from > > lone surrogate words ? > > I'd like to add another question > > 0. Should Py_UNICODE be extended to 32 bits? This would mean 4 bytes per Unicode character and is unacceptable given the fact that most of these would be 0-bytes in practice. It would also break binary compatibility to the native Unicode wchar_t type on e.g. Windows which we are among the most Unicode-aware platforms there are today. > > BTW, Python's Unicode implementation is bound to the standard > > defined at www.unicode.org; moving over to ISO 10646 is not an > > option. > > Can you elaborate? How can you rule out that option that easily? It is not an option because we chose Unicode as our basis for i18n work and not the ISO 10646 Uniform Character Set. I'd rather have those two camps fight over the details of the Unicode standard than try to fix anything related to the differences between the two in Python by mixing them. > And why can't Python support the two standards simultaneously? Why would you want to support two standards for the same thing ? -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From tree@basistech.com Sun Jun 24 20:32:03 2001 From: tree@basistech.com (Tom Emerson) Date: Sun, 24 Jun 2001 15:32:03 -0400 Subject: [I18n-sig] International Components for Unicode In-Reply-To: <3B362BAC.A08E8128@lemburg.com> References: <3B33F256.3C133966@ActiveState.com> <15156.64038.410669.795084@cymru.basistech.com> <200106232226.f5NMQR720529@mira.informatik.hu-berlin.de> <15158.4669.583190.272218@cymru.basistech.com> <3B362BAC.A08E8128@lemburg.com> Message-ID: <15158.16435.45725.274341@cymru.basistech.com> M.-A. Lemburg writes: > Would we need to incorporate some of ICU for this to work or could > we use a Python script to convert those tables to ones usable in > Python ? No, we wouldn't need to incorporate anything from ICU except the tables: that's my point. As long as we wrote the code to read the tables directly people could use them without conversion or anything like it. -tree -- Tom Emerson Basis Technology Corp. Sr. Sinostringologist http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever" From martin@loewis.home.cs.tu-berlin.de Sun Jun 24 19:37:02 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Sun, 24 Jun 2001 20:37:02 +0200 Subject: [I18n-sig] How does Python Unicode treat surrogates? In-Reply-To: <3B362E9B.4DC8DD81@lemburg.com> (mal@lemburg.com) References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> Message-ID: <200106241837.f5OIb2r07377@mira.informatik.hu-berlin.de> > It is not an option because we chose Unicode as our basis for > i18n work and not the ISO 10646 Uniform Character Set. Please speak for yourself only. > > And why can't Python support the two standards simultaneously? > > Why would you want to support two standards for the same thing ? Because they are almost identical. Regards, Martin From tim.one@home.com Mon Jun 25 06:37:25 2001 From: tim.one@home.com (Tim Peters) Date: Mon, 25 Jun 2001 01:37:25 -0400 Subject: [I18n-sig] How does Python Unicode treat surrogates? In-Reply-To: <3B35CEC6.710243E7@lemburg.com> Message-ID: [M.-A. Lemburg] > ... > 2. What to do when slicing of Unicode strings would break > a surrogate pair ? To me a string is a sequence of characters, and s[0] returns the first, s[1] the second, and so on. The internal details of how the implementation chooses to torture itself <0.7 wink> should be invisible. That is, breaking a surrogate via slicing should be impossible: s[i:j] returns j-i characters, and that's that. This implies the internal start address for the character s[i] can't be computed as base + N*i, unless-- what? --some fixed number B of bits >= 20 is used internally for each character. > ... > BTW, Python's Unicode implementation is bound to the standard > defined at www.unicode.org; moving over to ISO 10646 is not an > option. I doubt that either std says anything about how an implementation represents characters internally. And I'm certain neither mentions Py_UNICODE at all . From mal@lemburg.com Mon Jun 25 12:39:07 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Mon, 25 Jun 2001 13:39:07 +0200 Subject: [I18n-sig] How does Python Unicode treat surrogates? References: Message-ID: <3B3722DB.1FF54794@lemburg.com> Tim Peters wrote: > > [M.-A. Lemburg] > > ... > > 2. What to do when slicing of Unicode strings would break > > a surrogate pair ? > > To me a string is a sequence of characters, and s[0] returns the first, s[1] > the second, and so on. The internal details of how the implementation > chooses to torture itself <0.7 wink> should be invisible. That is, breaking > a surrogate via slicing should be impossible: s[i:j] returns j-i > characters, and that's that. It's not that simple: lone surrogates are true Unicode char points in their own right; it's just that they are pretty useless without their resp. partners in the data stream. And with this "feature" they are in good company: the Unicode combining characters (e.g. the combining acute) have th same property. Hard to say what's right and wrong here... (note that I posted the questions without an initial comment on what I think on these issues -- I simply don't know for sure just yet ;-) > This implies the internal start address for > the character s[i] can't be computed as base + N*i, unless-- what? --some > fixed number B of bits >= 20 is used internally for each character. > > > ... > > BTW, Python's Unicode implementation is bound to the standard > > defined at www.unicode.org; moving over to ISO 10646 is not an > > option. > > I doubt that either std says anything about how an implementation represents > characters internally. And I'm certain neither mentions Py_UNICODE at all > . That comment was aimed at Martin's proposal to stick with ISO 10646 for the UTF-8 codec treatment of lone surrogates. It has nothing to do with how we store Unicode internally... (sorry for the confusion). -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From mal@lemburg.com Mon Jun 25 12:41:10 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Mon, 25 Jun 2001 13:41:10 +0200 Subject: [I18n-sig] International Components for Unicode References: <3B33F256.3C133966@ActiveState.com> <15156.64038.410669.795084@cymru.basistech.com> <200106232226.f5NMQR720529@mira.informatik.hu-berlin.de> <15158.4669.583190.272218@cymru.basistech.com> <3B362BAC.A08E8128@lemburg.com> <15158.16435.45725.274341@cymru.basistech.com> Message-ID: <3B372356.A9BED3F9@lemburg.com> Tom Emerson wrote: > > M.-A. Lemburg writes: > > Would we need to incorporate some of ICU for this to work or could > > we use a Python script to convert those tables to ones usable in > > Python ? > > No, we wouldn't need to incorporate anything from ICU except the > tables: that's my point. As long as we wrote the code to read the > tables directly people could use them without conversion or anything > like it. Sounds great ! What's the license on those tables ? -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From tree@basistech.com Mon Jun 25 12:06:00 2001 From: tree@basistech.com (Tom Emerson) Date: Mon, 25 Jun 2001 07:06:00 -0400 Subject: [I18n-sig] International Components for Unicode In-Reply-To: <3B372356.A9BED3F9@lemburg.com> References: <3B33F256.3C133966@ActiveState.com> <15156.64038.410669.795084@cymru.basistech.com> <200106232226.f5NMQR720529@mira.informatik.hu-berlin.de> <15158.4669.583190.272218@cymru.basistech.com> <3B362BAC.A08E8128@lemburg.com> <15158.16435.45725.274341@cymru.basistech.com> <3B372356.A9BED3F9@lemburg.com> Message-ID: <15159.6936.745436.585017@cymru.basistech.com> M.-A. Lemburg writes: > Sounds great ! > > What's the license on those tables ? The latest ICU was released under the MIT/X license. I assume the tables are licensed similarly. -- Tom Emerson Basis Technology Corp. Sr. Sinostringologist http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever" From mal@lemburg.com Mon Jun 25 12:46:11 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Mon, 25 Jun 2001 13:46:11 +0200 Subject: [I18n-sig] International Components for Unicode References: <3B33F256.3C133966@ActiveState.com> <15156.64038.410669.795084@cymru.basistech.com> <200106232226.f5NMQR720529@mira.informatik.hu-berlin.de> <15158.4669.583190.272218@cymru.basistech.com> <3B362BAC.A08E8128@lemburg.com> <15158.16435.45725.274341@cymru.basistech.com> <3B372356.A9BED3F9@lemburg.com> <15159.6936.745436.585017@cymru.basistech.com> Message-ID: <3B372483.A6E71057@lemburg.com> Tom Emerson wrote: > > M.-A. Lemburg writes: > > Sounds great ! > > > > What's the license on those tables ? > > The latest ICU was released under the MIT/X license. I assume the > tables are licensed similarly. Sound even better :-) I think we should look into getting support for them into a extension similar to the one Tamito is working on and then place them into the python/dist/encodings directory. I just wish I had time to look into this... :-( -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From mal@lemburg.com Mon Jun 25 13:01:33 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Mon, 25 Jun 2001 14:01:33 +0200 Subject: [I18n-sig] How does Python Unicode treat surrogates? References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106241837.f5OIb2r07377@mira.informatik.hu-berlin.de> Message-ID: <3B37281D.BE13E297@lemburg.com> "Martin v. Loewis" wrote: > > > It is not an option because we chose Unicode as our basis for > > i18n work and not the ISO 10646 Uniform Character Set. > > Please speak for yourself only. With "we" I referred to the python-dev/i18n-sig team. Since these things are all based on concensus not necessarily all members of those teams will have or have had the same opinion. Speaking only for myself: I would very much appreciate if you would stop throwing these meta-comments into discussions we have on this list. > > > And why can't Python support the two standards simultaneously? > > > > Why would you want to support two standards for the same thing ? > > Because they are almost identical. True, but it's those small differences that make life harder. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From gs234@cam.ac.uk Mon Jun 25 13:03:31 2001 From: gs234@cam.ac.uk (Gaute B Strokkenes) Date: 25 Jun 2001 13:03:31 +0100 Subject: [I18n-sig] Re: How does Python Unicode treat surrogates? In-Reply-To: <3B3722DB.1FF54794@lemburg.com> ("M.-A. Lemburg"'s message of "Mon, 25 Jun 2001 13:39:07 +0200") References: <3B3722DB.1FF54794@lemburg.com> Message-ID: <4ak820g418.fsf@kern.srcf.societies.cam.ac.uk> [I'm cc:-ing the unicode list to make sure that I've gotten my terminology right, and to solicit comments On Mon, 25 Jun 2001, mal@lemburg.com wrote: > Tim Peters wrote: >> >> [M.-A. Lemburg] >> > ... >> > 2. What to do when slicing of Unicode strings would break >> > a surrogate pair ? >> >> To me a string is a sequence of characters, and s[0] returns the >> first, s[1] the second, and so on. The internal details of how the >> implementation chooses to torture itself <0.7 wink> should be >> invisible. That is, breaking a surrogate via slicing should be >> impossible: s[i:j] returns j-i characters, and that's that. > > It's not that simple: lone surrogates are true Unicode char points > in their own right; it's just that they are pretty useless without > their resp. partners in the data stream. And with this "feature" > they are in good company: the Unicode combining characters (e.g. the > combining acute) have th same property. This is completely and totally wrong. The Unicode standard version 3.1 states (conformance requirement C12(c): A conformant process shall not interpret illegal UTF code unit sequences as characters. The precise definition of "illegal" in this context is given elsewhere. See : 0xD800 is incomplete in Unicode. Unless followed by another 16-bit value of the right form, it is illegal. (Unicode here should read UTF-16, off course. The reason it does not is that the language of the technical report has not been updated to that of 3.1) -- Big Gaute http://www.srcf.ucam.org/~gs234/ Hello? Enema Bondage? I'm calling because I want to be happy, I guess.. From JMachin@Colonial.com.au Mon Jun 25 13:33:50 2001 From: JMachin@Colonial.com.au (Machin, John) Date: Mon, 25 Jun 2001 22:33:50 +1000 Subject: [I18n-sig] Re: How does Python Unicode treat surrogates? Message-ID: <9F2D83017589D211BD1000805FA70CA703B139D8@ntxmel03.cmutual.com.au> MAL and Gaute, Can I please take the middle ground (and risk having both of you throw things at me? => Lone surrogates are not 'true Unicode char points in their own right' [MAL] -- they don't represent characters. On the other hand, UTF code sequences that would decode into lone surrogates are not "illegal". Please read clause D29 in section 3.8 of the Unicode 3.0 standard. This is further clarified by Unicode 3.1 which expressly lists legal UTF-8 sequences; these encompass lone surrogates. -----Original Message----- From: Gaute B Strokkenes [mailto:gs234@cam.ac.uk] Sent: Monday, 25 June 2001 22:04 To: M.-A. Lemburg Cc: Tim Peters; i18n-sig@python.org; unicode@unicode.org Subject: [I18n-sig] Re: How does Python Unicode treat surrogates? [I'm cc:-ing the unicode list to make sure that I've gotten my terminology right, and to solicit comments On Mon, 25 Jun 2001, mal@lemburg.com wrote: > Tim Peters wrote: >> >> [M.-A. Lemburg] >> > ... >> > 2. What to do when slicing of Unicode strings would break >> > a surrogate pair ? >> >> To me a string is a sequence of characters, and s[0] returns the >> first, s[1] the second, and so on. The internal details of how the >> implementation chooses to torture itself <0.7 wink> should be >> invisible. That is, breaking a surrogate via slicing should be >> impossible: s[i:j] returns j-i characters, and that's that. > > It's not that simple: lone surrogates are true Unicode char points > in their own right; it's just that they are pretty useless without > their resp. partners in the data stream. And with this "feature" > they are in good company: the Unicode combining characters (e.g. the > combining acute) have th same property. This is completely and totally wrong. The Unicode standard version 3.1 states (conformance requirement C12(c): A conformant process shall not interpret illegal UTF code unit sequences as characters. The precise definition of "illegal" in this context is given elsewhere. See : 0xD800 is incomplete in Unicode. Unless followed by another 16-bit value of the right form, it is illegal. (Unicode here should read UTF-16, off course. The reason it does not is that the language of the technical report has not been updated to that of 3.1) -- Big Gaute http://www.srcf.ucam.org/~gs234/ Hello? Enema Bondage? I'm calling because I want to be happy, I guess.. _______________________________________________ I18n-sig mailing list I18n-sig@python.org http://mail.python.org/mailman/listinfo/i18n-sig ************** IMPORTANT MESSAGE ************** The information contained in or attached to this message is intended only for the people it is addressed to. If you are not the intended recipient, any use, disclosure or copying of this information is unauthorised and prohibited. This information may be confidential or subject to legal privilege. It is not the expressed view of Colonial Limited or any of its subsidiaries unless that is clearly stated. Colonial cannot accept liability for any virus damage caused by this message. ************************************************** From mal@lemburg.com Mon Jun 25 13:56:23 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Mon, 25 Jun 2001 14:56:23 +0200 Subject: [I18n-sig] Re: How does Python Unicode treat surrogates? References: <9F2D83017589D211BD1000805FA70CA703B139D8@ntxmel03.cmutual.com.au> Message-ID: <3B3734F7.AEDDAAAA@lemburg.com> "Machin, John" wrote: > > MAL and Gaute, > > Can I please take the middle ground (and risk having both of you throw > things at me? Sure :-) > => Lone surrogates are not 'true Unicode char points > in their own right' [MAL] -- they don't represent characters. I should have added "please correct me if I'm wrong", sorry. Let me put this into an example: Say you have a Unicode string which contains the following data: U+0061 U+0062 U+0063 U+DC00 U+0064 U+0065 U+0066 ("a" "b" "c" ? "d" "e" "f") Would you consider this sequence a Unicode string or not ? Please note that I am not talking about some UTF-n encoding here. The above snippet is simply to be seen as sequence of data entries which are referenced by the Unicode database. > On the other hand, UTF code sequences that would decode into lone surrogates > are not "illegal". > Please read clause D29 in section 3.8 of the Unicode 3.0 standard. This is > further clarified by Unicode 3.1 > which expressly lists legal UTF-8 sequences; these encompass lone > surrogates. > > -----Original Message----- > From: Gaute B Strokkenes [mailto:gs234@cam.ac.uk] > Sent: Monday, 25 June 2001 22:04 > To: M.-A. Lemburg > Cc: Tim Peters; i18n-sig@python.org; unicode@unicode.org > Subject: [I18n-sig] Re: How does Python Unicode treat surrogates? > > [I'm cc:-ing the unicode list to make sure that I've gotten my > terminology right, and to solicit comments > > On Mon, 25 Jun 2001, mal@lemburg.com wrote: > > Tim Peters wrote: > >> > >> [M.-A. Lemburg] > >> > ... > >> > 2. What to do when slicing of Unicode strings would break > >> > a surrogate pair ? > >> > >> To me a string is a sequence of characters, and s[0] returns the > >> first, s[1] the second, and so on. The internal details of how the > >> implementation chooses to torture itself <0.7 wink> should be > >> invisible. That is, breaking a surrogate via slicing should be > >> impossible: s[i:j] returns j-i characters, and that's that. > > > > It's not that simple: lone surrogates are true Unicode char points > > in their own right; it's just that they are pretty useless without > > their resp. partners in the data stream. And with this "feature" > > they are in good company: the Unicode combining characters (e.g. the > > combining acute) have th same property. > > This is completely and totally wrong. The Unicode standard version > 3.1 states (conformance requirement C12(c): A conformant process shall > not interpret illegal UTF code unit sequences as characters. > > The precise definition of "illegal" in this context is given > elsewhere. See : > > 0xD800 is incomplete in Unicode. Unless followed by another 16-bit > value of the right form, it is illegal. > > (Unicode here should read UTF-16, off course. The reason it does not > is that the language of the technical report has not been updated to > that of 3.1) > > -- > Big Gaute http://www.srcf.ucam.org/~gs234/ > Hello? Enema Bondage? I'm calling because I want to be happy, I guess.. > > _______________________________________________ > I18n-sig mailing list > I18n-sig@python.org > http://mail.python.org/mailman/listinfo/i18n-sig > > ************** IMPORTANT MESSAGE ************** > > The information contained in or attached to this message is intended only for the people it is addressed to. If you are not the intended recipient, any use, disclosure or copying of this information is unauthorised and prohibited. This information may be confidential or subject to legal privilege. It is not the expressed view of Colonial Limited or any of its subsidiaries unless that is clearly stated. Colonial cannot accept liability for any virus damage caused by this message. > > ************************************************** -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From mal@lemburg.com Mon Jun 25 14:21:36 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Mon, 25 Jun 2001 15:21:36 +0200 Subject: [I18n-sig] Re: How does Python Unicode treat surrogates? References: <3B3722DB.1FF54794@lemburg.com> <4ak820g418.fsf@kern.srcf.societies.cam.ac.uk> Message-ID: <3B373AE0.21E25716@lemburg.com> Gaute B Strokkenes wrote: > > [I'm cc:-ing the unicode list to make sure that I've gotten my > terminology right, and to solicit comments > > On Mon, 25 Jun 2001, mal@lemburg.com wrote: > > Tim Peters wrote: > >> > >> [M.-A. Lemburg] > >> > ... > >> > 2. What to do when slicing of Unicode strings would break > >> > a surrogate pair ? > >> > >> To me a string is a sequence of characters, and s[0] returns the > >> first, s[1] the second, and so on. The internal details of how the > >> implementation chooses to torture itself <0.7 wink> should be > >> invisible. That is, breaking a surrogate via slicing should be > >> impossible: s[i:j] returns j-i characters, and that's that. > > > > It's not that simple: lone surrogates are true Unicode char points > > in their own right; it's just that they are pretty useless without > > their resp. partners in the data stream. And with this "feature" > > they are in good company: the Unicode combining characters (e.g. the > > combining acute) have th same property. > > This is completely and totally wrong. The Unicode standard version > 3.1 states (conformance requirement C12(c): A conformant process shall > not interpret illegal UTF code unit sequences as characters. This would solve the UTF codec issue, but I was talking about Unicode itself. In Python, you can write u"abc\uD800\uDC00"[0:4] giving u"abc\uD800" without getting an exception and I am not sure whether this is correct or not. The internal machinery is a totally different issue: we currently use UTF-16 for this but have deliberatly left out the surrogate support for the first implementation phase. > The precise definition of "illegal" in this context is given > elsewhere. See : > > 0xD800 is incomplete in Unicode. Unless followed by another 16-bit > value of the right form, it is illegal. > > (Unicode here should read UTF-16, off course. The reason it does not > is that the language of the technical report has not been updated to > that of 3.1) If you would have left it at "Unicode" I would have felt better ;-) -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From guido@digicool.com Mon Jun 25 14:42:01 2001 From: guido@digicool.com (Guido van Rossum) Date: Mon, 25 Jun 2001 09:42:01 -0400 Subject: [I18n-sig] How does Python Unicode treat surrogates? In-Reply-To: Your message of "Sun, 24 Jun 2001 20:16:59 +0200." <3B362E9B.4DC8DD81@lemburg.com> References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> Message-ID: <200106251342.f5PDg1q07291@odiug.digicool.com> > This would mean 4 bytes per Unicode character and is > unacceptable given the fact that most of these would be 0-bytes Agreed, but see below. > in practice. It would also break binary compatibility to the > native Unicode wchar_t type on e.g. Windows which we are among > the most Unicode-aware platforms there are today. Shouldn't there be a conversion routine between wchar_t[] and Py_UNICODE[] instead of assuming they have the same format? This will come up more often, and Linux has sizeif(wchar_t) == 4 I believe. (Which suggests that others disagree on the waste of space.) > > > BTW, Python's Unicode implementation is bound to the standard > > > defined at www.unicode.org; moving over to ISO 10646 is not an > > > option. > > > > Can you elaborate? How can you rule out that option that easily? > > It is not an option because we chose Unicode as our basis for > i18n work and not the ISO 10646 Uniform Character Set. I'd rather > have those two camps fight over the details of the Unicode standard > than try to fix anything related to the differences between the two > in Python by mixing them. Agreed. But be prepared that at some point in the future the Unicode world might end up agreeing on 4 bytes too... > > And why can't Python support the two standards simultaneously? > > Why would you want to support two standards for the same thing ? Well, we support ASCII and Unicode. :-) If ISO 10646 becomes important to our users, we'll have to support it, if only by providing a codec. --Guido van Rossum (home page: http://www.python.org/~guido/) From tree@basistech.com Mon Jun 25 14:10:15 2001 From: tree@basistech.com (Tom Emerson) Date: Mon, 25 Jun 2001 09:10:15 -0400 Subject: [I18n-sig] How does Python Unicode treat surrogates? In-Reply-To: <200106251342.f5PDg1q07291@odiug.digicool.com> References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> Message-ID: <15159.14391.718891.645489@cymru.basistech.com> Guido van Rossum writes: [snip] > Agreed. But be prepared that at some point in the future the Unicode > world might end up agreeing on 4 bytes too... With the release of the Plane 2 ideographic extensions in Unicode 3.1 there are two options available: include surrogate support via UTF-16, which means dealing with multibyte (really multi"word") characters, or switching to UTF-32, allowing characters outside Plane 0 to be accessed uniformly. Note that this is a real issue: the Hong Kong Supplementary Character Set includes characters contained in Plane 2 when mapped to Unicode 3.1. > If ISO 10646 becomes important to our users, we'll have to support > it, if only by providing a codec. This is beyond ISO 10646 --- Unicode 3.1 support brings the issue to the fore. -tree -- Tom Emerson Basis Technology Corp. Sr. Sinostringologist http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever" From JMachin@Colonial.com.au Mon Jun 25 14:51:29 2001 From: JMachin@Colonial.com.au (Machin, John) Date: Mon, 25 Jun 2001 23:51:29 +1000 Subject: [I18n-sig] Re: How does Python Unicode treat surrogates? Message-ID: <9F2D83017589D211BD1000805FA70CA703B139D9@ntxmel03.cmutual.com.au> Marc-Andre, > I should have added "please correct me if I'm wrong", sorry. I'm sorry too; I didn't intend to be rude; it's just that I normally operate under a protocol where that licence ("please correct me if I'm wrong") is the default and doesn't need to be stated explicitly in each paragraph. > Say you have a Unicode string which contains the following data: > > U+0061 U+0062 U+0063 U+DC00 U+0064 U+0065 U+0066 > ("a" "b" "c" ? "d" "e" "f") > > Would you consider this sequence a Unicode string or not ? I think you are using "Unicode string" with two different meanings here. However, the pragmatic question is what should Python do when given such a sequence. Do we permit such a sequence to be held internally as a "Unicode string"? Is u"\udc00" legal in source code or should Python throw a syntax error? Same question for u"\uffff". We *do* need to consider UTF encodings, because Unicode *expressly* allows decoding UTF sequences that become unpaired surrogates, or other "not 100% valid" scalars such as 0xffff and 0xfffe. So, given that Python supports Unicode, not ISO 10646, we must IMO permit such sequences in our internal representation. It follows that we should stop worrying about these irregular values -- it's less programming that way. Unicode 3.1 will create enough extra programming as it is, because we now have variable-length characters again -- just what Unicode was going to save us from :-( Cheers, John -----Original Message----- From: M.-A. Lemburg [mailto:mal@lemburg.com] Sent: Monday, 25 June 2001 22:56 To: Machin, John Cc: 'Gaute B Strokkenes'; Tim Peters; i18n-sig@python.org; unicode@unicode.org Subject: Re: [I18n-sig] Re: How does Python Unicode treat surrogates? "Machin, John" wrote: > > MAL and Gaute, > > Can I please take the middle ground (and risk having both of you throw > things at me? Sure :-) > => Lone surrogates are not 'true Unicode char points > in their own right' [MAL] -- they don't represent characters. I should have added "please correct me if I'm wrong", sorry. Let me put this into an example: Say you have a Unicode string which contains the following data: U+0061 U+0062 U+0063 U+DC00 U+0064 U+0065 U+0066 ("a" "b" "c" ? "d" "e" "f") Would you consider this sequence a Unicode string or not ? Please note that I am not talking about some UTF-n encoding here. The above snippet is simply to be seen as sequence of data entries which are referenced by the Unicode database. > On the other hand, UTF code sequences that would decode into lone surrogates > are not "illegal". > Please read clause D29 in section 3.8 of the Unicode 3.0 standard. This is > further clarified by Unicode 3.1 > which expressly lists legal UTF-8 sequences; these encompass lone > surrogates. > > -----Original Message----- > From: Gaute B Strokkenes [mailto:gs234@cam.ac.uk] > Sent: Monday, 25 June 2001 22:04 > To: M.-A. Lemburg > Cc: Tim Peters; i18n-sig@python.org; unicode@unicode.org > Subject: [I18n-sig] Re: How does Python Unicode treat surrogates? > > [I'm cc:-ing the unicode list to make sure that I've gotten my > terminology right, and to solicit comments > > On Mon, 25 Jun 2001, mal@lemburg.com wrote: > > Tim Peters wrote: > >> > >> [M.-A. Lemburg] > >> > ... > >> > 2. What to do when slicing of Unicode strings would break > >> > a surrogate pair ? > >> > >> To me a string is a sequence of characters, and s[0] returns the > >> first, s[1] the second, and so on. The internal details of how the > >> implementation chooses to torture itself <0.7 wink> should be > >> invisible. That is, breaking a surrogate via slicing should be > >> impossible: s[i:j] returns j-i characters, and that's that. > > > > It's not that simple: lone surrogates are true Unicode char points > > in their own right; it's just that they are pretty useless without > > their resp. partners in the data stream. And with this "feature" > > they are in good company: the Unicode combining characters (e.g. the > > combining acute) have th same property. > > This is completely and totally wrong. The Unicode standard version > 3.1 states (conformance requirement C12(c): A conformant process shall > not interpret illegal UTF code unit sequences as characters. > > The precise definition of "illegal" in this context is given > elsewhere. See : > > 0xD800 is incomplete in Unicode. Unless followed by another 16-bit > value of the right form, it is illegal. > > (Unicode here should read UTF-16, off course. The reason it does not > is that the language of the technical report has not been updated to > that of 3.1) > > -- > Big Gaute http://www.srcf.ucam.org/~gs234/ > Hello? Enema Bondage? I'm calling because I want to be happy, I guess.. > > _______________________________________________ > I18n-sig mailing list > I18n-sig@python.org > http://mail.python.org/mailman/listinfo/i18n-sig > > ************** IMPORTANT MESSAGE ************** > > The information contained in or attached to this message is intended only for the people it is addressed to. If you are not the intended recipient, any use, disclosure or copying of this information is unauthorised and prohibited. This information may be confidential or subject to legal privilege. It is not the expressed view of Colonial Limited or any of its subsidiaries unless that is clearly stated. Colonial cannot accept liability for any virus damage caused by this message. > > ************************************************** -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From guido@digicool.com Mon Jun 25 15:22:40 2001 From: guido@digicool.com (Guido van Rossum) Date: Mon, 25 Jun 2001 10:22:40 -0400 Subject: [I18n-sig] How does Python Unicode treat surrogates? In-Reply-To: Your message of "Mon, 25 Jun 2001 09:10:15 EDT." <15159.14391.718891.645489@cymru.basistech.com> References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <15159.14391.718891.645489@cymru.basistech.com> Message-ID: <200106251422.f5PEMel07612@odiug.digicool.com> > Guido van Rossum writes: > [snip] > > Agreed. But be prepared that at some point in the future the Unicode > > world might end up agreeing on 4 bytes too... > > With the release of the Plane 2 ideographic extensions in Unicode 3.1 > there are two options available: include surrogate support via UTF-16, > which means dealing with multibyte (really multi"word") characters, or > switching to UTF-32, allowing characters outside Plane 0 to be > accessed uniformly. > > Note that this is a real issue: the Hong Kong Supplementary Character > Set includes characters contained in Plane 2 when mapped to Unicode > 3.1. > > > If ISO 10646 becomes important to our users, we'll have to support > > it, if only by providing a codec. > > This is beyond ISO 10646 --- Unicode 3.1 support brings the issue to > the fore. > > -tree I don't think switching to a 32-bit character is the right thing to do for us (although I think it should be easier than it currently is -- changing to define Py_UNICODE as a 32-bit unsigned int should be all that it takes, which is currently not the case). I'm all for taking the lazy approach and letting applications that need surrogate support do it themselves, at the application level. --Guido van Rossum (home page: http://www.python.org/~guido/) From mark@macchiato.com Mon Jun 25 15:24:28 2001 From: mark@macchiato.com (Mark Davis) Date: Mon, 25 Jun 2001 07:24:28 -0700 Subject: [I18n-sig] Re: How does Python Unicode treat surrogates? References: <3B3722DB.1FF54794@lemburg.com> <4ak820g418.fsf@kern.srcf.societies.cam.ac.uk> Message-ID: <006501c0fd82$8b5ba9f0$0c680b41@c1340594a> You cannot interpret isolated UTF-16 surrogate code units as characters. For example, you can't interpret the sequence of D800 followed by 0061 as if it were some private use character (say, Klingon) followed by an 'a'. (For those unfamiliar with the terminology, see http://www.unicode.org/glossary, and my paper at http://www-106.ibm.com/developerworks/unicode/library/utfencodingforms/.) However, you can certainly deal with surrogate code units in storage, and it is permissible on that level to handle them. For example, most UTF-16 string interfaces use code unit indices, so that a string from position 3 of length 5 will include precisely 5 code units, not however many code points (or graphemes!) they take up. Similarly for UTF-8 strings, the low-level units are bytes. In most people's experience, it is best to leave the low level interfaces with indices in terms of code units, then supply some utility routines that tell you information about code points. The most useful are: - given a string and an index into that string, how many code points are before it? - given a string and a number of code points, what is the lowest index that contains them? - given a string and an index into that string, is the index on a code point boundary? An example for Java is at http://oss.software.ibm.com/icu4j/doc/com/ibm/text/UTF16.html. Mark ----- Original Message ----- From: "Gaute B Strokkenes" To: "M.-A. Lemburg" Cc: "Tim Peters" ; ; Sent: Monday, June 25, 2001 05:03 Subject: Re: How does Python Unicode treat surrogates? > > [I'm cc:-ing the unicode list to make sure that I've gotten my > terminology right, and to solicit comments > > On Mon, 25 Jun 2001, mal@lemburg.com wrote: > > Tim Peters wrote: > >> > >> [M.-A. Lemburg] > >> > ... > >> > 2. What to do when slicing of Unicode strings would break > >> > a surrogate pair ? > >> > >> To me a string is a sequence of characters, and s[0] returns the > >> first, s[1] the second, and so on. The internal details of how the > >> implementation chooses to torture itself <0.7 wink> should be > >> invisible. That is, breaking a surrogate via slicing should be > >> impossible: s[i:j] returns j-i characters, and that's that. > > > > It's not that simple: lone surrogates are true Unicode char points > > in their own right; it's just that they are pretty useless without > > their resp. partners in the data stream. And with this "feature" > > they are in good company: the Unicode combining characters (e.g. the > > combining acute) have th same property. > > This is completely and totally wrong. The Unicode standard version > 3.1 states (conformance requirement C12(c): A conformant process shall > not interpret illegal UTF code unit sequences as characters. > > The precise definition of "illegal" in this context is given > elsewhere. See : > > 0xD800 is incomplete in Unicode. Unless followed by another 16-bit > value of the right form, it is illegal. > > (Unicode here should read UTF-16, off course. The reason it does not > is that the language of the technical report has not been updated to > that of 3.1) > > -- > Big Gaute http://www.srcf.ucam.org/~gs234/ > Hello? Enema Bondage? I'm calling because I want to be happy, I guess.. > > From tree@basistech.com Mon Jun 25 14:55:07 2001 From: tree@basistech.com (Tom Emerson) Date: Mon, 25 Jun 2001 09:55:07 -0400 Subject: [I18n-sig] How does Python Unicode treat surrogates? In-Reply-To: <200106251422.f5PEMel07612@odiug.digicool.com> References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <15159.14391.718891.645489@cymru.basistech.com> <200106251422.f5PEMel07612@odiug.digicool.com> Message-ID: <15159.17083.978971.519453@cymru.basistech.com> Guido van Rossum writes: [...] > I'm all for taking the lazy approach and letting applications that > need surrogate support do it themselves, at the application level. Meaning what? Leaving it up to the application to be entirely responsible for handling surrogates is a mistake. As was stated earlier in the thread (apologies, I don't have the message around to make the appropriate attribution) surrogates are an implementation detail: to the user/application developer the presence of the surrogate pair needs to be transparent. As long as the Unicode support functionality groks surrogates correctly (fully implements UTF-16) then the issue becomes a small one for the end user. The scanner would need to be modified to support Unicode escapes for values up to 0x10FFFF. Internally these are represented as surrogates. Put the burden of these multibyte representations on the library implementor, not the end-user. -tree -- Tom Emerson Basis Technology Corp. Sr. Sinostringologist http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever" From guido@digicool.com Mon Jun 25 15:43:02 2001 From: guido@digicool.com (Guido van Rossum) Date: Mon, 25 Jun 2001 10:43:02 -0400 Subject: [I18n-sig] How does Python Unicode treat surrogates? In-Reply-To: Your message of "Mon, 25 Jun 2001 09:55:07 EDT." <15159.17083.978971.519453@cymru.basistech.com> References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <15159.14391.718891.645489@cymru.basistech.com> <200106251422.f5PEMel07612@odiug.digicool.com> <15159.17083.978971.519453@cymru.basistech.com> Message-ID: <200106251443.f5PEh2p07753@odiug.digicool.com> > Guido van Rossum writes: > [...] > > I'm all for taking the lazy approach and letting applications that > > need surrogate support do it themselves, at the application level. > > Meaning what? Leaving it up to the application to be entirely > responsible for handling surrogates is a mistake. As was stated > earlier in the thread (apologies, I don't have the message around to > make the appropriate attribution) surrogates are an implementation > detail: to the user/application developer the presence of the > surrogate pair needs to be transparent. > > As long as the Unicode support functionality groks surrogates > correctly (fully implements UTF-16) then the issue becomes a small one > for the end user. The scanner would need to be modified to support > Unicode escapes for values up to 0x10FFFF. Internally these are > represented as surrogates. > > Put the burden of these multibyte representations on the library > implementor, not the end-user. > > -tree Depends on what you call transparent. I'm all for smart codecs between UTF-16 and UTF-8, but if you have a surrogate in a Unicode string, the application will have to know not to split it in the middle, and it must realize that len(u) is not necessarily the number of characters -- it's the number of 16-bit units in the UTF-16 encoding. Does that make sense? I know I am hindered by a lack of understanding of Unicode hairsplitting, angels-on-a-pin-dancing details; if I'm missing something, it's likely that many other people don't know the details either, so an explanation would be much appreciated! --Guido van Rossum (home page: http://www.python.org/~guido/) From tree@basistech.com Mon Jun 25 15:36:10 2001 From: tree@basistech.com (Tom Emerson) Date: Mon, 25 Jun 2001 10:36:10 -0400 Subject: [I18n-sig] How does Python Unicode treat surrogates? In-Reply-To: <200106251443.f5PEh2p07753@odiug.digicool.com> References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <15159.14391.718891.645489@cymru.basistech.com> <200106251422.f5PEMel07612@odiug.digicool.com> <15159.17083.978971.519453@cymru.basistech.com> <200106251443.f5PEh2p07753@odiug.digicool.com> Message-ID: <15159.19546.226155.383490@cymru.basistech.com> Guido van Rossum writes: > Depends on what you call transparent. I'm all for smart codecs > between UTF-16 and UTF-8, but if you have a surrogate in a Unicode > string, the application will have to know not to split it in the > middle, and it must realize that len(u) is not necessarily the number > of characters -- it's the number of 16-bit units in the UTF-16 > encoding. Surrogates were created as a way to allow characters outside Plane 0 (the BMP) to be accessed within a sixteen-bit codespace. When using UTF-16 a character constists of either two-octets or four-octets. A character that cannot be represented within the 16-bit code space is encoded using a surrogate pair, but it is the same character regardless. So, for example, the ideograph at U+20000 is the same character whether it is encoded as <20000> (UCS-4, UTF-32), (UTF-16), or (UTF-8). It doesn't matter what transformation format you use: it's the *same* character. Hence, when I have Unicode string, I'm thinking of each character as a Unicode character, not as a sequence of UTF-16 or UCS-2 two-octet words. Hence my belief that Unicode strings should not be synonymous with the underlying physical character representation is used. Clear as mud? :-) -tree -- Tom Emerson Basis Technology Corp. Sr. Sinostringologist http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever" From guido@digicool.com Mon Jun 25 16:44:32 2001 From: guido@digicool.com (Guido van Rossum) Date: Mon, 25 Jun 2001 11:44:32 -0400 Subject: [I18n-sig] How does Python Unicode treat surrogates? In-Reply-To: Your message of "Mon, 25 Jun 2001 10:36:10 EDT." <15159.19546.226155.383490@cymru.basistech.com> References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <15159.14391.718891.645489@cymru.basistech.com> <200106251422.f5PEMel07612@odiug.digicool.com> <15159.17083.978971.519453@cymru.basistech.com> <200106251443.f5PEh2p07753@odiug.digicool.com> <15159.19546.226155.383490@cymru.basistech.com> Message-ID: <200106251544.f5PFiWe07979@odiug.digicool.com> > Guido van Rossum writes: > > Depends on what you call transparent. I'm all for smart codecs > > between UTF-16 and UTF-8, but if you have a surrogate in a Unicode > > string, the application will have to know not to split it in the > > middle, and it must realize that len(u) is not necessarily the number > > of characters -- it's the number of 16-bit units in the UTF-16 > > encoding. > > Surrogates were created as a way to allow characters outside Plane 0 > (the BMP) to be accessed within a sixteen-bit codespace. When using > UTF-16 a character constists of either two-octets or four-octets. A > character that cannot be represented within the 16-bit code space is > encoded using a surrogate pair, but it is the same character > regardless. > > So, for example, the ideograph at U+20000 is the same character > whether it is encoded as <20000> (UCS-4, UTF-32), > (UTF-16), or (UTF-8). It doesn't matter what > transformation format you use: it's the *same* character. > > Hence, when I have Unicode string, I'm thinking of each character as a > Unicode character, not as a sequence of UTF-16 or UCS-2 two-octet > words. > > Hence my belief that Unicode strings should not be synonymous with the > underlying physical character representation is used. > > Clear as mud? :-) > > -tree Very clear. But, just as a Python 8-bit string object containing the UTF-8 encoded character U+20000 contains 4 bytes, with s[0] being '\xF0' etc., a Python "unicode" string containing that character as a surrogate will have length 2, with u[0] being u'\uD840' and u[1] being u'\uDC00'. You can think of it as containing a single character, but the interface gives you the individual items of the UTF-16 encoding. You can believe what *should* happen all you want, but we're not going to change this soon. u[i] has to be independent of the length of u and the value of i. It may change *eventually* -- when we switch to UCS-4 for the internal representation. Until then, the API will deal in 16-bit values that may or may not be "characters". I'd say that ideally the choice to have a 2 or 4 byte internal representation (or no Unicode support at all, for some platforms like PalmOS!) should be a configuration choice. Right now the implementation doesn't allow that choice at all, which should be remedied -- maybe you can help by submitting patches? --Guido van Rossum (home page: http://www.python.org/~guido/) From mal@lemburg.com Mon Jun 25 16:58:49 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Mon, 25 Jun 2001 17:58:49 +0200 Subject: [I18n-sig] How does Python Unicode treat surrogates? References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> Message-ID: <3B375FB9.91BA4B1E@lemburg.com> Guido van Rossum wrote: > > > This would mean 4 bytes per Unicode character and is > > unacceptable given the fact that most of these would be 0-bytes > > Agreed, but see below. > > > in practice. It would also break binary compatibility to the > > native Unicode wchar_t type on e.g. Windows which we are among > > the most Unicode-aware platforms there are today. > > Shouldn't there be a conversion routine between wchar_t[] and > Py_UNICODE[] instead of assuming they have the same format? This will > come up more often, and Linux has sizeif(wchar_t) == 4 I believe. > (Which suggests that others disagree on the waste of space.) There are conversion routines which map between Py_UNICODE and wchar_t in Python and these make use of the fact that e.g. on Windows Py_UNICODE can use wchar_t as basis which makes the conversion very fast. On Linux (which uses 4 bytes per wchar_t) the routine inserts tons of zeros to make Tux happy :-) > > > > BTW, Python's Unicode implementation is bound to the standard > > > > defined at www.unicode.org; moving over to ISO 10646 is not an > > > > option. > > > > > > Can you elaborate? How can you rule out that option that easily? > > > > It is not an option because we chose Unicode as our basis for > > i18n work and not the ISO 10646 Uniform Character Set. I'd rather > > have those two camps fight over the details of the Unicode standard > > than try to fix anything related to the differences between the two > > in Python by mixing them. > > Agreed. But be prepared that at some point in the future the Unicode > world might end up agreeing on 4 bytes too... No problem... we can change to 4 byte values too if the world agrees on 4 bytes per character. However, 2 bytes or 4 bytes is an implementation detail and not part of the Unicode standard itself. 4 bytes per character makes things at the C level much easier and this is probably why the GNU C lib team chose 4 bytes. Other programming languages like Java and platforms like Windows chose 2-byte UTF-16 as internal format. I guess it's up to the user acceptance to choose between the two. 2 bytes means more work on the implementor, 4 bytes means more $$$ for Micron et al. ;-) > > > And why can't Python support the two standards simultaneously? > > > > Why would you want to support two standards for the same thing ? > > Well, we support ASCII and Unicode. :-) > > If ISO 10646 becomes important to our users, we'll have to support > it, if only by providing a codec. This is different: ISO 10646 is a competing standard, not just a different encoding. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From tree@basistech.com Mon Jun 25 16:25:38 2001 From: tree@basistech.com (Tom Emerson) Date: Mon, 25 Jun 2001 11:25:38 -0400 Subject: [I18n-sig] How does Python Unicode treat surrogates? In-Reply-To: <200106251544.f5PFiWe07979@odiug.digicool.com> References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <15159.14391.718891.645489@cymru.basistech.com> <200106251422.f5PEMel07612@odiug.digicool.com> <15159.17083.978971.519453@cymru.basistech.com> <200106251443.f5PEh2p07753@odiug.digicool.com> <15159.19546.226155.383490@cymru.basistech.com> <200106251544.f5PFiWe07979@odiug.digicool.com> Message-ID: <15159.22514.976923.894201@cymru.basistech.com> Guido van Rossum writes: > But, just as a Python 8-bit string object containing the UTF-8 encoded > character U+20000 contains 4 bytes, with s[0] being '\xF0' etc., a > Python "unicode" string containing that character as a surrogate will > have length 2, with u[0] being u'\uD840' and u[1] being u'\uDC00'. > You can think of it as containing a single character, but the > interface gives you the individual items of the UTF-16 encoding. So what has been implemented is UCS-2, not UTF-16, and certainly not Unicode. Better to document u"" string literals as UCS-2, and not Unicode. > It may change *eventually* -- when we switch to UCS-4 for the internal > representation. Until then, the API will deal in 16-bit values that > may or may not be "characters". You don't need to switch to UCS-4 internally to implement what I'm suggesting. > I'd say that ideally the choice to have a 2 or 4 byte internal > representation (or no Unicode support at all, for some platforms like > PalmOS!) should be a configuration choice. I don't think it should be a configuration choice. That leads to incompatibilities between people's scripts. It's bad enough already with some things working with threaded versions of python and some not (e.g., Zope requires threading, but mod_python doesn't work if its turned on). BTW, Palm recently joined the Unicode Consortium, and Symbian has Unicode support. >Right now the implementation doesn't allow that choice at all, which >should be remedied -- maybe you can help by submitting patches? Touch=E9. -- = Tom Emerson Basis Technology Cor= p. Sr. Sinostringologist http://www.basistech.c= om "Beware the lollipop of mediocrity: lick it once and you suck forever" From guido@digicool.com Mon Jun 25 17:20:23 2001 From: guido@digicool.com (Guido van Rossum) Date: Mon, 25 Jun 2001 12:20:23 -0400 Subject: [I18n-sig] How does Python Unicode treat surrogates? In-Reply-To: Your message of "Mon, 25 Jun 2001 17:58:49 +0200." <3B375FB9.91BA4B1E@lemburg.com> References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <3B375FB9.91BA4B1E@lemburg.com> Message-ID: <200106251620.f5PGKNP08234@odiug.digicool.com> > > Shouldn't there be a conversion routine between wchar_t[] and > > Py_UNICODE[] instead of assuming they have the same format? This will > > come up more often, and Linux has sizeif(wchar_t) == 4 I believe. > > (Which suggests that others disagree on the waste of space.) > > There are conversion routines which map between Py_UNICODE > and wchar_t in Python and these make use of the fact that > e.g. on Windows Py_UNICODE can use wchar_t as basis which makes > the conversion very fast. > > On Linux (which uses 4 bytes per wchar_t) the routine inserts > tons of zeros to make Tux happy :-) Maybe this code should be restructured so that it lengthens the characters or not depending on the size difference between Py_UNICODE and wchar_t, rather than making platform assumptions. If this is the only thing that keeps us from having a configuration OPTION to make Py_UNICODE 32-bit wide, I'd say let's fix it. > > > > > BTW, Python's Unicode implementation is bound to the standard > > > > > defined at www.unicode.org; moving over to ISO 10646 is not an > > > > > option. > > > > > > > > Can you elaborate? How can you rule out that option that easily? > > > > > > It is not an option because we chose Unicode as our basis for > > > i18n work and not the ISO 10646 Uniform Character Set. I'd rather > > > have those two camps fight over the details of the Unicode standard > > > than try to fix anything related to the differences between the two > > > in Python by mixing them. > > > > Agreed. But be prepared that at some point in the future the Unicode > > world might end up agreeing on 4 bytes too... > > No problem... we can change to 4 byte values too if the world > agrees on 4 bytes per character. However, 2 bytes or 4 bytes > is an implementation detail and not part of the Unicode standard > itself. But UTF-16 vs. UCS-4 is not an implementation detail! If we store 4 bytes per character, we should treat surrogates differently. I don't know where those would be converted -- probably in the UTF-16 to UCS-4 codec. I'd be happy to make the configuration choice between UTF-16 and UCS-4, if that's doable. > 4 bytes per character makes things at the C level much easier > and this is probably why the GNU C lib team chose 4 bytes. Other > programming languages like Java and platforms like Windows > chose 2-byte UTF-16 as internal format. I guess it's up to the > user acceptance to choose between the two. 2 bytes means more > work on the implementor, 4 bytes means more $$$ for Micron et al. ;-) My 1-year old laptop has a 10 Gb hard drive and 128 Mb RAM. Current machines are between 2-4 times that. How much of that space will be wasted on extra Unicode? For a typical user, most of it is MP3's anyway. :-) > > > > And why can't Python support the two standards simultaneously? > > > > > > Why would you want to support two standards for the same thing ? > > > > Well, we support ASCII and Unicode. :-) > > > > If ISO 10646 becomes important to our users, we'll have to support > > it, if only by providing a codec. > > This is different: ISO 10646 is a competing standard, not just a > different encoding. Oh. I didn't know. How does it differ from Unicode? What's the user acceptance? --Guido van Rossum (home page: http://www.python.org/~guido/) From mal@lemburg.com Mon Jun 25 17:23:10 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Mon, 25 Jun 2001 18:23:10 +0200 Subject: [I18n-sig] Re: How does Python Unicode treat surrogates? References: <9F2D83017589D211BD1000805FA70CA703B139D9@ntxmel03.cmutual.com.au> Message-ID: <3B37656E.9E09DB1A@lemburg.com> "Machin, John" wrote: > > > Say you have a Unicode string which contains the following data: > > > > U+0061 U+0062 U+0063 U+DC00 U+0064 U+0065 U+0066 > > ("a" "b" "c" ? "d" "e" "f") > > > > Would you consider this sequence a Unicode string or not ? > > I think you are using "Unicode string" with two different meanings here. The question is really very simple: is the above correct Unicode or not ? > However, the pragmatic question is what should Python do when given such a > sequence. > Do we permit such a sequence to be held internally as a "Unicode string"? > Is u"\udc00" legal in source code or should Python throw a syntax error? > Same question for u"\uffff". Right... that's what I was getting at. The Unicode object in Python represent a "Unicode string"; the underlying logic is really secondary, the question here is whether construction of objects like u"\uFFFF" should be possible or not. If the standards defines these as illegal Unicode, then the constructors should make sure that construction of these objects is not possible; otherwise, it should work on them just like all other "code points". (http://www.unicode.org/glossary/) > We *do* need to consider UTF encodings, because Unicode *expressly* allows > decoding UTF sequences > that become unpaired surrogates, or other "not 100% valid" scalars such as > 0xffff and 0xfffe. The standard says this on the noncharacter code points: """ D7b Noncharacter: a code point that is permanently reserved for internal use, and that should never be interchanged. In Unicode 3.1, these consist of the values U+nFFFE and U+nFFFF (where n is from 0 to 1016) and the values U+FDD0..U+FDEF. C5 A process shall not interpret a noncharacter code point as an abstract character. The code points may be used internally, such as for sentinel values or delimiters, but should not be exchanged publicly. C10 A process shall make no change in a valid coded character representation other than the possible replacement of character sequences by their canonical-equivalent sequences or the deletion of noncharacter code points, if that process purports not to modify the interpretation of that coded character sequence. If a noncharacter which does not have a specific internal use is unexpectedly encountered in processing, an implementation may signal an error or delete or ignore the noncharacter. If these options are not taken, the noncharacter should be treated as an unassigned code point. For example, an API that returned a character property value for a noncharacter would return the same value as the default value for an unassigned code point. """ Note that lone surrogates are not regarded as noncharacters (for some reason). > So, > given that Python supports Unicode, not ISO 10646, we must IMO permit such > sequences in our internal > representation. It follows that we should stop worrying about these > irregular values -- it's less > programming that way. Unicode 3.1 will create enough extra programming as it > is, because we now have > variable-length characters again -- just what Unicode was going to save us > from :-( Agreed; now who's going to submit the patches ;-) -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From mal@lemburg.com Mon Jun 25 17:46:59 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Mon, 25 Jun 2001 18:46:59 +0200 Subject: [I18n-sig] Re: How does Python Unicode treat surrogates? References: <3B3722DB.1FF54794@lemburg.com> <4ak820g418.fsf@kern.srcf.societies.cam.ac.uk> <006501c0fd82$8b5ba9f0$0c680b41@c1340594a> Message-ID: <3B376B03.A2A84AE1@lemburg.com> Mark Davis wrote: > > You cannot interpret isolated UTF-16 surrogate code units as characters. For > example, you can't interpret the sequence of D800 followed by 0061 as if it > were some private use character (say, Klingon) followed by an 'a'. > > (For those unfamiliar with the terminology, see > http://www.unicode.org/glossary, and my paper at > http://www-106.ibm.com/developerworks/unicode/library/utfencodingforms/.) Thanks for the pointers and the explanations. Your paper is a very good reading indeed. My question was targetting into a slightly different direction, though. I know that UTF-16 does not allow lone surrogates, but how does Unicode itself treat these ? If I have a sequence of Unicode code points which includes an isolated surrogate code point, would this be considered a legal Unicode sequence or not ? > However, you can certainly deal with surrogate code units in storage, and it > is permissible on that level to handle them. For example, most UTF-16 string > interfaces use code unit indices, so that a string from position 3 of length > 5 will include precisely 5 code units, not however many code points (or > graphemes!) they take up. Similarly for UTF-8 strings, the low-level units > are bytes. FYI, Python currently uses UTF-16 as internal storage format and also exposes this through its indexing interfaces. In that sense isolated surrogates would be illegal. The codecs which convert such Unicode object to other encodings would raise an exception. Unicode object constructors, slicing and concatenating Unicode objects currently do not apply any checks though. > In most people's experience, it is best to leave the low level interfaces > with indices in terms of code units, then supply some utility routines that > tell you information about code points. So surrogate support or its handling is left to the applications using the interface ?! Perhaps you are right and this is the only feasable way to approach the problem... > The most useful are: > > - given a string and an index into that string, how many code points are > before it? > - given a string and a number of code points, what is the lowest index that > contains them? > - given a string and an index into that string, is the index on a code point > boundary? These are still missing in Python; we should probably add methods for them in one of the next releases, though. > An example for Java is at > http://oss.software.ibm.com/icu4j/doc/com/ibm/text/UTF16.html. > > Mark > > ----- Original Message ----- > From: "Gaute B Strokkenes" > To: "M.-A. Lemburg" > Cc: "Tim Peters" ; ; > > Sent: Monday, June 25, 2001 05:03 > Subject: Re: How does Python Unicode treat surrogates? > > > > > [I'm cc:-ing the unicode list to make sure that I've gotten my > > terminology right, and to solicit comments > > > > On Mon, 25 Jun 2001, mal@lemburg.com wrote: > > > Tim Peters wrote: > > >> > > >> [M.-A. Lemburg] > > >> > ... > > >> > 2. What to do when slicing of Unicode strings would break > > >> > a surrogate pair ? > > >> > > >> To me a string is a sequence of characters, and s[0] returns the > > >> first, s[1] the second, and so on. The internal details of how the > > >> implementation chooses to torture itself <0.7 wink> should be > > >> invisible. That is, breaking a surrogate via slicing should be > > >> impossible: s[i:j] returns j-i characters, and that's that. > > > > > > It's not that simple: lone surrogates are true Unicode char points > > > in their own right; it's just that they are pretty useless without > > > their resp. partners in the data stream. And with this "feature" > > > they are in good company: the Unicode combining characters (e.g. the > > > combining acute) have th same property. > > > > This is completely and totally wrong. The Unicode standard version > > 3.1 states (conformance requirement C12(c): A conformant process shall > > not interpret illegal UTF code unit sequences as characters. > > > > The precise definition of "illegal" in this context is given > > elsewhere. See : > > > > 0xD800 is incomplete in Unicode. Unless followed by another 16-bit > > value of the right form, it is illegal. > > > > (Unicode here should read UTF-16, off course. The reason it does not > > is that the language of the technical report has not been updated to > > that of 3.1) > > > > -- > > Big Gaute http://www.srcf.ucam.org/~gs234/ > > Hello? Enema Bondage? I'm calling because I want to be happy, I guess.. > > > > > > _______________________________________________ > I18n-sig mailing list > I18n-sig@python.org > http://mail.python.org/mailman/listinfo/i18n-sig -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From mal@lemburg.com Mon Jun 25 18:01:28 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Mon, 25 Jun 2001 19:01:28 +0200 Subject: [I18n-sig] How does Python Unicode treat surrogates? References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <3B375FB9.91BA4B1E@lemburg.com> <200106251620.f5PGKNP08234@odiug.digicool.com> Message-ID: <3B376E68.505BF6E@lemburg.com> Guido van Rossum wrote: > > > > Shouldn't there be a conversion routine between wchar_t[] and > > > Py_UNICODE[] instead of assuming they have the same format? This will > > > come up more often, and Linux has sizeif(wchar_t) == 4 I believe. > > > (Which suggests that others disagree on the waste of space.) > > > > There are conversion routines which map between Py_UNICODE > > and wchar_t in Python and these make use of the fact that > > e.g. on Windows Py_UNICODE can use wchar_t as basis which makes > > the conversion very fast. > > > > On Linux (which uses 4 bytes per wchar_t) the routine inserts > > tons of zeros to make Tux happy :-) > > Maybe this code should be restructured so that it lengthens the > characters or not depending on the size difference between Py_UNICODE > and wchar_t, rather than making platform assumptions. This is how it currently works. > If this is the only thing that keeps us from having a configuration > OPTION to make Py_UNICODE 32-bit wide, I'd say let's fix it. This is not easy to fix and can certainly not be made an option: UTF-16 has surrogates and is a variable width encoding of Unicode while UCS-4 is a fixed width encoding. Python currently only has minimal support for surrogates, so purist would say that we support UCS-2. However, we deliberatly chose this path to be able to upgrade to UTF-16 at some later point in time and it seems that this time has now come. > > > Agreed. But be prepared that at some point in the future the Unicode > > > world might end up agreeing on 4 bytes too... > > > > No problem... we can change to 4 byte values too if the world > > agrees on 4 bytes per character. However, 2 bytes or 4 bytes > > is an implementation detail and not part of the Unicode standard > > itself. > > But UTF-16 vs. UCS-4 is not an implementation detail! True. > If we store 4 bytes per character, we should treat surrogates > differently. I don't know where those would be converted -- probably > in the UTF-16 to UCS-4 codec. > > I'd be happy to make the configuration choice between UTF-16 and > UCS-4, if that's doable. Not easily, I'm afraid. > > 4 bytes per character makes things at the C level much easier > > and this is probably why the GNU C lib team chose 4 bytes. Other > > programming languages like Java and platforms like Windows > > chose 2-byte UTF-16 as internal format. I guess it's up to the > > user acceptance to choose between the two. 2 bytes means more > > work on the implementor, 4 bytes means more $$$ for Micron et al. ;-) > > My 1-year old laptop has a 10 Gb hard drive and 128 Mb RAM. Current > machines are between 2-4 times that. How much of that space will be > wasted on extra Unicode? For a typical user, most of it is MP3's > anyway. :-) True again :-) Still, it's the main argument people have against using 4 bytes per character; here's a quote from Mark Davis, the Unicode Consortium President: http://www-106.ibm.com/developerworks/unicode/library/utfencodingforms/ """ Decisions, decisions... Ultimately, the choice of which encoding format to use will depend heavily on the programming environment. For systems that only offer 8-bit strings currently, but are multi-byte enabled, UTF-8 may be the best choice. For systems that do not care about storage requirements, UTF-32 may be best. For systems such as Windows, Java, or ICU that use UTF-16 strings already, UTF-16 is the obvious choice. Even if they have not yet upgraded to fully support surrogates, they will be before long. If the programming environment is not an issue, UTF-16 is recommended as a good compromise between elegance, performance, and storage. """ > > > > > And why can't Python support the two standards simultaneously? > > > > > > > > Why would you want to support two standards for the same thing ? > > > > > > Well, we support ASCII and Unicode. :-) > > > > > > If ISO 10646 becomes important to our users, we'll have to support > > > it, if only by providing a codec. > > > > This is different: ISO 10646 is a competing standard, not just a > > different encoding. > > Oh. I didn't know. How does it differ from Unicode? What's the user > acceptance? http://www.unicode.org/unicode/consortium/memblogo.html says it all. ISO 10646 documents are only available on a pay-per-page basis -- not really ideal for spreading the word... (http://wwwold.dkuug.dk/JTC1/SC2/WG2/) -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From mike.sykes@acm.org Mon Jun 25 18:38:09 2001 From: mike.sykes@acm.org (J M Sykes) Date: Mon, 25 Jun 2001 18:38:09 +0100 Subject: [I18n-sig] Re: How does Python Unicode treat surrogates? References: <3B3722DB.1FF54794@lemburg.com> <4ak820g418.fsf@kern.srcf.societies.cam.ac.uk> <006501c0fd82$8b5ba9f0$0c680b41@c1340594a> Message-ID: <005b01c0fd9d$e4469e60$1a2cf7c2@oakdale2> Mark Davis said: > > In most people's experience, it is best to leave the low level interfaces > with indices in terms of code units, then supply some utility routines that > tell you information about code points. ... Anyone on the list interested in the treatment of UCS aka Unicode in programming languages might like to know that a meeting of ISO/IEC JTC 1/SC 32/WG 3 recently approved a paper that specifies how SQL implementations should do it. The proposal can be found at: ftp://sqlstandards.org/SC32/WG3/Meetings/PER_2001_04_Perth_AUS/per054r1.pdf The current CD of the next SQL standard (ISO/IEC 9075), as amended by this proposal (and many others) can be found at: ftp://sqlstandards.org/SC32/WG3/Progression_Documents/CD/cd1r1-foundation-20 01-06.pdf Briefly, the SQL functions CHARACTER_LENGTH, POSITION (the SQL string indexing function), and SUBSTRING will all accept a parameter specifying the units to be used, the alternatives being OCTETS, CODE_UNITS and CHARACTERS (which to SQL means code points); the default being characters. This proposal was agreed by major SQL implementors. Which doesn't mean that it's right, nor that it can't be changed. But that's how it is at the moment. Mike. *********************************************************** J M Sykes Email: Mike.Sykes@acm.org 97 Oakdale Drive Heald Green CHEADLE Cheshire SK8 3SN UK Tel: (44) 161 437 5413 *********************************************************** From guido@digicool.com Mon Jun 25 18:42:29 2001 From: guido@digicool.com (Guido van Rossum) Date: Mon, 25 Jun 2001 13:42:29 -0400 Subject: [I18n-sig] How does Python Unicode treat surrogates? In-Reply-To: Your message of "Mon, 25 Jun 2001 11:25:38 EDT." <15159.22514.976923.894201@cymru.basistech.com> References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <15159.14391.718891.645489@cymru.basistech.com> <200106251422.f5PEMel07612@odiug.digicool.com> <15159.17083.978971.519453@cymru.basistech.com> <200106251443.f5PEh2p07753@odiug.digicool.com> <15159.19546.226155.383490@cymru.basistech.com> <200106251544.f5PFiWe07979@odiug.digicool.com> <15159.22514.976923.894201@cymru.basistech.com> Message-ID: <200106251742.f5PHgTW08532@odiug.digicool.com> > So what has been implemented is UCS-2, not UTF-16, and certainly not > Unicode. Better to document u"" string literals as UCS-2, and not > Unicode. I'm sorry, but I don't see why it's UCS-2 any more or less than UTF-16. That's like arguing whether 8-bit strings contains ASCII or UTF-8. That's up to the application; the data type can be used for either. > > It may change *eventually* -- when we switch to UCS-4 for the internal > > representation. Until then, the API will deal in 16-bit values that > > may or may not be "characters". > > You don't need to switch to UCS-4 internally to implement what I'm > suggesting. But unless I misunderstand what it *is* that you are suggesting, the O(1) indexing property can't be retained with your suggestion, and that's out of the question. > > I'd say that ideally the choice to have a 2 or 4 byte internal > > representation (or no Unicode support at all, for some platforms like > > PalmOS!) should be a configuration choice. > > I don't think it should be a configuration choice. That leads to > incompatibilities between people's scripts. It's bad enough already > with some things working with threaded versions of python and some not > (e.g., Zope requires threading, but mod_python doesn't work if its > turned on). That turned out to be a myth, actually. mod_python works fine with threads on most platforms. Anyway, code that specifically doesn't work when a particular feature is turned *on* is rare. Code that *requires* a specific feature is common, of course, and I would think that Python's Unicode type is useful as it is for applications that don't need the newer planes. > BTW, Palm recently joined the Unicode Consortium, and Symbian has > Unicode support. > > >Right now the implementation doesn't allow that choice at all, which > >should be remedied -- maybe you can help by submitting patches? > > Touché. :-) --Guido van Rossum (home page: http://www.python.org/~guido/) From tree@basistech.com Mon Jun 25 18:13:56 2001 From: tree@basistech.com (Tom Emerson) Date: Mon, 25 Jun 2001 13:13:56 -0400 Subject: [I18n-sig] How does Python Unicode treat surrogates? In-Reply-To: <200106251742.f5PHgTW08532@odiug.digicool.com> References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <15159.14391.718891.645489@cymru.basistech.com> <200106251422.f5PEMel07612@odiug.digicool.com> <15159.17083.978971.519453@cymru.basistech.com> <200106251443.f5PEh2p07753@odiug.digicool.com> <15159.19546.226155.383490@cymru.basistech.com> <200106251544.f5PFiWe07979@odiug.digicool.com> <15159.22514.976923.894201@cymru.basistech.com> <200106251742.f5PHgTW08532@odiug.digicool.com> Message-ID: <15159.29012.266722.112773@cymru.basistech.com> Guido van Rossum writes: > I'm sorry, but I don't see why it's UCS-2 any more or less than > UTF-16. That's like arguing whether 8-bit strings contains ASCII or > UTF-8. That's up to the application; the data type can be used for > either. UCS-2 and UTF-16 and UTF-8 are encoding forms of Unicode. Unicode defines characters using an abstract integer, the code-point. As of Unicode 3.1 code points range from 0x000000 to 0x10FFFF. The so-called Unicode string type in Python is a wide-string type, where each character is treated as a 16-bit quantity. The interpretation placed on those 16-bit quantities is that of UCS-2. In that case each half of a surrogate pair is an unknown character. As soon as you impose UTF-16 semantics on the 16-bit quantities, then you need to treat surrogate pairs as a single character. If the implementation won't change, then the standard library needs to support surrogates as a wrapper: leaving it up to each application is a mistake. IMHO you cannot trust implementers to do this right. > But unless I misunderstand what it *is* that you are suggesting, the > O(1) indexing property can't be retained with your suggestion, and > that's out of the question. You understand me completely. Adding transparent UTF-16 support changes your O(1) indexing operation to O(1+c), where 'c' is the small amount of time required to check for the surrogate. Granted, this 'c' could get large, but... But I see your point: this requirement is what prompted the glibc folks to go with the 32-bit wchar_t type. > That turned out to be a myth, actually. mod_python works fine with > threads on most platforms. Not in my experience. On my FreeBSD box Python 2.0 built with threads does not get along in some cases where Apache 1.3.19. Not that it matters. -- Tom Emerson Basis Technology Corp. Sr. Sinostringologist http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever" From guido@digicool.com Mon Jun 25 19:04:13 2001 From: guido@digicool.com (Guido van Rossum) Date: Mon, 25 Jun 2001 14:04:13 -0400 Subject: [I18n-sig] How does Python Unicode treat surrogates? In-Reply-To: Your message of "Mon, 25 Jun 2001 19:01:28 +0200." <3B376E68.505BF6E@lemburg.com> References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <3B375FB9.91BA4B1E@lemburg.com> <200106251620.f5PGKNP08234@odiug.digicool.com> <3B376E68.505BF6E@lemburg.com> Message-ID: <200106251804.f5PI4D008730@odiug.digicool.com> OK, focusing on a single item. [me] > > If this is the only thing that keeps us from having a configuration > > OPTION to make Py_UNICODE 32-bit wide, I'd say let's fix it. [MAL] > This is not easy to fix and can certainly not be made an > option: UTF-16 has surrogates and is a variable width encoding > of Unicode while UCS-4 is a fixed width encoding. But even if we supported UTF-16 with surrogates, picking strings apart using u[i] would still be able to access the separate lower and upper halves of the surrogates, right, and in the presence of surrogates len(u) would not match the number of *characters* in u. > Python currently only has minimal support for surrogates, so > purist would say that we support UCS-2. However, we deliberatly > chose this path to be able to upgrade to UTF-16 at some later > point in time and it seems that this time has now come. How hard would it be to also change the party line about what the encoding used is based on whether we use 2 or 4 bytes? We could even give three choices: UCS-2 (current situation, no surrogates), UTF-16 (16-bit items with some surrogate support) or UCS-4 (32-bit items)? > > I'd be happy to make the configuration choice between UTF-16 and > > UCS-4, if that's doable. > > Not easily, I'm afraid. Can you explain why this is not easy? > http://www-106.ibm.com/developerworks/unicode/library/utfencodingforms/ > """ > Decisions, decisions... > Ultimately, the choice of which encoding format to use will depend heavily on the programming environment. For systems that only offer > 8-bit strings currently, but are multi-byte enabled, UTF-8 may be the best choice. For systems that do not care about storage requirements, > UTF-32 may be best. For systems such as Windows, Java, or ICU that use UTF-16 strings already, UTF-16 is the obvious choice. Even if > they have not yet upgraded to fully support surrogates, they will be before long. > > If the programming environment is not an issue, UTF-16 is recommended as a good compromise between elegance, performance, and > storage. > """ I buy that as an argument for supporting UTF-16, but not for cutting off the road to supporting UCS-4 for those users who would like to opt in. --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@digicool.com Mon Jun 25 19:16:40 2001 From: guido@digicool.com (Guido van Rossum) Date: Mon, 25 Jun 2001 14:16:40 -0400 Subject: [I18n-sig] How does Python Unicode treat surrogates? In-Reply-To: Your message of "Mon, 25 Jun 2001 13:13:56 EDT." <15159.29012.266722.112773@cymru.basistech.com> References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <15159.14391.718891.645489@cymru.basistech.com> <200106251422.f5PEMel07612@odiug.digicool.com> <15159.17083.978971.519453@cymru.basistech.com> <200106251443.f5PEh2p07753@odiug.digicool.com> <15159.19546.226155.383490@cymru.basistech.com> <200106251544.f5PFiWe07979@odiug.digicool.com> <15159.22514.976923.894201@cymru.basistech.com> <200106251742.f5PHgTW08532@odiug.digicool.com> <15159.29012.266722.112773@cymru.basistech.com> Message-ID: <200106251816.f5PIGev08808@odiug.digicool.com> > Guido van Rossum writes: > > I'm sorry, but I don't see why it's UCS-2 any more or less than > > UTF-16. That's like arguing whether 8-bit strings contains ASCII or > > UTF-8. That's up to the application; the data type can be used for > > either. > > UCS-2 and UTF-16 and UTF-8 are encoding forms of Unicode. Unicode > defines characters using an abstract integer, the code-point. As of > Unicode 3.1 code points range from 0x000000 to 0x10FFFF. > > The so-called Unicode string type in Python is a wide-string type, > where each character is treated as a 16-bit quantity. The > interpretation placed on those 16-bit quantities is that of UCS-2. In > that case each half of a surrogate pair is an unknown character. So far we agree. > As soon as you impose UTF-16 semantics on the 16-bit quantities, then > you need to treat surrogate pairs as a single character. > > If the implementation won't change, then the standard library needs to > support surrogates as a wrapper: leaving it up to each application is > a mistake. IMHO you cannot trust implementers to do this right. Sure, someone can add a module that provides surrogate support using the standard Unicode datatype. > > But unless I misunderstand what it *is* that you are suggesting, the > > O(1) indexing property can't be retained with your suggestion, and > > that's out of the question. > > You understand me completely. Adding transparent UTF-16 support > changes your O(1) indexing operation to O(1+c), where 'c' is the small > amount of time required to check for the surrogate. Granted, this 'c' > could get large, but... I don't think there is such a thing as "O(1+c) for small c". To extract the n'th Unicode character you would have to loop over all the preceding characters checking for surrogates. This makes it O(n). It's a common Python idiom to read megabytes of text into a single (8-bit or 16-bit) string object, so changing O(1) to O(n) is a real problem! > But I see your point: this requirement is what prompted the glibc > folks to go with the 32-bit wchar_t type. > > > That turned out to be a myth, actually. mod_python works fine with > > threads on most platforms. > > Not in my experience. On my FreeBSD box Python 2.0 built with threads > does not get along in some cases where Apache 1.3.19. Not that it matters. FreeBSD happens to be one of those platforms. :-( Has to do with the fact that on *BSD you link with a different version of the C library to enable threads, and since Apache is linked with the unthreaded version, any versions of Python embedded in Apache must also be unthreaded. --Guido van Rossum (home page: http://www.python.org/~guido/) From mark@macchiato.com Mon Jun 25 19:18:52 2001 From: mark@macchiato.com (Mark Davis) Date: Mon, 25 Jun 2001 11:18:52 -0700 Subject: [I18n-sig] Re: How does Python Unicode treat surrogates? References: <3B3722DB.1FF54794@lemburg.com> <4ak820g418.fsf@kern.srcf.societies.cam.ac.uk> <006501c0fd82$8b5ba9f0$0c680b41@c1340594a> <3B376B03.A2A84AE1@lemburg.com> Message-ID: <00f101c0fda3$4a2529e0$0c680b41@c1340594a> comments below. ----- Original Message ----- From: "M.-A. Lemburg" To: "Mark Davis" Cc: "Gaute B Strokkenes" ; "Tim Peters" ; ; Sent: Monday, June 25, 2001 09:46 Subject: Re: [I18n-sig] Re: How does Python Unicode treat surrogates? [snip] > > My question was targetting into a slightly different direction, > though. I know that UTF-16 does not allow lone surrogates, but > how does Unicode itself treat these ? If I have a sequence of Unicode > code points which includes an isolated surrogate code point, > would this be considered a legal Unicode sequence or not ? It is a legal Unicode code point sequence. However, it is not a legal Unicode *character* sequence, since it contains code points that by definition cannot be used to represent characters. > > > However, you can certainly deal with surrogate code units in storage, and it > > is permissible on that level to handle them. For example, most UTF-16 string > > interfaces use code unit indices, so that a string from position 3 of length > > 5 will include precisely 5 code units, not however many code points (or > > graphemes!) they take up. Similarly for UTF-8 strings, the low-level units > > are bytes. > > FYI, Python currently uses UTF-16 as internal storage format > and also exposes this through its indexing interfaces. In that > sense isolated surrogates would be illegal. The codecs which > convert such Unicode object to other encodings would raise an > exception. > Unicode object constructors, slicing and concatenating > Unicode objects currently do not apply any checks though. That is what is typically done, since using codepoint indices on each operation is a very significant performance burden. Mark From tree@basistech.com Mon Jun 25 18:43:23 2001 From: tree@basistech.com (Tom Emerson) Date: Mon, 25 Jun 2001 13:43:23 -0400 Subject: [I18n-sig] How does Python Unicode treat surrogates? In-Reply-To: <200106251816.f5PIGev08808@odiug.digicool.com> References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <15159.14391.718891.645489@cymru.basistech.com> <200106251422.f5PEMel07612@odiug.digicool.com> <15159.17083.978971.519453@cymru.basistech.com> <200106251443.f5PEh2p07753@odiug.digicool.com> <15159.19546.226155.383490@cymru.basistech.com> <200106251544.f5PFiWe07979@odiug.digicool.com> <15159.22514.976923.894201@cymru.basistech.com> <200106251742.f5PHgTW08532@odiug.digicool.com> <15159.29012.266722.112773@cymru.basistech.com> <200106251816.f5PIGev08808@odiug.digicool.com> Message-ID: <15159.30780.1143.760653@cymru.basistech.com> Guido van Rossum writes: > To extract the n'th Unicode character you would have to loop over all > the preceding characters checking for surrogates. This makes it O(n). No. If the n'th character is a valid high-surrogate (U+D800 -- U+DBFF) then look at the n+1'th character for a valid low-surrogate. If the n'th character is a valid low-surrogate and the n-1'th character is a valid high-surrogate, then skip it. > It's a common Python idiom to read megabytes of text into a single > (8-bit or 16-bit) string object, so changing O(1) to O(n) is a real > problem! Yes, I do it all the time... my primary use of Python is managing Chinese and Japanese lexicographic data where the files are upwards of 25+MB of UTF-8 encoded Unicode text. -- Tom Emerson Basis Technology Corp. Sr. Sinostringologist http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever" From mal@lemburg.com Mon Jun 25 19:35:12 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Mon, 25 Jun 2001 20:35:12 +0200 Subject: [I18n-sig] How does Python Unicode treat surrogates? References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <3B375FB9.91BA4B1E@lemburg.com> <200106251620.f5PGKNP08234@odiug.digicool.com> <3B376E68.505BF6E@lemburg.com> <200106251804.f5PI4D008730@odiug.digicool.com> Message-ID: <3B378460.C27CDCDD@lemburg.com> Guido van Rossum wrote: > > OK, focusing on a single item. > > [me] > > > If this is the only thing that keeps us from having a configuration > > > OPTION to make Py_UNICODE 32-bit wide, I'd say let's fix it. > > [MAL] > > This is not easy to fix and can certainly not be made an > > option: UTF-16 has surrogates and is a variable width encoding > > of Unicode while UCS-4 is a fixed width encoding. > > But even if we supported UTF-16 with surrogates, picking strings apart > using u[i] would still be able to access the separate lower and upper > halves of the surrogates, right, and in the presence of surrogates > len(u) would not match the number of *characters* in u. That's because len(u) has nothing to do with the number of characters in the string, it only counts the code units (Py_UNICODEs) which are used to represent characters. The same is true for normal strings, e.g. UTF-8 can use between 1-4 code units (bytes in this case) for a single code unit and in Unicode you can create characters by combining code units As Mark Davis pointed out: """In most people's experience, it is best to leave the low level interfaces with indices in terms of code units, then supply some utility routines that tell you information about code points. The most useful are: - given a string and an index into that string, how many code points are before it? - given a string and a number of code points, what is the lowest index that contains them? - given a string and an index into that string, is the index on a code point boundary? """ Python could use some more Unicode methods to answer these questions. > > Python currently only has minimal support for surrogates, so > > purist would say that we support UCS-2. However, we deliberatly > > chose this path to be able to upgrade to UTF-16 at some later > > point in time and it seems that this time has now come. > > How hard would it be to also change the party line about what the > encoding used is based on whether we use 2 or 4 bytes? We could even > give three choices: UCS-2 (current situation, no surrogates), UTF-16 > (16-bit items with some surrogate support) or UCS-4 (32-bit items)? Ehm... what are you getting at here ? > > > I'd be happy to make the configuration choice between UTF-16 and > > > UCS-4, if that's doable. > > > > Not easily, I'm afraid. > > Can you explain why this is not easy? Because choosing whether or not to support surrogates is a fundamental choice which affects far more than just the way you access storage. Surrogates introduce variable width characters: some characters use two or more Py_UNICODE code units while (most) others only use one. Remember when we discussed which internal format to use or which default encoding to apply ? We ruled out UTF-8 because it fails badly when it comes to slicing, concatenation, indexing, etc. UTF-16 is much less painful as most code points only take up a single code unit, but it still introduces a break in concept. > > http://www-106.ibm.com/developerworks/unicode/library/utfencodingforms/ > > """ > > Decisions, decisions... > > Ultimately, the choice of which encoding format to use will depend heavily on the programming environment. For systems that only offer > > 8-bit strings currently, but are multi-byte enabled, UTF-8 may be the best choice. For systems that do not care about storage requirements, > > UTF-32 may be best. For systems such as Windows, Java, or ICU that use UTF-16 strings already, UTF-16 is the obvious choice. Even if > > they have not yet upgraded to fully support surrogates, they will be before long. > > > > If the programming environment is not an issue, UTF-16 is recommended as a good compromise between elegance, performance, and > > storage. > > """ > > I buy that as an argument for supporting UTF-16, but not for cutting > off the road to supporting UCS-4 for those users who would like to opt > in. That was not my point. I just wanted to point out how well UTF-16 is being accepted out there and that we are in good company by moving from UCS-2 to UTF-16 as current internal format. I don't want to cut off the road to UCS-4, I just want to make clear that UTF-16 is a good choice and one which will last at least some more years. We can then always decide to move on to UCS-4 for the internal storage format. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From fredrik@pythonware.com Mon Jun 25 19:41:48 2001 From: fredrik@pythonware.com (Fredrik Lundh) Date: Mon, 25 Jun 2001 20:41:48 +0200 Subject: [I18n-sig] How does Python Unicode treat surrogates? References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com><200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de><3B3471AF.1311E872@lemburg.com><200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de><3B34F9BD.4FDEFC62@lemburg.com><200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de><3B35CEC6.710243E7@lemburg.com><200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de><3B362E9B.4DC8DD81@lemburg.com><200106251342.f5PDg1q07291@odiug.digicool.com><15159.14391.718891.645489@cymru.basistech.com><200106251422.f5PEMel07612@odiug.digicool.com><15159.17083.978971.519453@cymru.basistech.com><200106251443.f5PEh2p07753@odiug.digicool.com><15159.19546.226155.383490@cymru.basistech.com><200106251544.f5PFiWe07979@odiug.digicool.com><15159.22514.976923.894201@cymru.basistech.com><200106251742.f5PHgTW08532@odiug.digicool.com><15159.29012.266722.112773@cymru.basistech.com><200106251816.f5PIGev08808@odiug.digicool.com> <15159.30780.1143.760653@cymru.basistech.com> Message-ID: <008e01c0fda6$7fe81ad0$4ffa42d5@hagrid> Tom Emerson wrote: > > To extract the n'th Unicode character you would have to loop over all > > the preceding characters checking for surrogates. This makes it O(n). > > No. If the n'th character is a valid high-surrogate (U+D800 -- U+DBFF) > then look at the n+1'th character for a valid low-surrogate. If the > n'th character is a valid low-surrogate and the n-1'th character is a > valid high-surrogate, then skip it. bzzt. try again. From guido@digicool.com Mon Jun 25 19:42:24 2001 From: guido@digicool.com (Guido van Rossum) Date: Mon, 25 Jun 2001 14:42:24 -0400 Subject: [I18n-sig] How does Python Unicode treat surrogates? In-Reply-To: Your message of "Mon, 25 Jun 2001 13:43:23 EDT." <15159.30780.1143.760653@cymru.basistech.com> References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <15159.14391.718891.645489@cymru.basistech.com> <200106251422.f5PEMel07612@odiug.digicool.com> <15159.17083.978971.519453@cymru.basistech.com> <200106251443.f5PEh2p07753@odiug.digicool.com> <15159.19546.226155.383490@cymru.basistech.com> <200106251544.f5PFiWe07979@odiug.digicool.com> <15159.22514.976923.894201@cymru.basistech.com> <200106251742.f5PHgTW08532@odiug.digicool.com> <15159.29012.266722.112773@cymru.basistech.com> <200106251816.f5PIGev08808@odiug.digicool.com> <15159.30780.1143.760653@cymru.basistech.com> Message-ID: <200106251842.f5PIgOe09018@odiug.digicool.com> > No. If the n'th character is a valid high-surrogate (U+D800 -- U+DBFF) > then look at the n+1'th character for a valid low-surrogate. If the > n'th character is a valid low-surrogate and the n-1'th character is a > valid high-surrogate, then skip it. Ouch. So suppose we have a string u containing four items: a regular 16-bit char, a high surrogate, a low surrogate, and another regular 16-bit char. You're saying that u[0] should return the first character, u[1] the entire surrogate (so it would still be a 2-item string), u[2] I gues the empty string, and u[3] the final regular char. IMO that would break an important invariant of string-like objects, namely that len(s[i]) == 1. I could live with a method u.character(i) that would behave like the above rule -- but not the u[i] notation. But wouldn't it be enough to have a test u.issurrogate() that would test if the first character of u is a valid high-surrogate? (And maybe another test u.islowsurrogate() testing for a valid low-surrogate.) Then you could write it yourself easily: def char(u, i): c = u[i] if c.issurrogate(): c2 = u[i+1] assert c2.islowsurrogate() c = c + c2 return c (Don't pay attention to the method names I'm proposing -- that's for a separate subcommittee. :-) --Guido van Rossum (home page: http://www.python.org/~guido/) From tree@basistech.com Mon Jun 25 19:12:17 2001 From: tree@basistech.com (Tom Emerson) Date: Mon, 25 Jun 2001 14:12:17 -0400 Subject: [I18n-sig] How does Python Unicode treat surrogates? In-Reply-To: <200106251842.f5PIgOe09018@odiug.digicool.com> References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <15159.14391.718891.645489@cymru.basistech.com> <200106251422.f5PEMel07612@odiug.digicool.com> <15159.17083.978971.519453@cymru.basistech.com> <200106251443.f5PEh2p07753@odiug.digicool.com> <15159.19546.226155.383490@cymru.basistech.com> <200106251544.f5PFiWe07979@odiug.digicool.com> <15159.22514.976923.894201@cymru.basistech.com> <200106251742.f5PHgTW08532@odiug.digicool.com> <15159.29012.266722.112773@cymru.basistech.com> <200106251816.f5PIGev08808@odiug.digicool.com> <15159.30780.1143.760653@cymru.basistech.com> <200106251842.f5PIgOe09018@odiug.digicool.com> Message-ID: <15159.32513.611214.399097@cymru.basistech.com> Guido van Rossum writes: > Ouch. So suppose we have a string u containing four items: a regular > 16-bit char, a high surrogate, a low surrogate, and another regular > 16-bit char. You're saying that u[0] should return the first > character, u[1] the entire surrogate (so it would still be a 2-item > string), u[2] I gues the empty string, and u[3] the final regular > char. [...] No, but we may as well stop going around on this, since my views are not going to happen. In my view the string 'u' is a Unicode string. I don't care what sits underneath: 16-bits or 32-bits I don't care. As far as I'm concerned the string has three characters in it: foo = u"\u4e00\u020000a" means that foo[0] == u"\u4e00", foo[1] == u"\u020000", and foo[2] == u"a". The fact that this is represented internally different ways shouldn't matter to the user who only cares about characters. > IMO that would break an important invariant of string-like objects, > namely that len(s[i]) == 1. Yes it would, which is why it isn't what I'm recommending. > I could live with a method u.character(i) that would behave like the > above rule -- but not the u[i] notation. Me to. 'nuff said. ;-) > But wouldn't it be enough to have a test u.issurrogate() that would > test if the first character of u is a valid high-surrogate? (And > maybe another test u.islowsurrogate() testing for a valid > low-surrogate.) Then you could write it yourself easily: > def char(u, i): > c = u[i] > if c.issurrogate(): > c2 = u[i+1] > assert c2.islowsurrogate() > c = c + c2 > return c Sure, as long as you check for the edge conditions. This should be in the library. -- Tom Emerson Basis Technology Corp. Sr. Sinostringologist http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever" From guido@digicool.com Mon Jun 25 20:12:31 2001 From: guido@digicool.com (Guido van Rossum) Date: Mon, 25 Jun 2001 15:12:31 -0400 Subject: [I18n-sig] How does Python Unicode treat surrogates? In-Reply-To: Your message of "Mon, 25 Jun 2001 20:35:12 +0200." <3B378460.C27CDCDD@lemburg.com> References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <3B375FB9.91BA4B1E@lemburg.com> <200106251620.f5PGKNP08234@odiug.digicool.com> <3B376E68.505BF6E@lemburg.com> <200106251804.f5PI4D008730@odiug.digicool.com> <3B378460.C27CDCDD@lemburg.com> Message-ID: <200106251912.f5PJCVD09465@odiug.digicool.com> > That's because len(u) has nothing to do with the number of > characters in the string, it only counts the code units (Py_UNICODEs) > which are used to represent characters. The same is true for normal > strings, e.g. UTF-8 can use between 1-4 code units (bytes in this > case) for a single code unit and in Unicode you can create characters > by combining code units Total agreement. > As Mark Davis pointed out: > > """In most people's experience, it is best to leave the low level interfaces > with indices in terms of code units, then supply some utility routines that > tell you information about code points. The most useful are: > > - given a string and an index into that string, how many code points are > before it? > - given a string and a number of code points, what is the lowest index that > contains them? I understand the first and the third, but what is this one? Is it a search? > - given a string and an index into that string, is the index on a code point > boundary? > """ > > Python could use some more Unicode methods to answer these > questions. Agreed (see my other post responding to Ton Emerson). > > > Python currently only has minimal support for surrogates, so > > > purist would say that we support UCS-2. However, we deliberatly > > > chose this path to be able to upgrade to UTF-16 at some later > > > point in time and it seems that this time has now come. > > > > How hard would it be to also change the party line about what the > > encoding used is based on whether we use 2 or 4 bytes? We could even > > give three choices: UCS-2 (current situation, no surrogates), UTF-16 > > (16-bit items with some surrogate support) or UCS-4 (32-bit items)? > > Ehm... what are you getting at here ? Earlier on you said it would be hard to offer a config-time choice between UTF-16 and UCS-4. I'm still trying to figure out why. Given the additional stuff I've learned now about surrogates, it doesn't make sense to choose between UCS-2 and UTF-16; the surrogate handling can always be present. So let me rephrase the question. How hard would it be to offer the config-time choice between UCS-4 and UTF-16? If it's hard, why? (I've heard you say that it's hard before, but I still don't understand the problem.) > > > > I'd be happy to make the configuration choice between UTF-16 and > > > > UCS-4, if that's doable. > > > > > > Not easily, I'm afraid. > > > > Can you explain why this is not easy? > > Because choosing whether or not to support surrogates is a > fundamental choice which affects far more than just the way you > access storage. Surrogates introduce variable width characters: > some characters use two or more Py_UNICODE code units while (most) > others only use one. > > Remember when we discussed which internal format to use or > which default encoding to apply ? We ruled out UTF-8 because > it fails badly when it comes to slicing, concatenation, indexing, > etc. > > UTF-16 is much less painful as most code points only take > up a single code unit, but it still introduces a break in concept. Hm, it sounds like you have the same problem that I had with Ton Emerson's suggestion to support Unicode before he clarified it. If we make a clean distinction between characters and storage units, and if stick to the rule that u[i] accesses a storage unit, what's the conceptual difficulty? There might be a separate method u.char(i) which returns the *character* starting u[i:], or "" if u[i] is a low-surrogate. That could be all we need to support surrogates. How bad is that? (These could even continue to be supported when the storage uses UCS-4; there, u.char(i) would always be u[i], until someone comes up with a 64-bit character set. ;-) > > I buy that as an argument for supporting UTF-16, but not for cutting > > off the road to supporting UCS-4 for those users who would like to opt > > in. > > That was not my point. I just wanted to point out how well UTF-16 > is being accepted out there and that we are in good company by > moving from UCS-2 to UTF-16 as current internal format. Good! I agree. > I don't want to cut off the road to UCS-4, I just want to make > clear that UTF-16 is a good choice and one which will last at > least some more years. We can then always decide to move on > to UCS-4 for the internal storage format. Agreed again. --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@digicool.com Mon Jun 25 20:22:58 2001 From: guido@digicool.com (Guido van Rossum) Date: Mon, 25 Jun 2001 15:22:58 -0400 Subject: [I18n-sig] How does Python Unicode treat surrogates? In-Reply-To: Your message of "Mon, 25 Jun 2001 14:12:17 EDT." <15159.32513.611214.399097@cymru.basistech.com> References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <15159.14391.718891.645489@cymru.basistech.com> <200106251422.f5PEMel07612@odiug.digicool.com> <15159.17083.978971.519453@cymru.basistech.com> <200106251443.f5PEh2p07753@odiug.digicool.com> <15159.19546.226155.383490@cymru.basistech.com> <200106251544.f5PFiWe07979@odiug.digicool.com> <15159.22514.976923.894201@cymru.basistech.com> <200106251742.f5PHgTW08532@odiug.digicool.com> <15159.29012.266722.112773@cymru.basistech.com> <200106251816.f5PIGev08808@odiug.digicool.com> <15159.30780.1143.760653@cymru.basistech.com>! <200106251842.f5PIgOe09018@odiug.digicool.com> <15159.32513.611214.399097@cymru.basistech.com> Message-ID: <200106251922.f5PJMwm09492@odiug.digicool.com> > Guido van Rossum writes: > > Ouch. So suppose we have a string u containing four items: a regular > > 16-bit char, a high surrogate, a low surrogate, and another regular > > 16-bit char. You're saying that u[0] should return the first > > character, u[1] the entire surrogate (so it would still be a 2-item > > string), u[2] I gues the empty string, and u[3] the final regular > > char. > [...] > > No, but we may as well stop going around on this, since my views are > not going to happen. > > In my view the string 'u' is a Unicode string. I don't care what sits > underneath: 16-bits or 32-bits I don't care. As far as I'm concerned > the string has three characters in it: > > foo = u"\u4e00\u020000a" > > means that foo[0] == u"\u4e00", foo[1] == u"\u020000", and foo[2] == > u"a". I hope you meant foo = u"\u4e00\U00020000a" and foo[1] == u'\U00020000'. (I worry that your sloppy use of variable length \u escapes above shows that your understanding of the subject matter is less than you've made me believe. Please say it ain't so!) > The fact that this is represented internally different ways shouldn't > matter to the user who only cares about characters. You misunderstand. I am claiming that this shouldn't happen because it would make u[i] an O(n) operation. Then you brought up an argument that suggested a way of indexing that *wouldn't* make it O(n), and that's what I guessed (in my "Ouch" paragraph quoted above). But what you describe now doesn't have a constant number of storage units per character, so it has to have O(n) indexing time (unless you assume a terribly hairy data structure). I'm worried that you don't understand the O(n) notation, or that you don't understand why what you are proposing would make indexing O(n). Your suggestion of "O(1+c) for some small c" makes me *really* worried about this. In which case what you want ain't gonna happen, but not for the reason you fear (BDFL decree): it's not well thought out. > > IMO that would break an important invariant of string-like objects, > > namely that len(s[i]) == 1. > > Yes it would, which is why it isn't what I'm recommending. > > > I could live with a method u.character(i) that would behave like the > > above rule -- but not the u[i] notation. > > Me to. 'nuff said. ;-) But would u.character(i) be O(1) or O(n)? > > But wouldn't it be enough to have a test u.issurrogate() that would > > test if the first character of u is a valid high-surrogate? (And > > maybe another test u.islowsurrogate() testing for a valid > > low-surrogate.) Then you could write it yourself easily: > > > def char(u, i): > > c = u[i] > > if c.issurrogate(): > > c2 = u[i+1] > > assert c2.islowsurrogate() > > c = c + c2 > > return c > > Sure, as long as you check for the edge conditions. This should be in > the library. Note that in your above example, char(foo, 2) would not be u'a' but would be u'\u0000', and char(foo, 3) would be u'a'. So I still think you haven't thought this out as much as you believe. --Guido van Rossum (home page: http://www.python.org/~guido/) From mark@macchiato.com Mon Jun 25 20:27:07 2001 From: mark@macchiato.com (Mark Davis) Date: Mon, 25 Jun 2001 12:27:07 -0700 Subject: [I18n-sig] Re: How does Python Unicode treat surrogates? References: <3B3722DB.1FF54794@lemburg.com> <4ak820g418.fsf@kern.srcf.societies.cam.ac.uk> <006501c0fd82$8b5ba9f0$0c680b41@c1340594a> <005b01c0fd9d$e4469e60$1a2cf7c2@oakdale2> Message-ID: <013901c0fdac$d27d1970$0c680b41@c1340594a> That is an interesting approach; one that basically amounts to some convenience functions. For example, instead of writing: myString.substring(myString.cpToIndex(3), myString.cpToIndex(5)); you could write: myString.substring(3, 5, myString.CODEPOINT); This hides some of the work, when someone is working in code points. The performance cost is still there, of course; using code point indexes requires each operation to examine every code unit up to that point, which is much more expensive. For a general programming language or string library, I'm not sure about implementing this pattern throughout. I know in the ICU library, for example, we have a significant number of functions that take offsets into strings. Having such a parameter on all of them would be clumsy, when most of the time people are simply working in code units. Mark ----- Original Message ----- From: "J M Sykes" To: "Mark Davis" ; "M.-A. Lemburg" ; "Gaute B Strokkenes" Cc: "Tim Peters" ; ; "Unicode List" Sent: Monday, June 25, 2001 10:38 Subject: Re: How does Python Unicode treat surrogates? > Mark Davis said: > > > > In most people's experience, it is best to leave the low level interfaces > > with indices in terms of code units, then supply some utility routines > that > > tell you information about code points. ... > > Anyone on the list interested in the treatment of UCS aka Unicode in > programming languages might like to know that a meeting of ISO/IEC JTC 1/SC > 32/WG 3 recently approved a paper that specifies how SQL implementations > should do it. > > The proposal can be found at: > > ftp://sqlstandards.org/SC32/WG3/Meetings/PER_2001_04_Perth_AUS/per054r1.pdf > > The current CD of the next SQL standard (ISO/IEC 9075), as amended by this > proposal (and many others) can be found at: > > ftp://sqlstandards.org/SC32/WG3/Progression_Documents/CD/cd1r1-foundation-20 > 01-06.pdf > > Briefly, the SQL functions CHARACTER_LENGTH, POSITION (the SQL string > indexing function), and SUBSTRING will all accept a parameter specifying the > units to be used, the alternatives being OCTETS, CODE_UNITS and CHARACTERS > (which to SQL means code points); the default being characters. > > This proposal was agreed by major SQL implementors. > > Which doesn't mean that it's right, nor that it can't be changed. But that's > how it is at the moment. > > Mike. > > *********************************************************** > > J M Sykes Email: Mike.Sykes@acm.org > 97 Oakdale Drive > Heald Green > CHEADLE > Cheshire SK8 3SN > UK Tel: (44) 161 437 5413 > > *********************************************************** > > > > > > From paulp@ActiveState.com Mon Jun 25 20:41:15 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Mon, 25 Jun 2001 12:41:15 -0700 Subject: [I18n-sig] How does Python Unicode treat surrogates? References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <3B375FB9.91BA4B1E@lemburg.com> <200106251620.f5PGKNP08234@odiug.digicool.com> <3B376E68.505BF6E@lemburg.com> <200106251804.f5PI4D008730@odiug.digicool.com> <3B378460.C27CDCDD@lemburg.com> <200106251912.f5PJCVD09465@odiug.digicool.com> Message-ID: <3B3793DB.DFF114EC@ActiveState.com> Guido van Rossum wrote: > >... > > If we make a clean distinction between characters and storage units, > and if stick to the rule that u[i] accesses a storage unit, what's the > conceptual difficulty? > > There might be a separate method u.char(i) > which returns the *character* starting u[i:], or "" if u[i] is a > low-surrogate. Are you saying that having u[i] return the i'th character (code point) of 'u' is not going to be provided at all? > That could be all we need to support surrogates. How > bad is that? (These could even continue to be supported when the > storage uses UCS-4; there, u.char(i) would always be u[i], until > someone comes up with a 64-bit character set. ;-) So the same input will have a different behavior based on the fact that we upgraded our internal representation? :( The strikes me as an int/long issue. I'd rather we design in terms of the logical construct: "arbitrary-sized mathematical integer", "Unicode code point" rather than the implementation detail: "32-bit 2's complement integer", "UTF-16 code unit." -- Take a recipe. Leave a recipe. Python Cookbook! http://www.ActiveState.com/pythoncookbook From tree@basistech.com Mon Jun 25 20:01:43 2001 From: tree@basistech.com (Tom Emerson) Date: Mon, 25 Jun 2001 15:01:43 -0400 Subject: [I18n-sig] How does Python Unicode treat surrogates? In-Reply-To: <200106251922.f5PJMwm09492@odiug.digicool.com> References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <15159.14391.718891.645489@cymru.basistech.com> <200106251422.f5PEMel07612@odiug.digicool.com> <15159.17083.978971.519453@cymru.basistech.com> <200106251443.f5PEh2p07753@odiug.digicool.com> <15159.19546.226155.383490@cymru.basistech.com> <200106251544.f5PFiWe07979@odiug.digicool.com> <15159.22514.976923.894201@cymru.basistech.com> <200106251742.f5PHgTW08532@odiug.digicool.com> <15159.29012.266722.112773@cymru.basistech.com> <200106251816.f5PIGev08808@odiug.digicool.com> <15159.30780.1143.760653@cymru.basistech.com> <200106251842.f5PIgOe09018@odiug.digicool.com> <15159.32513.611214.399097@cymru.basistech.com> <200106251922.f5PJMwm09492@odiug.digicool.com> Message-ID: <15159.35479.42093.828285@cymru.basistech.com> [ I'm the first to admit this hasn't been thought out... I'm writing off the cuff ] Guido van Rossum writes: > > foo = u"\u4e00\u020000a" > > > > means that foo[0] == u"\u4e00", foo[1] == u"\u020000", and foo[2] == > > u"a". > > I hope you meant foo = u"\u4e00\U00020000a" and foo[1] == u'\U00020000'. > > (I worry that your sloppy use of variable length \u escapes above > shows that your understanding of the subject matter is less than > you've made me believe. Please say it ain't so!) The maximum code-point value for a Unicode character is U+10FFFF, hence the suggested notation above (I should have noted it as such). If Python is going to implement full support for ISO 10646 then the full 32-bit representation (and 8-digit \U escape) is appropriate. If you limit the maximum size of the character escape so that the scanner catches improper character sizes you save grief for the end-user, IMHO. I must admit that I wasn't aware of the "\U00020000" notation. I still think it should limit itself to 6 digits, not 8. > > The fact that this is represented internally different ways shouldn't > > matter to the user who only cares about characters. > > You misunderstand. I am claiming that this shouldn't happen because > it would make u[i] an O(n) operation. Then you brought up an argument > that suggested a way of indexing that *wouldn't* make it O(n), and > that's what I guessed (in my "Ouch" paragraph quoted above). > > But what you describe now doesn't have a constant number of storage > units per character, so it has to have O(n) indexing time (unless you > assume a terribly hairy data structure). I understand O(n) and O(1) perfectly well. My point is that you do not have to scan the entire string when doing this indexing. You only need to look at most one storage unit on either side of the index. We're only concerned here with transparently handling surrogates when the underlying representation is UTF-16. > Note that in your above example, char(foo, 2) would not be u'a' but > would be u'\u0000', and char(foo, 3) would be u'a'. My example above presumes that indicies in the index refers to characters, not storage units, and that UTF-16 is being used transparently internally. So in my world, evaluating foo = u"\u4e00\U00020000a" would treat foo[1] as u'\U00200000' and foo[2] as u'a'. > So I still think you haven't thought this out as much as you believe. As I said, I have no belief that this is thought out. I'm merely stating what I believe the observable behavior should be. -tree -- Tom Emerson Basis Technology Corp. Sr. Sinostringologist http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever" From fredrik@pythonware.com Mon Jun 25 20:54:37 2001 From: fredrik@pythonware.com (Fredrik Lundh) Date: Mon, 25 Jun 2001 21:54:37 +0200 Subject: [I18n-sig] How does Python Unicode treat surrogates? References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <3B375FB9.91BA4B1E@lemburg.com> <200106251620.f5PGKNP08234@odiug.digicool.com> <3B376E68.505BF6E@lemburg.com> <200106251804.f5PI4D008730@odiug.digicool.com> <3B378460.C27CDCDD@lemburg.com> <200106251912.f5PJCVD09465@odiug.digicool.com> Message-ID: <00f201c0fdb0$ab0fe170$4ffa42d5@hagrid> guido wrote: > > That's because len(u) has nothing to do with the number of > > characters in the string, it only counts the code units (Py_UNICODEs) > > which are used to represent characters. The same is true for normal > > strings, e.g. UTF-8 can use between 1-4 code units (bytes in this > > case) for a single code unit and in Unicode you can create characters > > by combining code units > > Total agreement. I disagree: in python's current string model, there's a difference between *encoded* byte buffers and character strings. > So let me rephrase the question. How hard would it be to offer the > config-time choice between UCS-4 and UTF-16? > If it's hard, why? the core string type (which I wrote) should support this pretty much out of the box. probably more work to fix the codecs (I didn't write them, so I cannot tell for sure), but I doubt it's that much work. SRE and the unicode databases (me again) should also work pretty much out of the box. > If we make a clean distinction between characters and storage units, > and if stick to the rule that u[i] accesses a storage unit, what's the > conceptual difficulty? I'm sceptical -- I see very little reason to maintain that distinction. let's use either UCS-2 or UCS-4 for the internal storage, stick to the "character strings are character sequences" concept, and keep the UTF-16 surrogate issue where it belongs: in the codecs. Cheers /F From tree@basistech.com Mon Jun 25 20:17:57 2001 From: tree@basistech.com (Tom Emerson) Date: Mon, 25 Jun 2001 15:17:57 -0400 Subject: [I18n-sig] How does Python Unicode treat surrogates? In-Reply-To: <00f201c0fdb0$ab0fe170$4ffa42d5@hagrid> References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <3B375FB9.91BA4B1E@lemburg.com> <200106251620.f5PGKNP08234@odiug.digicool.com> <3B376E68.505BF6E@lemburg.com> <200106251804.f5PI4D008730@odiug.digicool.com> <3B378460.C27CDCDD@lemburg.com> <200106251912.f5PJCVD09465@odiug.digicool.com> <00f201c0fdb0$ab0fe170$4ffa42d5@hagrid> Message-ID: <15159.36453.486716.705433@cymru.basistech.com> Fredrik Lundh writes: > I'm sceptical -- I see very little reason to maintain that distinction. > let's use either UCS-2 or UCS-4 for the internal storage, stick to the > "character strings are character sequences" concept, and keep the > UTF-16 surrogate issue where it belongs: in the codecs. How then is u"\U00200000" represented internally if you use UCS-2 as the internal storage representation? -- Tom Emerson Basis Technology Corp. Sr. Sinostringologist http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever" From paulp@ActiveState.com Mon Jun 25 21:03:54 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Mon, 25 Jun 2001 13:03:54 -0700 Subject: [I18n-sig] How does Python Unicode treat surrogates? References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <3B375FB9.91BA4B1E@lemburg.com> <200106251620.f5PGKNP08234@odiug.digicool.com> <3B376E68.505BF6E@lemburg.com> <200106251804.f5PI4D008730@odiug.digicool.com> <3B378460.C27CDCDD@lemburg.com> <200106251912.f5PJCVD09465@odiug.digicool.com> <00f201c0fdb0$ab0fe170$4ffa42d5@hagrid> Message-ID: <3B37992A.40CD1CF2@ActiveState.com> Fredrik Lundh wrote: > >... > > I'm sceptical -- I see very little reason to maintain that distinction. > let's use either UCS-2 or UCS-4 for the internal storage, stick to the > "character strings are character sequences" concept, and keep the > UTF-16 surrogate issue where it belongs: in the codecs. I agree. But I'd add that if different people really need different performance/simplicity trade-offs then maybe we need multiple variants of the Unicode object. But please don't cut those of us who value simplicity off from the option of strings that work entirely in terms of logical characters (code points) and not physical representation units. -- Take a recipe. Leave a recipe. Python Cookbook! http://www.ActiveState.com/pythoncookbook From guido@digicool.com Mon Jun 25 21:08:52 2001 From: guido@digicool.com (Guido van Rossum) Date: Mon, 25 Jun 2001 16:08:52 -0400 Subject: [I18n-sig] How does Python Unicode treat surrogates? In-Reply-To: Your message of "Mon, 25 Jun 2001 15:01:43 EDT." <15159.35479.42093.828285@cymru.basistech.com> References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <15159.14391.718891.645489@cymru.basistech.com> <200106251422.f5PEMel07612@odiug.digicool.com> <15159.17083.978971.519453@cymru.basistech.com> <200106251443.f5PEh2p07753@odiug.digicool.com> <15159.19546.226155.383490@cymru.basistech.com> <200106251544.f5PFiWe07979@odiug.digicool.com> <15159.22514.976923.894201@cymru.basistech.com> <200106251742.f5PHgTW08532@odiug.digicool.com> <15159.29012.266722.112773@cymru.basistech.com> <200106251816.f5PIGev08808@odiug.digicool.com> <15159.30780.1143.760653@cymru.basistech.com>! <200106251842.f5PIgOe09018@odiug.digicool.com> <15159.32513.611214.399097@cymru.basistech.com> <200106251922.f5PJMwm09492@odiug.digicool.com> <15159.35479.42093.828285@cymru.basistech.com> Message-ID: <200106252008.f5PK8q109630@odiug.digicool.com> > I must admit that I wasn't aware of the "\U00020000" notation. I still > think it should limit itself to 6 digits, not 8. Too late -- It's some kind of standard already (maybe borrowed from Java?). > I understand O(n) and O(1) perfectly well. My point is that you do not > have to scan the entire string when doing this indexing. You only need > to look at most one storage unit on either side of the index. We're > only concerned here with transparently handling surrogates when the > underlying representation is UTF-16. And that's where your proposal simple doesn't work. If the storage units are all 16 bits, and you want the index to count in characters, you can't know where in a megabyte-long string to start looking for character 1,000,000: you have to iterate over the storage units from the beginning until you have counted 1,000,000 characters. If there were no surrogates, that's 1,000,000 storage units from the beginning; if all characters happened to be surrogates, it would be 2,000,000 storage units. If there are n surrogates between character 0 and character n, character n starts at storage unit offset n+m; the only way to determine m is a brute-force O(n) search. > > Note that in your above example, char(foo, 2) would not be u'a' but > > would be u'\u0000', and char(foo, 3) would be u'a'. > > My example above presumes that indicies in the index refers to > characters, not storage units, and that UTF-16 is being used > transparently internally. So in my world, evaluating > > foo = u"\u4e00\U00020000a" > > would treat foo[1] as u'\U00200000' and foo[2] as u'a'. > > > So I still think you haven't thought this out as much as you believe. > > As I said, I have no belief that this is thought out. I'm merely > stating what I believe the observable behavior should be. So explain once more how the observable behavior could be O(1). --Guido van Rossum (home page: http://www.python.org/~guido/) From tree@basistech.com Mon Jun 25 20:33:35 2001 From: tree@basistech.com (Tom Emerson) Date: Mon, 25 Jun 2001 15:33:35 -0400 Subject: [I18n-sig] How does Python Unicode treat surrogates? In-Reply-To: <200106252008.f5PK8q109630@odiug.digicool.com> References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <15159.14391.718891.645489@cymru.basistech.com> <200106251422.f5PEMel07612@odiug.digicool.com> <15159.17083.978971.519453@cymru.basistech.com> <200106251443.f5PEh2p07753@odiug.digicool.com> <15159.19546.226155.383490@cymru.basistech.com> <200106251544.f5PFiWe07979@odiug.digicool.com> <15159.22514.976923.894201@cymru.basistech.com> <200106251742.f5PHgTW08532@odiug.digicool.com> <15159.29012.266722.112773@cymru.basistech.com> <200106251816.f5PIGev08808@odiug.digicool.com> <15159.30780.1143.760653@cymru.basistech.com> <200106251842.f5PIgOe09018@odiug.digicool.com> <15159.32513.611214.399097@cymru.basistech.com> <200106251922.f5PJMwm09492@odiug.digicool.com> <15159.35479.42093.828285@cymru.basistech.com> <200106252008.f5PK8q109630@odiug.digicool.com> Message-ID: <15159.37391.172601.161556@cymru.basistech.com> Guido van Rossum writes: > And that's where your proposal simple doesn't work. If the storage > units are all 16 bits, and you want the index to count in characters, > you can't know where in a megabyte-long string to start looking for > character 1,000,000: you have to iterate over the storage units from > the beginning until you have counted 1,000,000 characters. If there > were no surrogates, that's 1,000,000 storage units from the beginning; > if all characters happened to be surrogates, it would be 2,000,000 > storage units. If there are n surrogates between character 0 and > character n, character n starts at storage unit offset n+m; the only > way to determine m is a brute-force O(n) search. Bing, the light goes on. Of course. "Never mind." :-) -- Tom Emerson Basis Technology Corp. Sr. Sinostringologist http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever" From fredrik@pythonware.com Mon Jun 25 21:39:14 2001 From: fredrik@pythonware.com (Fredrik Lundh) Date: Mon, 25 Jun 2001 22:39:14 +0200 Subject: [I18n-sig] How does Python Unicode treat surrogates? References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <3B375FB9.91BA4B1E@lemburg.com> <200106251620.f5PGKNP08234@odiug.digicool.com> <3B376E68.505BF6E@lemburg.com> <200106251804.f5PI4D008730@odiug.digicool.com> <3B378460.C27CDCDD@lemburg.com> <200106251912.f5PJCVD09465@odiug.digicool.com> <00f201c0fdb0$ab0fe170$4ffa42d5@hagrid> Message-ID: <012e01c0fdb6$ea4e9e70$4ffa42d5@hagrid> I wrote: > SRE and the unicode databases (me again) should also work > pretty much out of the box. a 32-bit version SRE works as expected, at least: >>> a = array.array("i", map(ord, "hello")) >>> m = sre.search("l+", a) >>> m >>> m.group(0) array('i', [108, 108]) the DLL size is identical, and the performance is roughly the same. Cheers /F From mal@lemburg.com Mon Jun 25 21:43:55 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Mon, 25 Jun 2001 22:43:55 +0200 Subject: [I18n-sig] Re: How does Python Unicode treat surrogates? References: <200106251434.KAA20168@unicode.org> Message-ID: <3B37A28B.8445BDF7@lemburg.com> Rick McGowan wrote: > > Gaute B Strokkenes wrote... > > > [I'm cc:-ing the unicode list to make sure that I've gotten my > > terminology right, and to solicit comments > > Interesting... I just started looking at Python the other day, once I > discovered it has such nice built-in Unicode support. > > If Python is explicitly storing the stuff as UTF-16 in u"" strings, then > slicing operations certainly should be acting on units of the backing > store, just as for ASCII "character" strings. In that case, in order for > every unit to be addressible, it should allow breaking up of surrogate > pairs. (Apple's Cocoa environment strings work the same way with > "ranges".) There should be another operation, or several, that slice up > strings based on other kinds of text element boundaries. For example, a > "slice on character boundaries" that would always shift the range to > accommodate surrogate pairs -- as a separate operation. > > The low-level routines in Python, like slicing with absolute locations, > shouldn't presume to know about the encoding, only about the UNITS that are > in the "array". Exactly my opinion. Do you have references which we could look at to determine which of these boundary kinds would actually be useful in daily programming ? -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From mal@lemburg.com Mon Jun 25 21:52:54 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Mon, 25 Jun 2001 22:52:54 +0200 Subject: [I18n-sig] How does Python Unicode treat surrogates? References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <3B375FB9.91BA4B1E@lemburg.com> <200106251620.f5PGKNP08234@odiug.digicool.com> <3B376E68.505BF6E@lemburg.com> <200106251804.f5PI4D008730@odiug.digicool.com> <3B378460.C27CDCDD@lemburg.com> <200106251912.f5PJCVD09465@odiug.digicool.com> <00f201c0fdb0$ab0fe170$4ffa42d5@hagrid> <012e01c0fdb6$ea4e9e70$4ffa42d5@hagrid> Message-ID: <3B37A4A6.2B5D068A@lemburg.com> Fredrik Lundh wrote: > > I wrote: > > SRE and the unicode databases (me again) should also work > > pretty much out of the box. > > a 32-bit version SRE works as expected, at least: > > >>> a = array.array("i", map(ord, "hello")) > >>> m = sre.search("l+", a) > >>> m > > >>> m.group(0) > array('i', [108, 108]) > > the DLL size is identical, and the performance is roughly the > same. That's good to know, but Guido was asking about supporting both UTF-16 and UCS-4 by means of a configure switch -- supporting this kind of dual approach is what I consider hard to maintain and implement. Dealing only with UTF-16 or only with UCS-4 would be much less work and this is what I am advertising (stick with UTF-16 for the next few years and then maybe switch over to UCS-4; note that this will cause an incompatibility due to u[i] referencing code units which then change). -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From tim@digicool.com Mon Jun 25 22:12:42 2001 From: tim@digicool.com (Tim Peters) Date: Mon, 25 Jun 2001 17:12:42 -0400 Subject: [I18n-sig] How does Python Unicode treat surrogates? In-Reply-To: <200106252008.f5PK8q109630@odiug.digicool.com> Message-ID: [Tom Emerson] > I must admit that I wasn't aware of the "\U00020000" notation. I still > think it should limit itself to 6 digits, not 8. [Guido] > Too late -- It's some kind of standard already (maybe borrowed > from Java?). We borrowed \U12345678 notation from the current ISO/ANSI C standard ("C99"). A space with 2**20 characters isn't going to last either -- and unlike the Unicode folks, X3J11 didn't have any reason to indulge wishful thinking on this point . From tim@digicool.com Mon Jun 25 22:22:31 2001 From: tim@digicool.com (Tim Peters) Date: Mon, 25 Jun 2001 17:22:31 -0400 Subject: [I18n-sig] How does Python Unicode treat surrogates? In-Reply-To: <15159.37391.172601.161556@cymru.basistech.com> Message-ID: My understanding is that UTF-16 (like UTF-8 in this respect) was deliberately designed so that given a random pointer into the middle of a contiguous vector of encodings, it's indeed O(1) to find the start of the nearest *character* going either forwards or backwards. "The right way" to solve the character (not binary blob) indexing problem is to add a search finger to the string, a pair mapping "the last" character index asked for to the address of the start of its encoding. Since string traversal generally moves ahead-- or back --just one character at a time, the point in the first paragraph assures that traversing a string with N characters, in whole, takes O(N) time overall. It's not as simple as base + offset, but requires no more than a few range compares (plus updating the finger) per indexing operation. From fredrik@pythonware.com Mon Jun 25 22:43:34 2001 From: fredrik@pythonware.com (Fredrik Lundh) Date: Mon, 25 Jun 2001 23:43:34 +0200 Subject: [I18n-sig] How does Python Unicode treat surrogates? References: Message-ID: <001d01c0fdbf$e37720f0$4ffa42d5@hagrid> Tim Peters wrote: > "The right way" to solve the character (not binary blob) indexing problem is > to add a search finger to the string, a pair mapping "the last" character > index asked for to the address of the start of its encoding. Since string > traversal generally moves ahead-- or back --just one character at a time, > the point in the first paragraph assures that traversing a string with N > characters, in whole, takes O(N) time overall. It's not as simple as base + > offset, but requires no more than a few range compares (plus updating the > finger) per indexing operation. plus the time it takes to acquire and release a thread lock for each character... From tim@digicool.com Mon Jun 25 23:11:21 2001 From: tim@digicool.com (Tim Peters) Date: Mon, 25 Jun 2001 18:11:21 -0400 Subject: [I18n-sig] How does Python Unicode treat surrogates? In-Reply-To: <001d01c0fdbf$e37720f0$4ffa42d5@hagrid> Message-ID: [Fredrik Lundh] > plus the time it takes to acquire and release a thread lock > for each character... Eh? Python code runs under the protection of the global interpreter lock. There are no instances of Py_BEGIN_ALLOW_THREADS in any of the Unicode or regexp C support code now -- but you know that, so I must be missing your point. Or you're just feeling contrary . From mal@lemburg.com Mon Jun 25 21:05:36 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Mon, 25 Jun 2001 22:05:36 +0200 Subject: [I18n-sig] How does Python Unicode treat surrogates? References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <3B375FB9.91BA4B1E@lemburg.com> <200106251620.f5PGKNP08234@odiug.digicool.com> <3B376E68.505BF6E@lemburg.com> <200106251804.f5PI4D008730@odiug.digicool.com> <3B378460.C27CDCDD@lemburg.com> <200106251912.f5PJCVD09465@odiug.digicool.com> Message-ID: <3B379990.581C50EE@lemburg.com> Guido van Rossum wrote: > > > That's because len(u) has nothing to do with the number of > > characters in the string, it only counts the code units (Py_UNICODEs) > > which are used to represent characters. The same is true for normal > > strings, e.g. UTF-8 can use between 1-4 code units (bytes in this > > case) for a single code unit and in Unicode you can create characters > > by combining code units > > Total agreement. > > > As Mark Davis pointed out: > > > > """In most people's experience, it is best to leave the low level interfaces > > with indices in terms of code units, then supply some utility routines that > > tell you information about code points. The most useful are: > > > > - given a string and an index into that string, how many code points are > > before it? > > - given a string and a number of code points, what is the lowest index that > > contains them? > > I understand the first and the third, but what is this one? Is it a > search? Right. The difference to .find(s) is that it would return a code point index (which can differ from the code unit index). > > - given a string and an index into that string, is the index on a code point > > boundary? > > """ > > > > Python could use some more Unicode methods to answer these > > questions. > > Agreed (see my other post responding to Ton Emerson). > > > > > Python currently only has minimal support for surrogates, so > > > > purist would say that we support UCS-2. However, we deliberatly > > > > chose this path to be able to upgrade to UTF-16 at some later > > > > point in time and it seems that this time has now come. > > > > > > How hard would it be to also change the party line about what the > > > encoding used is based on whether we use 2 or 4 bytes? We could even > > > give three choices: UCS-2 (current situation, no surrogates), UTF-16 > > > (16-bit items with some surrogate support) or UCS-4 (32-bit items)? > > > > Ehm... what are you getting at here ? > > Earlier on you said it would be hard to offer a config-time choice > between UTF-16 and UCS-4. I'm still trying to figure out why. Here's an example of how this change affects semantics: u = u"\U00010000" # UTF-16 u[0] -> u"\uDC00" # UCS-4 u[0] -> u"\U00010000" > Given > the additional stuff I've learned now about surrogates, it doesn't > make sense to choose between UCS-2 and UTF-16; the surrogate handling > can always be present. Right. > So let me rephrase the question. How hard would it be to offer the > config-time choice between UCS-4 and UTF-16? It would mean lot's of #ifdefs and a change in semantics. > If it's hard, why? It's mostly hard due to the fact that indexing, sizes and memory management will be different for the two (e.g. dynamic resizing vs. one time allocation). Codecs will have to pay attention to the difference too since UCS-4 would not need surrogates while UTF-16 requires these. > (I've heard you say that it's hard before, but I still don't > understand the problem.) > > > > > > I'd be happy to make the configuration choice between UTF-16 and > > > > > UCS-4, if that's doable. > > > > > > > > Not easily, I'm afraid. > > > > > > Can you explain why this is not easy? > > > > Because choosing whether or not to support surrogates is a > > fundamental choice which affects far more than just the way you > > access storage. Surrogates introduce variable width characters: > > some characters use two or more Py_UNICODE code units while (most) > > others only use one. > > > > Remember when we discussed which internal format to use or > > which default encoding to apply ? We ruled out UTF-8 because > > it fails badly when it comes to slicing, concatenation, indexing, > > etc. > > > > UTF-16 is much less painful as most code points only take > > up a single code unit, but it still introduces a break in concept. > > Hm, it sounds like you have the same problem that I had with Ton > Emerson's suggestion to support Unicode before he clarified it. No, I do understand what you mean. The "break in concept" refers to the different ways you have to deal with variable and fixed width representations internally (as I tried to briefly explain above). > If we make a clean distinction between characters and storage units, > and if stick to the rule that u[i] accesses a storage unit, what's the > conceptual difficulty? There might be a separate method u.char(i) > which returns the *character* starting u[i:], or "" if u[i] is a > low-surrogate. That could be all we need to support surrogates. How > bad is that? (These could even continue to be supported when the > storage uses UCS-4; there, u.char(i) would always be u[i], until > someone comes up with a 64-bit character set. ;-) Right... that should solve the "problem". > > > I buy that as an argument for supporting UTF-16, but not for cutting > > > off the road to supporting UCS-4 for those users who would like to opt > > > in. > > > > That was not my point. I just wanted to point out how well UTF-16 > > is being accepted out there and that we are in good company by > > moving from UCS-2 to UTF-16 as current internal format. > > Good! I agree. > > > I don't want to cut off the road to UCS-4, I just want to make > > clear that UTF-16 is a good choice and one which will last at > > least some more years. We can then always decide to move on > > to UCS-4 for the internal storage format. > > Agreed again. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From martin@loewis.home.cs.tu-berlin.de Tue Jun 26 00:53:49 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Tue, 26 Jun 2001 01:53:49 +0200 Subject: [I18n-sig] How does Python Unicode treat surrogates? In-Reply-To: <3B376E68.505BF6E@lemburg.com> (mal@lemburg.com) References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <3B375FB9.91BA4B1E@lemburg.com> <200106251620.f5PGKNP08234@odiug.digicool.com> <3B376E68.505BF6E@lemburg.com> Message-ID: <200106252353.f5PNrnO01574@mira.informatik.hu-berlin.de> > > Oh. I didn't know. How does it differ from Unicode? What's the user > > acceptance? > > http://www.unicode.org/unicode/consortium/memblogo.html says it all. Mmh. http://www.iso.ch/iso/en/aboutiso/isomembers/MemberCountryList.MemberCountryList Regards, Martin From martin@loewis.home.cs.tu-berlin.de Tue Jun 26 01:07:43 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Tue, 26 Jun 2001 02:07:43 +0200 Subject: [I18n-sig] How does Python Unicode treat surrogates? In-Reply-To: <200106251842.f5PIgOe09018@odiug.digicool.com> (message from Guido van Rossum on Mon, 25 Jun 2001 14:42:24 -0400) References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <15159.14391.718891.645489@cymru.basistech.com> <200106251422.f5PEMel07612@odiug.digicool.com> <15159.17083.978971.519453@cymru.basistech.com> <200106251443.f5PEh2p07753@odiug.digicool.com> <15159.19546.226155.383490@cymru.basistech.com> <200106251544.f5PFiWe07979@odiug.digicool.com> <15159.22514.976923.894201@cymru.basistech.com> <200106251742.f5PHgTW08532@odiug.digicool.com> <15159.29012.266722.112773@cymru.basistech.com> <200106251816.f5PIGev08808@odiug.digicool.com> <15159.30780.1143.760653@cymru.basistech.com> <200106251842.f5PIgOe09018@odiug.digicool.com> Message-ID: <200106260007.f5Q07hV01625@mira.informatik.hu-berlin.de> > 16-bit char, a high surrogate, a low surrogate, and another regular > 16-bit char. You're saying that u[0] should return the first > character, u[1] the entire surrogate (so it would still be a 2-item > string), u[2] I gues the empty string, and u[3] the final regular > char. > > IMO that would break an important invariant of string-like objects, > namely that len(s[i]) == 1. No, it wouldn't. s[1] would return a string containing 2 Py_UNICODE values, but len(s[1]) would still be 1. Regards, Martin From martin@loewis.home.cs.tu-berlin.de Tue Jun 26 00:47:18 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Tue, 26 Jun 2001 01:47:18 +0200 Subject: [I18n-sig] How does Python Unicode treat surrogates? In-Reply-To: <200106251620.f5PGKNP08234@odiug.digicool.com> (message from Guido van Rossum on Mon, 25 Jun 2001 12:20:23 -0400) References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <3B375FB9.91BA4B1E@lemburg.com> <200106251620.f5PGKNP08234@odiug.digicool.com> Message-ID: <200106252347.f5PNlIT01439@mira.informatik.hu-berlin.de> > If this is the only thing that keeps us from having a configuration > OPTION to make Py_UNICODE 32-bit wide, I'd say let's fix it. I think there are numerous places which assume sizeof(Py_UNICODE)==2, including, but not limited to, sre. > But UTF-16 vs. UCS-4 is not an implementation detail! > > If we store 4 bytes per character, we should treat surrogates > differently. I don't know where those would be converted -- probably > in the UTF-16 to UCS-4 codec. Indeed, they would never appear in a 32-bit Unicode string. > > This is different: ISO 10646 is a competing standard, not just a > > different encoding. > > Oh. I didn't know. How does it differ from Unicode? What's the user > acceptance? To my knowledge, it only differs in minor points, which is only caused by different release dates (at one time, Unicode is behind, at another time, the ISO standard). End users typically view it as Unicode, whereas standards bodies and agencies typically view it as ISO 10646 (e.g. C, C++, and Posix all refer to ISO 10646, Microsoft refers to Unicode). Regards, Martin From martin@loewis.home.cs.tu-berlin.de Tue Jun 26 01:18:55 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Tue, 26 Jun 2001 02:18:55 +0200 Subject: [I18n-sig] How does Python Unicode treat surrogates? In-Reply-To: <15159.36453.486716.705433@cymru.basistech.com> (message from Tom Emerson on Mon, 25 Jun 2001 15:17:57 -0400) References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <3B375FB9.91BA4B1E@lemburg.com> <200106251620.f5PGKNP08234@odiug.digicool.com> <3B376E68.505BF6E@lemburg.com> <200106251804.f5PI4D008730@odiug.digicool.com> <3B378460.C27CDCDD@lemburg.com> <200106251912.f5PJCVD09465@odiug.digicool.com> <00f201c0fdb0$ab0fe170$4ffa42d5@hagrid> <15159.36453.486716.705433@cymru.basistech.com> Message-ID: <200106260018.f5Q0ItN01657@mira.informatik.hu-berlin.de> > Fredrik Lundh writes: > > I'm sceptical -- I see very little reason to maintain that distinction. > > let's use either UCS-2 or UCS-4 for the internal storage, stick to the > > "character strings are character sequences" concept, and keep the > > UTF-16 surrogate issue where it belongs: in the codecs. > > How then is u"\U00200000" represented internally if you use UCS-2 as > the internal storage representation? I think the obvious answer is: It is not supported. It will give an exception when you try to convert an UTF-8 or UTF-16 string that has such a character, it will be an error if you pass a surrogate to unichr, or in a \u literal. That would simplify a lot, IMO, and only require support for a 32-bit Py_UNICODE. Of course, that would have to be done as a per-platform choice, to avoid binary-incompatible extension modules. Regards, Martin From martin@loewis.home.cs.tu-berlin.de Tue Jun 26 01:16:08 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Tue, 26 Jun 2001 02:16:08 +0200 Subject: [I18n-sig] How does Python Unicode treat surrogates? In-Reply-To: <15159.35479.42093.828285@cymru.basistech.com> (message from Tom Emerson on Mon, 25 Jun 2001 15:01:43 -0400) References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <15159.14391.718891.645489@cymru.basistech.com> <200106251422.f5PEMel07612@odiug.digicool.com> <15159.17083.978971.519453@cymru.basistech.com> <200106251443.f5PEh2p07753@odiug.digicool.com> <15159.19546.226155.383490@cymru.basistech.com> <200106251544.f5PFiWe07979@odiug.digicool.com> <15159.22514.976923.894201@cymru.basistech.com> <200106251742.f5PHgTW08532@odiug.digicool.com> <15159.29012.266722.112773@cymru.basistech.com> <200106251816.f5PIGev08808@odiug.digicool.com> <15159.30780.1143.760653@cymru.basistech.com> <200106251842.f5PIgOe09018@odiug.digicool.com> <15159.32513.611214.399097@cymru.basistech.com> <200106251922.f5PJMwm09492@odiug.digicool.com> <15159.35479.42093.828285@cymru.basistech.com> Message-ID: <200106260016.f5Q0G8x01656@mira.informatik.hu-berlin.de> > The maximum code-point value for a Unicode character is U+10FFFF, > hence the suggested notation above (I should have noted it as > such). If Python is going to implement full support for ISO 10646 then > the full 32-bit representation (and 8-digit \U escape) is > appropriate. Correct me if I'm wrong, but doesn't some 10646 amendment limit the code range to 10FFFF also (i.e. to only a part of group 0)? > If you limit the maximum size of the character escape so that the > scanner catches improper character sizes you save grief for the > end-user, IMHO. I think Python should still use the \UXXXXXXXX notation, as does C and C++ - no matter that the first two XX will always be 00. > I understand O(n) and O(1) perfectly well. My point is that you do not > have to scan the entire string when doing this indexing. You only need > to look at most one storage unit on either side of the index. We're > only concerned here with transparently handling surrogates when the > underlying representation is UTF-16. Please think carefully. What if you are indexing index 20, but you have a surrogate at words 10 and 11? Then you should take word 21, instead of word 20, no? How are you going to find that out in constant time? Regards, Martin From martin@loewis.home.cs.tu-berlin.de Tue Jun 26 00:40:01 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Tue, 26 Jun 2001 01:40:01 +0200 Subject: [I18n-sig] How does Python Unicode treat surrogates? In-Reply-To: <200106251544.f5PFiWe07979@odiug.digicool.com> (message from Guido van Rossum on Mon, 25 Jun 2001 11:44:32 -0400) References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <15159.14391.718891.645489@cymru.basistech.com> <200106251422.f5PEMel07612@odiug.digicool.com> <15159.17083.978971.519453@cymru.basistech.com> <200106251443.f5PEh2p07753@odiug.digicool.com> <15159.19546.226155.383490@cymru.basistech.com> <200106251544.f5PFiWe07979@odiug.digicool.com> Message-ID: <200106252340.f5PNe1E01408@mira.informatik.hu-berlin.de> > You can believe what *should* happen all you want, but we're not going > to change this soon. u[i] has to be independent of the length of u > and the value of i. Not even if a patch is submitted that puts a bit into Unicode objects which have surrogates in them, to transparently implement indexing and length differently for them? Regards, Martin From martin@loewis.home.cs.tu-berlin.de Tue Jun 26 00:58:17 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Tue, 26 Jun 2001 01:58:17 +0200 Subject: [I18n-sig] How does Python Unicode treat surrogates? In-Reply-To: <200106251742.f5PHgTW08532@odiug.digicool.com> (message from Guido van Rossum on Mon, 25 Jun 2001 13:42:29 -0400) References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <15159.14391.718891.645489@cymru.basistech.com> <200106251422.f5PEMel07612@odiug.digicool.com> <15159.17083.978971.519453@cymru.basistech.com> <200106251443.f5PEh2p07753@odiug.digicool.com> <15159.19546.226155.383490@cymru.basistech.com> <200106251544.f5PFiWe07979@odiug.digicool.com> <15159.22514.976923.894201@cymru.basistech.com> <200106251742.f5PHgTW08532@odiug.digicool.com> Message-ID: <200106252358.f5PNwHg01594@mira.informatik.hu-berlin.de> > But unless I misunderstand what it *is* that you are suggesting, the > O(1) indexing property can't be retained with your suggestion, and > that's out of the question. The O(1) indexing property can be retained for strings not containing surrogates, while still counting surrogate pairs as one character. Unfortunately, this will require an additional word per unicode object, unless I'm allowed to use a byte past the terminating zero (which will only slightly reduce the memory overhead). If somebody can find a spare bit :-) Regards, Martin From martin@loewis.home.cs.tu-berlin.de Tue Jun 26 00:26:51 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Tue, 26 Jun 2001 01:26:51 +0200 Subject: [I18n-sig] How does Python Unicode treat surrogates? In-Reply-To: <200106251422.f5PEMel07612@odiug.digicool.com> (message from Guido van Rossum on Mon, 25 Jun 2001 10:22:40 -0400) References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <15159.14391.718891.645489@cymru.basistech.com> <200106251422.f5PEMel07612@odiug.digicool.com> Message-ID: <200106252326.f5PNQp401376@mira.informatik.hu-berlin.de> > I don't think switching to a 32-bit character is the right thing to do > for us (although I think it should be easier than it currently is -- > changing to define Py_UNICODE as a 32-bit unsigned int should be all > that it takes, which is currently not the case). > > I'm all for taking the lazy approach and letting applications that > need surrogate support do it themselves, at the application level. That, of course, means that you cast in stone the 16-bit Py_UNICODE. In a 32-bit Py_UNICODE, unichr(0xd800) would be surely illegal, wouldn't it? So an application that explicitly creates surrogates using unichr (how else would it do that?) won't be portable to a 32-bit Py_UNICODE. Would you accept patches that deal with surrogate pairs transparently throughout the implementation, in the sense of mapping them to ordinals above 0x10000? Regards, Martin From martin@loewis.home.cs.tu-berlin.de Tue Jun 26 00:32:25 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Tue, 26 Jun 2001 01:32:25 +0200 Subject: [I18n-sig] How does Python Unicode treat surrogates? In-Reply-To: <200106251443.f5PEh2p07753@odiug.digicool.com> (message from Guido van Rossum on Mon, 25 Jun 2001 10:43:02 -0400) References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <15159.14391.718891.645489@cymru.basistech.com> <200106251422.f5PEMel07612@odiug.digicool.com> <15159.17083.978971.519453@cymru.basistech.com> <200106251443.f5PEh2p07753@odiug.digicool.com> Message-ID: <200106252332.f5PNWPh01407@mira.informatik.hu-berlin.de> > Does that make sense? > > I know I am hindered by a lack of understanding of Unicode > hairsplitting, angels-on-a-pin-dancing details; if I'm missing > something, it's likely that many other people don't know the details > either, so an explanation would be much appreciated! I don't think you are missing any detail; I guess you are fully aware that you are throwing one of Unicode's biggest strengths out of the window :-) namely the possibility to index index characters, not the internal representation. As for Unicode hairsplitting: I think combining characters *are* different in that respect; they are code points on their own, even though they might have a zero-width representation. Also, normalization forms can help with combining characters; they don't help with surrogates. Regards, Martin From martin@loewis.home.cs.tu-berlin.de Tue Jun 26 01:21:56 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Tue, 26 Jun 2001 02:21:56 +0200 Subject: [I18n-sig] How does Python Unicode treat surrogates? In-Reply-To: <3B37992A.40CD1CF2@ActiveState.com> (message from Paul Prescod on Mon, 25 Jun 2001 13:03:54 -0700) References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <3B375FB9.91BA4B1E@lemburg.com> <200106251620.f5PGKNP08234@odiug.digicool.com> <3B376E68.505BF6E@lemburg.com> <200106251804.f5PI4D008730@odiug.digicool.com> <3B378460.C27CDCDD@lemburg.com> <200106251912.f5PJCVD09465@odiug.digicool.com> <00f201c0fdb0$ab0fe170$4ffa42d5@hagrid> <3B37992A.40CD1CF2@ActiveState.com> Message-ID: <200106260021.f5Q0Luo01684@mira.informatik.hu-berlin.de> > I agree. But I'd add that if different people really need different > performance/simplicity trade-offs then maybe we need multiple variants > of the Unicode object. The question really is: Those people that require a 16-bit Py_UNICODE, would they ever need characters outside the BMP? My guess is no, so Fredrik's proposal sounds good to me. Regards, Martin From paulp@ActiveState.com Tue Jun 26 01:43:05 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Mon, 25 Jun 2001 17:43:05 -0700 Subject: [I18n-sig] How does Python Unicode treat surrogates? References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <3B375FB9.91BA4B1E@lemburg.com> <200106251620.f5PGKNP08234@odiug.digicool.com> <3B376E68.505BF6E@lemburg.com> <200106251804.f5PI4D008730@odiug.digicool.com> <3B378460.C27CDCDD@lemburg.com> <200106251912.f5PJCVD09465@odiug.digicool.com> <00f201c0fdb0$ab0fe170$4ffa42d5@hagrid> <3B37992A.40CD1CF2@ActiveState.com> <200106260021.f5Q0Luo01684@mira.informatik.hu-berlin.de> Message-ID: <3B37DA99.31002323@ActiveState.com> "Martin v. Loewis" wrote: > > > I agree. But I'd add that if different people really need different > > performance/simplicity trade-offs then maybe we need multiple variants > > of the Unicode object. > > The question really is: Those people that require a 16-bit Py_UNICODE, > would they ever need characters outside the BMP? Hard to tell. People usually want to have their cake and eat it too. i.e. I want the performance of 16-bit Py_UNICODE but I want to support the occasional non-BMP character that happens to show up in a document. > My guess is no, so Fredrik's proposal sounds good to me. I'm not clear on what Fredrik's proposal is. He says: "let's use either UCS-2 or UCS-4 for the internal storage". Is he saying: 1. let's choose one or the other today 2. let's make it a compile-time switch 3. make it a runtime option I could live with 1. for a while longer...I haven't heard of a real user complaint about our current model. The longer we put it off, the more acceptable UCS-4 is. I wouldn't be thrilled with 2., because it makes Python code harder to move between machines (depends on your build options!) 3 would be okay if it is handled intelligently. Any of these is better to me than exposing the details of UTF-16 to the Python programmer in our Unicode type! -- Take a recipe. Leave a recipe. Python Cookbook! http://www.ActiveState.com/pythoncookbook From rick@unicode.org Tue Jun 26 01:50:18 2001 From: rick@unicode.org (Rick McGowan) Date: Mon, 25 Jun 2001 17:50:18 -0700 Subject: [I18n-sig] How does Python Unicode treat surrogates? In-Reply-To: <15159.35479.42093.828285@cymru.basistech.com> (message from TomEmerson on Mon, 25 Jun 2001 15:01:43 -0400) Message-ID: <200106252245.SAA04499@unicode.org> > Correct me if I'm wrong, but doesn't some 10646 amendment limit the > code range to 10FFFF also (i.e. to only a part of group 0)? Yes. It's recent. Rick From rick@unicode.org Tue Jun 26 01:51:59 2001 From: rick@unicode.org (Rick McGowan) Date: Mon, 25 Jun 2001 17:51:59 -0700 Subject: [I18n-sig] How does Python Unicode treat surrogates? In-Reply-To: <3B37992A.40CD1CF2@ActiveState.com> (message from Paul Prescod onMon, 25 Jun 2001 13:03:54 -0700) Message-ID: <200106252246.SAA04521@unicode.org> > The question really is: Those people that require a 16-bit Py_UNICODE, > would they ever need characters outside the BMP? Yes. More and more stuff is going outside the BMP in the future. Probably will be lots of procurement requirements eventually that need Plane 2 Han characters... But everyone of course likes the space savings of UTF-16. Rick From rick@unicode.org Tue Jun 26 01:59:58 2001 From: rick@unicode.org (Rick McGowan) Date: Mon, 25 Jun 2001 17:59:58 -0700 Subject: [I18n-sig] How does Python Unicode treat surrogates? Message-ID: <200106252254.SAA04620@unicode.org> > 1. let's choose one or the other today > 2. let's make it a compile-time switch > 3. make it a runtime option I definitely think Python should make a decision at the language level. But with the OO model, you can hide a lot of details behind string objects and accessors... Runtime options on such things are bad. This is one of the things Unicode is designed as an antidote for: the "choose char set at runtime" kind of 18n model. Compile time switch is poor because you do end up with two real models in the world. Could affect interoperability a lot, and byte-code stuff might not be as easily portable. (I don't know enough about the implementation or the language to guess, by the way.) Rick From tree@basistech.com Tue Jun 26 01:57:58 2001 From: tree@basistech.com (Tom Emerson) Date: Mon, 25 Jun 2001 20:57:58 -0400 Subject: [I18n-sig] How does Python Unicode treat surrogates? In-Reply-To: <200106260018.f5Q0ItN01657@mira.informatik.hu-berlin.de> References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <3B375FB9.91BA4B1E@lemburg.com> <200106251620.f5PGKNP08234@odiug.digicool.com> <3B376E68.505BF6E@lemburg.com> <200106251804.f5PI4D008730@odiug.digicool.com> <3B378460.C27CDCDD@lemburg.com> <200106251912.f5PJCVD09465@odiug.digicool.com> <00f201c0fdb0$ab0fe170$4ffa42d5@hagrid> <15159.36453.486716.705433@cymru.basistech.com> <200106260018.f5Q0ItN01657@mira.informatik.hu-berlin.de> Message-ID: <15159.56854.539327.291739@cymru.basistech.com> Martin v. Loewis writes: > > How then is u"\U00200000" represented internally if you use UCS-2 as > > the internal storage representation? > > I think the obvious answer is: It is not supported. It will give an > exception when you try to convert an UTF-8 or UTF-16 string that has > such a character, it will be an error if you pass a surrogate to > unichr, or in a \u literal. So the characters added in Unicode 3.1 in planes 1, 2, and 14 would not be representable in Python? Seems a bit draconian to make your life easier. -tree -- Tom Emerson Basis Technology Corp. Sr. Sinostringologist http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever" From tree@basistech.com Tue Jun 26 02:01:58 2001 From: tree@basistech.com (Tom Emerson) Date: Mon, 25 Jun 2001 21:01:58 -0400 Subject: [I18n-sig] How does Python Unicode treat surrogates? In-Reply-To: <200106252347.f5PNlIT01439@mira.informatik.hu-berlin.de> References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <3B375FB9.91BA4B1E@lemburg.com> <200106251620.f5PGKNP08234@odiug.digicool.com> <200106252347.f5PNlIT01439@mira.informatik.hu-berlin.de> Message-ID: <15159.57094.857439.860222@cymru.basistech.com> Martin v. Loewis writes: > To my knowledge, it only differs in minor points, which is only caused > by different release dates (at one time, Unicode is behind, at another > time, the ISO standard). The Unicode Technical Committee and WG2 are striving to make the two standards move in lock step as much as possible. Unfortunately the process of adding to an ISO standard is much more involved and time consuming than that required for Unicode. > End users typically view it as Unicode, whereas standards bodies and > agencies typically view it as ISO 10646 (e.g. C, C++, and Posix all > refer to ISO 10646, Microsoft refers to Unicode). The standards are code-point for code-point compatible. The primary difference is that Unicode provides property information that 10646 does not, and the UTC strives to standardize mapping tables for new encodings (e.g., GB 18030 and JIS X 0213-2000). -- Tom Emerson Basis Technology Corp. Sr. Sinostringologist http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever" From tree@basistech.com Tue Jun 26 02:41:51 2001 From: tree@basistech.com (Tom Emerson) Date: Mon, 25 Jun 2001 21:41:51 -0400 Subject: [I18n-sig] How does Python Unicode treat surrogates? In-Reply-To: <200106252315.f5PNFw601373@mira.informatik.hu-berlin.de> References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <15159.14391.718891.645489@cymru.basistech.com> <200106252315.f5PNFw601373@mira.informatik.hu-berlin.de> Message-ID: <15159.59487.398979.804494@cymru.basistech.com> Martin v. Loewis writes: > So nothing will happen until enough Chinese users complain. I don't > know whether you count as Chinese for these purposes :-) Perhaps not. :-) But the Chinese aren't the only ones to worry about. The Japanese also have characters being added outside the BMP, and Ruby holds sway in Japan... > P.S. The real issue IMO is display: If there are fonts supporting > these characters, people will want to write programs that make use of > the fonts. Until nobody can actually display such text, nobody will > request that indexing works reasonable. True to a point. Fonts do exist for these characters. And I end up referencing them even when I don't have fonts. Many Chinese organizations are worried more about making sure all their characters are encoded, and less on being able to display them adequately. Indeed, the HKSAR and CUHK are working on a project whereby rare characters are also encoded using the ideographic description characters. > P.P.S. Of course, if we wait until users actually use surrogates, it > is too late to change the indexing - that would likely break people's > code. All too true. -tree -- Tom Emerson Basis Technology Corp. Sr. Sinostringologist http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever" From mal@lemburg.com Mon Jun 25 20:18:13 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Mon, 25 Jun 2001 21:18:13 +0200 Subject: [I18n-sig] How does Python Unicode treat surrogates? References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <15159.14391.718891.645489@cymru.basistech.com> <200106251422.f5PEMel07612@odiug.digicool.com> <15159.17083.978971.519453@cymru.basistech.com> <200106251443.f5PEh2p07753@odiug.digicool.com> <15159.19546.226155.383490@cymru.basistech.com> <200106251544.f5PFiWe07979@odiug.digicool.com> <15159.22514.976923.894201@cymru.basistech.com> <200106251742.f5PHgTW08532@odiug.digicool.com> <15159.29012.266722.112773@cymru.basistech.com> <200106251816.f5PIGev08808@odiug.digicool.com> <15159.30780.1143.760653@cymru.basistech.com> <200106251842.f5PIgOe09018@odiug.digicool.com> <15159.32513.611214.399097@cymru.basistech.com> Message-ID: <3B378E75.740FFB52@lemburg.com> Tom Emerson wrote: > ... > No, but we may as well stop going around on this, since my views are > not going to happen. > > In my view the string 'u' is a Unicode string. I don't care what sits > underneath: 16-bits or 32-bits I don't care. As far as I'm concerned > the string has three characters in it: > > foo = u"\u4e00\u020000a" > > means that foo[0] == u"\u4e00", foo[1] == u"\u020000", and foo[2] == > u"a". > > The fact that this is represented internally different ways shouldn't > matter to the user who only cares about characters. While I agree with Guido that foo[i] should return the code unit and not the code point, I think that by providing a few more Unicode methods (like the ones Mark mentioned) would go a long way in providing a compromise, e.g. foo.codepoint(1) would then return u"\u020000", foo.codelen() would return 3, etc. Alternatively we could of course also provide this functionality in form of functions in a separate module (with the recent controveries over methods vs. functions I am not sure anymore what the general guideline is for Python... string methods at least don't seem to be too popular around here anymore; OK, just rambling ;-). -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From martin@loewis.home.cs.tu-berlin.de Tue Jun 26 00:22:27 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Tue, 26 Jun 2001 01:22:27 +0200 Subject: [I18n-sig] Re: How does Python Unicode treat surrogates? In-Reply-To: <9F2D83017589D211BD1000805FA70CA703B139D9@ntxmel03.cmutual.com.au> (JMachin@Colonial.com.au) References: <9F2D83017589D211BD1000805FA70CA703B139D9@ntxmel03.cmutual.com.au> Message-ID: <200106252322.f5PNMRi01375@mira.informatik.hu-berlin.de> > Do we permit such a sequence to be held internally as a "Unicode string"? > Is u"\udc00" legal in source code or should Python throw a syntax error? I think it shouldn't. If we disallow it, we should a) simultaneously disallow unichr(0xDC00) b) allow \U00010000, and unichr(0x10000), which would both give strings with two Py_UNICODE values inside (leaving out the question what len() of such a string would give). > We *do* need to consider UTF encodings, because Unicode *expressly* > allows decoding UTF sequences that become unpaired surrogates, or > other "not 100% valid" scalars such as 0xffff and 0xfffe. So, given > that Python supports Unicode, not ISO 10646, we must IMO permit such > sequences in our internal representation. I think the Unicode standard is in error here (or somebody is misinterpreting it). It has happened before: Unicode 2.0 strongly believed that the internal representation of a unicode character MUST be 16-bit, and found some funny wording to mark a 32-bit wchar_t as not strictly compliant, but acceptable. Unicode 3.1 has finally revised this wrong view. > It follows that we should stop worrying about these irregular values > -- it's less programming that way. Unicode 3.1 will create enough > extra programming as it is, because we now have variable-length > characters again -- just what Unicode was going to save us from :-( We wouldn't if we could widen Py_UNICODE to 32 bits... Regards, Martin From martin@loewis.home.cs.tu-berlin.de Tue Jun 26 00:15:58 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Tue, 26 Jun 2001 01:15:58 +0200 Subject: [I18n-sig] How does Python Unicode treat surrogates? In-Reply-To: <15159.14391.718891.645489@cymru.basistech.com> (message from Tom Emerson on Mon, 25 Jun 2001 09:10:15 -0400) References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <15159.14391.718891.645489@cymru.basistech.com> Message-ID: <200106252315.f5PNFw601373@mira.informatik.hu-berlin.de> > With the release of the Plane 2 ideographic extensions in Unicode 3.1 > there are two options available: include surrogate support via UTF-16, > which means dealing with multibyte (really multi"word") characters, or > switching to UTF-32, allowing characters outside Plane 0 to be > accessed uniformly. > > Note that this is a real issue: the Hong Kong Supplementary Character > Set includes characters contained in Plane 2 when mapped to Unicode > 3.1. The most likely solution, of course, for the time to come, is: Ignore characters outside the BMP. IMO, Tim Peter's view is right: If the internal representation uses surrogates, indexing should ignore this, and count a surrogate pair as one character. This is not going to happen unless somebody comes up with an efficient implementation. The obvious alternative solution is to use a 32-bit Py_UNICODE, which, given Guido's comment, is also not going to happen. So nothing will happen until enough Chinese users complain. I don't know whether you count as Chinese for these purposes :-) Regards, Martin P.S. The real issue IMO is display: If there are fonts supporting these characters, people will want to write programs that make use of the fonts. Until nobody can actually display such text, nobody will request that indexing works reasonable. P.P.S. Of course, if we wait until users actually use surrogates, it is too late to change the indexing - that would likely break people's code. From gs234@cam.ac.uk Tue Jun 26 04:06:07 2001 From: gs234@cam.ac.uk (Gaute B Strokkenes) Date: 26 Jun 2001 04:06:07 +0100 Subject: [I18n-sig] Re: How does Python Unicode treat surrogates? In-Reply-To: <9F2D83017589D211BD1000805FA70CA703B139D8@ntxmel03.cmutual.com.au> ("Machin, John"'s message of "Mon, 25 Jun 2001 22:33:50 +1000") References: <9F2D83017589D211BD1000805FA70CA703B139D8@ntxmel03.cmutual.com.au> Message-ID: <4ag0coey8w.fsf@kern.srcf.societies.cam.ac.uk> On Mon, 25 Jun 2001, JMachin@Colonial.com.au wrote: > MAL and Gaute, > > Can I please take the middle ground (and risk having both of you > throw things at me? > > => Lone surrogates are not 'true Unicode char points > in their own right' [MAL] -- they don't represent characters. I think you're misquoting MAL; the "not" was not there in his original statement. > On the other hand, UTF code sequences that would decode into lone > surrogates are not "illegal". Please read clause D29 in section 3.8 > of the Unicode 3.0 standard. This is further clarified by Unicode > 3.1 which expressly lists legal UTF-8 sequences; these encompass > lone surrogates. This is really a different issue. The paragraph states that the various UTFs have the property that they can transform any sequence of scalar values in the range 0 - 0x10FFFF to whatever representation is mandated by the UTF and then back again in a bijective fashion--even when the sequence includes scalars that are not Unicode characters, such as 0xFFFF, 0xFFFE and the various values that are reserved to contain UTF-16 surrogates. Personally, I'm having difficulty seeing how this statement could possibly apply to UTF-16. (For instance, I don't see how it would be possible to encode a sequence of unicode scalar values corresponding to a low and a high surrogate; if you tried to map this back then you would get a single unicode scalar value outside of the BMP). Perhaps someone on the unicode list could elaborate? My personal theory is that this is a vestige of the days when "Unicode" meant "16-bit characters" and all UTFs other than UTF-16 were just hacks that one was supposed to use for compatibility reasons only. Eventually someone realised that 16 bits wasn't going to be enough after all, and so kludges like surrogates were invented. It is instructive in this regard to note how the Unicode 3.0 conformance requirements effectively state that "thou shalt use 16-bit characters"; the paragraph stating that using UCS-4 for the wchar_t type in ISO C (this is what glibc does) is not Unicode conformant is particularly amusing. This was all changed for 3.1. -- Big Gaute http://www.srcf.ucam.org/~gs234/ .. here I am in 53 B.C. and all I want is a dill pickle!! From gs234@cam.ac.uk Tue Jun 26 04:24:27 2001 From: gs234@cam.ac.uk (Gaute B Strokkenes) Date: 26 Jun 2001 04:24:27 +0100 Subject: [I18n-sig] Re: How does Python Unicode treat surrogates? In-Reply-To: <200106251620.f5PGKNP08234@odiug.digicool.com> (Guido van Rossum's message of "Mon, 25 Jun 2001 12:20:23 -0400") References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <3B375FB9.91BA4B1E@lemburg.com> <200106251620.f5PGKNP08234@odiug.digicool.com> Message-ID: <4a8zifgbys.fsf@kern.srcf.societies.cam.ac.uk> On Mon, 25 Jun 2001, guido@digicool.com wrote: >> No problem... we can change to 4 byte values too if the world >> agrees on 4 bytes per character. However, 2 bytes or 4 bytes >> is an implementation detail and not part of the Unicode standard >> itself. > > But UTF-16 vs. UCS-4 is not an implementation detail! Sure it is! A given chunk of Unicode data is semantically just a finite sequence of Unicode scalar values. The difference between UTF-16 and UCS-4 is entirely one of how you are arranging bits and bytes to store the same information. The meaning is exactly the same; so it's an implementation detail. A (somewhat far-fetched, but there you are) analogy is this: imagine that you wish to store a true-colour bitmap in memory. You could do this by, say, storing the R, G and B components of a given pixel right next to each other, in that order. Alternatively, you could keep all the R components in one chunk and all the G components in another, or you could store the pixels in a different order. All of this makes no difference to the actual bitmap itself. I hope you see what I mean. > If we store 4 bytes per character, we should treat surrogates > differently. I don't know where those would be converted -- > probably in the UTF-16 to UCS-4 codec. An important point here is that the sole raison d'etre of surrogates is to enable one to store the entire 21-bit Unicode character set within the confines of a 16-bit encoding. If you're not dealing with UTF-16, surrogates quite simply do not exist and the only time you have to worry about them is when and if you wish to convert to and from UTF-16. As such the statement "we should treat surrogates differently when storing four bytes per character" is rather imprecise; the whole point is that you don't treat or worry about surrogates at all; except during conversion to and from UTF-16, obviously. -- Big Gaute http://www.srcf.ucam.org/~gs234/ I have nostalgia for the late Sixties! In 1969 I left my laundry with a hippie!! During an unauthorized Tupperware party it was chopped & diced! From tim.one@home.com Tue Jun 26 04:52:24 2001 From: tim.one@home.com (Tim Peters) Date: Mon, 25 Jun 2001 23:52:24 -0400 Subject: [I18n-sig] Re: How does Python Unicode treat surrogates? In-Reply-To: <4a8zifgbys.fsf@kern.srcf.societies.cam.ac.uk> Message-ID: [Guido] > But UTF-16 vs. UCS-4 is not an implementation detail! [Gaute B Strokkenes] > Sure it is! A given chunk of Unicode data is semantically just a > finite sequence of Unicode scalar values. The difference between > UTF-16 and UCS-4 is entirely one of how you are arranging bits and > bytes to store the same information. The meaning is exactly the same; > so it's an implementation detail. I don't know what possessed Guido to make that claim, but I'm sure he'll agree after some thought (he must, because you're right ). Something else is bothering me here, though: Python isn't C, or even Java, so a slicing gimmick returning raw encoding bytes (call 'em octets if you must, but they're bytes to me ) favored by Unicode *implementors* is at the wrong level. Unicode *users* can't paste this crap together again efficiently using Python code, because high-volume low-level bit-fiddling is exactly what Python code is worst at. So the idea that u[i] (for a Unicode string u and int i) should ever return meaningless binary blobs at the *Python* level is just astonishing to me: Unicode strings in Python are an end-user feature, not a low-level crutch for Unicode library developers. From martin@loewis.home.cs.tu-berlin.de Tue Jun 26 06:21:35 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Tue, 26 Jun 2001 07:21:35 +0200 Subject: [I18n-sig] How does Python Unicode treat surrogates? In-Reply-To: <15159.56854.539327.291739@cymru.basistech.com> (message from Tom Emerson on Mon, 25 Jun 2001 20:57:58 -0400) References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <3B375FB9.91BA4B1E@lemburg.com> <200106251620.f5PGKNP08234@odiug.digicool.com> <3B376E68.505BF6E@lemburg.com> <200106251804.f5PI4D008730@odiug.digicool.com> <3B378460.C27CDCDD@lemburg.com> <200106251912.f5PJCVD09465@odiug.digicool.com> <00f201c0fdb0$ab0fe170$4ffa42d5@hagrid> <15159.36453.486716.705433@cymru.basistech.com> <200106260018.f5Q0ItN01657@mira.informatik.hu-berlin.de> <15159.56854.539327.291739@cymru.basistech.com> Message-ID: <200106260521.f5Q5LZK00933@mira.informatik.hu-berlin.de> > Martin v. Loewis writes: > > > How then is u"\U00200000" represented internally if you use UCS-2 as > > > the internal storage representation? > > > > I think the obvious answer is: It is not supported. It will give an > > exception when you try to convert an UTF-8 or UTF-16 string that has > > such a character, it will be an error if you pass a surrogate to > > unichr, or in a \u literal. > > So the characters added in Unicode 3.1 in planes 1, 2, and 14 would > not be representable in Python? Seems a bit draconian to make your > life easier. With Fredrik's solution, you'ld have to rebuild your Python interpreter with a 32-bit Unicode type to represent the characters. With that option, we'ld delegate the decision to administrators and Python distributors. If their users demand support for the additional characters, they will need to consider wasting space. Of course, byte code files should then use UTF-16, to allow some portability of byte code across platforms. If a byte code file contains a plane 2 string literal, it could not be imported into an interpreter who uses UCS-2, just as the corresponding source code import would fail. Regards, Martin From martin@loewis.home.cs.tu-berlin.de Tue Jun 26 06:26:03 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Tue, 26 Jun 2001 07:26:03 +0200 Subject: [I18n-sig] How does Python Unicode treat surrogates? In-Reply-To: <15159.59487.398979.804494@cymru.basistech.com> (message from Tom Emerson on Mon, 25 Jun 2001 21:41:51 -0400) References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <15159.14391.718891.645489@cymru.basistech.com> <200106252315.f5PNFw601373@mira.informatik.hu-berlin.de> <15159.59487.398979.804494@cymru.basistech.com> Message-ID: <200106260526.f5Q5Q3900934@mira.informatik.hu-berlin.de> > Martin v. Loewis writes: > > So nothing will happen until enough Chinese users complain. I don't > > know whether you count as Chinese for these purposes :-) > > Perhaps not. :-) But the Chinese aren't the only ones to worry > about. The Japanese also have characters being added outside the BMP, > and Ruby holds sway in Japan... That's a good point. How does Ruby deal with surrogates? Java JDK 1.4? Perl? Tcl? Windows XP? Regards, Martin From martin@loewis.home.cs.tu-berlin.de Tue Jun 26 07:02:51 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Tue, 26 Jun 2001 08:02:51 +0200 Subject: [I18n-sig] Re: How does Python Unicode treat surrogates? In-Reply-To: <3B37656E.9E09DB1A@lemburg.com> (mal@lemburg.com) References: <9F2D83017589D211BD1000805FA70CA703B139D9@ntxmel03.cmutual.com.au> <3B37656E.9E09DB1A@lemburg.com> Message-ID: <200106260602.f5Q62pg01129@mira.informatik.hu-berlin.de> > > > Say you have a Unicode string which contains the following data: > > > > > > U+0061 U+0062 U+0063 U+DC00 U+0064 U+0065 U+0066 > > > ("a" "b" "c" ? "d" "e" "f") > > > > > > Would you consider this sequence a Unicode string or not ? > > > > I think you are using "Unicode string" with two different meanings here. > > The question is really very simple: is the above correct Unicode > or not ? I think it is not. Looking at Unicode TR 17 (http://www.unicode.org/unicode/reports/tr17/), this is an illegal sequence of code units. Specifically, they give the example - 0xD800 is incomplete in Unicode Unless followed by another 16-bit value of the right form, it is illegal. Now what does it mean that this is an illegal code unit sequence? Looking at Unicode TR 27 (aka Unicode 3.1), we see, for C12 (a) When a process generates data in a Unicode Transformation Format, it shall not emit ill-formed code unit sequences. (b) When a process interprets data in a Unicode Transformation Format, it shall treat illegal code unit sequences as an error condition. (c) A conformant process shall not interpret illegal UTF code unit sequences as characters. So clearly, we shall never emit that Unicode string in a UTF. In another message, you write > FYI, Python currently uses UTF-16 as internal storage format and > also exposes this through its indexing interfaces. Since Python uses UTF-16 as an internal format, Python must not emit above Unicode string into the internal representation, either. Therefore, if Python can represent above sequence of code units, it is not conforming. Regards, Martin From fredrik@pythonware.com Tue Jun 26 07:50:07 2001 From: fredrik@pythonware.com (Fredrik Lundh) Date: Tue, 26 Jun 2001 08:50:07 +0200 Subject: [I18n-sig] How does Python Unicode treat surrogates? References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com><200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de><3B3471AF.1311E872@lemburg.com><200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de><3B34F9BD.4FDEFC62@lemburg.com><200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de><3B35CEC6.710243E7@lemburg.com><200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de><3B362E9B.4DC8DD81@lemburg.com><200106251342.f5PDg1q07291@odiug.digicool.com><3B375FB9.91BA4B1E@lemburg.com><200106251620.f5PGKNP08234@odiug.digicool.com><3B376E68.505BF6E@lemburg.com><200106251804.f5PI4D008730@odiug.digicool.com><3B378460.C27CDCDD@lemburg.com><200106251912.f5PJCVD09465@odiug.digicool.com><00f201c0fdb0$ab0fe170$4ffa42d5@hagrid><15159.36453.486716.705433@cymru.basistech.com><200106260018.f5Q0ItN01657@mira.informatik.hu-berlin.de> <15159.56854.539327.291739@cymru.basistech.com> Message-ID: <009d01c0fe0e$566a7af0$4ffa42d5@hagrid> Tom Emerson wrote: > > How then is u"\U00200000" represented internally if you use UCS-2 as > > the internal storage representation? > > > > I think the obvious answer is: It is not supported. It will give an > > exception when you try to convert an UTF-8 or UTF-16 string that has > > such a character, it will be an error if you pass a surrogate to > > unichr, or in a \u literal. > > So the characters added in Unicode 3.1 in planes 1, 2, and 14 would > not be representable in Python? Seems a bit draconian to make your > life easier. it is not directly supported in Python 2.0, 2.1, and the current 2.2 codebase. no amount of arguing or wishful thinking will change that. From fredrik@pythonware.com Tue Jun 26 08:05:07 2001 From: fredrik@pythonware.com (Fredrik Lundh) Date: Tue, 26 Jun 2001 09:05:07 +0200 Subject: [I18n-sig] How does Python Unicode treat surrogates? References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com><200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de><3B3471AF.1311E872@lemburg.com><200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de><3B34F9BD.4FDEFC62@lemburg.com><200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de><3B35CEC6.710243E7@lemburg.com><200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de><3B362E9B.4DC8DD81@lemburg.com><200106251342.f5PDg1q07291@odiug.digicool.com><3B375FB9.91BA4B1E@lemburg.com><200106251620.f5PGKNP08234@odiug.digicool.com><3B376E68.505BF6E@lemburg.com><200106251804.f5PI4D008730@odiug.digicool.com><3B378460.C27CDCDD@lemburg.com><200106251912.f5PJCVD09465@odiug.digicool.com><00f201c0fdb0$ab0fe170$4ffa42d5@hagrid><15159.36453.486716.705433@cymru.basistech.com><200106260018.f5Q0ItN01657@mira.informatik.hu-berlin.de> <15159.56854.539327.291739@cymru.basistech.com> <200106260521.f5Q5LZK00933@mira.informatik.hu-berlin.de> Message-ID: <009e01c0fe0e$56e947e0$4ffa42d5@hagrid> mvl wrote: > With Fredrik's solution, you'ld have to rebuild your Python interpreter > with a 32-bit Unicode type to represent the characters. With that > option, we'ld delegate the decision to administrators and Python > distributors. If their users demand support for the additional > characters, they will need to consider wasting space. my suggestion is to prepare the Unicode subsystem for sizeof(Py_UNICODE) >= 4 *today*, and make the switch to UCS-4 when the time is right [1]. UTF-16 is an encoding format, not a storage format, so as long as sizeof(Py_UNICODE) is 2, there will be no support for surrogates beyond what's already in there [2]. 1) imho, that time is "as soon as the unicode subsystem is ready". 2) the U escape, plus some codecs, already support it: >>> u"\U0010ffff" u'\uDBFF\uDFFF' >>> unicode("\xf4\x8f\xbf\xbf", "utf-8") u'\uDBFF\uDFFF' From guido@digicool.com Tue Jun 26 09:51:38 2001 From: guido@digicool.com (Guido van Rossum) Date: Tue, 26 Jun 2001 04:51:38 -0400 Subject: [I18n-sig] Unicode surrogates: just say no! Message-ID: <200106260851.f5Q8pcN10662@odiug.digicool.com> I'm trying to reset this discussion to come to some sort of conclusion. There's been a lot of useful input; I believe I've read and understood it all. May the new thread subject serve as a summary of my position. :-) Terminology: "character" is a Unicode code point; "unit" is a storage unit, i.e. a 16-bit or 32-bit value. A "surrogate pair" is two 16-bit storage units with special values that represent a single character. I'll use "surrogate" for a single storage unit whose value indicates that it should be part of a surrogate pair. The variable u is a Python Unicode string object of some sort. There are several possible options for representing Unicode strings: 1. The current situation. I'd say that this uses UCS-2 for storage; it doesn't pay any attention to surrogates. u[i] might be a lone surrogate. unicode(i) where i is a lone surrogate value returns a string containing a lone surrogate. An application could use the unicode data type to store UTF-16 units, but it would have to be aware of all the rules pertaining to surrogates. The codecs, however, are surrogate-unaware. (Am I right that even the UTF-16 codec pays *no* special attention to surrogates?) 2. The compromise proposal. This uses true UTF-16 for storage and changes the interface to always deal in characters. unichr(i) where i is a lone surrogate is illegal, and so are the corresponding \u and \U encodings. unichr(i) for 0x10000 <= i < 0x100000 will return a one-character string that happens to be represented using a surrogate pair, but there's no way in Python to find out (short of knowing the implementation). Codecs that are capable of encoding full Unicode need to be aware of surrogate pairs. 3. The ideal situation. This uses UCS-4 for storage and doesn't require any support for surrogates except in the UTF-16 codecs (and maybe in the UTF-8 codecs; it seems that encoded surrogate pairs are legal in UTF-8 streams but should be converted back to a single character). It's unclear to me whether the (illegal, according to the Unicode standard) "characters" whose numerical value looks like a lone surrogate should be entirely ruled out here, or whether a dedicated programmer could create strings containing these. We could make it hard by declaring unichr(i) with surrogate i and \u and \U escapes that encode surrogates illegal, and by adding explicit checks to codecs as appropriate, but a C extension could still create an array containing illegal characters unless we do draconian input checking. Option 1, which does not reasonably support characters >= 0x10000, has clear problems, and these will grow with time, hence the current discussion. As a solution, option 2 seems to be most popular; this must be because it appears to promise the most efficient storage solution while allowing the largest range of characters to be represented without effort for the application. I'd like to argue that option 2 is REALLY BAD, given where we are, and that we should provide an upgrade path directly from 1 to 3 instead. The main problem with option 2 is that it breaks the correspondence between storage unit indices and character indices, and given Python's reliance on indexing and slicing for string operations, we need a way to keep the indexing operation (u[i]) efficient, as in O(1). Tim suggested a reasonable way to implement 2 efficiently: add a "finger" to each unicode object that caches the last used index (mapping the character index to the storage unit index). This can be used efficiently to walk through the characters in sequence. Of course, we would also have to store the length of the string in characters (so len(u) can be computed efficiently) as well as in storage units (so the implementation can efficiently know the storage boundaries). Martin has hinted at a solution requiring even less memory per string object, but I don't know for sure what he is thinking of. All I can imagine is a single flag saying "this string contains no surrogates". But either way, I believe that this requires that every part of the Unicode implementation be changed to become aware of the difference between characters and storage units. Every piece of C code that currently deals with indices into arrays of Py_UNICODE storage units will have to be changed. This would have to be one gigantic patch, just to change the basic Unicode object implementation. The assumption that storage indices and character indices are the same thing appears in almost every function. And then think of the required changes to the SRE engine. It currently assumes a strict character <--> storage unit equivalence throughout. In order to support option 2 correctly, it would have to become surrogate-aware. There are two parts to this: the internal engine needs to realize that e.g. "." and certain "[...]" sets may match a surrogate pair, and the indices returned by e.g. the span() method of match objects should be translated to character indices as expected by the applications. On the other hand, the changes needed to support option 3 are minimal. Fredrik claims that SRE already supports this (or at least it's very close); Tim has looked over the source code of the Unicode object implementation and has not found any code that would break if Py_UNICODE were changed to a 32-bit int type. (There must be some breakage, since the code as it stands won't build on machines where sizeof(short) != 2, but it's got to be a very shallow problem.) I see only one remaining argument against choosing 3 over 2: FUD about disk and promary memory space usage. (I can't believe that anyone would still worry about the extra CPU time, after Fredrik's report that SRE is about as fast with 4 byte characters as it with 2. In any case this is secondary to the memory space issue, as it is only related to the extra cycles needed to move twice as many bytes around; the cost of most algorithms is determined mostly by the number of characters (or storage units) processed rather than by the number of bytes.) I think the disk space usage problem is dealt with easily by choosing appropriate encodings; UTF-8 and UTF-16 are both great space-savers, and I doubt many sites will store large amounts of UCS-4 directly, given that good codecs are available. The primary memory space problem will go away with time; assuming that most textual documents contain at most a few millions of characters, it's already not that much of a problem on modern machines. Applications that are required to deal efficiently with larger documents should support some way of streaming or chunking the data anyway. The only remaining question is how to provide an upgrade path to option 3: A. At some Python version, we switch. B. Choose between 1 and 3 based on the platform. C. Make it a configuration-time choice. D. Make it a run-time choice. I hink we all agree that D is bad. I'd say that C is the best; eventually (say, when Windows is fixed :-) the choice becomes unnecessary. I don't think it will be hard to support C, with some careful coding. Politically, I think C will also look best to the users -- it allows sites to make their own decision based on storage needs (i.e. do they have the main memory it takes) and compatibility requirements (i.e. do they need the full Unicode set). I don't think interoperability will be much of a problem, since file exchanges should use encodings. Oh yes, we'll need a UCS-4 codec or two (one for each byte order). We could use B to determine the default choice, e.g. we could choose between option 1 and 3 depending on the platform's wchar_t; but it would be bad not to have a way to override this default, so we couldn't exploit the correspondence much. Some code could be #ifdef'ed out when Py_UNICODE == wchar_t, but there would always have to be code to support these two having different sizes. The outcome of the choice must be available at run-time, because it may affect certain codecs. Maybe sys.maxunicode could be the largest character value supported, i.e. 0xffff or 0xfffff? A different way to look at it: if we had wanted to use a variable-lenth internal representation, we should have picked UTF-8 way back, like Perl did. Moving to a UTF-16-based internal representation now will give us all the problems of the Perl choice without any of the benefits. --Guido van Rossum (home page: http://www.python.org/~guido/) From walter@livinglogic.de Tue Jun 26 10:56:49 2001 From: walter@livinglogic.de (Walter =?iso-8859-1?Q?D=F6rwald?=) Date: Tue, 26 Jun 2001 11:56:49 +0200 Subject: [I18n-sig] How does Python Unicode treat surrogates? References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <3B375FB9.91BA4B1E@lemburg.com> <200106251620.f5PGKNP08234@odiug.digicool.com> <3B376E68.505BF6E@lemburg.com> <200106251804.f5PI4D008730@odiug.digicool.com> <3B378460.C27CDCDD@lemburg.com> <200106251912.f5PJCVD09465@odiug.digicool.com> <00f201c0fdb0$ab0fe170$4ffa42d5@hagrid> Message-ID: <3B385C61.E69146D@livinglogic.de> Fredrik Lundh wrote: >=20 > guido wrote: >=20 > [...] > > If we make a clean distinction between characters and storage units, > > and if stick to the rule that u[i] accesses a storage unit, what's th= e > > conceptual difficulty? >=20 > I'm sceptical -- I see very little reason to maintain that distinction. > let's use either UCS-2 or UCS-4 for the internal storage, stick to the > "character strings are character sequences" concept, and keep the > UTF-16 surrogate issue where it belongs: in the codecs. Exactly! Using UTF-16 as the internal storage and defining new methods for accessing characters instead of code units essentially means implementing half a new string type. We'd have to duplicate every method Unicode objects=20 provide now. It would be two string type APIs combined in one type. Do we really need 2 1/2 string types? Bye, Walter D=F6rwald From mal@lemburg.com Tue Jun 26 10:54:36 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Tue, 26 Jun 2001 11:54:36 +0200 Subject: [I18n-sig] Unicode surrogates: just say no! References: <200106260851.f5Q8pcN10662@odiug.digicool.com> Message-ID: <3B385BDC.AB40A761@lemburg.com> Guido van Rossum wrote: > > I'm trying to reset this discussion to come to some sort of > conclusion. There's been a lot of useful input; I believe I've read > and understood it all. May the new thread subject serve as a summary > of my position. :-) > > Terminology: "character" is a Unicode code point; "unit" is a storage > unit, i.e. a 16-bit or 32-bit value. A "surrogate pair" is two > 16-bit storage units with special values that represent a single > character. I'll use "surrogate" for a single storage unit whose value > indicates that it should be part of a surrogate pair. The variable u > is a Python Unicode string object of some sort. > > There are several possible options for representing Unicode strings: > > 1. The current situation. I'd say that this uses UCS-2 for storage; > it doesn't pay any attention to surrogates. u[i] might be a lone > surrogate. unicode(i) where i is a lone surrogate value returns a > string containing a lone surrogate. An application could use the > unicode data type to store UTF-16 units, but it would have to be > aware of all the rules pertaining to surrogates. The codecs, > however, are surrogate-unaware. (Am I right that even the UTF-16 > codec pays *no* special attention to surrogates?) The UTF-16 decoder will raise an exception if it sees a surrogate. The encoder write the internal format as-is without checking for surrogate usage. The UTF-8 codec is fully surrogate aware and will translate the input into UTF-16 surrogates if necessary. The encoder will translate UTF-16 surrogates into UTF-8 representations of the code point. > 2. The compromise proposal. This uses true UTF-16 for storage and > changes the interface to always deal in characters. unichr(i) > where i is a lone surrogate is illegal, and so are the > corresponding \u and \U encodings. unichr(i) for 0x10000 <= i < > 0x100000 will return a one-character string that happens to be > represented using a surrogate pair, but there's no way in Python to > find out (short of knowing the implementation). Codecs that are > capable of encoding full Unicode need to be aware of surrogate > pairs. > > 3. The ideal situation. This uses UCS-4 for storage and doesn't > require any support for surrogates except in the UTF-16 codecs (and > maybe in the UTF-8 codecs; it seems that encoded surrogate pairs > are legal in UTF-8 streams but should be converted back to a single > character). The support is require in all Unicode codecs (UTF-n, unicode-escape and raw-unicode-escape). > It's unclear to me whether the (illegal, according to > the Unicode standard) "characters" whose numerical value looks like > a lone surrogate should be entirely ruled out here, or whether a > dedicated programmer could create strings containing these. As Mark Davis told me, isolated surrogates are legal code points, but the resulting sequence is not a legal Unicode character sequence, sinde these code point (like a few others as well) are not considered characters. After all this discussion and the feedback from the Unicode mailing list, I think we should leave surrogate handling solely to the codecs and not deal with them in the internal storage. That is, it is the applications responsability to make sure to create proper sequences of code points which can be used as character sequences. The codecs, OTOH, should be aware of what is and what is not considered a legal sequence. The default handling should be to follow the Unicode Consortium standard. If someone wants to have additional codecs which implement the ISO 10646 view of things with respect to UTF-n handling, then these can easily be supported by codec extensions packages. > We > could make it hard by declaring unichr(i) with surrogate i and \u > and \U escapes that encode surrogates illegal, and by adding > explicit checks to codecs as appropriate, but a C extension could > still create an array containing illegal characters unless we do > draconian input checking. See above: it's better to leave these decisions to the applications using the Unicode implementation. > ...choose option 3... > > The only remaining question is how to provide an upgrade path to > option 3: > > A. At some Python version, we switch. Like Fredrik said: as soon as the implementation is ready. > B. Choose between 1 and 3 based on the platform. > > C. Make it a configuration-time choice. > > D. Make it a run-time choice. I'd rather not make it a choice: let's go with UCS-4 and be done with these problems once and for all ! As side effect, you could then also enjoy Unicode on Crays :-) Instead of adding an option which allows selecting between 2 or 4 bytes per code unit, I think people would rather like to see for disabling Unicode support completely (I know that the Pippy Team would :-). -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From andy@reportlab.com Tue Jun 26 11:06:27 2001 From: andy@reportlab.com (Andy Robinson) Date: Tue, 26 Jun 2001 11:06:27 +0100 Subject: [I18n-sig] Unicode surrogates: just say no! In-Reply-To: <3B385BDC.AB40A761@lemburg.com> Message-ID: > I'd rather not make it a choice: let's go with UCS-4 and be > done with these problems once and for all ! > > As side effect, you could then also enjoy Unicode on Crays :-) I missed most of this thread, but I think there could be "marketing" benefits from proper UCS-4. I suspect a lot of other languages and libraries will be stuck with clunky workarounds and Python could be made out to be in the lead. That is, for the tiny number of people who care about these things :-) - Andy From tdickenson@geminidataloggers.com Tue Jun 26 13:49:12 2001 From: tdickenson@geminidataloggers.com (Toby Dickenson) Date: Tue, 26 Jun 2001 13:49:12 +0100 Subject: [I18n-sig] Unicode surrogates: just say no! In-Reply-To: <200106260851.f5Q8pcN10662@odiug.digicool.com> References: <200106260851.f5Q8pcN10662@odiug.digicool.com> Message-ID: On Tue, 26 Jun 2001 04:51:38 -0400, Guido van Rossum wrote: >I see only one remaining argument against choosing 3 over 2: FUD about >disk and promary memory space usage. In previous discussion about unifying plain strings an unicode strings, someone (I forget who, sorry) proposed that a unified string type that would store its data in arrays of either 1 or 2 byte elements (depending what was efficient for each string) but provide a unified interface independant of storage option. Could the same option be used to support an option E, individual strings use UCS-4 if they have to, but otherwise gain the space advantages of UCS-2? > >A. At some Python version, we switch. > >B. Choose between 1 and 3 based on the platform. > >C. Make it a configuration-time choice. > >D. Make it a run-time choice. Toby Dickenson tdickenson@geminidataloggers.com From tree@basistech.com Tue Jun 26 13:17:07 2001 From: tree@basistech.com (Tom Emerson) Date: Tue, 26 Jun 2001 08:17:07 -0400 Subject: [I18n-sig] How does Python Unicode treat surrogates? In-Reply-To: <009d01c0fe0e$566a7af0$4ffa42d5@hagrid> References: <200102201936.OAA30670@cj20424-a.reston1.va.home.com> <200106230826.f5N8QQH01304@mira.informatik.hu-berlin.de> <3B3471AF.1311E872@lemburg.com> <200106231220.f5NCKcS08353@mira.informatik.hu-berlin.de> <3B34F9BD.4FDEFC62@lemburg.com> <200106232219.f5NMJMu20377@mira.informatik.hu-berlin.de> <3B35CEC6.710243E7@lemburg.com> <200106241703.f5OH3XN01022@mira.informatik.hu-berlin.de> <3B362E9B.4DC8DD81@lemburg.com> <200106251342.f5PDg1q07291@odiug.digicool.com> <3B375FB9.91BA4B1E@lemburg.com> <200106251620.f5PGKNP08234@odiug.digicool.com> <3B376E68.505BF6E@lemburg.com> <200106251804.f5PI4D008730@odiug.digicool.com> <3B378460.C27CDCDD@lemburg.com> <200106251912.f5PJCVD09465@odiug.digicool.com> <00f201c0fdb0$ab0fe170$4ffa42d5@hagrid> <15159.36453.486716.705433@cymru.basistech.com> <200106260018.f5Q0ItN01657@mira.informatik.hu-berlin.de> <15159.56854.539327.291739@cymru.basistech.com> <009d01c0fe0e$566a7af0$4ffa42d5@hagrid> Message-ID: <15160.32067.420276.464530@cymru.basistech.com> Fredrik Lundh writes: > it is not directly supported in Python 2.0, 2.1, and the > current 2.2 codebase. no amount of arguing or wishful > thinking will change that. It is supported insofar as I can write u"\U0020000" and get the UTF-16 encoded u"\ud840\udc00" back. If you limit the internal representation to UCS-2 then you constrain yourself only to Plane 0 and the surrogate pairs are undefined. Hence you would have to disallow the above notation. -tree -- Tom Emerson Basis Technology Corp. Sr. Sinostringologist http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever" From mal@lemburg.com Tue Jun 26 14:08:33 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Tue, 26 Jun 2001 15:08:33 +0200 Subject: [I18n-sig] Unicode surrogates: just say no! References: <200106260851.f5Q8pcN10662@odiug.digicool.com> Message-ID: <3B388951.7B40652C@lemburg.com> Toby Dickenson wrote: > > On Tue, 26 Jun 2001 04:51:38 -0400, Guido van Rossum > wrote: > > >I see only one remaining argument against choosing 3 over 2: FUD about > >disk and promary memory space usage. > > In previous discussion about unifying plain strings an unicode > strings, someone (I forget who, sorry) proposed that a unified string > type that would store its data in arrays of either 1 or 2 byte > elements (depending what was efficient for each string) but provide a > unified interface independant of storage option. > > Could the same option be used to support an option E, individual > strings use UCS-4 if they have to, but otherwise gain the space > advantages of UCS-2? This makes the implementation more complicated: e.g. SRE would then have to be provided in three flavours: 8-bit, 16-bit and 32-bit. Same for most of the codecs. Maintenance will become a nightmare, the Python interpreter will put on wheight and we will probably not gain much w/r to overall memory usage (external storage will use one of the encodings which can be chosen on an per-application basis). -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From tree@basistech.com Tue Jun 26 13:40:27 2001 From: tree@basistech.com (Tom Emerson) Date: Tue, 26 Jun 2001 08:40:27 -0400 Subject: [I18n-sig] Unicode surrogates: just say no! In-Reply-To: <200106260851.f5Q8pcN10662@odiug.digicool.com> References: <200106260851.f5Q8pcN10662@odiug.digicool.com> Message-ID: <15160.33467.686959.415021@cymru.basistech.com> Guido van Rossum writes: > 3. The ideal situation. This uses UCS-4 for storage and doesn't > require any support for surrogates except in the UTF-16 codecs (and > maybe in the UTF-8 codecs; it seems that encoded surrogate pairs > are legal in UTF-8 streams but should be converted back to a single > character). It's unclear to me whether the (illegal, according to > the Unicode standard) "characters" whose numerical value looks like > a lone surrogate should be entirely ruled out here, or whether a > dedicated programmer could create strings containing these. We > could make it hard by declaring unichr(i) with surrogate i and \u > and \U escapes that encode surrogates illegal, and by adding > explicit checks to codecs as appropriate, but a C extension could > still create an array containing illegal characters unless we do > draconian input checking. UTF-8 can be used to encode encode each half of a surrogate pair (resulting in six-bytes for the character) --- a proposal for this was presented by PeopleSoft at the UTC meeting last month. UTF-8 can also encode the code-point directly in four bytes. As Marc-Andre said in his response, you can have a valid stream of Unicode characters with half a surrogate pair: that character, however, is undefined. > I see only one remaining argument against choosing 3 over 2: FUD about > disk and promary memory space usage. At the last IUC in Hong Kong some developers from SAP presented data against the use of UCS-4/UTF-32 as an internal representation. In their benchmarks they found that the overhead of cache-misses due to the increased character width were far more detrimental to runtime than having to deal with the odd surrogate pair in a UTF-16 encoded string. After the presentation several people (myself, Asmus Freytag, Toby Phipps of PeopleSoft, and Paul Laenger of Software AG) had a little chat about this issue and couldn't agree whether this was really a big problem or not. I think it bears more research. However, I agree that using UCS-4/UTF-32 as the internal string representation is the best solution. Remember too that glibc uses UCS-4 as its internal wchar_t representation. This was also discussed at the Li18nux meetings a couple of years ago. > A. At some Python version, we switch. > > B. Choose between 1 and 3 based on the platform. > > C. Make it a configuration-time choice. Defaulting to UCS-4? > We could use B to determine the default choice, e.g. we could choose > between option 1 and 3 depending on the platform's wchar_t; but it > would be bad not to have a way to override this default, so we > couldn't exploit the correspondence much. Some code could be > #ifdef'ed out when Py_UNICODE == wchar_t, but there would always have > to be code to support these two having different sizes. Seems to me this could add complexity and reliance on platform functionality that may not be consistent. Is the savings worth the complexity? > The outcome of the choice must be available at run-time, because it > may affect certain codecs. Maybe sys.maxunicode could be the largest > character value supported, i.e. 0xffff or 0xfffff? or 0x10ffff? -- Tom Emerson Basis Technology Corp. Sr. Sinostringologist http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever" From martin@loewis.home.cs.tu-berlin.de Tue Jun 26 15:53:35 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Tue, 26 Jun 2001 16:53:35 +0200 Subject: [I18n-sig] Unicode surrogates: just say no! In-Reply-To: <200106260851.f5Q8pcN10662@odiug.digicool.com> (message from Guido van Rossum on Tue, 26 Jun 2001 04:51:38 -0400) References: <200106260851.f5Q8pcN10662@odiug.digicool.com> Message-ID: <200106261453.f5QErZP01348@mira.informatik.hu-berlin.de> > Martin has hinted at a solution requiring even less memory per string > object, but I don't know for sure what he is thinking of. All I can > imagine is a single flag saying "this string contains no surrogates". That was my original idea. I later thought have a count of surrogate pairs would be better, since it allows to compute len() in constant time. Indexing would be linear time only for strings containing surrogates, otherwise constant time also. > But either way, I believe that this requires that every part of the > Unicode implementation be changed to become aware of the difference > between characters and storage units. Every piece of C code that > currently deals with indices into arrays of Py_UNICODE storage units > will have to be changed. One could try to reduce the impact of the change, in particular when expecting your solution 3 (i.e. a 32-bit Py_UNICODE). E.g. code that currently reads if (start < 0) start += self->length; if (start < 0) start = 0; would then read if (start < 0) start += Py_UNICODE_LENGTH(self); if (start < 0) start = 0; start = Py_UNICODE_UNIT_OF(self,start); where Py_UNICODE_UNIT_OF converts from character indices to unit indices, and is implemented as #ifdef Py_UNICODE_4_BYTES #define Py_UNICODE_UNIT_OF(str,x) x #else #define Py_UNICODE_UNIT_OF(str,x) (str->surrogates?Py_UnicodeUnitOf(str,x):x) #endif Not that I particular like that approach; I'm just pointing out it is feasible. [on sre] > There are two parts to this: the internal > engine needs to realize that e.g. "." and certain "[...]" sets may > match a surrogate pair, and the indices returned by e.g. the span() > method of match objects should be translated to character indices as > expected by the applications. For character classes, it may be acceptable they must only contain BMP characters; span would use the conversion macros, and . would need special casing. I agree this is terrible, but it could work. > I think the disk space usage problem is dealt with easily by choosing > appropriate encodings; UTF-8 and UTF-16 are both great space-savers, > and I doubt many sites will store large amounts of UCS-4 directly, > given that good codecs are available. For application data, the internal representation is irrelevant; it is not easy to get at the internal representation to write a string to a file (you have to use a codec). For marshal, backward compatibility becomes an issue; UTF-16 is the obvious choice. For pickle, UTF-8 or raw-unicode-escape is used, anyway. > The only remaining question is how to provide an upgrade path to > option 3: > > A. At some Python version, we switch. > > B. Choose between 1 and 3 based on the platform. > > C. Make it a configuration-time choice. > > D. Make it a run-time choice. > > I hink we all agree that D is bad. I'd say that C is the best; > eventually (say, when Windows is fixed :-) the choice becomes > unnecessary. I don't think it will be hard to support C, with some > careful coding. The biggest danger is that binary C modules are exchanged between installations, e.g. pyd DLLs or RPMs. With distutils, it is really easy to create these, so we should be careful that they break meaningfully instead of just crashing. So I suppose your "careful coding" includes Py_InitModule magic. > We could use B to determine the default choice, e.g. we could choose > between option 1 and 3 depending on the platform's wchar_t; but it > would be bad not to have a way to override this default, so we > couldn't exploit the correspondence much. Still, exploiting the platform's wchar_t might avoid copies in some cases (I'm thinking of my iconv codec in particular), so that would give a speed-up. > The outcome of the choice must be available at run-time, because it > may affect certain codecs. Maybe sys.maxunicode could be the largest > character value supported, i.e. 0xffff or 0xfffff? It's actually 0x10ffff, since UTF-16 allows for 16 additional planes, but yes, that interface sounds good. Regards, Martin From tree@basistech.com Tue Jun 26 15:39:51 2001 From: tree@basistech.com (Tom Emerson) Date: Tue, 26 Jun 2001 10:39:51 -0400 Subject: [I18n-sig] Unicode surrogates: just say no! In-Reply-To: <200106261453.f5QErZP01348@mira.informatik.hu-berlin.de> References: <200106260851.f5Q8pcN10662@odiug.digicool.com> <200106261453.f5QErZP01348@mira.informatik.hu-berlin.de> Message-ID: <15160.40631.208461.386096@cymru.basistech.com> Martin v. Loewis writes: > > Martin has hinted at a solution requiring even less memory per string > > object, but I don't know for sure what he is thinking of. All I can > > imagine is a single flag saying "this string contains no surrogates". > > That was my original idea. I later thought have a count of surrogate > pairs would be better, since it allows to compute len() in constant > time. Indexing would be linear time only for strings containing > surrogates, otherwise constant time also. Just so I understand: the codec will set this flag/length when it transcodes to the internal representation? > [on sre] > > There are two parts to this: the internal > > engine needs to realize that e.g. "." and certain "[...]" sets may > > match a surrogate pair, and the indices returned by e.g. the span() > > method of match objects should be translated to character indices as > > expected by the applications. > > For character classes, it may be acceptable they must only contain BMP > characters; span would use the conversion macros, and . would need > special casing. I agree this is terrible, but it could work. UTR #18 describes the impact of surrogates on regular expressions. http://www.unicode.org/unicode/reports/tr18/#Surrogates > Still, exploiting the platform's wchar_t might avoid copies in some > cases (I'm thinking of my iconv codec in particular), so that would > give a speed-up. Excellent point. -tree -- Tom Emerson Basis Technology Corp. Sr. Sinostringologist http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever" From martin@loewis.home.cs.tu-berlin.de Tue Jun 26 17:37:32 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Tue, 26 Jun 2001 18:37:32 +0200 Subject: [I18n-sig] Unicode surrogates: just say no! In-Reply-To: <15160.40631.208461.386096@cymru.basistech.com> (message from Tom Emerson on Tue, 26 Jun 2001 10:39:51 -0400) References: <200106260851.f5Q8pcN10662@odiug.digicool.com> <200106261453.f5QErZP01348@mira.informatik.hu-berlin.de> <15160.40631.208461.386096@cymru.basistech.com> Message-ID: <200106261637.f5QGbWQ01763@mira.informatik.hu-berlin.de> > > That was my original idea. I later thought have a count of surrogate > > pairs would be better, since it allows to compute len() in constant > > time. Indexing would be linear time only for strings containing > > surrogates, otherwise constant time also. > > Just so I understand: the codec will set this flag/length when it > transcodes to the internal representation? Depends on how it is written. At the C level, it could provide a surrogate count when creating a string, or it could give -1, in which case the implementation would count the surrogates. At the Python level, there would be no interface into finding out the number of surrogates, or setting them. Instead, unichr invocations with arguments above 0xffff would set the count. Regards, Martin From guido@digicool.com Tue Jun 26 18:00:44 2001 From: guido@digicool.com (Guido van Rossum) Date: Tue, 26 Jun 2001 13:00:44 -0400 Subject: [I18n-sig] Unicode surrogates: just say no! In-Reply-To: Your message of "Tue, 26 Jun 2001 11:54:36 +0200." <3B385BDC.AB40A761@lemburg.com> References: <200106260851.f5Q8pcN10662@odiug.digicool.com> <3B385BDC.AB40A761@lemburg.com> Message-ID: <200106261700.f5QH0ih14770@odiug.digicool.com> (Mass followup.) > From: "M.-A. Lemburg" > The UTF-16 decoder will raise an exception if it sees a surrogate. > The encoder write the internal format as-is without checking for > surrogate usage. Hm, isn't this asymmetric? I'd imagine that either behavior (exception or copy as-is) can useful in either direction at times, so this should be an option (maybe a different codec name?). > The UTF-8 codec is fully surrogate aware and will translate > the input into UTF-16 surrogates if necessary. The encoder > will translate UTF-16 surrogates into UTF-8 representations > of the code point. Good. This (like the UTF-16 codec's behavior) will have to be made conditional on sizeof(Py_UNICODE) in my proposal. > As Mark Davis told me, isolated surrogates are legal code > points, but the resulting sequence is not a legal Unicode > character sequence, sinde these code point (like a few others > as well) are not considered characters. Let me use this as an excuse to start a discussion on how far we should go in ruling out illegal code points. I think that *codecs* would be wise to be picky about illegal code points (except for the special UTF-16-pass-through option). But I think that the *datatype implementation* should allow storage units to take every possible value, whether or not it's illegal according to Unicode, either in isolation or in context. It's much easier to implement that way, and I believe that the checks ought to be in other tools. In particular, I propose: - in all cases: - \udddd and \Udddddddd always behave the same as unichr(0xdddd) or unichr(0xdddddddd) - with 16-bit (narrow) Py_UNICODE: - unichr(i) for 0 <= i <= 0xffff always returns a size-one string where ord(u[0]) == i - unichr(i) for 0x10000 <= i <= 0x10ffff (and hence corresponding \u and \U) generates a surrogate pair, where u[0] is the high surrogate value and u[1] the low surrogate value - unichr(i) for i >= 0x110000 (and hence corresponding \u and \U) raises an exception at Python-to-bytecode compile-time - with 32-bit (wide) Py_UNICODE: - unichr(i) for 0 <= i <= 0xffffffff always returns a size-one string where ord(u[i]) == i I expect that the surrogate generation rule will be controversial, so let me explain why I think it's the best possible rule. We're adding a difference between Python implementations here: some can only represent code points up to 0xffff directly, others can represent all 32-bit code points. This is no different (IMO) than having sys.maxint vary between platforms, or having thread support be platform dependent, or having several choices from the *dbm family of modules. We'll tell users their platform properties: sys.maxunicode is either 0xffff or 0x10ffff. Users can choose to write code that only runs with wide Unicode strings. They ought to put "assert sys.maxunicode>=0x10ffff" somewhere in their program, but that's their choice -- they can also just document it, or only run it on their own system which they configured for wide Unicode. Users can choose to write code that doesn't use Unicode characters outside the basic plane. They don't have to do anything special. Users can choose to write code that's portable between the two versions by using surrogates on the narrow platform but not on the wide platform. (This would be a good idea for backward compatibility with Python 2.0 and 2.1 anyway.) The proposed (and current!) behavior of \U makes it easy for them to do the right thing with string literals; everything else, they just have to write code that won't separate surrogate halves. Making unichr() and the \U escape behave the same regardless of platform makes more sense than the current situation, where unichr() refuses characters larger than 0xffff, but \U translates them into surrogates. I *don't* think \U should be limited to a notation to create surrogates. I also don't think it's wise to stop creating surrogates from \U when appropriate. I *don't* think it's wise to let unichr() balk at input values that happen to be lone surrogates. It is easy enough to avoid these in applications (if the application gets its input from a codec, it should be safe already), and it would prevent code that knows what it's doing to do stuff beyond the Unicode standard du jour. That would be unpythonic. > After all this discussion and the feedback from the Unicode > mailing list, I think we should leave surrogate handling > solely to the codecs and not deal with them in the internal > storage. That is, it is the applications responsability to > make sure to create proper sequences of code points which can > be used as character sequences. Exactly what I say above. > The codecs, OTOH, should be aware of what is and what is not > considered a legal sequence. The default handling should be to > follow the Unicode Consortium standard. If someone wants to > have additional codecs which implement the ISO 10646 view of things > with respect to UTF-n handling, then these can easily be supported > by codec extensions packages. Yes. > > We > > could make it hard by declaring unichr(i) with surrogate i and \u > > and \U escapes that encode surrogates illegal, and by adding > > explicit checks to codecs as appropriate, but a C extension could > > still create an array containing illegal characters unless we do > > draconian input checking. > > See above: it's better to leave these decisions to the applications > using the Unicode implementation. We agree! > > ...choose option 3... > > > > The only remaining question is how to provide an upgrade path to > > option 3: > > > > A. At some Python version, we switch. > > Like Fredrik said: as soon as the implementation is ready. But will the users be ready? > > B. Choose between 1 and 3 based on the platform. > > > > C. Make it a configuration-time choice. > > > > D. Make it a run-time choice. > > I'd rather not make it a choice: let's go with UCS-4 and be > done with these problems once and for all ! I assert that it's easy enough to write code that is indifferent to sizeof(Py_UNICODE). See SRE as a proof. I expect that not all Unicode users will be ready to embrace UCS-4. I don't want to hear people say "I don't want to upgrade to Python 2.2 because it wastes 4 bytes per Unicode character, but all I ever do is bandy around basic plane characters. Given that there's currently very limited need for characters outside the basic plane, I want to be able to say that Python 2.2 is UCS-4 ready, but not that it always uses it. > As side effect, you could then also enjoy Unicode on Crays :-) Indeed. > Instead of adding an option which allows selecting between > 2 or 4 bytes per code unit, I think people would rather like > to see for disabling Unicode support completely (I know that > the Pippy Team would :-). That's definitely another configuration switch that I would like to see. How hard would it be? > From: Toby Dickenson > In previous discussion about unifying plain strings an unicode > strings, someone (I forget who, sorry) proposed that a unified string > type that would store its data in arrays of either 1 or 2 byte > elements (depending what was efficient for each string) but provide a > unified interface independant of storage option. > > Could the same option be used to support an option E, individual > strings use UCS-4 if they have to, but otherwise gain the space > advantages of UCS-2? I agree with MAL's rebuttal: this would just make things more complicated all over the place. > From: Tom Emerson > UTF-8 can be used to encode encode each half of a surrogate pair > (resulting in six-bytes for the character) --- a proposal for this was > presented by PeopleSoft at the UTC meeting last month. UTF-8 can also > encode the code-point directly in four bytes. But isn't the direct encoding highly preferable? When would you ever want your UTF-8 to be encoded UTF-16? > As Marc-Andre said in his response, you can have a valid stream of Unicode > characters with half a surrogate pair: that character, however, is > undefined. I guess the UTF-8 codec would have to deal with unpaired surrogates somehow, but I would prefer it if normally it would peek ahead and encode a valid surrogate pair as the correct 4-byte sequence. > > I see only one remaining argument against choosing 3 over 2: FUD about > > disk and promary memory space usage. > > At the last IUC in Hong Kong some developers from SAP presented data > against the use of UCS-4/UTF-32 as an internal representation. In > their benchmarks they found that the overhead of cache-misses due to > the increased character width were far more detrimental to runtime > than having to deal with the odd surrogate pair in a UTF-16 encoded > string. After the presentation several people (myself, Asmus Freytag, > Toby Phipps of PeopleSoft, and Paul Laenger of Software AG) had a > little chat about this issue and couldn't agree whether this was > really a big problem or not. I think it bears more research. Yet another reason to offer a configuration choice between 2-byte and 4-byte Py_UNICODE, until we know the answer. (I'm sure it depends on what the application does with the data too!) > However, I agree that using UCS-4/UTF-32 as the internal string > representation is the best solution. Well, I find it infinitely better than trying to use UTF-16 as the internal representation but coercing the interface into dealing with characters and character indices uniformally. > Remember too that glibc uses UCS-4 as its internal wchar_t > representation. This was also discussed at the Li18nux meetings a > couple of years ago. But I don't think there are many Linux applications that use wchar_t extensively yet. At least I haven't seen any. (Does anyone know if Mozilla's Asian character support uses wchar_t or Unicode?) > > A. At some Python version, we switch. > > > > B. Choose between 1 and 3 based on the platform. > > > > C. Make it a configuration-time choice. > > Defaulting to UCS-4? Unclear. We'll have to user-test this default and see what the performance hit really is. > > We could use B to determine the default choice, e.g. we could choose > > between option 1 and 3 depending on the platform's wchar_t; but it > > would be bad not to have a way to override this default, so we > > couldn't exploit the correspondence much. Some code could be > > #ifdef'ed out when Py_UNICODE == wchar_t, but there would always have > > to be code to support these two having different sizes. > > Seems to me this could add complexity and reliance on platform > functionality that may not be consistent. Is the savings worth the > complexity? Given that the benefits of UCS-4 are unclear at this point, I think we should be cautious and support both UCS-2 and UCS-4 on all platforms (except maybe Crays :-). > > The outcome of the choice must be available at run-time, because it > > may affect certain codecs. Maybe sys.maxunicode could be the largest > > character value supported, i.e. 0xffff or 0xfffff? > > or 0x10ffff? Yes, I forgot about the 17th plane. > From: "M.-A. Lemburg" > From: "Martin v. Loewis" [sketches implementation idea] > Not that I particular like that approach; I'm just pointing out it is > feasible. I still find this approach very unattractive, and I doubt that it will be possible to make all aspects of the interface uniform. What would be a good reason to try this? It's by far the most work of all options. > [on sre] > For character classes, it may be acceptable they must only contain BMP > characters; span would use the conversion macros, and . would need > special casing. I agree this is terrible, but it could work. I doubt that Fredrik would want to maintain it. > > I think the disk space usage problem is dealt with easily by choosing > > appropriate encodings; UTF-8 and UTF-16 are both great space-savers, > > and I doubt many sites will store large amounts of UCS-4 directly, > > given that good codecs are available. > > For application data, the internal representation is irrelevant; it is > not easy to get at the internal representation to write a string to a > file (you have to use a codec). For marshal, backward compatibility > becomes an issue; UTF-16 is the obvious choice. For pickle, UTF-8 or > raw-unicode-escape is used, anyway. Huh? Marshal uses UTF-8 now. Since the UTF-8 codec is already fully surrogate-aware, shouldn't it do the right thing? E.g. on a "narrow" platform, encoding a Unicode string containing a surrogate pair generates the UTF-8 4-byte encoding of the corresponding Unicode character, and decoding that UTF-8 representation will create a surrogate pair. On a wide platform, that same UTF-8 encoding will be turned into a single character correctly (assuming the UTF-8 codec is adapted to the wide platform; I presume this code doesn't exist yet). So if either platform takes string literal containing a \U escape for a non-basic-plane character, and marshals the resulting string, they get the same marshalled value, and they can both read it back correctly. (Try it! It works.) > The biggest danger is that binary C modules are exchanged between > installations, e.g. pyd DLLs or RPMs. With distutils, it is really > easy to create these, so we should be careful that they break > meaningfully instead of just crashing. So I suppose your "careful > coding" includes Py_InitModule magic. Good point! > Still, exploiting the platform's wchar_t might avoid copies in some > cases (I'm thinking of my iconv codec in particular), so that would > give a speed-up. Yes, but I don't want to *force* users to use UCS-4. (Yet; in a few years time this may change.) We have this code now, so it shouldn't be too hard to keep it. PEP time? --Guido van Rossum (home page: http://www.python.org/~guido/) From tree@basistech.com Tue Jun 26 17:40:48 2001 From: tree@basistech.com (Tom Emerson) Date: Tue, 26 Jun 2001 12:40:48 -0400 Subject: [I18n-sig] Unicode surrogates: just say no! In-Reply-To: <200106261700.f5QH0ih14770@odiug.digicool.com> References: <200106260851.f5Q8pcN10662@odiug.digicool.com> <3B385BDC.AB40A761@lemburg.com> <200106261700.f5QH0ih14770@odiug.digicool.com> Message-ID: <15160.47888.58634.946673@cymru.basistech.com> Guido van Rossum writes: > > UTF-8 can be used to encode encode each half of a surrogate pair > > (resulting in six-bytes for the character) --- a proposal for this was > > presented by PeopleSoft at the UTC meeting last month. UTF-8 can also > > encode the code-point directly in four bytes. > > But isn't the direct encoding highly preferable? When would you ever > want your UTF-8 to be encoded UTF-16? Amen. There were other reasons related to sort orders that I'm not clear on as I didn't pay much attention to non-Asian issues. > > Remember too that glibc uses UCS-4 as its internal wchar_t > > representation. This was also discussed at the Li18nux meetings a > > couple of years ago. > > But I don't think there are many Linux applications that use wchar_t > extensively yet. At least I haven't seen any. (Does anyone know if > Mozilla's Asian character support uses wchar_t or Unicode?) I don't have statistics on this, but I don't think it much matters: I doubt Linux application developers are failing to use wchar_t because it is 4-bytes. I merely point to glibc as an example where a conscious decision was made to go with a 4-byte wide character type in order to allow for easy future growth without being constrained by alternate transformation formats of Unicode. Ulrich Drepper made the right choice, which was supported by the Li18nux group, which includes the Linux vendors as well as IBM and Basis. -- Tom Emerson Basis Technology Corp. Sr. Sinostringologist http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever" From fredrik@pythonware.com Tue Jun 26 18:28:10 2001 From: fredrik@pythonware.com (Fredrik Lundh) Date: Tue, 26 Jun 2001 19:28:10 +0200 Subject: [I18n-sig] Unicode surrogates: just say no! References: <200106260851.f5Q8pcN10662@odiug.digicool.com> <3B385BDC.AB40A761@lemburg.com> <200106261700.f5QH0ih14770@odiug.digicool.com> Message-ID: <004e01c0fe65$5fe418f0$4ffa42d5@hagrid> Guido wrote: > I assert that it's easy enough to write code that is indifferent to > sizeof(Py_UNICODE). See SRE as a proof. I just checked in a couple of patches which fixes some obvious problems for sizeof(Py_UNICODE) > 2 (so sue me ;-). most everything seems to work (the UTF-16 codec is a notable exception). there's a new (experimental) define in Include/unicodeobject.h: #undef USE_UCS4_STORAGE if defined, Py_UNICODE is set to the same thing as Py_UCS4. Cray users may want to define it... Cheers /F From tim@digicool.com Tue Jun 26 18:32:39 2001 From: tim@digicool.com (Tim Peters) Date: Tue, 26 Jun 2001 13:32:39 -0400 Subject: [I18n-sig] How does Python Unicode treat surrogates? In-Reply-To: <200106260526.f5Q5Q3900934@mira.informatik.hu-berlin.de> Message-ID: [Tom Emerson] > Perhaps not. :-) But the Chinese aren't the only ones to worry > about. The Japanese also have characters being added outside the BMP, > and Ruby holds sway in Japan... [Martin v. Loewis] > That's a good point. How does Ruby deal with surrogates? Ruby has some support for UTF-8 now, but Matz (Ruby's dad) is much more a Mule fan: http://www.m17n.org/ He's said that Ruby will eventually treat Unicode as "just another character set" -- along with every other character-set gimmick ever invented. > Java JDK 1.4? Perl? Tcl? Windows XP? Oh, go do your own web search . From fredrik@pythonware.com Tue Jun 26 18:59:13 2001 From: fredrik@pythonware.com (Fredrik Lundh) Date: Tue, 26 Jun 2001 19:59:13 +0200 Subject: [I18n-sig] Unicode surrogates: just say no! References: <200106260851.f5Q8pcN10662@odiug.digicool.com> <3B385BDC.AB40A761@lemburg.com> <200106261700.f5QH0ih14770@odiug.digicool.com> Message-ID: <00bc01c0fe69$b6cf1620$4ffa42d5@hagrid> Guido wrote: > PEP time? yes (based on this mail + your previous mail). I can write the code if someone else writes the PEP... Cheers /F From fredrik@pythonware.com Tue Jun 26 19:27:50 2001 From: fredrik@pythonware.com (Fredrik Lundh) Date: Tue, 26 Jun 2001 20:27:50 +0200 Subject: [I18n-sig] Unicode surrogates: just say no! References: <200106260851.f5Q8pcN10662@odiug.digicool.com> <3B385BDC.AB40A761@lemburg.com> <200106261700.f5QH0ih14770@odiug.digicool.com> Message-ID: <00d801c0fe6d$b5f81a40$4ffa42d5@hagrid> guido wrote: > - with 16-bit (narrow) Py_UNICODE: > > - unichr(i) for 0 <= i <= 0xffff always returns a size-one string > where ord(u[0]) == i > > - unichr(i) for 0x10000 <= i <= 0x10ffff (and hence corresponding \u > and \U) generates a surrogate pair, where u[0] is the high > surrogate value and u[1] the low surrogate value > > - unichr(i) for i >= 0x110000 (and hence corresponding \u and \U) > raises an exception at Python-to-bytecode compile-time or in other words: >>> unichr.__doc__ 'unichr(i) -> Unicode character\n\nReturn a Unicode string of one character with ordinal i; 0 <= i < 1114112.' >>> unichr(-1) Traceback (most recent call last): File "", line 1, in ? ValueError: unichr() arg not in range(1114111) >>> unichr(0) u'\x00' >>> unichr(1) u'\x01' >>> unichr(256) u'\u0100' >>> unichr(55296) u'\ud800' >>> unichr(65535) u'\uffff' >>> unichr(65536) u'\ud800\udc00' >>> unichr(1114111) u'\udbff\udfff' >>> unichr(1114112) Traceback (most recent call last): File "", line 1, in ? ValueError: unichr() arg not in range(1114111) >>> "\U00000000" '\\U00000000' >>> "\U00000100" '\\U00000100' >>> u"\U00000100" u'\u0100' >>> u"\U00000000" u'\x00' >>> u"\U00000000" u'\x00' >>> u"\U00000100" u'\u0100' >>> u"\U0000d800" u'\ud800' >>> u"\U0000ffff" u'\uffff' >>> u"\U00010000" u'\ud800\udc00' >>> u"\U0010ffff" u'\udbff\udfff' >>> u"\U00110000" UnicodeError: Unicode-Escape decoding error: illegal Unicode character (\U behaviour as in 2.1, unichr as in my development version of 2.2) note that unichr raises a ValueError, not a UnicodeError. should this be changed? Cheers /F From guido@digicool.com Tue Jun 26 20:39:16 2001 From: guido@digicool.com (Guido van Rossum) Date: Tue, 26 Jun 2001 15:39:16 -0400 Subject: [I18n-sig] Unicode surrogates: just say no! In-Reply-To: Your message of "Tue, 26 Jun 2001 20:27:50 +0200." <00d801c0fe6d$b5f81a40$4ffa42d5@hagrid> References: <200106260851.f5Q8pcN10662@odiug.digicool.com> <3B385BDC.AB40A761@lemburg.com> <200106261700.f5QH0ih14770@odiug.digicool.com> <00d801c0fe6d$b5f81a40$4ffa42d5@hagrid> Message-ID: <200106261939.f5QJdGY16026@odiug.digicool.com> > guido wrote: > > > - with 16-bit (narrow) Py_UNICODE: > > > > - unichr(i) for 0 <= i <= 0xffff always returns a size-one string > > where ord(u[0]) == i > > > > - unichr(i) for 0x10000 <= i <= 0x10ffff (and hence corresponding \u > > and \U) generates a surrogate pair, where u[0] is the high > > surrogate value and u[1] the low surrogate value > > > > - unichr(i) for i >= 0x110000 (and hence corresponding \u and \U) > > raises an exception at Python-to-bytecode compile-time > > or in other words: > > >>> unichr.__doc__ > 'unichr(i) -> Unicode character\n\nReturn a Unicode string of one character with > ordinal i; 0 <= i < 1114112.' I would write 0 <= i <= 0x10ffff, but otherwise, yes. Check it in already! > note that unichr raises a ValueError, not a UnicodeError. should this > be changed? I think not. The input value is wrong, that's a ValueError. There are lots of ValueErrors in the Unicode implementation. There are lots of UnicodeErrors too; the distinction isn't always clear. MAL? --Guido van Rossum (home page: http://www.python.org/~guido/) From martin@loewis.home.cs.tu-berlin.de Tue Jun 26 20:43:12 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Tue, 26 Jun 2001 21:43:12 +0200 Subject: [I18n-sig] Unicode surrogates: just say no! In-Reply-To: <200106261700.f5QH0ih14770@odiug.digicool.com> (message from Guido van Rossum on Tue, 26 Jun 2001 13:00:44 -0400) References: <200106260851.f5Q8pcN10662@odiug.digicool.com> <3B385BDC.AB40A761@lemburg.com> <200106261700.f5QH0ih14770@odiug.digicool.com> Message-ID: <200106261943.f5QJhCh20482@mira.informatik.hu-berlin.de> > > From: Tom Emerson > > > UTF-8 can be used to encode encode each half of a surrogate pair > > (resulting in six-bytes for the character) --- a proposal for this was > > presented by PeopleSoft at the UTC meeting last month. UTF-8 can also > > encode the code-point directly in four bytes. > > But isn't the direct encoding highly preferable? When would you ever > want your UTF-8 to be encoded UTF-16? Somebody please correct me: A conforming implementation must never encode a non-BMP character with six bytes in UTF-8; security people will shoot you if you say that two alternative representations for the same string are possible. HOWEVER, I think what the spec says that implementation shall accept to receive non-BMP characters encoded in six bytes UTF-8. This is because buggy implementations may produce such output, and because that was previously left unspecified, so accepting such UTF-8 strings improves interoperability. > Huh? Marshal uses UTF-8 now. Oops, I should have checked. Regards, Martin From martin@loewis.home.cs.tu-berlin.de Tue Jun 26 20:46:20 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Tue, 26 Jun 2001 21:46:20 +0200 Subject: [I18n-sig] How does Python Unicode treat surrogates? In-Reply-To: References: Message-ID: <200106261946.f5QJkKe20513@mira.informatik.hu-berlin.de> > > Java JDK 1.4? Perl? Tcl? Windows XP? > > Oh, go do your own web search . I could have answered the Perl and Tcl cases myself: both use UTF-8 internally, so they are never confronted with surrogates in their representation. The other two were rather polemic, since I don't really expect them to support other planes in some meaningful way - without checking, of course. Regards, Martin From paulp@ActiveState.com Tue Jun 26 21:31:08 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Tue, 26 Jun 2001 13:31:08 -0700 Subject: [I18n-sig] Unicode surrogates: just say no! References: <200106260851.f5Q8pcN10662@odiug.digicool.com> <3B385BDC.AB40A761@lemburg.com> <200106261700.f5QH0ih14770@odiug.digicool.com> Message-ID: <3B38F10B.CCA55437@ActiveState.com> Guido van Rossum wrote: > >... > > I expect that not all Unicode users will be ready to embrace UCS-4. I > don't want to hear people say "I don't want to upgrade to Python 2.2 > because it wastes 4 bytes per Unicode character, but all I ever do is > bandy around basic plane characters. Given that there's currently > very limited need for characters outside the basic plane, I want to be > able to say that Python 2.2 is UCS-4 ready, but not that it always > uses it. I'm not dead-set against this but I want to point out that binary distributors are probably not going to bother shipping two different binaries. So the silent majority of Python users who download precompiled binaries are going to have a "flag day" where Python changes its default behaviour. Given infinite resources, I'd rather see "best of both worlds" implementations such as a flag on the Unicode object that chooses its internal representation (i.e. a speed tweak for the knowledgable) or objects that "fall back" from ASCII to UCS-2 to UCS-4 depending on the input data. Or even a unicode32() data type that was interoperable with unicode16. (and the default could change from one to the other someday) I accept that in a world of finite resources there may be nobody interested enough to put in that effort but I'd rather see the option excluded on that basis rather than just because the code becomes more complex. The code complexity would be worth it if it prevents a minor fork in Python and varying behavior on different Pythons. -- Take a recipe. Leave a recipe. Python Cookbook! http://www.ActiveState.com/pythoncookbook From guido@digicool.com Tue Jun 26 21:36:33 2001 From: guido@digicool.com (Guido van Rossum) Date: Tue, 26 Jun 2001 16:36:33 -0400 Subject: [I18n-sig] Unicode surrogates: just say no! In-Reply-To: Your message of "Tue, 26 Jun 2001 13:31:08 PDT." <3B38F10B.CCA55437@ActiveState.com> References: <200106260851.f5Q8pcN10662@odiug.digicool.com> <3B385BDC.AB40A761@lemburg.com> <200106261700.f5QH0ih14770@odiug.digicool.com> <3B38F10B.CCA55437@ActiveState.com> Message-ID: <200106262036.f5QKaX618195@odiug.digicool.com> > > I expect that not all Unicode users will be ready to embrace UCS-4. I > > don't want to hear people say "I don't want to upgrade to Python 2.2 > > because it wastes 4 bytes per Unicode character, but all I ever do is > > bandy around basic plane characters. Given that there's currently > > very limited need for characters outside the basic plane, I want to be > > able to say that Python 2.2 is UCS-4 ready, but not that it always > > uses it. > > I'm not dead-set against this but I want to point out that binary > distributors are probably not going to bother shipping two different > binaries. So the silent majority of Python users who download > precompiled binaries are going to have a "flag day" where Python changes > its default behaviour. Distributors know their users best -- they can decide when it's time. E.g. I expect Asian Linux distributors to take the lead here, and American distributors to follow last, with European distributors in the middle. Users with different wishes (most likely users with a desire for UCS-4 in a UCS-2 world) can always build from source. > Given infinite resources, I'd rather see "best of both worlds" > implementations such as a flag on the Unicode object that chooses its > internal representation (i.e. a speed tweak for the knowledgable) or > objects that "fall back" from ASCII to UCS-2 to UCS-4 depending on the > input data. Or even a unicode32() data type that was interoperable with > unicode16. (and the default could change from one to the other someday) > > I accept that in a world of finite resources there may be nobody > interested enough to put in that effort but I'd rather see the option > excluded on that basis rather than just because the code becomes more > complex. The code complexity would be worth it if it prevents a minor > fork in Python and varying behavior on different Pythons. But you don't have to maintain it. I say that this particular varying behavior is just as acceptable as the varying int size. Do you want to write the PEP? --Guido van Rossum (home page: http://www.python.org/~guido/) From rick@unicode.org Tue Jun 26 21:38:48 2001 From: rick@unicode.org (Rick McGowan) Date: Tue, 26 Jun 2001 13:38:48 -0700 Subject: [I18n-sig] Unicode surrogates: just say no! In-Reply-To: <200106261700.f5QH0ih14770@odiug.digicool.com> (message fromGuido van Rossum on Tue, 26 Jun 2001 13:00:44 -0400) Message-ID: <200106261831.OAA22124@unicode.org> > Somebody please correct me: A conforming implementation must never > encode a non-BMP character with six bytes in UTF-8; security people > will shoot you if you say that two alternative representations for the > same string are possible. >... > HOWEVER, I think what the spec says that implementation shall accept > to receive non-BMP characters encoded in six bytes UTF-8. This is The spec has been recently changed to eliminate the ambiguity precisely because of security restrictions. You are never allowed to produce "non shortest form". The correct, conforming way to encode surrogate pairs in UTF-8 is to convert the pair to UTF-32, and then convert the UTF-32 entity to UTF-8. See: http://www.unicode.org/unicode/reports/tr27/ which is the definition of Unicode 3.1. It says in the intro: Most notable among the corrigenda to the standard is a tightening of the definition of UTF-8, to eliminate a possible security issue with non-shortest-form UTF-8. Later, there is a section "UTF-8 Corrigendum", which starts with the text shown below. This always results in a UTF-8 sequence <= 4 bytes in length, for all valid Unicode characters 0..10FFFF. (BTW, I have also been working on an updated reference code for the various UTF transformations, but have not yet posted it due to the controversy surrounding the so called UTF-8S proposal.) Rick ------------------------------------------------------ UTF-8 Corrigendum The current conformance clause C12 in The Unicode Standard, Version 3.0 forbids the generation of "non-shortest form" UTF-8, and forbids the interpretation of illegal sequences, but not the interpretation of "non-shortest form". Where software does interpret the non-shortest forms, security issues can arise. For example: Process A performs security checks, but does not check for non-shortest forms. Process B accepts the byte sequence from process A, and transforms it into UTF-16 while interpreting non-shortest forms. The UTF-16 text may then contain characters that should have been filtered out by process A. To address this issue, the Unicode Technical Committee has modified the definition of UTF-8 to forbid conformant implementations from interpreting non-shortest forms for BMP characters, and clarified some of the conformance clauses. From mal@lemburg.com Mon Jun 25 21:28:45 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Mon, 25 Jun 2001 22:28:45 +0200 Subject: [I18n-sig] Re: How does Python Unicode treat surrogates? References: <3B3722DB.1FF54794@lemburg.com> <4ak820g418.fsf@kern.srcf.societies.cam.ac.uk> <006501c0fd82$8b5ba9f0$0c680b41@c1340594a> <005b01c0fd9d$e4469e60$1a2cf7c2@oakdale2> <013901c0fdac$d27d1970$0c680b41@c1340594a> Message-ID: <3B379EFD.40F88FC5@lemburg.com> Mark Davis wrote: > > That is an interesting approach; one that basically amounts to some > convenience functions. For example, instead of writing: > > myString.substring(myString.cpToIndex(3), myString.cpToIndex(5)); > > you could write: > > myString.substring(3, 5, myString.CODEPOINT); > > This hides some of the work, when someone is working in code points. The > performance cost is still there, of course; using code point indexes > requires each operation to examine every code unit up to that point, which > is much more expensive. Good idea ! > For a general programming language or string library, I'm not sure about > implementing this pattern throughout. I know in the ICU library, for > example, we have a significant number of functions that take offsets into > strings. Having such a parameter on all of them would be clumsy, when most > of the time people are simply working in code units. In Python this would certainly be an elegant way to add the code point indexing functionality (Python supports optional arguments with default values). -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From guido@digicool.com Tue Jun 26 22:08:19 2001 From: guido@digicool.com (Guido van Rossum) Date: Tue, 26 Jun 2001 17:08:19 -0400 Subject: [I18n-sig] Re: How does Python Unicode treat surrogates? In-Reply-To: Your message of "Mon, 25 Jun 2001 22:28:45 +0200." <3B379EFD.40F88FC5@lemburg.com> References: <3B3722DB.1FF54794@lemburg.com> <4ak820g418.fsf@kern.srcf.societies.cam.ac.uk> <006501c0fd82$8b5ba9f0$0c680b41@c1340594a> <005b01c0fd9d$e4469e60$1a2cf7c2@oakdale2> <013901c0fdac$d27d1970$0c680b41@c1340594a> <3B379EFD.40F88FC5@lemburg.com> Message-ID: <200106262108.f5QL8Jd18469@odiug.digicool.com> > Mark Davis wrote: > > > > That is an interesting approach; one that basically amounts to some > > convenience functions. For example, instead of writing: > > > > myString.substring(myString.cpToIndex(3), myString.cpToIndex(5)); > > > > you could write: > > > > myString.substring(3, 5, myString.CODEPOINT); > > > > This hides some of the work, when someone is working in code points. The > > performance cost is still there, of course; using code point indexes > > requires each operation to examine every code unit up to that point, which > > is much more expensive. > > Good idea ! > > > For a general programming language or string library, I'm not sure about > > implementing this pattern throughout. I know in the ICU library, for > > example, we have a significant number of functions that take offsets into > > strings. Having such a parameter on all of them would be clumsy, when most > > of the time people are simply working in code units. > > In Python this would certainly be an elegant way to add the > code point indexing functionality (Python supports optional arguments > with default values). > > -- > Marc-Andre Lemburg I still think this should be an add-on module, to emphasize we're not eager to do a whole lot of surrogate support. --Guido van Rossum (home page: http://www.python.org/~guido/) From martin@loewis.home.cs.tu-berlin.de Tue Jun 26 22:15:19 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Tue, 26 Jun 2001 23:15:19 +0200 Subject: [I18n-sig] UCS-4 configuration Message-ID: <200106262115.f5QLFJ204654@mira.informatik.hu-berlin.de> I've now a patch on SF which does the autoconf machinery for the proposed simultaneous support for narrow and wide Py_UNICODE definitions. https://sourceforge.net/tracker/index.php?func=detail&aid=436496&group_id=5470&atid=305470 In particular --enable-unicode=ucs2 configures a narrow Py_UNICODE, and uses wchar_t if it fits --enable-unicode=ucs4 configures a wide Py_UNICODE likewise --enable-unicode configures Py_UNICODE to wchar_t if available, and to UCS-4 if not; this is the default The intention is that --disable-unicode, or --enable-unicode=no removes the Unicode type altogether; this is not yet implemented (it only defines a Py_USING_UNICODE macro that can be used to wrap Unicode support). With a wide Py_UNICODE, this patch also - supports UTF-8 and UTF-16 encodings of the complete Unicode range - supports unichr and \U literals: >>> u"\U00102030" u'\U00102030' >>> len(u"\U00102030") 1 >>> u"\U00102030".encode("utf-8") '\xf4\x82\x80\xb0' >>> u"\U00102030".encode("utf-16") '\xff\xfe\xc8\xdb0\xdc' Regards, Martin From fredrik@pythonware.com Tue Jun 26 23:04:10 2001 From: fredrik@pythonware.com (Fredrik Lundh) Date: Wed, 27 Jun 2001 00:04:10 +0200 Subject: [I18n-sig] UCS-4 configuration References: <200106262115.f5QLFJ204654@mira.informatik.hu-berlin.de> Message-ID: <005501c0fe8b$f0134d80$4ffa42d5@hagrid> Martin v. Loewis wrote: > I've now a patch on SF which does the autoconf machinery for the > proposed simultaneous support for narrow and wide Py_UNICODE > definitions. > > https://sourceforge.net/tracker/index.php?func=detail&aid=436496&group_id=5470&atid=305470 ouch. duplicate effort here. looks like your patch doesn't support sizeof(short) > 2 (e.g. cray). except for that, it's not too different from what I was working on. go ahead and check it in. From martin@loewis.home.cs.tu-berlin.de Tue Jun 26 23:50:24 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Wed, 27 Jun 2001 00:50:24 +0200 Subject: [I18n-sig] UCS-4 configuration In-Reply-To: <005501c0fe8b$f0134d80$4ffa42d5@hagrid> (fredrik@pythonware.com) References: <200106262115.f5QLFJ204654@mira.informatik.hu-berlin.de> <005501c0fe8b$f0134d80$4ffa42d5@hagrid> Message-ID: <200106262250.f5QMoO609419@mira.informatik.hu-berlin.de> > ouch. duplicate effort here. Sorry about this. When I noticed you had some code committed, I thought "release early, release often". > go ahead and check it in. Done. Some clean-up could be still applied, such as defining only one of USE_UCS4_STORAGE and Py_UNICODE_SIZE, but I'll leave that to your judgement (i.e. I won't attempt any further changes at the moment unless asked). > looks like your patch doesn't support sizeof(short) > 2 (e.g. cray). > except for that, it's not too different from what I was working on. Indeed it doesn't. How are you going to solve this? Generating UCS-2/UTF-16 when you have no two-byte type is not easy, unless you plan to do all byte operations yourself. Anyway, at the moment, it is a compile time error if short is not two bytes. I hope I found all places where Py_UCS2 should be used. Regards, Martin P.S. This patch makes the test suite fail in four byte mode, when trying to check the output of u'\ud800\udc02'.encode('utf-8'). IMO, all literals denoting surrogates should be replaced with \U literals in test_unicode; this is not done yet. From gs234@cam.ac.uk Wed Jun 27 00:15:26 2001 From: gs234@cam.ac.uk (Gaute B Strokkenes) Date: 27 Jun 2001 00:15:26 +0100 Subject: [I18n-sig] Re: Unicode surrogates: just say no! In-Reply-To: <200106261700.f5QH0ih14770@odiug.digicool.com> (Guido van Rossum's message of "Tue, 26 Jun 2001 13:00:44 -0400") References: <200106260851.f5Q8pcN10662@odiug.digicool.com> <3B385BDC.AB40A761@lemburg.com> <200106261700.f5QH0ih14770@odiug.digicool.com> Message-ID: <4almmeq1dd.fsf@kern.srcf.societies.cam.ac.uk> On Tue, 26 Jun 2001, guido@digicool.com wrote: > Let me use this as an excuse to start a discussion on how far we > should go in ruling out illegal code points. > > I think that *codecs* would be wise to be picky about illegal code > points (except for the special UTF-16-pass-through option). > > But I think that the *datatype implementation* should allow storage > units to take every possible value, whether or not it's illegal > according to Unicode, either in isolation or in context. It's much > easier to implement that way, and I believe that the checks ought to > be in other tools. I think that it is a good idea to allow users to stick any scalar value that will fit into the internal representation into a Python Unicode string, and that unichr(some value > 0xFFFF) should return a Unicode string with len(unichr(some value > 0xFFFF)) = 2 when UCS-2 is being used. There are a few issues that need to be considered, however: 1) Sort order. Unicode strings should sort in Unicode lexicographical order. With UCS-4 this is easy; just compare the Py_UNICODE values one by one like C does with strcmp(). With UTF-16 this is more complicated when surrogates get involved. Basically, you go through the strings being compared until you find the first difference. If both characters at this point are in the BMP or both are high surrogates, just compare them as usual. However, if one is in the BMP and the other is a surrogate, you need to make sure that the string with the surrogate in it sorts after the one with the BMP character. Straight comparison won't work since there are characters in the BMP with numerical values greater than those of surrogates. I believe that this is the right thing to do when Py_UNICODE is UCS-2 since the added complexity is only O(1) per string comparison and is very easy to implement. This will ensure that cmp(unichr(0xFFFD), unichr(0x10ABCD)) will work consistently and correctly for both UCS-2 and UCS-4. 2) There is an incompatibility between the two approaches since unichr(high surrogate) + unichr(low surrogate) will magically be the same as unichr(the approriate astral codepoint) when UCS-2 is used. With UCS-4 they will not; it will result in a string with two values that have no well-defined meaning. I don't think this is a show-stopper, but people will need to be made aware. > PEP time? Quite possibly... -- Big Gaute http://www.srcf.ucam.org/~gs234/ .. does your DRESSING ROOM have enough ASPARAGUS? From gs234@cam.ac.uk Wed Jun 27 00:30:00 2001 From: gs234@cam.ac.uk (Gaute B Strokkenes) Date: 27 Jun 2001 00:30:00 +0100 Subject: [I18n-sig] Re: Unicode surrogates: just say no! In-Reply-To: <4almmeq1dd.fsf@kern.srcf.societies.cam.ac.uk> (Gaute B Strokkenes's message of "27 Jun 2001 00:15:26 +0100") References: <200106260851.f5Q8pcN10662@odiug.digicool.com> <3B385BDC.AB40A761@lemburg.com> <200106261700.f5QH0ih14770@odiug.digicool.com> <4almmeq1dd.fsf@kern.srcf.societies.cam.ac.uk> Message-ID: <4ad77qq0p3.fsf@kern.srcf.societies.cam.ac.uk> On 27 Jun 2001, gs234@cam.ac.uk wrote: > > 1) Sort order. Unicode strings should sort in Unicode > lexicographical order. With UCS-4 this is easy; just compare the > Py_UNICODE values one by one like C does with strcmp(). With > UTF-16 this is more complicated when surrogates get involved. > Basically, you go through the strings being compared until you > find the first difference. If both characters at this point are > in the BMP or both are high surrogates, just compare them as > usual. However, if one is in the BMP and the other is a > surrogate, you need to make sure that the string with the > surrogate in it sorts after the one with the BMP character. > Straight comparison won't work since there are characters in the > BMP with numerical values greater than those of surrogates. Speaking of the devil indeed: mere seconds after I sent this, the following was posted to the unicode list: On Tue, 26 Jun 2001, mark@macchiato.com wrote: > I asked our performance czar to run a test comparing the performance > of the two ICU utf-16 strcmp routines (UTF-16 binary order and > UTF-8/32 binary order). While I want to caution that the results are > preliminary, here they are: > > "Test File u_strcmp u_strcmpCodePointOrder > --------------------------------------------------- > Asian Names 81 ns 83 ns / call > Latin Names 127 ns 124 ns > > > The test is a binary search of a sorted list of roughly 10000 names. > The Asian names are quite a bit shorter, which probably accounts for > the time difference between them and the Latin names. > > The code path through the u_strcmpCodePointOrder function has > (statistically, anyhow) exactly one added simple if relative to > u_strcmp. The timing differences are repeatable on my machine, but > are probably mostly noise from code alignment and the like..." -- Big Gaute http://www.srcf.ucam.org/~gs234/ How's it going in those MODULAR LOVE UNITS?? From guido@digicool.com Wed Jun 27 00:34:16 2001 From: guido@digicool.com (Guido van Rossum) Date: Tue, 26 Jun 2001 19:34:16 -0400 Subject: [I18n-sig] Re: Unicode surrogates: just say no! In-Reply-To: Your message of "27 Jun 2001 00:15:26 BST." <4almmeq1dd.fsf@kern.srcf.societies.cam.ac.uk> References: <200106260851.f5Q8pcN10662@odiug.digicool.com> <3B385BDC.AB40A761@lemburg.com> <200106261700.f5QH0ih14770@odiug.digicool.com> <4almmeq1dd.fsf@kern.srcf.societies.cam.ac.uk> Message-ID: <200106262334.f5QNYG418603@odiug.digicool.com> > 1) Sort order. Unicode strings should sort in Unicode lexicographical > order. With UCS-4 this is easy; just compare the Py_UNICODE values > one by one like C does with strcmp(). With UTF-16 this is more > complicated when surrogates get involved. Basically, you go > through the strings being compared until you find the first > difference. If both characters at this point are in the BMP or > both are high surrogates, just compare them as usual. However, if > one is in the BMP and the other is a surrogate, you need to make > sure that the string with the surrogate in it sorts after the one > with the BMP character. Straight comparison won't work since there > are characters in the BMP with numerical values greater than those > of surrogates. > > I believe that this is the right thing to do when Py_UNICODE is > UCS-2 since the added complexity is only O(1) per string comparison > and is very easy to implement. This will ensure that > cmp(unichr(0xFFFD), unichr(0x10ABCD)) will work consistently and > correctly for both UCS-2 and UCS-4. I'm neutral on this one; on the one hand I think we should minimize the surrogate support outside the codecs, on the other hand this makes some sense. > 2) There is an incompatibility between the two approaches since > unichr(high surrogate) + unichr(low surrogate) will magically be > the same as unichr(the approriate astral codepoint) when UCS-2 is > used. With UCS-4 they will not; it will result in a string with > two values that have no well-defined meaning. > > I don't think this is a show-stopper, but people will need to be > made aware. Agreed. --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@digicool.com Wed Jun 27 00:34:16 2001 From: guido@digicool.com (Guido van Rossum) Date: Tue, 26 Jun 2001 19:34:16 -0400 Subject: [I18n-sig] UCS-4 configuration In-Reply-To: Your message of "Wed, 27 Jun 2001 00:50:24 +0200." <200106262250.f5QMoO609419@mira.informatik.hu-berlin.de> References: <200106262115.f5QLFJ204654@mira.informatik.hu-berlin.de> <005501c0fe8b$f0134d80$4ffa42d5@hagrid> <200106262250.f5QMoO609419@mira.informatik.hu-berlin.de> Message-ID: <200106262334.f5QNYGb18598@odiug.digicool.com> Wow, this is so cool! Seems we don't need a PEP... Just an update to the NEWS file and some changes to the docs and test suite. > > looks like your patch doesn't support sizeof(short) > 2 (e.g. cray). > > except for that, it's not too different from what I was working on. > > Indeed it doesn't. How are you going to solve this? Generating > UCS-2/UTF-16 when you have no two-byte type is not easy, unless you > plan to do all byte operations yourself. Don't be a wimp. :-) As Tim Peters keeps pointing out, it's really not that hard to write such code, e.g. using the occasional mask operation. And a good compiler will remove the masks that don't do anything. > Anyway, at the moment, it is a compile time error if short is not two > bytes. I hope I found all places where Py_UCS2 should be used. Me too. I hope for the Cray folks that short will be allowede to vary properly. Another loose end: define sys.maxunicode. > Regards, > Martin > > P.S. This patch makes the test suite fail in four byte mode, when > trying to check the output of u'\ud800\udc02'.encode('utf-8'). IMO, > all literals denoting surrogates should be replaced with \U > literals in test_unicode; this is not done yet. Here's another weird failure in 4-byte mode, with a manually constructed surrogate pair (using marshal, but direct use of u.encode('utf8') would show the same problem): >>> u = u'\ud800\udc00' >>> u u'\ud800\udc00' >>> len(u) 2 >>> import marshal >>> s = marshal.dumps(u) >>> s 'u\x06\x00\x00\x00\xed\xa0\x80\xed\xb0\x80' >>> marshal.loads(s) Traceback (most recent call last): File "", line 1, in ? UnicodeError: UTF-8 decoding error: illegal encoding >>> Note how the utf8 codec has encoded the surrogate pair as two 3-byte utf8 sequences. I think it should either spit out an error or (I think this is better -- "be forgiving in what you accept") recognize the surrogate pair and spit out a 4-byte utf8 sequence. Note that in 2-byte mode, this same string literal can be marshalled and unmarshalled just fine! I think I'm going to withdraw my recommendation that in 4-byte mode \U and unichr() would accept any 32-bit value; the use of UTF8 by marshal effectively rules this out. Or should we change the marshalling format to do something that's more transparent? It feels uncomfortable that in 2-byte mode we can easily create unicode strings containing illegal sequences (e.g. lone surrogates), but these strings can't be marshalled. Marshal has no business being judgemental about the value of the data. I think we can work out most of the backward compatibility issues by switching to a new marshal tag byte (e.g. 'U'). PS. I checked in a tiny improvement to the unichr() code. --Guido van Rossum (home page: http://www.python.org/~guido/) From paulp@ActiveState.com Wed Jun 27 00:40:23 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Tue, 26 Jun 2001 16:40:23 -0700 Subject: [I18n-sig] Unicode surrogates: just say no! References: <200106260851.f5Q8pcN10662@odiug.digicool.com> <3B385BDC.AB40A761@lemburg.com> <200106261700.f5QH0ih14770@odiug.digicool.com> <3B38F10B.CCA55437@ActiveState.com> <200106262036.f5QKaX618195@odiug.digicool.com> Message-ID: <3B391D67.7D7D3C1D@ActiveState.com> Guido van Rossum wrote: > >... > > But you don't have to maintain it. I say that this particular varying > behavior is just as acceptable as the varying int size. Aren't we trying to get of the maximum int size? And even if we keep it, the rule for working with large integers is simple: calculations work on particular ranges of inputs. Period. If I understand correctly, the surrogates proposal will (for example) change this from legal to illegal: if unichr(0x10000) in somestring: ... Because sometimes unichr is a single-char string and sometimes it will actually produce a 2-byte encoding. > Do you want to write the PEP? If nobody pipes up to say that they've started it, then I'll do a first draft tonight. I presume you mean write the PEP up as you described it and not as I would like it. So I guess I would want to cover * what is the issue * what are surrogates * how Py_UNICODE effects literals and unichr * rationale for doing surrogate generation * description of the configure switches * description of why other options were rejected -- Take a recipe. Leave a recipe. Python Cookbook! http://www.ActiveState.com/pythoncookbook From guido@digicool.com Wed Jun 27 00:47:05 2001 From: guido@digicool.com (Guido van Rossum) Date: Tue, 26 Jun 2001 19:47:05 -0400 Subject: [I18n-sig] Unicode surrogates: just say no! In-Reply-To: Your message of "Tue, 26 Jun 2001 16:40:23 PDT." <3B391D67.7D7D3C1D@ActiveState.com> References: <200106260851.f5Q8pcN10662@odiug.digicool.com> <3B385BDC.AB40A761@lemburg.com> <200106261700.f5QH0ih14770@odiug.digicool.com> <3B38F10B.CCA55437@ActiveState.com> <200106262036.f5QKaX618195@odiug.digicool.com> <3B391D67.7D7D3C1D@ActiveState.com> Message-ID: <200106262347.f5QNl5O18720@odiug.digicool.com> > Aren't we trying to get of the maximum int size? And even if we keep it, > the rule for working with large integers is simple: calculations work on > particular ranges of inputs. Period. Well... 0xffffffff is negative on 32-bit systems but positive on 64-systems, and there are other anomalies like it. It's not ideal, but given the forces at work (some folks need UCS-4, some folks don't want to waste 2 extra bytes per character, we don't want to revise the implementation to hide the existence of surrogates in the 2-byte version) I think it's the best we can offer. > If I understand correctly, the surrogates proposal will (for example) > change this from legal to illegal: > > if unichr(0x10000) in somestring: > ... > > Because sometimes unichr is a single-char string and sometimes it will > actually produce a 2-byte encoding. Yes good example for the PEP. :-) > > Do you want to write the PEP? > > If nobody pipes up to say that they've started it, then I'll do a first > draft tonight. I presume you mean write the PEP up as you described it > and not as I would like it. Great, Paul! I'm tired of writing PEPs myself today. > So I guess I would want to cover > > * what is the issue > * what are surrogates > * how Py_UNICODE effects literals and unichr > * rationale for doing surrogate generation > * description of the configure switches > * description of why other options were rejected Yes. You can quote liberally from the i18n list. Use PEP number 261. Thanks so much! --Guido van Rossum (home page: http://www.python.org/~guido/) From gs234@cam.ac.uk Wed Jun 27 00:52:17 2001 From: gs234@cam.ac.uk (Gaute B Strokkenes) Date: 27 Jun 2001 00:52:17 +0100 Subject: [I18n-sig] Re: Unicode surrogates: just say no! In-Reply-To: <15160.33467.686959.415021@cymru.basistech.com> (Tom Emerson's message of "Tue, 26 Jun 2001 08:40:27 -0400") References: <200106260851.f5Q8pcN10662@odiug.digicool.com> <15160.33467.686959.415021@cymru.basistech.com> Message-ID: <4aae2upzny.fsf@kern.srcf.societies.cam.ac.uk> On Tue, 26 Jun 2001, tree@basistech.com wrote: > > UTF-8 can be used to encode encode each half of a surrogate pair > (resulting in six-bytes for the character) --- a proposal for this > was presented by PeopleSoft at the UTC meeting last month. UTF-8 can > also encode the code-point directly in four bytes. This is wrong. It is a bug to encode a non-BMP character with six bytes by pretending that the (surrogate) values used in the UTF-16 representation are BMP characters and encoding the character as though it was a string consisting of that character. It is also a bug to interpret such a six-byte sequence as a single character. This was clarified in Unicode 3.1. There are several good reasons for this, such as unique representation, security etc. etc. Personally, I think that the codecs should report an error in the appropriate fashion when presented with a python unicode string which contains values that are not allowed, such as lone surrogates. While it may be convenient to allow the python programmer to stick all kinds of junk into a python unicode string it is not reasonable for the python programmer to expect that this junk can be transformed into something meaningful when he wants to encode it with some UTF or the other. This has the advantage that whenever I run something through a codec the result is always a meaningful object of the appropriate type. For instance, I believe that given a python unicode string conversion to UCS-2 should always fail if the string contains surrogates (lone or otherwise) since UCS-2 is defined not to have surrogates. Conversion to UTF-16 or UTF-32 should fail whenever there is a lone surrogate, and so on. (These are sufficient but not necessary conditions for why such conversions should fail.) Off course, it may be convenient to offer alternative codecs and variations of existing ones that have a more lenient policy for use when the programmer so wishes, for instance to interact with buggy implementations. However, this should not be the default. Is the proposal you're referring to the "UTF-8s" proposal by Oracle et.al. ? This was brought up on the unicode list some time ago and met with massive negative response, along the lines of "oh my god, not another UTF; we have too many already" and "it is broken to sort unicode strings by looking at the words in the UTF-16 representation; you should compare in code point order instead" (this being the reason why UTF-8s was proposed: Oracle and certain other database vendors have old and buggy unicode implementations that do not sort UTF-16 strings in codepoint order and wanted UTF-8s so that a traditional C strcmp() on a UTF-8s string will give the same result as comparing the same string in UTF-16 representation word by word. Note that UTF-8 already has the corresponding property for UCS-4 / UTF-32; this was one of the design criteria of UTF-8. Essentially, Oracle & co. want their old mistakes canonised.) -- Big Gaute http://www.srcf.ucam.org/~gs234/ Did an Italian CRANE OPERATOR just experience uninhibited sensations in a MALIBU HOT TUB? From gs234@cam.ac.uk Wed Jun 27 01:22:22 2001 From: gs234@cam.ac.uk (Gaute B Strokkenes) Date: 27 Jun 2001 01:22:22 +0100 Subject: [I18n-sig] Re: UCS-4 configuration In-Reply-To: <200106262334.f5QNYGb18598@odiug.digicool.com> (Guido van Rossum's message of "Tue, 26 Jun 2001 19:34:16 -0400") References: <200106262115.f5QLFJ204654@mira.informatik.hu-berlin.de> <005501c0fe8b$f0134d80$4ffa42d5@hagrid> <200106262250.f5QMoO609419@mira.informatik.hu-berlin.de> <200106262334.f5QNYGb18598@odiug.digicool.com> Message-ID: <4a4rt2py9t.fsf@kern.srcf.societies.cam.ac.uk> On Tue, 26 Jun 2001, guido@digicool.com wrote: > Here's another weird failure in 4-byte mode, with a manually > constructed surrogate pair (using marshal, but direct use of > u.encode('utf8') would show the same problem): > >>>> u = u'\ud800\udc00' >>>> u > u'\ud800\udc00' >>>> len(u) > 2 >>>> import marshal >>>> s = marshal.dumps(u) >>>> s > 'u\x06\x00\x00\x00\xed\xa0\x80\xed\xb0\x80' >>>> marshal.loads(s) > Traceback (most recent call last): > File "", line 1, in ? > UnicodeError: UTF-8 decoding error: illegal encoding >>>> > > Note how the utf8 codec has encoded the surrogate pair as two 3-byte > utf8 sequences. I think it should either spit out an error or (I > think this is better -- "be forgiving in what you accept") recognize > the surrogate pair and spit out a 4-byte utf8 sequence. Note that > in 2-byte mode, this same string literal can be marshalled and > unmarshalled just fine! I think that the best compromise is to discourage programmers from creating non-BMP characters by manually splicing together surrogate values, and encourage them to use unichr(approiate non-BMP value) instead. This is not only more readable, but avoids this kind of problem. Perhaps the Python parser ought to produce a warning when it encounters such a string constant, to help catch this sort of bug. On the other hand, disallowing unichr(some surrogate value) is probably too far: you should either allow all non-sensical values, or none at all. > I think I'm going to withdraw my recommendation that in 4-byte mode > \U and unichr() would accept any 32-bit value; the use of UTF8 by > marshal effectively rules this out. UTF-8 is easily extended to store anything 31-bit values; in fact the current ISO definition of UTF-8 is like that, though it will be changed to match the Unicode definition in the next version. There is an obvious tweak to store 32 bit values as well. Off course, using such a scheme means that UTF-8 is not used for marshalling, just some closely related encoding. But since we "own" the marshalling format, this might no be such a problem. > Or should we change the marshalling format to do something that's > more transparent? It feels uncomfortable that in 2-byte mode we can > easily create unicode strings containing illegal sequences > (e.g. lone surrogates), but these strings can't be marshalled. > Marshal has no business being judgemental about the value of the > data. Just encode the lone surrogate as though it was a proper Unicode scalar value. This is a no-no if you go by the standard and I know that I've been arguing against doing things like that in the standard UTF-8 codec, but in the context of a private file format I think that it is ok to use a private variation of UTF-8. All we have to do is make sure that it is referred to by a name different from UTF-8 ("marshall" would be fine, I suppose) and that we never expose this private goo to anything outside Python. -- Big Gaute http://www.srcf.ucam.org/~gs234/ I am having a CONCEPTION-- From tim.one@home.com Wed Jun 27 02:38:34 2001 From: tim.one@home.com (Tim Peters) Date: Tue, 26 Jun 2001 21:38:34 -0400 Subject: [I18n-sig] UCS-4 configuration In-Reply-To: <200106262250.f5QMoO609419@mira.informatik.hu-berlin.de> Message-ID: [/F] > looks like your patch doesn't support sizeof(short) > 2 (e.g. cray). > except for that, it's not too different from what I was working on. [Martin v. Loewis] > Indeed it doesn't. How are you going to solve this? Generating > UCS-2/UTF-16 when you have no two-byte type is not easy, unless you > plan to do all byte operations yourself. As opposed to what, having elves do them for us while we sleep ? You need at least 16 bits, but it should be no problem if you have more than that -- all it takes is a tiny bit of care, and standard C (not even C99) does not guarantee that any integral type has exactly 2 bytes (or 4, or 8). All C guarantees is minimal sizes, and they refused to make stronger guarantees than that because the real world wouldn't let them. I have decades of experience with this, so either trust me on it or point me at code you think is a problem. The saving grace is that any sequence of 16-bit operations involving +, -, *, &, |, ^ and << yields exactly the same result if you do it with any number of bits >= 16, then take the last 16 bits at the end. /, ~ and >> *may* require a little thought. Note that MAL made a similar argument in the Cray T3E bug report, I asked him to point me at some troublesome code, and it turned out that didn't need *any* changes to work correctly when sizeof(Py_UNICODE)==4 (or 8, or 10000000000 on the next Cray ). > Anyway, at the moment, it is a compile time error if short is not two > bytes. Yes, I discovered that when the Windows build fell on its face . Just ribbing you there -- 'twas a trivial fix. From mal@lemburg.com Mon Jun 25 21:31:33 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Mon, 25 Jun 2001 22:31:33 +0200 Subject: [I18n-sig] Re: How does Python Unicode treat surrogates? References: <3B3722DB.1FF54794@lemburg.com> <4ak820g418.fsf@kern.srcf.societies.cam.ac.uk> <006501c0fd82$8b5ba9f0$0c680b41@c1340594a> <3B376B03.A2A84AE1@lemburg.com> <00f101c0fda3$4a2529e0$0c680b41@c1340594a> Message-ID: <3B379FA5.5E3E81DD@lemburg.com> Mark Davis wrote: > > > My question was targetting into a slightly different direction, > > though. I know that UTF-16 does not allow lone surrogates, but > > how does Unicode itself treat these ? If I have a sequence of Unicode > > code points which includes an isolated surrogate code point, > > would this be considered a legal Unicode sequence or not ? > > It is a legal Unicode code point sequence. However, it is not a legal > Unicode *character* sequence, since it contains code points that by > definition cannot be used to represent characters. So its basically a matter of viewing a string as sequence of characters vs. sequence of code points. Thanks for the explanation, -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From martin@loewis.home.cs.tu-berlin.de Wed Jun 27 07:08:58 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Wed, 27 Jun 2001 08:08:58 +0200 Subject: [I18n-sig] UCS-4 configuration In-Reply-To: References: Message-ID: <200106270608.f5R68wY02785@mira.informatik.hu-berlin.de> > I have decades of experience with this, so either trust me on it or > point me at code you think is a problem. I would never remotely consider questioning your authority, how could I? The specific code in question is in PyUnicode_DecodeUTF16. It gets a char*, and converts it to a Py_UCS2* (where Py_UCS is unsigned short). It then fetches a Py_UCS2 after another, byte-swapping if appropriate, and advances the Py_UCS2* by one. The intention is that this retrieves the bytes of the input in pairs. Is that code correct even if sizeof(unsigned short)>2? If so, I can just remove the test that it ought to be 2. If not, how should that be rewritten? Regards, Martin From martin@loewis.home.cs.tu-berlin.de Wed Jun 27 06:54:22 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Wed, 27 Jun 2001 07:54:22 +0200 Subject: [I18n-sig] Re: Unicode surrogates: just say no! In-Reply-To: <4aae2upzny.fsf@kern.srcf.societies.cam.ac.uk> (message from Gaute B Strokkenes on 27 Jun 2001 00:52:17 +0100) References: <200106260851.f5Q8pcN10662@odiug.digicool.com> <15160.33467.686959.415021@cymru.basistech.com> <4aae2upzny.fsf@kern.srcf.societies.cam.ac.uk> Message-ID: <200106270554.f5R5sMW02751@mira.informatik.hu-berlin.de> > This is wrong. It is a bug to encode a non-BMP character with six > bytes by pretending that the (surrogate) values used in the UTF-16 > representation are BMP characters and encoding the character as though > it was a string consisting of that character. It is also a bug to > interpret such a six-byte sequence as a single character. This was > clarified in Unicode 3.1. It seems to be unclear to many, including myself, what exactly was clarified with Unicode 3.1. Where exactly does it say that processing a six-byte two-surrogates sequence as a single character is non-conforming? What exactly does it say that the conforming behaviour should be? > Personally, I think that the codecs should report an error in the > appropriate fashion when presented with a python unicode string which > contains values that are not allowed, such as lone surrogates. Other people have read Unicode 3.1 and came to the conclusion that it mandates that implementations accept such a character... Regards, Martin From martin@loewis.home.cs.tu-berlin.de Wed Jun 27 07:45:11 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Wed, 27 Jun 2001 08:45:11 +0200 Subject: [I18n-sig] UCS-4 configuration In-Reply-To: <200106262334.f5QNYGb18598@odiug.digicool.com> (message from Guido van Rossum on Tue, 26 Jun 2001 19:34:16 -0400) References: <200106262115.f5QLFJ204654@mira.informatik.hu-berlin.de> <005501c0fe8b$f0134d80$4ffa42d5@hagrid> <200106262250.f5QMoO609419@mira.informatik.hu-berlin.de> <200106262334.f5QNYGb18598@odiug.digicool.com> Message-ID: <200106270645.f5R6jBS06348@mira.informatik.hu-berlin.de> > Another loose end: define sys.maxunicode. Breaking my promise not to touch the code, I've added this. I was not quite sure what type you meant to see in sys.maxunicode; I took integer, since U+FFFF is a non-character. > Note how the utf8 codec has encoded the surrogate pair as two 3-byte > utf8 sequences. I think it should either spit out an error or (I > think this is better -- "be forgiving in what you accept") recognize > the surrogate pair and spit out a 4-byte utf8 sequence. Note that in > 2-byte mode, this same string literal can be marshalled and > unmarshalled just fine! That was actually the same problem as with the test case: the UTF-8 encoder would not use the surrogate code in wide mode. I've removed that restriction, so this test now also passes. > Or should we change the marshalling format to do something that's more > transparent? It feels uncomfortable that in 2-byte mode we can easily > create unicode strings containing illegal sequences (e.g. lone > surrogates), but these strings can't be marshalled. You mean, they cannot be unmarshalled? With the current code, marshalling them works fine... There was another problem with the unicode database; the code assumed that adding two Py_UNICODE values would wrap around at 65536. With that fixed and committed, the test suite passes for me. Regards, Martin From gs234@cam.ac.uk Wed Jun 27 08:52:44 2001 From: gs234@cam.ac.uk (Gaute B Strokkenes) Date: 27 Jun 2001 08:52:44 +0100 Subject: [I18n-sig] Re: Unicode surrogates: just say no! References: <200106260851.f5Q8pcN10662@odiug.digicool.com> <15160.33467.686959.415021@cymru.basistech.com> <4aae2upzny.fsf@kern.srcf.societies.cam.ac.uk> <200106270554.f5R5sMW02751@mira.informatik.hu-berlin.de> Message-ID: <4ahex2xstv.fsf@kern.srcf.societies.cam.ac.uk> [I'm CC-ing the unicode list again because I'm doing some fairly sophisticated interpretation of the Unicode conformance requirements below and I'd like to have someone with more experience with this check my reasoning.] On Wed, 27 Jun 2001, martin@loewis.home.cs.tu-berlin.de wrote: >> This is wrong. It is a bug to encode a non-BMP character with six >> bytes by pretending that the (surrogate) values used in the UTF-16 >> representation are BMP characters and encoding the character as >> though it was a string consisting of that character. It is also a >> bug to interpret such a six-byte sequence as a single character. >> This was clarified in Unicode 3.1. > > It seems to be unclear to many, including myself, what exactly was > clarified with Unicode 3.1. See the section called "UTF-8 Corrigendum" in TR 27. It explains it all in detail. > Where exactly does it say that processing a six-byte two-surrogates > sequence as a single character is non-conforming? See D39(c) at . This defines such a six-byte sequence as an "irregular UTF-8 code unit sequence" and goes on to state that, as a consequence of C12, conforminig processes are not allowed to generate such sequences. This really ought to be obvious anyway: UTF-8 is defined to represent a given USV with 1 to 4 bytes, so clearly 6 is not possible. Conversely, C12(a) states that a conformant process can not produce "ill-formed code unit sequences" while producing data in a UTF. The definition of this term is given in D30 as a code unit sequence that can not be produced from a sequence of unicode scalar values. This is where things get somewhat more interesting. Somewhat surprisingly, the definition of "Unicode Scalar Value" has not been changed from 3.0 to 3.1. The reason why one might expect this to have changed is that in 3.0 UTF-16 was "the" unicode format, so that USVs were defined in terms of UTF-16 code points. In 3.1 it is stated elsewhere that different UTFs are simply conrete ways to store sequences of USVs. However, the definition of USV is still either: A value in the range 0 - 0xFFFF which is is not a high or low surrogate in UTF-16, or: a value in the range 0x10000 - 0x10FFFF which is obtained by taking a pair of values that form a high and low surrogate respectively in UTF-16 and applying the usual formula. Since there is no way you can form a value in the range 0xD800 - 0xDFFF in this fashion it follows that a USV can not be in this range. Therefore you are not allowed to create a 3 byte sequence that is the UTF-8 encoding of value in this range. Therefore you are not allowed to generate pairs of such sequences either. I hope this is all clear. One very important thing to keep in mind when doing this stuff is that 3.1 is a brand new standard, less than one and a half months old. A consequence of this is that most of the material on the Unicode web site still refers to version 3.0, so you have to be very careful to check that the information you're looking at is in fact up to date. (The only updated information I could find was TR 27 and [probably] the data tables.) > What exactly does it say that the conforming behaviour should be? Argh. Treat it as an error, probably. You go and read the standard yourself, my head is already hurting. 8-) >> Personally, I think that the codecs should report an error in the >> appropriate fashion when presented with a python unicode string >> which contains values that are not allowed, such as lone >> surrogates. > > Other people have read Unicode 3.1 and came to the conclusion that > it mandates that implementations accept such a character... Well, they're wrong. The standard is clear as ink in this regard. -- Big Gaute http://www.srcf.ucam.org/~gs234/ I can't think about that. It doesn't go with HEDGES in the shape of LITTLE LULU -- or ROBOTS making BRICKS... From mal@lemburg.com Wed Jun 27 08:52:31 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Wed, 27 Jun 2001 09:52:31 +0200 Subject: [I18n-sig] Unicode Maintenance Message-ID: <3B3990BF.5C8410A9@lemburg.com> Looking at the recent burst of checkins for the Unicode implementation completely bypassing the standard SF procedure and possible comments I might have on the different approaches, I guess I've been ruled out as maintainer and designer of the Unicode implementation. Well, I guess that's how things go. Was nice working for you guys, but no longer is... I'm tired of having to defend myself against meta-comments about the design, uncontrolled checkins and no true backup about my standing in all this from Guido. Perhaps I am misunderstanding the role of a maintainer and implementation designer, but as it is all respect for the work I've put into all this seems faded. That's the conclusion I draw from recent postings by Martin and Fredrik and their nightly "takeover". Thanks, -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From tim.one@home.com Wed Jun 27 09:24:44 2001 From: tim.one@home.com (Tim Peters) Date: Wed, 27 Jun 2001 04:24:44 -0400 Subject: [I18n-sig] UCS-4 configuration In-Reply-To: <200106270608.f5R68wY02785@mira.informatik.hu-berlin.de> Message-ID: [Martin v. Loewis] > I would never remotely consider questioning your authority, how could I? LOL! If authority were of any help in getting software to work, Guido wouldn't need any of us: he could just scowl at it, and it would all fall into place . > The specific code in question is in PyUnicode_DecodeUTF16. It gets a > char*, and converts it to a Py_UCS2* (where Py_UCS is unsigned short). > It then fetches a Py_UCS2 after another, byte-swapping if appropriate, > and advances the Py_UCS2* by one. The intention is that this retrieves > the bytes of the input in pairs. > > Is that code correct even if sizeof(unsigned short)>2? Oh no. Clearly, if sizeof(Py_UCS2) > 2, it will read more than 2 bytes each time. But the *obvious* way to read two bytes is to use a char* pointer! Say q and e were declared const unsigned char* instead of Py_UCS2*. Then for big-endian getting "the next" char is just ch = (q[0] << 8) | q[1]; q += 2; and swap "0" and "1" for a little-endian machine. The code would get substantially simpler. In fact, you can skip all the embedded #ifdefs and repeated (bo == 1), (bo == -1) tests by setting up invariants int lo_index, hi_index; appropriately at the start before the loop-- setting one of those to 1 and the other to 0 --and then do ch = (q[hi_index] << 8) | q[lo_index] q += 2; unconditionally inside the loop whenever fetching another pair. Now C doesn't guarantee that a byte is 8 bits either, but that's one thing that's true even on a Cray (they actually read 64 bits under the covers and shift+mask, but it looks like "8 bits" to C code); I don't know of any modern box on which it isn't true, and it's exceedingly unlikely any new architecture won't play along. Everything else should "just work" then. BTW, the existing byte-swapping code doesn't work right either for sizeof(Py_UCS2) > 2, because in ch = (ch >> 8) | (ch << 8); there's an assumption that the left shift is end-off. Fetch a byte at a time as above and none of that fiddling is needed. Else the existing byte-swapping code needs either ch &= 0xffff; after, or ch = (ch >> 8) | ((ch & 0xff) << 8); in the body. But we'd be better off getting rid of Py_UCS2 thingies entirely in this routine (they don't *mean* "UCS2", they *mean* "exactly two bytes", and that can't always be met). From JMachin@Colonial.com.au Wed Jun 27 09:27:50 2001 From: JMachin@Colonial.com.au (Machin, John) Date: Wed, 27 Jun 2001 18:27:50 +1000 Subject: [I18n-sig] validity of lone surrogates (was Re: Unicode surroga tes: just say no!) Message-ID: <9F2D83017589D211BD1000805FA70CA703B139EF@ntxmel03.cmutual.com.au> -----Original Message----- From: Gaute B Strokkenes [mailto:gs234@cam.ac.uk] Sent: Wednesday, 27 June 2001 17:53 To: Martin v. Loewis Cc: tree@basistech.com; guido@digicool.com; i18n-sig@python.org; unicode@unicode.org Subject: [I18n-sig] Re: Unicode surrogates: just say no! [earlier correspondents] >> Personally, I think that the codecs should report an error in the >> appropriate fashion when presented with a python unicode string >> which contains values that are not allowed, such as lone >> surrogates. > > Other people have read Unicode 3.1 and came to the conclusion that > it mandates that implementations accept such a character... [big Gaute] Well, they're wrong. The standard is clear as ink in this regard. [my comment] Unfortunately ink is usually opaque :-) The problem is caused by section 3.8 in Unicode 3.0, which is not specifically amended by 3.1 as far as I can tell. The offending text occurs after clause D29. It says "... every UTF supports lossless round-trip transcoding ..." and "... a UTF mapping must also map invalid Unicode scalar values to unique code value sequences. These invalid scalar values include [0xFFFE], [0xFFFF] and unpaired surrogates." My interpretation of this is that the 2nd part I quoted says we must export the guff, and the 1st part says we must accept it back again. I don't particularly like this idea, and am not in favour of codecs silently accepting such in incoming data --- I'm just pointing out that this "lossless round-trip transcoding" concept seems to be at variance with various interpretations of what is "legal". Cheers, John ************** IMPORTANT MESSAGE ************** The information contained in or attached to this message is intended only for the people it is addressed to. If you are not the intended recipient, any use, disclosure or copying of this information is unauthorised and prohibited. This information may be confidential or subject to legal privilege. It is not the expressed view of Colonial Limited or any of its subsidiaries unless that is clearly stated. Colonial cannot accept liability for any virus damage caused by this message. ************************************************** From martin@loewis.home.cs.tu-berlin.de Wed Jun 27 13:04:18 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Wed, 27 Jun 2001 14:04:18 +0200 Subject: [I18n-sig] Re: Unicode surrogates: just say no! In-Reply-To: <4ahex2xstv.fsf@kern.srcf.societies.cam.ac.uk> (message from Gaute B Strokkenes on 27 Jun 2001 08:52:44 +0100) References: <200106260851.f5Q8pcN10662@odiug.digicool.com> <15160.33467.686959.415021@cymru.basistech.com> <4aae2upzny.fsf@kern.srcf.societies.cam.ac.uk> <200106270554.f5R5sMW02751@mira.informatik.hu-berlin.de> <4ahex2xstv.fsf@kern.srcf.societies.cam.ac.uk> Message-ID: <200106271204.f5RC4Ia07546@mira.informatik.hu-berlin.de> > >> It is also a > >> bug to interpret such a six-byte sequence as a single character. > >> This was clarified in Unicode 3.1. > > > > It seems to be unclear to many, including myself, what exactly was > > clarified with Unicode 3.1. > > See the section called "UTF-8 Corrigendum" in TR 27. It explains it > all in detail. I've read this section forth an back over and over again, admittedly without having a copy of Unicode 3.0 at hand to mentally apply the changes. > > Where exactly does it say that processing a six-byte two-surrogates > > sequence as a single character is non-conforming? > > See D39(c) at . This > defines such a six-byte sequence as an "irregular UTF-8 code unit > sequence" and goes on to state that, as a consequence of C12, > conforminig processes are not allowed to generate such sequences. [I guess this is D36(c)] Yes, but you've claimed that one *also* must not interpret such a sequence as a single character - this only says that you must never generate such a sequence. > Therefore you are not allowed to create a 3 byte sequence that is the > UTF-8 encoding of value in this range. Therefore you are not allowed > to generate pairs of such sequences either. > > I hope this is all clear. That is all clear, but I still wonder why you said that the six byte sequence (which no conforming process can have produced) must not be interpreted as a single character. Specifically, C12 is amended with # Processes may transform irregular code unit sequences into the # equivalent well-formed code unit sequences. > > Other people have read Unicode 3.1 and came to the conclusion that > > it mandates that implementations accept such a character... > > Well, they're wrong. The standard is clear as ink in this regard. Not that clear to me... Please have a look at bug # 2 in http://sourceforge.net/tracker/download.php?group_id=5470&atid=105470&file_id=7439&aid=433882 The submitter claims that an implementation has to accept a single UTF-8 encoded surrogate word. Of course, it might be that accepting a single one in UTF-8 is mandated, but if you have two of them, you must reject them... Regards, Martin From gs234@cam.ac.uk Wed Jun 27 13:38:33 2001 From: gs234@cam.ac.uk (Gaute B Strokkenes) Date: 27 Jun 2001 13:38:33 +0100 Subject: [I18n-sig] Re: validity of lone surrogates (was Re: Unicode surroga tes: just say no!) References: <9F2D83017589D211BD1000805FA70CA703B139EF@ntxmel03.cmutual.com.au> Message-ID: <4ak81yjdx2.fsf@kern.srcf.societies.cam.ac.uk> On Wed, 27 Jun 2001, JMachin@Colonial.com.au wrote: > > [earlier correspondents] >>> Personally, I think that the codecs should report an error in the >>> appropriate fashion when presented with a python unicode string >>> which contains values that are not allowed, such as lone >>> surrogates. >> >> Other people have read Unicode 3.1 and came to the conclusion that >> it mandates that implementations accept such a character... > > [big Gaute] > Well, they're wrong. The standard is clear as ink in this regard. > > [my comment] > Unfortunately ink is usually opaque :-) Precisely. That's standardese for you. 8-) > The problem is caused by section 3.8 in Unicode 3.0, which is not > specifically amended by 3.1 as far as I can tell. It's not; AFAIK the list of changes at is supposed to be canonical and it's not listed. > The offending text occurs after clause D29. It says "... every UTF > supports lossless round-trip transcoding ..." and "... a UTF mapping > must also map invalid Unicode scalar values to unique code value > sequences. These invalid scalar values include [0xFFFE], [0xFFFF] > and unpaired surrogates." Sigh. This means that the Unicode standard is self-contradicting. It is nowhere defined precisely what "invalid Unicode Scalar Value" means. I can only assume that it means "an integer in the range 0 - 0x10FFFF that is not a Unicode Scalar Value". Even so, the statement is just plain wrong as far as UTF-16 is concerned. If UTF-16 is supposed to define a bijective mapping any sequence of integers in the range 0 - 0x10FFFF to some set of sequences of integers in the range 0 - 0xFFFF (and this is definitely what this statement is saying) this becomes a contradiction: suppose that H is some high surrogate value and that L is some low surrogate value, and that U is the corresponding USV. Then the sequences H, L <-- sequence consisting of two "invalid USVs" and U <-- sequence consisting of a single (valid) USV both map to H, L <-- sequence of two UTF-16 code points under UTF-16, so that the mapping induced by UTF-16 is very definitely not bijective. I have no idea why the standard includes this apparent error, but my best guess would be that this used to be true back in the pre-3.1 days when UTF-16 (though not with that name) was Unicode proper and UTF-16 was not a UTF, but _the_ canonical Unicode encoding. Note that the statement given in D29 actually is true when applied to UTF-8 and UTF-32. However, let us put this annoying fact aside for a moment. I believe that D29 is intended to point out that the various UTFs will "just work" if you try to encode scalar values that are not proper USVs. This is not the same thing as saying that these invalid USVs or the "pseudo-characters" or whatever that arise from them have any business in a Unicode string. In fact, Unicode conformant processes are explicitly forbidden from interpreting or using U+FFFF or U+FFFE when passing Unicode data between each other. They are, however, explicitly allowed and even encouraged to use these values internally as sentinel or "fencepost" values. To put this slightly differently, a process may be storing some Unicode data internally and it may be storing U+FFFF for some reason or another in that internal data. It may be convenient for the process to use an UTF to transform this data into a more convenient form. I think that D19 is merely pointing out that this is actually feasible, in spite of the appearance of invalid USVs in the internal data. I would be indebted if any of the experts who hang out on the unicode list could sort out this confusion. > My interpretation of this is that the 2nd part I quoted says we must > export the guff, and the 1st part says we must accept it back again. > > I don't particularly like this idea, and am not in favour of codecs > silently accepting such in incoming data --- I'm just pointing out > that this "lossless round-trip transcoding" concept seems to be at > variance with various interpretations of what is "legal". Yup. My take on this is that the various UTF codecs should follow the specs to the letter and reject antything else in default mode. There should also be a "lenient" or "forgiving" mode in which the codec does its best to interpret and repair broken, nonsensical or irregular data. Off course, if an application uses this mode then it will have to be aware of the dangers involved, including the security aspects. -- Big Gaute http://www.srcf.ucam.org/~gs234/ I'm having BEAUTIFUL THOUGHTS about the INSIPID WIVES of smug and wealthy CORPORATE LAWYERS.. From mark@macchiato.com Wed Jun 27 15:13:39 2001 From: mark@macchiato.com (Mark Davis) Date: Wed, 27 Jun 2001 07:13:39 -0700 Subject: [I18n-sig] Re: validity of lone surrogates (was Re: Unicode surroga tes: just say no!) References: <9F2D83017589D211BD1000805FA70CA703B139EF@ntxmel03.cmutual.com.au> <4ak81yjdx2.fsf@kern.srcf.societies.cam.ac.uk> Message-ID: <005101c0ff13$5d851c40$0c680b41@c1340594a> Your are correct in that the text is not nearly as clear as it should be, and is open to different interpretations. My view of the status in Unicode 3.1 is represented on http://www.macchiato.com/utc/utf_comparison.htm. Corresponding computations are on http://www.macchiato.com/utc/utf_computations.htm. One of the goals for Unicode 4.0 is to clear up the text describing UTFs in particular, which may change some of the edge cases (isolates and/or irregulars). Mark ----- Original Message ----- From: "Gaute B Strokkenes" To: "Machin, John" Cc: ; ; ; ; "Martin v. Loewis" Sent: Wednesday, June 27, 2001 05:38 Subject: Re: validity of lone surrogates (was Re: Unicode surroga tes: just say no!) > On Wed, 27 Jun 2001, JMachin@Colonial.com.au wrote: > > > > [earlier correspondents] > >>> Personally, I think that the codecs should report an error in the > >>> appropriate fashion when presented with a python unicode string > >>> which contains values that are not allowed, such as lone > >>> surrogates. > >> > >> Other people have read Unicode 3.1 and came to the conclusion that > >> it mandates that implementations accept such a character... > > > > [big Gaute] > > Well, they're wrong. The standard is clear as ink in this regard. > > > > [my comment] > > Unfortunately ink is usually opaque :-) > > Precisely. That's standardese for you. 8-) > > > The problem is caused by section 3.8 in Unicode 3.0, which is not > > specifically amended by 3.1 as far as I can tell. > > It's not; AFAIK the list of changes at > is supposed to be > canonical and it's not listed. > > > The offending text occurs after clause D29. It says "... every UTF > > supports lossless round-trip transcoding ..." and "... a UTF mapping > > must also map invalid Unicode scalar values to unique code value > > sequences. These invalid scalar values include [0xFFFE], [0xFFFF] > > and unpaired surrogates." > > Sigh. This means that the Unicode standard is self-contradicting. > > It is nowhere defined precisely what "invalid Unicode Scalar Value" > means. I can only assume that it means "an integer in the range 0 - > 0x10FFFF that is not a Unicode Scalar Value". Even so, the statement > is just plain wrong as far as UTF-16 is concerned. If UTF-16 is > supposed to define a bijective mapping any sequence of integers in the > range 0 - 0x10FFFF to some set of sequences of integers in the range 0 > - 0xFFFF (and this is definitely what this statement is saying) this > becomes a contradiction: suppose that H is some high surrogate value > and that L is some low surrogate value, and that U is the > corresponding USV. Then the sequences > > H, L <-- sequence consisting of two "invalid USVs" > > and > > U <-- sequence consisting of a single (valid) USV > > both map to > > H, L <-- sequence of two UTF-16 code points > > under UTF-16, so that the mapping induced by UTF-16 is very definitely > not bijective. > > I have no idea why the standard includes this apparent error, but my > best guess would be that this used to be true back in the pre-3.1 days > when UTF-16 (though not with that name) was Unicode proper and UTF-16 > was not a UTF, but _the_ canonical Unicode encoding. Note that the > statement given in D29 actually is true when applied to UTF-8 and > UTF-32. > > However, let us put this annoying fact aside for a moment. I believe > that D29 is intended to point out that the various UTFs will "just > work" if you try to encode scalar values that are not proper USVs. > This is not the same thing as saying that these invalid USVs or the > "pseudo-characters" or whatever that arise from them have any business > in a Unicode string. In fact, Unicode conformant processes are > explicitly forbidden from interpreting or using U+FFFF or U+FFFE when > passing Unicode data between each other. They are, however, > explicitly allowed and even encouraged to use these values internally > as sentinel or "fencepost" values. To put this slightly differently, > a process may be storing some Unicode data internally and it may be > storing U+FFFF for some reason or another in that internal data. It > may be convenient for the process to use an UTF to transform this data > into a more convenient form. I think that D19 is merely pointing out > that this is actually feasible, in spite of the appearance of invalid > USVs in the internal data. > > I would be indebted if any of the experts who hang out on the unicode > list could sort out this confusion. > > > My interpretation of this is that the 2nd part I quoted says we must > > export the guff, and the 1st part says we must accept it back again. > > > > I don't particularly like this idea, and am not in favour of codecs > > silently accepting such in incoming data --- I'm just pointing out > > that this "lossless round-trip transcoding" concept seems to be at > > variance with various interpretations of what is "legal". > > Yup. > > My take on this is that the various UTF codecs should follow the specs > to the letter and reject antything else in default mode. There should > also be a "lenient" or "forgiving" mode in which the codec does its > best to interpret and repair broken, nonsensical or irregular data. > Off course, if an application uses this mode then it will have to be > aware of the dangers involved, including the security aspects. > > -- > Big Gaute http://www.srcf.ucam.org/~gs234/ > I'm having BEAUTIFUL THOUGHTS about the INSIPID WIVES > of smug and wealthy CORPORATE LAWYERS.. > > From guido@digicool.com Wed Jun 27 15:16:47 2001 From: guido@digicool.com (Guido van Rossum) Date: Wed, 27 Jun 2001 10:16:47 -0400 Subject: [I18n-sig] Re: validity of lone surrogates In-Reply-To: Your message of "27 Jun 2001 13:38:33 BST." <4ak81yjdx2.fsf@kern.srcf.societies.cam.ac.uk> References: <9F2D83017589D211BD1000805FA70CA703B139EF@ntxmel03.cmutual.com.au> <4ak81yjdx2.fsf@kern.srcf.societies.cam.ac.uk> Message-ID: <200106271416.f5REGl519361@odiug.digicool.com> [Gaute] > My take on this is that the various UTF codecs should follow the specs > to the letter and reject antything else in default mode. There should > also be a "lenient" or "forgiving" mode in which the codec does its > best to interpret and repair broken, nonsensical or irregular data. > Off course, if an application uses this mode then it will have to be > aware of the dangers involved, including the security aspects. Python's codec mechanism has a nice API gimmick: you can pass an error handling option. Currently, this can be 'strict', 'ignore', or 'replace'. I wonder if we could add a fourth mode, 'lenient', that tries its best to encode anything passed in? --Guido van Rossum (home page: http://www.python.org/~guido/) From fredrik@pythonware.com Wed Jun 27 16:09:27 2001 From: fredrik@pythonware.com (Fredrik Lundh) Date: Wed, 27 Jun 2001 17:09:27 +0200 Subject: [I18n-sig] UCS-4 configuration References: <200106262115.f5QLFJ204654@mira.informatik.hu-berlin.de> <005501c0fe8b$f0134d80$4ffa42d5@hagrid> <200106262250.f5QMoO609419@mira.informatik.hu-berlin.de> Message-ID: <00f701c0ff1b$29498da0$4ffa42d5@hagrid> martin wrote: > > go ahead and check it in. > > Done. Some clean-up could be still applied, such as defining only one > of USE_UCS4_STORAGE and Py_UNICODE_SIZE, but I'll leave that to your > judgement (i.e. I won't attempt any further changes at the moment > unless asked). after a good night's sleep, I'm not sure Py_UNICODE_SIZE should be used for feature selection (especially not SIZE == 4). I'd rather see a separate define for UCS-2/UTF-16 vs. UCS-4, which works no matter what the exact sizes are (as long as Py_UCS4 is at least 32 bits, and Py_UCS2 is at least 16 bits, of course). (how about PY_UNICODE_WIDE?) (and what's the deal with Py_ vs PY_ prefixes, btw?) From guido@digicool.com Wed Jun 27 16:20:14 2001 From: guido@digicool.com (Guido van Rossum) Date: Wed, 27 Jun 2001 11:20:14 -0400 Subject: [I18n-sig] UCS-4 configuration In-Reply-To: Your message of "Wed, 27 Jun 2001 08:45:11 +0200." <200106270645.f5R6jBS06348@mira.informatik.hu-berlin.de> References: <200106262115.f5QLFJ204654@mira.informatik.hu-berlin.de> <005501c0fe8b$f0134d80$4ffa42d5@hagrid> <200106262250.f5QMoO609419@mira.informatik.hu-berlin.de> <200106262334.f5QNYGb18598@odiug.digicool.com> <200106270645.f5R6jBS06348@mira.informatik.hu-berlin.de> Message-ID: <200106271520.f5RFKE519522@odiug.digicool.com> > > Another loose end: define sys.maxunicode. > > Breaking my promise not to touch the code, I've added this. I was not > quite sure what type you meant to see in sys.maxunicode; I took > integer, since U+FFFF is a non-character. Correct. And thanks! > > Note how the utf8 codec has encoded the surrogate pair as two 3-byte > > utf8 sequences. I think it should either spit out an error or (I > > think this is better -- "be forgiving in what you accept") recognize > > the surrogate pair and spit out a 4-byte utf8 sequence. Note that in > > 2-byte mode, this same string literal can be marshalled and > > unmarshalled just fine! > > That was actually the same problem as with the test case: the UTF-8 > encoder would not use the surrogate code in wide mode. I've removed > that restriction, so this test now also passes. Thanks again! > > Or should we change the marshalling format to do something that's more > > transparent? It feels uncomfortable that in 2-byte mode we can easily > > create unicode strings containing illegal sequences (e.g. lone > > surrogates), but these strings can't be marshalled. > > You mean, they cannot be unmarshalled? With the current code, > marshalling them works fine... Yes. > There was another problem with the unicode database; the code assumed > that adding two Py_UNICODE values would wrap around at 65536. With > that fixed and committed, the test suite passes for me. Wow. And for both versions, too! Are there any open issues left? A list of those would help! Some I can think of: - Marc-Andre's message - disable Unicode entirely with a configuration switch - documentation - marshalling UCS2 strings containing lone surrogates Anything else? --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@digicool.com Wed Jun 27 16:25:54 2001 From: guido@digicool.com (Guido van Rossum) Date: Wed, 27 Jun 2001 11:25:54 -0400 Subject: [I18n-sig] UCS-4 configuration In-Reply-To: Your message of "Wed, 27 Jun 2001 17:09:27 +0200." <00f701c0ff1b$29498da0$4ffa42d5@hagrid> References: <200106262115.f5QLFJ204654@mira.informatik.hu-berlin.de> <005501c0fe8b$f0134d80$4ffa42d5@hagrid> <200106262250.f5QMoO609419@mira.informatik.hu-berlin.de> <00f701c0ff1b$29498da0$4ffa42d5@hagrid> Message-ID: <200106271525.f5RFPsJ19534@odiug.digicool.com> > martin wrote: > > > go ahead and check it in. > > > > Done. Some clean-up could be still applied, such as defining only one > > of USE_UCS4_STORAGE and Py_UNICODE_SIZE, but I'll leave that to your > > judgement (i.e. I won't attempt any further changes at the moment > > unless asked). > > after a good night's sleep, I'm not sure Py_UNICODE_SIZE should > be used for feature selection (especially not SIZE == 4). > > I'd rather see a separate define for UCS-2/UTF-16 vs. UCS-4, which > works no matter what the exact sizes are (as long as Py_UCS4 is at > least 32 bits, and Py_UCS2 is at least 16 bits, of course). Makes sense. > (how about PY_UNICODE_WIDE?) > > (and what's the deal with Py_ vs PY_ prefixes, btw?) > > The majority of macros use Py_, but a few use PY_. I'd stick with Py_ unless you're defining a new one that's part of a series that already uses PY_. In the Unicode support, the only one using PY_ seems to be PY_UNICODE_TYPE. Since that's a recent addition, there's time to rename it. --Guido van Rossum (home page: http://www.python.org/~guido/) From fredrik@pythonware.com Wed Jun 27 16:45:17 2001 From: fredrik@pythonware.com (Fredrik Lundh) Date: Wed, 27 Jun 2001 17:45:17 +0200 Subject: [I18n-sig] UCS-4 configuration References: <200106262115.f5QLFJ204654@mira.informatik.hu-berlin.de> <005501c0fe8b$f0134d80$4ffa42d5@hagrid> <200106262250.f5QMoO609419@mira.informatik.hu-berlin.de> <200106262334.f5QNYGb18598@odiug.digicool.com> <200106270645.f5R6jBS06348@mira.informatik.hu-berlin.de> <200106271520.f5RFKE519522@odiug.digicool.com> Message-ID: <019b01c0ff20$2b85a8b0$4ffa42d5@hagrid> guido wrote: > Anything else? also after a good night's sleep: should the default on unix really be "same as your wchar", or should we keep it as "ucs2" for the next release? (i.e. if you don't specify anything, you get UCS-2, like before) From guido@digicool.com Wed Jun 27 16:50:14 2001 From: guido@digicool.com (Guido van Rossum) Date: Wed, 27 Jun 2001 11:50:14 -0400 Subject: [I18n-sig] UCS-4 configuration In-Reply-To: Your message of "Wed, 27 Jun 2001 17:45:17 +0200." <019b01c0ff20$2b85a8b0$4ffa42d5@hagrid> References: <200106262115.f5QLFJ204654@mira.informatik.hu-berlin.de> <005501c0fe8b$f0134d80$4ffa42d5@hagrid> <200106262250.f5QMoO609419@mira.informatik.hu-berlin.de> <200106262334.f5QNYGb18598@odiug.digicool.com> <200106270645.f5R6jBS06348@mira.informatik.hu-berlin.de> <200106271520.f5RFKE519522@odiug.digicool.com> <019b01c0ff20$2b85a8b0$4ffa42d5@hagrid> Message-ID: <200106271550.f5RFoEZ19613@odiug.digicool.com> > guido wrote: > > > Anything else? > > also after a good night's sleep: should the default on unix really be > "same as your wchar", or should we keep it as "ucs2" for the next > release? > > (i.e. if you don't specify anything, you get UCS-2, like before) > > Yes, that's my preference. I had the same thought overnight. --Guido van Rossum (home page: http://www.python.org/~guido/) From rick@unicode.org Wed Jun 27 16:52:28 2001 From: rick@unicode.org (Rick McGowan) Date: Wed, 27 Jun 2001 08:52:28 -0700 Subject: [I18n-sig] Re: Unicode surrogates: just say no! In-Reply-To: <4aae2upzny.fsf@kern.srcf.societies.cam.ac.uk> (message fromGaute B Strokkenes on 27 Jun 2001 00:52:17 +0100) Message-ID: <200106271344.JAA08050@unicode.org> Martin v. Loewis" wrote: > It seems to be unclear to many, including myself, what exactly was > clarified with Unicode 3.1. Where exactly does it say that processing > a six-byte two-surrogates sequence as a single character is > non-conforming? It's not non-conforming, it's "irregular". Please read the technical report (#27) that I pointed at yesterday (on the i18n-sig@python). It gives detailed specifications for UTF-8. Anything not in the table "UTF-8 Bit Distribution" and accompanying description shown there is non-conforming. Rule D36 specifies: (a) UTF-8 is the Unicode Transformation Format that serializes a Unicode code point as a sequence of one to four bytes, as specified in Table 3.1, UTF-8 Bit Distribution. (b) An illegal UTF-8 code unit sequence is any byte sequence that does not match the patterns listed in Table 3.1B, Legal UTF-8 Byte Sequences. (c) An irregular UTF-8 code unit sequence is a six-byte sequence where the first three bytes correspond to a high surrogate, and the next three bytes correspond to a low surrogate. As a consequence of C12, these irregular UTF-8 sequences shall not be generated by a conformant process. In other words, it is non-conforming to generate two 3-byte things for a surrogate pair. However, it remains "legal but irregular" to interpret such a pair of 3-byte entities. Why wasn't it just made non-conforming to interpret such things? Because there are old implementations of UTF-8 in the world that pre-date the definition of surrogates, and if they ever encountered codepoints in that range, they would generate those pairs of 3-byte sequences. So it is legal for a process to recognize them and either raise an exception or try to "fix" the situation. > What exactly does it say that the conforming behaviour > should be? TR27 says: "Processes that require unique representation must not interpret irregular UTF code unit sequences as characters. They may, for example, reject or remove those sequences." If I were going to implement a UTF-8 interpeter for Python, I would give it a hook to optionally return a specific error condition on irregular sequences. If you still find the definitions and discussion in the technical report to be unclear, then the Unicode editorial committee would undoubtedly like to hear about it. Rick From guido@digicool.com Wed Jun 27 17:11:49 2001 From: guido@digicool.com (Guido van Rossum) Date: Wed, 27 Jun 2001 12:11:49 -0400 Subject: [I18n-sig] Re: [Python-Dev] Unicode Maintenance In-Reply-To: Your message of "Wed, 27 Jun 2001 14:10:57 +0200." <3B39CD51.406C28F0@lemburg.com> References: <3B39CD51.406C28F0@lemburg.com> Message-ID: <200106271611.f5RGBn819631@odiug.digicool.com> > Looking at the recent burst of checkins for the Unicode implementation > completely bypassing the standard SF procedure and possible comments > I might have on the different approaches, I guess I've been ruled out > as maintainer and designer of the Unicode implementation. > > Well, I guess that's how things go. Was nice working for you guys, > but no longer is... I'm tired of having to defend myself against > meta-comments about the design, uncontrolled checkins and no true > backup about my standing in all this from Guido. > > Perhaps I am misunderstanding the role of a maintainer and > implementation designer, but as it is all respect for the work I've > put into all this seems faded. That's the conclusion I draw from recent > postings by Martin and Fredrik and their nightly "takeover". > > Thanks, > -- > Marc-Andre Lemburg [For those of us to whom Marc-Andre's complaint comes as a total surprise: there was a thread on i18n-sig about whether we should support Unicode surrogates, followed by a conclusion to skip surrogates and jump directly to optional support for UCS-4, followed by some checkins that enabled a configuration choice between UCS-2 and UCS-4, and code to make it work. As a side effect, surrogate support in the UCS-2 version actually improved slightly.] Now, now, Marc-Andre. The only comments I recall from you on my "surrogates: just say no" post seemed favorable, except that you proposed to to all the way and make UCS-4 mandatory. I explained why I didn't want to go that far, and why I didn't believe your arguments against giving users a choice. I didn't hear back from you then, and I didn't think you could have much of a problem with my position. Our process requires the use of the SF patch manager only for controversial changes. Based on your feedback, I didn't think there was anything controversial about the changes that Fredrik and Martin have made! (If there was, IMO it was temporarily breaking the Windows build and the test suite -- but that's all fixed now.) I don't understand where you get the idea that we lost respect for your work! In fact, the fact that it was so easy to make the changes suggested to me that the original design was well suited to this particular change (as opposed to the surrugate support proposals, which all sounded like they would require a *lot* of changes). I don't think that we have very strict roles in this community anyway. (My role as BDFL excluded -- that's why I get to write this response. :-) I'd say that Fredrik owns SRE, because he has asserted that ownership at various times: he's undone changes by others that broke the 1.5.2 support, for example. But the Unicode support in Python isn't owned by one person: many folks have contributed to that, including Fredrik, who designed and wrote the original Unicode string object implementation. If you have specific comments about the changes made, please be specific. If you feel slighted by meta-comments, please also be specific. I don't think I've said anything derogatory about you or your design. Paul Prescod offered to write a PEP on this issue. My cynical half believes that we'll never hear from him again, but my optimistic half hopes that he'll actually write one, so that we'll be able to discuss the various issues for the users with the users. I encourage you to co-author the PEP, since you have a lot of background knowledge about the issues. BTW, I think that Misc/unicode.txt should be converted to a PEP, for the historic record. It was very much a PEP before the PEP process was invented. Barry, how much work would this be? No editing needed, just formatting, and assignment of a PEP number (the lower the better). --Guido van Rossum (home page: http://www.python.org/~guido/) From barry@digicool.com Wed Jun 27 17:24:30 2001 From: barry@digicool.com (Barry A. Warsaw) Date: Wed, 27 Jun 2001 12:24:30 -0400 Subject: [I18n-sig] Re: [Python-Dev] Unicode Maintenance References: <3B39CD51.406C28F0@lemburg.com> <200106271611.f5RGBn819631@odiug.digicool.com> Message-ID: <15162.2238.720508.508081@anthem.wooz.org> >>>>> "GvR" == Guido van Rossum writes: GvR> BTW, I think that Misc/unicode.txt should be converted to a GvR> PEP, for the historic record. It was very much a PEP before GvR> the PEP process was invented. Barry, how much work would GvR> this be? No editing needed, just formatting, and assignment GvR> of a PEP number (the lower the better). Not much work at all, so I'll do this (and replace Misc/unicode.txt with a pointer to the PEP). Let's go with PEP 7, but stick it under the "Other Informational PEPs" category. -Barry From tim.one@home.com Wed Jun 27 17:51:16 2001 From: tim.one@home.com (Tim Peters) Date: Wed, 27 Jun 2001 12:51:16 -0400 Subject: [I18n-sig] UCS-4 configuration In-Reply-To: <200106271520.f5RFKE519522@odiug.digicool.com> Message-ID: [Guido] > Are there any open issues left? A list of those would help! Some I > can think of: > > - Marc-Andre's message > - disable Unicode entirely with a configuration switch > - documentation > - marshalling UCS2 strings containing lone surrogates > > Anything else? Other unresolved glitches raised here in the wee hours: + New warnings (prototype/definition mismatches). + Windows _winreg doesn't link. Unclear (to me) what assumptions it really needs to have met; it's failing now because HAVE_USABLE_WCHAR_T isn't #define'd anymore, but I don't know really know what "usable" refers to (perhaps that it's usable by _winreg ). From walter@livinglogic.de Wed Jun 27 17:56:00 2001 From: walter@livinglogic.de (Walter =?iso-8859-1?Q?D=F6rwald?=) Date: Wed, 27 Jun 2001 18:56:00 +0200 Subject: [I18n-sig] Re: validity of lone surrogates References: <9F2D83017589D211BD1000805FA70CA703B139EF@ntxmel03.cmutual.com.au> <4ak81yjdx2.fsf@kern.srcf.societies.cam.ac.uk> <200106271416.f5REGl519361@odiug.digicool.com> Message-ID: <3B3A1020.7154E4B6@livinglogic.de> Guido van Rossum wrote: >=20 > [Gaute] > > My take on this is that the various UTF codecs should follow the spec= s > > to the letter and reject antything else in default mode. There shoul= d > > also be a "lenient" or "forgiving" mode in which the codec does its > > best to interpret and repair broken, nonsensical or irregular data. > > Off course, if an application uses this mode then it will have to be > > aware of the dangers involved, including the security aspects. >=20 > Python's codec mechanism has a nice API gimmick: you can pass an error > handling option. Currently, this can be 'strict', 'ignore', or > 'replace'. I wonder if we could add a fourth mode, 'lenient', that > tries its best to encode anything passed in? How would this work together with the proposed encode error handling callback feature (see patch #432401)? Does this patch have any change of getting into Python (when it's finished)? Bye, Walter D=F6rwald From martin@loewis.home.cs.tu-berlin.de Wed Jun 27 17:27:41 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Wed, 27 Jun 2001 18:27:41 +0200 Subject: [I18n-sig] UCS-4 configuration In-Reply-To: <00f701c0ff1b$29498da0$4ffa42d5@hagrid> (fredrik@pythonware.com) References: <200106262115.f5QLFJ204654@mira.informatik.hu-berlin.de> <005501c0fe8b$f0134d80$4ffa42d5@hagrid> <200106262250.f5QMoO609419@mira.informatik.hu-berlin.de> <00f701c0ff1b$29498da0$4ffa42d5@hagrid> Message-ID: <200106271627.f5RGRf909183@mira.informatik.hu-berlin.de> > after a good night's sleep, I'm not sure Py_UNICODE_SIZE should > be used for feature selection (especially not SIZE == 4). > > I'd rather see a separate define for UCS-2/UTF-16 vs. UCS-4, which > works no matter what the exact sizes are (as long as Py_UCS4 is at > least 32 bits, and Py_UCS2 is at least 16 bits, of course). > > (how about PY_UNICODE_WIDE?) Normalizing everything to Py_UNICODE_WIDE sounds fine to me; I won't start writing a patch for that, though. Feel free to get completely rid of Py_UNICODE_SIZE in the process (and probably of USE_UCS4_STORAGE as well). > (and what's the deal with Py_ vs PY_ prefixes, btw?) I took PY_ out of confusion, as mentioned in another message. Regards, Martin From martin@loewis.home.cs.tu-berlin.de Wed Jun 27 17:21:41 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Wed, 27 Jun 2001 18:21:41 +0200 Subject: [I18n-sig] UCS-4 configuration In-Reply-To: <200106271525.f5RFPsJ19534@odiug.digicool.com> (message from Guido van Rossum on Wed, 27 Jun 2001 11:25:54 -0400) References: <200106262115.f5QLFJ204654@mira.informatik.hu-berlin.de> <005501c0fe8b$f0134d80$4ffa42d5@hagrid> <200106262250.f5QMoO609419@mira.informatik.hu-berlin.de> <00f701c0ff1b$29498da0$4ffa42d5@hagrid> <200106271525.f5RFPsJ19534@odiug.digicool.com> Message-ID: <200106271621.f5RGLf709180@mira.informatik.hu-berlin.de> > The majority of macros use Py_, but a few use PY_. I'd stick with Py_ > unless you're defining a new one that's part of a series that already > uses PY_. > > In the Unicode support, the only one using PY_ seems to be > PY_UNICODE_TYPE. Since that's a recent addition, there's time to > rename it. PY_UNICODE_TYPE is the #define that is used in the typedef for Py_UNICODE; it should not be used elsewhere. I couldn't figure out how to have autoconf generate typedefs, so I generate a #define. Originally, I wanted to use PY_UNICODE for the #define, but then thought it to be too similar to Py_UNICODE, hence PY_UNICODE_TYPE. Changing it to Py_UNICODE_TYPE sounds fine to me. Regards, Martin From martin@loewis.home.cs.tu-berlin.de Wed Jun 27 17:46:37 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Wed, 27 Jun 2001 18:46:37 +0200 Subject: [I18n-sig] UCS-4 configuration In-Reply-To: <200106271520.f5RFKE519522@odiug.digicool.com> (message from Guido van Rossum on Wed, 27 Jun 2001 11:20:14 -0400) References: <200106262115.f5QLFJ204654@mira.informatik.hu-berlin.de> <005501c0fe8b$f0134d80$4ffa42d5@hagrid> <200106262250.f5QMoO609419@mira.informatik.hu-berlin.de> <200106262334.f5QNYGb18598@odiug.digicool.com> <200106270645.f5R6jBS06348@mira.informatik.hu-berlin.de> <200106271520.f5RFKE519522@odiug.digicool.com> Message-ID: <200106271646.f5RGkbR09253@mira.informatik.hu-berlin.de> > Are there any open issues left? A list of those would help! Some I > can think of: > > - Marc-Andre's message > - disable Unicode entirely with a configuration switch > - documentation > - marshalling UCS2 strings containing lone surrogates > > Anything else? - bump the API version? With the current CVS, this is only necessary for systems with a 4-byte wchar_t. - Find some magic to deal with exchanging extensions across incompatible installation. - fix UTF-8 encoding for lone surrogates, as per SF bug report. - Windows configuration: should unicodeobject.h provide autoconfiguration, or should everything be defined in PC/config.h (or similar manually-maintained config files). I'll be leaving for two weeks next week, so I can tackle larger tasks only later. On the PYD compatibility, the easiest solution would be to create a Py_InitModule5, which also takes a flag value, this flag value could include other incompatible settings, such as --without-cycle-gc. Of course, such a change would break all existing binary modules, unless Python continues to provide Py_InitModule4 to binary modules. Calling Py_InitModule4 would then imply narrow Unicode. To hack without Py_InitModule5, putting flags into PYTHON_API_VERSION might also work. Regards, Martin From martin@loewis.home.cs.tu-berlin.de Wed Jun 27 18:06:30 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Wed, 27 Jun 2001 19:06:30 +0200 Subject: [I18n-sig] Re: Unicode surrogates: just say no! In-Reply-To: <200106271344.JAA08050@unicode.org> (message from Rick McGowan on Wed, 27 Jun 2001 08:52:28 -0700) References: <200106271344.JAA08050@unicode.org> Message-ID: <200106271706.f5RH6UT09389@mira.informatik.hu-berlin.de> > Martin v. Loewis" wrote: > > > It seems to be unclear to many, including myself, what exactly was > > clarified with Unicode 3.1. Where exactly does it say that processing > > a six-byte two-surrogates sequence as a single character is > > non-conforming? > > It's not non-conforming, it's "irregular". If some implementation processes something, it can be either conforming or non-conforming doing so, no? The byte sequence itself may be irregular; I'm asking how a conforming implementation should deal with it when it sees it. > Please read the technical report (#27) that I pointed at yesterday > (on the i18n-sig@python). It gives detailed specifications for > UTF-8. Anything not in the table "UTF-8 Bit Distribution" and > accompanying description shown there is non-conforming. I see conformant/non-conformant (*) only used for implementations (and processes), not for byte sequences. There you use illegal, ill-formed, irregular; much of my confusion probably is because I don't know how these terms relate, except for - an irregular sequence (of bytes, or code units) is not illegal. Also, I assume that negation of these concepts follows the English language rules (i.e. "not illegal" == "legal", "not ill-formed" == "well-formed", etc) > In other words, it is non-conforming to generate two 3-byte things for a > surrogate pair. However, it remains "legal but irregular" to interpret > such a pair of 3-byte entities. [...] > If you still find the definitions and discussion in the technical report > to be unclear, then the Unicode editorial committee would undoubtedly like > to hear about it. The issue of UTF-8 encoded surrogate pairs is clear now to me, I hope: You must not write them, but you may read them. The next question then is what to do with lone surrogate triplets; the table in TR 27 suggests they are legal, but people on this list have argued they must neither be emitted nor consumed (since what you get is not a legal USV). Thanks for your comments, Martin (*) "Conforming" is never used, sorry for the confusion From martin@loewis.home.cs.tu-berlin.de Wed Jun 27 18:21:26 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Wed, 27 Jun 2001 19:21:26 +0200 Subject: [I18n-sig] UCS-4 configuration In-Reply-To: References: Message-ID: <200106271721.f5RHLQM09510@mira.informatik.hu-berlin.de> > + Windows _winreg doesn't link. Unclear (to me) what assumptions > it really needs to have met; it's failing now because > HAVE_USABLE_WCHAR_T isn't #define'd anymore, but I don't know > really know what "usable" refers to (perhaps that it's usable > by _winreg ). HAVE_USABLE_WCHAR_T should be defined iff sizeof(wchar_t)==sizeof(Py_UNICODE) (*). If you follow my proposal, PC/config.h should define this simultaneously with defining Py_UNICODE_TYPE to wchar_t. OTOH, the implementation of PyUnicode_DecodeMBCS and friends should probably be changed to operate for a wide Py_UNICODE also. Currently, it calls MultiByteToWideChar; this should be followed by widening each value if a wide Py_UNICODE is used. Without such a change, the "mbcs" codec won't work on Windows with a wide Py_UNICODE. Regards, Martin (*) technically, this requires also that wchar_t values are always understood as Unicode in the C library, instead of, say, EUC-JP. This is hard to test in general, but for Windows, it is known to be true. From fredrik@pythonware.com Wed Jun 27 18:41:01 2001 From: fredrik@pythonware.com (Fredrik Lundh) Date: Wed, 27 Jun 2001 19:41:01 +0200 Subject: [I18n-sig] UCS-4 configuration References: <200106262115.f5QLFJ204654@mira.informatik.hu-berlin.de> <005501c0fe8b$f0134d80$4ffa42d5@hagrid> <200106262250.f5QMoO609419@mira.informatik.hu-berlin.de> <200106262334.f5QNYGb18598@odiug.digicool.com> <200106270645.f5R6jBS06348@mira.informatik.hu-berlin.de> <200106271520.f5RFKE519522@odiug.digicool.com> <200106271646.f5RGkbR09253@mira.informatik.hu-berlin.de> Message-ID: <02c901c0ff30$ed0d9fa0$4ffa42d5@hagrid> martin wrote: > - Windows configuration: should unicodeobject.h provide > autoconfiguration, or should everything be defined in PC/config.h > (or similar manually-maintained config files). > > I'll be leaving for two weeks next week, so I can tackle larger tasks > only later. before you leave, can you change the ./configure default to ucs2? (see my and gvr's earlier mails) I'll clean up the unicode defines tonight. Cheers /F From guido@digicool.com Wed Jun 27 18:44:20 2001 From: guido@digicool.com (Guido van Rossum) Date: Wed, 27 Jun 2001 13:44:20 -0400 Subject: [I18n-sig] Re: Unicode surrogates: just say no! In-Reply-To: Your message of "Wed, 27 Jun 2001 19:06:30 +0200." <200106271706.f5RH6UT09389@mira.informatik.hu-berlin.de> References: <200106271344.JAA08050@unicode.org> <200106271706.f5RH6UT09389@mira.informatik.hu-berlin.de> Message-ID: <200106271744.f5RHiKO19739@odiug.digicool.com> > The issue of UTF-8 encoded surrogate pairs is clear now to me, I hope: > You must not write them, but you may read them. Agreed. Clarifying: if you read one pair when converting to UCS-4, you should store one character; when converting to UCS-2, you should store a pair, of course. > The next question then is what to do with lone surrogate triplets; the > table in TR 27 suggests they are legal, but people on this list have > argued they must neither be emitted nor consumed (since what you get > is not a legal USV). I see two positions possible: (1) it's up to the application to ensure this, not to the codec, so the codec needn't check for this; (2) the codec's output should be legal, and this is a good time to check for illegalities. Since both are reasonable positions, perhaps the error handling option of the codec should be used to decide? Neither of "strict", "replace" or "ignore" really matches the semantics of (1) however; perhaps this behavior should be called "lenient". --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@digicool.com Wed Jun 27 18:53:10 2001 From: guido@digicool.com (Guido van Rossum) Date: Wed, 27 Jun 2001 13:53:10 -0400 Subject: [I18n-sig] Re: validity of lone surrogates In-Reply-To: Your message of "Wed, 27 Jun 2001 18:56:00 +0200." <3B3A1020.7154E4B6@livinglogic.de> References: <9F2D83017589D211BD1000805FA70CA703B139EF@ntxmel03.cmutual.com.au> <4ak81yjdx2.fsf@kern.srcf.societies.cam.ac.uk> <200106271416.f5REGl519361@odiug.digicool.com> <3B3A1020.7154E4B6@livinglogic.de> Message-ID: <200106271753.f5RHrAB19753@odiug.digicool.com> > How would this work together with the proposed encode error handling > callback feature (see patch #432401)? Does this patch have any change of > getting into Python (when it's finished)? I don't know. The patch looks awfully big, and the motivation seems thin, so I don't have high hopes. I doubt that I would use it myself, and I fear that it would be pretty slow if called frequently. An alternative way to get what you want would be to write your own codec. Also, some standard codecs might be subclassable in a way that makes it easy to get the desired functionality through subclassing rather than through changing lots of C level APIs. --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@digicool.com Wed Jun 27 19:13:10 2001 From: guido@digicool.com (Guido van Rossum) Date: Wed, 27 Jun 2001 14:13:10 -0400 Subject: [I18n-sig] UCS-4 configuration In-Reply-To: Your message of "Wed, 27 Jun 2001 18:46:37 +0200." <200106271646.f5RGkbR09253@mira.informatik.hu-berlin.de> References: <200106262115.f5QLFJ204654@mira.informatik.hu-berlin.de> <005501c0fe8b$f0134d80$4ffa42d5@hagrid> <200106262250.f5QMoO609419@mira.informatik.hu-berlin.de> <200106262334.f5QNYGb18598@odiug.digicool.com> <200106270645.f5R6jBS06348@mira.informatik.hu-berlin.de> <200106271520.f5RFKE519522@odiug.digicool.com> <200106271646.f5RGkbR09253@mira.informatik.hu-berlin.de> Message-ID: <200106271813.f5RIDAU19807@odiug.digicool.com> > To hack without Py_InitModule5, putting flags into PYTHON_API_VERSION > might also work. I like adding a flag better than Py_InitModule5. If PYTHON_API_VERSION > 1010, the low bit should be off for UCS-2 and on for UCS-4. So the next version should be 1012; this would become 1013 for UCS-4. If a program doesn't use Unicode-specific APIs that take or return Py_UNICODE arrays, it's not vulnerable to this problem. An alternative would be to use the C preprocessor to give all affected APIs a different name when using UCS4. (There are also macros affected, e.g. Py_UNICODE_COPY(). But macro users are likely to also refernce the function APIs.) There's a bunch of functions that take or return a single Py_UNICODE value. These would be affected too. That's a shame; if they had been defined to take/return an unsigned long they would have worked just as well. --Guido van Rossum (home page: http://www.python.org/~guido/) From paulp@ActiveState.com Wed Jun 27 20:17:32 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Wed, 27 Jun 2001 12:17:32 -0700 Subject: [I18n-sig] Unicode surrogates: just say no! References: <200106260851.f5Q8pcN10662@odiug.digicool.com> <3B385BDC.AB40A761@lemburg.com> <200106261700.f5QH0ih14770@odiug.digicool.com> Message-ID: <3B3A314C.161FE431@ActiveState.com> I'm trying to sift through all of the decisions made in different messages for the PEP. Guido van Rossum wrote: > >... > > - unichr(i) for 0x10000 <= i <= 0x10ffff (and hence corresponding \u > and \U) generates a surrogate pair, where u[0] is the high > surrogate value and u[1] the low surrogate value Does this imply that ord() should take in surrogate pairs too? -- Take a recipe. Leave a recipe. Python Cookbook! http://www.ActiveState.com/pythoncookbook From kenw@sybase.com Wed Jun 27 20:23:47 2001 From: kenw@sybase.com (Kenneth Whistler) Date: Wed, 27 Jun 2001 12:23:47 -0700 (PDT) Subject: [I18n-sig] Re: validity of lone surrogates (was Re: Unicode surroga tes: just say no!) Message-ID: <200106271923.MAA11557@birdie.sybase.com> Mark Davis wrote: > Your are correct in that the text is not nearly as clear as it should be, > and is open to different interpretations. My view of the status in Unicode > 3.1 is represented on http://www.macchiato.com/utc/utf_comparison.htm. > Corresponding computations are on > http://www.macchiato.com/utc/utf_computations.htm. I concur in general with Mark's characterization of what the current text is intended to say. In particular, Mark is correct that there is language just below D29 that says that "a UTF mapping *must also* map invalid Unicode scalar values to unique code value sequences. These invalid scalar values include FFFE, FFFF, and unpaired surrogates." I strongly agree with Mark that this is the correct position to take with respect to the *noncharacters*, i.e. FFFE, FFFF (and their ilk on the supplementary planes, as well as the newly defined FDD0..FDFF). In this respect, ISO/IEC 10646 is inconsistent in its definition of UTF-8, and needs to be fixed. However, like Gaute, I think there are logical contradictions in the current text of the Unicode Standard when it comes to dealing with the isolated surrogate code points. Gaute is also correct that much of the problem of textual interpretation results from the incomplete transition in Unicode 3.0 from thinking of UTF-16 as Unicode, with UTF-8 derived from UTF-16, to UTF-16 and UTF-8 as coequal transforms from the Unicode Scalar Value. The UTC editorial committee struggled with that text, but also attempted to minimize the overall impact on Chapter 3 of the standard. In retrospect, it probably would have been better to take the hit then and completely rewrite Chapter 3 in terms of the new model, because of the continuing confusion that the incomplete transition has obviously engendered among implementers. > > One of the goals for Unicode 4.0 is to clear up the text describing UTFs in > particular, which may change some of the edge cases (isolates and/or > irregulars). This work is actively underway. I can guarantee that the Unicode 4.0 text will be *much* clearer about all these issues. However, the UTC editorial committee is still struggling with exactly how to present the edge cases. It is my *personal* opinion -- and not yet one that could be stated to be consensus in UTC or the UTC editorial committee -- that the Unicode Standard should adopt formal definitions similar to that of the IETF, where isolated surrogates and/or irregular sequences are just ill-formed, period. And where the issues of lenient interpretation of irregular UTF-8 generated by older implementations are shunted off into a migration strategy section dealing with UTF converters. --Ken Whistler From guido@digicool.com Wed Jun 27 20:30:19 2001 From: guido@digicool.com (Guido van Rossum) Date: Wed, 27 Jun 2001 15:30:19 -0400 Subject: [I18n-sig] Unicode surrogates: just say no! In-Reply-To: Your message of "Wed, 27 Jun 2001 12:17:32 PDT." <3B3A314C.161FE431@ActiveState.com> References: <200106260851.f5Q8pcN10662@odiug.digicool.com> <3B385BDC.AB40A761@lemburg.com> <200106261700.f5QH0ih14770@odiug.digicool.com> <3B3A314C.161FE431@ActiveState.com> Message-ID: <200106271930.f5RJUJw19910@odiug.digicool.com> > I'm trying to sift through all of the decisions made in different > messages for the PEP. Excellent! > Guido van Rossum wrote: > > > >... > > > > - unichr(i) for 0x10000 <= i <= 0x10ffff (and hence corresponding \u > > and \U) generates a surrogate pair, where u[0] is the high > > surrogate value and u[1] the low surrogate value > > Does this imply that ord() should take in surrogate pairs too? Oooh, hadn't thought of that, but yes, it makes sense! Not yet implemented, but I think it should. Makes for a nice pair of invariants: unichr(ord('\Udddddddd')) == '\Udddddddd' ord(unichr(0xdddddddd)) == 0xdddddddd regardless of whether we're using UCS-2 or UCS-4 storage. Currently this is broken for 0xdddddddd > 0xffff with UCS-2 storage. On the other hand, unichr() and ord() should still work for lone surrogate values as well (even though these are invalid code points). --Guido van Rossum (home page: http://www.python.org/~guido/) From paulp@ActiveState.com Wed Jun 27 20:40:07 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Wed, 27 Jun 2001 12:40:07 -0700 Subject: [I18n-sig] Unicode surrogates: just say no! References: <200106260851.f5Q8pcN10662@odiug.digicool.com> <3B385BDC.AB40A761@lemburg.com> <200106261700.f5QH0ih14770@odiug.digicool.com> Message-ID: <3B3A3696.FFA7FCE@ActiveState.com> Guido van Rossum wrote: > >... > > - unichr(i) for i >= 0x110000 (and hence corresponding \u and \U) > raises an exception at Python-to-bytecode compile-time unichr(i) is an expression. When would it be evaluated at compile-time? Also, I'm not sure what runtime behavior you want for these "very large" unichr(i) values. In general I don't understand why we're treating the > 0x11000 range specially at all? -- Take a recipe. Leave a recipe. Python Cookbook! http://www.ActiveState.com/pythoncookbook From Peter_Constable@sil.org Wed Jun 27 19:58:46 2001 From: Peter_Constable@sil.org (Peter_Constable@sil.org) Date: Wed, 27 Jun 2001 11:58:46 -0700 Subject: [I18n-sig] Re: Unicode surrogates: just say no! Message-ID: >If you still find the definitions and discussion in the technical report >to be unclear, then the Unicode editorial committee would undoubtedly like >to hear about it. There is no question that there are still things that are unclear and things that are anachronistic in the definitions. I have been told that the editorial *is* aware of these things and looking at them with the intent to revise them for TUS 4.0. - Peter --------------------------------------------------------------------------- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: From paulp@ActiveState.com Wed Jun 27 20:50:24 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Wed, 27 Jun 2001 12:50:24 -0700 Subject: [I18n-sig] Unicode surrogates: just say no! References: <200106260851.f5Q8pcN10662@odiug.digicool.com> <3B385BDC.AB40A761@lemburg.com> <200106261700.f5QH0ih14770@odiug.digicool.com> <3B3A314C.161FE431@ActiveState.com> <200106271930.f5RJUJw19910@odiug.digicool.com> Message-ID: <3B3A3900.CB73F3E0@ActiveState.com> Guido van Rossum wrote: > >... > > Oooh, hadn't thought of that, but yes, it makes sense! > > Not yet implemented, but I think it should. Makes for a nice pair > of invariants: > > unichr(ord('\Udddddddd')) == '\Udddddddd' > ord(unichr(0xdddddddd)) == 0xdddddddd > > regardless of whether we're using UCS-2 or UCS-4 storage. I'm going to presume that ord should accept surrogate pairs on both narrow and wide interpreters. -- Take a recipe. Leave a recipe. Python Cookbook! http://www.ActiveState.com/pythoncookbook From guido@digicool.com Wed Jun 27 20:53:25 2001 From: guido@digicool.com (Guido van Rossum) Date: Wed, 27 Jun 2001 15:53:25 -0400 Subject: [I18n-sig] Unicode surrogates: just say no! In-Reply-To: Your message of "Wed, 27 Jun 2001 12:40:07 PDT." <3B3A3696.FFA7FCE@ActiveState.com> References: <200106260851.f5Q8pcN10662@odiug.digicool.com> <3B385BDC.AB40A761@lemburg.com> <200106261700.f5QH0ih14770@odiug.digicool.com> <3B3A3696.FFA7FCE@ActiveState.com> Message-ID: <200106271953.f5RJrPi19963@odiug.digicool.com> > Guido van Rossum wrote: > > > >... > > > > - unichr(i) for i >= 0x110000 (and hence corresponding \u and \U) > > raises an exception at Python-to-bytecode compile-time > > unichr(i) is an expression. When would it be evaluated at compile-time? My mistake. The corresponding \U would be a compile-time error, unichr() of course a run-time error. > Also, I'm not sure what runtime behavior you want for these "very large" > unichr(i) values. > > In general I don't understand why we're treating the > 0x11000 range > specially at all? When using UCS-2 + surrogate pairs (== UTF-16), they are not representable, and the Unicode and ISO standards have effectively declared that this will be the supported range forever. (For *some* definition of forever. :-) When using UCS-4 mode, I was in favor of allowing unichr() and \U to specify any value in range(0x100000000L), but that's not what Martin and Fredrik checked in. Note that if C code somehow creates a UCS-4 string containing something with the high bit on, ord() will currently return a negative value on platforms where a C long is 32 bits. Returning a Python long int with a positive value would be more consistent, but since these values aren't useful, I wonder if we should care. On the other hand, do we want ord() to raise an error when the value is not a legal Unicode code point? (Fortunately lone surrogates are still legal code points -- AFAIK all values in range(0x110000) are legal code points.) Definitely a PEP question; it's not cast in stone. --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@digicool.com Wed Jun 27 20:57:12 2001 From: guido@digicool.com (Guido van Rossum) Date: Wed, 27 Jun 2001 15:57:12 -0400 Subject: [I18n-sig] Unicode surrogates: just say no! In-Reply-To: Your message of "Wed, 27 Jun 2001 12:50:24 PDT." <3B3A3900.CB73F3E0@ActiveState.com> References: <200106260851.f5Q8pcN10662@odiug.digicool.com> <3B385BDC.AB40A761@lemburg.com> <200106261700.f5QH0ih14770@odiug.digicool.com> <3B3A314C.161FE431@ActiveState.com> <200106271930.f5RJUJw19910@odiug.digicool.com> <3B3A3900.CB73F3E0@ActiveState.com> Message-ID: <200106271957.f5RJvC219975@odiug.digicool.com> > Guido van Rossum wrote: > > > >... > > > > Oooh, hadn't thought of that, but yes, it makes sense! > > > > Not yet implemented, but I think it should. Makes for a nice pair > > of invariants: > > > > unichr(ord('\Udddddddd')) == '\Udddddddd' > > ord(unichr(0xdddddddd)) == 0xdddddddd > > > > regardless of whether we're using UCS-2 or UCS-4 storage. > > I'm going to presume that ord should accept surrogate pairs on both > narrow and wide interpreters. That's a separate question. On wide interpreters, surrogate pairs "shouldn't" exist if the app plays by the rules. But they're easily created of course! What should ord(u'\uD800\uDC00') mean on a wide interpreter? I think it's nice if you support this. Of course, if a length-two Unicode string is anything else than a high surrogate followed by a low surrogate, ord() should be illegal. --Guido van Rossum (home page: http://www.python.org/~guido/) From rick@unicode.org Wed Jun 27 21:04:00 2001 From: rick@unicode.org (Rick McGowan) Date: Wed, 27 Jun 2001 13:04:00 -0700 Subject: [I18n-sig] Re: Unicode surrogates: just say no! In-Reply-To: <200106271344.JAA08050@unicode.org> (message from Rick McGowan onWed, 27 Jun 2001 08:52:28 -0700) Message-ID: <200106271756.NAA11851@unicode.org> Martin v. Loewis" wrote: > The next question then is what to do with lone surrogate triplets; the > table in TR 27 suggests they are legal, but people on this list have > argued they must neither be emitted nor consumed (since what you get > is not a legal USV). Part of the confusion every has is because the UTFs have been envisioned as both (A) pure mathematical transformations of integer spaces, and (B) transformations of coded characters. But the explanations have been muddled a little. Part of the re-write that's happening now in the Unicode editorial committee is dealing with this confusion. In the future, I hope that it can be clarified. > an irregular sequence (of bytes, or code units) is not illegal. > Also, I assume that negation of these concepts follows the English > language rules (i.e. "not illegal" == "legal", "not ill-formed" == > "well-formed", etc) Well, yes, you're right. However, in English when something phrased as "not foo" that wording often carries the implication of some shadiness that occupies the boundary between foo and anti-foo. In this sense, "not illegal" does not mean the same thing as "legal". "Not illegal" means something more like "socially backward and frowned upon, but not worthy of legal prosecution in the strict sense". Here's my take on irregular sequences / lone surrogates: If you have a process which is claiming to take in arbitrary data and emit identical data in the same or different UTF, then it should probably allow unpaired surrogates to be eaten, stored, and re-emitted without error in the UTF-8 input case. If you have a process which is claiming to take in legal characters, transform them into something else, then you can (A) barf on lone surrogate pairs or (B) try to fix the situation. Allowing the user of the API to decide which is preferrable in a given situation is probably the right answer. I.e., the codec for UTF-8 reading/writing should have strict and non-strict modes. And strict mode should be the default. > The issue of UTF-8 encoded surrogate pairs is clear now to me, I hope: > You must not write them, but you may read them. Exactly. They could exist in nature; their existance cannot be ruled out, and hence, it may transpire that you could be presented with one. Rick From paulp@ActiveState.com Wed Jun 27 21:10:45 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Wed, 27 Jun 2001 13:10:45 -0700 Subject: [I18n-sig] Unicode surrogates: just say no! References: <200106260851.f5Q8pcN10662@odiug.digicool.com> <3B385BDC.AB40A761@lemburg.com> <200106261700.f5QH0ih14770@odiug.digicool.com> Message-ID: <3B3A3DC5.CA6767FD@ActiveState.com> Guido van Rossum wrote: > >.. > > Users can choose to write code that's portable between the two > versions by using surrogates on the narrow platform but not on the > wide platform. (This would be a good idea for backward compatibility > with Python 2.0 and 2.1 anyway.) The proposed (and current!) behavior > of \U makes it easy for them to do the right thing with string > literals; everything else, they just have to write code that won't > separate surrogate halves. What is the virtue in making the literal syntax easy and making unichr() easy when everything else is hard? Counting characters is hard. Addressing characters reliably is hard. Slicing reliably is hard. Why not simplify things? Surrogates are just characters. If you want to handle wide characters you need to build Python that way. I'm trying to imagine the use-case where you care about surrogates enough to want them to be automatically generated but not enough to care about slicing and addressing and counting and ...and is this use-case worth breaking the invariant that len(unichr(i))==1. Surrogates: Just say no. :) -- Take a recipe. Leave a recipe. Python Cookbook! http://www.ActiveState.com/pythoncookbook From martin@loewis.home.cs.tu-berlin.de Wed Jun 27 21:23:00 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Wed, 27 Jun 2001 22:23:00 +0200 Subject: [I18n-sig] UCS-4 configuration In-Reply-To: <02c901c0ff30$ed0d9fa0$4ffa42d5@hagrid> (fredrik@pythonware.com) References: <200106262115.f5QLFJ204654@mira.informatik.hu-berlin.de> <005501c0fe8b$f0134d80$4ffa42d5@hagrid> <200106262250.f5QMoO609419@mira.informatik.hu-berlin.de> <200106262334.f5QNYGb18598@odiug.digicool.com> <200106270645.f5R6jBS06348@mira.informatik.hu-berlin.de> <200106271520.f5RFKE519522@odiug.digicool.com> <200106271646.f5RGkbR09253@mira.informatik.hu-berlin.de> <02c901c0ff30$ed0d9fa0$4ffa42d5@hagrid> Message-ID: <200106272023.f5RKN0b13150@mira.informatik.hu-berlin.de> > before you leave, can you change the ./configure default to ucs2? Done. Martin From martin@loewis.home.cs.tu-berlin.de Wed Jun 27 21:24:41 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Wed, 27 Jun 2001 22:24:41 +0200 Subject: [I18n-sig] Re: Unicode surrogates: just say no! In-Reply-To: <200106271744.f5RHiKO19739@odiug.digicool.com> (message from Guido van Rossum on Wed, 27 Jun 2001 13:44:20 -0400) References: <200106271344.JAA08050@unicode.org> <200106271706.f5RH6UT09389@mira.informatik.hu-berlin.de> <200106271744.f5RHiKO19739@odiug.digicool.com> Message-ID: <200106272024.f5RKOfL13151@mira.informatik.hu-berlin.de> > Neither of "strict", "replace" or "ignore" really matches the > semantics of (1) however; perhaps this behavior should be called > "lenient". Sounds good to me (although "lenient" is not even in my passive vocabulary); implementing it may take time, though. Regards, Martin From guido@digicool.com Wed Jun 27 21:49:18 2001 From: guido@digicool.com (Guido van Rossum) Date: Wed, 27 Jun 2001 16:49:18 -0400 Subject: [I18n-sig] Re: Unicode surrogates: just say no! In-Reply-To: Your message of "Wed, 27 Jun 2001 22:24:41 +0200." <200106272024.f5RKOfL13151@mira.informatik.hu-berlin.de> References: <200106271344.JAA08050@unicode.org> <200106271706.f5RH6UT09389@mira.informatik.hu-berlin.de> <200106271744.f5RHiKO19739@odiug.digicool.com> <200106272024.f5RKOfL13151@mira.informatik.hu-berlin.de> Message-ID: <200106272049.f5RKnIj20036@odiug.digicool.com> > > Neither of "strict", "replace" or "ignore" really matches the > > semantics of (1) however; perhaps this behavior should be called > > "lenient". > > Sounds good to me (although "lenient" is not even in my passive > vocabulary); Have a better suggestion? Maybe "liberal"? (The IETF motto is most often quoted as "be liberal in what you accept and conservative in what you send." Must be a reference to US politics. ;-) > implementing it may take time, though. Not too much -- there isn't a whole lot of checking of the error values until the error occurs, so I think this could be a codec-specific extension. --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@digicool.com Wed Jun 27 21:54:37 2001 From: guido@digicool.com (Guido van Rossum) Date: Wed, 27 Jun 2001 16:54:37 -0400 Subject: [I18n-sig] Unicode surrogates: just say no! In-Reply-To: Your message of "Wed, 27 Jun 2001 13:10:45 PDT." <3B3A3DC5.CA6767FD@ActiveState.com> References: <200106260851.f5Q8pcN10662@odiug.digicool.com> <3B385BDC.AB40A761@lemburg.com> <200106261700.f5QH0ih14770@odiug.digicool.com> <3B3A3DC5.CA6767FD@ActiveState.com> Message-ID: <200106272054.f5RKsbL20050@odiug.digicool.com> > Guido van Rossum wrote: > > > >.. > > > > Users can choose to write code that's portable between the two > > versions by using surrogates on the narrow platform but not on the > > wide platform. (This would be a good idea for backward compatibility > > with Python 2.0 and 2.1 anyway.) The proposed (and current!) behavior > > of \U makes it easy for them to do the right thing with string > > literals; everything else, they just have to write code that won't > > separate surrogate halves. > > What is the virtue in making the literal syntax easy and making unichr() > easy when everything else is hard? Counting characters is hard. > Addressing characters reliably is hard. Slicing reliably is hard. Why > not simplify things? Surrogates are just characters. If you want to > handle wide characters you need to build Python that way. > > I'm trying to imagine the use-case where you care about surrogates > enough to want them to be automatically generated but not enough to care > about slicing and addressing and counting and ...and is this use-case > worth breaking the invariant that len(unichr(i))==1. > > Surrogates: Just say no. :) \U has supported surrogate creation since Python 2.0 was released, but I can't find a clear answer in PEP 100 (a.k.a. Misc/unicode.txt; \U was added after that was finalized). The use case I've been assuming of is simple enough: someone wants to print "Hello World" in Klingon. They have a printing routine that takes Unicode, but only ASCII keyboard. They look up the Unicode values for the Klingon characters spelling "Hello World" in Klingon on the web. The characters happen to be in plane 17. Do we really want to place the additional burden on them to (a) figure out if their Python interpreter uses UCS-2 or UCS-4, and (b) correctly implement the surrogate creation algorithm on the UCS-2 platform? I don't think we should. --Guido van Rossum (home page: http://www.python.org/~guido/) From rick@unicode.org Wed Jun 27 22:09:57 2001 From: rick@unicode.org (Rick McGowan) Date: Wed, 27 Jun 2001 14:09:57 -0700 Subject: [I18n-sig] Unicode surrogates: just say no! In-Reply-To: Your message of "Wed, 27 Jun 2001 13:10:45 PDT." <3B3A3DC5.CA6767FD@ActiveState.com> Message-ID: <200106271902.PAA12747@unicode.org> > someone wants to > print "Hello World" in Klingon. They have a printing routine that > takes Unicode, but only ASCII keyboard. They look up the Unicode > values for the Klingon characters spelling "Hello World" in Klingon Whew! Luckily we cut off this avenue for them. See: http://www.unicode.org/unicode/alloc/Pipeline.html and scroll to the bottom. ;-) Rick From JMachin@Colonial.com.au Wed Jun 27 23:50:13 2001 From: JMachin@Colonial.com.au (Machin, John) Date: Thu, 28 Jun 2001 08:50:13 +1000 Subject: [I18n-sig] Unicode surrogates: just say no! Message-ID: <9F2D83017589D211BD1000805FA70CA703B139F6@ntxmel03.cmutual.com.au> The "nice pair of invariants" for unichr() and ord() seem to involve what I call "all that variable-length mucking about" and Tim more robustly called "crap". IMO, there should be a very short list of places where a narrow Unicode implementation will need to know anything at all about surrogates. This short list will include codecs, the \Uxxxxxxxx notation for literals, and unichr() --- the users can ship it into the warehouse and ship it out again, but it won't be processed as other than 16-bit values. Attempts to place other items on the list should be rigorously justified. Guido asked: What should ord(u'\uD800\uDC00') mean on a wide interpreter? IMO, this should mean an exception on *both* narrow and wide interpreters, just as ord("xy") does. ord() should expect one and only one *character* Let's just keep on saying no! -----Original Message----- From: Guido van Rossum [mailto:guido@digicool.com] Sent: Thursday, 28 June 2001 5:57 To: Paul Prescod Cc: i18n-sig@python.org Subject: Re: [I18n-sig] Unicode surrogates: just say no! > Guido van Rossum wrote: > > > >... > > > > Oooh, hadn't thought of that, but yes, it makes sense! > > > > Not yet implemented, but I think it should. Makes for a nice pair > > of invariants: > > > > unichr(ord('\Udddddddd')) == '\Udddddddd' > > ord(unichr(0xdddddddd)) == 0xdddddddd > > > > regardless of whether we're using UCS-2 or UCS-4 storage. > > I'm going to presume that ord should accept surrogate pairs on both > narrow and wide interpreters. That's a separate question. On wide interpreters, surrogate pairs "shouldn't" exist if the app plays by the rules. But they're easily created of course! What should ord(u'\uD800\uDC00') mean on a wide interpreter? I think it's nice if you support this. Of course, if a length-two Unicode string is anything else than a high surrogate followed by a low surrogate, ord() should be illegal. --Guido van Rossum (home page: http://www.python.org/~guido/) _______________________________________________ I18n-sig mailing list I18n-sig@python.org http://mail.python.org/mailman/listinfo/i18n-sig ************** IMPORTANT MESSAGE ************** The information contained in or attached to this message is intended only for the people it is addressed to. If you are not the intended recipient, any use, disclosure or copying of this information is unauthorised and prohibited. This information may be confidential or subject to legal privilege. It is not the expressed view of Colonial Limited or any of its subsidiaries unless that is clearly stated. Colonial cannot accept liability for any virus damage caused by this message. ************************************************** From paulp@ActiveState.com Wed Jun 27 23:54:48 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Wed, 27 Jun 2001 15:54:48 -0700 Subject: [I18n-sig] Python Support for "Wide" Unicode characters Message-ID: <3B3A6438.6DA39268@ActiveState.com> PEP: 261 Title: Python Support for "Wide" Unicode characters Version: 1.0 Author: paulp@activestate.com (Paul Prescod) Status: Draft Type: Standards Track Python-Version: 2.2 Created: 27-Jun-2001 Post-History: 27-Jun-2001 Abstract Python 2.1 unicode characters can have ordinals only up to 65536. These characters are known as Basic Multilinual Plane characters. There are now characters in Unicode that live on other "planes". The largest addressable character in Unicode has the ordinal 2**20 + 2**16 - 1. For readability, we will call this TOPCHAR. Proposed Solution One solution would be to merely increase the maximum ordinal to a larger value. Unfortunately the only straightforward implementation of this idea is to increase the character code unit to 4 bytes. This has the effect of doubling the size of most Unicode strings. In order to avoid imposing this cost on every user, Python 2.2 will allow 4-byte Unicode characters as a build-time option. The 4-byte option is called "wide Py_UNICODE". The 2-byte option is called "narrow Py_UNICODE". Most things will behave identically in the wide and narrow worlds. * the \u and \U literal syntaxes will always generate the same data that the unichr function would. They are just different syntaxes for the same thing. * unichr(i) for 0 <= i <= 2**16 always returns a size-one string. * unichr(i) for 2**16+1 <= i <= TOPCHAR will always return a string representing the character. * BUT on narrow builds of Python, the string will actually be composed of two characters called a "surrogate pair". * ord() will now accept surrogate pairs and return the ordinal of the "wide" character. Open question: should it accept surrogate pairs on wide Python builds? * There is an integer value in the sys module that describes the largest ordinal for a Unicode character on the current interpreter. sys.maxunicode is 2**16-1 on narrow builds of Python. On wide builds it could be either TOPCHAR or 2**32-1. That's an open question. * Note that ord() can in some cases return ordinals higher than sys.maxunicode because it accepts surrogate pairs on narrow Python builds. * codecs will be upgraded to support "wide characters". On narrow Python builds, the codecs will generate surrogate pairs, on wide Python builds they will generate a single character. * new codecs will be written for 4-byte Unicode and older codecs will be updated to recognize surrogates and map them to wide characters on wide Pythons. * there are no restrictions on constructing strings that use code points "reserved for surrogates" improperly. These are called "lone surrogates". The codecs should disallow reading these but you could construct them using string literals or unichr(). Implementation There is a new (experimental) define in Include/unicodeobject.h: #undef USE_UCS4_STORAGE if defined, Py_UNICODE is set to the same thing as Py_UCS4. USE_UCS4_STORAGE There is a new configure options: --enable-unicode=ucs2 configures a narrow Py_UNICODE, and uses wchar_t if it fits --enable-unicode=ucs4 configures a wide Py_UNICODE likewise --enable-unicode configures Py_UNICODE to wchar_t if available, and to UCS-4 if not; this is the default The intention is that --disable-unicode, or --enable-unicode=no removes the Unicode type altogether; this is not yet implemented. Notes Note that len(unichr(i))==2 for i>=0x10000 on narrow machines. This means (for example) that the following code is not portable: x = 0x10000 if unichr(x) in somestring: ... In general, you should be careful using "in" if the character that is searched for could have been generated from unichr applied to a number greater than 0x10000 or from a string literal greater than 0x10000. This PEP does NOT imply that people using Unicode need to use a 4-byte encoding. It only allows them to do so. For example, ASCII is still a legitimate (7-bit) Unicode-encoding. Open Questions "Code points" above TOPCHAR cannot be expressed in two 16-bit characters. These are not assigned to Unicode characters and supposedly will never be. Should we allow them to be passed as arguments to unichr() anyhow? We could allow knowledgable programmers to use these "unused" characters for whatever they want, though Unicode does not address them. "Lone surrogates" "should not" occur on wide platforms. Should ord() still accept them? -- Take a recipe. Leave a recipe. Python Cookbook! http://www.ActiveState.com/pythoncookbook From paulp@ActiveState.com Wed Jun 27 23:58:38 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Wed, 27 Jun 2001 15:58:38 -0700 Subject: No Klingon? was Re: [I18n-sig] Unicode surrogates: just say no! References: <200106271902.PAA12747@unicode.org> Message-ID: <3B3A651E.1C3EAEE0@ActiveState.com> Rick McGowan wrote: > >... > > Whew! Luckily we cut off this avenue for them. See: > http://www.unicode.org/unicode/alloc/Pipeline.html > and scroll to the bottom. You should have told us that Klingon was rejected before we went to all of this work! Did you think we were interested in the japanese dentistry characters? The Wiggly Fences? Shavian? -- Take a recipe. Leave a recipe. Python Cookbook! http://www.ActiveState.com/pythoncookbook From guido@digicool.com Wed Jun 27 23:58:36 2001 From: guido@digicool.com (Guido van Rossum) Date: Wed, 27 Jun 2001 18:58:36 -0400 Subject: [I18n-sig] Unicode surrogates: just say no! In-Reply-To: Your message of "Thu, 28 Jun 2001 08:50:13 +1000." <9F2D83017589D211BD1000805FA70CA703B139F6@ntxmel03.cmutual.com.au> References: <9F2D83017589D211BD1000805FA70CA703B139F6@ntxmel03.cmutual.com.au> Message-ID: <200106272258.f5RMwaZ20144@odiug.digicool.com> > The "nice pair of invariants" for unichr() and ord() seem to involve > what I call "all that variable-length mucking about" and Tim more > robustly called "crap". > > IMO, there should be a very short list of places where a narrow > Unicode implementation will need to know anything at all about > surrogates. This short list will include codecs, the > \Uxxxxxxxx notation for literals, and unichr() --- the users can > ship it into the warehouse and ship it out again, but it won't be > processed as other than 16-bit values. Attempts to place other > items on the list should be rigorously justified. Thanks, that's about what I wanted to say! But I assume you meant to include ord() in that list, as it is unichr()'s inverse. We should have one place that implements the surrogate creation magic (unichr) and one place that implements the surrogate unpacking magic (ord). (Plus \U, which is to act like unichr(), and codecs.) > Guido asked: > What should ord(u'\uD800\uDC00') mean on a wide interpreter? > > IMO, this should mean an exception on *both* narrow and wide > interpreters, just as ord("xy") does. ord() should expect one > and only one *character* But on a narrow interpreter, that's a valid surrogate pair, so it's a single character, so ord() *should* return 0x10000 for this example. > Let's just keep on saying no! Yes! --Guido van Rossum (home page: http://www.python.org/~guido/) From JMachin@Colonial.com.au Thu Jun 28 00:14:17 2001 From: JMachin@Colonial.com.au (Machin, John) Date: Thu, 28 Jun 2001 09:14:17 +1000 Subject: [I18n-sig] Unicode surrogates: just say no! Message-ID: <9F2D83017589D211BD1000805FA70CA703B139F7@ntxmel03.cmutual.com.au> Guido said: But on a narrow interpreter, that's a valid surrogate pair, so it's a single character, so ord() *should* return 0x10000 for this example. IMO, once you say that a "valid surrogate pair" is a "single character" in a narrow implementation, people will want to do the indexing / slicing /dicing thing as well. ord() is just the thin end of the wedge. "No" should mean "no". unichr() and ord() should be inverses *only* in respect of scalar values up to sys.maxunicode. -----Original Message----- From: Guido van Rossum [mailto:guido@digicool.com] Sent: Thursday, 28 June 2001 8:59 To: Machin, John Cc: Paul Prescod; i18n-sig@python.org Subject: Re: [I18n-sig] Unicode surrogates: just say no! > The "nice pair of invariants" for unichr() and ord() seem to involve > what I call "all that variable-length mucking about" and Tim more > robustly called "crap". > > IMO, there should be a very short list of places where a narrow > Unicode implementation will need to know anything at all about > surrogates. This short list will include codecs, the > \Uxxxxxxxx notation for literals, and unichr() --- the users can > ship it into the warehouse and ship it out again, but it won't be > processed as other than 16-bit values. Attempts to place other > items on the list should be rigorously justified. Thanks, that's about what I wanted to say! But I assume you meant to include ord() in that list, as it is unichr()'s inverse. We should have one place that implements the surrogate creation magic (unichr) and one place that implements the surrogate unpacking magic (ord). (Plus \U, which is to act like unichr(), and codecs.) > Guido asked: > What should ord(u'\uD800\uDC00') mean on a wide interpreter? > > IMO, this should mean an exception on *both* narrow and wide > interpreters, just as ord("xy") does. ord() should expect one > and only one *character* But on a narrow interpreter, that's a valid surrogate pair, so it's a single character, so ord() *should* return 0x10000 for this example. > Let's just keep on saying no! Yes! --Guido van Rossum (home page: http://www.python.org/~guido/) ************** IMPORTANT MESSAGE ************** The information contained in or attached to this message is intended only for the people it is addressed to. If you are not the intended recipient, any use, disclosure or copying of this information is unauthorised and prohibited. This information may be confidential or subject to legal privilege. It is not the expressed view of Colonial Limited or any of its subsidiaries unless that is clearly stated. Colonial cannot accept liability for any virus damage caused by this message. ************************************************** From guido@digicool.com Thu Jun 28 00:19:49 2001 From: guido@digicool.com (Guido van Rossum) Date: Wed, 27 Jun 2001 19:19:49 -0400 Subject: [I18n-sig] Python Support for "Wide" Unicode characters In-Reply-To: Your message of "Wed, 27 Jun 2001 15:54:48 PDT." <3B3A6438.6DA39268@ActiveState.com> References: <3B3A6438.6DA39268@ActiveState.com> Message-ID: <200106272319.f5RNJnO20162@odiug.digicool.com> Nice job, Paul! I especially like the notion of narrow and wide Pythons. :-) In the style of the PEP process, there should probably be some discussion of the alternatives that were proposed, considered and rejected, in particular (1) place the burden of surrogate handling on the application, possibly with limited library support, and (2) try to mend the unicode string object so that it is always indexed in characters, even if it contains surrogates. > PEP: 261 > Title: Python Support for "Wide" Unicode characters > Version: 1.0 > Author: paulp@activestate.com (Paul Prescod) > Status: Draft > Type: Standards Track > Python-Version: 2.2 > Created: 27-Jun-2001 > Post-History: 27-Jun-2001 I think PEPs should get wider distribution than a SIG. Maybe after the first round of comments on i18n-sig is over you can post it to c.l.py(.a) and python-dev? > Abstract > > Python 2.1 unicode characters can have ordinals only up to 65536. > These characters are known as Basic Multilinual Plane characters. > There are now characters in Unicode that live on other "planes". > The largest addressable character in Unicode has the ordinal > 2**20 + 2**16 - 1. For readability, we will call this TOPCHAR. I would express this as 17 * 2**16 - 1, to emphasize the fact that there are 17 planes of 2**16 characters each. > Proposed Solution > > One solution would be to merely increase the maximum ordinal to a > larger value. Unfortunately the only straightforward implementation > of this idea is to increase the character code unit to 4 bytes. This > has the effect of doubling the size of most Unicode strings. In > order to avoid imposing this cost on every user, Python 2.2 will > allow 4-byte Unicode characters as a build-time option. > > > The 4-byte option is called "wide Py_UNICODE". The 2-byte option > is called "narrow Py_UNICODE". > > Most things will behave identically in the wide and narrow worlds. > > * the \u and \U literal syntaxes will always generate the same > data that the unichr function would. They are just different > syntaxes for the same thing. > > * unichr(i) for 0 <= i <= 2**16 always returns a size-one string. > > * unichr(i) for 2**16+1 <= i <= TOPCHAR will always > return a string representing the character. > > * BUT on narrow builds of Python, the string will actually be > composed of two characters called a "surrogate pair". Can't call these characters. Maybe use "characters" in quotes, maybe use code points or items. > * ord() will now accept surrogate pairs and return the ordinal of > the "wide" character. Open question: should it accept surrogate > pairs on wide Python builds? After thinking about it, I think it should. Apps that are written specifically to handle surrogates (e.g. a conversion tool to remove surrogates!) should work on wide interpreters, and ord() is the only way to get the character value from a surrogate pair (short from implementing the shifts and masks yourself, which is doable but a pain). > * There is an integer value in the sys module that describes the > largest ordinal for a Unicode character on the current > interpreter. sys.maxunicode is 2**16-1 on narrow builds of > Python. On wide builds it could be either TOPCHAR > or 2**32-1. That's an open question. Given its name I think it should be TOPCHAR, even if unichr() accepts larger values. > * Note that ord() can in some cases return ordinals > higher than sys.maxunicode because it accepts surrogate pairs > on narrow Python builds. > > * codecs will be upgraded to support "wide characters". On narrow > Python builds, the codecs will generate surrogate pairs, on > wide Python builds they will generate a single character. Maybe add a note that this is the main thing that hasn't been fully implemented yet; everything else except the extended ord() is implemented now, AFAIK. > * new codecs will be written for 4-byte Unicode and older codecs > will be updated to recognize surrogates and map them to wide ^^^^^^^^^^ Make that "surrogate pairs" > characters on wide Pythons. > > * there are no restrictions on constructing strings that use > code points "reserved for surrogates" improperly. These are > called "lone surrogates". The codecs should disallow reading > these but you could construct them using string literals or > unichr(). > > Implementation > > There is a new (experimental) define in Include/unicodeobject.h: > > #undef USE_UCS4_STORAGE > > if defined, Py_UNICODE is set to the same thing as Py_UCS4. > > USE_UCS4_STORAGE USE_UCS4_STORAGE is no more. Long live Py_UNICODE_SIZE (2 or 4). > There is a new configure options: > > --enable-unicode=ucs2 configures a narrow Py_UNICODE, and uses > wchar_t if it fits > --enable-unicode=ucs4 configures a wide Py_UNICODE likewise > --enable-unicode configures Py_UNICODE to wchar_t if > available, > and to UCS-4 if not; this is the default Not any more; the default is ucs2 now. > The intention is that --disable-unicode, or --enable-unicode=no > removes the Unicode type altogether; this is not yet implemented. > > Notes > > Note that len(unichr(i))==2 for i>=0x10000 on narrow machines. > > This means (for example) that the following code is not portable: > > x = 0x10000 > if unichr(x) in somestring: > ... > > In general, you should be careful using "in" if the character > that is searched for could have been generated from unichr applied > to a number greater than 0x10000 or from a string literal greater > than 0x10000. I suppose we *could* fix the __contains__ implementation for Unicode objects, but I'm -0 on that. > This PEP does NOT imply that people using Unicode need to use a > 4-byte encoding. It only allows them to do so. For example, ASCII > is still a legitimate (7-bit) Unicode-encoding. > > Open Questions > > "Code points" above TOPCHAR cannot be expressed in two 16-bit > characters. These are not assigned to Unicode characters and > supposedly will never be. Should we allow them to be passed as > arguments to unichr() anyhow? We could allow knowledgable > programmers to use these "unused" characters for whatever > they want, though Unicode does not address them. > > "Lone surrogates" "should not" occur on wide platforms. Should > ord() still accept them? Unclear what you tried to say here. You already explained that there are no restrictions on the use of lone surrogates, so ord() has no choice (It would be pretty bad if you could construct a 1-code-point string but ord() could't tell you what that code point was). Or did you mean "should ord() accept surrogate pairs? That question was already asked above. Or did you mean this to be a summary of all open issues? Then there are several more. Nit: there's no copyright clause. All PEPs should have one. Again, thanks!!! --Guido van Rossum (home page: http://www.python.org/~guido/) From paulp@ActiveState.com Thu Jun 28 00:40:36 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Wed, 27 Jun 2001 16:40:36 -0700 Subject: [I18n-sig] Unicode surrogates: just say no! References: <9F2D83017589D211BD1000805FA70CA703B139F7@ntxmel03.cmutual.com.au> Message-ID: <3B3A6EF4.A62BD417@ActiveState.com> "Machin, John" wrote: > >... > > IMO, once you say that a "valid surrogate pair" is a "single > character" in a narrow implementation, people will want to do > the indexing / slicing /dicing thing as well. ord() is just the > thin end of the wedge. I'll see your puritanism and raise: unichr(bignum) and \Ubignum are the thin edge of the wedge. :) I would still prefer to abolish the notion of surrogates from anything except codecs. Or at least abolish them now and see if anyone screams. We should do the simplest thing possible and see what happens. -- Take a recipe. Leave a recipe. Python Cookbook! http://www.ActiveState.com/pythoncookbook From guido@digicool.com Thu Jun 28 00:38:04 2001 From: guido@digicool.com (Guido van Rossum) Date: Wed, 27 Jun 2001 19:38:04 -0400 Subject: [I18n-sig] Unicode surrogates: just say no! In-Reply-To: Your message of "Thu, 28 Jun 2001 09:14:17 +1000." <9F2D83017589D211BD1000805FA70CA703B139F7@ntxmel03.cmutual.com.au> References: <9F2D83017589D211BD1000805FA70CA703B139F7@ntxmel03.cmutual.com.au> Message-ID: <200106272338.f5RNc5k20236@odiug.digicool.com> > Guido said: > But on a narrow interpreter, that's a valid surrogate pair, so it's a > single character, so ord() *should* return 0x10000 for this example. > > IMO, once you say that a "valid surrogate pair" is a "single > character" in a narrow implementation, people will want to do > the indexing / slicing /dicing thing as well. ord() is just the > thin end of the wedge. > > "No" should mean "no". > > unichr() and ord() should be inverses *only* > in respect of scalar values up to sys.maxunicode. Your position is weakened by inconsistency. If you really wanted to be consistent, you should argue against \U and unichr() with ordinals >= 0x10000 on narrow Pythons. :-) IMO ord() and unichr() are so closely tied that either both of them should support surrogate pairs, or none. You know my position. It's not usable as a wedge to get the indexing/slicing/dicing, because the implementation would be too complicated, and we have the wide Python as a mighty weapon. BTW, I quoted Paul: > > * ord() will now accept surrogate pairs and return the ordinal of > > the "wide" character. Open question: should it accept surrogate > > pairs on wide Python builds? and replied: > After thinking about it, I think it should. Apps that are written > specifically to handle surrogates (e.g. a conversion tool to remove > surrogates!) should work on wide interpreters, and ord() is the only > way to get the character value from a surrogate pair (short from > implementing the shifts and masks yourself, which is doable but a > pain). I take that back. On wide Pythons, unichr() doesn't return surrogates either. Once the whole world uses UCS-4 (around the time Python 3000 is released :-), surrogates can be deprecated anyway. --Guido van Rossum (home page: http://www.python.org/~guido/) From JMachin@Colonial.com.au Thu Jun 28 01:05:39 2001 From: JMachin@Colonial.com.au (Machin, John) Date: Thu, 28 Jun 2001 10:05:39 +1000 Subject: [I18n-sig] Unicode surrogates: just say no! Message-ID: <9F2D83017589D211BD1000805FA70CA703B139F8@ntxmel03.cmutual.com.au> OK. I take (most of) your point on consistency between unichr() and ord(). However there is a practical problem with ord(surrogate_pair) on a narrow Python. ord('\x01') -> 1 ord('\x01\x02') -> exception ord(u'\u0001') -> 1 ord(u'\u0001\u0002') -> exception ord(u'\ud800\udc00') -> 0x10000 # magic! so either (a) programmer wanting to write (say) the conversion tool that you mentioned still has to work very hard or (b) we redefine ord() so that the arg may also be a Unicode string, and it returns the ordinal of the first character (which may involve two code units) or (c) we provide some other functionality for unpacking Unicode strings into ints -----Original Message----- From: Guido van Rossum [mailto:guido@digicool.com] Sent: Thursday, 28 June 2001 9:38 To: Machin, John Cc: i18n-sig@python.org Subject: Re: [I18n-sig] Unicode surrogates: just say no! > Guido said: > But on a narrow interpreter, that's a valid surrogate pair, so it's a > single character, so ord() *should* return 0x10000 for this example. > > IMO, once you say that a "valid surrogate pair" is a "single > character" in a narrow implementation, people will want to do > the indexing / slicing /dicing thing as well. ord() is just the > thin end of the wedge. > > "No" should mean "no". > > unichr() and ord() should be inverses *only* > in respect of scalar values up to sys.maxunicode. Your position is weakened by inconsistency. If you really wanted to be consistent, you should argue against \U and unichr() with ordinals >= 0x10000 on narrow Pythons. :-) IMO ord() and unichr() are so closely tied that either both of them should support surrogate pairs, or none. You know my position. It's not usable as a wedge to get the indexing/slicing/dicing, because the implementation would be too complicated, and we have the wide Python as a mighty weapon. BTW, I quoted Paul: > > * ord() will now accept surrogate pairs and return the ordinal of > > the "wide" character. Open question: should it accept surrogate > > pairs on wide Python builds? and replied: > After thinking about it, I think it should. Apps that are written > specifically to handle surrogates (e.g. a conversion tool to remove > surrogates!) should work on wide interpreters, and ord() is the only > way to get the character value from a surrogate pair (short from > implementing the shifts and masks yourself, which is doable but a > pain). I take that back. On wide Pythons, unichr() doesn't return surrogates either. Once the whole world uses UCS-4 (around the time Python 3000 is released :-), surrogates can be deprecated anyway. --Guido van Rossum (home page: http://www.python.org/~guido/) ************** IMPORTANT MESSAGE ************** The information contained in or attached to this message is intended only for the people it is addressed to. If you are not the intended recipient, any use, disclosure or copying of this information is unauthorised and prohibited. This information may be confidential or subject to legal privilege. It is not the expressed view of Colonial Limited or any of its subsidiaries unless that is clearly stated. Colonial cannot accept liability for any virus damage caused by this message. ************************************************** From paulp@ActiveState.com Thu Jun 28 01:20:39 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Wed, 27 Jun 2001 17:20:39 -0700 Subject: [I18n-sig] Python Support for "Wide" Unicode characters References: <3B3A6438.6DA39268@ActiveState.com> <200106272319.f5RNJnO20162@odiug.digicool.com> Message-ID: <3B3A7857.1593F72@ActiveState.com> Guido van Rossum wrote: > >... > > In the style of the PEP process, there should probably be some > discussion of the alternatives that were proposed, considered and > rejected, in particular (1) place the burden of surrogate handling on > the application, possibly with limited library support, > and (2) try to > mend the unicode string object so that it is always indexed in > characters, even if it contains surrogates. Okay. > > I think PEPs should get wider distribution than a SIG. Maybe after > the first round of comments on i18n-sig is over you can post it to > c.l.py(.a) and python-dev? I agree. That's what I intended. I thought it would be confusing if I posted to the other areas before I had all of my facts right. > I would express this as 17 * 2**16 - 1, to emphasize the fact that > there are 17 planes of 2**16 characters each. Done. > > * BUT on narrow builds of Python, the string will actually be > > composed of two characters called a "surrogate pair". > > Can't call these characters. Maybe use "characters" in quotes, maybe > use code points or items. I think they ARE characters in the Python, not Unicode sense. So I said: * BUT on narrow builds of Python, the string will actually be composed of two characters (in the Python, not Unicode sense) called a "surrogate pair". These two Python characters are logically one Unicode character. > > * There is an integer value in the sys module that describes the > > largest ordinal for a Unicode character on the current > > interpreter. sys.maxunicode is 2**16-1 on narrow builds of > > Python. On wide builds it could be either TOPCHAR > > or 2**32-1. That's an open question. > > Given its name I think it should be TOPCHAR, even if unichr() accepts > larger values. Maybe there is a virtue in having a way to both ask for the largest *legal* Unicode character and the largest character that will fit into a Python character on the platform. I mean in theory the maximum Unicode character is constant but that doesn't mean I want to declare it in my programs explicitly. unicodedata.maxchar => always TOPCHAR sys.maxunicode => some power of 2 - 1 I'm not entirely happy that we call a thing "sys.maxunicode" and then tell people how to generate larger values. How about sys.maxcodeunit . (or we could remove the whole surrogate building stuff :) ) Do you want to rule on this or call it an open issue? > > * Note that ord() can in some cases return ordinals > > higher than sys.maxunicode because it accepts surrogate pairs > > on narrow Python builds. And if sys.maxunicode is TOPCHAR then you can also get ords greater than sys.maxunicode just by using unichr on values larger than sys.maxunicode. > > * codecs will be upgraded to support "wide characters". On narrow > > Python builds, the codecs will generate surrogate pairs, on > > wide Python builds they will generate a single character. > > Maybe add a note that this is the main thing that hasn't been fully > implemented yet; everything else except the extended ord() is > implemented now, AFAIK. Done. > > * new codecs will be written for 4-byte Unicode and older codecs > > will be updated to recognize surrogates and map them to wide > ^^^^^^^^^^ > Make that "surrogate pairs" Done. > > USE_UCS4_STORAGE > > USE_UCS4_STORAGE is no more. Long live Py_UNICODE_SIZE (2 or 4). Okay. > > There is a new configure options: > > > > --enable-unicode=ucs2 configures a narrow Py_UNICODE, and uses > > wchar_t if it fits > > --enable-unicode=ucs4 configures a wide Py_UNICODE likewise > > --enable-unicode configures Py_UNICODE to wchar_t if > > available, > > and to UCS-4 if not; this is the default > > Not any more; the default is ucs2 now. So there is no way to get the heuristic of "wchar_t if available, UCS-4 if not". I'm not complaining, just checking. The list of options is just two with ucs2 the default. >... Or did you mean this to be a summary of all open > issues? Then there are several more. What are the open issues in your mind...I'm not clear on what things you've expressed an opinion on and what things you've ruled on. > Nit: there's no copyright clause. All PEPs should have one. Okay. When I hear from you I'll update it. -- Take a recipe. Leave a recipe. Python Cookbook! http://www.ActiveState.com/pythoncookbook From rick@unicode.org Thu Jun 28 01:31:11 2001 From: rick@unicode.org (Rick McGowan) Date: Wed, 27 Jun 2001 17:31:11 -0700 Subject: [I18n-sig] Python Support for "Wide" Unicode characters Message-ID: <200106272224.SAA15484@unicode.org> I don't suppose that anyone has actually considered just using a 24-bit scalar type? What would be the downside to doing so? Rick From rick@unicode.org Thu Jun 28 01:34:52 2001 From: rick@unicode.org (Rick McGowan) Date: Wed, 27 Jun 2001 17:34:52 -0700 Subject: No Klingon? was Re: [I18n-sig] Unicode surrogates: just say no! Message-ID: <200106272228.SAA15535@unicode.org> Oh, sorry, Paul... The venerable work in Python unfortunately preceded the rather recent rejection of Klingon. We didn't think anyone was using it! Now, if you'd beamed an armed party into a meeting when I was casting about for some serious reps from the Klingon empire, we could have saved everyone the trouble of rejecting it... ;-) Rick > You should have told us that Klingon was rejected before we went to all > of this work! Did you think we were interested in the japanese dentistry > characters? The Wiggly Fences? Shavian? From guido@digicool.com Thu Jun 28 01:43:44 2001 From: guido@digicool.com (Guido van Rossum) Date: Wed, 27 Jun 2001 20:43:44 -0400 Subject: [I18n-sig] Python Support for "Wide" Unicode characters In-Reply-To: Your message of "Wed, 27 Jun 2001 17:31:11 PDT." <200106272224.SAA15484@unicode.org> References: <200106272224.SAA15484@unicode.org> Message-ID: <200106280043.f5S0hi520359@odiug.digicool.com> > I don't suppose that anyone has actually considered just using a 24-bit > scalar type? What would be the downside to doing so? > > Rick Because of alignment requirements and the absence in general of a 3-byte integral type in C, you can't extract a 24-bit integer given its address without doing something like two shifts and two or operations. For mostly the same reasons you also can't declare arrays of 3-byte integers, so you'd have to do all your address arithmetic yourself. While none of this makes it impossible, it makes it impractical, because ever place in the code that indexes or declares a Py_UNICODE array would have to be patched. The elegance of the 4-byte approach is that almost all code continues to work without changes. (Technically, it's the "smallest integral type containing at least 32 bits" approach. C guarantees there always is such a type, since long is guaranteed to be at least 32 bits. I suppose we could try to be exact and use the "smallest integral type containing at least 21 bits" approach, but it wouldn't make a difference on current practical hardware. It would have 20 years ago, when machines with 24 or 28 bits per word were common. :-) --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@digicool.com Thu Jun 28 01:47:29 2001 From: guido@digicool.com (Guido van Rossum) Date: Wed, 27 Jun 2001 20:47:29 -0400 Subject: [I18n-sig] Python Support for "Wide" Unicode characters In-Reply-To: Your message of "Wed, 27 Jun 2001 17:20:39 PDT." <3B3A7857.1593F72@ActiveState.com> References: <3B3A6438.6DA39268@ActiveState.com> <200106272319.f5RNJnO20162@odiug.digicool.com> <3B3A7857.1593F72@ActiveState.com> Message-ID: <200106280047.f5S0lTQ20371@odiug.digicool.com> I agree with everything I deleted from the quoting below! > > Given its name I think it should be TOPCHAR, even if unichr() accepts > > larger values. > > Maybe there is a virtue in having a way to both ask for the largest > *legal* Unicode character and the largest character that will fit into a > Python character on the platform. I mean in theory the maximum Unicode > character is constant but that doesn't mean I want to declare it in my > programs explicitly. > > unicodedata.maxchar => always TOPCHAR > sys.maxunicode => some power of 2 - 1 > > I'm not entirely happy that we call a thing "sys.maxunicode" and then > tell people how to generate larger values. How about sys.maxcodeunit . > (or we could remove the whole surrogate building stuff :) ) > > Do you want to rule on this or call it an open issue? Leave it open; personally I'd be happy with the heuristic "if sys.maxunicode >= 2**16 then a unicode character can store 32 bits". > What are the open issues in your mind...I'm not clear on what things > you've expressed an opinion on and what things you've ruled on. Sorry. I meant that there were two open issues listed earlier in the PEP, and one of those was repeated here, so I wasn't sure if this was intended to be a summary and you missed one, or it was intended to be additional open issues and you had a duplicate. Either way is fine but I think you should make up your mind. :-) > When I hear from you I'll update it. Go ahead! --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@digicool.com Thu Jun 28 01:50:37 2001 From: guido@digicool.com (Guido van Rossum) Date: Wed, 27 Jun 2001 20:50:37 -0400 Subject: [I18n-sig] Unicode surrogates: just say no! In-Reply-To: Your message of "Thu, 28 Jun 2001 10:05:39 +1000." <9F2D83017589D211BD1000805FA70CA703B139F8@ntxmel03.cmutual.com.au> References: <9F2D83017589D211BD1000805FA70CA703B139F8@ntxmel03.cmutual.com.au> Message-ID: <200106280050.f5S0obA20385@odiug.digicool.com> > OK. I take (most of) your point on consistency between unichr() and ord(). > > However there is a practical problem with ord(surrogate_pair) on a > narrow Python. > > ord('\x01') -> 1 > ord('\x01\x02') -> exception > ord(u'\u0001') -> 1 > ord(u'\u0001\u0002') -> exception > ord(u'\ud800\udc00') -> 0x10000 # magic! > > so either > (a) programmer wanting to write (say) the > conversion tool that you mentioned still has to work very hard > or (b) we redefine ord() so that the arg may also be a Unicode > string, and it returns the ordinal of the first character (which may involve > two code units) > or (c) we provide some other functionality for unpacking Unicode strings > into ints Yes, the longer I think about this the less I like it. Unfortunately, the surrogate-creating behavior of \U is present in 2.0 and 2.1, so I think we can't reasonably remove this from narrow Python 2.2, and I like the rule that unichr and \U match. But maybe that's the one that should go, and unichr() and ord() should deal with single code points only. Then sys.maxunicode should be the largest value that unichr() will accept. This could be 0xffff (narrow Python), 0x10ffff (wide Python with strict unichr()), or 0xffffffffL (wide Python with liberal unichr()). The latter is an open PEP issue. --Guido van Rossum (home page: http://www.python.org/~guido/) From fredrik@pythonware.com Thu Jun 28 01:56:59 2001 From: fredrik@pythonware.com (Fredrik Lundh) Date: Thu, 28 Jun 2001 02:56:59 +0200 Subject: [I18n-sig] Python Support for "Wide" Unicode characters References: <200106272224.SAA15484@unicode.org> Message-ID: <002501c0ff6d$3cfc4de0$4ffa42d5@hagrid> Rick McGowan wrote: > I don't suppose that anyone has actually considered just using a 24-bit > scalar type? What would be the downside to doing so? nothing stops you from using 24-bit unsigned integers, if your compiler supports them. Cheers /F From JMachin@Colonial.com.au Thu Jun 28 02:23:55 2001 From: JMachin@Colonial.com.au (Machin, John) Date: Thu, 28 Jun 2001 11:23:55 +1000 Subject: [I18n-sig] Unicode surrogates: just say no! Message-ID: <9F2D83017589D211BD1000805FA70CA703B139FA@ntxmel03.cmutual.com.au> > Unfortunately, the surrogate-creating behavior of \U > is present in 2.0 and 2.1, so I > think we can't reasonably remove this from narrow Python 2.2, and I > like the rule that unichr and \U match. But maybe that's the one that > should go, and unichr() and ord() should deal with single code points > only. My understanding is that very few people noticed that \U was creating surrogate pairs, and my guess would be that nobody would be affected in practice by stopping this behaviour. IOW, I suggest treating "\U -> surrogate pairs" just like the more esoteric parts of xrange() -- or the "Korean mess" in earlier Unicode -- just bury it and move on. IMO, the type of people wanting to fiddle with surrogate pairs in narrow Python would also be capable of whipping up a C extension to unpack a narrow Unicode string into a list of ints and do the shifting and masking necessary with surrogates. If this is not so, then the next preference would be for "someone" to write such a C extension and publicise it. I would volunteer to be that "someone" in the interests of not burdening ord() with "magic". ************** IMPORTANT MESSAGE ************** The information contained in or attached to this message is intended only for the people it is addressed to. If you are not the intended recipient, any use, disclosure or copying of this information is unauthorised and prohibited. This information may be confidential or subject to legal privilege. It is not the expressed view of Colonial Limited or any of its subsidiaries unless that is clearly stated. Colonial cannot accept liability for any virus damage caused by this message. ************************************************** From rick@unicode.org Thu Jun 28 02:23:21 2001 From: rick@unicode.org (Rick McGowan) Date: Wed, 27 Jun 2001 18:23:21 -0700 Subject: [I18n-sig] Python Support for "Wide" Unicode characters In-Reply-To: Your message of "Wed, 27 Jun 2001 17:31:11 PDT." <200106272224.SAA15484@unicode.org> Message-ID: <200106272316.TAA16083@unicode.org> Re 24-bit scalar type... Guido, those are good reasons & lotsa juicy downsides. Thanks. Rick From paulp@ActiveState.com Thu Jun 28 02:41:17 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Wed, 27 Jun 2001 18:41:17 -0700 Subject: [I18n-sig] Unicode surrogates: just say no! References: <9F2D83017589D211BD1000805FA70CA703B139F8@ntxmel03.cmutual.com.au> <200106280050.f5S0obA20385@odiug.digicool.com> Message-ID: <3B3A8B3D.B838CB80@ActiveState.com> Guido van Rossum wrote: > >... > > Yes, the longer I think about this the less I like it. Unfortunately, > the surrogate-creating behavior of \U is present in 2.0 and 2.1, so I > think we can't reasonably remove this from narrow Python 2.2, and I'm having a hard time caring about backwards compatibilty much here. And I can't square it with your enthusiasm for ripping the guts out of poor old xrange. We're talking about a certain kind of *literal* right? Even ASCII literals are rare in my code. Unicode literals are extremely rare. Now consider that we're talking about Unicode literals to characters so obscure that they were passed over by the first three versions of Unicode. And so new that most people don't even know that they are part of Unicode. Let's just put a deprecation warning in for \U where you've asked for a character larger than your build's code unit size. And if there is a need, someone, somewhere will write a beautiful surrogates library that handles all details of surrogate handling. > Then sys.maxunicode should be the largest value that unichr() will > accept. This could be 0xffff (narrow Python), 0x10ffff (wide Python > with strict unichr()), or 0xffffffffL (wide Python with liberal > unichr()). The latter is an open PEP issue. Okay. -- Take a recipe. Leave a recipe. Python Cookbook! http://www.ActiveState.com/pythoncookbook From barry@digicool.com Thu Jun 28 02:47:46 2001 From: barry@digicool.com (Barry A. Warsaw) Date: Wed, 27 Jun 2001 21:47:46 -0400 Subject: [I18n-sig] Python Support for "Wide" Unicode characters References: <3B3A6438.6DA39268@ActiveState.com> <200106272319.f5RNJnO20162@odiug.digicool.com> Message-ID: <15162.36034.762209.479359@anthem.wooz.org> GvR> Nit: there's no copyright clause. All PEPs should have one. Whoops, I forgot to nag Paul about that. Feel free to add one when you revise the PEP, Paul . -Barry From tim.one@home.com Thu Jun 28 05:55:08 2001 From: tim.one@home.com (Tim Peters) Date: Thu, 28 Jun 2001 00:55:08 -0400 Subject: [I18n-sig] Unicode surrogates: just say no! In-Reply-To: <3B3A8B3D.B838CB80@ActiveState.com> Message-ID: [Guido] > Unfortunately, the surrogate-creating behavior of \U is present in > 2.0 and 2.1, so I think we can't reasonably remove this from narrow > Python 2.2 [Paul Prescod] > I'm having a hard time caring about backwards compatibilty much here. > And I can't square it with your enthusiasm for ripping the guts out of > poor old xrange. But there's a HUGE difference. The xrange() behaviors we're seeking to shed have been documented for years. But the Python 2.1 Reference Manual's section on Unicode literals reads: 2.4.3 Unicode literals XXX explain more here... in its entirety, and the word "surrogate" appears nowhere at all. Well, OK, it's mentioned twice in unicode.txt, both times in a disclaimer sense ("we don't need no stinkin' surrogates -- and neither do you"). See? I thought you would, if someone just paused to explain it . > ... > Let's just put a deprecation warning in for \U where you've asked for a > character larger than your build's code unit size. More consideration than it merits, if anyone were silly enough tp ask me. From tim.one@home.com Thu Jun 28 06:10:33 2001 From: tim.one@home.com (Tim Peters) Date: Thu, 28 Jun 2001 01:10:33 -0400 Subject: [I18n-sig] UCS-4 configuration In-Reply-To: Message-ID: [discussion about PyUnicode_DecodeUTF16] It's nice that we got to chat about portability to Platforms from Mars, but is anyone actually going to work on that function? It shouldn't be hard, I just don't want to see it fall thru the cracks. otoh-falling-between-the-surrogates-is-fine-ly y'rs - tim From paulp@ActiveState.com Thu Jun 28 06:25:12 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Wed, 27 Jun 2001 22:25:12 -0700 Subject: [I18n-sig] Support for "wide" Unicode characters Message-ID: <3B3ABFB8.84C7510B@ActiveState.com> Round 2: I can't check in right now but I'll collect another round of suggestions and then post this to other lists tomorrow. ---- PEP: 261 Title: Support for "wide" Unicode characters Version: $Revision: 1.2 $ Author: paulp@activestate.com (Paul Prescod) Status: Draft Type: Standards Track Created: 27-Jun-2001 Python-Version: 2.2 Post-History: 27-Jun-2001 Abstract Python 2.1 unicode characters can have ordinals only up to 65536. These characters are known as Basic Multilinual Plane characters. There are now characters in Unicode that live on other "planes". The largest addressable character in Unicode has the ordinal 17 * 2**16 - 1. For readability, we will call this TOPCHAR. Proposed Solution One solution would be to merely increase the maximum ordinal to a larger value. Unfortunately the only straightforward implementation of this idea is to increase the character code unit to 4 bytes. This has the effect of doubling the size of most Unicode strings. In order to avoid imposing this cost on every user, Python 2.2 will allow 4-byte Unicode characters as a build-time option. The 4-byte option is called "wide Py_UNICODE". The 2-byte option is called "narrow Py_UNICODE". Most things will behave identically in the wide and narrow worlds. * the \u and \U literal syntaxes will always generate the same data that the unichr function would. They are just different syntaxes for the same thing. * unichr(i) for 0 <= i <= 2**16 always returns a size-one string. * unichr(i) for 2**16+1 <= i <= TOPCHAR will always return a string representing the character. * BUT on narrow builds of Python, the string will actually be composed of two characters (in the Python, not Unicode sense) called a "surrogate pair". These two Python characters are logically one Unicode character. ISSUE: Should Python return surrogate pairs or narrow builds or should it just disallow them? ISSUE: Should the upper bound of the domain of unichr and range of ord() be TOPCHAR or 2**32-1 or even 2**31? * ord() will now accept surrogate pairs and return the ordinal of the "wide" character. ISSUE: Should Python accept surrogate pairs on wide Python builds? * There is an integer value in the sys module that describes the largest ordinal for a Unicode character on the current interpreter. sys.maxunicode is 2**16-1 on narrow builds of Python. ISSUE: Should sys.maxunicode be TOPCHAR or 2**32-1 or even 2**31 on wide builds? ISSUE: Should there be distinct constants for accessing TOPCHAR and the real upper bound for the domain of unichr? * Note that ord() can in some cases return ordinals higher than sys.maxunicode because it accepts surrogate pairs on narrow Python builds. * codecs will be upgraded to support "wide characters" (represented directly in UCS-4, as surrogate pairs in UTF-16 and as multi-byte sequences in UTF-8). On narrow Python builds, the codecs will generate surrogate pairs, on wide Python builds they will generate a single character. This is the main part of the implementation left to be done. * there are no restrictions on constructing strings that use code points "reserved for surrogates" improperly. These are called "lone surrogates". The codecs should disallow reading these but you could construct them using string literals or unichr(). unichr() is not restricted to values less than either TOPCHAR nor sys.maxunicode. ISSUE: Should lone surrogates be allowed as input to ord even on wide platforms where they "should" not occur? Implementation There is a new (experimental) define: #define PY_UNICODE_SIZE 2 There is a new configure options: --enable-unicode=ucs2 configures a narrow Py_UNICODE, and uses wchar_t if it fits --enable-unicode=ucs4 configures a wide Py_UNICODE likewise --enable-unicode same as "=ucs2" The intention is that --disable-unicode, or --enable-unicode=no removes the Unicode type altogether; this is not yet implemented. Notes Note that len(unichr(i))==2 for i>=2**16 on narrow machines because of the returned surrogates. This means (for example) that the following code is not portable: x = 2**16 if unichr(x) in somestring: ... In general, you should be careful using "in" if the character that is searched for could have been generated from unichr applied to a number greater than 2**16 or from a string literal greater than 2**16. This PEP does NOT imply that people using Unicode need to use a 4-byte encoding. It only allows them to do so. For example, ASCII is still a legitimate (7-bit) Unicode-encoding. Rationale for Surrogate Creation Behaviour Python currently supports the construction of a surrogate pair for a large unicode literal character escape sequence. This is basically designed as a simple way to construct "wide characters" even in a narrow Python build. ISSUE: surrogates can be created this way but the user still needs to be careful about slicing, indexing, printing etc. Another option is to remove knowledge of surrogates from everything other than the codecs. Rejected Suggestions There were two primary solutions that were rejected. The first was more or less the status-quo. We could officially say that UTF-16 is the Python character encoding and require programmers to implement wide characters in their application logic. This is a heavy burden because emulating 32-bit characters is likely to be very inefficient if it is coded entirely in Python. The other solution is to use UTF-16 (or even UTF-8) internally (for efficiency) but present an abstraction of 32-bit characters to the programmer. This would require a much more complex implementation than the accepted solution. In theory, we could move to this implementation in the future without breaking Python code. It would just emulate a wide Python build on narrow Pythons. Copyright This document has been placed in the public domain. Local Variables: mode: indented-text indent-tabs-mode: nil End: -- Take a recipe. Leave a recipe. Python Cookbook! http://www.ActiveState.com/pythoncookbook From martin@loewis.home.cs.tu-berlin.de Wed Jun 27 21:44:05 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Wed, 27 Jun 2001 22:44:05 +0200 Subject: [I18n-sig] Unicode surrogates: just say no! In-Reply-To: <3B3A314C.161FE431@ActiveState.com> (message from Paul Prescod on Wed, 27 Jun 2001 12:17:32 -0700) References: <200106260851.f5Q8pcN10662@odiug.digicool.com> <3B385BDC.AB40A761@lemburg.com> <200106261700.f5QH0ih14770@odiug.digicool.com> <3B3A314C.161FE431@ActiveState.com> Message-ID: <200106272044.f5RKi5g13272@mira.informatik.hu-berlin.de> > > - unichr(i) for 0x10000 <= i <= 0x10ffff (and hence corresponding \u > > and \U) generates a surrogate pair, where u[0] is the high > > surrogate value and u[1] the low surrogate value > > Does this imply that ord() should take in surrogate pairs too? Good question. IMO, it shouldn't, so ord(unichr(n)) may raise exceptions, even for values of n where unichr(n) succeeds. The basic rationale here is: if you need surrogates a lot, you should use a wide unicode implementation. In a narrow unicode implementation, a lot of surprises are likely (although each surprise should be documented, of course). In the specific case, there isn't even a single best solution: If ord of a surrogate pair would return a value, you'd lose the property that ord(s[0])==ord(s) either raises an exception or gives 1. Regards, Martin From martin@loewis.home.cs.tu-berlin.de Wed Jun 27 21:53:11 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Wed, 27 Jun 2001 22:53:11 +0200 Subject: [I18n-sig] Unicode surrogates: just say no! In-Reply-To: <200106271957.f5RJvC219975@odiug.digicool.com> (message from Guido van Rossum on Wed, 27 Jun 2001 15:57:12 -0400) References: <200106260851.f5Q8pcN10662@odiug.digicool.com> <3B385BDC.AB40A761@lemburg.com> <200106261700.f5QH0ih14770@odiug.digicool.com> <3B3A314C.161FE431@ActiveState.com> <200106271930.f5RJUJw19910@odiug.digicool.com> <3B3A3900.CB73F3E0@ActiveState.com> <200106271957.f5RJvC219975@odiug.digicool.com> Message-ID: <200106272053.f5RKrBA13303@mira.informatik.hu-berlin.de> > That's a separate question. On wide interpreters, surrogate pairs > "shouldn't" exist if the app plays by the rules. But they're easily > created of course! What should ord(u'\uD800\uDC00') mean on a wide > interpreter? I think it's nice if you support this. Of course, if a > length-two Unicode string is anything else than a high surrogate > followed by a low surrogate, ord() should be illegal. But then, you get unichr(ord(u'\uD800\uDC00')) <> u'\uD800\uDC00'. Is that acceptable? I'd rather prefer ord not to work on surrogate pairs. It means that code may behave differently, but that is no surprise: len(u'\U00102030') already varies depending on the width of unicode. Regards, Martin From martin@loewis.home.cs.tu-berlin.de Wed Jun 27 22:00:18 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Wed, 27 Jun 2001 23:00:18 +0200 Subject: [I18n-sig] Unicode surrogates: just say no! In-Reply-To: <200106271953.f5RJrPi19963@odiug.digicool.com> (message from Guido van Rossum on Wed, 27 Jun 2001 15:53:25 -0400) References: <200106260851.f5Q8pcN10662@odiug.digicool.com> <3B385BDC.AB40A761@lemburg.com> <200106261700.f5QH0ih14770@odiug.digicool.com> <3B3A3696.FFA7FCE@ActiveState.com> <200106271953.f5RJrPi19963@odiug.digicool.com> Message-ID: <200106272100.f5RL0IP13334@mira.informatik.hu-berlin.de> > When using UCS-4 mode, I was in favor of allowing unichr() and \U to > specify any value in range(0x100000000L), but that's not what Martin > and Fredrik checked in. Note that if C code somehow creates a UCS-4 > string containing something with the high bit on, ord() will currently > return a negative value on platforms where a C long is 32 bits. Couldn't it be an unenforced rule that C code also must stick to the 17 planes? There are many unenforced rules, like that you must not modify a string unless you've created it by passing a NULL char*, and not handed out a reference to anybody. Effectively, using C code might introduce undefined behaviour. On some systems, ord will return a negative value, on others, a positive one; in a future version, it may produce an error if we find too many people had problems with their C code writing large integers into unicode characters. Regards, Martin From martin@loewis.home.cs.tu-berlin.de Thu Jun 28 07:20:58 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Thu, 28 Jun 2001 08:20:58 +0200 Subject: [I18n-sig] Unicode surrogates: just say no! In-Reply-To: <3B3A3DC5.CA6767FD@ActiveState.com> (message from Paul Prescod on Wed, 27 Jun 2001 13:10:45 -0700) References: <200106260851.f5Q8pcN10662@odiug.digicool.com> <3B385BDC.AB40A761@lemburg.com> <200106261700.f5QH0ih14770@odiug.digicool.com> <3B3A3DC5.CA6767FD@ActiveState.com> Message-ID: <200106280620.f5S6Kwe01395@mira.informatik.hu-berlin.de> > What is the virtue in making the literal syntax easy and making unichr() > easy when everything else is hard? Counting characters is hard. > Addressing characters reliably is hard. Slicing reliably is hard. Why > not simplify things? Surrogates are just characters. If you want to > handle wide characters you need to build Python that way. > > I'm trying to imagine the use-case where you care about surrogates > enough to want them to be automatically generated but not enough to care > about slicing and addressing and counting and ...and is this use-case > worth breaking the invariant that len(unichr(i))==1. I'm in favour of supporting the \U notation to denote non-BMP characters even in a "narrow" installation. Whether unichr should also support them is less interesting, but it gives some consistency if it does. The rationale for supporting \U is two-fold: One, importing a module should not fail in one installation, and succeed in another (of the same Python version). Running the module may give different results, but you should be able to generate byte code. Furthermore, people using non-BMP characters in source are probably not very interested in counting the characters: They want to display them. For just displaying them, you need to represent them, and you need the fonts. String manipulation is less important. Regards, Martin From martin@loewis.home.cs.tu-berlin.de Thu Jun 28 08:05:19 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Thu, 28 Jun 2001 09:05:19 +0200 Subject: [I18n-sig] UCS-4 configuration In-Reply-To: References: Message-ID: <200106280705.f5S75J701656@mira.informatik.hu-berlin.de> > It's nice that we got to chat about portability to Platforms from Mars, but > is anyone actually going to work on that function? It shouldn't be hard, I > just don't want to see it fall thru the cracks. If nothing happens within three weeks, I will. Regards, Martin From martin@loewis.home.cs.tu-berlin.de Thu Jun 28 08:08:21 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Thu, 28 Jun 2001 09:08:21 +0200 Subject: [I18n-sig] Support for "wide" Unicode characters In-Reply-To: <3B3ABFB8.84C7510B@ActiveState.com> (message from Paul Prescod on Wed, 27 Jun 2001 22:25:12 -0700) References: <3B3ABFB8.84C7510B@ActiveState.com> Message-ID: <200106280708.f5S78Lm01658@mira.informatik.hu-berlin.de> > * ord() will now accept surrogate pairs and return the ordinal of > the "wide" character. I'm still -1 on this. > ISSUE: Should sys.maxunicode be TOPCHAR or 2**32-1 or even > 2**31 on wide builds? It should be TOPCHAR, the maximum value that unichr accepts. Regards, Martin From martin@loewis.home.cs.tu-berlin.de Thu Jun 28 07:57:34 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Thu, 28 Jun 2001 08:57:34 +0200 Subject: [I18n-sig] Python Support for "Wide" Unicode characters In-Reply-To: <3B3A7857.1593F72@ActiveState.com> (message from Paul Prescod on Wed, 27 Jun 2001 17:20:39 -0700) References: <3B3A6438.6DA39268@ActiveState.com> <200106272319.f5RNJnO20162@odiug.digicool.com> <3B3A7857.1593F72@ActiveState.com> Message-ID: <200106280657.f5S6vYl01625@mira.informatik.hu-berlin.de> > Maybe there is a virtue in having a way to both ask for the largest > *legal* Unicode character and the largest character that will fit into a > Python character on the platform. I mean in theory the maximum Unicode > character is constant but that doesn't mean I want to declare it in my > programs explicitly. > > unicodedata.maxchar => always TOPCHAR > sys.maxunicode => some power of 2 - 1 > > I'm not entirely happy that we call a thing "sys.maxunicode" and then > tell people how to generate larger values. How about sys.maxcodeunit . > (or we could remove the whole surrogate building stuff :) ) -1. The Unicode consortium and ISO have promised that there will never be characters above 0x10ffff. Most of the characters below TOPCHAR are "unassigned", whereas the ones above TOPCHAR are "illegal" (or not even representable in UTF-16). If we were to allow putting very large numbers into Unicode strings, we'd have to check for them in all codecs also. I'd rather disallow them from Python code, and declare using them in C as undefined behaviour. > So there is no way to get the heuristic of "wchar_t if available, UCS-4 > if not". I'm not complaining, just checking. The list of options is just > two with ucs2 the default. I'd be complaining, though, if I wasn't that pleased with this PEP overall. Regards, Martin From martin@loewis.home.cs.tu-berlin.de Thu Jun 28 07:36:55 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Thu, 28 Jun 2001 08:36:55 +0200 Subject: [I18n-sig] Unicode surrogates: just say no! In-Reply-To: <9F2D83017589D211BD1000805FA70CA703B139F7@ntxmel03.cmutual.com.au> (JMachin@Colonial.com.au) References: <9F2D83017589D211BD1000805FA70CA703B139F7@ntxmel03.cmutual.com.au> Message-ID: <200106280636.f5S6at701531@mira.informatik.hu-berlin.de> > "No" should mean "no". > > unichr() and ord() should be inverses *only* > in respect of scalar values up to sys.maxunicode. +1. Martin From mkuhn@suse.de Thu Jun 28 09:03:59 2001 From: mkuhn@suse.de (Markus Kuhn) Date: Thu, 28 Jun 2001 10:03:59 +0200 (CEST) Subject: [I18n-sig] Determine encoding from $LANG In-Reply-To: <15160.60506.589750.287186@honolulu.ilog.fr> Message-ID: On Tue, 26 Jun 2001, Bruno Haible wrote: > > A program cannot be considered properly internationalized > until it obeys the current locale (LC_ALL || LC_CTYPE || LANG). > > The programs we are waiting for are: > [...] Add to that list many of the programming languages that use Unicode internally but that do not yet set the default i/o encoding correctly automatically based on LC_ALL || LC_CTYPE || LANG. For example TCL currently uses some primitive LANG substring matching, which basically gets only a few Japanese and Russian encodings right. The TCL function unix/tclUnixInit.c:TclpSetInitialEncodings really should call libcharset or nl_langinfo(CODESET) instead: https://sourceforge.net/tracker/?func=detail&aid=418645&group_id=10894&atid=110894 I suspect that Perl and Python are not much better and don't call nl_langinfo(CODESET) or the portable libcharset wrapper around it either to properly determine the locale-dependent external encoding. References on how to determine the character encoding from the locale in a safe and portable manner: http://www.cl.cam.ac.uk/~mgk25/unicode.html#activate http://clisp.cons.org/~haible/packages-libcharset.html http://www.opengroup.org/onlinepubs/7908799/xsh/langinfo.h.html Markus -- Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK Email: mkuhn at acm.org, WWW: From Markus.Kuhn@cl.cam.ac.uk Thu Jun 28 09:20:32 2001 From: Markus.Kuhn@cl.cam.ac.uk (Markus Kuhn) Date: Thu, 28 Jun 2001 09:20:32 +0100 Subject: [I18n-sig] Re: Unicode 3.1 and contradictions. In-Reply-To: Your message of "27 Jun 2001 12:30:05 BST." <4a7kxykvnm.fsf@kern.srcf.societies.cam.ac.uk> Message-ID: > It is a bug to encode a non-BMP character with six > bytes by pretending that the (surrogate) values used in the UTF-16 > representation are BMP characters and encoding the character as > though it was a string consisting of that character. It is also a > bug to interpret such a six-byte sequence as a single character. > This was clarified in Unicode 3.1. Fully agreed. Independent of what the letter of the standard says, it is absolutely essential for numerous practical security reasons, that a UTF-8 decoder accepts one and only one single possible UTF-8 sequence as the encoding of any Unicode character. ISO 10646 is also very clear about that surrogates must not appear in a UTF-8 stream and are malformed UTF-8 sequences. Unicode 3.0 was badly flawed in that respect and that has led to numerous security problems in fielded implementations. As I understand it, Unicode 3.1 fixed that, but in any case, no matter what the standard says, you should definitely follow the advice given in the UTF-8 decoder robustness test file http://www.cl.cam.ac.uk/~mgk25/ucs/examples/ UTF-8-test.txt and accept only one single representation for every Unicode character, otherwise you just generate nice loopholes for hackers to pass critical characters through non-decoding filters. The UTF-8 representations of U+D800..U+DFFF, U+FFFE, and U+FFFF are not allowed in a UTF-8 stream and a secure UTF-8 decoder must never output any of these characters. http://www.cl.cam.ac.uk/~mgk25/unicode.html Markus -- Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK Email: mkuhn at acm.org, WWW: From mal@lemburg.com Thu Jun 28 10:04:07 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 28 Jun 2001 11:04:07 +0200 Subject: [I18n-sig] Re: [Python-Dev] Unicode Maintenance References: <3B39CD51.406C28F0@lemburg.com> <200106271611.f5RGBn819631@odiug.digicool.com> Message-ID: <3B3AF307.6496AFB4@lemburg.com> Guido van Rossum wrote: > > > Looking at the recent burst of checkins for the Unicode implementation > > completely bypassing the standard SF procedure and possible comments > > I might have on the different approaches, I guess I've been ruled out > > as maintainer and designer of the Unicode implementation. > > > > Well, I guess that's how things go. Was nice working for you guys, > > but no longer is... I'm tired of having to defend myself against > > meta-comments about the design, uncontrolled checkins and no true > > backup about my standing in all this from Guido. > > > > Perhaps I am misunderstanding the role of a maintainer and > > implementation designer, but as it is all respect for the work I've > > put into all this seems faded. That's the conclusion I draw from recent > > postings by Martin and Fredrik and their nightly "takeover". > > > > Thanks, > > -- > > Marc-Andre Lemburg > > [For those of us to whom Marc-Andre's complaint comes as a total > surprise: there was a thread on i18n-sig about whether we should > support Unicode surrogates, followed by a conclusion to skip > surrogates and jump directly to optional support for UCS-4, followed > by some checkins that enabled a configuration choice between UCS-2 and > UCS-4, and code to make it work. As a side effect, surrogate support > in the UCS-2 version actually improved slightly.] > > Now, now, Marc-Andre. > > The only comments I recall from you on my "surrogates: just say no" > post seemed favorable, except that you proposed to to all the way and > make UCS-4 mandatory. I explained why I didn't want to go that far, > and why I didn't believe your arguments against giving users a choice. > I didn't hear back from you then, and I didn't think you could have > much of a problem with my position. > > Our process requires the use of the SF patch manager only for > controversial changes. Based on your feedback, I didn't think there > was anything controversial about the changes that Fredrik and Martin > have made! (If there was, IMO it was temporarily breaking the Windows > build and the test suite -- but that's all fixed now.) > > I don't understand where you get the idea that we lost respect for > your work! In fact, the fact that it was so easy to make the changes > suggested to me that the original design was well suited to this > particular change (as opposed to the surrugate support proposals, > which all sounded like they would require a *lot* of changes). > > I don't think that we have very strict roles in this community anyway. > (My role as BDFL excluded -- that's why I get to write this > response. :-) I'd say that Fredrik owns SRE, because he has asserted > that ownership at various times: he's undone changes by others that > broke the 1.5.2 support, for example. > > But the Unicode support in Python isn't owned by one person: many > folks have contributed to that, including Fredrik, who designed and > wrote the original Unicode string object implementation. > > If you have specific comments about the changes made, please be > specific. If you feel slighted by meta-comments, please also be > specific. I don't think I've said anything derogatory about you or > your design. You didn't get my point. I feel responsable for the Unicode implementation design and would like to see it become a continued success. In that sense and taking into account that I am the maintainer of all this stuff, I think it is very reasonable to ask me before making any significant changes to the implementation and also respect any comments I put forward. Currently, I have to watch the checkins list very closely to find out who changed what in the implementation and then to take actions only after the fact. Since I'm not supporting Unicode as my full-time job this is simply impossible. We have the SF manager and there is really no need to rush anything around here. If I am offline or too busy with other things for a day or two, then I want to see patches on SF and not find new versions of the implementation already checked in. This has worked just fine during the last year, so I can only explain the latest actions in this direction with an urge to bypass my comments and any discussion this might cause. Needless to say that quality control is not possible anymore. Conclusion: I am not going to continue this work if this does not change. Another other problem for me is the continued hostility I feel on i18n against parts of the design and some of my decisions. I am not talking about your feedback and the feedback from many other people on the list which was excellent and to high standards. But reading the postings of the last few months you will find notices of what I am referring to here (no, I don't want to be specific). If people don't respect my comments or decision, then how can I defend the design and how can I stop endless discussions which simply don't lead anywhere ? So either I am missing something or there is a need for a clear statement from you about my status in all this. If I don't have the right to comment on proposals and patches, possibly even rejecting them, then I simply don't see any ground for keeping the implementation in a state which I can maintain. And last but not least: The fun-factor has faded which was the main motor driving my into working on Unicode in the first place. Nothing much you can do about this, though :-/ > Paul Prescod offered to write a PEP on this issue. My cynical half > believes that we'll never hear from him again, but my optimistic half > hopes that he'll actually write one, so that we'll be able to discuss > the various issues for the users with the users. I encourage you to > co-author the PEP, since you have a lot of background knowledge about > the issues. I guess your optimistic half won :-) I think Paul already did all the work, so I'll simply comment on what he wrote. > BTW, I think that Misc/unicode.txt should be converted to a PEP, for > the historic record. It was very much a PEP before the PEP process > was invented. Barry, how much work would this be? No editing needed, > just formatting, and assignment of a PEP number (the lower the better). Thanks for converting the text to PEP format, Barry. Thanks for reading this far, -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From mal@lemburg.com Thu Jun 28 10:27:35 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 28 Jun 2001 11:27:35 +0200 Subject: [I18n-sig] Support for "wide" Unicode characters References: <3B3ABFB8.84C7510B@ActiveState.com> Message-ID: <3B3AF887.5181D0CF@lemburg.com> Paul Prescod wrote: > > Round 2: I can't check in right now but I'll collect another round of > suggestions and then post this to other lists tomorrow. Here you go... > ---- > PEP: 261 > Title: Support for "wide" Unicode characters > Version: $Revision: 1.2 $ > Author: paulp@activestate.com (Paul Prescod) > Status: Draft > Type: Standards Track > Created: 27-Jun-2001 > Python-Version: 2.2 > Post-History: 27-Jun-2001 > > Abstract > > Python 2.1 unicode characters can have ordinals only up to 65536. > These characters are known as Basic Multilinual Plane characters. > There are now characters in Unicode that live on other "planes". > The largest addressable character in Unicode has the ordinal > 17 * 2**16 - 1. For readability, we will call this TOPCHAR. I would add hex notations for those who are more familiar with HEX and Unicode (which uses HEX to pinpoint code points). Also, a suggestion: I think to avoid all the problems of understanding the different terms in this PEP, I'd do two things: 1. add a Glossary (copying from the Unicode glossary) 2. use the standard Unicode terms throughout the PEP (code points, code units, etc.) The reason is that otherwise you'll get confusion about what you mean by noncharacter characters ;-) > Proposed Solution > > One solution would be to merely increase the maximum ordinal to a > larger value. Unfortunately the only straightforward > implementation of this idea is to increase the character code unit > to 4 bytes. This has the effect of doubling the size of most > Unicode strings. In order to avoid imposing this cost on every > user, Python 2.2 will allow 4-byte Unicode characters as a > build-time option. > > The 4-byte option is called "wide Py_UNICODE". The 2-byte option > is called "narrow Py_UNICODE". > > Most things will behave identically in the wide and narrow worlds. > > * the \u and \U literal syntaxes will always generate the same > data that the unichr function would. They are just different > syntaxes for the same thing. > > * unichr(i) for 0 <= i <= 2**16 always returns a size-one string. > > * unichr(i) for 2**16+1 <= i <= TOPCHAR will always return a > string representing the character. -1. If the platform does not support the character in question, then this should raise a ValueError instead of returning anything with len() > 1. Reasoning: u[i] in Python should always refer to a code point *and* code unit in the Unicode sense. If this is not possible, raise an exception. > * BUT on narrow builds of Python, the string will actually be > composed of two characters (in the Python, not Unicode sense) > called a "surrogate pair". These two Python characters are > logically one Unicode character. > > ISSUE: Should Python return surrogate pairs or narrow builds > or should it just disallow them? > > ISSUE: Should the upper bound of the domain of unichr and > range of ord() be TOPCHAR or 2**32-1 or even 2**31? -1. See above. > * ord() will now accept surrogate pairs and return the ordinal of > the "wide" character. > > ISSUE: Should Python accept surrogate pairs on wide > Python builds? -1. Have the codecs do the business of dealing with surrogates and ord() return the code point ordinal (isolated surrogates are code points as well; they are not Unicode characters though). > * There is an integer value in the sys module that describes the > largest ordinal for a Unicode character on the current > interpreter. sys.maxunicode is 2**16-1 on narrow builds of > Python. > > ISSUE: Should sys.maxunicode be TOPCHAR or 2**32-1 or even > 2**31 on wide builds? > > ISSUE: Should there be distinct constants for accessing > TOPCHAR and the real upper bound for the domain of > unichr? Hmm, not sure. Wouldn't it be better to simply an attribute sys.unicodewidth == 'narrow' | 'wide' ? This leaves out all the complicated issues and redirects people to this PEP. > * Note that ord() can in some cases return ordinals higher than > sys.maxunicode because it accepts surrogate pairs on narrow > Python builds. -1. > * codecs will be upgraded to support "wide characters" > (represented directly in UCS-4, as surrogate pairs in UTF-16 and > as multi-byte sequences in UTF-8). On narrow Python builds, the > codecs will generate surrogate pairs, on wide Python builds they > will generate a single character. This is the main part of the > implementation left to be done. +1. This is how surrogates should be treated: in the codecs ! > * there are no restrictions on constructing strings that use > code points "reserved for surrogates" improperly. These are > called "lone surrogates". Better call them "isolated surrogates"; that's the term Mark Davis used and he should know. > The codecs should disallow reading > these but you could construct them using string literals or > unichr(). unichr() is not restricted to values less than either > TOPCHAR nor sys.maxunicode. > > ISSUE: Should lone surrogates be allowed as input to ord even > on wide platforms where they "should" not occur? Yes, see above. Isolated surrogates are true code points. > Implementation > > There is a new (experimental) define: > > #define PY_UNICODE_SIZE 2 Doesn't sizeof(Py_UNICODE) do the same ? > There is a new configure options: > > --enable-unicode=ucs2 configures a narrow Py_UNICODE, and uses > wchar_t if it fits > --enable-unicode=ucs4 configures a wide Py_UNICODE likewise With "likewise" meaning: "and uses wchar_t if it fits" ! > --enable-unicode same as "=ucs2" > > The intention is that --disable-unicode, or --enable-unicode=no > removes the Unicode type altogether; this is not yet implemented. Let's add the UCS-2/UCS-4 stuff first and only then think about adding the removal #ifdefs. > Notes > > Note that len(unichr(i))==2 for i>=2**16 on narrow machines > because of the returned surrogates. -1. See above. > This means (for example) that the following code is not portable: > > x = 2**16 > if unichr(x) in somestring: > ... > > In general, you should be careful using "in" if the character that > is searched for could have been generated from unichr applied to a > number greater than 2**16 or from a string literal greater than > 2**16. > > This PEP does NOT imply that people using Unicode need to use a > 4-byte encoding. It only allows them to do so. For example, > ASCII is still a legitimate (7-bit) Unicode-encoding. > > Rationale for Surrogate Creation Behaviour > > Python currently supports the construction of a surrogate pair > for a large unicode literal character escape sequence. This is > basically designed as a simple way to construct "wide characters" > even in a narrow Python build. > > ISSUE: surrogates can be created this way but the user still > needs to be careful about slicing, indexing, printing > etc. Another option is to remove knowledge of > surrogates from everything other than the codecs. Side note: Python uses the unicode-escape codec for interpreting the Unicode literals. This means that narrow builds will also support the full range of UCS-4 -- using surrogates if needed. This introduces an incompatibility between narrow and wide builds at run-time. PYC should not be harmed by this since they store Unicode strings using UTF-8. > Rejected Suggestions > > There were two primary solutions that were rejected. The first was > more or less the status-quo. We could officially say that UTF-16 > is the Python character encoding and require programmers to > implement wide characters in their application logic. This is a > heavy burden because emulating 32-bit characters is likely to be > very inefficient if it is coded entirely in Python. > > The other solution is to use UTF-16 (or even UTF-8) internally > (for efficiency) but present an abstraction of 32-bit characters > to the programmer. This would require a much more complex > implementation than the accepted solution. In theory, we could > move to this implementation in the future without breaking Python > code. It would just emulate a wide Python build on narrow > Pythons. > > Copyright > > This document has been placed in the public domain. > > Local Variables: > mode: indented-text > indent-tabs-mode: nil > End: -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From guido@digicool.com Thu Jun 28 12:17:03 2001 From: guido@digicool.com (Guido van Rossum) Date: Thu, 28 Jun 2001 07:17:03 -0400 Subject: [I18n-sig] Unicode surrogates: just say no! In-Reply-To: Your message of "Thu, 28 Jun 2001 00:55:08 EDT." References: Message-ID: <200106281117.f5SBH3Z20788@odiug.digicool.com> OK, I'm convinced that ord() should only work on single-unit strings. If we're going to deprecate creating surrogates with \U, I think unichr() should follow suit. (My Klingon use case had a need for \U but not for unichr() doing this.) But reasonable people can argue over this. [Tim] > But there's a HUGE difference. The xrange() behaviors we're seeking to shed > have been documented for years. Oh yeah? Where? The docs for XRange objects are very vague, claiming that they "behave like tuples" and have a tolist() method. Well, they can't be concatenated, so they don't behave like tuples. --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@digicool.com Thu Jun 28 12:25:30 2001 From: guido@digicool.com (Guido van Rossum) Date: Thu, 28 Jun 2001 07:25:30 -0400 Subject: [I18n-sig] Re: Unicode 3.1 and contradictions. In-Reply-To: Your message of "Thu, 28 Jun 2001 09:20:32 BST." References: Message-ID: <200106281125.f5SBPVc20814@odiug.digicool.com> > The UTF-8 representations of U+D800..U+DFFF, U+FFFE, and U+FFFF are not > allowed in a UTF-8 stream and a secure UTF-8 decoder must never output > any of these characters. Can you explain a bit more about the security issues? --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@digicool.com Thu Jun 28 12:33:25 2001 From: guido@digicool.com (Guido van Rossum) Date: Thu, 28 Jun 2001 07:33:25 -0400 Subject: [I18n-sig] Support for "wide" Unicode characters In-Reply-To: Your message of "Thu, 28 Jun 2001 11:27:35 +0200." <3B3AF887.5181D0CF@lemburg.com> References: <3B3ABFB8.84C7510B@ActiveState.com> <3B3AF887.5181D0CF@lemburg.com> Message-ID: <200106281133.f5SBXQ020837@odiug.digicool.com> > > There is a new (experimental) define: > > > > #define PY_UNICODE_SIZE 2 > > Doesn't sizeof(Py_UNICODE) do the same ? Not on a Cray! And not in the C standard. Ask Tim. :-) > This introduces an incompatibility between narrow and wide > builds at run-time. PYC should not be harmed by this since they > store Unicode strings using UTF-8. Does UTF-8 transfer isolated surrogates correctly? I think that's necessary, otherwise I can't marshal or unmarshal literals containing those, which means that .pyc files for .py files containing those can't be read (on maybe aren't portable between wide and narrow interpreters). Note that I'm OK with the UTF-8 encoder recognizing hi+lo surrogate pairs and encoding them as one Unicode character, since the decoder generates surrogates for non-BMP characters on a narrow platform. --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@digicool.com Thu Jun 28 12:37:06 2001 From: guido@digicool.com (Guido van Rossum) Date: Thu, 28 Jun 2001 07:37:06 -0400 Subject: [I18n-sig] Support for "wide" Unicode characters In-Reply-To: Your message of "Wed, 27 Jun 2001 22:25:12 PDT." <3B3ABFB8.84C7510B@ActiveState.com> References: <3B3ABFB8.84C7510B@ActiveState.com> Message-ID: <200106281137.f5SBb6r20850@odiug.digicool.com> Whether \U can create surrogates should now be marked as an open issue as well, like for unichr(). No further comments but agree with what others have said; I like the idea of adding a Glossary and using the Unicode terminology correctly. --Guido van Rossum (home page: http://www.python.org/~guido/) From Markus.Kuhn@cl.cam.ac.uk Thu Jun 28 12:48:40 2001 From: Markus.Kuhn@cl.cam.ac.uk (Markus Kuhn) Date: Thu, 28 Jun 2001 12:48:40 +0100 Subject: [I18n-sig] Re: Unicode 3.1 and contradictions. In-Reply-To: Your message of "Thu, 28 Jun 2001 07:25:30 EDT." <200106281125.f5SBPVc20814@odiug.digicool.com> Message-ID: Guido van Rossum wrote on 2001-06-28 11:25 UTC: > > The UTF-8 representations of U+D800..U+DFFF, U+FFFE, and U+FFFF are not > > allowed in a UTF-8 stream and a secure UTF-8 decoder must never output > > any of these characters. > > Can you explain a bit more about the security issues? There are two ways of processing UTF-8 encoded UCS text: a) as a UTF-8 bytestream b) as a stream of decoded integer code values (32-bit wchar_t, etc.) Problems arise if security-relevant checks are done in one representation and interpretation of the data is done in the other. Imagine, you have an application with the following processing steps: - read a UTF-8 string - apply a substring test to convince yourself that certain characters are not present in the string - decode UTF-8 - use the decoded string in an application where presence of the tested characters could be security critical The classical example is a Win32 web server, where a UTF-8 URL is fed in, tested by a script in UTF-8 to be free of the byte sequence '/../', and then UTF-8 decoded and fed into a UTF-16 API for file system access. Even though the presence of '/../' encoded in ASCII was filtered out, the same character sequence can still be passed past the filter by a clever attacker using alternative encodings that an unsafe UTF-8 decoder might accept, for instance an overlong sequence for any of the characters. This problem is most severe with non-ASCII representations of ASCII characters by overlong UTF-8 sequences, because ASCII characters have often lots of special functions associated, but it also occurs with other tests. For example, it should be perfectly legitimate to test a UTF-8 string to be free of non-BMP characters by simply testing that no byte >= 0xE0 is present, without the far less efficient use of a UTF-8 decoder. Other risks are people smuggling a UTF-8 encoded U+FFFE or U+FFFF into a system, which when decoded into UTF-16 might be interpreted as an instruction to swap the byte sex (anti-BOM) or as some generic escape-or-end-of-string/file character (U+FFFF). The golden rule that there must be exactly one single UTF-8 byte sequence that can result in the output of a certain Unicode character and that Unicode code positions reserved for special non-character use such as U+D800..U+DFFF, U+FFFE, and U+FFFF should never be generated by a UTF-8 decoder eliminates all these potential pitfalls. http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt Markus -- Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK Email: mkuhn at acm.org, WWW: From fredrik@pythonware.com Thu Jun 28 12:55:25 2001 From: fredrik@pythonware.com (Fredrik Lundh) Date: Thu, 28 Jun 2001 13:55:25 +0200 Subject: [I18n-sig] Support for "wide" Unicode characters References: <3B3ABFB8.84C7510B@ActiveState.com> <3B3AF887.5181D0CF@lemburg.com> <200106281133.f5SBXQ020837@odiug.digicool.com> Message-ID: <00b601c0ffc9$38ae4bb0$0900a8c0@spiff> guido wrote: > > > There is a new (experimental) define: > > > > > > #define PY_UNICODE_SIZE 2 > > > > Doesn't sizeof(Py_UNICODE) do the same ? > > Not on a Cray! And not in the C standard. Ask Tim. :-) not to mention that the preprocessor doesn't understand sizeof(type)... (note that in the current implementation, the Py_UNICODE_WIDE macro is used to enable wide storage and disable the surrogate stuff. it's currently set if PY_UNICODE_SIZE >= 4, but it might be better to do it the other way around) Cheers /F From JMachin@Colonial.com.au Thu Jun 28 13:27:45 2001 From: JMachin@Colonial.com.au (Machin, John) Date: Thu, 28 Jun 2001 22:27:45 +1000 Subject: [I18n-sig] Support for "wide" Unicode characters Message-ID: <9F2D83017589D211BD1000805FA70CA703B13A02@ntxmel03.cmutual.com.au> Guido asked: Does UTF-8 transfer isolated surrogates correctly? No. See my bug report in SF. Briefly, a lone high surrogate has its leading UTF-8 byte omitted, causing an illegal UTF-8 sequence to be generated. Here's the URL: http://sourceforge.net/tracker/?group_id=5470&atid=105470&func=detail&aid=43 3882 (or search for "surrogates") ************** IMPORTANT MESSAGE ************** The information contained in or attached to this message is intended only for the people it is addressed to. If you are not the intended recipient, any use, disclosure or copying of this information is unauthorised and prohibited. This information may be confidential or subject to legal privilege. It is not the expressed view of Colonial Limited or any of its subsidiaries unless that is clearly stated. Colonial cannot accept liability for any virus damage caused by this message. ************************************************** From mal@lemburg.com Thu Jun 28 14:11:04 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 28 Jun 2001 15:11:04 +0200 Subject: [I18n-sig] Support for "wide" Unicode characters References: <3B3ABFB8.84C7510B@ActiveState.com> <3B3AF887.5181D0CF@lemburg.com> <200106281133.f5SBXQ020837@odiug.digicool.com> Message-ID: <3B3B2CE8.B1A062C4@lemburg.com> Guido van Rossum wrote: > > > > There is a new (experimental) define: > > > > > > #define PY_UNICODE_SIZE 2 > > > > Doesn't sizeof(Py_UNICODE) do the same ? > > Not on a Cray! And not in the C standard. Ask Tim. :-) Ah, OK... nice sofas these Crays, BTW ;-) > > This introduces an incompatibility between narrow and wide > > builds at run-time. PYC should not be harmed by this since they > > store Unicode strings using UTF-8. > > Does UTF-8 transfer isolated surrogates correctly? I think that's > necessary, otherwise I can't marshal or unmarshal literals containing > those, which means that .pyc files for .py files containing those > can't be read (on maybe aren't portable between wide and narrow > interpreters). It handles surrogates correctly, but rejects isolated ones on input (easy to fix though) and passes them through on output. As I said before, surrogate is far from being complete. > Note that I'm OK with the UTF-8 encoder recognizing hi+lo surrogate > pairs and encoding them as one Unicode character, since the decoder > generates surrogates for non-BMP characters on a narrow platform. That's what it currently does. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From JMachin@Colonial.com.au Thu Jun 28 14:41:08 2001 From: JMachin@Colonial.com.au (Machin, John) Date: Thu, 28 Jun 2001 23:41:08 +1000 Subject: [I18n-sig] Support for "wide" Unicode characters Message-ID: <9F2D83017589D211BD1000805FA70CA703B13A03@ntxmel03.cmutual.com.au> [Guido van Rossum] > store Unicode strings using UTF-8. > > Does UTF-8 transfer isolated surrogates correctly? [Marc-Andre Lemburg} It handles surrogates correctly, but rejects isolated ones on input (easy to fix though) and passes them through on output. As I said before, surrogate is far from being complete. Marc-Andre, there is a *bug* in 2.1 encoding isolated high surrogates. I reported it and you assigned it to yourself on 23 June. Lookee here: Python 2.1 (#15, Apr 16 2001, 18:25:49) [MSC 32 bit (Intel)] on win32 Type "copyright", "credits" or "license" for more information. >>> u'\ud800'.encode('utf-8') '\xa0\x80' # should be 3 bytes, not 2 >>> While the fix is trivial, IMO an appropriate answer to Guido's question would include this particular lack of correctness. Cheers, John ************** IMPORTANT MESSAGE ************** The information contained in or attached to this message is intended only for the people it is addressed to. If you are not the intended recipient, any use, disclosure or copying of this information is unauthorised and prohibited. This information may be confidential or subject to legal privilege. It is not the expressed view of Colonial Limited or any of its subsidiaries unless that is clearly stated. Colonial cannot accept liability for any virus damage caused by this message. ************************************************** From mal@lemburg.com Thu Jun 28 14:49:28 2001 From: mal@lemburg.com (M.-A. Lemburg) Date: Thu, 28 Jun 2001 15:49:28 +0200 Subject: [I18n-sig] Support for "wide" Unicode characters References: <9F2D83017589D211BD1000805FA70CA703B13A03@ntxmel03.cmutual.com.au> Message-ID: <3B3B35E8.6634D032@lemburg.com> "Machin, John" wrote: > > [Guido van Rossum] > > store Unicode strings using UTF-8. > > > > Does UTF-8 transfer isolated surrogates correctly? > > [Marc-Andre Lemburg} > It handles surrogates correctly, but rejects isolated ones on input > (easy to fix though) and passes them through on output. As I said > before, surrogate is far from being complete. > > Marc-Andre, there is a *bug* in 2.1 encoding isolated high surrogates. I > reported it > and you assigned it to yourself on 23 June. Lookee here: > > Python 2.1 (#15, Apr 16 2001, 18:25:49) [MSC 32 bit (Intel)] on win32 > Type "copyright", "credits" or "license" for more information. > >>> u'\ud800'.encode('utf-8') > '\xa0\x80' # should be 3 bytes, not 2 > >>> > > While the fix is trivial, IMO an appropriate answer to Guido's question > would include > this particular lack of correctness. Thanks for the note. I was looking at the code rather than actually trying an example -- guess the latter is faster and gives better answers ;-) -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/ From jhi@iki.fi Thu Jun 28 14:51:01 2001 From: jhi@iki.fi (Jarkko Hietaniemi) Date: Thu, 28 Jun 2001 08:51:01 -0500 Subject: [I18n-sig] Re: Determine encoding from $LANG In-Reply-To: ; from mkuhn@suse.de on Thu, Jun 28, 2001 at 10:03:59AM +0200 References: <15160.60506.589750.287186@honolulu.ilog.fr> Message-ID: <20010628085101.B21832@chaos.wustl.edu> On Thu, Jun 28, 2001 at 10:03:59AM +0200, Markus Kuhn wrote: > On Tue, 26 Jun 2001, Bruno Haible wrote: > > > > A program cannot be considered properly internationalized > > until it obeys the current locale (LC_ALL || LC_CTYPE || LANG). > > > > The programs we are waiting for are: > > [...] > > Add to that list many of the programming languages that use Unicode > internally but that do not yet set the default i/o encoding correctly > automatically based on LC_ALL || LC_CTYPE || LANG. Until very recently the term "default I/O encoding" didn't mean anything to Perl (it was native bytes, period). Now we do have a new I/O subsystem (with which we can do things like "this I/O stream is in UTF-8") but the new I/O subsystem is not yet available in any public release of Perl, only in one developer release so far (5.7.1). > I suspect that Perl and Python are not much better and don't call > nl_langinfo(CODESET) or the portable libcharset wrapper around it either No, we don't call nl_langinfo(CODESET). We still need to figure out the correct policy and place for doing that. Sorry if "the correct policy" has been already extensively discussed and answered in this thread, this is the first message that was CCed (well, which I saw, anyway) to perl-unicode. But as a general rule, Perl doesn't do much in the way of locales unless the user explicitly asks for a locale behaviour by using setlocale(). Changing that now to be more 'automatic' would break backward compatibility. > to properly determine the locale-dependent external encoding. > > References on how to determine the character encoding from the locale in a > safe and portable manner: > > http://www.cl.cam.ac.uk/~mgk25/unicode.html#activate > http://clisp.cons.org/~haible/packages-libcharset.html Alas, IIUC, LGPL is currently slightly incompatible for inclusion into Perl, for something as central piece of a code as locale handling. (Note: this is just a statement of facts as far as I understand them, I do not intend or want to start discussion about software licensing politics.) > http://www.opengroup.org/onlinepubs/7908799/xsh/langinfo.h.html But thanks for the pointers. I don't know whether I will be able to smush in the use use nl_langinfo() for the upcoming public release of Perl, Perl 5.8.0, but I will certainly give some thought to the matter. -- $jhi++; # http://www.iki.fi/jhi/ # There is this special biologist word we use for 'stable'. # It is 'dead'. -- Jack Cohen From guido@digicool.com Thu Jun 28 15:14:31 2001 From: guido@digicool.com (Guido van Rossum) Date: Thu, 28 Jun 2001 10:14:31 -0400 Subject: [I18n-sig] Support for "wide" Unicode characters In-Reply-To: Your message of "Thu, 28 Jun 2001 22:27:45 +1000." <9F2D83017589D211BD1000805FA70CA703B13A02@ntxmel03.cmutual.com.au> References: <9F2D83017589D211BD1000805FA70CA703B13A02@ntxmel03.cmutual.com.au> Message-ID: <200106281414.f5SEEVX23234@odiug.digicool.com> > Guido asked: > Does UTF-8 transfer isolated surrogates correctly? > > No. See my bug report in SF. Briefly, a lone high > surrogate has its leading UTF-8 byte omitted, > causing an illegal UTF-8 sequence to be generated. > > Here's the URL: > http://sourceforge.net/tracker/?group_id=5470&atid=105470&func=detail&aid=43 > 3882 > > (or search for "surrogates") It's a bug indeed. But my question was about the definition of UTF8, not our (fallible) implementation. What *should* be the result of u'\ud800'.encode('utf8')? '\xed\xa0\x80' or an exception? And likewise, what should be the result of unicode('\xed\xa0\x80', 'utf8')? u'\ud800' or an exception? (Likewise for low surrogates; currently, u'\udc00'.encode('utf8') returns '\xed\xb0\x80', but unicode('\xed\xb0\x80', 'utf8') raise an exception.) --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@digicool.com Thu Jun 28 15:51:30 2001 From: guido@digicool.com (Guido van Rossum) Date: Thu, 28 Jun 2001 10:51:30 -0400 Subject: [I18n-sig] Re: Unicode 3.1 and contradictions. In-Reply-To: Your message of "Thu, 28 Jun 2001 12:48:40 BST." References: Message-ID: <200106281451.f5SEpUv23358@odiug.digicool.com> [Markus] > > > The UTF-8 representations of U+D800..U+DFFF, U+FFFE, and U+FFFF are not > > > allowed in a UTF-8 stream and a secure UTF-8 decoder must never output > > > any of these characters. [Guido] > > Can you explain a bit more about the security issues? [Markus] > There are two ways of processing UTF-8 encoded UCS text: > > a) as a UTF-8 bytestream > b) as a stream of decoded integer code values (32-bit wchar_t, etc.) > > Problems arise if security-relevant checks are done in one > representation and interpretation of the data is done in the other. > > Imagine, you have an application with the following processing steps: > > - read a UTF-8 string > - apply a substring test to convince yourself that certain characters > are not present in the string > - decode UTF-8 > - use the decoded string in an application where presence of the > tested characters could be security critical I'd say that the security implementation of such an application is broken -- the check should have been done on the final datya. It seems you are trying to patch up a legacy system the wrong way. Or am I missing something? How can this be a common pattern? > The classical example is a Win32 web server, where a UTF-8 URL is fed > in, tested by a script in UTF-8 to be free of the byte sequence '/../', > and then UTF-8 decoded and fed into a UTF-16 API for file system access. > Even though the presence of '/../' encoded in ASCII was filtered out, > the same character sequence can still be passed past the filter by a > clever attacker using alternative encodings that an unsafe UTF-8 decoder > might accept, for instance an overlong sequence for any of the > characters. Here you are assuming an unsafe UTF-8 decoder. I agree that an UTF-8 decoder that accepts overlong sequences is broken. But we were talking about isolated surrogates. How can passing through *isolated* surrogates cause a security violation? It's not an overlong sequence! (Assuming the decoder does the right thing for surrogate *pairs*.) > This problem is most severe with non-ASCII representations of ASCII > characters by overlong UTF-8 sequences, because ASCII characters have > often lots of special functions associated, but it also occurs with > other tests. For example, it should be perfectly legitimate to test a > UTF-8 string to be free of non-BMP characters by simply testing that no > byte >= 0xE0 is present, without the far less efficient use of a UTF-8 > decoder. Why is testing for non-BMP characters part of a security screening? Maybe you are worried that an application will over-index some table prepared for the BMP only. But Python already protects against over-indexing with an exception. Why would you want a security screening of the UTF-8 stream when you're going to decode it eventually? If you *have* to check that no decoded character is >= 2**16, faster than a separate scan would be to fold the security screening into the UTF-8 codec. > Other risks are people smuggling a UTF-8 encoded U+FFFE or U+FFFF into a > system, which when decoded into UTF-16 might be interpreted as an > instruction to swap the byte sex (anti-BOM) or as some generic > escape-or-end-of-string/file character (U+FFFF). These aren't isolated surrogates, so they would fall under a different rule (currently they pass through Python's UTF-8 codec just fine). I have the feeling that you want the UTF-8 decoder to make up for all the sloppy coding practices that might be used in the application. > The golden rule that there must be exactly one single UTF-8 byte > sequence that can result in the output of a certain Unicode character > and that Unicode code positions reserved for special non-character use > such as U+D800..U+DFFF, U+FFFE, and U+FFFF should never be generated by > a UTF-8 decoder eliminates all these potential pitfalls. Sorry, you haven't convinced me that these tests should be applied by Python's standard UTF-8 codec. Also, your use of "such as" suggests that the collection of dangerous code points is open-ended, but I find that hard to believe (since legacy codecs won't be updated). > http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 > http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt --Guido van Rossum (home page: http://www.python.org/~guido/) From Markus.Kuhn@cl.cam.ac.uk Thu Jun 28 16:47:59 2001 From: Markus.Kuhn@cl.cam.ac.uk (Markus Kuhn) Date: Thu, 28 Jun 2001 16:47:59 +0100 Subject: [I18n-sig] Re: Unicode 3.1 and contradictions. In-Reply-To: Your message of "Thu, 28 Jun 2001 10:51:30 EDT." <200106281451.f5SEpUv23358@odiug.digicool.com> Message-ID: Guido van Rossum wrote on 2001-06-28 14:51 UTC: > > Imagine, you have an application with the following processing steps: > > > > - read a UTF-8 string > > - apply a substring test to convince yourself that certain characters > > are not present in the string > > - decode UTF-8 > > - use the decoded string in an application where presence of the > > tested characters could be security critical > > I'd say that the security implementation of such an application is > broken -- the check should have been done on the final datya. It > seems you are trying to patch up a legacy system the wrong way. Or am > I missing something? How can this be a common pattern? We should not expect that any and all UTF-8 data has to be decoded before it can be processed. UTF-8 has been very carefully designed to allow much text processing (substring searching without case mapping, etc.) to be done on UTF-8 data directly. Only few operations (display, case mapping, proper sorting) actually require a UTF-8 decoder. The name "UCS Transfer Format" is in practise misleading, because processing UTF-8 as opposed to just transfering is often the right thing to do, unless a buggy UTF-8 decoder would make that risky. > But we were talking about isolated surrogates. How can passing > through *isolated* surrogates cause a security violation? It's not an > overlong sequence! (Assuming the decoder does the right thing for > surrogate *pairs*.) OK, that is far less of a security concern. However, an isolated surrogate is usually a symptom of something else being wrong (e.g., UTF-16 strings being split at the wrong place, then UTF-8 converted, then joined again), and if not spotted will lead to incorrect UTF-8 sequences at the end. Signalling an exception might often be better than passing everything through quietly. > > This problem is most severe with non-ASCII representations of ASCII > > characters by overlong UTF-8 sequences, because ASCII characters have > > often lots of special functions associated, but it also occurs with > > other tests. For example, it should be perfectly legitimate to test a > > UTF-8 string to be free of non-BMP characters by simply testing that no > > byte >= 0xE0 is present, without the far less efficient use of a UTF-8 > > decoder. > > Why is testing for non-BMP characters part of a security screening? If a database field has a policy of not allowing non-BMP characters in a field, then that policy can be violated. How bad that is depends on the application. It was really just an example, not a specific risk. > Sorry, you haven't convinced me that these tests should be applied by > Python's standard UTF-8 codec. Also, your use of "such as" suggests > that the collection of dangerous code points is open-ended, but I find > that hard to believe (since legacy codecs won't be updated). My list of unwanted UTF-8 code points was just the one found in a note in the UTF-8 definition in ISO 10646-1:1993 (R.4): NOTE 3 - Values of x in the range 0000 D800 .. 0000 DFFF are reserved for the UTF-16 form and do not occur in UCS-4. The values 0000 FFFE and 0000 FFFF also do not occur (see clause 8). The mappings of these code positions in UTF-8 are undefined. http://www.cl.cam.ac.uk/~mgk25/ucs/ISO-10646-UTF-8.html Markus -- Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK Email: mkuhn at acm.org, WWW: From tim@digicool.com Thu Jun 28 17:40:29 2001 From: tim@digicool.com (Tim Peters) Date: Thu, 28 Jun 2001 12:40:29 -0400 Subject: [I18n-sig] Unicode surrogates: just say no! In-Reply-To: <200106281117.f5SBH3Z20788@odiug.digicool.com> Message-ID: [Tim] > But there's a HUGE difference. The xrange() behaviors we're > seeking to shed have been documented for years. [Guido] > Oh yeah? Where? Same place as \U surrogates: in the c.l.py archives . Well, I take that back: while any number of bizarre xrange tricks have been posted over the years, I don't think I ever saw a surrogate literal example before this thread. 'twas-news-to-me-but-then-so-was-80%-of-what-xrange-did-ly y'rs - tim From paulp@ActiveState.com Thu Jun 28 19:11:59 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Thu, 28 Jun 2001 11:11:59 -0700 Subject: [I18n-sig] Unicode surrogates: just say no! References: <200106260851.f5Q8pcN10662@odiug.digicool.com> <3B385BDC.AB40A761@lemburg.com> <200106261700.f5QH0ih14770@odiug.digicool.com> <3B3A3DC5.CA6767FD@ActiveState.com> <200106280620.f5S6Kwe01395@mira.informatik.hu-berlin.de> Message-ID: <3B3B736F.316649CA@ActiveState.com> "Martin v. Loewis" wrote: > >... > > The rationale for supporting \U is two-fold: One, importing a module > should not fail in one installation, and succeed in another (of the > same Python version). Running the module may give different results, > but you should be able to generate byte code. Isn't it already the case that big Python integer literals can be legal on one platform and illegal on another? (I don't know, I just thought that was the case....) > ... Furthermore, people > using non-BMP characters in source are probably not very interested in > counting the characters: They want to display them. For just > displaying them, you need to represent them, and you need the fonts. > String manipulation is less important. What are the chances that anybody is in this situation in the near future? Can you even display these characters on Windows? Does Tk support them? And if so, on what platforms? What about the Java APIs? (once again, these are real, not rhetorical questions) Wide Python builds may be the "default" before these characters become practically usable in GUIs. -- Take a recipe. Leave a recipe. Python Cookbook! http://www.ActiveState.com/pythoncookbook From paulp@ActiveState.com Thu Jun 28 19:13:44 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Thu, 28 Jun 2001 11:13:44 -0700 Subject: [I18n-sig] Python Support for "Wide" Unicode characters References: <3B3A6438.6DA39268@ActiveState.com> <200106272319.f5RNJnO20162@odiug.digicool.com> <3B3A7857.1593F72@ActiveState.com> <200106280657.f5S6vYl01625@mira.informatik.hu-berlin.de> Message-ID: <3B3B73D8.B9D92DE8@ActiveState.com> "Martin v. Loewis" wrote: > >... > > > So there is no way to get the heuristic of "wchar_t if available, UCS-4 > > if not". I'm not complaining, just checking. The list of options is just > > two with ucs2 the default. > > I'd be complaining, though, if I wasn't that pleased with this PEP > overall. Sorry, I don't understand the point you were making here. You may be away already so I'll take explanations from anyone who is interested. :) -- Take a recipe. Leave a recipe. Python Cookbook! http://www.ActiveState.com/pythoncookbook From paulp@ActiveState.com Thu Jun 28 19:28:28 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Thu, 28 Jun 2001 11:28:28 -0700 Subject: [I18n-sig] Closing some issues Message-ID: <3B3B774C.5D3F1E99@ActiveState.com> I'd like to close some issues in the PEP if there is agreement. If you feel that the following issues still deserve further discussion, just yell and I'll leave them as issues: * unichr() should never return surrogate pairs so its domain and range vary between wide and narrow Python builds. * ord() should never accept pairs so its domain and range vary between wide and narrow Python builds. * nowhere in the design will we discriminate against "lone surrogates" other than potentially the codecs. "Agreement" means everybody comes out on the same side or Guido rules. -- Take a recipe. Leave a recipe. Python Cookbook! http://www.ActiveState.com/pythoncookbook From tree@basistech.com Thu Jun 28 19:00:47 2001 From: tree@basistech.com (Tom Emerson) Date: Thu, 28 Jun 2001 14:00:47 -0400 Subject: [I18n-sig] Unicode surrogates: just say no! In-Reply-To: <3B3B736F.316649CA@ActiveState.com> References: <200106260851.f5Q8pcN10662@odiug.digicool.com> <3B385BDC.AB40A761@lemburg.com> <200106261700.f5QH0ih14770@odiug.digicool.com> <3B3A3DC5.CA6767FD@ActiveState.com> <200106280620.f5S6Kwe01395@mira.informatik.hu-berlin.de> <3B3B736F.316649CA@ActiveState.com> Message-ID: <15163.28879.54534.29084@cymru.basistech.com> Paul Prescod writes: > "Martin v. Loewis" wrote: [snip] > > ... Furthermore, people > > using non-BMP characters in source are probably not very interested in > > counting the characters: They want to display them. For just > > displaying them, you need to represent them, and you need the fonts. > > String manipulation is less important. > > What are the chances that anybody is in this situation in the near > future? Can you even display these characters on Windows? Does Tk > support them? And if so, on what platforms? What about the Java APIs? > (once again, these are real, not rhetorical questions) I can't speak for the characters in plane 1, but the characters in plane 2 have fonts available already for those who need them. Also, plane 14 contains code-points that *would* be used for both display and text processing applications. Finally I would expect that those using the ideographs in plane 2 care less about display than they do being able to encode and manipulate the data. Either the characters are used in names which must be put into databases and the like, or they are being used to encode historical documents for searching and the like. While display is important, I strongly suggest that the ability to display them does not outweigh the ability to work with strings containing them. -tree -- Tom Emerson Basis Technology Corp. Sr. Sinostringologist http://www.basistech.com "Beware the lollipop of mediocrity: lick it once and you suck forever" From paulp@ActiveState.com Thu Jun 28 19:59:29 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Thu, 28 Jun 2001 11:59:29 -0700 Subject: [I18n-sig] Unicode surrogates: just say no! References: <200106260851.f5Q8pcN10662@odiug.digicool.com> <3B385BDC.AB40A761@lemburg.com> <200106261700.f5QH0ih14770@odiug.digicool.com> <3B3A3DC5.CA6767FD@ActiveState.com> <200106280620.f5S6Kwe01395@mira.informatik.hu-berlin.de> <3B3B736F.316649CA@ActiveState.com> <15163.28879.54534.29084@cymru.basistech.com> Message-ID: <3B3B7E91.D5899327@ActiveState.com> Tom Emerson wrote: > >... > Used to encode > historical documents for searching and the like. While display is > important, I strongly suggest that the ability to display them does > not outweigh the ability to work with strings containing them. The ability to work with them is not at issue. The question is whether you can use them in string literals. One side of the argument says that "working with them" in narrow Python builds will be extremely difficult, so allowing them in literals and as inputs to unichr doesn't help much. The other side says that at least allowing them in literals makes them available in code in a straightforwards way. "Working with them" will still require understanding of surrogates. (in narrow Python builds!) -- Take a recipe. Leave a recipe. Python Cookbook! http://www.ActiveState.com/pythoncookbook From guido@digicool.com Thu Jun 28 20:37:43 2001 From: guido@digicool.com (Guido van Rossum) Date: Thu, 28 Jun 2001 15:37:43 -0400 Subject: [I18n-sig] Unicode surrogates: just say no! In-Reply-To: Your message of "Thu, 28 Jun 2001 11:11:59 PDT." <3B3B736F.316649CA@ActiveState.com> References: <200106260851.f5Q8pcN10662@odiug.digicool.com> <3B385BDC.AB40A761@lemburg.com> <200106261700.f5QH0ih14770@odiug.digicool.com> <3B3A3DC5.CA6767FD@ActiveState.com> <200106280620.f5S6Kwe01395@mira.informatik.hu-berlin.de> <3B3B736F.316649CA@ActiveState.com> Message-ID: <200106281937.f5SJbit27023@odiug.digicool.com> > > The rationale for supporting \U is two-fold: One, importing a module > > should not fail in one installation, and succeed in another (of the > > same Python version). Running the module may give different results, > > but you should be able to generate byte code. > > Isn't it already the case that big Python integer literals can be legal > on one platform and illegal on another? (I don't know, I just thought > that was the case....) Yes, this is why the argument for \U as surrogate-generator is not so strong. > > ... Furthermore, people > > using non-BMP characters in source are probably not very interested in > > counting the characters: They want to display them. For just > > displaying them, you need to represent them, and you need the fonts. > > String manipulation is less important. > > What are the chances that anybody is in this situation in the near > future? Can you even display these characters on Windows? Does Tk > support them? And if so, on what platforms? What about the Java APIs? > (once again, these are real, not rhetorical questions) I don't know the answers. > Wide Python builds may be the "default" before these characters become > practically usable in GUIs. :-) --Guido van Rossum (home page: http://www.python.org/~guido/) From guido@digicool.com Thu Jun 28 20:40:09 2001 From: guido@digicool.com (Guido van Rossum) Date: Thu, 28 Jun 2001 15:40:09 -0400 Subject: [I18n-sig] Closing some issues In-Reply-To: Your message of "Thu, 28 Jun 2001 11:28:28 PDT." <3B3B774C.5D3F1E99@ActiveState.com> References: <3B3B774C.5D3F1E99@ActiveState.com> Message-ID: <200106281940.f5SJeAe27046@odiug.digicool.com> > I'd like to close some issues in the PEP if there is agreement. If you > feel that the following issues still deserve further discussion, just > yell and I'll leave them as issues: > > * unichr() should never return surrogate pairs so its domain and range > vary between wide and narrow Python builds. +1 > * ord() should never accept pairs so its domain and range vary between > wide and narrow Python builds. +1 > * nowhere in the design will we discriminate against "lone surrogates" > other than potentially the codecs. +1 > "Agreement" means everybody comes out on the same side or Guido rules. +1 :-) I take it that \U is still open? At this point I am +1 on making that behave platform-specific too. --Guido van Rossum (home page: http://www.python.org/~guido/) From rick@unicode.org Thu Jun 28 21:40:27 2001 From: rick@unicode.org (Rick McGowan) Date: Thu, 28 Jun 2001 13:40:27 -0700 Subject: [I18n-sig] Unicode surrogates: just say no! Message-ID: <200106281833.OAA31487@unicode.org> I have a question... Since Unicode does define upper-plane charactes -- some 40,000 of them I believe -- and more are on the way... What would be the use in going forward with any Python implementation that doesn't handle the 21-bit space? Rick From paulp@ActiveState.com Thu Jun 28 22:16:47 2001 From: paulp@ActiveState.com (Paul Prescod) Date: Thu, 28 Jun 2001 14:16:47 -0700 Subject: [I18n-sig] Unicode surrogates: just say no! References: <200106281833.OAA31487@unicode.org> Message-ID: <3B3B9EBF.DB008390@ActiveState.com> Rick McGowan wrote: > > I have a question... > > Since Unicode does define upper-plane charactes -- some 40,000 of them I > believe -- and more are on the way... What would be the use in going > forward with any Python implementation that doesn't handle the 21-bit > space? There will be only one Python implementation and it will support all Unicode characters. As a compile time flag you can turn this support on or off based on the individual's feeling about the importance of the new characters versus the importance of conserving memory. -- Take a recipe. Leave a recipe. Python Cookbook! http://www.ActiveState.com/pythoncookbook From JMachin@Colonial.com.au Thu Jun 28 23:08:04 2001 From: JMachin@Colonial.com.au (Machin, John) Date: Fri, 29 Jun 2001 08:08:04 +1000 Subject: [I18n-sig] Support for "wide" Unicode characters Message-ID: <9F2D83017589D211BD1000805FA70CA703B13A04@ntxmel03.cmutual.com.au> [John Machin] > Guido asked: > Does UTF-8 transfer isolated surrogates correctly? > > No. See my bug report in SF. Briefly, a lone high > surrogate has its leading UTF-8 byte omitted, > causing an illegal UTF-8 sequence to be generated. > > Here's the URL: > http://sourceforge.net/tracker/?group_id=5470&atid=105470&func=detail&aid=43 > 3882 > > (or search for "surrogates") [Guido again] It's a bug indeed. But my question was about the definition of UTF8, not our (fallible) implementation. What *should* be the result of u'\ud800'.encode('utf8')? '\xed\xa0\x80' or an exception? And likewise, what should be the result of unicode('\xed\xa0\x80', 'utf8')? u'\ud800' or an exception? (Likewise for low surrogates; currently, u'\udc00'.encode('utf8') returns '\xed\xb0\x80', but unicode('\xed\xb0\x80', 'utf8') raise an exception.) [John Machin] OK, sorry for the misunderstanding. A UTF-8 codec can be made to transcode scalars up to at least 31 bits wide. The ISO 10646 specification allows for this. So, for marshalling and (pickling?) purposes, calling the UTF-8 codec with errors='liberal' would be the way to go. IMO, 'liberal' should still give an exception for over-long UTF-8 byte sequences -- an encoder which produces such is broken (either accidentally or deliberately) -- but should happily transcode any scalar value <= X for some X in (0x10FFFF, 0x7FFFFFFF). IMO, when errors is 'strict', upper limit should be 0xFFFF for narrow builds, and 0x10FFFF for wide builds. IMO, unicode(), u.encode() and the \U notation should all use 'strict' ... and perhaps the exception messages produced by the narrow build could be marketing-aligned and point the punter to the wide build. Cheers, John ************** IMPORTANT MESSAGE ************** The information contained in or attached to this message is intended only for the people it is addressed to. If you are not the intended recipient, any use, disclosure or copying of this information is unauthorised and prohibited. This information may be confidential or subject to legal privilege. It is not the expressed view of Colonial Limited or any of its subsidiaries unless that is clearly stated. Colonial cannot accept liability for any virus damage caused by this message. ************************************************** From martin@loewis.home.cs.tu-berlin.de Thu Jun 28 23:49:45 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Fri, 29 Jun 2001 00:49:45 +0200 Subject: [I18n-sig] Closing some issues In-Reply-To: <3B3B774C.5D3F1E99@ActiveState.com> (message from Paul Prescod on Thu, 28 Jun 2001 11:28:28 -0700) References: <3B3B774C.5D3F1E99@ActiveState.com> Message-ID: <200106282249.f5SMnj901841@mira.informatik.hu-berlin.de> > I'd like to close some issues in the PEP if there is agreement. I agree with all of those. Regards, Martin From martin@loewis.home.cs.tu-berlin.de Thu Jun 28 23:31:49 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Fri, 29 Jun 2001 00:31:49 +0200 Subject: [I18n-sig] Re: Unicode 3.1 and contradictions. In-Reply-To: (message from Markus Kuhn on Thu, 28 Jun 2001 16:47:59 +0100) References: Message-ID: <200106282231.f5SMVnC01808@mira.informatik.hu-berlin.de> > My list of unwanted UTF-8 code points was just the one found in a note > in the UTF-8 definition in ISO 10646-1:1993 (R.4): > > NOTE 3 - Values of x in the range 0000 D800 .. 0000 DFFF are reserved > for the UTF-16 form and do not occur in UCS-4. The values 0000 FFFE and > 0000 FFFF also do not occur (see clause 8). The mappings of these code > positions in UTF-8 are undefined. That explains a lot. Apparently, Unicode takes the stand of making the undefined well-defined, which is just in the spirit of standards: Unicode is an extension to ISO 10646, in this respect. Regards, Martin From martin@loewis.home.cs.tu-berlin.de Thu Jun 28 23:38:26 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Fri, 29 Jun 2001 00:38:26 +0200 Subject: [I18n-sig] Python Support for "Wide" Unicode characters In-Reply-To: <3B3B73D8.B9D92DE8@ActiveState.com> (message from Paul Prescod on Thu, 28 Jun 2001 11:13:44 -0700) References: <3B3A6438.6DA39268@ActiveState.com> <200106272319.f5RNJnO20162@odiug.digicool.com> <3B3A7857.1593F72@ActiveState.com> <200106280657.f5S6vYl01625@mira.informatik.hu-berlin.de> <3B3B73D8.B9D92DE8@ActiveState.com> Message-ID: <200106282238.f5SMcQT01809@mira.informatik.hu-berlin.de> > > > So there is no way to get the heuristic of "wchar_t if available, UCS-4 > > > if not". I'm not complaining, just checking. The list of options is just > > > two with ucs2 the default. > > > > I'd be complaining, though, if I wasn't that pleased with this PEP > > overall. > > Sorry, I don't understand the point you were making here. I still would prefer if the default was wchar_t if available, so I'd get a wide Python from distributors as default. As it stands, most distributors will ship a narrow Python 2.2, since they are unlikely to change the default settings. Since I like the overall design of this patch very much, I'm not going to start long discussions on the detail of some default setting. Regards, Martin From martin@loewis.home.cs.tu-berlin.de Thu Jun 28 23:48:25 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Fri, 29 Jun 2001 00:48:25 +0200 Subject: [I18n-sig] Unicode surrogates: just say no! In-Reply-To: <3B3B736F.316649CA@ActiveState.com> (message from Paul Prescod on Thu, 28 Jun 2001 11:11:59 -0700) References: <200106260851.f5Q8pcN10662@odiug.digicool.com> <3B385BDC.AB40A761@lemburg.com> <200106261700.f5QH0ih14770@odiug.digicool.com> <3B3A3DC5.CA6767FD@ActiveState.com> <200106280620.f5S6Kwe01395@mira.informatik.hu-berlin.de> <3B3B736F.316649CA@ActiveState.com> Message-ID: <200106282248.f5SMmPI01840@mira.informatik.hu-berlin.de> > > The rationale for supporting \U is two-fold: One, importing a module > > should not fail in one installation, and succeed in another (of the > > same Python version). Running the module may give different results, > > but you should be able to generate byte code. > > Isn't it already the case that big Python integer literals can be legal > on one platform and illegal on another? (I don't know, I just thought > that was the case....) I guess so; I'm not even sure you can exchange byte code files across machines with sizeof(long). OTOH, I think this is a real problem, and we should not extend this problem into other areas as well. Furthermore, if you encounter a source incompatibility between installations because of very large integers, you can switch to long integers with little effort. The same is not that easy for Unicode literals. > What are the chances that anybody is in this situation in the near > future? Can you even display these characters on Windows? Does Tk > support them? And if so, on what platforms? I'm pretty sure that Tk can display them soon after fonts become available. I believe the X11 fonts support full ISO 10646. Since Tk uses UTF-8, it is also capable of representing these characters internally. For Windows, I don't know the power of TrueType/OpenType in this respect, but I'd assume they have thought of UTF-16 already. As for the fonts themselves, I've seen PDF files for the plane 2 characters, so I guess fonts are available *somehwere*. > What about the Java APIs? I could not care less about the Unicode capabilities of Java. > Wide Python builds may be the "default" before these characters become > practically usable in GUIs. That would be a good thing, since I think infrastructures need to build from ground up (operating system, programming language, GUI libraries, applications). Given that it is much easier to support representing the characters in Python than producing a font, it seems only natural that Python can represent them first. Python won't have a lot of other facilities needed for processing them (like character properties, combining, sorting, etc), but the representation should work fairly early. Regards, Martin From martin@loewis.home.cs.tu-berlin.de Thu Jun 28 23:05:16 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Fri, 29 Jun 2001 00:05:16 +0200 Subject: [I18n-sig] Support for "wide" Unicode characters In-Reply-To: <3B3AF887.5181D0CF@lemburg.com> (mal@lemburg.com) References: <3B3ABFB8.84C7510B@ActiveState.com> <3B3AF887.5181D0CF@lemburg.com> Message-ID: <200106282205.f5SM5GK00908@mira.informatik.hu-berlin.de> > > Implementation > > > > There is a new (experimental) define: > > > > #define PY_UNICODE_SIZE 2 > > Doesn't sizeof(Py_UNICODE) do the same ? No, you can't use sizeof in a preprocessor #if test. Regards, Martin From martin@loewis.home.cs.tu-berlin.de Thu Jun 28 23:10:46 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Fri, 29 Jun 2001 00:10:46 +0200 Subject: [I18n-sig] Re: Unicode 3.1 and contradictions. In-Reply-To: <200106281125.f5SBPVc20814@odiug.digicool.com> (message from Guido van Rossum on Thu, 28 Jun 2001 07:25:30 -0400) References: <200106281125.f5SBPVc20814@odiug.digicool.com> Message-ID: <200106282210.f5SMAk501230@mira.informatik.hu-berlin.de> > > The UTF-8 representations of U+D800..U+DFFF, U+FFFE, and U+FFFF are not > > allowed in a UTF-8 stream and a secure UTF-8 decoder must never output > > any of these characters. > > Can you explain a bit more about the security issues? I don't understand the comment about filters, but one aspect is the requirement for a canonical encoding: If you encrypt two pieces of text of code with the same key, the original pieces must be considered equal iff the encrypted versions are equal. Non-canonical forms break this guarantee: the pieces might be equal even if the encrypted output is not. Regards, Martin From martin@loewis.home.cs.tu-berlin.de Fri Jun 29 00:28:17 2001 From: martin@loewis.home.cs.tu-berlin.de (Martin v. Loewis) Date: Fri, 29 Jun 2001 01:28:17 +0200 Subject: [I18n-sig] Support for "wide" Unicode characters In-Reply-To: <9F2D83017589D211BD1000805FA70CA703B13A04@ntxmel03.cmutual.com.au> (JMachin@Colonial.com.au) References: <9F2D83017589D211BD1000805FA70CA703B13A04@ntxmel03.cmutual.com.au> Message-ID: <200106282328.f5SNSHf02025@mira.informatik.hu-berlin.de> > IMO, unicode(), u.encode() and the \U notation should all use > 'strict' ... and perhaps the exception messages produced by the > narrow build could be marketing-aligned and point the punter to the > wide build. Both unicode and u.encode support an optional errors parameter, for which Guido proposed to accept an additional value of "lenient". The default is "strict". Regards, Martin From tim.one@home.com Fri Jun 29 04:46:10 2001 From: tim.one@home.com (Tim Peters) Date: Thu, 28 Jun 2001 23:46:10 -0400 Subject: [I18n-sig] Support for "wide" Unicode characters In-Reply-To: <3B3B2CE8.B1A062C4@lemburg.com> Message-ID: [MAL] > Ah, OK... nice sofas these Crays, BTW ;-) You're going to get a Cray Education before this is over even if it kills you -- which it may . Crays (at least in my day) made for horrible sofas! The oh-so-inviting padded leather "seats" surrounding the box actually covered massive cooling coils. Sit on it for 10 minutes and your butt went numb; some poor souls who tried sleeping on them suffered serious cases of hypothermia. And these were people who didn't believe *anything* was smaller than 64 bits. I can't imagine what it would do to a C weenie with heretical delusions about sizeof(short) -- if it got the chance, it would probably put you in cryonic suspension until PCs moved to 128-bit ints. don't-screw-with-the-icy-ghost-of-seymour-cray-ly y'rs - tim From fredrik@pythonware.com Fri Jun 29 09:54:56 2001 From: fredrik@pythonware.com (Fredrik Lundh) Date: Fri, 29 Jun 2001 10:54:56 +0200 Subject: [I18n-sig] Python Support for "Wide" Unicode characters References: <3B3A6438.6DA39268@ActiveState.com> <200106272319.f5RNJnO20162@odiug.digicool.com> <3B3A7857.1593F72@ActiveState.com> <200106280657.f5S6vYl01625@mira.informatik.hu-berlin.de> <3B3B73D8.B9D92DE8@ActiveState.com> <200106282238.f5SMcQT01809@mira.informatik.hu-berlin.de> Message-ID: <017301c10079$7e87db00$0900a8c0@spiff> martin wrote: > > Sorry, I don't understand the point you were making here. > > I still would prefer if the default was wchar_t if available, so I'd > get a wide Python from distributors as default. As it stands, most > distributors will ship a narrow Python 2.2, since they are unlikely to > change the default settings. I haven't ruled out "wchar_t" as a default for 2.2, but we shouldn't make the switch right now -- popular subsystems may not be 32-bit ready (the xml stuff, tkinter and other gui toolkits). Just give it a little more calendar time. Cheers /F From Misha.Wolf@reuters.com Fri Jun 29 20:07:10 2001 From: Misha.Wolf@reuters.com (Misha.Wolf@reuters.com) Date: Fri, 29 Jun 2001 20:07:10 +0100 Subject: [I18n-sig] 19th Unicode Conference, September 2001, San Jose, CA, USA -- Register now! Message-ID: Nineteenth International Unicode Conference (IUC19) Unicode and the Web: The Global Connection http://www.unicode.org/iuc/iuc19 September 10-14, 2001 San Jose, CA, USA Register now! * * * * * The Internet and the World Wide Web continue to change the shape of computing. The goal of network computing and understandable text access across wide, diverse groups of people has brought great momentum to computing environments that build Unicode into their foundation. Whether it's Internet commerce, network access to data, or highly portable applications, Unicode makes a solid foundation for the network, global enterprises, and software users everywhere. The Nineteenth International Unicode Conference (IUC19) will address topics ranging from Unicode use in the World Wide Web and in operating systems and databases, to the latest developments with Unicode 3.1, Java, Open Source, XML and Web protocols. Conference attendees will include managers, software engineers, systems analysts, and product marketing personnel responsible for the development of software supporting Unicode, as well as those involved in all aspects of the globalization of software and the Internet. CONFERENCE DATES The Conference has been extended to 5 days: 2 days of Tutorials / Workshops 3 days of Conference Sessions CONFERENCE WEB SITE, PROGRAM and REGISTRATION The Conference Program, including abstracts and speaker biographies, and Registration form are now available at the Conference Web site: http://www.unicode.org/iuc/iuc19 CONFERENCE SPONSORS Agfa Monotype Corporation Basis Technology Corporation Lionbridge Technologies Microsoft Corporation Netscape Communications Oracle Corporation PeopleSoft, Inc. Reuters Ltd. Sun Microsystems, Inc. Trados Corporation Trigeminal Software, Inc. World Wide Web Consortium (W3C) Wrox Press GLOBAL COMPUTING SHOWCASE Visit the Showcase to find out more about products supporting the Unicode Standard, and products and services that can help you globalize/localize your software, documentation and Internet content. For details, visit the Conference Web site: http://www.unicode.org/iuc/iuc19 CONFERENCE VENUE DoubleTree Hotel San Jose 2050 Gateway Place San Jose, CA 95110 USA Tel: +1 408 453 4000 Fax: +1 408 437 2898 CONFERENCE MANAGEMENT Global Meeting Services Inc. 4360 Benhurst Avenue San Diego, CA 92122, USA Tel: +1 858 638 0206 (voice) +1 858 638 0504 (fax) Email: info@global-conference.com or: conference@unicode.org THE UNICODE CONSORTIUM The Unicode Consortium was founded as a non-profit organization in 1991. It is dedicated to the development, maintenance and promotion of The Unicode Standard, a worldwide character encoding. The Unicode Standard encodes the characters of the world's principal scripts and languages, and is code-for-code identical to the international standard ISO/IEC 10646. In addition to cooperating with ISO on the future development of ISO/IEC 10646, the Consortium is responsible for providing character properties and algorithms for use in implementations. Today the membership base of the Unicode Consortium includes major computer corporations, software producers, database vendors, research institutions, international agencies and various user groups. For further information on the Unicode Standard, visit the Unicode Web site at http://www.unicode.org or e-mail * * * * * Unicode(r) and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission. ----------------------------------------------------------------- Visit our Internet site at http://www.reuters.com Any views expressed in this message are those of the individual sender, except where the sender specifically states them to be the views of Reuters Ltd.