[Python-Dev] PEP 393 Summer of Code Project

Wed Aug 31 22:04:01 CEST 2011

On 8/31/2011 10:10 AM, Guido van Rossum wrote:
> On Tue, Aug 30, 2011 at 11:03 PM, Stephen J. Turnbull
> <stephen at xemacs.org>  wrote:
> [me]
>>   >  That sounds like a contradiction -- it wouldn't be a UTF-16 array if
>>   >  you couldn't tell that it was using UTF-16.
>>
>> Well, that's why I wrote "intended to be suggestive".  The Unicode
>> Standard does not specify at all what the internal representation of
>> characters may be, it only specifies what their external behavior must
>> be when two processes communicate.  (For "process" as used in the
>> standard, think "Python modules" here, since we are concerned with the
>> problems of folks who develop in Python.)  When observing the behavior
>> of a Unicode process, there are no UTF-16 arrays or UTF-8 arrays or
>> even UTF-32 arrays; only arrays of characters.
> Hm, that's not how I would read "process". IMO that is an
> intentionally vague term, and we are free to decide how to interpret
> it. I don't think it will work very well to define a process as a
> Python module; what about Python modules that agree about passing
> along array of code units (or streams of UTF-8, for that matter)?
>
> This is why I find the issue of Python, the language (and stdlib), as
> a whole "conforming to the Unicode standard" such a troublesome
> concept -- I think it is something that an application may claim, but
> the language should make much more modest claims, such as "the regular
> expression syntax supports features X, Y and Z from the Unicode
> recommendation XXX, or "the UTF-8 codec will never emit a sequence of
> bytes that is invalid according Unicode specification YYY". (As long
> as the Unicode references are also versioned or dated.)
>
> I'm fine with saying "it is hard to write Unicode-conforming
> application code for reason ZZZ" and proposing a fix (e.g. PEP 393
> fixes a specific complaint about code units being inferior to code
> points for most types of processing). I'm not fine with saying "the
> string datatype should conform to the Unicode standard".
>
>> Thus, according to the rules of handling a UTF-16 stream, it is an
>> error to observe a lone surrogate or a surrogate pair that isn't a
>> high-low pair (Unicode 6.0, Ch. 3 "Conformance", requirements C1 and
>> C8-C10).  That's what I mean by "can't tell it's UTF-16".
> But if you can observe (valid) surrogate pairs it is still UTF-16.
>
>> And I
>> understand those requirements to mean that operations on UTF-16
>> streams should produce UTF-16 streams, or raise an error.  Without
>> that closure property for basic operations on str, I think it's a bad
>> idea to say that the representation of text in a str in a pre-PEP-393
>> "narrow" build is UTF-16.  For many users and app developers, it
>> creates expectations that are not fulfilled.
> Ok, I dig this, to some extent. However saying it is UCS-2 is equally
> bad. I guess this is why Java and .NET just say their string types
> contain arrays of "16-bit characters", with essentially no semantics
> attached to the word "character" besides "16-bit unsigned integer".
>
> At the same time I think it would be useful if certain string
> operations like .lower() worked in such a way that *if* the input were
> valid UTF-16, *then* the output would also be, while *if* the input
> contained an invalid surrogate, the result would simply be something
> that is no worse (in particular, those are all mapped to themselves).
> We could even go further and have .lower() and friends look at
> graphemes (multi-code-point characters) if the Unicode std has a
> useful definition of e.g. lowercasing graphemes that differed from
> lowercasing code points.
>
> An analogy is actually found in .lower() on 8-bit strings in Python 2:
> it assumes the string contains ASCII, and non-ASCII characters are
> mapped to themselves. If your string contains Latin-1 or EBCDIC or
> UTF-8 it will not do the right thing. But that doesn't mean strings
> cannot contain those encodings, it just means that the .lower() method
> is not useful if they do. (Why ASCII? Because that is the system
> encoding in Python 2.)

So if Python 3.3+ uses Unicode codepoints as its str representation, the 
analogy to ASCII and Python 2 would imply that it should permit 
out-of-range codepoints, if they can be represented in the underlying 
data values.  Valid codecs would not create such on input, and Valid 
codecs would not accept such on output.  Operations on codepoints 
should, like .lower(), use the identity operation when applied to 
non-codepoints.

>
>> It's true that common usage is that an array of code units that
>> usually conforms to UTF-16 may be called "UTF-16" without the closure
>> properties.  I just disagree with that usage, because there are two
>> camps that interpret "UTF-16" differently.  One side says, "we have an
>> array representation in UTF-16 that can handle all Unicode code points
>> efficiently, and if you think you need more, think again", while the
>> other says "it's too painful to have to check every result for valid
>> UTF-16, and we need a UTF-16 type that supports the usual array
>> operations on *characters* via the usual operators; if you think
>> otherwise, think again."
> I think we should just document how it behaves and not get hung up on
> what it is called. Mentioning UTF-16 is still useful because it
> indicates that some operations may act properly on surrogate pairs.
> (Also because of course character properties for BMP characters are
> respected, etc.)
>
>> Note that despite the (presumed) resolution of the UTF-16 issue for
>> CPython by PEP 393, at some point a very similar discussion will take
>> place over "characters" anyway, because users and app developers are
>> going to want a type that handles composition sequences and/or
>> grapheme clusters for them, as well as comparison that respects
>> canonical equivalence, even if it is inefficient compared to str.
>> That's why I insisted on use of "array of code points" to describe the
>> PEP 393 str type, rather than "array of characters".
> Let's call those things graphemes (Tom C's term, I quite like leaving
> "character" ambiguous) -- they are sequences of multiple code points
> that represent a single "visual squiggle" (the kind of thing that
> you'd want to be swappable in vim with "xp" :-). I agree that APIs are
> needed to manipulate (match, generate, validate, mutilate, etc.)
> things at the grapheme level. I don't agree that this means a separate
> data type is required. There are ever-larger units of information
> encoded in text strings, with ever farther-reaching (and more vague)
> requirements on valid sequences. Do you want to have a data type that
> can represent (only valid) words in a language? Sentences? Novels?

Interesting ideas.  Once you break the idea that every code point must 
be directly indexed, and that higher level concepts can be abstracted, 
appropriate codecs could produce a sequence of words, instead of 
characters.  It depends on the purpose of the application whether such 
is interesting or not.  Working a bit with ebook searching algorithms 
lately, and one idea is to extract from the text a list of words, and 
represent the words with codes.  Do the same for the search string.  
Then the search, instead of searching for characters and character 
strings, and skipping over punctuation, etc., it can simply search for 
the appropriate sequence of word codes.  In this case, part of the 
usefulness of the abstraction is the elimination of punctuation, so it 
is more of an index to the character text rather an encoding of it... 
but if the encoding of the text extracted words, the creation of the 
index would then be extremely simple.  I don't have applications in mind 
where representing sentences or novels would be particularly useful, but 
representing words could be extremely useful.  Valid words?  Given a 
language (or languages) and dictionary (or dictionaries), words could be 
flagged as valid or invalid for that dictionary.  Representing invalid 
words, could be similar to the idea of the representing of invalid UTF-8 
bytes using the lone-surrogate error handler... possible when the 
application requests such.

> I think that at this point in time the best we can do is claim that
> Python (the language standard) uses either 16-bit code units or 21-bit
> code points in its string datatype, and that, thanks to PEP 393,
> CPython 3.3 and further will always use 21-bit code points (but Jython
> and IronPython may forever use their platform's native 16-bit code
> unit representing string type). And then we add APIs that can be used
> everywhere to look for code points (even if the string contains code
> points), graphemes, or larger constructs. I'd like those APIs to be
> designed using a garbage-in-garbage-out principle, where if the input
> conforms to some Unicode requirement, the output does too, but if the
> input doesn't, the output does what makes most sense. Validation is
> then limited to codecs, and optional calls.

So limiting the code point values to 21-bits (wasting 11 bits) only 
serves to prevent applications from using those 11 bits when they have 
extra-Unicode values to represent.  There is no shortage of 32-bit 
datatypes to draw from, though, but it seems an unnecessary constraint 
if exact conformance to Unicode is not provided... conforming codecs 
wouldn't create such values on input nor accept them on output, so the 
constraint only serves to restrict applications from using all 32-bits 
of the underlying storage.

> If you index or slice a string, or create a string from chr() of a
> surrogate or from some other value that the Unicode standard considers
> an illegal code point, you better know what you are doing. I want
> chr(i) to be valid for all values of i in range(2**21), so it can be
> used to create a lone surrogate, or (on systems with 16-bit
> "characters") a surrogate pair. And also ord(chr(i)) == i for all i in
> range(2**21). I'm not sure about ord() on a 2-character string
> containing a surrogate pair on systems where strings contain 21-bit
> code points; I think it should be an error there, just as ord() on
> other strings of length != 1. But on systems with 16-bit "characters",
> ord() of strings of length 2 containing a valid surrogate pair should
> work.
>

Yep.  So str != Unicode.  You keep saying that :)  And others point out 
how some applications would benefit from encapsulating the complexities 
of Unicode semantics at various higher levels of abstractions.  Sure, it 
can be tacked on, by adding complex access methods to a subtype of str, 
but losing O(1) indexing of those higher abstractions when that route is 
chosen.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20110831/da400a6a/attachment-0001.html>