[Python-Dev] PEP 393 Summer of Code Project

Guido van Rossum guido at python.org
Fri Aug 26 04:52:09 CEST 2011


On Thu, Aug 25, 2011 at 6:40 PM, Ezio Melotti <ezio.melotti at gmail.com> wrote:
> On Fri, Aug 26, 2011 at 1:54 AM, Guido van Rossum <guido at python.org> wrote:
>>
>> On Wed, Aug 24, 2011 at 3:06 AM, Terry Reedy <tjreedy at udel.edu> wrote:
>> > Excuse me for believing the fine 3.2 manual that says
>> > "Strings contain Unicode characters." (And to a naive reader, that
>> > implies
>> > that string iteration and indexing should produce Unicode characters.)
>>
>> The naive reader also doesn't know the difference between characters,
>> code points and code units. It's the advanced, Unicode-aware reader
>> who is confused by this phrase in the docs. It should say code units;
>> or perhaps code units for narrow builds and code points for wide
>> builds.
>
> For UTF-16/32 (i.e. narrow/wide), talking about "code units"[0] should be
> correct.  Also note that:
>   * for both, every "code unit" has a specific "codepoint" (including lone
> surrogates), so it might be OK to talk about "codepoints" too, but
>   * only for wide builds every "codepoints" is represented by a single,
> 32-bits "code unit".  In narrow builds, non-BMP chars are represented by a
> "code unit sequence" of two elements (i.e. a "surrogate pair").

The more I think about it the more it seems to me that the biggest
problem is that in narrow builds it is ambiguous whether (unicode)
strings contain code units, i.e. are *encoded* code points, or whether
they contain (decoded) code points. In a sense this is repeating the
ambiguity of 8-bit strings in Python 2, which are sometimes assumed to
contain ASCII or Latin-1 (i.e., code points with a limited range) or
UTF-8 (i.e., code units).

I know that by now I am repeating myself, but I think it would be
really good if we could get rid of this ambiguity. PEP 393 seems the
best way forward, even if it doesn't directly address what to do for
IronPython or Jython, both of which have to deal with a pervasive
native string type that contains UTF-16.

IIUC, CPython on Windows will work just fine with PEP 393, even if it
means that there is a bit more translation between Python strings and
the OS native wchar_t[] type. I assume that the data volumes going
through the OS APIs is relatively constrained, since data actually
written to or read from a file will still be bytes, possibly run
through a codec (if it's a text file), and not go through one of the
wchar_t[] APIs -- the latter are used for things like filenames, which
are much smaller.

> Since "code unit" refers to the *minimal* bit combination, in UTF-8
> characters that needs 2/3/4 bytes, are represented with a "code unit
> sequence" made of 2/3/4 "code units" (so in UTF-8 "code units" and "code
> points" overlaps only for the ASCII range).

Actually I think UTF-8 is best thought of as an encoding for code
points, not characters -- the subtle difference between these two
should be of no concern to the UTF-8 codec (unless it is a validating
codec).

>> With PEP 393 we can unconditionally say code points, which is
>> much better. We should try to remove our use of "characters" -- or
>> else we should *define* our use of the term "characters" as "what the
>> Unicode standard calls code points".
>
> Character usually works fine, especially for naive readers.  Even
> Unicode-aware readers often confuse between the several terms, so using a
> simple term and pointing to a more accurate description sounds like a better
> idea to me.

We may well have no choice -- there is just too much documentation
that naively refers to characters while really referring to code units
or code points.

> Note that there's also another important term[1]:
> """
> Unicode Scalar Value. Any Unicode code point except high-surrogate and
> low-surrogate code points. In other words, the ranges of integers 0 to
> D7FF16 and E00016 to 10FFFF16 inclusive.
> """

This seems to involve validation. I think all validation should be
sequestered to specific APIs (e.g. certain codecs) and the string type
should not care about it. Depending on what they are doing,
applications may have to be aware of many subtleties in order to
always avoid generating "invalid" (or not well-formed-- what's the
difference?) strings.

> For example the UTF codecs produce sequences of "code units" (of 8, 16, 32
> bits) that represent "scalar values"[2][3]:
>
> Chapter 3 [4] says:
> """
> 3.9 Unicode Encoding Forms
> The Unicode Standard supports three character encoding forms: UTF-32,
> UTF-16, and UTF-8. Each encoding form maps the Unicode code points
> U+0000..U+D7FF and U+E000..U+10FFFF to unique code unit sequences. [...]

I really don't mind whether our codecs actually make exceptions for
surrogates (lone or otherwise). The only requirement I care about is
that surrogate-free strings round-trip correctly. Again, apps that
want to conform to the requirements regarding surrogates can implement
their own validation, and certainly at some point we should offer a
validation library as part of the stdlib -- but it should be up to the
app whether and when to use it.

>  D76 Unicode scalar value: Any Unicode code point except high-surrogate and
> low-surrogate code points.
>      • As a result of this definition, the set of Unicode scalar values
> consists of the ranges 0 to D7FF and E000 to 10FFFF, inclusive.
>  D77 Code unit: The minimal bit combination that can represent a unit of
> encoded text for processing or interchange.
> [...]
>  D79 A Unicode encoding form assigns each Unicode scalar value to a unique
> code unit sequence.
> """
>
> On the other hand, Python Unicode strings are not limited to scalar values,
> because they can also contain lone surrogates.

Right.

> I hope this helps clarify the terminology a bit and doesn't add more
> confusion, but if we want to use the Unicode terms we should get them
> right.  (Also note that I might have misunderstood something, even if I've
> been careful with the terms and I double-checked and quoted the relevant
> parts of the Unicode standard.)

I'm not more confused than I was, but I think we should reduce the
number of Unicode terms we care about rather than increase them. If we
only ever had to talk about code points and encoded byte sequences I'd
be happy -- although in practice we also need to acknowledge the
existence of characters that may be represented by multiple code
points, since islower(), lower() etc. may need these (and also the re
module). Other concepts we may have to at least acknowledge include
various normal forms, equivalence, and collation sequences (which are
language-dependent?). It would be lovely if someone wrote up an
informational PEP so that we don't all have to lug around a copy of
the Unicode standard.

> Best Regards,
> Ezio Melotti
>
>
> [0]: From the chapter 3 [4],
>  D77 Code unit: The minimal bit combination that can represent a unit of
> encoded text for processing or interchange.
>    • Code units are particular units of computer storage. Other character
> encoding standards typically use code units defined as 8-bit units—that is,
> octets.
>      The Unicode Standard uses 8-bit code units in the UTF-8 encoding form,
> 16-bit code units in the UTF-16 encoding form, and 32-bit code units in the
> UTF-32 encoding form.
> [1]: http://unicode.org/glossary/#unicode_scalar_value
> [2]: Apparently Python 3 raises an error while encoding lone surrogates in
> UTF-8, but it doesn't for UTF-16 and UTF-32.
> From the chapter 3 [4],
>  D91: "Because surrogate code points are not Unicode scalar values, isolated
> UTF-16 code units in the range 0xD800..0xDFFF are ill-formed."
>  D92: "Because surrogate code points are not included in the set of Unicode
> scalar values, UTF-32 code units in the range 0x0000D800..0x0000DFFF are
> ill-formed."
> I think this should be fixed.
> [3]: Note that I'm talking about codecs used to encode/decode Unicode
> strings to/from bytes here, it's perfectly fine for Python itself to
> represent lone surrogates in its *internal* representations, regardless of
> what encoding it's using.
> [4]: Chapter 3: http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf

-- 
--Guido van Rossum (python.org/~guido)


More information about the Python-Dev mailing list