[Python-Dev] len(chr(i)) = 2?

Sun Nov 21 13:55:12 CET 2010

"Martin v. Löwis" writes:
 > Am 20.11.2010 05:11, schrieb Stephen J. Turnbull:
 > > "Martin v. Löwis" writes:
 > > 
 > >  > The term "UCS-2" is a character set that can encode only encode 65536
 > >  > characters; it thus refers to Unicode 1.1. According to the Unicode
 > >  > Consortium's FAQ, the term UCS-2 should be avoided these days.
 > > 
 > > So what do you propose we call the Python implementation?
 > 
 > A technical correct description would be to say that Python uses either
 > 16-bit code units or 32-bit code units; for brevity, these can be called
 > narrow and wide code units.

I agree that's technically correct.  Unfortunately, it's also useless
to anybody who doesn't already know more about Unicode than anybody
should have to know.

 > > and therefore is not UTF-16 conforming.
 > 
 > I disagree. Python does "conform" to "UTF-16"

I'm sure the codecs do.  But the Unicode standard doesn't care about
the parts of the process, it cares about what it does as a whole.
Python's internal coding does not conform to UTF-16, and that internal
coding can, under certain conditions, escape to the outside world as
invalid "Unicode" output.

 > > AFAIK this was not supposed to change in Python 3; indexing and
 > > slicing go by code unit (isomorphic to UCS-n), not character, and due
 > > to PEP 383 4-octet builds do not conform (internally) to UTF-32, and
 > > can produce output that conforms to Unicode not at all (as a user
 > > option, of course, but it's still non-conformant).
 > 
 > What behavior specifically do you consider non-conforming, and what
 > specific specification do you think it is not conforming to? For
 > example, it *is* fully conforming with UTF-8.

Oh,

    f = open('/tmp/broken','wt',encoding='utf8',errors='surrogateescape')
    f.write(chr(int('dc80',16)))
    f.close()

for one.  That produces a non-UTF-8 file in a 32-bit-code-unit build.
You can say, "oh, but that's not really a UTF-8 codec", and I'd agree.
Nevertheless, the program is able to produce output from internal
"Unicode" strings that does not conform to Unicode at all.  A Unicode-
conforming Python implementation would error at the chr() call, or
perhaps would not provide surrogateescape error handlers.

It is, of course, possible to write Python programs that conform (and
easier than in any other language I know), but Python itself does not
conform to post-1.1 Unicode standards.  Too bad for the standards:
"Although practicality beats purity."

The point is that internal code is *not* UTF-16 (or -32), but it *is*
isomorphic to UCS-2 (or -4).  *That is very useful information to
users*, it's not a technical detail of interest only to Unicode geeks.
It means that if you stick to defined characters in the BMP when
giving Python input, then slicing and indexing unicode (Python 2) or
str (Python 3) objects gives only valid output even in builds with
16-bit code units.  OTOH, invalid processing (involving functions like
'chr' or input using surrogateescape codecs) can lead to invalid
output even in builds with 32-bit code units.

IMO, saying "UCS-2" or "UCS-4" tells ordinary developers most of what
they need to know about the limitations of their Python vis-a-vis full
conformance, at least with respect to the string manipulation functions.