[Python-Dev] len(chr(i)) = 2?

"Martin v. Löwis" martin at v.loewis.de
Sun Nov 21 19:51:44 CET 2010


>  > I disagree. Python does "conform" to "UTF-16"
> 
> I'm sure the codecs do.  But the Unicode standard doesn't care about
> the parts of the process, it cares about what it does as a whole.

Chapter and verse?

> Python's internal coding does not conform to UTF-16, and that internal
> coding can, under certain conditions, escape to the outside world as
> invalid "Unicode" output.

I'm fairly certain there are provisions in the Unicode standard for such
behavior (taking into account "certain conditions").

>  > What behavior specifically do you consider non-conforming, and what
>  > specific specification do you think it is not conforming to? For
>  > example, it *is* fully conforming with UTF-8.
> 
> Oh,
> 
>     f = open('/tmp/broken','wt',encoding='utf8',errors='surrogateescape')
>     f.write(chr(int('dc80',16)))
>     f.close()
> 
> for one.  That produces a non-UTF-8 file

Right. You are using an API that does not promise to create UTF-8, and
hence isn't UTF-8. The Unicode standard certainly allows implementations
to use character encoding schemes other than UTF-8; this one being
"UTF-8 with surrogate escapes", which is different from "UTF-8" (IANA
MIBEnum 106).

> You can say, "oh, but that's not really a UTF-8 codec", and I'd agree.

See above :-)

> Nevertheless, the program is able to produce output from internal
> "Unicode" strings that does not conform to Unicode at all.

*Any* Unicode implementation will do that, since they all have to
support legacy encodings in some form. This is certainly conforming to
the Unicode standard, and in fact one of the primary Unicode design
principles.

> A Unicode-
> conforming Python implementation would error at the chr() call, or
> perhaps would not provide surrogateescape error handlers.

Chapter and verse?

> "Although practicality beats purity."

The Unicode standard itself is based on practicality. It wouldn't
have received the success it did if it was based on purity only
(and indeed, was often rejected in cases where it put purity over
practicality, e.g. with the Hangul syllables).

Regards,
Martin


More information about the Python-Dev mailing list