[Python-Dev] PEP 383 update: utf8b is now the error handler

Tue May 5 23:01:49 CEST 2009

> I have three substantive comments.  First, although consequences for
> Python 3 byte interfaces (ie, "none") are explicitly stated, as far as
> I can see this PEP could apply to Python 2 as well.  I don't think
> it's intended that way.  Either way, I think you should clarify that
> point.

Done: the Python-Version header already clarifies that point.

> Second, I suggest "surrogate-replace" as the name of the error handler
> rather than "utf8b".

I think this is bike-shedding.

> Third, it is not clear to me why non-decodable ASCII should be an
> error.  There are plenty of low surrogates for the purpose.  Is there
> another technical reason?  Stupid or not, Shift-JIS- and Big5-encoded
> file systems are quite common in Asia still (including non-rewritable
> media).  I think surrogate-replacement of ASCII should at least be an
> option.

It's a security risk. If U+DCXX would map to \xXX, then somebody could
embed U+DC2E U+DC2E U+DC2F into a character string; even if this gets
sanitized, nobody would expect that this will actually access ../

> 1.  There is no such thing as a "half-surrogate" in Unicode.  "Lone
>     surrogate" is clear enough.  Or for somewhat fancier English,
>     "isolated surrogate" or "non-syntactic surrogate".  To emphasize
>     that Python codecs will only produce them in contexts where a
>     Unicode character or high surrogate (for UTF-16 Python) is
>     syntactically required, "isolated low surrogate" or "isolated
>     trailing surrogate" might be good.[1]

Fixed. I removed the world "half" everywhere. It really doesn't mean
anything to me (it could have been called sunnygate instead, making
no difference).

I tried to understand "surrogate", and it was explained to me that
"surrogate" is something that stands for something - but then I
would argue that the two subsequence codes form a surrogate - they
stand for something else. The individual surrogate code (in Unicode
terminology) doesn't stand for anything. So don't you agree that
it is the Unicode terminology that is in error, not the PEP?

> 2.  The specification should state, and the discussion emphasize, that
>     strings which were produced by surrogate replacement *must not* be
>     used in data interchange with systems that do not specifically
>     accept such strings, and that this is the responsibility of the
>     application.[2]

No. The specification puts no requirements on applications whatsoever.
So if you propose to use MUST NOT in the RFC 2119 sense, I strongly
disagree.

Applications that desire mojibake are free to produce it; we are
consenting adults; and all that.

> 3.  In the discussion, the transition from the example of alternative
>     use of 'python-escape' to discussion of the error handler
>     interface extension is a bit abrupt.  I suggest rewriting as:
> 
>     """The extension to the encode error handler interface proposed by
>     this PEP is necessary to implement the 'utf8b' error handler,
>     because there are required byte sequences which cannot be
>     generated from replacement Unicode.  However, the encode error
>     handler interface presently requires replacement Unicode to be
>     provided in lieu of the non-encodable Unicode from the source
>     string.  Then it promptly encodes that replacement Unicode.  In
>     some error handlers, such as the 'utf8b' proposed here, it is also
>     simpler and more efficient for the error handler to provide a
>     pre-encoded replacement byte string, rather than forcing it to
>     calculating Unicode from which the encoder would create the
>     desired bytes."""

Unfortunately, I failed to understand where you want this text to
go. What paragraphs should I remove, or (if none), after which
paragraph should I insert this text?

Regards,
Martin