[Python-Dev] PEP 383 update: utf8b is now the error handler

Walter Dörwald walter at livinglogic.de
Thu May 7 15:20:07 CEST 2009


M.-A. Lemburg wrote:
> Antoine Pitrou wrote:
>> Martin v. Löwis <martin <at> v.loewis.de> writes:
>>> py> b'\xed\xa0\x80'.decode("utf-8","surrogates")
>>> '\ud800'
>> The point is, "surrogates" does not mean anything intuitive for an /error
>> handler/. You seem to be the only one who finds this name explicit enough,
>> perhaps because you chose it.
>> Most other handlers' names have verbs in them ("ignore", "replace",
>> "xmlcharrefreplace", etc.).
> 
> Correct.
> 
> The purpose of an error handler name is to indicate to the user
> what it does, hence the use of verbs.
> 
> Walter started with "xmlcharrefreplace", ie. no space names, so
> "surrogatereplace" would be the logically correct name for the
> "replace with lone surrogates" scheme invented by Markus Kuhn.

"surrogatepass" (for the "don't complain about lone half surrogates"
handler) and "surrogatereplace" sound OK to me. However the other
"...replace" handlers are destructive (i.e. when such a "...replace"
handler is used for encoding, decoding will not produce the original
unicode string). The purpose of the PEP 383 error handler however is to
be roundtrip safe, so maybe we should choose a slightly different name?
How about "surrogateescape"?

> The error handler for undoing this operation (ie. when converting
> a Unicode string to some other encoding) should probably use the
> same name based on symmetry and the fact that the escaping
> scheme is meant to be used for enabling round-trip safety.

We have only one error handler registry, but we *can* have one error
handler for both directions (encoding and decoding) as the error handler
can simply check whether it got passed a UnicodeEncodeError or
UnicodeDecodeError object.

> BTW: It would also be appropriate to reference Markus Kuhn in the PEP
> as the inventor of the escaping scheme.
> 
> Even if only to give the reader an idea of how that scheme works and
> why (the PEP on python.org currently doesn't explain this).
> 
> It should also explain that the scheme is meant to assure round-trip
> safety and doesn't necessarily work when using transcoding, ie.
> reading using one encoding, writing using another.

Servus,
   Walter


More information about the Python-Dev mailing list