[Python-Dev] Bytes path related questions for Guido

Thu Aug 28 20:43:51 CEST 2014

On Thu, 28 Aug 2014 10:54:44 -0700, Glenn Linderman <v+python at g.nevcal.com> wrote:
> On 8/28/2014 10:41 AM, R. David Murray wrote:
> > On Thu, 28 Aug 2014 10:15:40 -0700, Glenn Linderman <v+python at g.nevcal.com> wrote:
> >> On 8/28/2014 12:30 AM, MRAB wrote:
> >>> There'll be a surrogate escape if a byte couldn't be decoded, but just
> >>> because a byte could be decoded, it doesn't mean that it's correct.
> >>>
> >>> If you picked the wrong encoding, the other codepoints could be wrong
> >>> too.
> >> Aha! Thanks for pointing out the flaw in my reasoning. But that means it
> >> is also pretty useless to "replace_surrogate_escapes" at all, because it
> >> only cleans out the non-decodable characters, not the incorrectly
> >> decoded characters.
> > Well, replace would still be useful for ASCII+surrogateescape.
> 
> How?

Because there "can't" be any incorrectly decoded bytes in the ASCII part,
so all undecodable bytes turning into 'unrecognized character' glyphs
is useful. "can't" is in quotes because of course if you decode random
binary data as ASCII+surrogate escape you could get a mess just like any
other encoding, so this is really a "more *likely* to be useful" version
of my second point, because "real" ASCII with some junk bytes mixed in
is much more likely to be encountered in the wild than, say, utf-8 with
some junk bytes mixed in (although is probably changing as use of utf-8
becomes more widespread, so this point applies to utf-8 as well).

> > Also for
> > cases where the data stream is *supposed* to be in a given encoding, but
> > contains undecodable bytes.  Showing the stuff that incorrectly decodes
> > as whatever it decodes to is generally what you want in that case.
>
> Sure, people can learn to recognize mojibake for what it is, and maybe 
> even learn to recognize it for what it was intended to be, in limited 
> domains. But suppressing/replacing the surrogates doesn't help with 

Well, it does if the alternative is not being able to display the string
to the user at all.  And yeah, people being able to recognize mojibake
in specific problem domains is what I'm talking about...not perhaps a
great use case, but it is a use case.

> that... would it not be better to replace the surrogates with an escape 
> sequence that shows the original, undecodable, byte value?  Like  \xNN ?

Yeah, that idea has been floated as well, and I think it would indeed be
more useful than the 'unknown character' glyph.  I've also seen fonts
that display the hex code inside a box character when the code point is
unknown, which would be cool...but that can hardly be part of unicode,
can it? :)

--David