[Python-Dev] Bytes path related questions for Guido

Thu Aug 28 19:41:03 CEST 2014

On Thu, 28 Aug 2014 10:15:40 -0700, Glenn Linderman <v+python at g.nevcal.com> wrote:
> On 8/28/2014 12:30 AM, MRAB wrote:
> > On 2014-08-28 05:56, Glenn Linderman wrote:
> >> On 8/27/2014 6:08 PM, Stephen J. Turnbull wrote:
> >>> Glenn Linderman writes:
> >>>   > On 8/26/2014 4:31 AM, MRAB wrote:
> >>>   > > On 2014-08-26 03:11, Stephen J. Turnbull wrote:
> >>>   > >> Nick Coghlan writes:
> >>>
> >>>   > > How about:
> >>>   > >
> >>>   > >     replace_surrogate_escapes(s, replacement='\uFFFD')
> >>>   > >
> >>>   > > If you want them removed, just pass an empty string as the
> >>>   > > replacement.
> >>>
> >>> That seems better to me (I had too much C for breakfast, I think).
> >>>
> >>>   > And further, replacement could be a vector of 128 characters, to do
> >>>   > immediate transcoding,
> >>>
> >>> Using what encoding?
> >>
> >> The vector would contain the transcoding. Each lone surrogate would map
> >> to a character in the vector.
> >>
> >>> If you knew that much, why didn't you use
> >>> (write, if necessary) an appropriate codec?  I can't envision this
> >>> being useful.
> >>
> >> If the data format describes its encoding, possibly containing data from
> >> several encodings in various spots, then perhaps it is best read as
> >> binary, and processed as binary until those definitions are found.
> >>
> >> But an alternative would be to read with surrogate escapes, and then
> >> when the encoding is determined, to transcode the data. Previously, a
> >> proposal was made to reverse the surrogate escapes to the original
> >> bytes, and then apply the (now known) appropriate codec. There are not
> >> appropriate codecs that can convert directly from surrogate escapes to
> >> the desired end result. This technique could be used instead, for
> >> single-byte, non-escaped encodings. On the other hand, writing specialty
> >> codecs for the purpose would be more general.
> >>
> > There'll be a surrogate escape if a byte couldn't be decoded, but just
> > because a byte could be decoded, it doesn't mean that it's correct.
> >
> > If you picked the wrong encoding, the other codepoints could be wrong
> > too.
> 
> Aha! Thanks for pointing out the flaw in my reasoning. But that means it 
> is also pretty useless to "replace_surrogate_escapes" at all, because it 
> only cleans out the non-decodable characters, not the incorrectly 
> decoded characters.

Well, replace would still be useful for ASCII+surrogateescape.  Also for
cases where the data stream is *supposed* to be in a given encoding, but
contains undecodable bytes.  Showing the stuff that incorrectly decodes
as whatever it decodes to is generally what you want in that case.

--David