[Python-Dev] Bytes path related questions for Guido
R. David Murray
rdmurray at bitdance.com
Thu Aug 28 19:41:03 CEST 2014
On Thu, 28 Aug 2014 10:15:40 -0700, Glenn Linderman <v+python at g.nevcal.com> wrote:
> On 8/28/2014 12:30 AM, MRAB wrote:
> > On 2014-08-28 05:56, Glenn Linderman wrote:
> >> On 8/27/2014 6:08 PM, Stephen J. Turnbull wrote:
> >>> Glenn Linderman writes:
> >>> > On 8/26/2014 4:31 AM, MRAB wrote:
> >>> > > On 2014-08-26 03:11, Stephen J. Turnbull wrote:
> >>> > >> Nick Coghlan writes:
> >>>
> >>> > > How about:
> >>> > >
> >>> > > replace_surrogate_escapes(s, replacement='\uFFFD')
> >>> > >
> >>> > > If you want them removed, just pass an empty string as the
> >>> > > replacement.
> >>>
> >>> That seems better to me (I had too much C for breakfast, I think).
> >>>
> >>> > And further, replacement could be a vector of 128 characters, to do
> >>> > immediate transcoding,
> >>>
> >>> Using what encoding?
> >>
> >> The vector would contain the transcoding. Each lone surrogate would map
> >> to a character in the vector.
> >>
> >>> If you knew that much, why didn't you use
> >>> (write, if necessary) an appropriate codec? I can't envision this
> >>> being useful.
> >>
> >> If the data format describes its encoding, possibly containing data from
> >> several encodings in various spots, then perhaps it is best read as
> >> binary, and processed as binary until those definitions are found.
> >>
> >> But an alternative would be to read with surrogate escapes, and then
> >> when the encoding is determined, to transcode the data. Previously, a
> >> proposal was made to reverse the surrogate escapes to the original
> >> bytes, and then apply the (now known) appropriate codec. There are not
> >> appropriate codecs that can convert directly from surrogate escapes to
> >> the desired end result. This technique could be used instead, for
> >> single-byte, non-escaped encodings. On the other hand, writing specialty
> >> codecs for the purpose would be more general.
> >>
> > There'll be a surrogate escape if a byte couldn't be decoded, but just
> > because a byte could be decoded, it doesn't mean that it's correct.
> >
> > If you picked the wrong encoding, the other codepoints could be wrong
> > too.
>
> Aha! Thanks for pointing out the flaw in my reasoning. But that means it
> is also pretty useless to "replace_surrogate_escapes" at all, because it
> only cleans out the non-decodable characters, not the incorrectly
> decoded characters.
Well, replace would still be useful for ASCII+surrogateescape. Also for
cases where the data stream is *supposed* to be in a given encoding, but
contains undecodable bytes. Showing the stuff that incorrectly decodes
as whatever it decodes to is generally what you want in that case.
--David
More information about the Python-Dev
mailing list