[Python-Dev] Bytes path related questions for Guido

Thu Aug 28 19:54:44 CEST 2014

On 8/28/2014 10:41 AM, R. David Murray wrote:
> On Thu, 28 Aug 2014 10:15:40 -0700, Glenn Linderman <v+python at g.nevcal.com> wrote:
>> On 8/28/2014 12:30 AM, MRAB wrote:
>>> On 2014-08-28 05:56, Glenn Linderman wrote:
>>>> On 8/27/2014 6:08 PM, Stephen J. Turnbull wrote:
>>>>> Glenn Linderman writes:
>>>>>    > On 8/26/2014 4:31 AM, MRAB wrote:
>>>>>    > > On 2014-08-26 03:11, Stephen J. Turnbull wrote:
>>>>>    > >> Nick Coghlan writes:
>>>>>
>>>>>    > > How about:
>>>>>    > >
>>>>>    > >     replace_surrogate_escapes(s, replacement='\uFFFD')
>>>>>    > >
>>>>>    > > If you want them removed, just pass an empty string as the
>>>>>    > > replacement.
>>>>>
>>>>> That seems better to me (I had too much C for breakfast, I think).
>>>>>
>>>>>    > And further, replacement could be a vector of 128 characters, to do
>>>>>    > immediate transcoding,
>>>>>
>>>>> Using what encoding?
>>>> The vector would contain the transcoding. Each lone surrogate would map
>>>> to a character in the vector.
>>>>
>>>>> If you knew that much, why didn't you use
>>>>> (write, if necessary) an appropriate codec?  I can't envision this
>>>>> being useful.
>>>> If the data format describes its encoding, possibly containing data from
>>>> several encodings in various spots, then perhaps it is best read as
>>>> binary, and processed as binary until those definitions are found.
>>>>
>>>> But an alternative would be to read with surrogate escapes, and then
>>>> when the encoding is determined, to transcode the data. Previously, a
>>>> proposal was made to reverse the surrogate escapes to the original
>>>> bytes, and then apply the (now known) appropriate codec. There are not
>>>> appropriate codecs that can convert directly from surrogate escapes to
>>>> the desired end result. This technique could be used instead, for
>>>> single-byte, non-escaped encodings. On the other hand, writing specialty
>>>> codecs for the purpose would be more general.
>>>>
>>> There'll be a surrogate escape if a byte couldn't be decoded, but just
>>> because a byte could be decoded, it doesn't mean that it's correct.
>>>
>>> If you picked the wrong encoding, the other codepoints could be wrong
>>> too.
>> Aha! Thanks for pointing out the flaw in my reasoning. But that means it
>> is also pretty useless to "replace_surrogate_escapes" at all, because it
>> only cleans out the non-decodable characters, not the incorrectly
>> decoded characters.
> Well, replace would still be useful for ASCII+surrogateescape.

How?

> Also for
> cases where the data stream is *supposed* to be in a given encoding, but
> contains undecodable bytes.  Showing the stuff that incorrectly decodes
> as whatever it decodes to is generally what you want in that case.
Sure, people can learn to recognize mojibake for what it is, and maybe 
even learn to recognize it for what it was intended to be, in limited 
domains. But suppressing/replacing the surrogates doesn't help with 
that... would it not be better to replace the surrogates with an escape 
sequence that shows the original, undecodable, byte value?  Like  \xNN ?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20140828/d6820282/attachment.html>