[Python-Dev] Bytes path support

Sat Aug 23 10:02:25 CEST 2014

Chris Barker writes:

 > > The third is to specify the UTF-8 with the surrogate escape error
 > > handler.  This allows non-UTF-8 codes to be loaded into
 > > memory.

Read as bytes and incrementally decode.  If you hit an Exception,
retry from that point.

 > Just so I'm clear here -- if you write that back out, encoded as
 > utf-8 -- you'll get the exact same binary blob out as came in?

If and only if there are no changes to the content.

 > I wonder if this would make it hard to preserve byte boundaries,
 > though.

I'm not sure what you mean by "byte boundaries".  If you mean
after concatenation of such objects, yes, the uninterpretable bytes
will be encoded in such a way as to be identifiable as lone bytes;
they won't be interpreted as Unicode characters.

 > By the way, IIUC correctly, you can also use the python latin-1
 > decoder -- anything latin-1 will come through correctly, anything
 > not valid latin-1 will come in as garbage, but if you re-encode
 > with latin-1 the original bytes will be preserved. I think this
 > will also preserve a 1:1 relationship between character count and
 > byte count, which could be handy.

Bad idea, especially for Oleg's use case -- you can't decode those by
codec without reencoding to bytes first.  No point in abandoning
codecs just because there isn't one designed for his use case exactly.
Just read as bytes and decode piecewise in one way or another.  For
Oleg's HTML case, there's a well-understood structure that can be used
to determine retry points and a very few plausible coding systems,
which can be fairly well distinguished by the range of bytes used and
probably nearly perfectly with additional information from the
structure and distribution of apparently decoded characters.