[Python-Dev] Python3 "complexity"

Chris Barker chris.barker at noaa.gov
Thu Jan 9 22:36:05 CET 2014


This has all gotten a bit complicated because everyone has been thinking in
terms of actual encodings and actual text files. But I think the use-case
here is something different:

A file with a bunch of bytes in it, _some_of which are ascii, and the rest
are other bytes (maybe binary data, maybe non-ascii-encoded text).

I think this is the use-case that "just worked" in py2, but doesn't in py3
-- i.e. in py3 you have to choose either the binary interpretation or the
ascii one, but you can't have both. If you choose ascii, it will barf when
you try to decode it, if you choose binary, you lose the ability to do
simple stuff with the ascii subset -- parsing, substitution, etc.

Some folks have suggested using latin-1 (or other 8-bit encoding) -- is
that guaranteed to work with any binary data, and round-trip accurately?

and will surrogateescape work for arbitrary binary data?

If this is a common need, then it would be nice for py3 to address. I know
that I work with a couple file formats that have text headers followed by
binary data (not as hard to deal with, but still harder in py3). And from
this discussion , it seems that "wire protocols" commonly mix ascii and
binary.

So the decisions to be made:

Is this a use-case worth supporting in the standard library?

If so, how?
  1) add some of the basic stuff to the bytes object - i.e. string
formatting, what this all started with.
  2) create a custom encoding that could losslessly convert to from this
mixture to/from a unicode object. I
'm not sure if that is even possible, but it would be kind of cool.
  3) create a new object, neither a string nor a bytes object that did what
we want (it would look a lot like the py2 string...)
  4) create a module for doing the stuff wanted with a bytes object (not
very OO)

Does that clarify the discussion at all?

On Thu, Jan 9, 2014 at 2:15 AM, Kristján Valur Jónsson <
kristjan at ccpgames.com> wrote:

> This is the python 2 program:
> with open(fn1) as f1:
>     with open(fn2, 'w') as f2:
>         f2.write(process_text(f1.read())
>

I think the key point here is that this worked because a common case was
ascii text and arbitrary binary mixed. As long as all the process_text()
stuff is ascii only, that would work, either with arbitrary binary data or
ascii-compatible encoding. The fact that it would NOT work with arbitrarily
encoded data doesn't mean it's not useful for this special, but perhaps
common, case.

-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20140109/a8bac910/attachment.html>


More information about the Python-Dev mailing list