[Python-Dev] What to do for bytes in 2.6?

Sun Jan 20 05:26:43 CET 2008

On Jan 19, 2008 5:54 PM,  <glyph at divmod.com> wrote:
> On 19 Jan, 07:32 pm, guido at python.org wrote:
> >There is no way to know whether that return value means text or data
> >(plenty of apps legitimately read text straight off a socket in 2.x),
>
> IMHO, this is a stretch of the word "legitimately" ;-).  If you're
> reading from a socket, what you're getting are bytes, whether they're
> represented by str() or bytes(); correct code in 2.x must currently do a
> .decode("ascii") or .decode("charmap") to "legitimately" identify the
> result as text of some kind.
>
> Now, ad-hoc code with a fast and loose definition of "text" can still
> read arrays of bytes off a socket without specifying an encoding and get
> away with it, but that's because Python's unicode implementation has
> thus far been very forgiving, not because the data is cleanly text yet.

I would say that depends on the application, and on arrangements that
client and server may have made off-line about the encoding.

In 2.x, text can legitimately be represented as str -- there's even
the locale module to further specify how it is to be interpreted as
characters.

Sure, this doesn't work for full unicode, and it doesn't work for all
protocols used with sockets, but claiming that only fast and loose
code ever uses str to represent text is quite far from reality -- this
would be saying that the locale module is only for quick and dirty
code, which just ain't so.

> Why can't we get that warning in -3 mode just the same from something
> read from a socket and a b"" literal?

If you really want this, please think through all the consequences,
and report back here. While I have a hunch that it'll end up giving
too many false positives and at the same time too many false
negatives, perhaps I haven't thought it through enough. But if you
really think this'll be important for you, I hope you'll be willing to
do at least some of the thinking.

I believe that a constraint should be that by default (without -3 or a
__future__ import) str and bytes should be the same thing. Or, another
way of looking at this, reads from binary files and reads from sockets
(and other similar things, like ctypes and mmap and the struct module,
for example) should return str instances, not instances of a str
subclass by default -- IMO returning a subclass is bound to break too
much code. (Remember that there is still *lots* of code out there that
uses "type(x) is types.StringType)" rather than "isinstance(x, str)",
and while I'd be happy to warn about that in -3 mode if we could, I
think it's unacceptable to break that in the default environment --
let it break in 3.0 instead.)

> I've written lots of code that
> aggressively rejects str() instances as text, as well as unicode
> instances as bytes, and that's in code that still supports 2.3 ;).

Yeah, well, but remember, while keeping you happy is high on my list
of priorities, it's not the only priority. :-)

> >Really, the pure aliasing solution is just about optimal in terms of
> >bang per buck. :-)
>
> Not that I'm particularly opposed to the aliasing solution, either.  It
> would still allow writing code that was perfectly useful in 2.6 as well
> as 3.0, and it would avoid disturbing code that did checks of type("").

Right.

> It would just remove an opportunity to get one potentially helpful
> warning.

I worry that the warning wouldn't come often enough, and that too
often it would be unhelpful. There will inevitably be some stuff where
you just have to try to convert the code using 2to3 and try to run it
under 3.0 in order to see if it works. And there's also the concern of
those who want to use 2.6 because it offers 2.5 compatibility plus a
fair number of new features, but who aren't interested (yet) in moving
up to 3.0. I expect that Google will initially be in this category
too.

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)