[Python-Dev] thoughts on the bytes/string discussion

Thu Jun 24 22:59:09 CEST 2010

I see it a little differently (though there is probably a common
concept lurking in here).

The protocols you mention are intentionally designed to be
encoding-neutral as long as the encoding is an ASCII superset. This
covers ASCII itself, Latin-1, Latin-N for other values of N, MacRoman,
Microsoft's code pages (most of them anyways), UTF-8, presumably at
least some of the Japanese encodings, and probably a host of others.
But it does not cover UTF-16, EBCDIC, and others. (Encodings that have
"shift bytes" that change the meaning of some or all ordinary ASCII
characters also aren't covered, unless such an encoding happens to
exclude the special characters that the protocol spec cares about).

The protocol specs typically go out of their way to specify what byte
values they use for syntactically significant positions (e.g. ':' in
headers, or '/' in URLs), while hand-waving about the meaning of "what
goes in between" since it is all typically treated as "not of
syntactic significance". So you can write a parser that looks at bytes
exclusively, and looks for a bunch of ASCII punctuation characters
(e.g. '<', '>', '/', '&'), and doesn't know or care whether the stuff
in between is encoded in Latin-15, MacRoman or UTF-8 -- it never looks
"inside" stretches of characters between the special characters and
just copies them. (Sometimes there may be *some* sections that are
required to be ASCII and there equivalence of a-z and A-Z is well
defined.)

But I wouldn't go so far as to claim that interpreting the protocols
as text is wrong. After all we're talking exclusively about protocols
that are designed intentionally to be directly "human readable"
(albeit as a fall-back option) -- the only tool you need to debug the
traffic on the wire or socket is something that knows which subset of
ASCII is considered "printable" and which renders everything else
safely as a hex escape or even a special "unknown" character (like
Unicode's "?" inside a black diamond).

Depending on the requirements of a specific app (or framework) it may
be entirely reasonable to convert everything to Unicode and process
the resulting text; in other contexts it makes more sense to keep
everything as bytes. It also makes sense to have an interface library
to deal with a specific protocol that treats the protocol side as
bytes but interacts with the application using text, since that is
often how the application programmer wants to treat it anyway.

Of course, some protocols require the application programmer to be
aware of bytes as well in *some* cases -- examples are email and HTTP
which can be used to transfer text as well as binary data (e.g.
images). There is also the bootstrap problem where the wire data must
be partially parsed in order to find out the encoding to be used to
convert it to text. But that doesn't mean it's invalid to think about
it as text in many application contexts.

Regarding the proposal of a String ABC, I hope this isn't going to
become a backdoor to reintroduce the Python 2 madness of allowing
equivalency between text and bytes for *some* strings of bytes and not
others.

Finally, I do think that we should not introduce changes to the
fundamental behavior of text and bytes while the moratorium is in
place. Changes to specific stdlib APIs are fine however.

--Guido

On Thu, Jun 24, 2010 at 12:49 PM, Ian Bicking <ianb at colorstudy.com> wrote:
> On Thu, Jun 24, 2010 at 12:38 PM, Bill Janssen <janssen at parc.com> wrote:
>>
>> Here are a couple of ideas I'm taking away from the bytes/string
>> discussion.
>>
>> First, it would probably be a good idea to have a String ABC.
>>
>> Secondly, maybe the string situation in 2.x wasn't as broken as we
>> thought it was.  In particular, those who deal with lots of encoded
>> strings seemed to find it handy, and miss it in 3.x.  Perhaps strings
>> are more like numbers than we think.  We have separate types for int,
>> float, Decimal, etc.  But they're all numbers, and they all
>> cross-operate.  In 2.x, it seems there were two missing features: no
>> encoding attribute on str, which should have been there and should have
>> been required, and the default encoding being "ASCII" (I can't tell you
>> how many times I've had to fix that issue when a non-ASCII encoded str
>> was passed to some output function).
>
> I've started to form a conceptual notion that I think fits these cases.
>
> We've setup a system where we think of text as natively unicode, with
> encodings to put that unicode into a byte form.  This is certainly
> appropriate in a lot of cases.  But there's a significant class of problems
> where bytes are the native structure.  Network protocols are what we've been
> discussing, and are a notable case of that.  That is, b'/' is the most
> native sense of a path separator in a URL, or b':' is the most native sense
> of what separates a header name from a header value in HTTP.  To disallow
> unicode URLs or unicode HTTP headers would be rather anti-social, especially
> because unicode is now the "native" string type in Python 3 (as an aside for
> the WSGI spec we've been talking about using "native" strings in some
> positions like dictionary keys, meaning Python 2 str and Python 3 str, while
> being more exacting in other areas such as a response body which would
> always be bytes).
>
> The HTTP spec and other network protocols seems a little fuzzy on this,
> because it was written before unicode even existed, and even later activity
> happened at a point when "unicode" and "text" weren't widely considered the
> same thing like they are now.  But I think the original intention is
> revealed in a more modern specification like WebSockets, where they are very
> explicit that ':' is just shorthand for a particular byte, it is not "text"
> in our new modern notion of the term.
>
> So with this idea in mind it makes more sense to me that *specific pieces of
> text* can be reasonably treated as both bytes and text.  All the string
> literals in urllib.parse.urlunspit() for example.
>
> The semantics I imagine are that special('/')+b'x'==b'/x' (i.e., it does not
> become special('/x')) and special('/')+x=='/x' (again it becomes str).  This
> avoids some of the cases of unicode or str infecting a system as they did in
> Python 2 (where you might pass in unicode and everything works fine until
> some non-ASCII is introduced).
>
> The one place where this might be tricky is if you have an encoding that is
> not ASCII compatible.  But we can't guard against every possibility.  So it
> would be entirely wrong to take a string encoded with UTF-16 and start to
> use b'/' with it.  But there are other nonsensical combinations already
> possible, especially with polymorphic functions, we can't guard against all
> of them.  Also I'm unsure if something like UTF-16 is in any way compatible
> with the kind of legacy systems that use bytes.  Can you encode your
> filesystem with UTF-16?  I don't think you could encode a cookie with it.
>
>> So maybe having a second string type in 3.x that consists of an encoded
>> sequence of bytes plus the encoding, call it "estr", wouldn't have been
>> a bad idea.  It would probably have made sense to have estr cooperate
>> with the str type, in the same way that two different kinds of numbers
>> cooperate, "promoting" the result of an operation only when necessary.
>> This would automatically achieve the kind of polymorphic functionality
>> that Guido is suggesting, but without losing the ability to do
>>
>>  x = e(ASCII)"bar"
>>  a = ''.join("foo", x)
>>
>> (or whatever the syntax for such an encoded string literal would be --
>> I'm not claiming this is a good one) which presume would bind "a" to a
>> Unicode string "foobar" -- have to work out what gets promoted to what.
>
> I would be entirely happy without a literal syntax.  But as Phillip has
> noted, this can't be implemented *entirely* in a library as there are some
> constraints with the current str/bytes implementations.  Reading PEP 3003
> I'm not clear if such changes are part of the moratorium?  They seem like
> they would be (sadly), but it doesn't seem clearly noted.
>
> I think there's a *different* use case for things like
> bytes-in-a-utf8-encoding (e.g., to allow XML data to be decoded lazily), but
> that could be yet another class, and maybe shouldn't be polymorphicly usable
> as bytes (i.e., treat it as an optimized str representation that is
> otherwise semantically equivalent).  A String ABC would formalize these
> things.
>
> --
> Ian Bicking  |  http://blog.ianbicking.org
>
> _______________________________________________
> Python-Dev mailing list
> Python-Dev at python.org
> http://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> http://mail.python.org/mailman/options/python-dev/guido%40python.org
>
>

-- 
--Guido van Rossum (python.org/~guido)