[Python-Dev] methods on the bytes object

Mon May 1 10:07:55 CEST 2006

"Martin v. Löwis" <martin at v.loewis.de> wrote:
> 
> Josiah Carlson wrote:
> >> I mean unicode strings, period. I can't imagine what "unicode strings
> >> which do not contain data" could be.
> > 
> > Binary data as opposed to text.  Input to a array.fromstring(),
> > struct.unpack(), etc.
> 
> You can't/shouldn't put such data into character strings: you need
> an encoding first.

Certainly that is the case.  But how would you propose embedded bytes
data be represented? (I talk more extensively about this particular
issue later).

> Neither array.fromstring nor struct.unpack will
> produce/consume type 'str' in Python 3; both will operate on the
> bytes type. So fromstring should probably be renamed frombytes.

Um...struct.unpack() already works on unicode...
    >>> struct.unpack('>L', u'work')
    (2003792491L,)
As does array.fromstring...
    >>> a = array.array('B')
    >>> a.fromstring(u'work')
    >>> a
    array('B', [119, 111, 114, 107])

... assuming that all characters are in the 0...127 range.  But that's a
different discussion.

> > Certainly it is the case that right now strings are used to contain
> > 'text' and 'bytes' (binary data, encodings of text, etc.).  The problem
> > is in the ambiguity of Python 2.x str containing text where it should
> > only contain bytes. But in 3.x, there will continue to be an ambiguity,
> > as strings will still contain bytes and text (parsing literals, see the
> > somewhat recent argument over bytes.encode('base64'), etc.).
> 
> No. In Python 3, type 'str' cannot be interpreted to contain bytes.
> Operations that expect bytes and are given type 'str', and no encoding,
> should raise TypeError.

I am apparently not communicating this particular idea effectively
enough.  How would you propose that I store parsing literals for
non-textual data, and how would you propose that I set up a dictionary
to hold some non-trivial number of these parsing literals?  I don't want
a vague "you can't do X", I want a "here's the code you would use".

From what I understand, it would seem that you would suggest that I use
something like the following...

handler = {bytes('...', encoding=...).encode('latin-1'): ...,
           #or
           '\uXXXX\uXXXX...': ...,
           #or even without bytes/str
           (0xXX, 0xXX, ...): ..., }

Note how two of those examples have non-textual data inside of a Python
3.x string?  Yeah.

> > We've not removed the problem, only changed it from being contained 
> > in non-unicode
> > strings to be contained in unicode strings (which are 2 or 4 times larger
> > than their non-unicode counterparts).
> 
> We have removed the problem.

Excuse me?  People are going to use '...' to represent literals of all
different kinds.  Whether these are text literals, binary data literals,
encoded binary data blobs (see the output of img2py.py from wxPython),
whatever.  We haven't removed the problem, we've only forced all
string literals to be unicode; foolish consistancy and all that.

> > Within the remainder of this email, there are two things I'm trying to
> > accomplish:
> > 1. preserve the Python 2.x string type
> 
> I would expect that people try that. I'm -1.

I also expect that people will try to make it happen; I am (and I'm
certainly not a visionary when it comes to programming language features). 
I would also hope that others are able to see that immutable unicode and
mutable bytes aren't necessarily sufficient, especially when the
standard line will be something like "if you are putting binary data
inside of a unicode string, you are doing it wrong".  Especially
considering that unless one jumps through hoops of defining their bytes
data as a bytes(list/tuple), and not bytes('...', encoding=...), that
technically, they are still going to be storing bytes data as unicode
strings.

> > 2. make the bytes object more palatable regardless of #1
> 
> This might be good, but we have to be careful to not create a type
> that people would casually use to represent text.

Certainly.  But by lacking #1, we will run into a situation where Python
3.x strings will be used to represent bytes.  Understand that I'm also
trying to differentiate the two cases (and thinking further, a bytes
literal would allow users to differentiate them without needing to use
bytes('...', ...) ).

In the realm of palatability, giving bytes objects most of the current
string methods, (with perhaps .read(), .seek(), (and .write() for
mutable bytes) ), I think, would go a long ways (if not all the way)
towards being more than satisfactory.

> > I do, however, believe that the Python 2.x string type is very useful
> > from a data parsing/processing perspective.
> 
> You have to explain your terminology somewhat better here: What
> applications do you have in mind when you are talking about
> "parsing/processing"? To me, "parsing" always means "text", never
> "raw bytes". I'm thinking of the Chomsky classification of grammars,
> EBNF, etc. when I hear "parsing".

What does pickle.load(...) do to the files that are passed into it?  It
reads the (possibly binary) data it reads in from a file (or file-like
object), performing a particular operation based based on a dictionary
of expected tokens in the file, producing a Python object. I would say
that pickle 'parses' the content of a file, which I presume isn't
necessary text.

Replace pickle with a structured data storage format of your choice.
It's still parsing (at least according to my grasp of English, which
could certainly be flawed (my wife says as much on a daily basis)).

> > Look how successful and
> > effective it has been so far in the history of Python.  In order to make
> > the bytes object be as effective in 3.x, one would need to add basically
> > all of the Python 2.x string methods to it
> 
> The precondition of this clause is misguided: the bytes type doesn't
> need to be as effective, since the string type is as effective in 2.3,
> so you can do all parsing based on strings.

Not if those strings contain binary data.  I thought we were
disambiguating what Python 3.x strings are supposed to contain?  If they
contain binary data, then there isn't a disambiguation as to their
content, and we end up with the same situation we have now (only worse
because now your data takes up twice as much memory and takes twice as
much time to process).

> > (having some mechanism to use
> > slices of bytes objects as dictionary keys (if data[:4] in handler: ...
> > -> if tuple(data[:4]) in handler: ... ?) would also be nice).
> 
> You can't use the bytes type as a dictionary key because it is
> immutable. Use the string type instead.

I meant that currently, if I have data as a Python 2.x string, and I
were to perhaps handle the current portion of the string via...
    if data[:4] in handler:
...that when my data becomes bytes in 3.x (because it isn't text, and
non-text shouldn't be in the 3.x string, if I understand our discussions
about it correctly), then I would need to use...
    if tuple(data[:4]) in handler:
or even
    if data[:4].decode('latin-1') in handler:
...because data is mutable.

I was expressing that being able to leave out the tuple() (or .decode())
would be convenient, though not necessary.  If given an immutable bytes
type, data to be parsed would likely be immutable, so data[:4] could be
hashable (I personally rarely mutate input data that is being parsed,
and I would suggest, if there is a choice between mutable/immutable file
reads, etc., that it be immutable).

> > So, what to do?  Rename Python 2.x str to bytes.  The name of the type
> > now confers the idea that it should contain bytes, not strings.
> 
> It seems that you want an immutable version of the bytes type. As I
> don't understand what "parsing" is, I cannot see the need for it;

I hope I've made what I consider parsing sufficiently clear.

> I think having two different bytes types is confusing.

I think that the difference between an immutable and mutable bytes types
will be clear, especially because they are given different names, and
because attempting to assign to read-only bytes would raise an
AttributeError, "'[immutable]bytes' object has no attribute '...'" (if
one wanted to be clever, one could even have it raise TypeError,
"'[immutable]bytes' object is not writable'".

There are at least two other situations in which there are mutable and
immutable simple variants of the same structure: set/frozenset and
list/tuple (though the latter instance doesn't support the same API in
both objects, due to their significantly different use-cases).  One of
the reasons I think that a bytes/mutablebytes should have a very similar
API, is because their use-cases may very well overlap, in a similar
fashion to how set/frozenset and list/tuple sometimes overlap (I
remember a discussion about giving tuples a list.index-like method, and
even another about giving one or the other a str.find-like method).

I would also point out that duplication of very similar functionality in
Python types is not that uncommon (beyond set/frozenset and list/tuple).
See the 4 (soon 3) different kinds of numbers in Python 2.4, and the 5
different ways of representing dates and times.

 - Josiah