[Python-Dev] Byte string class hierarchy

Thu Aug 19 00:16:33 CEST 2004

I may have missed a crucial bit of the discussion, having been away, so 
if this is completely besides the point let me know. But my feeling is 
that the crucial bit is the type inheritance graph of all the byte and 
string types. And I wonder whether the following graph would help us 
solve most problems (aside from introducing one new one, that may be a 
showstopper):

genericbytes
	mutablebytes
	bytes
		genericstring
			string
			unicode

The basic type for all bytes, buffers and strings is genericbytes. This 
abstract base type is neither mutable nor immutable, and has the 
interface that all of the types would share. Mutablebytes adds slice 
assignment and such. Bytes, on the other hand, adds hashing. 
genericstring is the magic stuff that's there already that makes 
unicode and string interoperable for hashing and dict keys and such.

Casting to a basetype is always free and doesn't copy anything, i.e. 
the bits stay the same. 'foo' in sourcecode is a string, and if you 
cast it to bytes you'll just get the bits, which is pretty much the 
same as what you get now. If you really want to make sure you get an 
8-bit ascii representation even if you run in an interpreter built with 
UCS4 as the default character set you must use 
bytes('foo'.encode('ascii')).

Casting to a subtype may cause a copy, but does not modify the bits. 
Casting sideways copies, and may modify the bits too, the current 
unicode encode/decode stuff. These 2 rules mean that unicode('foo') is 
something different from unicode(bytes('foo')), and probably illegal to 
boot, but I don't think that's too much of a problem: you shouldn't 
explicitly cast to bytes() unless you really want uninterpreted bits.

Operations like concatenation return the most specialised class. 
Mutablebytes is the only problem case here, we should probably forbid 
concatenating these with the others. The alternatives (return 
mutablebytes, return the other one, return the type of the first 
operand) all seem somewhat random.

Read() is guaranteed only to return genericbytes, but if you open a 
file in textmode they'll returns strings, and we should add the ability 
to open files for unicode and probably mutablebytes too. I'm not sure 
about socket.recv() and such, but something similar probably holds. 
Readline() really shouldn't be allowed on files  open in binary mode, 
but that may be a bit too much.

Write and friends accept genericbytes, and binary files will just dump 
the bits. Files open in text mode may convert unicode and string 
objects between representations.

The bad news (aside from any glaring holes I may have overseen in the 
above: shoot away!) is that I don't know what to do for hash on bytes 
objects. On the one hand I would like hash('foo') == 
hash(bytes('foo')). But that leads to also wanting hash(u'foo') == 
hash(bytes(u'foo')), and we can't really have that because hash('foo') 
== hash(u'foo') is needed to make string/unicode interoperability for 
dictionaries work. Note that for the value 'foo' this isn't a problem, 
but for 'föö' (thats F O-UMLAUT O-UMLAUT) it is. So it seems that 
making hash('foo') != hash(bytes('foo')) is the only reasonable 
solution (and probably also a good idea with the future in mind: 
explicit is better than implicit so just put a cast there if you want 
the binary bits to be interpreted as an ASCII or Unicode string!) it 
will probably break existing code.
--
Jack Jansen, <Jack.Jansen at cwi.nl>, http://www.cwi.nl/~jack
If I can't dance I don't want to be part of your revolution -- Emma 
Goldman