[Python-ideas] Python 3.x and bytes

Terry Reedy tjreedy at udel.edu
Thu May 19 05:10:01 CEST 2011


On 5/18/2011 4:10 PM, Ethan Furman wrote:
> As those who have to work with byte strings know, when retrieving a
> single character from a byte string, what you get back is not a byte
> string, but an int -- a rather important distinction from unicode
> strings (str).

For all sequences, slicing (if it works at all) returns a subsequence 
(possibly of length 0, which is why slicing can work with out-of-bounds 
slice points). For all (built-in) sequences except for strings, indexing 
returns a member of the sequence (which is why it raises an exception 
for out-of-bounds indexes). Leaving aside extension and user-defined 
sequences, strings are unique in instead returning a length-1 
subsequence So bytes are normal while strings are anomolous!

Why that anomaly? The immediate reason is that Python does not have a 
separate character type. Why not? Guido might best answer (but he might 
say 'my gut instinct'), but I can think of a few reasons.

1. That is how it is in the (math) theory of strings. 'A' is both a char 
and a string of length one. There is no separate 'char' type that cannot 
be added (concatenated) to other strings of whatever length.

2. (Related) This pragmatically works best for Python.

3. Python follows Occam's principle by not introducing types without 
necessity. And a separate char type is not *necessary*.

4. Text strings are homegeneous arrays (like the arrays in the array 
module), unlike heterogeneous tuples and lists. So they need not be 
sequences of Python objects, and for efficiency, would not be even if 
there were a character type. Like other arrays, they contain the 
information needed to produce Python objects on demand without actually 
containing such objects in the way tuples, lists, and dicts do.

I do, however, understand the tendency to think of bytes as strings 
because of both Python's history and the remnant string interface.

For people using non-Latin (non-ascii) alphabets, the 'convenience' of 
replacing some bytes with ascii-chars might be less convenient.

-- 
Terry Jan Reedy




More information about the Python-ideas mailing list