unicode encoding usablilty problem

Nick Coghlan ncoghlan at iinet.net.au
Sun Feb 20 07:05:50 EST 2005


Martin v. Löwis wrote:
>> How about
>>
>>   b'' - 8bit string; '' unicode string
>>
>> and no automatic conversion.
> 
> 
> This has been proposed before, see PEP 332. The problem is that
> people often want byte strings to be mutable as well, so it is
> still unclear whether it is better to make the b prefix denote
> the current string type (so it would be currently redundant)
> or a newly-created mutable string type (similar to array.array).

Having "", u"", and r"" be immutable, while b"" was mutable would seem rather 
inconsistent.

If you want a phased migration to 'assert (str is unicode) == True', then PEP 
332 seems to have that covered:

1. Introduce 'bytes' as an alias of str
2. Introduce b"" as an alternate spelling of r""
3. Switch str to be an alias of unicode
4. Switch "" to be an alternate spelling of u""

Trying to intermingle this with making the bytes type mutable seems to be 
begging for trouble - consider how many string-keyed dictionaries would break 
with that change (the upgrade path is non-existent - you can't stay with str, 
because you want byte strings, but you can't go to bytes, because you need 
something immutable).

An alternative would be to have "bytestr" be the immutable type corresponding to 
the current str (with b"" literals producing bytestr's), while reserving the 
"bytes" name for a mutable byte sequence. That is, change PEP 332's upgrade path 
to look more like:

     * Add a bytestr builtin which is just a synonym for str. (2.5)
     * Add a b"..." string literal which is equivalent to raw string literals, 
with the exception that values which conflict with the source encoding of the 
containing file not generate warnings. (2.5)
     * Warn about the use of variables named "bytestr". (2.5 or 2.6)
     * Introduce a bytestr builtin which refers to a sequence distinct from the 
str type. (2.6)
     * Make str a synonym for unicode. (3.0)

And separately:
    * Introduce a bytes builtin which is a mutable byte sequence

Alternately, add array.bytes as a subclass of array.array, that provides a nicer 
API for dealing specifically with byte strings.

The main point being, the replacement for 'str' needs to be immutable or the 
upgrade process is going to be a serious PITA.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at email.com   |   Brisbane, Australia
---------------------------------------------------------------
             http://boredomandlaziness.skystorm.net



More information about the Python-list mailing list