unicode encoding usablilty problem
Nick Coghlan
ncoghlan at iinet.net.au
Sun Feb 20 07:05:50 EST 2005
Martin v. Löwis wrote:
>> How about
>>
>> b'' - 8bit string; '' unicode string
>>
>> and no automatic conversion.
>
>
> This has been proposed before, see PEP 332. The problem is that
> people often want byte strings to be mutable as well, so it is
> still unclear whether it is better to make the b prefix denote
> the current string type (so it would be currently redundant)
> or a newly-created mutable string type (similar to array.array).
Having "", u"", and r"" be immutable, while b"" was mutable would seem rather
inconsistent.
If you want a phased migration to 'assert (str is unicode) == True', then PEP
332 seems to have that covered:
1. Introduce 'bytes' as an alias of str
2. Introduce b"" as an alternate spelling of r""
3. Switch str to be an alias of unicode
4. Switch "" to be an alternate spelling of u""
Trying to intermingle this with making the bytes type mutable seems to be
begging for trouble - consider how many string-keyed dictionaries would break
with that change (the upgrade path is non-existent - you can't stay with str,
because you want byte strings, but you can't go to bytes, because you need
something immutable).
An alternative would be to have "bytestr" be the immutable type corresponding to
the current str (with b"" literals producing bytestr's), while reserving the
"bytes" name for a mutable byte sequence. That is, change PEP 332's upgrade path
to look more like:
* Add a bytestr builtin which is just a synonym for str. (2.5)
* Add a b"..." string literal which is equivalent to raw string literals,
with the exception that values which conflict with the source encoding of the
containing file not generate warnings. (2.5)
* Warn about the use of variables named "bytestr". (2.5 or 2.6)
* Introduce a bytestr builtin which refers to a sequence distinct from the
str type. (2.6)
* Make str a synonym for unicode. (3.0)
And separately:
* Introduce a bytes builtin which is a mutable byte sequence
Alternately, add array.bytes as a subclass of array.array, that provides a nicer
API for dealing specifically with byte strings.
The main point being, the replacement for 'str' needs to be immutable or the
upgrade process is going to be a serious PITA.
Cheers,
Nick.
--
Nick Coghlan | ncoghlan at email.com | Brisbane, Australia
---------------------------------------------------------------
http://boredomandlaziness.skystorm.net
More information about the Python-list
mailing list