unicode encoding usablilty problem

aurora aurora00 at gmail.com
Fri Feb 18 19:21:02 EST 2005


On Fri, 18 Feb 2005 21:16:01 +0100, Martin v. Löwis <martin at v.loewis.de>  
wrote:

> I'd like to point out the
> historical reason: Python predates Unicode, so the byte string type
> has many convenience operations that you would only expect of
> a character string.
>
> We have come up with a transition strategy, allowing existing
> libraries to widen their support from byte strings to character
> strings. This isn't a simple task, so many libraries still expect
> and return byte strings, when they should process character strings.
> Instead of breaking the libraries right away, we have defined
> a transitional mechanism, which allows to add Unicode support
> to libraries as the need arises. This transition is still in
> progress.

I understand. So I wasn't yelling "why can't Python be more like Java". On  
the other hand I also want to point out making individual decision for  
each string wasn't practical and is very error prone. The fact that  
unicode and 8 bit string look alike and work alike in common situation but  
only run into problem with non-ASCII is very confusing for most people.


> Eventually, the primary string type should be the Unicode
> string. If you are curious how far we are still off that goal,
> just try running your program with the -U option.

Lots of errors. Amount them are gzip (binary?!) and strftime??

I actually quite appriciate Python's power in processing binary data as  
8-bit strings. But perhaps we should transition to use unicode as text  
string as treat binary string as exception. Right now we have

   '' - 8bit string; u'' unicode string

How about

   b'' - 8bit string; '' unicode string

and no automatic conversion. Perhaps this can be activated by something  
like the encoding declarations, so that transition can happen module by  
module.


> Regards,
> Martin




More information about the Python-list mailing list