[Python-Dev] Python 1.6a2 Unicode bug (was Re: comparing strings and ints)

Alisa Pasic Robinson alisa@robanal.demon.co.uk
Thu, 27 Apr 2000 10:29:54 GMT


>I wrote:
>>A utf-8-encoded 8-bit string in Python is *not* a string, but a =
"ByteArray".
>
>Another way of putting this is:
>- utf-8 in an 8-bit string is to a unicode string what a pickle is to an
>object.
>- defaulting to utf-8 upon coercing is like implicitly trying to =
unpickle
>an 8-bit string when comparing it to an instance. Bad idea.
>
>Defaulting to Latin-1 is the only logical choice, no matter how
>western-culture-centric this may seem.
>
>Just

The Van Rossum Common Sense gene strikes again!  You guys owe
it to the world to have lots of children.

I agree 100%.  Let me also add that if you want to do encoding work
that goes beyond what the library gives you, you absolutely need
a 'byte array' type which makes no assumptions and does nothing
magic to its content. I have always thought of 8-bit strings as 'byte
arrays' and not 'characer arrays', and doing anything magic to
them in literals or standard input is going to cause lots of trouble.

I think our proposal is BETTER than Java, Tcl, Visual Basic etc for
the following reasons:
- you can work with old fashioned strings, which are understood
by everyone to be arrays of bytes, and there is no magic
conversion going on.  The bytes in literal strings in your script file
are the bytes that end up in the program.
- you can work with Unicode strings if you want
- you are in explicit control of conversions between them
- both types have similar methods so there isn't much to learn or
remember

The 'no magic' thing is very important with Japanese, where very=20
often you need to roll your own codecs and look at the raw bytes;=20
any auto-conversion might not go through the filter you want and
you've already lost information before you started.  Especially If
your job is to repair possibly corrupt data.  Any company with
a few extra custom characters in the user-defined Shift-JIS range
is going to suddenly find their Perl scripts are failing or trashing
all their data as a result of the UTF-8 decision.

I'm also convinced that the majority of Python scripts won't need
to work in Unicode.  Even working with exotic languages,
there is always a native 8-bit encoding.  I have only used Unicode
when=20
(a) working with data that is in several languages
(b) doing conversions, which requires a 'central point'
(b) wanting to do per-character operations safely on multi-byte data


I still haven't sorted out in my head whether the default encoding
thing is a big red herring or is important; I already have a safe way
to construct Unicode literals in my source files if I want to using
    unicode('rawdata','myencoding'). =20
But if there has to be one I'd say the following:
- strict ASCII is an option
- Latin-1 is the more generous option that is right for the most
people,
and has a 'special status' among 8-bit encodings
- UTF-8 is not one byte per character and will confuse people

Just my 2p worth,

Andy