diferences between 22 and python 23

Fri Dec 5 13:18:50 EST 2003

bokr at oz.net (Bengt Richter) writes:

> If you put a sequence of those in a "string," ISTM the string should
> be thought of as having the same encoding as the characters whose
> ord() codes are stored.

So this is a matter of "conceptual correctness". I could not care
less: I thought you bring forward real problems that would be solved
if strings had an encoding attached.

> But either way, what you wanted to specify was the latin-1 glyph
> sequence associated with the number sequence

I would use a Unicode object to represent these characters.

> >The answer would be more difficult for (4/5)+4.56 if 4/5 was a
> >rational number; for 1 < 0.5+0.5j, Python decides that it just cannot
> >find a result in a reasonable way. For strings-with-attached encoding,
> >the answer would always be difficult.
> Why, when unicode includes all?

Because at the end, you would produce a byte string. Then the question
is what type the byte string should have.

> >assuming it is ASCII will give the expected result, as ASCII is a
>  ^^^^^^^^ oh, ok, it's just an assumption.

Yes. I advocate you should never make use of this assumption, but I
also believe it is a reasonable one - because it would still hold if
the string was Latin-1, KOI-8R, UTF-8, Mac-Roman, ...

> >What is the advantage of having an encoding associated with byte
> >strings?
> If e.g. name had latin-1 encoding associated with it by virtue of source like
>     ...
>     # -*- coding: latin-1 -*-
>     name = 'Martin Löwis'
> 
> then on my cp437 console window, I might be able to expect to see the umlaut
> just by writing
> 
>     print name	

I see. To achieve this effect, do

# -*- coding: latin-1 -*-
name = u'Martin Löwis'
print name

> Why should I have to do that if I have written # -*- coding: latin-1 -*-
> in the second line? Why shouldn't s='blah blah' result in s being internally
> stored as a latin-1 glyph sequence instead of an 8-bit code sequence that will
> trip up ascii assumptions annoyingly ;-)

Because adding encoding to strings raise difficult questions, which,
when answered, will result in non-intuitive behaviour.

> >Currently, they are represented as ASCII+escapes. I see no reason to
> >change that.
> Ok, that's no biggie, but even with your name? ;-)

I use Unicode literals in source code. They can represent my name just
fine.

> interesting. Will u'...' mean Unicode in the abstract, reserving the
> the choice of utf-16(le|be)/wchar or utf-8 to the implementation?

You seem to be missing an important point. u'...' is available today.

The choice of representation is currently between UCS-2/UTF-16 and
UCS-4, with UTF-8 being an unlikely candidate for implementation
choice.

> Yes that seems obvious, but I had some inkling that if two modules
> m1 and m2 had different source encodings, different codes would be
> allowed in '...' literals in each, and e.g.,
> 
>     import m1,m2
>     print 'm1: %r, m2: %r' % (m1.s1, m2.s2)
> 
> might have ill-defined meaning

That is just one of the problems you run into when associating
encodings with strings. Fortunately, there is no encoding associated
with a byte string.

> But if s = '...' becomes effectively s = u'...' will type('...') =>
> <type 'unicode'> ?

Of course!

> What will become of str? Will that still be the default
> pseudo-ascii-but-really-byte-string general data container that is
> is now?

Well, <type 'str'> will continue to be the byte string type, and
conversion to str() will continue to produce byte strings. It might be
reasonable to add a string() built-in some day, which is a synonym for
unicode().

Regards,
Martin