diferences between 22 and python 23

Fri Dec 5 18:16:41 EST 2003

On 05 Dec 2003 19:18:50 +0100, martin at v.loewis.de (Martin v. =?iso-8859-15?q?L=F6wis?=) wrote:

>bokr at oz.net (Bengt Richter) writes:
>
>> If you put a sequence of those in a "string," ISTM the string should
>> be thought of as having the same encoding as the characters whose
>> ord() codes are stored.
>
>So this is a matter of "conceptual correctness". I could not care
>less: I thought you bring forward real problems that would be solved
>if strings had an encoding attached.
I thought I did, but the "problem" is not achieving end effects (everyone
appreciates your expert advice on that). The "problem" is a bit of
(UIAM -- and to explore the issue collaboratively is why I post) unnecessary
explicitness required to achieve an end that could possibly happen automatically
(according to a "conceptually correct[-me-if-I'm-wrong]" model ;-)

>
>> But either way, what you wanted to specify was the latin-1 glyph
>> sequence associated with the number sequence
>
>I would use a Unicode object to represent these characters.
Yes, that is an effective explicitness, but not what I was trying to get at.
>
>> >The answer would be more difficult for (4/5)+4.56 if 4/5 was a
>> >rational number; for 1 < 0.5+0.5j, Python decides that it just cannot
>> >find a result in a reasonable way. For strings-with-attached encoding,
>> >the answer would always be difficult.
>> Why, when unicode includes all?
>
>Because at the end, you would produce a byte string. Then the question
>is what type the byte string should have.
Unicode, of course, unless that coercion was not necessary, as in ascii+ascii
or latin-1 + latin-1, etc., where the result could retain the more specific
encoding attribute.

>
>> >assuming it is ASCII will give the expected result, as ASCII is a
>>  ^^^^^^^^ oh, ok, it's just an assumption.
>
>Yes. I advocate you should never make use of this assumption, but I
>also believe it is a reasonable one - because it would still hold if
>the string was Latin-1, KOI-8R, UTF-8, Mac-Roman, ...
Why not assume latin-1, if it's just a convenience assumption for certain
contexts? I suspect it would be right more often than not, given that for
other cases explicit unicode or decode/encode calls would probably be used.

>
>> >What is the advantage of having an encoding associated with byte
>> >strings?
>> If e.g. name had latin-1 encoding associated with it by virtue of source like
>>     ...
>>     # -*- coding: latin-1 -*-
>>     name = 'Martin LÃ¶wis'
>> 
>> then on my cp437 console window, I might be able to expect to see the umlaut
>> just by writing
>> 
>>     print name	
>
>I see. To achieve this effect, do
>
># -*- coding: latin-1 -*-
>name = u'Martin LÃ¶wis'
>print name
Right, but that is a workaround w.r.t the possibility I am trying to discuss.

>
>
>> Why should I have to do that if I have written # -*- coding: latin-1 -*-
>> in the second line? Why shouldn't s='blah blah' result in s being internally
>> stored as a latin-1 glyph sequence instead of an 8-bit code sequence that will
>> trip up ascii assumptions annoyingly ;-)
>
>Because adding encoding to strings raise difficult questions, which,
>when answered, will result in non-intuitive behaviour.
Care to elaborate? I don't know what difficult questions nor non-intuitive behavior
you have in mind, but I am probably not the only one who is curious ;-)

>
>> >Currently, they are represented as ASCII+escapes. I see no reason to
>> >change that.
>> Ok, that's no biggie, but even with your name? ;-)
>
>I use Unicode literals in source code. They can represent my name just
>fine.
Ok, ok ;-)

>
>> interesting. Will u'...' mean Unicode in the abstract, reserving the
>> the choice of utf-16(le|be)/wchar or utf-8 to the implementation?
>
>You seem to be missing an important point. u'...' is available today.
No, I know that ;-) But I don't know how you are going to migrate towards
a more pervasive use of unicode in all the '...' contexts. Whether at
some point unicode will be built into cpython as the C representation
of all internal strings, or it will use unicode through unicode objects
and their interfaces, which I imagine would be the way it started.
Memory-limited implementations might want to make different choices IWG,
so the cleaner the python-unicode relationship the freer those choices
are likely to be IWT. I was just speculating on these things.

>
>The choice of representation is currently between UCS-2/UTF-16 and
>UCS-4, with UTF-8 being an unlikely candidate for implementation
>choice.
>
>> Yes that seems obvious, but I had some inkling that if two modules
>> m1 and m2 had different source encodings, different codes would be
>> allowed in '...' literals in each, and e.g.,
>> 
>>     import m1,m2
>>     print 'm1: %r, m2: %r' % (m1.s1, m2.s2)
>> 
>> might have ill-defined meaning
>
>That is just one of the problems you run into when associating
                                                   ^--not ;-)
>encodings with strings. Fortunately, there is no encoding associated
>with a byte string.
So assume ascii, after having stripped away better knowledge?

It's fine to have a byte type with no encoding associated. But unfortunately
ISTM str instances seem to be playing a dual role as ascii-encoded strings
and byte strings. More below.

>
>> But if s = '...' becomes effectively s = u'...' will type('...') =>
>> <type 'unicode'> ?
>
>Of course!
Just checking ;-)

>
>> What will become of str? Will that still be the default
>> pseudo-ascii-but-really-byte-string general data container that is
>> is now?
>
>Well, <type 'str'> will continue to be the byte string type, and
>conversion to str() will continue to produce byte strings. It might be
>reasonable to add a string() built-in some day, which is a synonym for
>unicode().

How will the following look when s == '...' becomes effectively s = u'...' per above?

 >>> str('L\xf6wis')
 'L\xf6wis'
 >>> str(u'L\xf6wis')
 Traceback (most recent call last):
   File "<stdin>", line 1, in ?
 UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 1: ordinal not in
 ange(128)

Will there be an ascii codec involved in that first str('...') ?

Hm. For a pure-bytes string type, you could define an assumed 8-bit encoding in place of ascii,
so that you could always get a unicode translation 1:1. You could do it by using a private
code area of unicode, so e.g. '\x00' to '\xff' becomes u'\ue000' to u'\ue0ff' and then e.g.,
unicode.__str__(u'\ue0ab') could render back '\xab' as the str value instead of raising
UnicodeEncodeError saying it's not in ascii range. Also, u'\ue0ab'.encode('bytes') would presumably
return the byte string '\xab'.

To get the e000-e0ff unicode, you'd do some_byte_string.decode('bytes') analogous to the apparent
some_ordinary_str.decode('ascii') that seems to be attempted in some contexts now.
BTW, is that really some_ordinary_str.decode(sys.getdefaultencoding()) ?

Another thing: what encoding should an ordinary user object return from __str__ ?
(I still think str instances with explicit optional mutable 'encoding' attribute slots
could be useful ;-) Or should __str__ not return type str any more, but unicode??
Or optionally either?

BTW, is '...' =(effectively)= u'...' slated for a particular future python version?

Regards,
Bengt Richter