diferences between 22 and python 23

Sun Dec 7 21:38:58 EST 2003

On Mon, 8 Dec 2003 03:01:09 +0300, "Serge Orlov" <sombDELETE at pobox.ru> wrote:

>Bengt,
>
>don't take it personally but this what happens <wink> when you use unicode
>unaware software:
>Quote from your message:
>> martin at v.loewis.de (Martin v. =?iso-8859-15?q?L=F6wis?=) wrote:
>
>Your software also doesn't specify message encoding.
>
>The real issue is to convince developers that there are many encodings in
>this world. Python should offer only one way to deal with multiple encodings.
>Your 8-bit strings with attached coding attribute is duplicating what unicode
>strings offer.
In one sense yes (being able to represent all the same character sequences one way
or another) but other senses no. Part of the concern was matters that could well be
totally hidden behind a unicode interface (internal representation/optimization issues).
Normalizing to one methodology usually saves time and space, but if there are use cases
where the normalization is a useless 1:1 transformation, then not. But then it becomes
a matter of how much overhead there is in deciding which situation applies. It may or
may not be worth it. Anyway, that's one path of discussion. Another is how to deal with
the fact that you wouldn't want to convert all current str data into unicode (e.g. data
that is really pure bytestrings and has no character interpretation).

If we were starting from scratch, it would be a lot easier to disentangle bytestrings
and charstrings. That's really the problem that led to most of my ramblings. My tacking
on .coding attributes is really a kludge to create a hybrid charstring/bytestring.
I don't like kludges, but otherwise I don't currently see a way short of major breakage,
and Martin predicts I'll run into major problems any way I might want to try going.
His opionion is nothing to sneeze at ;-)
>
>> >> If e.g. name had latin-1 encoding associated with it by virtue of source like
>> >>     ...
>> >>     # -*- coding: latin-1 -*-
>> >>     name = 'Martin Lowis'
>> >>
>> >> then on my cp437 console window, I might be able to expect to see the umlaut
>> >> just by writing
>> >>
>> >>     print name
>> >
>> >I see. To achieve this effect, do
>> >
>> ># -*- coding: latin-1 -*-
>> >name = u'Martin Lowis'
>> >print name
>> Right, but that is a workaround w.r.t the possibility I am trying to discuss.
>
>It's not a workaround it's a solution. What you propose is a lot of effort for
It's one solution to a particular problem. That solution was proposed as a way
to get an end effect. I wasn't asking for help on how to get the end effect.
I think I can do that ;-) I was discussing another problem, namely that a bare quoted
string is not sufficient to cause the proper encoding conversions for output even though
the encoding could be determined unambigously from the source file encoding.

>a little gain wrt handling multiple encoding. It's already possible to handle
I don't consider being able to eliminate unnecessary source text cruft little gain.

>multiple encoding. The time is better spent converting everything that still
>deals with 8-bit text to handle unicode.

It was always possible to handle multiple encodings, if you wanted to go to the trouble
of writing your own solution. If you are happy with the u'...' prefix as the final answer
to handling multiple character sets, and consider python evolution finished there with that issue,
that's fine. But the problem is also disentangling text uses from byte uses of str, and migrating
towards transparent use of unicode oer the quivalent. This gets into language design issues that
are more interesting than just how to make a multi-char-set app work using python as it is currently.

I doubt if wholesale conversion to unicode is a good idea. E.g., you wouldn't want to read a latin-1
log file of hundreds of megabytes and get automatic conversion to 16-bit unicode for no reason,
I wouldn't think. You could have a unicode interface that hid internal
bytestrings-with-various-encodings-attached. And then you are getting into that part of
what I was discussing. The other part is how to disentangle charstrings from bytestrings
without rewriting the world. Hence proposing, for discussion, a kludge that might let str act as both.
But you didn't see me mention the PEP word anywhere yet, did you ;-)

Martin says the problem is hard. I believe him. I still like to bat ideas around with intelligent
people (of which there are a fair number posting to this group), even at the risk of striking out
some of the time. Sometimes something worthwhile emerges that perhaps no one participating would
have thought of without the interchange to trigger a key thought.

I'll be off line for a few days (pre-apologizing in case I appear to be ignoring anyone ;-)

Regards,
Bengt Richter