Proposal: require 7-bit source str's

Sun Aug 22 13:03:00 EDT 2004

Peter Otten wrote:
> Hallvard B Furuseth wrote:
>>>>> The long-term goal would be unicode throughout, IMHO.
>>>> 
>>>> Whose long-term goal for what?  For things like Internet communication,
>>>> fine.  But there are lot of less 'global' applications where other
>>>> character encodings make more sense.
> 
> More sense? I doubt that. What does make sense is an api that abstracts from
> the encoding.

If the application knows which encoding it is so it can convert at all,
and is 'big enough' to bother with encoding back and forth, and the
encoding doesn't already provide what one needs such abstraction to do.

> You can then reduce the points where data in limited i. e.
> non-unicode encodings is imported/exported as the adoption of unicode grows
> without affecting the core of your app. IMHO chr(ord("a") - 32) is inferior
> to "a".upper() even in an all-ascii environment.

If you mean 'limited' to some other character set than Unicode, that's
not much use if the appliation is designed for something which has that
'limited' character set/encoding anyway.

>>> Here we disagree. Showing the right image for a character should be
>>> the job of the OS and should safely work cross-platform.
>> 
>> Yes.  What of it?
> 
> I don't understand the question.

I explained that in the next paragraph:

>> Programs that show text still need to know which character set the
>> source text has, so it can pass the OS the text it expects, or send a
>> charset directive to the OS, or whatever.

If you disagree with that, is that because you think of Unicode as The
One True Character Set which everything can assume is in use if not
otherwise specified?  That's a long way from the world I'm living in.
Besides, even if you have 'everything is Unicode', that still doesn't
necessarily mean UTF-8.  It could be UCS-4, or whatever.  Unicode or no,
displaying a character does involve telling the OS what encoding is in
use.  Or not telling it and trusting the application to handle it, which
is again what's being done outside the Unicode world.

>>> Why shouldn't I be able to store a file with a greek or chinese name?
>> 
>> If you want an OS that allows that, get an OS which allows that.
> 
> That was not the point. I was trying to say that the usefulness of a
> standard grows with its adoption.

And the thing about standards is that there are so many of them to
choose from.  Enforcing a standard somewhere in an environment where
that is not the standard is not useful.  Try the standard of driving on
the right side of the road in a country where everyone else drives on
the left side.  Standards are supposed to serve us, it's not we who are
supposed to server standards.

>>> I wasn't able to quote Martin's
>>> surname correctly for the Python-URL. That's a mess that should be
>>> cleaned up once per OS rather than once per user. I don't see how that
>>> can happen without unicode (only). Even NASA blunders when they have to
>>> deal with meters and inches.
>> 
>> Yes, there are many non-'global' applications too where Unicode is
>> desirable.  What of it?
> 
> I don't understand the question.

You claimed one non-global application where Unicode would have been
good, as an argument that there are no non-global application where
Unicode would not be good.

>> Just because you want Unicode, why shouldn't I be allowed to use
>> other charcater encodings in cases where they are more practical?
> 
> Again, my contention is that once the use of unicode has reached the tipping
> point you will encounter no cases where other encodings are more practical. 

So because you are fond of Unicode, you want to force a quick transition
on everyone else and leave us to deal with the troubles of the
transition, even in cases where things worked perfectly fine without
Unicode.

But I'm pretty sure that "tipping point" where no cases of non-Unicode
is no practical is pretty close to 100% usage of Unicode around the
world.

>> For example, if one uses character set ns_4551-1 - ASCII with {|}[\]
>> replaced with æøåÆØÅ, sorting by simple byte ordering will sort text
>> correctly.  Unicode text _can't_ be sorted correctly, because of
>> characters like 'ö': Swedish 'ö' should match Norwegian 'ø' and sort
>> with that, while German 'ö' should not match 'ø' and sorts with 'o'.
> 
> Why not sort depending on the locale instead of ordinal values of the
> bytes/characters?

I'm in Norway.  Both Swedes and Germans are foreigners.

>>>> In any case, a language's both short-term and long-term goals should be
>>>> to support current programming, not programming like it 'should be done'
>>>> some day in the future.
> 
> At some point you have to ask yourself whether the dirty tricks that work
> depending on the country you live in, its current orthography and the
> current state of your favourite programming language do save you some time
> at so many places in your program that one centralized api that does it
> right is more efficient even today.

Just that you are fond of Unicode and think that's the Right Solution to
everything, doesn't make other ways of doing things a dirty trick.

As for dirty tricks, that's exactly what such premature standardization
leads to, and one reason I don't like it.  Like Perl and Emacs which
have decided that if they don't know which character set is in use, then
it's the character set of the current locale (if they can deduce it) -
even though they have no idea if the data they are processing have
anything to do with the current locale.  I wrote a long rant addressed
to the wrong person about that recently; please read article
<HBF.20040808avqr at bombur.uio.no> in the 'PEP 263 status check' thread.

>>> If I were to add a switch to Python's
>>> string handling it would be "all-unicode".
>> 
>> Meaning what?
> 
> All strings are unicode by default. If you need byte sequences instead of
> character sequences you would have to provide a b-prefixed string.

I've been wondering about something like that myself, but it still
requires the program to be told which character set is in use so it can
convert back and forth between that and Unicode.  To get that right,
Python would need to tag I/O streams and other stuff with their
character set/encoding.  And either Python would have to guess when it
didn't know (like looking at the locale's name), or if it didn't
programmers would guess to get rid of the annoyance of encoding
exceptions cropping up everywhere.  Then at a later date we'd have to
clean up all the code with the bogus guesses, so the problem would
really just have been transformed to another problem...

-- 
Hallvard