Proposal: require 7-bit source str's

Mon Aug 9 08:43:39 EDT 2004

Hallvard B Furuseth wrote:

>>>> The long-term goal would be unicode throughout, IMHO.
>>> 
>>> Whose long-term goal for what?  For things like Internet communication,
>>> fine.  But there are lot of less 'global' applications where other
>>> character encodings make more sense.

More sense? I doubt that. What does make sense is an api that abstracts from
the encoding. You can then reduce the points where data in limited i. e.
non-unicode encodings is imported/exported as the adoption of unicode grows
without affecting the core of your app. IMHO chr(ord("a") - 32) is inferior
to "a".upper() even in an all-ascii environment.

>> Here we disagree. Showing the right image for a character should be
>> the job of the OS and should safely work cross-platform.
> 
> Yes.  What of it?

I don't understand the question.

> 
> Programs that show text still need to know which character set the
> source text has, so it can pass the OS the text it expects, or send a
> charset directive to the OS, or whatever.
> 
>> Why shouldn't I be able to store a file with a greek or chinese name?
> 
> If you want an OS that allows that, get an OS which allows that.

That was not the point. I was trying to say that the usefulness of a
standard grows with its adoption.

>> I wasn't able to quote Martin's
>> surname correctly for the Python-URL. That's a mess that should be
>> cleaned up once per OS rather than once per user. I don't see how that
>> can happen without unicode (only). Even NASA blunders when they have to
>> deal with meters and inches.
> 
> Yes, there are many non-'global' applications too where Unicode is
> desirable.  What of it?

I don't understand the question.

> Just because you want Unicode, why shouldn't I be allowed to use
> other charcater encodings in cases where they are more practical?

Again, my contention is that once the use of unicode has reached the tipping
point you will encounter no cases where other encodings are more practical. 

> For example, if one uses character set ns_4551-1 - ASCII with {|}[\]
> replaced with æøåÆØÅ, sorting by simple byte ordering will sort text
> correctly.  Unicode text _can't_ be sorted correctly, because of
> characters like 'ö': Swedish 'ö' should match Norwegian 'ø' and sort
> with that, while German 'ö' should not match 'ø' and sorts with 'o'.

Why not sort depending on the locale instead of ordinal values of the
bytes/characters?

>>> In any case, a language's both short-term and long-term goals should be
>>> to support current programming, not programming like it 'should be done'
>>> some day in the future.

At some point you have to ask yourself whether the dirty tricks that work
depending on the country you live in, its current orthography and the
current state of your favourite programming language do save you some time
at so many places in your program that one centralized api that does it
right is more efficient even today.

> I don't know Perl 6, but Perl 5 is an excellent example of how not do to
> this.  So is Emacs' MULE, for that matter.
> 
> I recently had to downgrade to perl5.004 when perl5.8 broke my programs.
> They worked fine until they were moved to a machine where someone had
> set up the locale to use UTF-8.  Then Perl decided that my data, which
> has nothing at all to do with the locale, was Unicode data.  I tried to
> insert 'use bytes', but that didn't work.  It does seem to work in newer
> Perl versions, but it's not clear to me how many places I have to insert
> some magic to prevent that.  Nor am I interested in finding out: I just
> don't trust the people who released such a piece of crap to leave my
> non-Unicode strings alone.  In particular since _most_ of the strings
> are UTF-8, so I wonder if Perl might decide to do something 'friendly'
> with them.

I see you know more Perl than me - well, my mentioning of the zipper was
rather a lightweight digression prompted by the ongoing decorator frenzy.

>> the way to go in your case. If I were to add a switch to Python's
>> string handling it would be "all-unicode".
> 
> Meaning what?

All strings are unicode by default. If you need byte sequences instead of
character sequences you would have to provide a b-prefixed string.

Peter