Exended ASCII and code pages [was Re: for / while else doesn't make sense]

Steven D'Aprano steve at pearwood.info
Fri May 27 02:47:28 EDT 2016


On Fri, 27 May 2016 04:10 pm, Marko Rauhamaa wrote:

> Steven D'Aprano <steve at pearwood.info>:
>> This concept of ASCII = "all character sets", or "nearly all", or
>> "okay, maybe not nearly all of them, but just the important ones" is
>> terribly Euro-centric. The very idea would be laughable in Japan and
>> other East Asian countries, where Shift-JIS and Big5 still dominate.
> 
> Shift-JIS and Big5 are ASCII derivatives:

Gosh. Really?

If you looked at what I wrote, I said:

"Then there are the variable-width encodings which are backwards compatible
with ASCII *only* in the sense that text containing only ASCII characters
uses the same sequence of bytes as ASCII would."

and gave both Shift-JIS and Big5 as examples. But you cannot treat them
as "like ASCII" or "extended ASCII" because they are multibyte encodings.

Unlike UTF-8, if you mangle a Shift-JIS or Big5 multibyte sequence, you
don't just corrupt a single character, you corrupt a potentially unlimited
amount of subsequent text.

I don't mind being corrected if I make a genuine mistake, in fact I
appreciate correction. But being corrected for something I already
acknowledged? That's just arguing for the sake of arguing.



[...]
> ASCII derivatives are in wide use in the Americas and Antarctica as
> well. They have been spotted in Australia, New Zealand, Oceania and
> Africa. You shouldn't be surprized if you run into them in Asia, either.

Of course.

But they're not *all encodings*, and while they're important, there are
plenty of non-ASCII encodings and encodings which violate the "one byte
equals one character" invariant followed by ASCII and extended-ASCII
encodings.




-- 
Steven




More information about the Python-list mailing list