ASCII and Unicode [was Re: Managing Google Groups headaches]

Steven D'Aprano steve+comp.lang.python at pearwood.info
Fri Dec 6 14:00:18 EST 2013


On Fri, 06 Dec 2013 05:03:57 -0800, rusi wrote:

> Evidently (and completely inadvertently) this exchange has just
> illustrated one of the inadmissable assumptions:
> 
> "unicode as a medium is universal in the same way that ASCII used to be"

Ironically, your post was not Unicode.

Seriously. I am 100% serious.

Your post was sent using a legacy encoding, Windows-1252, also known as 
CP-1252, which is most certainly *not* Unicode. Whatever software you 
used to send the message correctly flagged it with a charset header:

Content-Type: text/plain; charset=windows-1252

Alas, the software Roy Smith uses, MT-NewsWatcher, does not handle 
encodings correctly (or at all!), it screws up the encoding then sends a 
reply with no charset line at all. This is one bug that cannot be blamed 
on Google Groups -- or on Unicode.


> I wrote a number of ellipsis characters ie codepoint 2026 as in:

Actually you didn't. You wrote a number of ellipsis characters, hex byte 
\x85 (decimal 133), in the CP1252 charset. That happens to be mapped to 
code point U+2026 in Unicode, but the two are as distinct as ASCII and 
EBCDIC.


> Somewhere between my sending and your quoting those ellipses became the
> replacement character FFFD

Yes, it appears that MT-NewsWatcher is *deeply, deeply* confused about 
encodings and character sets. It doesn't just assume things are ASCII, 
but makes a half-hearted attempt to be charset-aware, but badly. I can 
only imagine that it was written back in the Dark Ages where there were a 
lot of different charsets in use but no conventions for specifying which 
charset was in use. Or perhaps the author was smoking crack while coding.


> Leaving aside whose fault this is (very likely buggy google groups),
> this mojibaking cannot happen if the assumption "All text is ASCII" were
> to uniformly hold.

This is incorrect. People forget that ASCII has evolved since the first 
version of the standard in 1963. There have actually been five versions 
of the ASCII standard, plus one unpublished version. (And that's not 
including the things which are frequently called ASCII but aren't.)

ASCII-1963 didn't even include lowercase letters. It is also missing some 
graphic characters like braces, and included at least two characters no 
longer used, the up-arrow and left-arrow. The control characters were 
also significantly different from today.

ASCII-1965 was unpublished and unused. I don't know the details of what 
it changed.

ASCII-1967 is a lot closer to the ASCII in use today. It made 
considerable changes to the control characters, moving, adding, removing, 
or renaming at least half a dozen control characters. It officially added 
lowercase letters, braces, and some others. It replaced the up-arrow 
character with the caret and the left-arrow with the underscore. It was 
ambiguous, allowing variations and substitutions, e.g.:

    - character 33 was permitted to be either the exclamation 
      mark ! or the logical OR symbol |

    - consequently character 124 (vertical bar) was always 
      displayed as a broken bar ¦, which explains why even today
      many keyboards show it that way

    - character 35 was permitted to be either the number sign # or 
      the pound sign £

    - character 94 could be either a caret ^ or a logical NOT ¬

Even the humble comma could be pressed into service as a cedilla.

ASCII-1968 didn't change any characters, but allowed the use of LF on its 
own. Previously, you had to use either LF/CR or CR/LF as newline.

ASCII-1977 removed the ambiguities from the 1967 standard.

The most recent version is ASCII-1986 (also known as ANSI X3.4-1986). 
Unfortunately I haven't been able to find out what changes were made -- I 
presume they were minor, and didn't affect the character set.

So as you can see, even with actual ASCII, you can have mojibake. It's 
just not normally called that. But if you are given an arbitrary ASCII 
file of unknown age, containing code 94, how can you be sure it was 
intended as a caret rather than a logical NOT symbol? You can't.

Then there are at least 30 official variations of ASCII, strictly 
speaking part of ISO-646. These 7-bit codes were commonly called "ASCII" 
by their users, despite the differences, e.g. replacing the dollar sign $ 
with the international currency sign ¤, or replacing the left brace 
{ with the letter s with caron š.

One consequence of this is that the MIME type for ASCII text is called 
"US ASCII", despite the redundancy, because many people expect "ASCII" 
alone to mean whatever national variation they are used to.

But it gets worse: there are proprietary variations on ASCII which are 
commonly called "ASCII" but aren't, including dozens of 8-bit so-called 
"extended ASCII" character sets, which is where the problems *really* 
pile up. Invariably back in the 1980s and early 1990s people used to call 
these "ASCII" no matter that they used 8-bits and contained anything up 
to 256 characters.

Just because somebody calls something "ASCII", doesn't make it so; even 
if it is ASCII, doesn't mean you know which version of ASCII; even if you 
know which version, doesn't mean you know how to interpret certain codes. 
It simply is *wrong* to think that "good ol' plain ASCII text" is 
unambiguous and devoid of problems.


> With unicode there are in-memory formats, transportation formats eg
> UTF-8, 

And the same applies to ASCII. 

ASCII is a *seven-bit code*. It will work fine on computers where the 
word-size is seven bits. If the word-size is eight bits, or more, you 
have to pad the ASCII code. How do you do that? Pad the most-significant 
end or the least significant end? That's a choice there. How do you pad 
it, with a zero or a one? That's another choice. If your word-size is 
more than eight bits, you might even pad *both* ends.

In C, a char is defined as the smallest addressable unit of the machine 
that can contain basic character set, not necessarily eight bits. 
Implementations of C and C++ sometimes reserve 8, 9, 16, 32, or 36 bits 
as a "byte" and/or char. Your in-memory representation of ASCII "a" could 
easily end up as bits 001100001 or 0000000001100001.

And then there is the question of whether ASCII characters should be Big 
Endian or Little Endian. I'm referring here to bit endianness, rather 
than bytes: should character 'a' be represented as bits 1100001 (most 
significant bit to the left) or 1000011 (least significant bit to the 
left)? This may be relevant with certain networking protocols. Not all 
networking protocols are big-endian, nor are all processors. The Ada 
programming language even supports both bit orders.

When transmitting ASCII characters, the networking protocol could include 
various start and stop bits and parity codes. A single 7-bit ASCII 
character might be anything up to 12 bits in length on the wire. It is 
simply naive to imagine that the transmission of ASCII codes is the same 
as the in-memory or on-disk storage of ASCII.

You're lucky to be active in a time when most common processors have 
standardized on a single bit-order, and when most (but not all) network 
protocols have done the same. But that doesn't mean that these issues 
don't exist for ASCII. If you get a message that purports to be ASCII 
text but looks like this:

"\tS\x1b\x1b{\x01u{'\x1b\x13!"

you should suspect strongly that it is "Hello World!" which has been 
accidentally bit-reversed by some rogue piece of hardware.


-- 
Steven



More information about the Python-list mailing list