ASCII and Unicode [was Re: Managing Google Groups headaches]

Fri Dec 6 14:34:54 EST 2013

On Friday 06 December 2013 14:30:06 Steven D'Aprano did opine:

> On Fri, 06 Dec 2013 05:03:57 -0800, rusi wrote:
> > Evidently (and completely inadvertently) this exchange has just
> > illustrated one of the inadmissable assumptions:
> > 
> > "unicode as a medium is universal in the same way that ASCII used to
> > be"
> 
> Ironically, your post was not Unicode.
> 
> Seriously. I am 100% serious.
> 
> Your post was sent using a legacy encoding, Windows-1252, also known as
> CP-1252, which is most certainly *not* Unicode. Whatever software you
> used to send the message correctly flagged it with a charset header:
> 
> Content-Type: text/plain; charset=windows-1252
> 
> Alas, the software Roy Smith uses, MT-NewsWatcher, does not handle
> encodings correctly (or at all!), it screws up the encoding then sends a
> reply with no charset line at all. This is one bug that cannot be blamed
> on Google Groups -- or on Unicode.
> 
> > I wrote a number of ellipsis characters ie codepoint 2026 as in:
> Actually you didn't. You wrote a number of ellipsis characters, hex byte
> \x85 (decimal 133), in the CP1252 charset. That happens to be mapped to
> code point U+2026 in Unicode, but the two are as distinct as ASCII and
> EBCDIC.
> 
> > Somewhere between my sending and your quoting those ellipses became
> > the replacement character FFFD
> 
> Yes, it appears that MT-NewsWatcher is *deeply, deeply* confused about
> encodings and character sets. It doesn't just assume things are ASCII,
> but makes a half-hearted attempt to be charset-aware, but badly. I can
> only imagine that it was written back in the Dark Ages where there were
> a lot of different charsets in use but no conventions for specifying
> which charset was in use. Or perhaps the author was smoking crack while
> coding.
> 
> > Leaving aside whose fault this is (very likely buggy google groups),
> > this mojibaking cannot happen if the assumption "All text is ASCII"
> > were to uniformly hold.
> 
> This is incorrect. People forget that ASCII has evolved since the first
> version of the standard in 1963. There have actually been five versions
> of the ASCII standard, plus one unpublished version. (And that's not
> including the things which are frequently called ASCII but aren't.)
> 
> ASCII-1963 didn't even include lowercase letters. It is also missing
> some graphic characters like braces, and included at least two
> characters no longer used, the up-arrow and left-arrow. The control
> characters were also significantly different from today.
> 
> ASCII-1965 was unpublished and unused. I don't know the details of what
> it changed.
> 
> ASCII-1967 is a lot closer to the ASCII in use today. It made
> considerable changes to the control characters, moving, adding,
> removing, or renaming at least half a dozen control characters. It
> officially added lowercase letters, braces, and some others. It
> replaced the up-arrow character with the caret and the left-arrow with
> the underscore. It was ambiguous, allowing variations and
> substitutions, e.g.:
> 
>     - character 33 was permitted to be either the exclamation
>       mark ! or the logical OR symbol |
> 
>     - consequently character 124 (vertical bar) was always
>       displayed as a broken bar آ¦, which explains why even today
>       many keyboards show it that way
> 
>     - character 35 was permitted to be either the number sign # or
>       the pound sign آ£
> 
>     - character 94 could be either a caret ^ or a logical NOT آ¬
> 
> Even the humble comma could be pressed into service as a cedilla.
> 
> ASCII-1968 didn't change any characters, but allowed the use of LF on
> its own. Previously, you had to use either LF/CR or CR/LF as newline.
> 
> ASCII-1977 removed the ambiguities from the 1967 standard.
> 
> The most recent version is ASCII-1986 (also known as ANSI X3.4-1986).
> Unfortunately I haven't been able to find out what changes were made --
> I presume they were minor, and didn't affect the character set.
> 
> So as you can see, even with actual ASCII, you can have mojibake. It's
> just not normally called that. But if you are given an arbitrary ASCII
> file of unknown age, containing code 94, how can you be sure it was
> intended as a caret rather than a logical NOT symbol? You can't.
> 
> Then there are at least 30 official variations of ASCII, strictly
> speaking part of ISO-646. These 7-bit codes were commonly called "ASCII"
> by their users, despite the differences, e.g. replacing the dollar sign
> $ with the international currency sign آ¤, or replacing the left brace
> { with the letter s with caron إ،.
> 
> One consequence of this is that the MIME type for ASCII text is called
> "US ASCII", despite the redundancy, because many people expect "ASCII"
> alone to mean whatever national variation they are used to.
> 
> But it gets worse: there are proprietary variations on ASCII which are
> commonly called "ASCII" but aren't, including dozens of 8-bit so-called
> "extended ASCII" character sets, which is where the problems *really*
> pile up. Invariably back in the 1980s and early 1990s people used to
> call these "ASCII" no matter that they used 8-bits and contained
> anything up to 256 characters.
> 
> Just because somebody calls something "ASCII", doesn't make it so; even
> if it is ASCII, doesn't mean you know which version of ASCII; even if
> you know which version, doesn't mean you know how to interpret certain
> codes. It simply is *wrong* to think that "good ol' plain ASCII text"
> is unambiguous and devoid of problems.
> 
> > With unicode there are in-memory formats, transportation formats eg
> > UTF-8,
> 
> And the same applies to ASCII.
> 
> ASCII is a *seven-bit code*. It will work fine on computers where the
> word-size is seven bits. If the word-size is eight bits, or more, you
> have to pad the ASCII code. How do you do that? Pad the most-significant
> end or the least significant end? That's a choice there. How do you pad
> it, with a zero or a one? That's another choice. If your word-size is
> more than eight bits, you might even pad *both* ends.
> 
> In C, a char is defined as the smallest addressable unit of the machine
> that can contain basic character set, not necessarily eight bits.
> Implementations of C and C++ sometimes reserve 8, 9, 16, 32, or 36 bits
> as a "byte" and/or char. Your in-memory representation of ASCII "a"
> could easily end up as bits 001100001 or 0000000001100001.
> 
> And then there is the question of whether ASCII characters should be Big
> Endian or Little Endian. I'm referring here to bit endianness, rather
> than bytes: should character 'a' be represented as bits 1100001 (most
> significant bit to the left) or 1000011 (least significant bit to the
> left)? This may be relevant with certain networking protocols. Not all
> networking protocols are big-endian, nor are all processors. The Ada
> programming language even supports both bit orders.
> 
> When transmitting ASCII characters, the networking protocol could
> include various start and stop bits and parity codes. A single 7-bit
> ASCII character might be anything up to 12 bits in length on the wire.
> It is simply naive to imagine that the transmission of ASCII codes is
> the same as the in-memory or on-disk storage of ASCII.
> 
> You're lucky to be active in a time when most common processors have
> standardized on a single bit-order, and when most (but not all) network
> protocols have done the same. But that doesn't mean that these issues
> don't exist for ASCII. If you get a message that purports to be ASCII
> text but looks like this:
> 
> "\tS\x1b\x1b{\x01u{'\x1b\x13!"
> 
> you should suspect strongly that it is "Hello World!" which has been
> accidentally bit-reversed by some rogue piece of hardware.

You can lay a lot of the ASCII ambiguity on D.E.C. and their vt series 
terminals, anything newer than a vt100 made liberal use of the msbit in a 
character.  Having written an emulator for the vt-220, I can testify that 
really getting it right, was a right pain in the ass.  And then I added 
zmodem triggers and detections.

Cheers, Gene
-- 
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Genes Web page <http://geneslinuxbox.net:6309/gene>

Mother Earth is not flat!
A pen in the hand of this president is far more
dangerous than 200 million guns in the hands of
         law-abiding citizens.