ASCII and Unicode [was Re: Managing Google Groups headaches]

Fri Dec 6 21:33:39 EST 2013

On Saturday, December 7, 2013 12:30:18 AM UTC+5:30, Steven D'Aprano wrote:
> On Fri, 06 Dec 2013 05:03:57 -0800, rusi wrote:

> > Evidently (and completely inadvertently) this exchange has just
> > illustrated one of the inadmissable assumptions:
> > "unicode as a medium is universal in the same way that ASCII used to be"

> Ironically, your post was not Unicode.

> Seriously. I am 100% serious.

> Your post was sent using a legacy encoding, Windows-1252, also known as 
> CP-1252, which is most certainly *not* Unicode. Whatever software you 
> used to send the message correctly flagged it with a charset header:

> Content-Type: text/plain; charset=windows-1252

> Alas, the software Roy Smith uses, MT-NewsWatcher, does not handle 
> encodings correctly (or at all!), it screws up the encoding then sends a 
> reply with no charset line at all. This is one bug that cannot be blamed 
> on Google Groups -- or on Unicode.

> > I wrote a number of ellipsis characters ie codepoint 2026 as in:

> Actually you didn't. You wrote a number of ellipsis characters, hex byte 
> \x85 (decimal 133), in the CP1252 charset. That happens to be mapped to 
> code point U+2026 in Unicode, but the two are as distinct as ASCII and 
> EBCDIC.

> > Somewhere between my sending and your quoting those ellipses became the
> > replacement character FFFD

> Yes, it appears that MT-NewsWatcher is *deeply, deeply* confused about 
> encodings and character sets. It doesn't just assume things are ASCII, 
> but makes a half-hearted attempt to be charset-aware, but badly. I can 
> only imagine that it was written back in the Dark Ages where there were a 
> lot of different charsets in use but no conventions for specifying which 
> charset was in use. Or perhaps the author was smoking crack while coding.

> > Leaving aside whose fault this is (very likely buggy google groups),
> > this mojibaking cannot happen if the assumption "All text is ASCII" were
> > to uniformly hold.

> This is incorrect. People forget that ASCII has evolved since the first 
> version of the standard in 1963. There have actually been five versions 
> of the ASCII standard, plus one unpublished version. (And that's not 
> including the things which are frequently called ASCII but aren't.)

> ASCII-1963 didn't even include lowercase letters. It is also missing some 
> graphic characters like braces, and included at least two characters no 
> longer used, the up-arrow and left-arrow. The control characters were 
> also significantly different from today.

> ASCII-1965 was unpublished and unused. I don't know the details of what 
> it changed.

> ASCII-1967 is a lot closer to the ASCII in use today. It made 
> considerable changes to the control characters, moving, adding, removing, 
> or renaming at least half a dozen control characters. It officially added 
> lowercase letters, braces, and some others. It replaced the up-arrow 
> character with the caret and the left-arrow with the underscore. It was 
> ambiguous, allowing variations and substitutions, e.g.:

>     - character 33 was permitted to be either the exclamation 
>       mark ! or the logical OR symbol |

>     - consequently character 124 (vertical bar) was always 
>       displayed as a broken bar ¦, which explains why even today
>       many keyboards show it that way

>     - character 35 was permitted to be either the number sign # or 
>       the pound sign £

>     - character 94 could be either a caret ^ or a logical NOT ¬

> Even the humble comma could be pressed into service as a cedilla.

> ASCII-1968 didn't change any characters, but allowed the use of LF on its 
> own. Previously, you had to use either LF/CR or CR/LF as newline.

> ASCII-1977 removed the ambiguities from the 1967 standard.

> The most recent version is ASCII-1986 (also known as ANSI X3.4-1986). 
> Unfortunately I haven't been able to find out what changes were made -- I 
> presume they were minor, and didn't affect the character set.

> So as you can see, even with actual ASCII, you can have mojibake. It's 
> just not normally called that. But if you are given an arbitrary ASCII 
> file of unknown age, containing code 94, how can you be sure it was 
> intended as a caret rather than a logical NOT symbol? You can't.

> Then there are at least 30 official variations of ASCII, strictly 
> speaking part of ISO-646. These 7-bit codes were commonly called "ASCII" 
> by their users, despite the differences, e.g. replacing the dollar sign $ 
> with the international currency sign ¤, or replacing the left brace 
> { with the letter s with caron š.

> One consequence of this is that the MIME type for ASCII text is called 
> "US ASCII", despite the redundancy, because many people expect "ASCII" 
> alone to mean whatever national variation they are used to.

> But it gets worse: there are proprietary variations on ASCII which are 
> commonly called "ASCII" but aren't, including dozens of 8-bit so-called 
> "extended ASCII" character sets, which is where the problems *really* 
> pile up. Invariably back in the 1980s and early 1990s people used to call 
> these "ASCII" no matter that they used 8-bits and contained anything up 
> to 256 characters.

> Just because somebody calls something "ASCII", doesn't make it so; even 
> if it is ASCII, doesn't mean you know which version of ASCII; even if you 
> know which version, doesn't mean you know how to interpret certain codes. 
> It simply is *wrong* to think that "good ol' plain ASCII text" is 
> unambiguous and devoid of problems.

> > With unicode there are in-memory formats, transportation formats eg
> > UTF-8, 

> And the same applies to ASCII. 

> ASCII is a *seven-bit code*. It will work fine on computers where the 
> word-size is seven bits. If the word-size is eight bits, or more, you 
> have to pad the ASCII code. How do you do that? Pad the most-significant 
> end or the least significant end? That's a choice there. How do you pad 
> it, with a zero or a one? That's another choice. If your word-size is 
> more than eight bits, you might even pad *both* ends.

> In C, a char is defined as the smallest addressable unit of the machine 
> that can contain basic character set, not necessarily eight bits. 
> Implementations of C and C++ sometimes reserve 8, 9, 16, 32, or 36 bits 
> as a "byte" and/or char. Your in-memory representation of ASCII "a" could 
> easily end up as bits 001100001 or 0000000001100001.

> And then there is the question of whether ASCII characters should be Big 
> Endian or Little Endian. I'm referring here to bit endianness, rather 
> than bytes: should character 'a' be represented as bits 1100001 (most 
> significant bit to the left) or 1000011 (least significant bit to the 
> left)? This may be relevant with certain networking protocols. Not all 
> networking protocols are big-endian, nor are all processors. The Ada 
> programming language even supports both bit orders.

> When transmitting ASCII characters, the networking protocol could include 
> various start and stop bits and parity codes. A single 7-bit ASCII 
> character might be anything up to 12 bits in length on the wire. It is 
> simply naive to imagine that the transmission of ASCII codes is the same 
> as the in-memory or on-disk storage of ASCII.

> You're lucky to be active in a time when most common processors have 
> standardized on a single bit-order, and when most (but not all) network 
> protocols have done the same. But that doesn't mean that these issues 
> don't exist for ASCII. If you get a message that purports to be ASCII 
> text but looks like this:

> "\tS\x1b\x1b{\x01u{'\x1b\x13!"

> you should suspect strongly that it is "Hello World!" which has been 
> accidentally bit-reversed by some rogue piece of hardware.

OOf! Thats a lot of data to digest! Thanks anyway.

There's one thing I want to get into:

> Your post was sent using a legacy encoding, Windows-1252, also known as 
> CP-1252, which is most certainly *not* Unicode. Whatever software you 
> used to send the message correctly flagged it with a charset header:

What the hell! I am using firefox 25.0 in debian-testing and posting via GG.

$ locale
shows me:
LANG=en_US.UTF-8

and a bunch of other things all en_US.UTF-8.

For the most part when I point FF at any site and go to view ->
character-encoding, it says Unicode (UTF-8).

However when I go to anything in the python archives:
https://mail.python.org/pipermail/python-list/2013-December/

FF shows it as Western (Windows-1252)

That seems to suggest that something is not right with the python
mailing list config. No??