Unicode 7

Steven D'Aprano steve+comp.lang.python at pearwood.info
Fri May 2 04:45:41 EDT 2014


On Thu, 01 May 2014 19:02:48 -0700, Rustom Mody wrote:

> I dont know how one causally connects the 'headaches' but Ive seen -
> mojibake

Mojibake is certainly more common with multiple encodings, but the 
solution to that is Unicode, not ASCII.

In fact, in your blog post you even link to a post of mine where I 
explain that ASCII has gone through multiple backwards incompatible 
changes over the decades, which means you can have a limited form of 
mojibake even in pure ASCII. Between changes over various versions of 
ASCII, and ambiguous characters allowed by the standard, you needed some 
sort of out-of-band metadata to tell you whether they intended an @ or a 
`, a | or a ¬, a £ or a #, to mention only a few.

It's only since the 1980s that ASCII, actual 7-bit US ASCII, has become 
an unambiguous standard. But that's okay, because that merely allowed 
people to create dozens of 7-bit and 8-bit variations on ASCII, all 
incompatible with each other, and *call them ASCII* regardless of the 
actual standard name.

Between ambiguities in actual ASCII, and common practice to label non-
ASCII as ASCII, I can categorically say that mojibake has always been 
possible in so-called "plain text". If you haven't noticed it, it was 
because you were only exchanging documents with people who happened to 
use the same set of characters as you.


> - unicode 'number-boxes' (what are these called?) 

They are missing character glyphs, and they have nothing to do with 
Unicode. They are due to deficiencies in the text font you are using.

Admittedly with Unicode's 0x10FFFF possible characters (actually more, 
since a single code point can have multiple glyphs) it isn't surprising 
that most font designers have neither the time, skill or desire to create 
a glyph for every single code point. But then the same applies even for 
more restrictive 8-bit encodings -- sometimes font designers don't even 
bother providing glyphs for *ASCII* characters.

(E.g. they may only provide glyphs for uppercase A...Z, not lowercase.)

> - Worst of all what we
> *dont* see -- how many others dont see what we see?

Again, this a deficiency of the font. There are very few code points in 
Unicode which are intended to be invisible, e.g. space, newline, zero-
width joiner, control characters, etc., but they ought to be equally 
invisible to everyone. No printable character should ever be invisible in 
any decent font.


> I never knew of any of this in the good ol days of ASCII

You must have been happy with a very impoverished set of symbols, then.


> ¶ Passive voice is often the best choice in the interests of political
> correctness
> 
> It would be a pleasant surprise if everyone sees a pilcrow at start of
> line above

I do.



-- 
Steven D'Aprano
http://import-that.dreamwidth.org/



More information about the Python-list mailing list