[Tutor] name shortening in a csv module output

Steven D'Aprano steve at pearwood.info
Fri Apr 24 04:46:20 CEST 2015


On Fri, Apr 24, 2015 at 12:33:57AM +0100, Alan Gauld wrote:
> On 24/04/15 00:15, Jim Mooney wrote:
> >Pretty much guesswork.
> >Alan Gauld
> >-- 
> >This all sounds suspiciously like the old browser wars
> 
> Its more about history. Early text encodings all worked in a single byte 
> which is
> limited to 256 patterns.

Oh it's much more complicated than that!

The first broadly available standard encoding was ASCII, which was 
*seven bits*, not even a full byte. It was seven bits so that there was 
an extra bit available for error correction when sending over telex or 
some such thing.

(There were other encodings older than ASCII, like EBCDIC, which was 
used on IBM mainframes and is either 7 or 8 bits.)

With seven bits, you can only have 127 characters. About 30 odd are used 
for control characters (which used to be important, now only two or 
three of them are still commonly used, but we're stuck with them 
forever), so ASCII is greatly impoverished, even for American English. 
(E.g. there is no cent symbol.)

Hence the 1970s through 90s saw an explosion of "code pages" for Windows 
systems, national standards which extended ASCII to use a full 8 bits 
(plus a few 16-bit and variable-bit standards as well). Fortunately 
people rarely swapped documents from one platform to another.

Then, in the late 1990s, Internet access started becoming common, and 
what do people do on the Internet? They exchange files. Chaos and 
confusion everywhere...


> That's simply not enough to cover all the 
> alphabets
> around. So people developed their own encodings for every computer 
> platform,
> printer, county and combinations thereof. Unicode is an attempt to get away
> from that, but the historical junk is still out there. And unicode is 
> not yet
> the de-facto standard.

Not quite. Unicode is an actual real standard, and pretty much everyone 
agrees that it is the only reasonable choice[1] as a standard. It is 
quite close to being a standard in practice as well as on paper: Unicode 
is more or less the language of the Internet, something like 85% of web 
pages now use UTF-8, Linux and Mac desktops use UTF-8, Windows provides 
optional Unicode interfaces for their programming APIs, most modern 
languages (with the exception of PHP, what a surprise) provide at least 
moderate Unicode support.

Sadly though you are right that there are still historical documents to 
deal with and (mostly Windows) systems which even today still default to 
using legacy encodings rather than Unicode. We'll be dealing with legacy 
encodings for decades, especially since East Asian countries lag behind 
the West for Unicode adoption, but Unicode has won.


> Now the simple thing to do would be just have one enormous character
> set that covers everything. That's Unicode 32 bit encoding.

No it isn't :-)

The character set and the encoding are *not the same*. We could have a 
two-character set with a 32-bit encoding, if we were nuts:

Two characters only: A and B

A encodes as four bytes 00 00 00 00
B encodes as four bytes FF FF FF FF

and every other combination of bytes is an error.

The set of characters provided by Unicode is the same regardless of how 
those characters are encoded to and from bytes.

The official standard is the "Universal Character Set", or UCS, which 
supports 1114112 different "code points", numbered from U+0000 to 
U+10FFFF. Each code point represents:

- a character (letter, digit, symbol, etc.)
- control character
- private-use characters
- reserved for future use
- non-characters (e.g. BOM, variation selectors, surrogates)
- combining accents

etc. Unicode can be considered the same as UCS plus a bunch of other 
stuff, such as rules for sorting.

Notice that UCS and Unicode deliberately use much less than the full 
4294967296 possible values of a 32-bit quantity. You can fit Unicode in 
21 bits.

Historically, the UCS started at 16 bits, it may have reserved the full 
32 bits, but it is now guaranteed to use no more than 21 bits U+10FFFF 
will be the highest code point forever.


> The snag is
> that it takes 4 bytes for every character, which is a lot of 
> space/bandwidth.
> So more efficient encodings were introduced such as Unicode "16 bit" and
> "8 bit", aka utf-8.

Historically, the Universal Character Set and Unicode originally was 
only 16 bit, so the first UCS encoding was UCS-2, which uses two bytes 
for every code point. But they soon realised that they needed more than 
65536 code points, and UCS-2 is now obsolute.

The current version of Unicode and UCS uses 21 bits, and 
guaranteed that there will be no future expansions. They include a 
bunch of encodings:

UCS-4 uses a full 32-bits per code point; it is defined in the UCS 
standard.

UTF-32 does the same, except in the Unicode standard. They are 
effectively identical, just different names because they live in 
different standards.

UTF-16 is a variable-width encoding, using either 2 bytes or 4. It is 
effectively an extension to the obsolete UCS-2.

UTF-8 is a variable-width encoding, using 1, 2, 3, or 4 bytes. The 
mechanism works for up to six bytes, but the standard guarantees that it 
will never exceed 4 bytes.

UTF-7 is a "seven-bit clean" version, for systems which cannot deal with 
text data with the high-bit set.

There are a few others in the various standards, but nobody uses them 
:-)



> UTF-8 is becoming a standard but it has the complication that its a 
> variable
> width standard where a character can be anything from a single byte up
> to 4 bytes long. The "rarer" the character the longer its encoding. And
> unfortunately the nature of the coding is such that it looks a lot like
> other historical encodings, especially some of the Latin ones. So you can
> get something almost sane out of a text by decoding it as utf8 but not
> quite totally sane. And it maybe some time into your usage before you
> realise you are actually using the wrong code!

That's by design! If you use nothing but ASCII characters, then UTF-8 
and ASCII are identical.

If you save a text file "Hello World!" using any of the other UTF 
encodings (UTF-16, -32 in particular), you will get something which an 
old, dumb non-Unicode aware application cannot deal with. It will treat 
it as a binary file, because it contains lots of zero bytes. C programs 
will choke on it.

But with UTF-8, so long as you limit yourself to only ASCII characters, 
old ASCII programs will continue to work fine. This was a very important 
requirement that allowed Unicode to become popular. Editors can ship 
defaulting to UTF-8, and so long as their users only type ASCII 
characters other non-Unicode programs won't know the difference.





[1] There are a very few people in Japan who use TRON instead, because 
TRON separates Japanese characters from their identical Chinese and 
Korean characters -- think about having a separate English A, French A 
and German A. Pretty much nobody else uses it, and precious few 
Japanese. It's not quite 100% due to Japanese nationalism, there are 
some good reasons for wanting to separate the characters, but it's 
probably 97% Japanese nationalism. The Japanese, Chinese and Korean 
governments, as well as linguists, are all in agreement that despite a 
few minor differences, the three languages share a common character set.


-- 
Steve


More information about the Tutor mailing list