[Tutor] name shortening in a csv module output
Steven D'Aprano
steve at pearwood.info
Fri Apr 24 04:46:20 CEST 2015
On Fri, Apr 24, 2015 at 12:33:57AM +0100, Alan Gauld wrote:
> On 24/04/15 00:15, Jim Mooney wrote:
> >Pretty much guesswork.
> >Alan Gauld
> >--
> >This all sounds suspiciously like the old browser wars
>
> Its more about history. Early text encodings all worked in a single byte
> which is
> limited to 256 patterns.
Oh it's much more complicated than that!
The first broadly available standard encoding was ASCII, which was
*seven bits*, not even a full byte. It was seven bits so that there was
an extra bit available for error correction when sending over telex or
some such thing.
(There were other encodings older than ASCII, like EBCDIC, which was
used on IBM mainframes and is either 7 or 8 bits.)
With seven bits, you can only have 127 characters. About 30 odd are used
for control characters (which used to be important, now only two or
three of them are still commonly used, but we're stuck with them
forever), so ASCII is greatly impoverished, even for American English.
(E.g. there is no cent symbol.)
Hence the 1970s through 90s saw an explosion of "code pages" for Windows
systems, national standards which extended ASCII to use a full 8 bits
(plus a few 16-bit and variable-bit standards as well). Fortunately
people rarely swapped documents from one platform to another.
Then, in the late 1990s, Internet access started becoming common, and
what do people do on the Internet? They exchange files. Chaos and
confusion everywhere...
> That's simply not enough to cover all the
> alphabets
> around. So people developed their own encodings for every computer
> platform,
> printer, county and combinations thereof. Unicode is an attempt to get away
> from that, but the historical junk is still out there. And unicode is
> not yet
> the de-facto standard.
Not quite. Unicode is an actual real standard, and pretty much everyone
agrees that it is the only reasonable choice[1] as a standard. It is
quite close to being a standard in practice as well as on paper: Unicode
is more or less the language of the Internet, something like 85% of web
pages now use UTF-8, Linux and Mac desktops use UTF-8, Windows provides
optional Unicode interfaces for their programming APIs, most modern
languages (with the exception of PHP, what a surprise) provide at least
moderate Unicode support.
Sadly though you are right that there are still historical documents to
deal with and (mostly Windows) systems which even today still default to
using legacy encodings rather than Unicode. We'll be dealing with legacy
encodings for decades, especially since East Asian countries lag behind
the West for Unicode adoption, but Unicode has won.
> Now the simple thing to do would be just have one enormous character
> set that covers everything. That's Unicode 32 bit encoding.
No it isn't :-)
The character set and the encoding are *not the same*. We could have a
two-character set with a 32-bit encoding, if we were nuts:
Two characters only: A and B
A encodes as four bytes 00 00 00 00
B encodes as four bytes FF FF FF FF
and every other combination of bytes is an error.
The set of characters provided by Unicode is the same regardless of how
those characters are encoded to and from bytes.
The official standard is the "Universal Character Set", or UCS, which
supports 1114112 different "code points", numbered from U+0000 to
U+10FFFF. Each code point represents:
- a character (letter, digit, symbol, etc.)
- control character
- private-use characters
- reserved for future use
- non-characters (e.g. BOM, variation selectors, surrogates)
- combining accents
etc. Unicode can be considered the same as UCS plus a bunch of other
stuff, such as rules for sorting.
Notice that UCS and Unicode deliberately use much less than the full
4294967296 possible values of a 32-bit quantity. You can fit Unicode in
21 bits.
Historically, the UCS started at 16 bits, it may have reserved the full
32 bits, but it is now guaranteed to use no more than 21 bits U+10FFFF
will be the highest code point forever.
> The snag is
> that it takes 4 bytes for every character, which is a lot of
> space/bandwidth.
> So more efficient encodings were introduced such as Unicode "16 bit" and
> "8 bit", aka utf-8.
Historically, the Universal Character Set and Unicode originally was
only 16 bit, so the first UCS encoding was UCS-2, which uses two bytes
for every code point. But they soon realised that they needed more than
65536 code points, and UCS-2 is now obsolute.
The current version of Unicode and UCS uses 21 bits, and
guaranteed that there will be no future expansions. They include a
bunch of encodings:
UCS-4 uses a full 32-bits per code point; it is defined in the UCS
standard.
UTF-32 does the same, except in the Unicode standard. They are
effectively identical, just different names because they live in
different standards.
UTF-16 is a variable-width encoding, using either 2 bytes or 4. It is
effectively an extension to the obsolete UCS-2.
UTF-8 is a variable-width encoding, using 1, 2, 3, or 4 bytes. The
mechanism works for up to six bytes, but the standard guarantees that it
will never exceed 4 bytes.
UTF-7 is a "seven-bit clean" version, for systems which cannot deal with
text data with the high-bit set.
There are a few others in the various standards, but nobody uses them
:-)
> UTF-8 is becoming a standard but it has the complication that its a
> variable
> width standard where a character can be anything from a single byte up
> to 4 bytes long. The "rarer" the character the longer its encoding. And
> unfortunately the nature of the coding is such that it looks a lot like
> other historical encodings, especially some of the Latin ones. So you can
> get something almost sane out of a text by decoding it as utf8 but not
> quite totally sane. And it maybe some time into your usage before you
> realise you are actually using the wrong code!
That's by design! If you use nothing but ASCII characters, then UTF-8
and ASCII are identical.
If you save a text file "Hello World!" using any of the other UTF
encodings (UTF-16, -32 in particular), you will get something which an
old, dumb non-Unicode aware application cannot deal with. It will treat
it as a binary file, because it contains lots of zero bytes. C programs
will choke on it.
But with UTF-8, so long as you limit yourself to only ASCII characters,
old ASCII programs will continue to work fine. This was a very important
requirement that allowed Unicode to become popular. Editors can ship
defaulting to UTF-8, and so long as their users only type ASCII
characters other non-Unicode programs won't know the difference.
[1] There are a very few people in Japan who use TRON instead, because
TRON separates Japanese characters from their identical Chinese and
Korean characters -- think about having a separate English A, French A
and German A. Pretty much nobody else uses it, and precious few
Japanese. It's not quite 100% due to Japanese nationalism, there are
some good reasons for wanting to separate the characters, but it's
probably 97% Japanese nationalism. The Japanese, Chinese and Korean
governments, as well as linguists, are all in agreement that despite a
few minor differences, the three languages share a common character set.
--
Steve
More information about the Tutor
mailing list