Changing filenames from Greeklish => Greek (subprocess complain)

Chris Angelico rosuav at gmail.com
Sat Jun 8 14:47:53 EDT 2013


On Sun, Jun 9, 2013 at 4:01 AM, Νικόλαος Κούρας <nikos.gr33k at gmail.com> wrote:
> Hold on!
>
> In the beginning there was ASCII with 0-127 values and then there was
> Unicode with 0-127 of ASCII's + i dont know how much many more?
>
> Now ASCIII needs 1 byte to store a single character while Unicode needs 2
> bytes to store a character and that is because it has > 256 characters to
> store > 2^8bits ?
>
> Is this correct?

No. Let me start from the beginning.

Computers don't work with characters, or strings, natively. They work
with numbers. To be specific, they work with bits; and it's only by
convention that we can work with anything larger. For instance,
there's a VERY common convention around the PC world that a set of
bits can be interpreted as a signed integer; if the highest bit is
set, it's negative. There are also standards for floating-point (IEEE
754), and so on.

ASCII is a character set. It defines a mapping of numbers to
characters - for instance, @ is 64, SOH is 1, $ is 36, etcetera,
etcetera. There are 128 such mappings. Since they all fit inside a
7-bit number, there's a trivial way to represent ASCII characters in a
PC's 8-bit byte: you just leave the high bit clear and use the other
seven. There have been various schemes for using the eighth bit -
serial ports with parity, WordStar (I think) marking the ends of
words, and most notably, Extended ASCII schemes that give you another
whole set of 128 characters. And that was the beginning of Code Pages,
because nobody could agree on what those extra 128 should be.
Norwegians used Norwegian, the Greeks were taught their Greek,
Arabians created themselves an Arabian codepage with the speed of
summer lightning, and Hebrews allocated from 255 down to 128, which is
absolutely frightening. But I digress.

There were a variety of multi-byte schemes devised at various times,
but we'll ignore all of them and jump straight to Unicode. With
Unicode, there's (theoretically) no need to use any other system ever
again, because whatever character you want, it'll exist in Unicode. In
theory, of course; there are debates over that. Now, Unicode currently
has defined an "address space" of roughly 20 bits, and in a throwback
to the first programming I ever did, it's a segmented system: sixteen
or seventeen planes of 65,536 characters each. (Fortunately the planes
are identified by low numbers, not high numbers, and there's no
stupidity of overlapping planes the way the 8086 did with memory!) The
highest planes are  special (plane 14 has a few special-purpose
characters, planes 15 and 16 are for private use), and most of the
middle ones have no characters assigned to them, so for the most part,
you'll see characters from the first three planes.

So what do we now have? A mapping of characters to "code points",
which are numbers. (I'm leaving aside the issues of combining
characters and such for the moment.) But computers don't work with
numbers, they work with bits. Somehow we have to store those bits in
memory.

There are a good few ways to do that; one is to note that every
Unicode character can be represented inside 32 bits, so we can use the
standard integer scheme safely. (Since they fit inside 31 bits, we
don't even need to care if it's signed or unsigned.) That's called
UTF-32 or UCS-4, and it's a great way to handle the full Unicode range
in a manner that makes a Texan look agoraphobic. Wide builds of Python
up to 3.2 did this. Or you can try to store them in 16-bit numbers,
but then you have to worry about the ones that don't fit in 16 bits,
because it's really hard to squeeze 20 bits of information into 16
bits of storage. UTF-16 is one way to do this; special numbers mean
"grab another number". It has its issues, but is (in my opinion,
unfortunately) fairly prevalent. Narrow builds of Python up to 3.2 did
this. Finally, you can use a more complicated scheme that uses
anywhere from 1 to 4 bytes for each character, by carefully encoding
information into the top bit - if it's set, you have a multi-byte
character. That's how UTF-8 works, and is probably the most prevalent
disk/network encoding.

All of the UTF-X systems are called "UCS Transformation Formats" (UCS
meaning Universal Character Set, roughly "Unicode"). They are mappings
from Unicode numbers to bytes. Between Unicode and UTF-X, you have a
mapping from character to byte sequence.

> Now UTF-8, latin-iso, greek-iso e.t.c are WAYS of storing characters into
> the hard drive?

The ISO standard 8859 specifies a number of ASCII-compatible
encodings, referred to as ISO-8859-1 through ISO-8859-16. You've been
working with ISO-8859-1, also called Latin-1, and ISO-8859-7, which
has your Greek characters in it. These are all ways of translating
characters into numbers; and since they all fit within 8 bits, they're
most commonly represented on PCs with single bytes.

> So taken form above example(the closest i could think of), the way i
> understand them is:
>
> A 'string' can be of (unicode's or ascii's) type and that type needs a way
> (thats a charset) to store this string into the hdd as a sequense of bytes?

A Python 3 'string' is always a series of Unicode characters. How
they're represented in memory doesn't matter, but as of Python 3.3
that's a fairly compact and efficient system that can omit unnecessary
zero bits. To store that string on your hard disk, send it across a
network, or transmit it to another process, you need to encode it as
bytes, somehow. The UCS Transformation Formats are specifically
designed for this, and most of the time, UTF-8 is going to be the best
option. It's compact, it's well known, and usually, it'll do
everything you want. The only thing it won't do is let you quickly
locate the Nth character, which is why it makes a poor in-memory
format.

Fortunately, Python lets us hide away pretty much all those details,
just as it lets us hide away the details of what makes up a list, a
dictionary, or an integer. You can safely assume that the string "foo"
is a string of three characters, which you can work with as
characters. The chr() and ord() functions let you switch between
characters and numbers, and str.encode() and bytes.decode() let you
switch between characters and byte sequences. Once you get your head
around the differences between those three, it all works fairly
neatly.

Chris Angelico



More information about the Python-list mailing list