Changing filenames from Greeklish => Greek (subprocess complain)

Steven D'Aprano steve+comp.lang.python at pearwood.info
Sun Jun 9 02:25:10 EDT 2013


On Sun, 09 Jun 2013 07:46:40 +0300, Νικόλαος Κούρας wrote:

> Why does every character in a character set needs to be associated with
> a numeric value?

Because computers are digital, not analog, and because bytes are numbers.

Here are a few of the 256 possible bytes, written in binary, decimal and 
hexadecimal:

0b00000000 0 0x00
0b00000001 1 0x01
0b00000010 2 0x02
[...]
0b01111111 127 0x7F
0b10000000 128 0x80
[...]
0b11111110 254 0xFE
0b11111111 255 0xFF


EVERYTHING in computers are numbers, because everything is stored as 
bytes. Text is stored as bytes. Sound files are stored as bytes. Images 
are stored as bytes. Programs are stored as bytes. So everything is being 
stored as numbers. But the *meaning* we give to those numbers depends on 
what we do with them, whether we treat them as characters, bitmapped 
images, floating point values, or something else.

Once we decide we want to store the character "A" as bytes, we need to 
decide which number it should be. That is the job of the charset.

ASCII:

65 <--> 'A'
66 <--> 'B'
67 <--> 'C'
etc.


> I mean couldn't we just have characters sets that wouldn't have numeric
> associations like:
> 
> 'A'  => encoding process(i.e. uf-8) => bytes bytes => decoding
> process(i.e. utf-8) =>  character 'A'

No. How would you store it in a computer's memory, or on a hard drive? By 
carving a tiny, microscopic "A" onto the hard drive? How would you read 
it back?

It is theoretically possible to build an analog computer, out of 
clockwork, or water flowing through pipes, or something, but nobody 
really bothers because it is much harder and not very useful.


> An ordinal = ordered numbers like 7,8,910 and so on?

Yes.


> Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for 
> values up to 256?

Because then how do you tell when you need one byte, and when you need 
two? If you read two bytes, and see 0x4C 0xFA, does that mean two 
characters, with ordinal values 0x4C and 0xFA, or one character with 
ordinal value 0x4CFA?

UTF-8 solves this problem by reserving some values to mean "this byte, on 
its own", and others to mean "this byte, plus the next byte, together", 
and so forth, up to four bytes.

If you look up UTF-8 on Wikipedia, you will see more about this.

> UTF-8 and UTF-16 and UTF-32
> I though the number beside of UTF- was to declare how many bits the 
> character set was using to store a character into the hdd, no?

Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values. 
UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit 
values to make a surrogate pair. UTF-8 uses 8-bit values, but sometimes 
it combines two, three or four of them to represent a single code-point.

> > "Narrow" Unicode uses two bytes per character. Since two bytes is only
> > enough for about 65,000 characters, not 1,000,000+, the rest of the
> > characters are stored as pairs of two-byte "surrogates".
> 
> Can you please explain this line "the rest of thecharacters are stored 
> as pairs of two-byte "surrogates"" more easily for me to understand it?
> I'm still having troubl understanding what a surrogate is.

Look up UTF-16 and "surrogate pair" on Wikepedia.

But basically, there are 65000+ different possible 16-bit values 
available for UTF-16 to use. Some of those values are reserved to mean 
"this value is not a character, it is half of a surrogate pair". Since 
they are *pairs*, they must always come in twos. A surrogate pair makes 
up a valid character. Half of a surrogate pair, on its own, is an error.


A lot of this complexity is because of historical reasons. For example, 
when Unicode was first invented, there was only 65 thousand characters, 
and a fixed 16 bits was all you needed. But it was soon learned that 65 
thousand was not enough (there are more than 65,000 Asian characters 
alone!) and so UTF-16 developed the trick with surrogate pairs to cover 
the extras.


[...]
> When locale to linux system is set to utf-8 that would mean that the 
> linux applications, should try to encode string into hdd by using 
> system's default encoding to utf-8 nad read them back from bytes by
> also using utf-8. Is that correct?

Yes.



-- 
Steven



More information about the Python-list mailing list