Changing filenames from Greeklish => Greek (subprocess complain)

Steven D'Aprano steve+comp.lang.python at pearwood.info
Sun Jun 9 08:31:44 EDT 2013


On Sun, 09 Jun 2013 02:00:46 -0700, Νικόλαος Κούρας wrote:

> Steven wrote:
>>> Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for
>>> values up to 256?
> 
>>Because then how do you tell when you need one byte, and when you need
>>two? If you read two bytes, and see 0x4C 0xFA, does that mean two
>>characters, with ordinal values 0x4C and 0xFA, or one character with
>>ordinal value 0x4CFA?
> 
> I mean utf-8 could use 1 byte for storing the 1st 256 characters. I
> meant up to 256, not above 256.

Think about it. Draw up a big table of one million plus characters:

Ordinal   Character
========  ====================
0         NUL control code
1         SOH control code
...
84        LATIN CAPITAL LETTER T
85        LATIN CAPITAL LETTER U
...
255       LATIN SMALL LETTER Y WITH DIAERESIS
256       LATIN CAPITAL LETTER A WITH MACRON
...
8485      OUNCE SIGN


and so on, all the way to 1114111. Now, suppose you read a file, and see 
two bytes, shown in decimal: 84 followed by 85, or in hexadecimal, 0x54 
followed by 0x55.

How do you tell whether that means two characters, T followed by U, or a 
single character, ℥ (OUNCE SIGN)?

With UTF-32, you can, because every value takes exactly the same space. 
So a T followed by a U is:

0x00000054
0x00000055

while a single ℥ is:

0x00002125

and it is easy to tell them apart: each block of 4 bytes is exactly one 
character. But notice how many NUL bytes there are? In the three 
characters shown, there are eight NUL bytes. Most text will be filled 
with NUL bytes, which is very wasteful.

UTF-8 is designed to be compact, and also to be backwards-compatible with 
ASCII. Characters which are in ASCII will be a single byte, so there are 
no null bytes used for padding, (except for NUL itself, of course). So 
the three characters TU℥ will be:

0x54
0x55
0xE2
0x84
0xA5

Five bytes in total, instead of 12 for UTF-32. But the only tricky part 
is that character with ordinal value 0xE2 (decimal 226, â) cannot be 
encoded as the single byte 0xE2, otherwise you would mistake the three 
bytes 0xE284A5 as starting with 'â' followed by two more characters. And 
indeed, 'â' is encoded as two bytes:

0xC3
0xA2

Likewise, character with ordinal value 0xC3 (decimal 195, Ã) is also 
encoded as two bytes:

0xC3
0x83

And so on. This way, there is never any confusion as to whether (say) 
three bytes are three one-byte characters, or one three-byte character.


>>> UTF-8 and UTF-16 and UTF-32
>>> I though the number beside of UTF- was to declare how many bits the
>>> character set was using to store a character into the hdd, no?
> 
>>Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values.
>>UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit
>>values to make a surrogate pair.
> 
> A surrogate pair is like itting for example Ctrl-A, which means is a
> combination character that consists of 2 different characters? Is this
> what a surrogate is? a pari of 2 chars?

Yes, a surrogate pair is a pair of two "characters". But they're not 
*real* characters. They don't exist in any human language. They are just 
values that tells the program "these go together, and count as a single 
character".

(This is why Unicode prefers to talk about *code points* rather than 
characters. Some code points are characters, and some are not.)

>>UTF-8 uses 8-bit values, but sometimes it combines two, three or four of
>>them to represent a single code-point.
> 
> 'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65)

Correct.


> 'α΄' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is >
> 127 ) 

That looks like two characters to me, 'α' followed by '΄'. That will take 
4 bytes, two for 'α' and two for '΄'.


> 'a chinese ideogramm' to be utf8 encoded needs 4 byte to be stored
> ? (since ordinal >  65000 )

Not necessarily four bytes. Could be three. Depends on the ideogram.

> The amount of bytes needed to store a character solely depends on the
> character's ordinal value in the Unicode table?

Yes.


>>UTF-8 solves this problem by reserving some values to mean "this byte,
>>on its own", and others to mean "this byte, plus the next byte,
>>together", and so forth, up to four bytes.
> 
> Some of the utf-8 bits that are used to represent a character's ordinal
> value are actually been also used to seperate or join the ordinal values
> themselves? Can you give an example please? How there are beign
> seperated?

Did you look up UTF-8 on Wikipedia like I suggested?


>>Computers are digital and work with numbers.
> 
> So character 'A' <-> 65 (in decimal uses in charset's table)  <->
> 01011100 (as binary stored in disk) <-> 0xEF (as hex, when we open the
> file with a hex editor)
> 
> Is this how the thing works? (above values are fictional)

You can check this in Python:


py> c = 'A'
py> ord(c)
65
py> bin(65)
'0b1000001'
py> hex(65)
'0x41'


py> c = 'α'
py> ord(c)
945
py> c.encode('utf-8')
b'\xce\xb1'
py> c.encode('utf-16be')
b'\x03\xb1'
py> c.encode('utf-32be')
b'\x00\x00\x03\xb1'
py> c.encode('iso-8859-7')
b'\xe1'


-- 
Steven



More information about the Python-list mailing list