Changing filenames from Greeklish => Greek (subprocess complain)

Andreas Perstinger andipersti at gmail.com
Mon Jun 10 04:15:38 EDT 2013


On 10.06.2013 09:10, nagia.retsina at gmail.com wrote:
> Τη Κυριακή, 9 Ιουνίου 2013 3:31:44 μ.μ. UTC+3, ο χρήστης Steven D'Aprano έγραψε:
>
>> py> c = 'α'
>> py> ord(c)
>> 945
>
> The number 945 is the characters 'α' ordinal value in the unicode charset correct?

Yes, the unicode character set is just a big list of characters. The 
946th character in that list (starting from 0) happens to be 'α'.

> The command in the python interactive session to show me how many bytes
> this character will take upon encoding to utf-8 is:
>
>>>> s = 'α'
>>>> s.encode('utf-8')
> b'\xce\xb1'
>
> I see that the encoding of this char takes 2 bytes. But why two exactly?

That's how the encoding is designed. Haven't you read the wikipedia 
article which was already mentioned several times?

> How do i calculate how many bits are needed to store this char into bytes?

You need to understand how UTF-8 works. Read the wikipedia article.

> Trying to to the same here but it gave me no bytes back.
>
>>>> s = 'a'
>>>> s.encode('utf-8')
> b'a'

The encode method returns a byte object. It's length will tell you how 
many bytes there are:

 >>> len(b'a')
1
 >>> len(b'\xce\xb1')
2

The python interpreter will represent all values below 256 as ASCII 
characters if they are printable:

 >>> ord(b'a')
97
 >>> hex(97)
'0x61'
 >>> b'\x61' == b'a'
True

The Python designers have decided to use b'a' instead of b'\x61'.

>>py> c.encode('utf-8')
>> b'\xce\xb1'
>
> 2 bytes here. why 2?

Same as your first question.

>> py> c.encode('utf-16be')
>> b'\x03\xb1'
>
> 2 byets here also. but why 3 different bytes? the ordinal value of
> char 'a' is the same in unicode. the encodign system just takes the
> ordinal value end encode, but sinc eit uses 2 bytes should these 2 bytes
> be the same?

'utf-16be' is a different encoding scheme, thus it uses other rules to 
determine how each character is translated into a byte sequence.

>> py> c.encode('iso-8859-7')
>> b'\xe1'
>
> And also does '\x' means that the value is being respresented in hex way?
> and when i bin(6) i see '0b1000001'
>
> I should expect to see 8 bits of 1s and 0's. what the 'b' is tryign to say?
>
'\x' is an escape sequence and means that the following two characters 
should be interpreted as a number in hexadecimal notation (see also the 
table of allowed escape sequences: 
http://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals 
).

'0b' tells you that the number is printed in binary notation.
Leading zeros are usually discarded when a number is printed:
 >>> bin(70)
'0b1000110'
 >>> 0b100110 == 0b00100110
True
 >>> 0b100110 == 0b0000000000100110
True

It's the same with decimal notation. You wouldn't say 00123 is different 
from 123, would you?

Bye, Andreas



More information about the Python-list mailing list