Changing filenames from Greeklish => Greek (subprocess complain)

Steven D'Aprano steve+comp.lang.python at pearwood.info
Mon Jun 10 07:59:03 EDT 2013


On Mon, 10 Jun 2013 00:10:38 -0700, nagia.retsina wrote:

> Τη Κυριακή, 9 Ιουνίου 2013 3:31:44 μ.μ. UTC+3, ο χρήστης Steven D'Aprano
> έγραψε:
> 
>> py> c = 'α'
>> py> ord(c)
>> 945
> 
> The number 945 is the characters 'α' ordinal value in the unicode
> charset correct?

Correct.


> The command in the python interactive session to show me how many bytes
> this character will take upon encoding to utf-8 is:
> 
>>>> s = 'α'
>>>> s.encode('utf-8')
> b'\xce\xb1'
> 
> I see that the encoding of this char takes 2 bytes. But why two exactly?

Because that's how UTF-8 works. If it was a different encoding, it might 
be 4 bytes, or 2, or 1, or 101, or 7, or 3. But it is UTF-8, so it takes 
2 bytes. If you want to understand how UTF-8 works, look it up on 
Wikipedia. 


> How do i calculate how many bits are needed to store this char into
> bytes?

Every byte is made of 8 bits. There are two bytes. So multiply 8 by 2.


> Trying to to the same here but it gave me no bytes back.
> 
>>>> s = 'a'
>>>> s.encode('utf-8')
> b'a'

There is a byte there. The byte is printed by Python as b'a', which in my 
opinion is a design mistake. That makes it look like a string, but it is 
not a string, and would be better printed as b'\x61'. But regardless of 
the display, it is still a single byte.

 
>>py> c.encode('utf-8')
>> b'\xce\xb1'
> 
> 2 bytes here. why 2?

Because that's how UTF-8 works.


>> py> c.encode('utf-16be')
>> b'\x03\xb1'
> 
> 2 byets here also. but why 3 different bytes? 

Because it is a different encoding.


> the ordinal value of char 'a' is the same in unicode.

The same as what?


> the encodign system just takes the ordinal value end encode, but 
> sinc eit uses 2 bytes should these 2 bytes be the same?

No.

That's like saying that since a dog in Germany has four legs and one 
head, and a dog in France has four legs and one head, dog should be 
spelled "Hund" in both Germany and France.

Different encodings are like different languages. They spell the same 
word differently.


>> py> c.encode('utf-32be')
>> b'\x00\x00\x03\xb1
> 
> every char here takes exactly 4 bytes to be stored. okey.
> 
>> py> c.encode('iso-8859-7')
>> b'\xe1'
> 
> And also does '\x' means that the value is being respresented in hex
> way? and when i bin(6) i see '0b1000001'
> 
> I should expect to see 8 bits of 1s and 0's. what the 'b' is tryign to
> say?

"b" for Binary.

Just like 0o1234 uses octal, "o" for Octal.

And 0x123EF uses hexadecimal. "x" for heXadecimal.



-- 
Steven



More information about the Python-list mailing list