How to turn a string into a list of integers?

Kurt Mueller kurt.alfred.mueller at gmail.com
Fri Sep 5 16:41:16 EDT 2014


Am 05.09.2014 um 21:16 schrieb Kurt Mueller <kurt.alfred.mueller at gmail.com>:
> Am 05.09.2014 um 20:25 schrieb Chris “Kwpolska” Warrick <kwpolska at gmail.com>:
>> On Sep 5, 2014 7:57 PM, "Kurt Mueller" <kurt.alfred.mueller at gmail.com> wrote:
>>> Could someone please explain the following behavior to me:
>>> Python 2.7.7, MacOS 10.9 Mavericks
>>> 
>>>>>> import sys
>>>>>> sys.getdefaultencoding()
>>> 'ascii'
>>>>>> [ord(c) for c in 'AÄ']
>>> [65, 195, 132]
>>>>>> [ord(c) for c in u'AÄ']
>>> [65, 196]
>>> 
>>> My obviously wrong understanding:
>>> ‚AÄ‘ in ‚ascii‘ are two characters
>>>     one with ord A=65 and
>>>     one with ord Ä=196 ISO8859-1 <depends on code table>
>>>     —-> why [65, 195, 132]
>>> u’AÄ’ is an Unicode string
>>>     —-> why [65, 196]
>>> 
>>> It is just the other way round as I would expect.
>> 
>> Basically, the first string is just a bunch of bytes, as provided by your terminal — which sounds like UTF-8 (perfectly logical in 2014).  The second one is converted into a real Unicode representation. The codepoint for Ä is U+00C4 (196 decimal). It's just a coincidence that it also matches latin1 aka ISO 8859-1 as Unicode starts with all 256 latin1 codepoints. Please kindly forget encodings other than UTF-8.
> 
> So:
> ‘AÄ’ is an UTF-8 string represented by 3 bytes:
> A -> 41   -> 65  first byte decimal
> Ä -> c384 -> 195 and 132 second and third byte decimal
> 
> u’AÄ’ is an Unicode string represented by 2 bytes?:
> A -> U+0041 -> 65 first byte decimal, 00 is omitted or not yielded by ord()?
> Ä -> U+00C4 -> 196 second byte decimal, 00 is ommited or not yielded by ord()?

After reading the ord() manual:
The second case should read:
u’AÄ’ is an Unicode string represented by 2 unicode characters:
If Python was built with UCS2 Unicode, then the character’s code point must
be in the range [0..65535, 16 bits, U-0000..U-FFFF]
A -> U+0041 ->  65 first  character decimal (code point)
Ä -> U+00C4 -> 196 second character decimal (code point)


Am I right now?

-- 
Kurt Mueller, kurt.alfred.mueller at gmail.com




More information about the Python-list mailing list