String encoding in Py2.7

Thomas Jollans tjol at tjol.eu
Tue May 29 05:09:54 EDT 2018


On 2018-05-29 09:55, ftg at lutix.org wrote:
> Hello,
> Using Python 2.7 (will switch to Py3 soon but Before I'd like to understand how string encoding worked)

Oh dear. This is probably the exact wrong way to go about it: the
interplay between string encoding, unicode and bytes is much less clear
and easy to understand in Python 2.

> Could you please tell me is I understood well what occurs in Python's mind:
> in a .py file:
> if I write s="héhéhé", if my file is declared as unicode coding, python will store in memory s='hx82hx82hx82'

No, it doesn't. At the very least, you're missing some backslashes – and
I don't know of any character encoding that using 0x82 to encode é.

On my system, I see

>>> s = 'héhéhé'
>>> s
'h\xc3\xa9h\xc3\xa9h\xc3\xa9'

My system uses UTF-8. If your PC is set up to uses an encoding like ISO
8859-15 or Windows-1252, you should see

'h\xe9h\xe9h\xe9'

The \x?? are just Python notation.

> however this is not yet unicode for python interpreter this is just raw bytes. Right?

Right, this is a bunch of bytes:

>>> s
'h\xe9h\xe9h\xe9'
>>> [ord(c) for c in s]
[104, 233, 104, 233, 104, 233]
>>> [hex(ord(c)) for c in s]
['0x68', '0xe9', '0x68', '0xe9', '0x68', '0xe9']
>>>


> By the way, why 'h' is not turned into hexa value? Because it is already in the ASCII table?

That's just how Python 2 likes to display stuff.

> If I want python interpreter to recognize my string as unicode I have to declare it as unicode s=u'héhéhé' and magically python will look for those 
> hex values 'x82' in the Unicode table. Still OK?

In principle, the unicode table has nothing to do with anything here. It
so happens that for some characters in some encodings the value is equal
to the code point, but that's neither here nor there.

> Now: how come when I declare s='héhéhé', print(s) displays well 'héhéhé'? Is it because of my shell windows that is dealing well with unicode? Or is it 
> because the print function is magic?

It's because the print statement is magic.

Actually, this *only* works if the encoding of your file matches the
default encoding required by your console. This is usually the case as
long as you stay on the same PC, but this assumption can fall apart
quite easily when you move code and data between systems, especially if
they use different operating systems or (human) languages.


Just use Python 3. There, the print function is not magic, which makes
life so much more logical.


-- Thomas



More information about the Python-list mailing list