String encoding in Py2.7

Chris Angelico rosuav at gmail.com
Tue May 29 06:01:41 EDT 2018


On Tue, May 29, 2018 at 5:55 PM,  <ftg at lutix.org> wrote:
> Hello,
> Using Python 2.7 (will switch to Py3 soon but Before I'd like to understand how string encoding worked)
> Could you please tell me is I understood well what occurs in Python's mind:
> in a .py file:
> if I write s="héhéhé", if my file is declared as unicode coding, python will store in memory s='hx82hx82hx82'
> however this is not yet unicode for python interpreter this is just raw bytes. Right?
> By the way, why 'h' is not turned into hexa value? Because it is already in the ASCII table?
> If I want python interpreter to recognize my string as unicode I have to declare it as unicode s=u'héhéhé' and magically python will look for those
> hex values 'x82' in the Unicode table. Still OK?
> Now: how come when I declare s='héhéhé', print(s) displays well 'héhéhé'? Is it because of my shell windows that is dealing well with unicode? Or is it
> because the print function is magic?

What actually happens is this:

1) Your file contains bytes. Hopefully, your editor will follow the
same encoding that you've declared for your file, but that's not
guaranteed.
2) The string contains bytes. If you ask for the representation of
that string, some of them will be shown as characters, but that 'h' is
exactly the same as "\x68".
3) Printing that string sends those bytes to your console.

If EVERYTHING is using the exact same encoding, it all appears to work
correctly. But if your editor saves the file as UTF-8 and your console
is set to Codepage 1252, you'll get a nonsense.

When you use the u-prefix string literal, here's what happens:

1) As above, your file contains bytes, hopefully in the encoding that
you've declared.
2) The Python interpreter reads those bytes using the encoding you
declared. The string contains Unicode characters.
3) Those characters really are characters. They are LATIN SMALL LETTER
H and LATIN SMALL LETTER E WITH ACUTE.
4) Printing that string causes those characters to be sent to your
console in the best way possible for Unicode characters (Python and
the console can negotiate that between them).

If you switch to Python 3 and remove the file's encoding declaration,
here's what happens:

1) Your file contains bytes. Your editor should use UTF-8 because it's
the most logical default; check for this but it's probably going to
happen without any effort.
2) The Python interpreter reads those bytes and understands them as
UTF-8. Your string, like all your source code, contains Unicode
characters.
3) As in the second case, you have those two characters.
4) As above, Python and the console negotiate the best way to display text.

It's a LOT easier. All you have to do is make sure everything's using
UTF-8, and then the defaults will just work. For most text editors,
you don't need to think about this at all, because they're already
configured that way.

As a bonus: Since, in Python 3, *all* your source code is Unicode
text, you can actually switch things around.

>>> s="héhéhé"
>>> héhéhé="s"
>>> print(s)
héhéhé
>>> print(héhéhé)
s

Yep, that's non-ASCII letters in a variable name. Doesn't bother
Python 3 at all!

ChrisA



More information about the Python-list mailing list