Encoding of Python 2 string literals

Steven D'Aprano steve at pearwood.info
Wed Jul 22 10:38:16 EDT 2015


On Wed, 22 Jul 2015 08:17 pm, anatoly techtonik wrote:

> Hi,
> 
> Is there a way to know encoding of string (bytes) literal
> defined in source file? For example, given that source:
> 
>     # -*- coding: utf-8 -*-
>     from library import Entry
>     Entry("текст")
> 
> Is there any way for Entry() constructor to know that
> string "текст" passed into it is the utf-8 string?

No.

The entry constructor will receive a BYTE string, not a Unicode string,
containing some sequence of bytes.

If the coding cookie is accurate, then it will be the UTF-8 encoding of that
string, namely:

'\xd1\x82\xd0\xb5\xd0\xba\xd1\x81\xd1\x82'

If you print those bytes, at least under Linux, your terminal will probably
interpret them as UTF-8 and display it as текст but don't be fooled, the
string has length 10 (not 5).

If the coding cookie is not accurate, you will get something else. Probably
garbage, possibly a syntax error. Let's say you saved the text file using
the koi8-r encoding, but the coding cookie says utf-8. Then the text file
will actually contain bytes \xd4\xc5\xcb\xd3\xd4, but Python will try to
read those bytes as UTF-8, which is invalid. So at best you will get a
syntax error, at worst garbage text.


The right way to deal with this is to use an actual Unicode string:

Entry(u"текст")

and make sure that the file is saved using UTF-8, as the encoding cookie
says.

> I need to better prepare SCons for Python 3 migration.

The first step is to use proper Unicode strings u'' in Python 2.

It is acceptable to drop support for Python 3.1 and 3.2, and only support
3.3 and better. The advantage of this is that 3.3 supports the u'' string
prefix. If you must support 3.1 and 3.2 as well, there is no good solution,
just ugly ones.


-- 
Steven




More information about the Python-list mailing list