python 2.7 and unicode (one more time)

Chris Angelico rosuav at gmail.com
Thu Nov 20 10:15:22 EST 2014


On Fri, Nov 21, 2014 at 1:14 AM, Francis Moreau <francis.moro at gmail.com> wrote:
> Hi,
>
> Thanks for the "from __future__ import unicode_literals" trick, it makes
> that switch much less intrusive.
>
> However it seems that I will suddenly be trapped by all modules which
> are not prepared to handle unicode. For example:
>
>  >>> from __future__ import unicode_literals
>  >>> import locale
>  >>> locale.setlocale(locale.LC_ALL, 'fr_FR')
>  Traceback (most recent call last):
>    File "<stdin>", line 1, in <module>
>    File "/usr/lib64/python2.7/locale.py", line 546, in setlocale
>      locale = normalize(_build_localename(locale))
>    File "/usr/lib64/python2.7/locale.py", line 453, in _build_localename
>      language, encoding = localetuple
>  ValueError: too many values to unpack
>
> Is the locale module an exception and in that case I'll fix it by doing:
>
>  >>> locale.setlocale(locale.LC_ALL, b'fr_FR')
>
> or is a (big) part of the modules in python 2.7 still not ready for
> unicode and in that case I have to decide which prefix (u or b) I should
> manually add ?

Sadly, there are quite a lot of parts of Python 2 that simply don't
handle Unicode strings. But you can probably keep all of those down to
just a handful of explicit b"whatever" strings; most places should
accept unicode as well as str. What you're seeing here is a prime
example of one of this author's points (caution, long post):

http://unspecified.wordpress.com/2012/04/19/the-importance-of-language-level-abstract-unicode-strings/

"""The lesson of Python 3 is: give programmers a Unicode string type,
*make it the default*, and encoding issues will /mostly/ go away."""

There's a whole ecosystem to Python 2 - some in the standard library,
heaps more in the rest of the world - and a lot of it was written on
the assumption that a byte is a character is an octet. When you pass
Unicode strings to functions written to expect byte strings, sometimes
you win, and sometimes you lose... even with the standard library
itself. But the Python 3 ecosystem has been written on the assumption
that strings are Unicode. It's only a narrow set of programs
("boundary code", where you're moving text across networks and stuff
like that) where the Python 2 model is easier to work with; and the
recent Py3 releases have been progressively working to relieve that
pain.

The absolute worst case is a function which exists in Python 2 and 3,
and requires a byte string in Py2 and a text string in Py3. Sadly,
that may be exactly what locale.setlocale() is. For that, I would
suggest explicitly passing stuff through str():

locale.setlocale(locale.LC_ALL, str('fr_FR'))

In Python 3, 'fr_FR' is already a str, so passing it through str()
will have no significant effect. (Though it would be worth commenting
that, to make it clear to a subsequent reader that this is Py2 compat
code.) In Python 2 with unicode_literals active, 'fr_FR' is a unicode,
so passing it through str() will encode it to ASCII, producing a byte
string that setlocale should be happy with.

By the way, the reason for the strange error message is clearer in
Python 3, which chains in another exception:

>>> locale.setlocale(locale.LC_ALL, b'fr_FR')
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/locale.py", line 498, in _build_localename
    language, encoding = localetuple
ValueError: too many values to unpack (expected 2)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.5/locale.py", line 594, in setlocale
    locale = normalize(_build_localename(locale))
  File "/usr/local/lib/python3.5/locale.py", line 507, in _build_localename
    raise TypeError('Locale must be None, a string, or an iterable of
two strings -- language code, encoding.')
TypeError: Locale must be None, a string, or an iterable of two
strings -- language code, encoding.

So when it gets the wrong type of string, it attempts to unpack it as
an iterable; it yields five values (the five bytes or characters,
depending on which way it's the wrong type of string), but it's
expecting two. Fortunately, str() will deal with this. But make sure
you don't have the b prefix, or str() in Py3 will give you quite a
different result!

ChrisA



More information about the Python-list mailing list