[Python-Dev] Unicode locale values in 2.7

Thu Dec 3 12:19:30 CET 2009

While researching http://bugs.python.org/issue7327, I've come to the
conclusion that trunk handles locales incorrectly in regards to Unicode.
Fixing this would be the first step toward resolving this issue with 
float and Decimal locale-aware formatting.

The issue concerns the locale "cs_CZ.UTF-8", and the "thousands_sep"
value (among others). The C struct lconv (in Linux) contains '\xc2\xa0'
for thousands_sep. In py3k this is handled by calling mbstowcs (which is
locale-aware) and then PyUnicode_FromWideChar, so the value is converted
to u"\xa0" (non-breaking space).

But in trunk, the value is just used as-is. So when formating a decimal,
for example, '\xc2\xa0' is just inserted into the result, such as:
>>> format(Decimal('1000'), 'n')
'1\xc2\xa0000'
This doesn't make much sense, and causes an error when converting it to
unicode:
>>> format(Decimal('1000'), u'n')
Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
   File "/root/python/trunk/Lib/decimal.py", line 3609, in __format__
     return _format_number(self._sign, intpart, fracpart, exp, spec)
   File "/root/python/trunk/Lib/decimal.py", line 5704, in _format_number
     return _format_align(sign, intpart+fracpart, spec)
   File "/root/python/trunk/Lib/decimal.py", line 5595, in _format_align
     result = unicode(result)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 1:
ordinal not in range(128)

I believe that the correct solution is to do what py3k does in locale,
which is to convert the struct lconv values to unicode. But since this
would be a disruptive change if universally applied, I'd like to propose
that we only convert to unicode if the values won't fit into a str.

So the algorithm would be something like:
1. call mbstowcs
2. if every value in the result is in the range [32, 126], return a str
3. otherwise, return a unicode

This would mean that for most locales, the current behavior in trunk
wouldn't change: the locale.localeconv() values would continue to be
str. Only for those locales where the values wouldn't fit into a str
would unicode be returned.

Does this seem like an acceptable change?

Eric.

PS: Thanks to Mark Dickinson and others on irc and on the issue for
helping in formulating this.