[Python-Dev] Unicode locale values in 2.7
Eric Smith
eric at trueblade.com
Thu Dec 3 12:19:30 CET 2009
While researching http://bugs.python.org/issue7327, I've come to the
conclusion that trunk handles locales incorrectly in regards to Unicode.
Fixing this would be the first step toward resolving this issue with
float and Decimal locale-aware formatting.
The issue concerns the locale "cs_CZ.UTF-8", and the "thousands_sep"
value (among others). The C struct lconv (in Linux) contains '\xc2\xa0'
for thousands_sep. In py3k this is handled by calling mbstowcs (which is
locale-aware) and then PyUnicode_FromWideChar, so the value is converted
to u"\xa0" (non-breaking space).
But in trunk, the value is just used as-is. So when formating a decimal,
for example, '\xc2\xa0' is just inserted into the result, such as:
>>> format(Decimal('1000'), 'n')
'1\xc2\xa0000'
This doesn't make much sense, and causes an error when converting it to
unicode:
>>> format(Decimal('1000'), u'n')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/root/python/trunk/Lib/decimal.py", line 3609, in __format__
return _format_number(self._sign, intpart, fracpart, exp, spec)
File "/root/python/trunk/Lib/decimal.py", line 5704, in _format_number
return _format_align(sign, intpart+fracpart, spec)
File "/root/python/trunk/Lib/decimal.py", line 5595, in _format_align
result = unicode(result)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 1:
ordinal not in range(128)
I believe that the correct solution is to do what py3k does in locale,
which is to convert the struct lconv values to unicode. But since this
would be a disruptive change if universally applied, I'd like to propose
that we only convert to unicode if the values won't fit into a str.
So the algorithm would be something like:
1. call mbstowcs
2. if every value in the result is in the range [32, 126], return a str
3. otherwise, return a unicode
This would mean that for most locales, the current behavior in trunk
wouldn't change: the locale.localeconv() values would continue to be
str. Only for those locales where the values wouldn't fit into a str
would unicode be returned.
Does this seem like an acceptable change?
Eric.
PS: Thanks to Mark Dickinson and others on irc and on the issue for
helping in formulating this.
More information about the Python-Dev
mailing list