[Tutor] lc_ctype and re.LOCALE

Fri Jan 29 08:14:07 EST 2016

On Thu, Jan 28, 2016 at 2:23 PM, Albert-Jan Roskam
<sjeik_appie at hotmail.com> wrote:
> Out of curiosity, I wrote the throw-away script below to find a character that is classified
> (--> LC_CTYPE) as digit in one locale, but not in another.

The re module is the wrong tool for this. The re.LOCALE flag is only
for byte strings, and in this case only ASCII 0-9 are matched as
decimal digits. It doesn't call the isdigit() ctype function. Using
Unicode with re.LOCALE is wrong. The current locale doesn't affect the
meaning of a Unicode character. Starting with 3.6 doing this will
raise an exception.

The POSIX ctype functions such as isalnum and isdigit are limited to a
single code in the range 0-255 and EOF (-1). For UTF-8, the ctype
functions return 0 in the range 128-255 (i.e. lead bytes and trailing
bytes aren't characters). Even if this range has valid characters in a
given locale, it's meaningless to use a Unicode value from the Latin-1
block, unless the locale uses Latin-1 as its codeset.

Python 2's str uses the locale-aware isdigit() function. However, all
of the locales on my Linux system use UTF-8, so I have to switch to
Windows to demonstrate two locales that differ with respect to
isdigit(). You could use PyWin32 or ctypes to iterate over all the
locales known to Windows, if it mattered that much to you.

The English locale (codepage 1252) includes superscript digits 1, 2, and 3:

    >>> locale.setlocale(locale.LC_CTYPE, 'English_United Kingdom')
    'English_United Kingdom.1252'
    >>> [chr(x) for x in range(256) if chr(x).isdigit()]
    ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '\xb2', '\xb3', '\xb9']
    >>> unicodedata.name('\xb9'.decode('1252'))
    'SUPERSCRIPT ONE'
    >>> unicodedata.name('\xb2'.decode('1252'))
    'SUPERSCRIPT TWO'
    >>> unicodedata.name('\xb3'.decode('1252'))
    'SUPERSCRIPT THREE'

Note that using the re.LOCALE flag doesn't match these superscript digits:

    >>> re.findall(r'\d', '0123456789\xb2\xb3\xb9', re.LOCALE)
    ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

The Windows Greek locale (codepage 1253) substitutes "Ή" for superscript 1:

    >>> locale.setlocale(locale.LC_CTYPE, 'Greek_Greece')
    'Greek_Greece.1253'
    >>> [chr(x) for x in range(256) if chr(x).isdigit()]
    ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '\xb2', '\xb3']

    >>> unicodedata.name('\xb9'.decode('1253'))
    'GREEK CAPITAL LETTER ETA WITH TONOS'