[Tutor] lc_ctype and re.LOCALE

Sun Jan 31 14:48:07 EST 2016

> From: eryksun at gmail.com
> Date: Fri, 29 Jan 2016 07:14:07 -0600
> Subject: Re: [Tutor] lc_ctype and re.LOCALE
> To: tutor at python.org
> CC: sjeik_appie at hotmail.com
> 
> On Thu, Jan 28, 2016 at 2:23 PM, Albert-Jan Roskam
> <sjeik_appie at hotmail.com> wrote:
> > Out of curiosity, I wrote the throw-away script below to find a character that is classified
> > (--> LC_CTYPE) as digit in one locale, but not in another.
> 
> The re module is the wrong tool for this. The re.LOCALE flag is only
> for byte strings, and in this case only ASCII 0-9 are matched as
> decimal digits. It doesn't call the isdigit() ctype function. Using
> Unicode with re.LOCALE is wrong. 

Ok, good to know. In my original Python-2 version of the script I did convert the ordinal to a byte string, but it was still a utf-8 byte string.

> The current locale doesn't affect the
> meaning of a Unicode character. Starting with 3.6 doing this will
> raise an exception.

I find it strange that specifying either re.LOCALE or re.UNICODE is still the "special case". IMHO it is a historical anomaly that ASCII is the "normal case". Matching accented characters should not require any special flags.

> The POSIX ctype functions such as isalnum and isdigit are limited to a
> single code in the range 0-255 and EOF (-1). For UTF-8, the ctype
> functions return 0 in the range 128-255 (i.e. lead bytes and trailing
> bytes aren't characters). Even if this range has valid characters in a
> given locale, it's meaningless to use a Unicode value from the Latin-1
> block, unless the locale uses Latin-1 as its codeset.
> 
> Python 2's str uses the locale-aware isdigit() function. However, all
> of the locales on my Linux system use UTF-8, so I have to switch to
> Windows to demonstrate two locales that differ with respect to
> isdigit(). 

In other words: the LC_CTYPE is only relevant with codepage encodings?

You could use PyWin32 or ctypes to iterate over all the
> locales known to Windows, if it mattered that much to you.
> 
> The English locale (codepage 1252) includes superscript digits 1, 2, and 3:
> 
>     >>> locale.setlocale(locale.LC_CTYPE, 'English_United Kingdom')
>     'English_United Kingdom.1252'
>     >>> [chr(x) for x in range(256) if chr(x).isdigit()]
>     ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '\xb2', '\xb3', '\xb9']
>     >>> unicodedata.name('\xb9'.decode('1252'))
>     'SUPERSCRIPT ONE'
>     >>> unicodedata.name('\xb2'.decode('1252'))
>     'SUPERSCRIPT TWO'
>     >>> unicodedata.name('\xb3'.decode('1252'))
>     'SUPERSCRIPT THREE'

Is character classification also related to the compatibility form of unicode normalization?
>>> unicodedata.normalize("NFKD", u'\xb3')
u'3'

(see also http://unicode.org/reports/tr15/)

> Note that using the re.LOCALE flag doesn't match these superscript digits:
> 
>     >>> re.findall(r'\d', '0123456789\xb2\xb3\xb9', re.LOCALE)
>     ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

Ok, now I am upset. I did not expect this at all! I would have expected the re results to be in line with the str.isdigit results. If LC_CTYPE is not relevant isn't "re.DIACRITIC"  a better name for the re.LOCALE flag?

> The Windows Greek locale (codepage 1253) substitutes "Ή" for superscript 1:
> 
>     >>> locale.setlocale(locale.LC_CTYPE, 'Greek_Greece')
>     'Greek_Greece.1253'
>     >>> [chr(x) for x in range(256) if chr(x).isdigit()]
>     ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '\xb2', '\xb3']
> 
>     >>> unicodedata.name('\xb9'.decode('1253'))
>     'GREEK CAPITAL LETTER ETA WITH TONOS'

Ok, I switched to Windows to see this with my own eyes. Checked the regex. Strange, but fun to know.

Thanks a lot for your thorough reply!