Correct handling of case in unicode and regexps

Sun Feb 24 14:28:27 EST 2013

On 23 fév, 15:26, Devin Jeanpierre <jeanpierr... at gmail.com> wrote:
> Hi folks,
>
> I'm pretty unsure of myself when it comes to unicode. As I understand
> it, you're generally supposed to compare things in a case insensitive
> manner by case folding, right? So instead of a.lower() == b.lower()
> (the ASCII way), you do a.casefold() == b.casefold()
>
> However, I'm struggling to figure out how regular expressions should
> treat case. Python's re module doesn't "work properly" to my
> understanding, because:
>
>     >>> a = 'ss'
>     >>> b = 'ß'
>     >>> a.casefold() == b.casefold()
>     True
>     >>> re.match(re.escape(a), b, re.UNICODE | re.IGNORECASE)
>     >>> # oh dear!
>
> In addition, it seems improbable that this ever _could_ work. Because
> if it did work like that, then what would the value be of
> re.match('s', 'ß', re.UNICODE | re.IGNORECASE).end() ? 0.5?
>
> I'd really like to hear the thoughts of people more experienced with
> unicode. What is the ideal correct behavior here? Or do I
> misunderstand things?

-----

I'm just wondering if there is a real issue here. After all,
this is only a question of conventions. Unicode has some
conventions, re modules may (has to) use some conventions too.

It seems to me, the safest way is to preprocess the text,
which has to be examinated.

Proposed case study:
How should be ss/ß/SS/ẞ interpreted?

'Richard-Strauss-Straße'
'Richard-Strauss-Strasse'
'RICHARD-STRAUSS-STRASSE'
'RICHARD-STRAUSS-STRAẞE'

There is more or less the same situation with sorting.
Unicode can not do all and it may be mandatory to
preprocess the "input".

Eg. This fct I wrote once for the fun. It sorts French
words (without unicodedata and locale).

>>> import libfrancais
>>> z = ['oeuf', 'œuf', 'od', 'of']
>>> zo = libfrancais.sortedfr(z)
>>> zo
['od', 'oeuf', 'œuf', 'of']

jmf