Case-insensitive sorting of strings (Python newbie)

Chris Angelico rosuav at gmail.com
Fri Jan 23 14:56:19 EST 2015


On Sat, Jan 24, 2015 at 6:14 AM, Marko Rauhamaa <marko at pacujo.net> wrote:
> Well, if Python can't, then who can? Probably nobody in the world, not
> generically, anyway.
>
> Example:
>
>     >>> print("re\u0301sume\u0301")
>     résumé
>     >>> print("r\u00e9sum\u00e9")
>     résumé
>     >>> print("re\u0301sume\u0301" == "r\u00e9sum\u00e9")
>     False
>     >>> print("\ufb01nd")
>     find
>     >>> print("find")
>     find
>     >>> print("\ufb01nd" == "find")
>     False
>
> If equality can't be determined, words really can't be sorted.

Ah, that's a bit easier to deal with. Just use Unicode normalization.

>>> print(unicodedata.normalize("NFC","re\u0301sume\u0301") == unicodedata.normalize("NFC","r\u00e9sum\u00e9"))
True

It's a bit verbose, but if you're doing a lot of comparisons, you
probably want to make a key-function that folds together everything
that you want to be treated the same way, for instance:

def key(s):
    """Normalize a Unicode string for comparison purposes.

    Composes, case-folds, and trims excess spaces.
    """
    return unicodedata.normalize("NFC",s).strip().casefold()

Then it's much tidier:

>>> print(key("re\u0301sume\u0301") == key("r\u00e9sum\u00e9"))
True
>>> print(key("\ufb01nd") == key("find"))
True

You may want to go further, too; for search comparisons, you'll want
to use NFKC normalization, and probably translate all strings of
Unicode whitespace into single U+0020s, or completely strip out
zero-width non-breaking spaces (and maybe zero-width breaking spaces,
too), etc, etc. It all depends on what you mean by "equality". But
certainly a basic NFC or NFD normalization is safe for general work.

ChrisA



More information about the Python-list mailing list