Case-insensitive sorting of strings (Python newbie)
Chris Angelico
rosuav at gmail.com
Fri Jan 23 14:56:19 EST 2015
On Sat, Jan 24, 2015 at 6:14 AM, Marko Rauhamaa <marko at pacujo.net> wrote:
> Well, if Python can't, then who can? Probably nobody in the world, not
> generically, anyway.
>
> Example:
>
> >>> print("re\u0301sume\u0301")
> résumé
> >>> print("r\u00e9sum\u00e9")
> résumé
> >>> print("re\u0301sume\u0301" == "r\u00e9sum\u00e9")
> False
> >>> print("\ufb01nd")
> find
> >>> print("find")
> find
> >>> print("\ufb01nd" == "find")
> False
>
> If equality can't be determined, words really can't be sorted.
Ah, that's a bit easier to deal with. Just use Unicode normalization.
>>> print(unicodedata.normalize("NFC","re\u0301sume\u0301") == unicodedata.normalize("NFC","r\u00e9sum\u00e9"))
True
It's a bit verbose, but if you're doing a lot of comparisons, you
probably want to make a key-function that folds together everything
that you want to be treated the same way, for instance:
def key(s):
"""Normalize a Unicode string for comparison purposes.
Composes, case-folds, and trims excess spaces.
"""
return unicodedata.normalize("NFC",s).strip().casefold()
Then it's much tidier:
>>> print(key("re\u0301sume\u0301") == key("r\u00e9sum\u00e9"))
True
>>> print(key("\ufb01nd") == key("find"))
True
You may want to go further, too; for search comparisons, you'll want
to use NFKC normalization, and probably translate all strings of
Unicode whitespace into single U+0020s, or completely strip out
zero-width non-breaking spaces (and maybe zero-width breaking spaces,
too), etc, etc. It all depends on what you mean by "equality". But
certainly a basic NFC or NFD normalization is safe for general work.
ChrisA
More information about the Python-list
mailing list