[Tutor] ignoring diacritical signs

Steven D'Aprano steve at pearwood.info
Mon Dec 2 17:20:42 CET 2013


Oh, I forgot...

On Mon, Dec 02, 2013 at 06:11:04AM -0800, Albert-Jan Roskam wrote:
>         if self.ignorecase:
>             value = value.lower()

The right way to do case-insensitive comparisons is to use casefold, not 
lower. Unfortunately, casefold is only available in Python 3.3 and on, 
so for older versions you're stuck with lower (or maybe upper, if you 
prefer). I usually put this at the top of my module:


try:
    ''.casefold
except AttributeError:
    def casefold(s):
        return s.lower()
else:
    def casefold(s):
        return s.casefold()


then just use the custom casefold function.

Case-folding isn't entirely right either, it will give the wrong results 
in Turkish and Azerbaijani and one or two other languages, due to the 
presence of both dotted and dotless I, but it's as close as you're going 
to get without full locale awareness.

http://gizmodo.com/382026/a-cellphones-missing-dot-kills-two-people-puts-three-more-in-jail

By the way, that dot on the lowercase I and J, and the uppercase dotted 
I in Turkish, is called a tittle, and is technically a diacritic too. 
Next time you come across somebody bitching about how all those weird 
Unicode accents are a waste of time, you can reply "Is that rıght?"


-- 
Steven


More information about the Tutor mailing list