[Tutor] ignoring diacritical signs

Steven D'Aprano steve at pearwood.info
Mon Dec 2 16:53:44 CET 2013


On Mon, Dec 02, 2013 at 06:11:04AM -0800, Albert-Jan Roskam wrote:
> Hi,
> 
> I created the code below because I want to compare two fields while 
> ignoring the diacritical signs.

Why would you want to do that? That's like comparing two fields while 
ignoring the difference between "e" and "i", or "s" and "z", or "c" and 
"k". Or indeed between "s", "z", "c" and "k".

*only half joking*


I think the right way to ignore diacritics and other combining marks is 
with a function like this:

import unicodedata

def strip_marks(s):
    decomposed = unicodedata.normalize('NFD', s)
    base_chars = [c for c in decomposed if not unicodedata.combining(c)]
    return ''.join(base_chars)


Example:

py> strip_marks("I will coöperate with Müller's résumé mañana.")
"I will cooperate with Muller's resume manana."


Beware: stripping accents may completely change the meaning of the word 
in many languages! Even in English, stripping the accents from "résumé" 
makes the word ambiguous (do you mean a CV, or the verb to start 
something again?). In other languages, stripping accents may completely 
change the word, or even turn it into nonsense.

For example, I understand that in Danish, å is not the letter a with a 
circle accent on it, but a distinct letter of the alphabet which should 
not be touched. And I haven't even considered non-Western European 
languages, like Greek, Polish, Russian, Arabic, Hebrew...

Another issue: depending on the language, it may be better to replace 
certain accents with letter combinations. For example, a German might 
prefer to see Müller transformed to Mueller. (Although Herr Müller 
probably won't, as people tend to be very sensitive about their names.)

Also, the above function leaves LATIN CAPITAL LETTER O WITH STROKE as Ø 
instead of stripping the stroke. I'm not sure whether that is an 
oversight or by design. Likewise for the lowercase version. You might 
want to do some post-processing:


def strip_marks2(s):
    # Post-process letter O with stroke.
    decomposed = unicodedata.normalize('NFD', s)
    result = ''.join([c for c in decomposed if not unicodedata.combining(c)])
    return result.replace('Ø', 'O').replace('ø', 'o')


If you have a lot of characters to post-process (e.g. ß to "ss" or "sz") 
I recommend you look into the str.translate method, which is more 
efficient than repeatedly calling replace.

No *simple* function can take into account the myriad of language- 
specific rules for accents. The best you can do is code up a limited set 
of rules for whichever languages you care about, and in the general case 
fall back on just stripping accents like an ignorant American.

(No offence intended to ignorant Americans *wink*)


> I thought it'd be cool to overload 
> __eq__ for this. Is this a good approach, or have I been fixated too 
> much on using the __eq__ special method?

This isn't Java, no need for a class :-)

On the other hand, if you start building up a set of language-specific 
normalization functions, a class might be what you want. For example:

class DefaultAccentStripper:
    exceptions = {'Ø': 'O', 'ø': 'o'}
    mode = 'NFD'  # Or possibly 'NFKD' for some uses?
    def __call__(self, s):
        decomposed = []
        for c in s:
            if c in self.exceptions:
                decomposed.append(self.exceptions[c])
            else:
                decomposed.append(unicodedata.normalize(self.mode, c))
        result = ''.join([c for c in decomposed if not 
                          unicodedata.combining(c)])
        return result

class GermanAccentStripper(DefaultAccentStripper):
    exceptions = DefaultAccentStripper.exceptions.copy()
    exceptions.update({'Ä': 'AE', 'ä': 'ae', 'Ë': 'EE', 'ë': 'ee',
                       'Ï': 'IE', 'ï': 'ie', 'Ö': 'OE', 'ö': 'oe',
                       # there seems to be a pattern here...
                       'Ü': 'UE', 'ü': 'ue',
                       'ß': 'sz',
                       })

class DanishAccentStripper(DefaultAccentStripper):
    exceptions = {'Å': 'Å', 'å': 'å'}


And there you go, three accent-strippers. Just instantiate the classes, 
once, and you're ready to go:

accent_stripper = GermanAccentStripper()



-- 
Steven


More information about the Tutor mailing list