[Tutor] ignoring diacritical signs

Mon Dec 2 21:08:49 CET 2013

On Mon, 12/2/13, Steven D'Aprano <steve at pearwood.info> wrote:

 Subject: Re: [Tutor] ignoring diacritical signs
 To: tutor at python.org
 Date: Monday, December 2, 2013, 4:53 PM

 On Mon, Dec 02, 2013 at 06:11:04AM
 -0800, Albert-Jan Roskam wrote:
 > Hi,
 > 
 > I created the code below because I want to compare two
 fields while 
 > ignoring the diacritical signs.

 Why would you want to do that? That's like comparing two
 fields while 
 ignoring the difference between "e" and "i", or "s" and "z",
 or "c" and 
 "k". Or indeed between "s", "z", "c" and "k".

 *only half joking*

====> ;-) Unaccented characters that really should be accented are a fact of life. We often need to merge datasets and if one of them comes from a system that dates back to the Pleistocene... well...

 I think the right way to ignore diacritics and other
 combining marks is 
 with a function like this:

 import unicodedata

 def strip_marks(s):
     decomposed = unicodedata.normalize('NFD', s)
     base_chars = [c for c in decomposed if not
 unicodedata.combining(c)]
     return ''.join(base_chars)

 Example:

 py> strip_marks("I will coöperate with Müller's
 résumé mañana.")
 "I will cooperate with Muller's resume manana."

====> woaaah, very different approach compared to mine. Nice! I have to read up on unicodedata. I have used it a few times (e.g. where the re module is not enough), but many of the abbreviations are still a mystery to me. This seems a good start: http://www.unicode.org/reports/tr44/tr44-6.html

 Beware: stripping accents may completely change the meaning
 of the word 
 in many languages! Even in English, stripping the accents
 from "résumé" 
 makes the word ambiguous (do you mean a CV, or the verb to
 start 
 something again?). In other languages, stripping accents may
 completely 
 change the word, or even turn it into nonsense.

 For example, I understand that in Danish, å is not the
 letter a with a 
 circle accent on it, but a distinct letter of the alphabet
 which should 
 not be touched. And I haven't even considered non-Western
 European 
 languages, like Greek, Polish, Russian, Arabic, Hebrew...

=====> Similarly, ñ is a letter in Spanish and Tagalog. So they have (at least?) 27 letters in their alphabet.

 Another issue: depending on the language, it may be better
 to replace 
 certain accents with letter combinations. For example, a
 German might 
 prefer to see Müller transformed to Mueller. (Although Herr
 Müller 
 probably won't, as people tend to be very sensitive about
 their names.)

=====> Strangely, the nazi Goebbels is never referred to as "Göbbels".

 Also, the above function leaves LATIN CAPITAL LETTER O WITH
 STROKE as Ø 
 instead of stripping the stroke. I'm not sure whether that
 is an 
 oversight or by design. Likewise for the lowercase version.
 You might 
 want to do some post-processing:

 def strip_marks2(s):
     # Post-process letter O with stroke.
     decomposed = unicodedata.normalize('NFD', s)
     result = ''.join([c for c in decomposed if not
 unicodedata.combining(c)])
     return result.replace('Ø', 'O').replace('ø',
 'o')

 If you have a lot of characters to post-process (e.g. ß to
 "ss" or "sz") 
 I recommend you look into the str.translate method, which is
 more 
 efficient than repeatedly calling replace.

====> Efficiency certainly counts here, with millions of records to check. It may even be more important than readability. Then again, accented letters are fairly rare in my language.

 No *simple* function can take into account the myriad of
 language- 
 specific rules for accents. The best you can do is code up a
 limited set 
 of rules for whichever languages you care about, and in the
 general case 
 fall back on just stripping accents like an ignorant
 American.

 (No offence intended to ignorant Americans *wink*)

====> You are referring to this recipe, right? http://code.activestate.com/recipes/251871-latin1-to-ascii-the-unicode-hammer/
;-)

 > I thought it'd be cool to overload 
 > __eq__ for this. Is this a good approach, or have I
 been fixated too 
 > much on using the __eq__ special method?

 This isn't Java, no need for a class :-)

 On the other hand, if you start building up a set of
 language-specific 
 normalization functions, a class might be what you want. For
 example:

 class DefaultAccentStripper:
     exceptions = {'Ø': 'O', 'ø': 'o'}
     mode = 'NFD'  # Or possibly 'NFKD' for
 some uses?
     def __call__(self, s):
         decomposed = []
         for c in s:
             if c in
 self.exceptions:

 decomposed.append(self.exceptions[c])
             else:

 decomposed.append(unicodedata.normalize(self.mode, c))
         result = ''.join([c for c in
 decomposed if not 

 unicodedata.combining(c)])
         return result

 class GermanAccentStripper(DefaultAccentStripper):
     exceptions =
 DefaultAccentStripper.exceptions.copy()
     exceptions.update({'Ä': 'AE', 'ä': 'ae',
 'Ë': 'EE', 'ë': 'ee',

        'Ï': 'IE', 'ï': 'ie',
 'Ö': 'OE', 'ö': 'oe',

        # there seems to be a
 pattern here...

        'Ü': 'UE', 'ü': 'ue',

        'ß': 'sz',

        })

 class DanishAccentStripper(DefaultAccentStripper):
     exceptions = {'Å': 'Å', 'å': 'å'}

 And there you go, three accent-strippers. Just instantiate
 the classes, 
 once, and you're ready to go:

 accent_stripper = GermanAccentStripper()

====> very slick. Cool!

====> regarding casefold (in your next mail). What is the difference between lower and casefold?

Help on built-in function casefold:

casefold(...)
    S.casefold() -> str

    Return a version of S suitable for caseless comparisons.

>>> "Alala alala".casefold() == "Alala alala".lower()
True

====> And then this article............. sheeeeeesshhh!!!! What a short fuse! Wouldn't  it be easier to say "Look, man, the diacritics of my phone suck"