[Tutor] ignoring diacritical signs
Steven D'Aprano
steve at pearwood.info
Mon Dec 2 16:53:44 CET 2013
On Mon, Dec 02, 2013 at 06:11:04AM -0800, Albert-Jan Roskam wrote:
> Hi,
>
> I created the code below because I want to compare two fields while
> ignoring the diacritical signs.
Why would you want to do that? That's like comparing two fields while
ignoring the difference between "e" and "i", or "s" and "z", or "c" and
"k". Or indeed between "s", "z", "c" and "k".
*only half joking*
I think the right way to ignore diacritics and other combining marks is
with a function like this:
import unicodedata
def strip_marks(s):
decomposed = unicodedata.normalize('NFD', s)
base_chars = [c for c in decomposed if not unicodedata.combining(c)]
return ''.join(base_chars)
Example:
py> strip_marks("I will coöperate with Müller's résumé mañana.")
"I will cooperate with Muller's resume manana."
Beware: stripping accents may completely change the meaning of the word
in many languages! Even in English, stripping the accents from "résumé"
makes the word ambiguous (do you mean a CV, or the verb to start
something again?). In other languages, stripping accents may completely
change the word, or even turn it into nonsense.
For example, I understand that in Danish, å is not the letter a with a
circle accent on it, but a distinct letter of the alphabet which should
not be touched. And I haven't even considered non-Western European
languages, like Greek, Polish, Russian, Arabic, Hebrew...
Another issue: depending on the language, it may be better to replace
certain accents with letter combinations. For example, a German might
prefer to see Müller transformed to Mueller. (Although Herr Müller
probably won't, as people tend to be very sensitive about their names.)
Also, the above function leaves LATIN CAPITAL LETTER O WITH STROKE as Ø
instead of stripping the stroke. I'm not sure whether that is an
oversight or by design. Likewise for the lowercase version. You might
want to do some post-processing:
def strip_marks2(s):
# Post-process letter O with stroke.
decomposed = unicodedata.normalize('NFD', s)
result = ''.join([c for c in decomposed if not unicodedata.combining(c)])
return result.replace('Ø', 'O').replace('ø', 'o')
If you have a lot of characters to post-process (e.g. ß to "ss" or "sz")
I recommend you look into the str.translate method, which is more
efficient than repeatedly calling replace.
No *simple* function can take into account the myriad of language-
specific rules for accents. The best you can do is code up a limited set
of rules for whichever languages you care about, and in the general case
fall back on just stripping accents like an ignorant American.
(No offence intended to ignorant Americans *wink*)
> I thought it'd be cool to overload
> __eq__ for this. Is this a good approach, or have I been fixated too
> much on using the __eq__ special method?
This isn't Java, no need for a class :-)
On the other hand, if you start building up a set of language-specific
normalization functions, a class might be what you want. For example:
class DefaultAccentStripper:
exceptions = {'Ø': 'O', 'ø': 'o'}
mode = 'NFD' # Or possibly 'NFKD' for some uses?
def __call__(self, s):
decomposed = []
for c in s:
if c in self.exceptions:
decomposed.append(self.exceptions[c])
else:
decomposed.append(unicodedata.normalize(self.mode, c))
result = ''.join([c for c in decomposed if not
unicodedata.combining(c)])
return result
class GermanAccentStripper(DefaultAccentStripper):
exceptions = DefaultAccentStripper.exceptions.copy()
exceptions.update({'Ä': 'AE', 'ä': 'ae', 'Ë': 'EE', 'ë': 'ee',
'Ï': 'IE', 'ï': 'ie', 'Ö': 'OE', 'ö': 'oe',
# there seems to be a pattern here...
'Ü': 'UE', 'ü': 'ue',
'ß': 'sz',
})
class DanishAccentStripper(DefaultAccentStripper):
exceptions = {'Å': 'Å', 'å': 'å'}
And there you go, three accent-strippers. Just instantiate the classes,
once, and you're ready to go:
accent_stripper = GermanAccentStripper()
--
Steven
More information about the Tutor
mailing list