Nice unicode -> ascii translation?

"Martin v. Löwis" martin at v.loewis.de
Mon Aug 7 08:27:50 EDT 2006


crowell at mit.edu schrieb:
> The trick is finding the right XXXX.  Has someone attempted this
> before, or am I stuck writing my own solution?

In this specific example, there is a different approach, using
the Unicode character database:

def strip_combining(s):
    import unicodedata
    # Expand pre-combined characters into base+combinator
    s1 = unicodedata.normalize("NFD", s)
    r = []
    for c in s1:
        # add all non-combining characters
        if not unicodedata.combining(c):
            r.append(c)
    return u"".join(r)

py> a.strip_combining(u'B\xe9la Fleck')
u'Bela Fleck'

As the accented characters get decomposed into base character
plus combining accent, this strips off all accents in the
string.

Of course, it is still fairly limited. If you have non-latin
scripts (Greek, Cyrillic, Arabic, Kanji, ...), this approach
fails, and you would need a transliteration database for them.
There is non built into Python, and I couldn't find a
transliteration database that transliterates all Unicode characters
into ASCII, either.

Regards,
Martin



More information about the Python-list mailing list