Looking for UNICODE to ASCII Conversioni Example Code

Sat Oct 19 05:19:12 EDT 2013

On Fri, 18 Oct 2013 13:45:53 -0700, caldwellinva wrote:

> Hi!
> 
> I am looking for an example of a UNICODE to ASCII conversion example
> that will remove diacritics from characters (and leave the characters,
> i.e., Klüft to Kluft) as well as handle the conversion of other
> characters, like große to grosse.

Seems like a nasty thing to do, akin to stripping the vowels from English 
text just because Hebrew didn't write them. But if you insist, there's 
always this:

http://code.activestate.com/recipes/251871

although it is nowhere near complete, and it's pretty ugly code too.

Perhaps a cleaner method might be to use a combination of Unicode 
normalisation forms and a custom translation table. Here's a basic 
version to get you started, written for Python 3:

import unicodedata

# Do this once. It may take a while.
table = {}
for n in range(128, 0x11000):
    # Use unichar in Python2
    expanded = unicodedata.normalize('NFKD', chr(n))
    keep = [c for c in expanded if ord(c) < 128]
    if keep:
        table[n] = ''.join(keep)
    else:
        # None to delete, or use some other replacement string.
        table[n] = None

# Add extra transformations.
# In Python2, every string needs to be a Unicode string u'xyz'.
table[ord('ß')] = 'ss'
table[ord('\N{LATIN CAPITAL LETTER SHARP S}')] = 'SS'
table[ord('Æ')] = 'AE'
table[ord('æ')] = 'ae'
table[ord('Œ')] = 'OE'
table[ord('œ')] = 'oe'
table[ord('ﬁ')] = 'fi'
table[ord('ﬂ')] = 'fl'
table[ord('ø')] = 'oe'
table[ord('Ð')] = 'D'
table[ord('Þ')] = 'TH'
# etc.

# Say you don't want control characters in your string, you might 
# escape them using caret ^C notation:
for i in range(32):
    table[i] = '^%c' % (ord('@') + i)

table[127] = '^?'

# But it's probably best if you leave newlines, tabs etc. alone...
for c in '\n\r\t\f\v':
    del table[ord(c)]

# Add any more transformations you like here. Perhaps you want to
# transliterate Russian and Greek characters to English?
table[whatever] = whatever

# In Python2, use unicode.maketrans instead.
table = str.maketrans(table)

That's a fair chunk of work, but it only needs be done once, at the start 
of your application. Then you call it like this:

cleaned = 'some Unicode string'.translate(table)

If you really want to be fancy, you can extract the name of each Unicode 
code point (if it has one!) and parse the name. Here's an example:

py> unicodedata.name('ħ')
'LATIN SMALL LETTER H WITH STROKE'
py> unicodedata.lookup('LATIN SMALL LETTER H')
'h'

but I'd only do that after the normalization step, if at all.

Too much work for your needs? Well, you can get about 80% of the way in 
only a few lines of code:

cleaned = unicodedata.normalize('NFKD', unistr)
for before, after in (
        ('ß', 'ss'), ('Æ', 'AE'), ('æ', 'ae'), ('Œ', 'OE'), ('œ', 'oe'),
        # put any more transformations here...
        ):
    cleaned = cleaned.replace(before, after)

cleaned = cleaned.encode('ascii', 'replace').decode('ascii')

Another method would be this:

http://effbot.org/zone/unicode-convert.htm

which is focused on European languages. But it might suit your purposes.

> There used to be a program called any2ascii.py
> (http://www.haypocalc.com/perso/prog/python/any2ascii.py) that worked
> well, but the link is now broken and I can't seem to locate it.
> 
> I have seen the page Unicode strings to ASCII ...nicely,
> http://www.peterbe.com/plog/unicode-to-ascii, but am looking for a
> working example.

He has a working example. How much hand-holding are you looking for?

Quoting from that page:

    I'd much rather that a word like "Klüft" is converted to 
    "Kluft" which will be more human readable and still correct.

The author is wrong. That's like saying that changing the English word 
"car" to "cer" is still correct -- it absolutely is not correct, and even 
if it were, what is he implying with the quip about "more human 
readable"? That Germans and other Europeans aren't human?

If an Italian said:

    I'd much rather that a word like "jump" is converted to 
    "iump" which will be more human readable and still correct.

we'd all agree that he was talking rubbish.

Make no mistake, this sort of simple-minded stripping of accents and 
diacritics is an extremely ham-fisted thing to do. To strip out letters 
without changing the meaning of the words is, at best, hard to do right 
and requiring good knowledge of the linguistic rules of the language 
you're translating. And at worst, it's outright impossible. For instance, 
in German I believe it is quite acceptable to translate 'ü' to 'ue', 
except in names: Herr Müller will probably be quite annoyed if you call 
him Herr Mueller, and Herr Mueller will probably be annoyed too, and both 
of them will be peeved to be confused with Herr Muller.

-- 
Steven