utf - string translation

Wed Nov 29 16:52:16 EST 2006

Fredrik Lundh wrote:
> John Machin wrote:
>
> > 3. ... and to check for missing maps. The OP may be working only with
> > French text, and may not care about Icelandic and German letters, but
> > other readers who stumble on this (and miss past thread(s) on this
> > topic) may like something done with \xde (capital thorn),  \xfe (small
> > thorn) and \xdf (sharp s aka Eszett).
>
> I did post links to code that does this to this thread, several days ago...
>

Ah yes, I missed that  -- and your posting doesn't advertise that the
code fixed the "one character should be mapped to two" cases :-)

This code
(http://effbot.python-hosting.com/file/stuff/sandbox/text/unaccent.py)
looks generally very good, but I'm left wondering why "AE" and "OE" in
the table, not "Ae and "Oe":
[snip]
    0xc6: u"AE", # LATIN CAPITAL LETTER AE <<<=== ??
    0xd0: u"D",  # LATIN CAPITAL LETTER ETH
    0xd8: u"OE", # LATIN CAPITAL LETTER O WITH STROKE <<<=== ??
    0xde: u"Th", # LATIN CAPITAL LETTER THORN
[snip]

Another point: there are many non-latin1 characters that could be
mapped to ASCII. For example:
    u"\u0141ukasziewicz".translate(unaccented_map())
doesn't work unless an entry is added to the no-decomposition table:
    0x0141: u"L", # LATIN CAPITAL LETTER L WITH STROKE

It looks like generating extra entries like that could be done, with
the aid of unicodedata.name():

LATIN CAPITAL LETTER X WITH blahblah -> "X"
LATIN SMALL LETTER X WITH blahblah -> "X".lower()

This would require a fair bit of care -- obviously there are special
cases like LATIN CAPITAL LETTER O WITH STROKE. Eyeballing by regional
experts is probably required.

Cheers,
John