utf - string translation

John Machin sjmachin at lexicon.net
Wed Nov 29 14:53:15 EST 2006


Frederic Rentsch wrote:

> Try this:
>
> from_characters   =
> '\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd8\xd9\xda\xdb\xdc\xdd\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf8\xf9\xfa\xfb\xfc\xfd\xff\xe7\xe8\xe9\xea\xeb'
> to_characters     =
> 'AAAAAAACEEEEIIIIDNOOOOOOUUUUYaaaaaaaiiiionoooooouuuuyyceeee'
> translation_table = string.maketrans (from_characters, to_characters)
> translated_string = string.translate (original_string, translation_table)
>

A few observations on the above:

1. This assumes that "original_string" is a str object, and the text is
encoded in latin1 or similar (e.g. cp1252).

2. Presentation of the map could be improved greatly, along the lines
of:

import pprint
import unicodedata
fromc = \
[snip]
toc = 'AAAAAAACEEEEIIIIDNOOOOOOUUUUYaaaaaaaiiiionoooooouuuuyyceeee'
assert len(fromc) == len(toc)
tups = list(zip(unicode(fromc, 'latin1'), toc))
tups.sort()
tupsu = [(x[1], x[0], unicodedata.name(x[0], '** no name **')) for x in
tups]
pprint.pprint(tupsu)

which produces:

[('A', u'\xc0', 'LATIN CAPITAL LETTER A WITH GRAVE'),
 ('A', u'\xc1', 'LATIN CAPITAL LETTER A WITH ACUTE'),
[snip]
 ('D', u'\xd0', 'LATIN CAPITAL LETTER ETH'),
[snip]
 ('Y', u'\xdd', 'LATIN CAPITAL LETTER Y WITH ACUTE'),
 ('a', u'\xe0', 'LATIN SMALL LETTER A WITH GRAVE'),
[snip]
 ('o', u'\xf0', 'LATIN SMALL LETTER ETH'),
[snip]
 ('y', u'\xfd', 'LATIN SMALL LETTER Y WITH ACUTE'),
 ('y', u'\xff', 'LATIN SMALL LETTER Y WITH DIAERESIS')]

This makes it a lot easier to see what is going on, and check for
weirdness, like the inconsistent treatment of \xd0 and \xf0.

3. ... and to check for missing maps. The OP may be working only with
French text, and may not care about Icelandic and German letters, but
other readers who stumble on this (and miss past thread(s) on this
topic) may like something done with \xde (capital thorn),  \xfe (small
thorn) and \xdf (sharp s aka Eszett).

Cheers,
John




More information about the Python-list mailing list