Replace accented chars with unaccented ones
Josiah Carlson
jcarlson at nospam.uci.edu
Mon Mar 15 21:19:00 EST 2004
Jeff Epler wrote:
> You have two options. First, convert the string to Unicode and use code
> like the following:
>
> replacements = [(u'\xe9', 'e'), ...]
> def remove_accents(u):
> for a, b in replacements:
> u = u.replace(a, b)
> return u
>
>
>>>>remove_accents(u'\xe9')
>
> u'e'
>
> Second, if you are using a single-byte encoding (iso8859-1, for
> instance), then work with byte string:
> replacement_map = string.maketrans('\xe9...', 'e...')
> def remove_accents(s):
> return s.translate(replacement_map)
>
>
>>>>remove_accents('\xe9')
>
> 'e'
>
> If you want to have strings like u'é' in your programs, you have to
> include a line at the top of the source file that tells Python the
> encoding, like the following line does:
> # -*- coding: utf-8 -*-
> (except you have to name the encoding your editor uses, if it's not
> utf-8) See http://python.org/peps/pep-0263.html
>
> Once you've done that, you can write
> replacements = [(u'é', 'e'), ...]
> instead of using the \xXX escape for it.
Translating the replacements pairs into a dictionary would result in a
significant speedup for large numbers of replacements.
mapping = dict(replacement_pairs)
def multi_replace(inp, mapping=mapping):
return u''.join([mapping.get(i, i) for i in inp])
One pass through the file gives an O(len(inp)) algorithm, much better
(running-time wise) than the string.replace method that runs in
O(len(inp) * len(replacement_pairs)) time as given.
- Josiah
More information about the Python-list
mailing list