trying to strip out non ascii.. or rather convert non ascii

MRAB python at mrabarnett.plus.com
Sat Oct 26 17:07:58 EDT 2013


On 26/10/2013 21:11, bruce wrote:
> hi..
>
> getting some files via curl, and want to convert them from what i'm
> guessing to be unicode.
>
> I'd like to convert a string like this::
> <div class="profName"><a href="ShowRatings.jsp?tid=1312168">Alcántar,
> Iliana</a></div>
>
> to::
> <div class="profName"><a href="ShowRatings.jsp?tid=1312168">Alcantar,
> Iliana</a></div>
>
> where I convert the
> " á " to " a"
>
> which appears to be a shift of 128, but I'm not sure how to accomplish this..
>
> I've tested using the different decode/encode functions using
> utf-8/ascii with no luck.
>
> I've reviewed stack overflow, as well as a few other sites, but
> haven't hit the aha moment.
>
> pointers/comments would be welcome.
>
Why do you want to do that?

The short answer is that you should accept that these days you should
be using Unicode, not ASCII.

The longer answer is that you could normalise the Unicode codepoints to
the NFKD form and then discard any codepoints outside the ASCII range:

>>> import unicodedata
>>> t = unicodedata.normalize("NFKD", "Alcántar")
>>> "".join(c for c in t if ord(c) < 0x80)
'Alcantar'

The disadvantage, of course, is that it'll throw away a whole lot of
codepoints that can't be 'converted'.

Have a look at Unidecode:

http://pypi.python.org/pypi/Unidecode




More information about the Python-list mailing list