trying to strip out non ascii.. or rather convert non ascii

Tim Chase python.list at tim.thechases.com
Sat Oct 26 22:17:29 EDT 2013


On 2013-10-26 21:54, Roy Smith wrote:
> In article <mailman.1628.1382838024.18130.python-list at python.org>,
>  Tim Chase <python.list at tim.thechases.com> wrote:
>> I'd be just as happy if Python provided a "sloppy string compare"
>> that ignored case, diacritical marks, and the like.
> 
> The problem with putting fuzzy matching in the core language is
> that there is no general agreement on how it's supposed to work.
> 
> There are, however, third-party libraries which do fuzzy matching.
> One popular one is jellyfish
> (https://pypi.python.org/pypi/jellyfish/0.1.2).

Bookmarking and archiving your email for future reference.

> Don't expect you can just download and use it right out of the box,
> however. You'll need to do a little thinking about which of the
> several algorithms it includes makes sense for your application.

I'd be content with a baseline that denormalizes and then strips out
combining diacritical marks, something akin to MRAB's

  from unicodedata import normalize
  "".join(c for c in normalize("NFKD", s) if ord(c) < 0x80)

and tweaking it if that was insufficient.

Thanks for the link to Jellyfish.

-tkc






More information about the Python-list mailing list