trying to strip out non ascii.. or rather convert non ascii

Tim Chase python.list at tim.thechases.com
Sat Oct 26 21:41:58 EDT 2013


On 2013-10-26 22:24, Steven D'Aprano wrote:
> Why on earth would you want to throw away perfectly good
> information? 

The main reason I've needed to do it in the past is for normalization
of search queries.  When a user wants to find something containing
"pingüino", I want to have those results come back even if they type
"pinguino" in the search box.

For the same reason searches are often normalized to ignore case.
The difference between "Polish" and "polish" is visually just
capitalization, but most folks don't think twice about

  if term.upper() in datum.upper():
    it_matches()

I'd be just as happy if Python provided a "sloppy string compare"
that ignored case, diacritical marks, and the like.

  unicode_haystack1 = u"pingüino"
  unicode_haystack2 = u"¡Miré un pingüino!"
  needle = u"pinguino"
  if unicode_haystack1.sloppy_equals(needle):
    it_matches()
  if unicode_haystack2.sloppy_contains(needle):
    it_contains()

As a matter of fact, I'd even be happier if Python did the heavy
lifting, since I wouldn't have to think about whether I want my code
to force upper-vs-lower for the comparison. :-)

-tkc









More information about the Python-list mailing list