locating strings approximately

Thu Jun 29 08:02:50 EDT 2006

In article <44a315ad$0$31638$e4fe514c at news.xs4all.nl>,
Irmen de Jong  <irmen.NOSPAM at xs4all.nl> wrote:
>BBands wrote:
>> I'd like to see if a string exists, even approximately, in another. For
>> example if "black" exists in "blakbird" or if "beatles" exists in
>> "beatlemania". The application is to look though a long list of songs
>> and return any approximate matches along with a confidence factor. I
>> have looked at edit distance, but that isn't a good choice for finding
>> a short string in a longer one. I have also explored
>> difflib.SequenceMatcher and .get_close_matches, but what I'd really
>> like is something like:
>> 
>> a = FindApprox("beatles", "beatlemania")
>> print a
>> 0.857
>> 
>> Any ideas?
>> 
>>     jab
>> 
>
>I collected a few pointers in this article:
>
>http://www.razorvine.net/frog/user/irmen/article/2005-05-28/53
>
>It contains one or two additional methods besides the ones you mentioned.
>Sorry, it's in Dutch, but you can find the links without much trouble i guess.

For those who don't read Dutch, here's a translation of that page:

========================

Python </frog/user/irmen/category/2>

By fuzzy string comparision, I mean comparing two strings and getting
a result that indicates how similar the strings are to each
other. That is, it's not an absoulte string compaision (that only
tells if the two strings are absolutely the same or not), but a more
quantative string comparision.

It can be used to, for example, to remove typos from commands, a
specific form of spelling checker or for other natural language
processing.

Python has a few modules which offer a fuzzy string comparision.

â€”

QWERTY - or protein distances

Recently, BoB Hooft posted an interesting module in comp.lang.python
which uses an algorithm that is normally used for comparing proteins,
but now works for ASCII strings. The result is a float which is
negative if the strings are very different and increases up to
len(string) + 1 if the strings are identical, the score can be viewd
as 'the number of correspondingn letters'. The algorithm uses the
layout of the QWERTY keyboard to specifiy the distance between the
stings and therefore is restricted to ASCII strings (and only makes
sense if one uses a QWERTY keyboard), but therefor is very well suited
to compare an imput command with all the possible commands and thus to
identify typos. 

Why not use a normalised score between 0.0 and 1.0? 
Rob: It doesn't matter if you only want the 'best score' from a number
of strings. Negative results appear if the strings are not similar and
also don't even have the same length.

The module can be found here: : comparestrings.py
<http://starship.python.net/crew/hooft/comparestrings.py>.

*Python standard library*

In de standard lib hebben we difflib met de functie get_close_matches
<http://docs.python.org/lib/module-difflib.html#l2h-922>. Groot voordeel
dat deze dus standaard bij Python zit.

The standard library has difflib with the function get_close_matches
<http://docs.python.org/lib/module-difflib.html#l2h-922>. The big
advantage is that this is standard with Python.

*Febrl (biometrisch)*

Febrl <https://sourceforge.net/projects/febrl/> has a number of string
comparision functions, including the 'edit distance' or 'Levenshtein
distance' function. For the latter, there is an implementation on
Useless Python
<http://www.uselesspython.com/download.php?script_id=108>.

*Apse*

An extension module (written in C):  Apse
<http://www.personal.psu.edu/staff/i/u/iua1/python/apse/>. 

Requires python, SWIG, and a C compiler

*Java*

SecondString <http://sourceforge.net/projects/secondstring/> is an
open source package with Java implementations of a large number of
string comparision algorithms. They should not be too difficult to
port to Python.

-- 
Jim Segrave           (jes at jes-2.demon.nl)