Spell-check engine?

Sun Oct 20 19:24:48 EDT 2002

Well, I've been playing with this a little more.  It seems to me that 
about 70% of the complexity of the aspell code is just getting around 
the limitations of C++.  The engine itself seems to be composed of a few 
fairly minimal parts:

(limited) edit distance algorithm:
        provides a mechanism for comparing two strings of chars and 
generating an integer "difference" measure.  The "limited" feature 
basically gives you the ability to say "anything over this number is a 
'big' match", to allow for short-circuiting the check when you are 
looking for similar strings.  That short-circuiting is going to be 
critical with the way the algorithm is used (see below).  Difflib, fast 
as it is, probably wouldn't work in this usage.
        I've ripped this algorithm out of aspell and made it into a 
Python module with SWIG.  However, I have only used one of the variants 
(the plain limited version).  I haven't wrapped the "typo" code, for 
instance.

("genetic") phonetic compression algorithm:
        Compresses a word according to rules defined in rule-files 
available for a large number of languages.  I re-implemented this 
according to the documentation (including the ability to read the same 
rule-files), rather than trying to use the code in aspell.  There 
doesn't appear to be a decent suite of test data to see if I've properly 
constructed my version (I know I haven't implemented priorities 
correctly, for instance).
        Since then I've realised that what I was thinking was an 
incredibly obscure algorithm for doing the phonetic compression is just 
C++ cruft for parsing the data file. The algorithm I created is 
_nothing_ like the aspell algorithm (which looks very heavily 
optimised), so likely I should just rip that little method out and turn 
it into another extension module.

typo data: (I haven't touched this yet)
        This appears to be a simple mapping for common keyboards that 
can be read and used to weight the distance algorithm.  I must be 
missing something in how it's applied, because this looks ridiculously 
simple (you load up around 20 two-key sets, and down-weight any 
substitution that involves both characters) but it doesn't appear to be 
hooked up to the edit distance algo AFAICT.

filters, converters, tokenisers:
        Lots of stuff for processing unstructured texts into "words" for 
processing.  Includes handling of character sets and the like.  I think 
almost all of this could easily be handled with Python's re and string 
modules, so I don't have any strong interest in porting it.  For an 
engine, it seems likely that providing these services as their own 
engine is a more appropriate tack (if being done at all).

 From there, basically, here's how the matching goes:

class WordSet:
    def check( self, word ):
        return word in self.exactTable
    def equalSound( self, sound):
        """Find all words in this set with an exactly-equal "soundslike" 
value"""
        return self.soundTable.get( sound, {})
    def similarSound( self, sound, distance=1 ):
        """Find all words in this set with a similar "soundslike" value 
(expensive)"""
        # either generate an exhaustive list of diff'd sound values from 
sound
        # or do a brute-force iteration through the entire soundTable 
computing
        # the limited-edit-distance value with distance

def check( word ):
    """Check whether a word is in the current dictionary (set of 
wordsets)"""
    for set in activeSets:
        if set.check(word):
            return set
    return None

def suggest( word, editDistance=1):
    """Suggest words which might be correct spellings of the word"""
    set = {}
    sound = currentPhonetic.convert( word )
    for wordset in activeSets:
        set.update( wordset.equalSound( sound ))
    for wordset in activeSets:
        set.update( wordset.similarSound( sound, distance=distance ))
    return set

So, to work properly, you need to have your word-sets compiled into two 
tables, the first is just used for "is-in" tests (i.e. by the "check" 
methods) with the key == the word itself.  The second maps 
"soundslike":[ word, word, word,...].  You'd want the keys of that table 
to be easily iterable if using the brute-force search mechanism for 
"similar" sound matches.  Note, that's a _really_ slow approach if 
you're having to do  len(entire dictionary) sounds  * len(document) 
checks, so you'd want to be pretty sure you only call similarSound a few 
times/minute.

 From where I sit, it looks as though it would be possible to provide an 
engine that could "consume" a simple word-list to feed itself.  I 
haven't yet figured out how to use the (binary) rwl and cwl files from 
aspell's language modules, however.  That would be pretty useful for 
getting the engine fed with high-quality data.

Anyway, going back to playing now. Have fun,
Mike

Terry Hancock wrote:

>On Friday 18 October 2002 01:00 pm, python-list-request at python.org wrote:
>  
>
>>Anyone have a fairly standard spell-check engine available for use from
>>Python?  That is, something with an interface something like this:
>>    
>>
>
>I don't know of any ready-made solutions, and could use one myself, so
>if you find something, please post. I immediately imagine using a wrapper
>for ispell or aspell or the pspell library though I am ignorant of how to do
>that.  I was able to find a few interesting leads, though:
>
>ispell: http://fmg-www.cs.ucla.edu/geoff/ispell.html
>
>aspell: http://aspell.sourceforge.net
>  
>
...

>Alternatively, if you want a pure-python approach, I'm pretty sure that
>the standard library module "difflib" could be persuaded to fuzzy-match
>words.  It will fuzzy-match sequences, and a  word can be regarded as
>a sequence of characters. E.g.:
>
...

_______________________________________
  Mike C. Fletcher
  Designer, VR Plumber, Coder
  http://members.rogers.com/mcfletch/