Spell-check engine?
Mike C. Fletcher
mcfletch at rogers.com
Sun Oct 20 19:24:48 EDT 2002
Well, I've been playing with this a little more. It seems to me that
about 70% of the complexity of the aspell code is just getting around
the limitations of C++. The engine itself seems to be composed of a few
fairly minimal parts:
(limited) edit distance algorithm:
provides a mechanism for comparing two strings of chars and
generating an integer "difference" measure. The "limited" feature
basically gives you the ability to say "anything over this number is a
'big' match", to allow for short-circuiting the check when you are
looking for similar strings. That short-circuiting is going to be
critical with the way the algorithm is used (see below). Difflib, fast
as it is, probably wouldn't work in this usage.
I've ripped this algorithm out of aspell and made it into a
Python module with SWIG. However, I have only used one of the variants
(the plain limited version). I haven't wrapped the "typo" code, for
instance.
("genetic") phonetic compression algorithm:
Compresses a word according to rules defined in rule-files
available for a large number of languages. I re-implemented this
according to the documentation (including the ability to read the same
rule-files), rather than trying to use the code in aspell. There
doesn't appear to be a decent suite of test data to see if I've properly
constructed my version (I know I haven't implemented priorities
correctly, for instance).
Since then I've realised that what I was thinking was an
incredibly obscure algorithm for doing the phonetic compression is just
C++ cruft for parsing the data file. The algorithm I created is
_nothing_ like the aspell algorithm (which looks very heavily
optimised), so likely I should just rip that little method out and turn
it into another extension module.
typo data: (I haven't touched this yet)
This appears to be a simple mapping for common keyboards that
can be read and used to weight the distance algorithm. I must be
missing something in how it's applied, because this looks ridiculously
simple (you load up around 20 two-key sets, and down-weight any
substitution that involves both characters) but it doesn't appear to be
hooked up to the edit distance algo AFAICT.
filters, converters, tokenisers:
Lots of stuff for processing unstructured texts into "words" for
processing. Includes handling of character sets and the like. I think
almost all of this could easily be handled with Python's re and string
modules, so I don't have any strong interest in porting it. For an
engine, it seems likely that providing these services as their own
engine is a more appropriate tack (if being done at all).
From there, basically, here's how the matching goes:
class WordSet:
def check( self, word ):
return word in self.exactTable
def equalSound( self, sound):
"""Find all words in this set with an exactly-equal "soundslike"
value"""
return self.soundTable.get( sound, {})
def similarSound( self, sound, distance=1 ):
"""Find all words in this set with a similar "soundslike" value
(expensive)"""
# either generate an exhaustive list of diff'd sound values from
sound
# or do a brute-force iteration through the entire soundTable
computing
# the limited-edit-distance value with distance
def check( word ):
"""Check whether a word is in the current dictionary (set of
wordsets)"""
for set in activeSets:
if set.check(word):
return set
return None
def suggest( word, editDistance=1):
"""Suggest words which might be correct spellings of the word"""
set = {}
sound = currentPhonetic.convert( word )
for wordset in activeSets:
set.update( wordset.equalSound( sound ))
for wordset in activeSets:
set.update( wordset.similarSound( sound, distance=distance ))
return set
So, to work properly, you need to have your word-sets compiled into two
tables, the first is just used for "is-in" tests (i.e. by the "check"
methods) with the key == the word itself. The second maps
"soundslike":[ word, word, word,...]. You'd want the keys of that table
to be easily iterable if using the brute-force search mechanism for
"similar" sound matches. Note, that's a _really_ slow approach if
you're having to do len(entire dictionary) sounds * len(document)
checks, so you'd want to be pretty sure you only call similarSound a few
times/minute.
From where I sit, it looks as though it would be possible to provide an
engine that could "consume" a simple word-list to feed itself. I
haven't yet figured out how to use the (binary) rwl and cwl files from
aspell's language modules, however. That would be pretty useful for
getting the engine fed with high-quality data.
Anyway, going back to playing now. Have fun,
Mike
Terry Hancock wrote:
>On Friday 18 October 2002 01:00 pm, python-list-request at python.org wrote:
>
>
>>Anyone have a fairly standard spell-check engine available for use from
>>Python? That is, something with an interface something like this:
>>
>>
>
>I don't know of any ready-made solutions, and could use one myself, so
>if you find something, please post. I immediately imagine using a wrapper
>for ispell or aspell or the pspell library though I am ignorant of how to do
>that. I was able to find a few interesting leads, though:
>
>ispell: http://fmg-www.cs.ucla.edu/geoff/ispell.html
>
>aspell: http://aspell.sourceforge.net
>
>
...
>Alternatively, if you want a pure-python approach, I'm pretty sure that
>the standard library module "difflib" could be persuaded to fuzzy-match
>words. It will fuzzy-match sequences, and a word can be regarded as
>a sequence of characters. E.g.:
>
...
_______________________________________
Mike C. Fletcher
Designer, VR Plumber, Coder
http://members.rogers.com/mcfletch/
More information about the Python-list
mailing list