Spell-check engine?

Mike C. Fletcher mcfletch at rogers.com
Mon Oct 21 03:14:58 EDT 2002


Okay, I think I'm done playing for today, so here's how it stands:

    The (python) phonetic compression algo appears to work approximately 
correctly with the aspell phonet.dat files. There is still no 
rule-priority, "follow-on rule", or accent-stripping support.  It does, 
however, seem to produce at least vaguely intelligible results.

    I've got word-set classes which store individual word-sets using 
simple bsddb btree tables.  The wordsets provide for exact, 
exact-soundslike and similar-soundslike searches (as understood by aspell).

    The edit-distance code works as expected.  It still doesn't have 
typo-pair support.

    There is an object (SpellDict) which provides a simple API for a 
collection of some number of word-sets.

    Still no support for reading in the compiled aspell files (and quite 
frankly, I don't see a nice way to do it).

The SpellDict provides this API:

class SpellDict( object ):
    """Binds together a set of wordsets
    """
    def __init__(
        self,
        sets = (),
        name = "",
    ):
        """Create a new SpellDict with the given sets and a 
user-friendly name
        sets is a sequence object (iterable) of instantiated WordSet objects
        name is a user-friendly name for this collection of word sets 
(dictionary)
        """
    def check( self, word):
        """Check whether a word is in dictionary

        If the word is in the dictionary, returns the
        first set in which the word appears.
        """
    def suggest( self, word, distance=1 ):
        """Suggest words to replace word

        distance is a metric determining how far
        the soundslike value can diverge to still allow
        for a "similar" ranking.  Higher values will
        dramatically increase the running time and catch
        far more possible matches.
        """

with everything basically working as advertised (AFAICT).

If there's any interest, I'll setup a SourceForge project tomorrow, post 
all the code, and move updates off the Python list :) .
Enjoy all,
Mike


Mike C. Fletcher wrote:

> Well, I've been playing with this a little more.  It seems to me that 
> about 70% of the complexity of the aspell code is just getting around 
> the limitations of C++.  The engine itself seems to be composed of a 
> few fairly minimal parts:

[previous status deleted]
...






More information about the Python-list mailing list