Spell-check engine?
Mike C. Fletcher
mcfletch at rogers.com
Mon Oct 21 03:14:58 EDT 2002
Okay, I think I'm done playing for today, so here's how it stands:
The (python) phonetic compression algo appears to work approximately
correctly with the aspell phonet.dat files. There is still no
rule-priority, "follow-on rule", or accent-stripping support. It does,
however, seem to produce at least vaguely intelligible results.
I've got word-set classes which store individual word-sets using
simple bsddb btree tables. The wordsets provide for exact,
exact-soundslike and similar-soundslike searches (as understood by aspell).
The edit-distance code works as expected. It still doesn't have
typo-pair support.
There is an object (SpellDict) which provides a simple API for a
collection of some number of word-sets.
Still no support for reading in the compiled aspell files (and quite
frankly, I don't see a nice way to do it).
The SpellDict provides this API:
class SpellDict( object ):
"""Binds together a set of wordsets
"""
def __init__(
self,
sets = (),
name = "",
):
"""Create a new SpellDict with the given sets and a
user-friendly name
sets is a sequence object (iterable) of instantiated WordSet objects
name is a user-friendly name for this collection of word sets
(dictionary)
"""
def check( self, word):
"""Check whether a word is in dictionary
If the word is in the dictionary, returns the
first set in which the word appears.
"""
def suggest( self, word, distance=1 ):
"""Suggest words to replace word
distance is a metric determining how far
the soundslike value can diverge to still allow
for a "similar" ranking. Higher values will
dramatically increase the running time and catch
far more possible matches.
"""
with everything basically working as advertised (AFAICT).
If there's any interest, I'll setup a SourceForge project tomorrow, post
all the code, and move updates off the Python list :) .
Enjoy all,
Mike
Mike C. Fletcher wrote:
> Well, I've been playing with this a little more. It seems to me that
> about 70% of the complexity of the aspell code is just getting around
> the limitations of C++. The engine itself seems to be composed of a
> few fairly minimal parts:
[previous status deleted]
...
More information about the Python-list
mailing list