Aspell and
Mike C. Fletcher
mcfletch at rogers.com
Mon Oct 21 18:24:35 EDT 2002
Sorry if I caused offense by using your code w/out notifying you. I
didn't really think you'd be interested in a project that's so early in
it's life-cycle (I only just set up the SourceForge project in the last
hour). I cancelled a message I wrote to you Saturday night because I
figured you'd be too busy to be answering questions from the likes of me
before I have anything that's actually working :) .
As for compiling Aspell on Win32, I hadn't tried the MingWin32 version
of GCC. I had noticed the post about the VC++ compilation patch, but
your comment on it seemed to suggest that it would require quite a bit
of work to be acceptable. Given that I have no great C/C++ skill, it is
easier for me to build the infrastructure in Python and only use C/C++
for a few key algorithms than it is to try and modify a complex C/C++
project.
Too bad about using the *.rws files directly, but in considering it, I'm
leaning toward giving (GUI) tools to both dictionary creators and users
for generating redistributable files for both dictionaries and
word-sets. From the sound of it, it should be easy to allow users to
generate distributables for either system. If they have aspell
installed we'll offer the word-list-(de)compress functionality,
otherwise I'll only accept/generate uncompressed lists.
I am somewhat at a loss for how you access the "compressed" files. I'd
thought they were using a b-tree or similar index, but it doesn't seem
that way when I look at the code for word-list-compress. Are you
loading the whole word-set into memory? That should make it fast, but
doesn't it consume a lot of space? I'm currently using bsddb tables on
disk, with an in-memory hash-table implementation for temporary
word-sets (such as per-document and per-application sets).
I'll have to look at the typo-weighting code, as I'm not sure where to
hook it into the leditdistance algorithm. It would seem that you'd need
each "swap" to be a lookup into the typo table. I'm looking at making a
set of ranking algos based on:
set meta-data
user-specific sets have higher rank than system sets
dictionaries declare set's "commonality" ranking (e.g. the
english dict has levels 10,20,...90)
might allow for "formality" rankings (e.g. slang word-sets have
lower ranking in Business dictionaries and higher in Informal
dictionaries). Similarly "technicality", "political correctness" or
whatever key you want. Made a float factor, sets which don't include
the meta-data just get the default values. Each dictionary would then
include the set meta-data to determine the ranking of suggestions within
itself. Most likely would use a single float value at run time
(basically the product of the various set weightings)
frequency tracking
individual user's word-frequency tracking (optional). If it's
tracked, may as well use it.
individual user's typo-frequency tracking (optional). It might
be useful to track the frequency of typos for a given user to generate
the weightings (i.e. if a correction is reported, increment the diff (i
-> o) frequency record as well as the whole-word correction's frequency
record).
Anyway, rather than blathering on at you, suppose I'll do some more work
now. Have fun,
Mike
Kevin Atkinson wrote:
>[CC to Aspell-devel for a public record of our conversation, please
>continue to do so unless you have a good reason not to.]
>
>I was browsing though Usenet groups on the search term "Aspell" as I do
>from time to time to see what out people are saying about Aspell and I
>came across your thread "Spell-check engine?" to comp.lang.python.
>
>Although the LGPL gives you the right to reuse my code I would
>of appreciate a note to that effort. You could of saved yourself a
>decent deal of effort by contacting me first.
>
>A few points I want to address:
>
>The Aspell library should compile on Win32 using the MinGW version of Gcc
>which means that the CygWin library does not need to be pulled in. It can
>now also compile using VC++ but with a user contributed patch but that is
>completely unsupported by me.
>
>Do not even think about using the *.rws files as it is a compiled
>dictionary format internal to Aspell and can change at any time. For
>example the next Aspell release 0.51 will change the format of the
>compiled words lists in a non trivial way. However using the *.cwl is
>rather easy. All the *.cwl are just compressed word lists with the
>word-list-compress utility distributed with Aspell. The process is
>extremely simple and can easy be written in any language.
>
>When edit distance are computed each "edit" has a weight associated with
>it. When typo analysis is used the weights are significantly different
>from the normal edit distance algorithm. The basic algorithm is the same
>however.
>
>If you have any other questions I will be happy to address them.
>
>
_______________________________________
Mike C. Fletcher
Designer, VR Plumber, Coder
http://members.rogers.com/mcfletch/
More information about the Python-list
mailing list