Aspell and

Mike C. Fletcher mcfletch at rogers.com
Mon Oct 21 18:24:35 EDT 2002


Sorry if I caused offense by using your code w/out notifying you.  I 
didn't really think you'd be interested in a project that's so early in 
it's life-cycle (I only just set up the SourceForge project in the last 
hour).  I cancelled a message I wrote to you Saturday night because I 
figured you'd be too busy to be answering questions from the likes of me 
before I have anything that's actually working :) .

As for compiling Aspell on Win32, I hadn't tried the MingWin32 version 
of GCC.  I had noticed the post about the VC++ compilation patch, but 
your comment on it seemed to suggest that it would require quite a bit 
of work to be acceptable.  Given that I have no great C/C++ skill, it is 
easier for me to build the infrastructure in Python and only use C/C++ 
for a few key algorithms than it is to try and modify a complex C/C++ 
project.

Too bad about using the *.rws files directly, but in considering it, I'm 
leaning toward giving (GUI) tools to both dictionary creators and users 
for generating redistributable files for both dictionaries and 
word-sets.  From the sound of it, it should be easy to allow users to 
generate distributables for either system.  If they have aspell 
installed we'll offer the word-list-(de)compress functionality, 
otherwise I'll only accept/generate uncompressed lists.

I am somewhat at a loss for how you access the "compressed" files.  I'd 
thought they were using a b-tree or similar index, but it doesn't seem 
that way when I look at the code for word-list-compress.  Are you 
loading the whole word-set into memory?  That should make it fast, but 
doesn't it consume a lot of space?  I'm currently using bsddb tables on 
disk, with an in-memory hash-table implementation for temporary 
word-sets (such as per-document and per-application sets).

I'll have to look at the typo-weighting code, as I'm not sure where to 
hook it into the leditdistance algorithm.  It would seem that you'd need 
each "swap" to be a lookup into the typo table.  I'm looking at making a 
set of ranking algos based on:

    set meta-data
        user-specific sets have higher rank than system sets
        dictionaries declare set's "commonality" ranking (e.g. the 
english dict has levels 10,20,...90)
        might allow for "formality" rankings (e.g. slang word-sets have 
lower ranking in Business dictionaries and higher in Informal 
dictionaries).  Similarly "technicality", "political correctness" or 
whatever key you want.  Made a float factor, sets which don't include 
the meta-data just get the default values.  Each dictionary would then 
include the set meta-data to determine the ranking of suggestions within 
itself.  Most likely would use a single float value at run time 
(basically the product of the various set weightings)
   
    frequency tracking
        individual user's word-frequency tracking (optional).  If it's 
tracked, may as well use it.
        individual user's typo-frequency tracking (optional).  It might 
be useful to track the frequency of typos for a given user to generate 
the weightings (i.e. if a correction is reported, increment the diff (i 
-> o) frequency record as well as the whole-word correction's frequency 
record).

Anyway, rather than blathering on at you, suppose I'll do some more work 
now.  Have fun,
Mike

Kevin Atkinson wrote:

>[CC to Aspell-devel for a public record of our conversation, please 
>continue to do so unless you have a good reason not to.]
>
>I was browsing though Usenet groups on the search term "Aspell" as I do 
>from time to time to see what out people are saying about Aspell and I 
>came across your thread "Spell-check engine?" to comp.lang.python.
>
>Although the LGPL gives you the right to reuse my code I would 
>of appreciate a note to that effort.  You could of saved yourself a 
>decent deal of effort by contacting me first.
>
>A few points I want to address:
>
>The Aspell library should compile on Win32 using the MinGW version of Gcc 
>which means that the CygWin library does not need to be pulled in.  It can 
>now also compile using VC++ but with a user contributed patch but that is 
>completely unsupported by me.
>
>Do not even think about using the *.rws files as it is a compiled
>dictionary format internal to Aspell and can change at any time.  For
>example the next Aspell release 0.51 will change the format of the
>compiled words lists in a non trivial way.  However using the *.cwl is
>rather easy.  All the *.cwl are just compressed word lists with the
>word-list-compress utility distributed with Aspell.  The process is
>extremely simple and can easy be written in any language.
>
>When edit distance are computed each "edit" has a weight associated with 
>it.  When typo analysis is used the weights are significantly different 
>from the normal edit distance algorithm.  The basic algorithm is the same 
>however.
>
>If you have any other questions I will be happy to address them.
>  
>
_______________________________________
  Mike C. Fletcher
  Designer, VR Plumber, Coder
  http://members.rogers.com/mcfletch/







More information about the Python-list mailing list