Building a word list from multiple files

Thu Nov 18 09:18:12 EST 2004

Manu wrote:
> Hi,
> 
> Here's what i want to accomplish.
> I want to make a list of frequenctly occuring words in a group of
> files along with the no of occurances of each
> The brute force method will be to read the file as a string,split,load
> the words
> into a dict with  words as key and no of occurances as key.
> Load the next file ,iterate through the new words increment the value
> if there is
> a match or add a new key,value pair if there is none.
> repeat for all files.
> 
> is there a better way ??
> 
>  
> Thanks in advance.
> Manu

Manu,

There are some things we would need to know to specifically
answer your question.  I've tried to answer it with some
"assumptions" about your data/usage:

1) How large are the files you are reading (e.g. can they
fit in memory)?

If not, you will need to read the file a line at a time
and process each line individually.

2) Are the words in the file separated with some consistent
character (e.g. space, tab, csv, etc).

If not, you will probably need to use regular expressions
to handle all different punctuations that might separate
the words.  Things like quotes, commas, periods, colons,
semi-colons, etc.  Simple string split won't handle these
properly.

3) Do the "files" change a lot?

If not, preprocess the files and use shelve to save a
dictionary that has already been processed.  When you
add/change one of the files run this process to recreate
and shelve the new dictionary.  In your main program
get the shelved dictionary from the preprocess program
so that you don't have to process all the files every
time.

Hope info helps,
Larry Bates
Syscon, Inc.