Building a word list from multiple files

Manu manu.1982 at gmail.com
Thu Nov 18 23:16:30 EST 2004


hi,
> 1) How large are the files you are reading (e.g. can they
> fit in memory)?

The files are email messages.
I will using the the builtin email module to extract only the content
type which is plain text or in html.So no line by line processing is
possible unless
i write my own parser for email.

> 2) Are the words in the file separated with some consistent
> character (e.g. space, tab, csv, etc).

in the case of html mail i only extract the text and strip of the
tags.
Since this is regular text i expect no special seperators and as i
understand split() by default takes any whitespace character as
delimter.This will work fine for my purposes.


> If not, preprocess the files and use shelve to save a
> dictionary that has already been processed.  When you

This is what i was planning to do.Once the processing is done for a
set of files they are never processed again.I was going to store the
dict as a string in a file and then use eval() to get it back.


Thanks
Manu



More information about the Python-list mailing list