[SciPy-User] Cleaning/feature extraction of e-mail messages
Florian Lindner
mailinglists at xgm.de
Sun Nov 24 06:50:01 EST 2013
Hello,
I want to use scikit-lean for mail classification (no spam detection). I
haven't really worked with machine learning software (besides end-user
spamfilters).
What I have done so far:
vectorizer = TfidfVectorizer(input='filename', preprocessor=mail_preprocessor,
decode_error="ignore")
X = vectorizer.fit_transform(["testmail2"])
testmail2 is raw email message (taken from a servers maildir), The
decode_error I've set due to utf8 decoding issues that I decided to ignore for
the time being.
This works perfectly for the scikit-learn part. But one challenge (for me)
seems to be to prepare the mail for feature extraction.
My idea would be to take the plain/text parts of the mails, maybe additionally
the From header.
def mail_preprocessor(str):
msg = email.message_from_string(str)
msg_body = ""
for part in msg.walk():
if part.get_content_type() == "text/plain":
msg_body += part.get_payload(decode=True)
msg_body = msg_body.lower()
msg_body = msg_body.replace("\n", " ")
msg_body = msg_body.replace("\t", " ")
return msg_body
I know that this may be slightly offtopic and I apologize if it's too offtopic.
Is there already some code in the wild that prepares mail messages for feature
extraction? The topic seems to be much more fancy then I had suspected,
regarding issues like HTML, MIME encodings, multipart stuff, ...
Thanks!
Florian
More information about the SciPy-User
mailing list