[Tutor] Extracting body of all email messages from an mbox file on computer

grishma govani grishma20 at gmail.com
Thu Sep 11 10:22:52 CEST 2008


Yes, I used the part of the code from the second link.
I am using the mailbox modules too.

I have the e-mails from gmail in a file on my computer. I have used  
the code below extract all the headers. As you can see for now I am  
using text stored in document as my body. I just want to extract the  
plain text and leave out all the html, duplicates of plain text and  
all the other information like content type, from etc. Can anyone help  
me out?

mb = mailbox.UnixMailbox(file('tmp/automated/Feedback', 'r'))
fout = file('Feedback.txt', 'w')
msg = mb.next()

while msg is not None:
    document = msg.fp.read()
    document = passthrough_filter(msg, document)
    msg = mb.next()


def passthrough_filter(msg, document):
    """This prints the 'from' address of the message and
    returns the document unchanged.
    """
    from_addr = msg.getaddr('From')[0]
    Sub = msg.get('Subject')
    ContentType = msg.get('Content-Type')
    ContentDisp = msg.get('Content-Disposition')
    print "From:",from_addr
    print "Subject:",Sub
    print "Attachment:",None
    print "Body:",document
    print '\n'
    return document




On 10 Sep 2008, at 22:09, Kent Johnson wrote:

> On Wed, Sep 10, 2008 at 4:06 PM, grishma govani  
> <grishma20 at gmail.com> wrote:
>> Hello Everybody,
>>
>> I have been trying to extract the body of all the email messages  
>> from an
>> mbox file.
>
> How are you doing this? Have you seen the mailbox module and this  
> recipe:
> http://docs.python.org/lib/mailbox-mbox.html
> http://code.activestate.com/recipes/157437/
>
> Kent

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20080911/f4117f18/attachment.htm>


More information about the Tutor mailing list