Reading Huge UnixMailbox Files

Tue Apr 26 16:17:51 EDT 2011

On Tue, Apr 26, 2011 at 12:39 PM, Brandon McGinty
<brandon.mcginty at gmail.com> wrote:
> List,
> I'm trying to import hundreds of thousands of e-mail messages into a
> database with Python.
> However, some of these mailboxes are so large that they are giving
> errors when being read with the standard mailbox module.
> I created a buffered reader, that reads chunks of the mailbox, splits
> them using the re.split function with a compiled regexp, and imports
> each chunk as a message.
> The regular expression work is where the bottle-neck appears to be,
> based on timings.
> I'm wondering if there is a faster way to do this, or some other method
> that you all would recommend.
>
> Brandon McGinty

Is it traditional mbox, or the more recent mbox that uses a
Content-length header?

Either way, you could probably read the mbox files line by line, and
yield a string corresponding to one message - one message at a time.

Traditional mbox is easier - you just look for lines that start with
"^From " - if a message actually wanted to include that in its body,
the MTA should prepend it with a > or something to avoid ambiguity.

With the Content-length header, you need to understand a little more
about the header lines - this header gives the length of the message
so that you don't need the ugly > escape for From's.