Distinguishing between maildir, mbox, and MH files/directories?

Dan Stromberg drsalists at gmail.com
Sun Aug 31 16:59:07 EDT 2014


On Sun, Aug 31, 2014 at 11:45 AM, Tim Chase
<python.list at tim.thechases.com> wrote:
> Tinkering around with a little script, I found myself with the need
> to walk a directory tree and process mail messaged found within.
> Sometimes these end up being mbox files (with multiple messages
> within), sometimes it's a Maildir structure with messages in each
> individual file and extra holding directories, and sometimes it's a
> MH directory.  To complicate matters, there's also the possibility of
> non-{mbox,maildir,mh) files such as binary MUA caches appearing
> alongside these messages.
>
> Python knows how to handle each just fine as long as I tell it what
> type of file to expect.  But is there a straight-forward way to
> distinguish them?  (FWIW, the *nix "file" utility is just reporting
> "ASCII text", sometimes "with very long lines", and sometimes
> erroneously flags them as C or C++ files‽).
>
> All I need is "is it maildir, mbox, mh, or something else" (I don't
> have to get more complex for the "something else") inside an os.walk
> loop.

If you find a directory full of numbered files (and optionally,
numbered filenames preceded by commas), that's probably an MH folder.
I don't like regexes that much, but I'd probably use one for this.

If you find a directory full of Maildir-style files, that's probably
Maildir.  You could probably match this with a regex too.

If you find a file with lots of '^From " in it, that's probably an
mbox file.  However, you could have an mbox file with only one '^From
', so watch out.

This will probably give some false postives and/or false negatives,
depending on your data, but perhaps it's easier than classifying
things manually.



More information about the Python-list mailing list