The email package and KLEZ mails

Tue May 28 13:34:11 EDT 2002

On Tue, 28 May 2002 21:34:22 +1000, Anthony Baxter
<anthony at interlink.com.au> wrote in comp.lang.python in article
<mailman.1022585747.21423.python-list at python.org>:

> 
> > In my experience, incorrect MIME structure is one of the numerous
> > hints about mail being SPAM.  I do not remember a single false positive.
> 
> I wish. I have to deal with end-user email, and trust me, it's not all
> spam.

I concur with Anthony. I have written an email filter package using the
email module and if you use the strict Parser class included in that
module, it does throw away too much good email (because any good mail
thrown away is too much). Moreover, as I've mentioned in other posts and
email correspondence, if you're writing software for end users, you really
can't just tell them: "Oh, all those mails that caused errors...they were
just non-RFC compliant. Probably SPAM or virus." First off, it's not 100%
correct. Secondly, why is it that the three other mail readers I use
(Agent, Pegasus, and PocoMail) are all able to parse these messages? I also
agree with the idea that applications must be strict in what they write and
liberal in what they accept.

In extensive correspondence with Barry Warsaw on this matter a few months
back, we came to the understanding that the Parser he provides in the email
module is intended to be a strict, RFC compliant Parser. The design of the
email module, allows Python programmers to plug in their own Parser class
and use it with the rest of the email module to get the flexibility and
functionality that they need. Barry is open to including other types of
Parsers, but his point of view seems to be, that if the strict Parser
provided in the email module cannot parse the email, then the Python
programmer should decide how to handle this and write appropriate code.

I have written a "smart parser" class that I am using in my email filter. I
use this class instead of the Parser class provided with the email module.
I provide the code below for all interested parties. It really does a
pretty good job of handling most mails that the strict Parser cannot
handle. If the "smart parser" class below cannot handle your email, then
there is a non-documented HeaderParser class in the email module (see the
source code for the Parser class) which is your next best chance.
Otherwise, you will have to write your own routines for parsing the
message.

The "smart parser" class below is adapted from the Parser code provided in
the email module. I have been using this in a production environment for a
couple of months now, and have quite a number of other beta testers also
using it, and we get almost no mails that cannot be parsed.

Caution: Because this module makes "assumptions" about the structure of the
message, in the case that the received email is not RFC compliant, if you
try to use one of the Generators to print the message (which is called when
printing) it will possibly print a message that is not identical to the raw
message which was received. You may want to somehow save the raw message in
your code elsewhere, if you might need the original raw message.

Code follows the signature. Enjoy,

--
Sheila King
http://www.thinkspot.net/sheila/
http://www.k12groups.org/
http://www.FutureQuest.net

#####   CODE FOR SMART PARSER CLASS #####

from email.Parser import Parser

class smart_Parser(Parser):

    def parse(self, fp):
        root = self._class()
        self._parseheaders(root, fp)
        self._parsebody(root, fp)
        return root

    def parsestr(self, text):
        return self.parse(StringIO(text))

    def _parseheaders(self, container, fp):
        # Parse the headers, returning a list of header/value pairs.  
		# None as
        # the header means the Unix-From header.
        lastheader = ''
        lastvalue = []
        lineno = 0
        while 1:
            line = fp.readline()[:-1]
            if not line or not line.strip():
                break
            lineno += 1
            # Check for initial Unix From_ line
            if line.startswith('From '):
                if lineno == 1:
                    container.set_unixfrom(line)
                    continue
                else:
                    raise Errors.HeaderParseError(
                        'Unix-from in headers after first rfc822 header')
            #
            # Header continuation line
            if line[0] in ' 	':
                if not lastheader:
                    raise Errors.HeaderParseError(
                        'Continuation line seen before first header')
                lastvalue.append(line)
                continue
            # Normal, non-continuation header.  
			# BAW: this should check to make
            # sure it's a legal header, e.g. doesn't contain spaces.  
			# Also, we
            # should expose the header matching algorithm in the API, and
            # allow for a non-strict parsing mode (that ignores the line
            # instead of raising the exception).
            i = line.find(':')
            if i < 0:
                raise Errors.HeaderParseError(
                    'Not a header, not a continuation')
            if lastheader:
                container[lastheader] = NL.join(lastvalue)
            lastheader = line[:i]
            lastvalue = [line[i+1:].lstrip()]
        # Make sure we retain the last header
        if lastheader:
            container[lastheader] = NL.join(lastvalue)

    def _parsebody(self, container, fp):
        boundary = container.get_boundary()
        isdigest = (container.get_type() == 'multipart/digest')
        if boundary:
            preamble = epilogue = None
            separator = '--' + boundary
            payload = fp.read()
            start = payload.find(separator)
            if start < 0:
                container.add_payload(payload)
                return
            if start > 0:
                preamble = payload[0:start]
            start += len(separator) + 1 + isdigest
            terminator = payload.find('\n' + separator + '--', start)
            if terminator < 0:
                terminator = len(payload)
            if terminator + len(separator) + 3 < len(payload):
                epilogue = payload[terminator + len(separator) + 3:]
            if isdigest:
                separator += '\n\n'
            else:
                separator += '\n'
            parts = payload[start:terminator].split('\n' + separator)
            for part in parts:
                if type(part) is type('') and not part.strip():
                    parts.remove(part)
                elif part:
                    msgobj = self.parsestr(part)
                    container.preamble = preamble
                    container.epilogue = epilogue
                    if not isinstance(container.get_payload(), type([])):
                        container.set_payload([msgobj])
                    else:
                        container.add_payload(msgobj)
        elif container.get_type() == 'message/delivery-status':
            # This special kind of type contains blocks 
			# of headers separated
            # by a blank line.  We'll represent each header block as a
            # separate Message object
            blocks = []
            while 1:
                blockmsg = self._class()
                self._parseheaders(blockmsg, fp)
                if not len(blockmsg):
                    # No more header blocks left
                    break
                blocks.append(blockmsg)
            container.set_payload(blocks)
        elif container.get_main_type() == 'message':
            # Create a container for the payload, 
			# but watch out for there not
            # being any headers left
            try:
                msg = self.parse(fp)
            except Errors.HeaderParseError:
                msg = self._class()
                self._parsebody(msg, fp)
            container.add_payload(msg)
        else:
            container.add_payload(fp.read())