[Email-SIG] API thoughts

Thu Mar 3 17:13:41 CET 2011

On Thu, 03 Mar 2011 16:28:32 +0100, Steffen Daode Nurpmeso <sdaoden at googlemail.com> wrote:
> On Wed, Mar 02, 2011 at 07:50:20PM -0500, R. David Murray wrote:
> > That is, if the defects list is non-empty,
> > the message is technically malformed.  Of course, that information by
> > itself isn't necessarily useful, which is why defects is a list
> > of defects.
> > "is_processable" lies in the eyes of the application.
> > What defects is it capable of dealing with?  The email package
> > can't know that.  So, again, that's why defects is a list.
> > 
> > Let me clarify what I mean by the policy controlling "what, exactly, is
> > a defect".  The idea here is that when parsing an email, each deviance
> > from the RFCs counts as a defect (the current email package, by the way,
> > only detects a small number of such defects!).  But when parsing, say,
> > an http stream, non-ascii characters in headers are perfectly legal.
> > So it seems to make sense that the HTTP policy would change what counts
> > as a defect during the operation of the parser.
> 
> So i would hope for '.all_defects[]' and (policy-adjusted) 
> '.defects[]'.  I would hope for 
> '.had_header_defects(policy_only=True)', 
> '.had_payload_defects(policy_only=True)'.

Well, what is a defect for an HTTP parse is not the same as what is
a defect for an email parse, so I don't know what "all defects" would
consist of.  The recovery decisions the parser makes can also be affected
by the policy, so there can't, as far as I can see, be a single list of
"all defects" that applies to all parses.

Currently the email package does not report header defects.  When it does,
my plan is that each Header will have its own defect list, and likewise
each message body (using a recursive definition).  How the defects list
on the Message object interacts with this is an interesting API question
worthy of discussion.  Perhaps we do, after all, have some sort of
"has_defects" method that queries the constituent parts, and perhaps a
function that returns a list of parts with defects, possibly divided
between headers and body as you suggest.

> Doing so would fill the huge hole in between 'not len(defects)' 
> and the detailed inspection of a defects list which consists of 
> a highly differentiated tree of classes.

Yeah, the number of different defect classes involved in this scheme
worries me a little bit.

> The parser has to parse- and does encounter all of these anyway, 
> and an application cannot re-collect this (dropped) information 
> except with expensive effort, i.e. at least choosing a different, 
> stricter policy followed by another parse of the bogus mail.

Why recollect?  The list is there (and, as I indicated above, will be
associated with the part that contains the error).  The list of defects
will be *all* the defects detected by that policy: all RFC deviance
(well, perhaps not quite all...see below).  Defects don't normally raise
errors, so there's no reason not lot look for all of the relevant ones
(and indeed, we are probably only detecting the ones that actually affect
the parsing).

That is, if you parse an HTTP stream, encountering a non-ASCII character
is *not* a defect.  It doesn't make any sense to me to report an
"if this were an email this would be a defect" defect.  And if the
header for some strange reason included an RFC2047 encoded word that
was invalidly formed...well, in an HTTP parse that would *technically*
violate the RFC, but in practice it really means that the data should
just be passed through as is.  That is, it's not a defect, and we
would be be wasting time even *looking* for RFC2047 encoded words.
(Unless someone finds a browser or server that generates them!)

In other words, in the base package I don't think there are "strict"
and "less strict" parsing policies; rather there are *different* parsing
policies depending on the context.  As far as I can see, it makes no sense
to parse an HTTP stream, and the reparse it as if it were an email stream.
Now, it might be useful to design a "very_strict" policy that did extra
work looking for RFC defects that a normal parse wouldn't detect (I can't
think of any off the top of my head, but the email RFCs are so complex
that I'm sure there are some), but in that case if you parsed it with
the less-strict (normal) policy those defects would *not* be noticed
by the parser.  In any case, I think such a validating parser/policy is
out of scope for the current package.

--David