Handling bad tags with SGMLParser

Ken Causey ken at ineffable.com
Thu Mar 7 11:48:52 EST 2002


I'm working with a piece of software which uses SGMLParser to parse
HTML content so that a small subset of tags can be handled specially. 
The result is then passed onto a browser for rendering.  Failure is
not an option.

I'm running into a problem with sgmllib in Python 2.1.2 with tags of
the form:

<![blah]> where 'blah' could be anything.

In 1.5.2 any non-comment tag starting with ! was simply ignored.  I'm
not sure when it started, but 2.1.2 now considers such tags
declarations and puts them through a bit more parsing.  This is great.
 The problem comes in when the tag does not properly parse as a
declaration, which the above tag does not. 
SGMLParse.parse_declaration() throws a SGMLParseError on anything that
doesn't pass.

The user of SGMLParser needs to be able to handle invalid tags.  This
handling may be complex or as simple as just ignoring it and asking
SGMLParser to skip this tag and move along.  As far as I can tell this
is not an option.

As a side note, the text error message thrown is particularly
uninformation as it simply includes the first letter of the tag, in
other words always '<'.

Thanks,

Ken Causey



More information about the Python-list mailing list