HTML parser problem

Sat Nov 9 02:38:41 EST 2002

gerhard.haering at opus-gmbh.netsanjay wrote:
>  Any one has suggestion for following problem. Some word documents
> have been converted to HTML page in Ms-Word. Want to filter html tags
> like..
> <o:p></o:p>,
> <![if !supportEmptyParas]> <![endif]>
> <?xml:namespace prefix = o ns =
> "urn:schemas-microsoft-com:office:office" />, etc. I couldn't solve
> using SGMLParser.

If you need a pure Python solution, here's what I did with some
help from Martin v. Löwis.

    ======================

import htmllib, entitydefs, sgmllib

# From Martin v. L"owis, on c.l.py, 10/31/02
def _get_unicode_entitydefs():
     import htmlentitydefs
     entitydefs = htmlentitydefs.entitydefs.copy()
     for k,v in entitydefs.items():
         if v.startswith('&#'):
             v = int(v[2:-1])
         else:
             v = ord(v)
         entitydefs[k] = unichr(v)
     return entitydefs

class MyHTMLParser(htmllib.HTMLParser):
     entitydefs = _get_unicode_entitydefs()

     def handle_charref(self, name):
         try:
             c = unichr(int(name))
         except ValueError:
             # either it isn't an integer or it's
             # outside the supported Unicode range
             # (my Python only goes up to 65535)
             c = '?'
         self.handle_data(c)

     def handle_image(self, arg, *args):
         pass

def convert_to_text(s):
     file = StringIO.StringIO()
     form = formatter.AbstractFormatter(formatter.DumbWriter(file=file))
     p = MyHTMLParser(form)
     try:
         p.feed(s)
     except sgmllib.SGMLParseError, err:
         # Error in the parse
         print "HTML wasn't HTML -- cannot index: %s" % (err,)
         return
     p.close()

     return file.getvalue()
===============================

*HOWEVER*, the SGML parser upon which this is built will fail for
some MS-Word HTML.  For example, in your text you have

   <![if !supportEmptyParas]> <![endif]>

The "<!" gets identified in sgmllib.SGMLParser.goahead which
passes it on to "self.parse_declaration" which is implemented
in markupbase.py

     def parse_declaration(self, i):
         # This is some sort of declaration; in "HTML as
         # deployed," this should only be the document type
         # declaration ("<!DOCTYPE html...>").
         rawdata = self.rawdata
         j = i + 2
         assert rawdata[i:j] == "<!", "unexpected call to parse_declaration"
         if rawdata[j:j+1] in ("-", ""):
             # Start of comment followed by buffer boundary,
             # or just a buffer boundary.
             return -1
         # in practice, this should look like: ((name|stringlit) S*)+ '>'
         n = len(rawdata)
         decltype, j = self._scan_name(j, i)

The method "_scan_name" does

     def _scan_name(self, i, declstartpos):
         rawdata = self.rawdata
         n = len(rawdata)
         if i == n:
             return None, -1
         m = _declname_match(rawdata, i)
         if m:

Where _declname_match is

_declname_match = re.compile(r'[a-zA-Z][-_.a-zA-Z0-9]*\s*').match

But this requires the term after "<!" to be a letter, and "[" is
most definitely not a letter.  So the match fails and the code
branches to

             self.updatepos(declstartpos, i)
             self.error("expected name token")

As far as I could tell, MS-HTML is wrong and it should be using
<!--[if ! ....

that is, the "<!--" instead of just "<!".  (Other MS-HTML uses
this construct.)

In other words, this is a long winded way to say you can't
do this with the standard Python library.  As Martin suggested,
try "HTMLTidy" and/or try (as suggested by Gerhard Häring)
one of the text browsers (lynx -dump, links -dump, w3m -dump).

					Andrew
					dalke at dalkescientific.com