HTML parser problem
Andrew Dalke
adalke at mindspring.com
Sat Nov 9 02:38:41 EST 2002
gerhard.haering at opus-gmbh.netsanjay wrote:
> Any one has suggestion for following problem. Some word documents
> have been converted to HTML page in Ms-Word. Want to filter html tags
> like..
> <o:p></o:p>,
> <![if !supportEmptyParas]> <![endif]>
> <?xml:namespace prefix = o ns =
> "urn:schemas-microsoft-com:office:office" />, etc. I couldn't solve
> using SGMLParser.
If you need a pure Python solution, here's what I did with some
help from Martin v. Löwis.
======================
import htmllib, entitydefs, sgmllib
# From Martin v. L"owis, on c.l.py, 10/31/02
def _get_unicode_entitydefs():
import htmlentitydefs
entitydefs = htmlentitydefs.entitydefs.copy()
for k,v in entitydefs.items():
if v.startswith('&#'):
v = int(v[2:-1])
else:
v = ord(v)
entitydefs[k] = unichr(v)
return entitydefs
class MyHTMLParser(htmllib.HTMLParser):
entitydefs = _get_unicode_entitydefs()
def handle_charref(self, name):
try:
c = unichr(int(name))
except ValueError:
# either it isn't an integer or it's
# outside the supported Unicode range
# (my Python only goes up to 65535)
c = '?'
self.handle_data(c)
def handle_image(self, arg, *args):
pass
def convert_to_text(s):
file = StringIO.StringIO()
form = formatter.AbstractFormatter(formatter.DumbWriter(file=file))
p = MyHTMLParser(form)
try:
p.feed(s)
except sgmllib.SGMLParseError, err:
# Error in the parse
print "HTML wasn't HTML -- cannot index: %s" % (err,)
return
p.close()
return file.getvalue()
===============================
*HOWEVER*, the SGML parser upon which this is built will fail for
some MS-Word HTML. For example, in your text you have
<![if !supportEmptyParas]> <![endif]>
The "<!" gets identified in sgmllib.SGMLParser.goahead which
passes it on to "self.parse_declaration" which is implemented
in markupbase.py
def parse_declaration(self, i):
# This is some sort of declaration; in "HTML as
# deployed," this should only be the document type
# declaration ("<!DOCTYPE html...>").
rawdata = self.rawdata
j = i + 2
assert rawdata[i:j] == "<!", "unexpected call to parse_declaration"
if rawdata[j:j+1] in ("-", ""):
# Start of comment followed by buffer boundary,
# or just a buffer boundary.
return -1
# in practice, this should look like: ((name|stringlit) S*)+ '>'
n = len(rawdata)
decltype, j = self._scan_name(j, i)
The method "_scan_name" does
def _scan_name(self, i, declstartpos):
rawdata = self.rawdata
n = len(rawdata)
if i == n:
return None, -1
m = _declname_match(rawdata, i)
if m:
Where _declname_match is
_declname_match = re.compile(r'[a-zA-Z][-_.a-zA-Z0-9]*\s*').match
But this requires the term after "<!" to be a letter, and "[" is
most definitely not a letter. So the match fails and the code
branches to
self.updatepos(declstartpos, i)
self.error("expected name token")
As far as I could tell, MS-HTML is wrong and it should be using
<!--[if ! ....
that is, the "<!--" instead of just "<!". (Other MS-HTML uses
this construct.)
In other words, this is a long winded way to say you can't
do this with the standard Python library. As Martin suggested,
try "HTMLTidy" and/or try (as suggested by Gerhard Häring)
one of the text browsers (lynx -dump, links -dump, w3m -dump).
Andrew
dalke at dalkescientific.com
More information about the Python-list
mailing list