HTMLParser tag contents
Grant Griffin
g2 at seebelow.org
Sun May 7 11:51:48 EDT 2000
Grant Griffin wrote:
>
> Oleg Broytmann wrote:
> >
> > On 5 May 2000, Grant Griffin wrote:
> > > Perhaps I misspoke. I agree that the solution would probably have to occur at
> > > the level of SGMLParser, but I guess my question remains: can it do that? if so,
> > > how?
> > >
> > > In looking at the SGMLParser source code, it doesn't appear to have any
> > > mechanism to capture the contents of a tag.
> >
> > You cannot "just do it" :) You need to write a class inhertied from
> > SGMLParser, define the methods for capturing <BODY> and from this point
> > forward capture ALL text and ALL tags until </BODY>.
>
> Thanks for the tip, Oleg.
I experimented with your approach and it worked, but I finally decided
just to use 're', to preserve HTMLParser's other features. Here's what
I came up with:
---
class HTMLParserEx(HTMLParser):
import re
head_re = re.compile(r"<\s*head.*?>(.*)<\s*/head\s*>", re.S | re.I)
body_re = re.compile(r"<\s*body.*?>(.*)<\s*/body\s*>", re.S | re.I)
def __init__(self, formatter, verbose=0):
self.alldata = ''
HTMLParser.__init__(self, formatter, verbose)
def reset(self):
self.alldata = ''
HTMLParser.reset(self)
def feed(self, data):
self.alldata = self.alldata + data;
HTMLParser.feed(self, data)
def get_text(self, text_re, debug=0):
srch = text_re.search(self.alldata)
if srch is None:
if debug:
print 'No match: Data is:'
print self.alldata
return ''
else:
return srch.group(1)
def get_body_text(self, debug=0):
return self.get_text(HTMLParserEx.body_re, debug)
def get_head_text(self, debug=0):
return self.get_text(HTMLParserEx.head_re, debug)
---
I'm a Python tyro, so comments and criticisms from 'experts' are
welcome.
(at-times-like-this,-i'd-prefer-more-library-documentation-and
-examples-to-reading-open-source-<wink>)-ly y'rs,
=g2
p.s. It took a long time for me to figure out how to combine flags for
're'. I finally found that in 'Contents_of_Module_re.html', but I would
like to kindly recommend to the python docs folks that they might want
to spread some redundant needles on that point throughout the re doc
haystack; in particular, I didn't find any examples of combined flags in
the docs.
--
_____________________________________________________________________
Grant R. Griffin g2 at dspguru.com
Publisher of dspGuru http://www.dspguru.com
Iowegian International Corporation http://www.iowegian.com
More information about the Python-list
mailing list