HTMLParser tag contents

Grant Griffin g2 at seebelow.org
Sun May 7 11:51:48 EDT 2000


Grant Griffin wrote:
> 
> Oleg Broytmann wrote:
> >
> > On 5 May 2000, Grant Griffin wrote:
> > > Perhaps I misspoke.  I agree that the solution would probably have to occur at
> > > the level of SGMLParser, but I guess my question remains: can it do that? if so,
> > > how?
> > >
> > > In looking at the SGMLParser source code, it doesn't appear to have any
> > > mechanism to capture the contents of a tag.
> >
> >    You cannot "just do it" :) You need to write a class inhertied from
> > SGMLParser, define the methods for capturing <BODY> and from this point
> > forward capture ALL text and ALL tags until </BODY>.
> 
> Thanks for the tip, Oleg.

I experimented with your approach and it worked, but I finally decided
just to use 're', to preserve HTMLParser's other features.  Here's what
I came up with:

---
class HTMLParserEx(HTMLParser):
    import re

    head_re = re.compile(r"<\s*head.*?>(.*)<\s*/head\s*>", re.S | re.I)
    body_re = re.compile(r"<\s*body.*?>(.*)<\s*/body\s*>", re.S | re.I)

    def __init__(self, formatter, verbose=0):
        self.alldata = ''
        HTMLParser.__init__(self, formatter, verbose)

    def reset(self):
        self.alldata = ''
        HTMLParser.reset(self)

    def feed(self, data):
        self.alldata = self.alldata + data;
        HTMLParser.feed(self, data)
        
    def get_text(self, text_re, debug=0):
        srch = text_re.search(self.alldata)
        if srch is None:
            if debug:
                print 'No match: Data is:'
                print self.alldata
            return ''
        else:
            return srch.group(1)

    def get_body_text(self, debug=0):
        return self.get_text(HTMLParserEx.body_re, debug)

    def get_head_text(self, debug=0):
        return self.get_text(HTMLParserEx.head_re, debug)
---

I'm a Python tyro, so comments and criticisms from 'experts' are
welcome.

(at-times-like-this,-i'd-prefer-more-library-documentation-and
  -examples-to-reading-open-source-<wink>)-ly y'rs,

=g2
p.s.  It took a long time for me to figure out how to combine flags for
're'.  I finally found that in 'Contents_of_Module_re.html', but I would
like to kindly recommend to the python docs folks that they might want
to spread some redundant needles on that point throughout the re doc
haystack; in particular, I didn't find any examples of combined flags in
the docs.
-- 
_____________________________________________________________________

Grant R. Griffin                                       g2 at dspguru.com
Publisher of dspGuru                           http://www.dspguru.com
Iowegian International Corporation	      http://www.iowegian.com



More information about the Python-list mailing list