HTMLParser tag contents

Grant Griffin g2 at seebelow.org
Wed May 10 03:52:51 EDT 2000


Aahz Maruch wrote:
> 
> In article <3913CBAA.64BB at seebelow.org>,
> Grant Griffin  <g2 at seebelow.org> wrote:
> >
> >I've been trying to figure out how to use HTMLParser.  My immediate need
> >is to extract the entire <BODY> of a file.  (I could do that with 're',
> >but I'm trying to learn HTMLParser.)  Sure, HTMLParser will returns a
> >tag's _attributes_, but I can't figure out how to get to the tag's
> >_contents_.  Can it do that?
> 
> Take a look at the handle_data() method.

That's a good suggestion, Aahz--thanks.  I tried it, and it worked, but
there was a problem: if you use the "handle_data" method, you get the
_text_ of a tag, but you don't get contained tags.  So you have to write
an extension to SGMLParser, and include handlers for
"handle_unknown_tag" to append contained tags to the text from
"handle_data".  But then, all tags must be unhandled.  Also, to capture
the text of just one specific tag, such as <BODY>, your overridden
"handle_data" method has to use a flag in "start_body" and "end_body" to
know to capture data.  All this works, but it seems awkward because it
doesn't generalize; it has to be done on a per-tag basis.  And it
prevents you from using the capabilities of HTMLParser.

Therefore, for Python 1.6, I would like to recommend that SGMLParser be
modified to provide a method called "get_tag_contents" (or whatever)
which can be called at the point of any "end_xxx" to convey the tag's
contents (which would include not only text but contained tags and their
text.)  (The reason SGMLParser has to be modified is that its index into
its "rawdata" array is local to its parser routine.)

(surely-i'm-not-the-first-(or-last)-person-to-ever-want-to-capture
  -a-tag's-contents!)-ly y'rs,

=g2
-- 
_____________________________________________________________________

Grant R. Griffin                                       g2 at dspguru.com
Publisher of dspGuru                           http://www.dspguru.com
Iowegian International Corporation	      http://www.iowegian.com



More information about the Python-list mailing list