how to get text between HTML tags with URLLIB??

Alex Martelli alex at magenta.com
Sat Aug 19 05:13:43 EDT 2000


"Roy Katz" <katz at Glue.umd.edu> wrote in message
news:Pine.GSO.4.21.0008181829180.905-100000 at y.glue.umd.edu...
> Hello,
>
> I'm writing a link-checker using urllib.  I'm using the following
> technique to extract bookmark info from netscape bookmark files:
>
>   # override HTMLlib's start_a to hook in our registration function
>   class wcheck_parse( htmllib.HTMLParser ):
>       links = []
>       def start_a( self, attrs ):  self.links.append( link_type( attrs ) )

OK so far.

> And then I use HTMLlib to open and parse the bookmark file:
>
>   parser = wcheck_parse( formatter.NullFormatter() )
>   parser.feed( open( bkfname ).read()  )

Where the data come from is 'orthogonal' to how you process it.


> The problem lies in overriding start_a();  'attrs' will contain the url,
> time visited, time created, etc., but it will *not* give me the text
> between the HTML tags. So for example, for the following

Right, because processing is sequential; at tag start, you still
have not 'seen' the contents.

> link I will not get the text between the starting and closing href tags:
>
>  <a href=http://wacky.roey.com > 'Roey's Wacky Server of Fun!' </a href>
>
> This is really frustrating.  Why isn't this mentioned in the urllib docs?

Maybe you mean htmllib, because urllib just has to do with getting at
the document stream, *nothing* to do with how you then process it.

> Same thing for the text between the <DT>...</DT> tags (or for any other
> tags!)  Bottom line:  how do I get the text between html tags?

By receiving the data and processing it at the end-tag.  That's how
sequential processing works; the architectural alternative to this
sequential stream processing is building an in-memory object
model for the document, the DOM approach, and it's heavier to
do though more powerful.

Consider, for example, this...:

import htmllib
import formatter

# override HTMLlib's start_a to hook in our registration function
class wcheck_parse(htmllib.HTMLParser):
    def __init__(self,*arg,**kw):
        self.accum=''
        apply(htmllib.HTMLParser.__init__,(self,)+arg,kw)
    def start_a(self, attrs):
        print 'start_a:', attrs
        self.accum=''
    def end_a(self):
        print 'end_a:', self.accum
    def handle_data(self, data):
        # print 'data:',data
        self.accum = self.accum+data


parser = wcheck_parse(formatter.NullFormatter())
parser.feed(open(r'c:\a.htm').read())


With the HMTL file:
Before
<a href=http://wacky.roey.com > 'Roey's <b>Wacky</b> Server of Fun!' </a>
After


Note I've changed the incorrect close tag '</a href>' and added a <b> in the
anchor text just for fun.

This produces the output:

start_a: [('href', 'http://wacky.roey.com')]
end_a:  'Roey's Wacky Server of Fun!'

Note the <b> tag is not shown, since it's not handled by this class: the
handle_data method just receives three separate pieces of data and
accumulates them in self.accum.  One can of course do better: the
start_tag method sets up a data structure to receive the data as they
come, handle_data ignores incoming data unless the structure is in
fact set up, end_tag uses all the data in the structure and dismantles
it again.  If you want to conserve nested tags, such as the above <b>,
you need to handle them too; methods unknown_starttag and
unknown_endtag may be handy for this.


This 'callback'-based approach is handy if you only want to manipulate
a few things, but the DOM is really better (if you can afford it, as it
needs to build an in-memory model, while the callback-based one may
be able to run more cheaply; this matters for huge input documents)
as long as you need to do anything really advanced/sophisticated.

Still, it's surely POSSIBLE to do anything with a callback-based approach
(that's how the DOM is built up, for example:-), though it's not
necessarily _easy_:-).


I *think* Python should have a DOM-based approach among its
core-library functionality, and some note as to how it may be much
easier, etc, etc.


Alex






More information about the Python-list mailing list