how to get text between HTML tags with URLLIB??
Alex Martelli
alex at magenta.com
Sat Aug 19 05:13:43 EDT 2000
"Roy Katz" <katz at Glue.umd.edu> wrote in message
news:Pine.GSO.4.21.0008181829180.905-100000 at y.glue.umd.edu...
> Hello,
>
> I'm writing a link-checker using urllib. I'm using the following
> technique to extract bookmark info from netscape bookmark files:
>
> # override HTMLlib's start_a to hook in our registration function
> class wcheck_parse( htmllib.HTMLParser ):
> links = []
> def start_a( self, attrs ): self.links.append( link_type( attrs ) )
OK so far.
> And then I use HTMLlib to open and parse the bookmark file:
>
> parser = wcheck_parse( formatter.NullFormatter() )
> parser.feed( open( bkfname ).read() )
Where the data come from is 'orthogonal' to how you process it.
> The problem lies in overriding start_a(); 'attrs' will contain the url,
> time visited, time created, etc., but it will *not* give me the text
> between the HTML tags. So for example, for the following
Right, because processing is sequential; at tag start, you still
have not 'seen' the contents.
> link I will not get the text between the starting and closing href tags:
>
> <a href=http://wacky.roey.com > 'Roey's Wacky Server of Fun!' </a href>
>
> This is really frustrating. Why isn't this mentioned in the urllib docs?
Maybe you mean htmllib, because urllib just has to do with getting at
the document stream, *nothing* to do with how you then process it.
> Same thing for the text between the <DT>...</DT> tags (or for any other
> tags!) Bottom line: how do I get the text between html tags?
By receiving the data and processing it at the end-tag. That's how
sequential processing works; the architectural alternative to this
sequential stream processing is building an in-memory object
model for the document, the DOM approach, and it's heavier to
do though more powerful.
Consider, for example, this...:
import htmllib
import formatter
# override HTMLlib's start_a to hook in our registration function
class wcheck_parse(htmllib.HTMLParser):
def __init__(self,*arg,**kw):
self.accum=''
apply(htmllib.HTMLParser.__init__,(self,)+arg,kw)
def start_a(self, attrs):
print 'start_a:', attrs
self.accum=''
def end_a(self):
print 'end_a:', self.accum
def handle_data(self, data):
# print 'data:',data
self.accum = self.accum+data
parser = wcheck_parse(formatter.NullFormatter())
parser.feed(open(r'c:\a.htm').read())
With the HMTL file:
Before
<a href=http://wacky.roey.com > 'Roey's <b>Wacky</b> Server of Fun!' </a>
After
Note I've changed the incorrect close tag '</a href>' and added a <b> in the
anchor text just for fun.
This produces the output:
start_a: [('href', 'http://wacky.roey.com')]
end_a: 'Roey's Wacky Server of Fun!'
Note the <b> tag is not shown, since it's not handled by this class: the
handle_data method just receives three separate pieces of data and
accumulates them in self.accum. One can of course do better: the
start_tag method sets up a data structure to receive the data as they
come, handle_data ignores incoming data unless the structure is in
fact set up, end_tag uses all the data in the structure and dismantles
it again. If you want to conserve nested tags, such as the above <b>,
you need to handle them too; methods unknown_starttag and
unknown_endtag may be handy for this.
This 'callback'-based approach is handy if you only want to manipulate
a few things, but the DOM is really better (if you can afford it, as it
needs to build an in-memory model, while the callback-based one may
be able to run more cheaply; this matters for huge input documents)
as long as you need to do anything really advanced/sophisticated.
Still, it's surely POSSIBLE to do anything with a callback-based approach
(that's how the DOM is built up, for example:-), though it's not
necessarily _easy_:-).
I *think* Python should have a DOM-based approach among its
core-library functionality, and some note as to how it may be much
easier, etc, etc.
Alex
More information about the Python-list
mailing list