Python web client anyone?

Sun Oct 14 20:57:42 EDT 2001

On Monday 15 October 2001 10:25, Paul Rubin wrote:
> Richard Jones <richard at bizarsoftware.com.au> writes:
> > > Thanks, this appears to include an HTTP client, which is a start, but
> > > I was looking for something that actually parses the HTML on the
> > > retrieved page like LWP does.  I wonder if there's some way to do that
> > > with the XML libraries (though HTML is generally not well-formed
> > > XML--for example it usually has unterminated <P> tags).  Any
> > > thoughts?
> >
> > htmllib?
> >
> > If you want quick and simple DOM extraction, I have a module that extends
> > HTMLParser...
>
> Perl LWP is a module for easily writing robot web clients.

There's been a lot of work done with Python robots in the past. Search the 
web. For a good laff, read:

  http://www.w3.org/Tools/Python/Overview.html

Specifically: "
I have written a robot that does this, except it doesn't check for
valid SGML -- it just tries to map out the entire web.  I believe I
found roughly 50 or 60 different sites (this was maybe 2 months ago --
I'm sorry, I didn't save the output).  It took the robot about half a
day (a saturday morning) to complete.
"

The demo Guido mentions is no longer distributed with Python. There is a web 
tool in the Tools/webchecker directory of the distribution though.

>  It doesn't
> exactly make a DOM, but it's the same idea, so DOM extraction would be
> fine.  What I *really* want is to be able to easily find link objects
> (anchor tags) based on the anchor text, which LWP for some reason
> doesn't do, but DOM extraction would be a start.  By "anchor text" I
> mean the text in <a href=blah.html>this is the anchor text</a>.  The
> client should be able to find some "underlined" text on the page it
> retrieves, and "click" on the linked document.

My SimpleDOM module will do this. Here's some off-the-top-of-my-head usage:

>>> import SimpleDOM, urllib
>>> page = urllib.urlopen('http://www.w3c.org/').read()
>>> dom = SimpleDOM.SimpleDOMParser()
>>> dom.parseString(page)
>>> len(dom.getByName('a'))
144
>>> dom.getByName('a')[0]
<SimpleDOMNode "a" {'title': 'W3C Activities', 'class': 'bannerLink', 'href': 
'Consortium/Activities'} (1 elements)>

Note that the parser is not particularly generous in accepting bad inputs. In 
the example above, I fed it:

 1. www.python.org              (mismatched tag)
 2. www.bizarsoftware.com.au    (*blush* bad entity)

> I may not have read the htmllib docs carefuly enough but it looks more
> intended for formatting/displaying HTML than parsing it.  Are your
> DOM extensions available?

  http://bigboy.bizarsoftware.com.au/~richard/SimpleDOM.py

     Richard