scripting browsers from Python

Thu Jun 2 02:18:58 EDT 2005

On Wed, 01 Jun 2005 22:27:44 +0000, John J. Lee wrote:

> Olivier Favre-Simon <olivier.favre-simon at club-internet.fr> writes:
> 
>> On Tue, 31 May 2005 00:52:33 -0700, Michele Simionato wrote:
>> 
>> > I would like to know what is available for scripting browsers from
>> > Python.
> [...]
>> ClientForm	http://wwwsearch.sourceforge.net/ClientForm/
>> 
>> I use it for automation of POSTs of entire image directories to
>> imagevenue.com/imagehigh.com/etc hosts.
> 
> This doesn't actually address what the OP wanted: it's not a browser.

Yep. Didn't read with sufficient care. He really wants scripting not
webscraping.

> 
> 
>> The only drawback I've found are:
>> - does not support nested forms (since forms are returned in a list)
> 
> Nested forms??  Good grief.  Can you point me at a real life example of
> such HTML?  Can probably fix the parser to work around this.

What I mean is: The parser does not detect a missing </form>, so
thinks that there are nested forms, and raises a ParseError.

Browsers have an easier task at spotting non-matching form tags, because
they can use matching table or div tags around to imply that the form is
closed (DOM approach).

Not easy with a SAXish approach like HTMLParser.

I don't mean nested forms should be supported, they are crap (is this even
legal code ?)

> 
> 
>> - does not like ill-formed HTML (Uses HTMLParser as the underlying
>> parser. you may pass a parser class as parameter (say SGMLParser for
>> greater acceptance of stupid HTML code) but it's tricky because there
>> is no well defined parser interface)
> 
> Titus Brown says he's trying to fix sgmllib (to some extent, at least).
> 
> Also, you can always feed stuff through mxTidy.
> 
> I'd like to have a reimplementation of ClientForm on top of something
> like BeautifulSoup...
> 
> 
> John

When taken separately, either ClientForm, HTMLParser or SGMLParser work
well.

But it would be cool that competent people in the HTML parsing domain join
up, and define a base parser interface, the same way smart guys did with
WSGI for webservers.

So libs like ClientForm would not raise say an AttributeError if some
custom parser class does not implement a given attribute.

Adding an otherwise unused attribute to a parser just in case one day it
will interop with ClientForm sounds silly. And what if ClientForm changes
its attributes, etc.

No really, whatever the chosen codebase, a common parser interface would
be great.