[Web-SIG] HTML parsers and DOM; WWW::Mechanize work-alike [was: Re: Python version...]

Sun Nov 30 17:18:56 EST 2003

John J Lee spoo'd forth:
> On Sun, 30 Nov 2003, Stuart Langridge wrote:
>> > Is this aimed at the standard library?  xml.dom.ext.reader.HtmlLib?
>> Um. What I was looking for was something that could parse HTML
>> (including invalid HTML) and give me a DOM tree. I tried Twisted's
> 
> Fine, but what we're talking about here is what should go into Python's
> standard library.

True enough. I fear, though, that without *something* that can cope
with invalid HTML, a WWW::Mechanize-style thing is going to be pretty
darn hard...

> [...]
>> I think
>> that a DOM parser for HTML is pretty important, even if that parser
>> *actually* just does "convert broken HTML to valid XHTML and then feed
>> it to minidom" or something similar. Are there any others?
> 
> There are lots of XML DOM implementations for Python (only one HTML DOM
> implementation, though: 4DOM -- and that's out of date), including the one
> that's already in the standard library.  Parsing arbitrary HTML is hard,
> though (xml.dom.ext.reader.HtmlLib doesn't even manage to generate an HTML
> DOM from arbitrary *correct* HTML, and correct HTML is not often seen in
> the wild ;-).  tidylib is the only sane way I know of.  See below.

*nod* Your notes on tidylib are useful -- I didn't know about it. That
said, though, without it in the stdlib, it's no better than HtmlLib
(well, it's maintained, true, but it's still not available to the
stdlib).

>> > Why isn't it a subclass of urllib.OpenerDirector (or, better, from
> [...]
>> Because I didn't know about it. This is because "urllib.urlopen" is
>> hardwired into my fingers, and then I just overrode it with
>> ClientCookie when I needed cookie handling. I'm entirely happy to have
>> it work totally differently; this was really a proof-of-concept to get
>> the ball rolling rather than a submission for direct inclusion.
> 
> Sure (you don't mean proof-of-concept, but I know what you mean). 

Very true, yes, and thanks :)

> Should tidylib be in the standard library?  On one hand, I lean towards
> "no", because HTML is (in theory) on the way out.  OTOH, if it's going to
> take another thirty years for HTML to completely go away, that may be a
> silly attitude to take!  Opinions?  If it were to be in the std. lib., I
> guess somebody would need to write a non-ctypes wrapper.

I really think that HTML is not going away any time soon. Moreover,
there are still issues with XHTML (like which content-type to serve it
as). It's certainly reasonable to make tools only *produce* newer
variants, but you have to be able to consume all kinds of invalid
rubbish or you'll never be able to look at the web at all :)

> [...]
>> > No .forward() / .backward() methods?
>>
>> Didn't think of them until after I sent the message out. They'd be
>> pretty trivial to implement, though, although I don't know what you'd
>> do about the "This page contains POSTDATA" issue that browsers get.
> [...]
> 
> You're allowed to do whatever you like, really (RFC 2616 section 13.13).

Either re-posting or not doing so are both iffy, though, hence the
choice. Admittedly, you could have backward() and forward() take a
repostData parameter, but you'd have to know beforehand whether you'd
want to do it, since use isn't interactive. Hm.

sil

-- 
Medio tutissimus ibis.
(You will travel safest in a middle course)
	   -- family motto