[Web-SIG] HTML parsers and DOM; WWW::Mechanize work-alike

Tue Dec 2 11:13:22 EST 2003

Simon Willison spoo'd forth:
> Stuart Langridge wrote:
>> I don't see that tidy's ability to tidy HTML per se is useful, but I
>> think that it's very useful in that it can take invalid HTML and
>> convert it to valid XHTML. That way, we can get a DOM tree from invalid
>> HTML, which is very useful...
> 
> Is there any way we could get a DOM tree from invalid HTML using pure 
> Python tools? The HTML tools in the Python standard library at the 
> moment are all pure Python. Could we even use the existing sgmllib 
> module (or an extension of it) to create our own DOM tree from invalid HTML?

Presumably we could (the existing things, like HtmlLib or microdom do
it); I was just thinking of not having to implement it if we didn't have
to :)
I'm not all that hot on sgmllib, either -- parsing invalid HTML strikes
me as being pretty hard, since browsers have to try hard to do it. I
don't know, however, if the hard thing is *displaying* it right rather
than just *parsing* it.
Thought: Grail was a browser, so it might have done it?

sil

-- 
2. Make it halfway normal. I don't have any use for
laser-beam-shooting pocket combs, or non-existent existents existing
within their own existences, or ballpoint pens made out of lettuce.
	   -- CardinalT dictates rules for the raif Silly Game