[Web-SIG] HTML parsers and DOM; WWW::Mechanize work-alike

Wed Dec 3 10:40:58 EST 2003

On Wed, 3 Dec 2003, Casey Duncan wrote:
> On Wed, 3 Dec 2003 14:23:00 +0000 (GMT) John J Lee <jjl at pobox.com> wrote:
[...]
> > from tidy import tidy
> > xhtml = tidy(html)
>
> That would be a pretty easy wrapper methinks. At first that was pretty
> much all I thought tidylib would do, but it exposes its object model in
> such a way that you could parse HTML directly to a DOM if you wanted to.

Loss is inevitable if you're tidying.  How could it be otherwise?

Usually you don't get huge DOMs from HTML documents, unlike XML, so that's
not a major problem -- I hope!  Marc-Andre's page talks about poor
performance from HTMLTidy due to character-based operation, but I don't
know how severe that is or whether it's been addressed in tidylib.

4DOM seems damn slow (I may be unfairly blaming 4DOM, since I'm using a
hacked version with JavaScript interpretation on top, so it could easily
be my fault, or the fault of the JS code I'm running), but of course there
are faster, more compliant implementations, so that shouldn't be a
problem.

Finally, DOM *processing* might well be faster using tidylib just as a
tidier than it would be as a DOM (especially if you wrap the tidy-DOM to
get a real, compliant, DOM).

John