[Web-SIG] HTML parsers and DOM; WWW::Mechanize work-alike

Wed Dec 3 10:08:59 EST 2003

On Wed, 3 Dec 2003 14:23:00 +0000 (GMT)
John J Lee <jjl at pobox.com> wrote:

> On Tue, 2 Dec 2003, Casey Duncan wrote:
> [...]
> > OTOH, if anyone has a better idea, I'm all ears. What kind of api do people want?
> [...]
> 
> from tidy import tidy
> xhtml = tidy(html)

That would be a pretty easy wrapper methinks. At first that was pretty much all I thought tidylib would do, but it exposes its object model in such a way that you could parse HTML directly to a DOM if you wanted to.

If you merely use tidy to create xhtml and then parse that, you are doing a DOM parse twice and not only is that inefficient, its probably lossy (depending on how strict the conversion is). Cycles are cheap so I'm willing to live with inefficency if it means forward progress in functionality. The loss part might not be so great.

So maybe the approach should be:

1. Expose the basic functionality that the tidy binary has as a python function and see how we like it. I think this is worthwhile regardless of whether it makes it into the stdlib.

2. Think about whether we want/need a direct HTML->DOM parser. And then decide how much we need it 8^)

3. Go get a beer and think about something entirely different.

-Casey