HTML DOM parser?

Thu Jul 31 21:45:56 EDT 2003

Paul Rubin <http://phr.cx@NOSPAM.invalid> writes:

> Is there an HTML DOM parser available for Python?  Preferably one that
> does a reasonable job with the crappy HTML out there on real web
> pages, that doesn't get upset about unterminated tables and stuff like
> that.  Many extra points if it understands Javascript.  Application is
> a screen scraping web robot.  Thanks.

glork.  I just started working on this myself.

Email me if you'd like the code, such as it is.  I've wrapped the
Mozilla JS interpreter but am currently stuck on a segfault, so I
could certainly do with a collaborator.

I'm using utidylib and 4DOM (latter from PyXML).

Mind you, if you actually want to get a job done <wink>, for a
quick-but-bulky (and somewhat closed) solution, try PyKDE (KHTML /
KJS) or IE automation (MSHTML / JScript).  Mozilla + XPCOM also, but I
think it requires rebuilding Mozilla to get PyXPCOM support.  There's
also httpunit (in Java, useable from Jython).

John