HTML DOM parser?

David Boddie davidb at mcs.st-and.ac.uk
Fri Aug 1 18:20:05 EDT 2003


Paul Rubin <http://phr.cx@NOSPAM.invalid> wrote in message news:<7x7k5y5wfh.fsf_-_ at ruckus.brouhaha.com>...
> Is there an HTML DOM parser available for Python?  Preferably one that
> does a reasonable job with the crappy HTML out there on real web
> pages, that doesn't get upset about unterminated tables and stuff like
> that.  Many extra points if it understands Javascript.  Application is
> a screen scraping web robot.  Thanks.

As John J. Lee points out in another message in this thread, the KHTML
bindings in PyKDE might be useful. Particularly for cutting through
Javascript to get to those extra points.

However, what's wrong with the following?

>>> from xml.dom.ext.reader import HtmlLib
>>> reader = HtmlLib.Reader()
>>> document = reader.fromUri("http://www.python.org/index.html")

The help for xml.dom.ext.reader.HtmlLib says:

DESCRIPTION
    Components for reading HTML files using htmllib.py.
    WWW: http://4suite.com/4DOM         e-mail: support at 4suite.com

    Copyright (c) 2000 Fourthought Inc, USA.   All Rights Reserved.
    See  http://4suite.com/COPYRIGHT  for license and copyright information

I've no idea how it performs on badly written pages.

David
-- 
http://www.boddie.org.uk/david/Projects/




More information about the Python-list mailing list