How *extract* data from XHTML Transitional web pages? got xml.dom.minidom troubles..

James Graham jg307 at cam.ac.uk
Fri Mar 2 19:27:25 EST 2007


seberino at spawar.navy.mil wrote:
> I'm trying to extract some data from an XHTML Transitional web page.
> 
> What is best way to do this?

May I suggest html5lib [1]? It's based on the parsing section of the 
WHATWG "HTML5" spec [2] which is in turn based on the behavior of major 
web browsers so it should parse more or less* any invalid markup you 
throw at it. Despite the name "html5lib" it works with any (X)HTML 
document. By default, you have the option of producing a minidom tree, 
an ElementTree, or a "simpletree" - a lightweight DOM-like 
html5lib-specific tree.

If you are happy to pull from SVN I recommend that version; it has a few 
bug fixes over the 0.2 release as well as improved features including 
better error reporting and detection of encoding from <meta> elements 
(the next release is imminent).

[1] http://code.google.com/p/html5lib/
[2] http://whatwg.org/specs/web-apps/current-work/#parsing

* There might be a problem if e.g. the document uses a character 
encoding that python does not support, otherwise it should parse anything.



More information about the Python-list mailing list