How *extract* data from XHTML Transitional web pages? got xml.dom.minidom troubles..
James Graham
jg307 at cam.ac.uk
Fri Mar 2 19:27:25 EST 2007
seberino at spawar.navy.mil wrote:
> I'm trying to extract some data from an XHTML Transitional web page.
>
> What is best way to do this?
May I suggest html5lib [1]? It's based on the parsing section of the
WHATWG "HTML5" spec [2] which is in turn based on the behavior of major
web browsers so it should parse more or less* any invalid markup you
throw at it. Despite the name "html5lib" it works with any (X)HTML
document. By default, you have the option of producing a minidom tree,
an ElementTree, or a "simpletree" - a lightweight DOM-like
html5lib-specific tree.
If you are happy to pull from SVN I recommend that version; it has a few
bug fixes over the 0.2 release as well as improved features including
better error reporting and detection of encoding from <meta> elements
(the next release is imminent).
[1] http://code.google.com/p/html5lib/
[2] http://whatwg.org/specs/web-apps/current-work/#parsing
* There might be a problem if e.g. the document uses a character
encoding that python does not support, otherwise it should parse anything.
More information about the Python-list
mailing list