html DOM
Gabriel Genellina
gagsl-py2 at yahoo.com.ar
Sun Mar 30 00:58:53 EDT 2008
En Sun, 30 Mar 2008 00:19:08 -0300, Michael Wieher
<michael.wieher at gmail.com> escribió:
> Was this not of any use?
>
> http://www.boddie.org.uk/python/HTML.html
>
> I think, since HTML is a sub-set of XML, any XML parser could be adapted
> to
> do this...
That's not true. A perfectly valid HTML document might even not be well
formed XML; some closing tags are not mandatory, attributes may not be
quoted, tags may be written in uppercase, etc. Example:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/1999/REC-html401-19991224/strict.dtd">
<HTML><TITLE>Invalid xml</title><p Id=Abc>a</html>
The above document validates with no errors on http://validator.w3.org
If you are talking about XHTML documents, yes, they *should* be valid XML
documents.
> I doubt there's an HTML-specific version, but I would imagine you
> could wrap any XML parser, or really, create your own that derives from
> the
> XML-parser-class...
The problem is that many HTML and XHTML pages that you find on the web
aren't valid, some are ridiculously invalid. Browsers have a "quirks"
mode, and can imagine/guess more or less the writer's intent only because
HTML tags have some meaning. A generic XML parser, on the other hand,
usually just refuses to continue parsing an ill-formed document. You can't
simply "adapt any XML parser to to that".
BeautifulSoup, by example, does a very good job trying to interpret and
extract some data from the "tag soup", and may be useful to the OP.
http://www.crummy.com/software/BeautifulSoup/
--
Gabriel Genellina
More information about the Python-list
mailing list