html DOM

Sun Mar 30 00:58:53 EDT 2008

En Sun, 30 Mar 2008 00:19:08 -0300, Michael Wieher  
<michael.wieher at gmail.com> escribió:

> Was this not of any use?
>
> http://www.boddie.org.uk/python/HTML.html
>
> I think, since HTML is a sub-set of XML, any XML parser could be adapted  
> to
> do this...

That's not true. A perfectly valid HTML document might even not be well  
formed XML; some closing tags are not mandatory, attributes may not be  
quoted, tags may be written in uppercase, etc. Example:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"  
"http://www.w3.org/TR/1999/REC-html401-19991224/strict.dtd">
<HTML><TITLE>Invalid xml</title><p Id=Abc>a</html>

The above document validates with no errors on http://validator.w3.org
If you are talking about XHTML documents, yes, they *should* be valid XML  
documents.

> I doubt there's an HTML-specific version, but I would imagine you
> could wrap any XML parser, or really, create your own that derives from  
> the
> XML-parser-class...

The problem is that many HTML and XHTML pages that you find on the web  
aren't valid, some are ridiculously invalid. Browsers have a "quirks"  
mode, and can imagine/guess more or less the writer's intent only because  
HTML tags have some meaning. A generic XML parser, on the other hand,  
usually just refuses to continue parsing an ill-formed document. You can't  
simply "adapt any XML parser to to that".

BeautifulSoup, by example, does a very good job trying to interpret and  
extract some data from the "tag soup", and may be useful to the OP.  
http://www.crummy.com/software/BeautifulSoup/

-- 
Gabriel Genellina