[Web-SIG] DOM-based templating

Ian Bicking ianb at colorstudy.com
Fri Jun 3 18:41:38 CEST 2005


James Y Knight wrote:
> On Jun 3, 2005, at 2:18 AM, Ian Bicking wrote:
> 
>> * Can parse HTML, not just XHTML.  Not the crazy HTML browsers parse,
>> but unambiguous well-formed HTML.  I don't like the idea of putting  the
>> HTML through tidy; that's fine for a screen-scraper, but is way too
>> defensive for this kind of thing.
> 
> 
> Ha ha. As far as I've seen, there is no python module that can do  this. 
> And yes, by "this" I do mean actual correct HTML, not random  crud real 
> browsers have to put up with. Even unambiguous well-formed  HTML is 
> difficult to parse into a useful DOM. 

ZPT seems to do okay.  It's picky about a few things, but when it's 
picky it also gives errors, so you can get by.  For instance, I think 
this example you give:

 > Dealing with the optional closing of tags in HTML is somewhat
 > irritating as well. Here's an example of a correct HTML document:
 > "<title>Hello</title><table><tr><td><p>Foo<tr><td>Bar</table>".
 > Microdom gets the table there wrong -- it has a list of which opening
 > tags can close which others, but only looks at the current level, not
 > up the tree, to find the tag to close. So it creates a structure like:
 > table[tr[td[p["Foo", tr[td["Bar"]]]]]]. Besides the obvious  issue of
 > the tr being inside the p, the DOM should probably include  inferred
 > elements as well, such as html, head, body, and tbody.

That would cause an error in ZPT.  Always a good principle -- when in 
doubt, don't guess; it's better to complain than to guess incorrectly. 
Well, that can be hard in some cases, like when you are parsing someone 
else's markup.  But the idea here is that you are parsing markup that 
you also control.

> Getting whitespace rules right, in particular, is quite difficult.  
> While the actual rules from the actual HTML spec are easy enough,  they 
> are not what you think they are, and not what anybody has  implemented. 
> The actual rules are much more complex, and depend on  the element the 
> whitespace is near and also seem to allow whitespace  to float into and 
> out of elements with relative abandon. Quick, tell  me when the space in 
> here is significant: "<span> foo</span>"?

Um lemme thing... x<span> foo</span>!  I think I'll just retract the 
whitespace issue, though; it can be solved better at other levels.

> Just inserting all the whitespace from the original document into the  
> DOM is a pretty safe thing to do, but it'd be nice to not have to do  
> that, as you end up with excessive numbers of text nodes that have no  
> meaning.

Well, they have some meaning irregardless, because someone thought they 
made the document prettier, and it makes the resulting document look 
more like the original.  That's important, even if the browser doesn't care.

One issue might be that the DOM has a fairly heavy representation of 
text.  Maybe this isn't an issue when text is just a Python string. 
ElementTree has a weird but seemingly light way of representing text. 
OTOH, I don't see any problem with just putting strings in the list of 
children.

-- 
Ian Bicking  /  ianb at colorstudy.com  /  http://blog.ianbicking.org


More information about the Web-SIG mailing list