[Web-SIG] DOM-based templating
Ian Bicking
ianb at colorstudy.com
Fri Jun 3 18:41:38 CEST 2005
James Y Knight wrote:
> On Jun 3, 2005, at 2:18 AM, Ian Bicking wrote:
>
>> * Can parse HTML, not just XHTML. Not the crazy HTML browsers parse,
>> but unambiguous well-formed HTML. I don't like the idea of putting the
>> HTML through tidy; that's fine for a screen-scraper, but is way too
>> defensive for this kind of thing.
>
>
> Ha ha. As far as I've seen, there is no python module that can do this.
> And yes, by "this" I do mean actual correct HTML, not random crud real
> browsers have to put up with. Even unambiguous well-formed HTML is
> difficult to parse into a useful DOM.
ZPT seems to do okay. It's picky about a few things, but when it's
picky it also gives errors, so you can get by. For instance, I think
this example you give:
> Dealing with the optional closing of tags in HTML is somewhat
> irritating as well. Here's an example of a correct HTML document:
> "<title>Hello</title><table><tr><td><p>Foo<tr><td>Bar</table>".
> Microdom gets the table there wrong -- it has a list of which opening
> tags can close which others, but only looks at the current level, not
> up the tree, to find the tag to close. So it creates a structure like:
> table[tr[td[p["Foo", tr[td["Bar"]]]]]]. Besides the obvious issue of
> the tr being inside the p, the DOM should probably include inferred
> elements as well, such as html, head, body, and tbody.
That would cause an error in ZPT. Always a good principle -- when in
doubt, don't guess; it's better to complain than to guess incorrectly.
Well, that can be hard in some cases, like when you are parsing someone
else's markup. But the idea here is that you are parsing markup that
you also control.
> Getting whitespace rules right, in particular, is quite difficult.
> While the actual rules from the actual HTML spec are easy enough, they
> are not what you think they are, and not what anybody has implemented.
> The actual rules are much more complex, and depend on the element the
> whitespace is near and also seem to allow whitespace to float into and
> out of elements with relative abandon. Quick, tell me when the space in
> here is significant: "<span> foo</span>"?
Um lemme thing... x<span> foo</span>! I think I'll just retract the
whitespace issue, though; it can be solved better at other levels.
> Just inserting all the whitespace from the original document into the
> DOM is a pretty safe thing to do, but it'd be nice to not have to do
> that, as you end up with excessive numbers of text nodes that have no
> meaning.
Well, they have some meaning irregardless, because someone thought they
made the document prettier, and it makes the resulting document look
more like the original. That's important, even if the browser doesn't care.
One issue might be that the DOM has a fairly heavy representation of
text. Maybe this isn't an issue when text is just a Python string.
ElementTree has a weird but seemingly light way of representing text.
OTOH, I don't see any problem with just putting strings in the list of
children.
--
Ian Bicking / ianb at colorstudy.com / http://blog.ianbicking.org
More information about the Web-SIG
mailing list