[Doc-SIG] How to traverse a document object

Paul Moore gustav@morpheus.demon.co.uk
Wed, 24 Oct 2001 22:31:08 +0100


On Tue, 23 Oct 2001 22:07:14 -0400, David Goodger
<goodger@users.sourceforge.net> wrote:

>Paul Moore wrote:
>> a. I'm inserting new methods into the node classes, which is a little
>>    presumptuous of me.
>
>If you insert methods from outside the module, yes, you *would* be
>presumptuous. If you edit the nodes.py module and submit a patch,
>though, you'll be a contributing developer!

=46air enough :-) The real point, I guess, was that I wasn't sure if what
I was doing counted as a prototype for a generally useful feature, or
just a one-off piece of code for the particular task I was attempting.
On reflection, it looks like a general thing (basically, *all* writers
will need to tree-walk the nodes, so it's general...)

>> b. I need to know the class names of the node classes, which seems to
>>    be too tied to implementation specifics.
>
>The document tree is meant to be an specific document/DTD/schema
>implementation only, not a generic DOM.

I'm not sure I understand what you mean here. That's probably because I
know almost nothing of XML, and in particular, I find the terminology
confusing (what do you mean by a "generic DOM"?)

Regardless, the point is probably moot, as this looks more like a
prototype for a patch (which means that being tied to the implementation
details is emphatically *not* an issue) as opposed to a separate module.

>> Nevertheless, its a pretty good tree walking model. It might be nice
>> to have this in the node classes, except that there may not be a
>> single "correct" walk order (a bit like the normal preorder,
>> postorder, inorder issues).
>
>I don't know how you'd do inorder with document trees. ;-)

OK, so I got a bit over-excited there :-)

>It's most useful to have a hook on the way in and on the way out of an
>element. I'm sure we can come up with a useful set of methods, perhaps
>without having to reinvent the wheel, using polymorphism and without
>resorting to magic. SAX comes to mind. Suggestions, references, and/or
>patches welcome.

Yes, (what little I know of) SAX struck me as relevant. I tried looking
in the Python library manual (not XML/SAX specific, I know, but the only
documentation I had to hand) and that seemed to imply that SAX was
related to the parsing end of things rather than to the tree-walking
end. There were some vague hints about tree walker classes and visitors
in the PyXML modules, but unless I missed it, there's virtually no
documentation for those modules, so I couldn't find out any more.

Question: This document model isn't a "real" XML DOM (by my reading of
your comments). So we end up reinventing technologies like DOM tree
walkers, etc. My naive reaction is "why aren't we using the XML DOM,
then, so we get this sort of thing for free"? Presumably there *are*
good reasons - I'd be interested to know what they are. (The biggest
sign of a problem, I suspect, would be if the asdom() method was heavily
used, implying that people habitually generated a "real" DOM to handle
this thing - presumably because of failings in the DPS DOM).

>> .. [Offtopic] I would normally indent this list if I was writing
>>    "plain" (ie, no markup) text. I'm not sure what effect such
>>    indentation would have on reST. I get the impression I'd get an
>>    extra "blockquote" element that I didn't want.
>
>Correct.

Experimentation had confirmed this for me.

>>    Is this harmful?
>
>Depends on what you mean by "harmful". :-)

"Not what I want" :-) Seriously, I guess the question is how
"blockquote" elements are going to be visibly marked up in the various
output formats. Until we actually *have* some more formats, the question
is hard to answer.

>>    Can it be avoided? In "plain" text, I *really* prefer the look
>>    of lists when they are indented.
>
>A transform could be written that looks for a block quote containing
>only a list, and extracts the list from within the block quote. The
>spec could be changed to specify that this will happen. But what if we
>*want* a list inside a block quote? How else would we write it?

I think that's basically my point. A lot depends on what a block quote
is intended to signify. I view blockquotes as basically a way of
displaying a quotation, or something similar, without the "..." around
it. As such, it would be a pretty rare thing. In LaTeX, I'd use the
"quote" environment for this. My copy of "HTML - The Definitive Guide"
says that <blockquote> tags cause the contents to be set of from the
main text, usually with indented right and left margins, and *sometimes
in italicised typeface* (my emphasis). And it does point out that it's
intended for quotes. This isn't the sort of thing I'd want to use
often...

Maybe the blockquote element in the DPS model should be redefined (or
just better defined) to clarify the intended use. I can see a number of
possibilities:

- It is intended for block quotations, and so the HTML <blockquote> and
  LaTeX quote environments are appropriate. I can't see this form
  getting much use.
- It is for general text which is indented on both left and right. This
  would be more useful, for text to "stand out" from the surroundings.
  But it doesn't match the HTML <blockquote> tag - so there may be
  problems implementing this in a HTML writer.
- It is for text indented on the left only. This is something people
  actually do a lot. It also matches the look of the source text, so
  it's fairly easy to understand. But again, it may be harder to
  implement, and it smacks of "visual" formatting, rather than "logical"
  formatting.

>> Sorry, this has all turned into a bit of a brain-dump. But that's
>> probably because I'm feeling that I'm having to invent something
>> that I expected to be part of the basic infrastructure.
>
>Great value for the money though!

Oh, definitely!

>
>> Is it simply that no-one's got to the point of needing this
>> implemented yet?
>
>In my case, yes. I haven't tackled the output end yet.

I think that's the problem. I can't conceptualise the data structure in
isolation - I need to grasp how it translates into real output. That's
not a criticism of what you've done, it's just that I work differently.
And I need to get a handle on getting stuff out of the object model
before I can completely understand it. That's a real chicken-and-egg
problem, though, as I'm trying to understand the model so that I can
implement an output program :-)

>What are you trying to do exactly? My interest is half-piqued. Provide
>me a bit of stimulus and it may become fully-piqued.

Write an output processor, at the most basic level. More specifically,
write a framework with which output processors can be built fairly
simply. My first concrete implementation is likely to be some sort of
dump of a summary of the structure - tweakable, so that different levels
of structure can be seen. With that, I'll be able to explore the model
for a given document. Once I get that, I'm looking to implement a LaTeX
output processor. It shouldn't be hard to do others (such as yet another
HTML writer). If it is, I'll have done something wrong...

One thing that has already become clear to me is that it will be *far*
easier to write output processors for "structural" markup languages
(HTML, (La)TeX, DocBook, Texinfo, etc) than for "layout" oriented
languages (PDF, PostScript, etc). Sufficiently so that I doubt it will
ever be realistic to go direct to such formats - you'd have to implement
line and page breaking algorithms, etc, etc.

>> Thanks for listening,
>
>Any time. Cheaper than psychotherapy.

Gibber, gibber...

Paul.