[Doc-SIG] Re: DPS and DOM trees

Thu, 30 Aug 2001 10:12:15 +0100

David Goodger wrote:
> Why is a DOM tree a prerequisite for HTML output? I don't follow.

Hmm - I hadn't thought about it much.

Firstly, I want to be able to take advantage of the parsers other people
are writing for the DPS project (I only need to provide the extra bits
for the Python specific structures outwith the docstring), and as I
understand it, those work off the DOM tree?

Secondly, I felt the DOM interface was more "published" (or, perhaps,
obvious by inspection, and less likely to change) than the "internal"
interface. I'm still at a *very* early stage in navigating around the
source code.

My "understanding" that DOM is the official interface between parser and
formatter may, of course, be, erm, flawed (I know Guido suggested it
need not be, and using Python classes directly can be much more
pleasant). Hmm - have I missed something in the documentation again?

Where is the actual structure of the document *as Python datastructure*
defined? Is the stuff in nodes.py the class structure for the Python
tree structure (of course, by the time you answer this, I may have
worked it out - time folding is fun).

Somehow I can't help feeling that the infoset is a natural for my
thinking of this "document" we're working with, as well - particularly
as we *are* talking about a document of sorts. That's what SGML (and
later XML) was designed to work with. Still doesn't mean we *have* to
use DOM at any particular stage, but...

> > I would thus like to propose that the user provide a DOM document
> > instance to the DOM generator methods
>
> Is there a reason to provide a DOM *document instance* rather than a
> constructor?

Hmm - I thought that in DOM the document instance (or perhaps class)
*was* the constructor - but I (think I) see what you mean.

I'll retrieve the latest snapshot again (I try to do so daily) and have
a look (I'm having a good run on getting stuff done in the evening at
the moment - this won't last, as school term starts again next
Wednesday, and life gets more frenetic).

> Are the DOM implementations standard enough for this kind of
> interoperability?

I believe that some of the methods are predetermined by translation from
the standard. But I may be wrong (in which case, of course, life is only
*slightly* more complex). I've yet to get fully to grips with Python XML
stuff.

> The Python docstring mode model that's evolving in my mind goes
> something like this:
>
> 1. Extract the docstring/namespace tree from the module(s) and/or
>    package(s).
>
> 2. Run the parser on each docstring in turn, producing a forest of
>    trees (internal data structure as per nodes.py).
>
> 3. Run various transformations on the individual docstring trees.
>    Examples: resolving cross-references; resolving hyperlinks;
>    footnote auto-numbering; first field list -> bibliographic
>    elements.
>
> 4. Join the docstring trees together into a single tree, running more
>    transformations (such as creating various sections like "Module
>    Attributes", "Functions", "Classes", "Class Attributes", etc.; see
>    the DPS spec/ppdi.dtd).
>
> 5. Pass the resulting unified tree to the output formatter.

OK - my model is not dissimilar, but goes like:

1. Parse the Python module(s) [remembering we may have a package]
   This locates the docstrings, amongst other things.

2. Trim the tree to lose stuff we didn't need (!).

3. Parse the docstrings (this might, instead, be done at the time
   that each docstring is "discovered").

4. Integrate the docstring into the tree - this *may* be as simple
   as having "thing.docstring = <docstring instance>"

5. Perform internal resolutions on the docstring (footnotes, etc.)

6. Perform intra-module/package resolutions on the docstring
   (so this is when we work out that `Fred` in *this* docstring
   refers to class Fred over here in the datastructure).

7. Format.

Now, it may be that a DOM tree makes sense for the datastructure for
some of those steps - certainly if that's what the formatter is going to
want.

Also note that, if we go via a DOM tree, we can "cheat" with references,
a bit like Unix linkers do - thus if we have::

    :class:`Fred`

we might store that with a *reference* to
"#<packagename>.<modulename>.Fred", and trust to there being a class in
the DOM tree that declares its id to be the same - this naive approach
gets one quick satisfaction for links that are easy, but doesn't give
one any error response for unsatisfied links, of course.

In my mind, I don't think of a great tidal change at the interface
between module documentation and docstring - I think of it all as one
structure (so there might be internal structure to a parameter list, and
there might be internal structure to a docstring - same idea).

> I've had trouble reconciling the roles of input parser and output
> formatter with the idea of "modes". Does the mode govern the
> tranformation of the input, the output, or both? Perhaps the mode
> should be split into two.

A mode needs to:

1. Provide plugins for parsing - this *may* go so far as to subsume
   the DPS functionality into a new program, as I'm doing for Python.
   In this case the "plugin" for parsing may be virtual - I just need
   to ferret around in the docstring looking for things that are
   already there, perhaps.

2. Provide plugins for formatting - again, these may subsume a
   DPS parser process. In the Python case, I clearly want to
   *use* the normal HTML parser for HTML output, but with
   extra support "around it" for the Python specific infrastructure.

I think this will become much clearer once I have an example (!).

Interestingly, I see directives as being split similarly - for a typical
(non-pragma) directive, one needs both code to parse the directive
content (or to manipulate what DPS/reST has already parsed), and code to
format it within a formatter. Clearly, formatters should cope well with
directives they *don't* understand, though!

> For example, say the source of our input is a Python module. Our
> "input mode" should be "Python Docstring Mode". It discovers (from
> ``__docformat__``) that the input parser is "reStructuredText". If we
> want HTML, we'll specify the "HTML" output formatter. But there's a
> piece missing. What *kind* or *style* of HTML output do we want?
> PyDoc-style, LibRefMan style, etc. (many people will want to specify
> and control their own style). Is the output style specific to a
> particular output format (XML, HTML, etc.)? Is the style specific to
> the input mode? Or can/should they be independent?
>
> I envision interaction between the input parser, an "input
> mode" (would
> control steps 1, 2, & 3), a "transformation style" (would control step
> 4), and the output formatter. The same intermediate data format would
> be used between each of these, gaining detail as it progresses.
>
> This requires thought.

Agreed - I think it will emerge as we try to do it.

This is, of course, another reason to consider the DOM as a useful
tool - people can be expected to learn about the DOM independently of
us, and to understand (in general) how to manipulate DOM trees. There
are standard ways of documenting the content of a DOM tree (DTD,
XMLSchema, etc.). There are Useful Tools.

Asking each person who wants to write a thingy to understand *our* tree
structure in depth may be more powerful in some ways, but more onerous
in others.

> I've absorbed the new features of 2.0 too completely to want to revert
> to 1.5.2. String methods and augmented assignment have become second
> nature. Of course, it's possible to back-port to 1.5.2 (the new syntax
> is all so much sugar, if sweet). If enough need arises, fine.

So far I've avoided doing this. But I think that for a tool like this,
we can probably make the break. And I do want to use the new syntactic
sugar (heck, OO is only some *very nice* syntactic sugar, albeit in
chunks rather than granules).

> A larger issue might be that relating to the docstring extraction
> mechanism. Won't the code that uses compiler.py be very
> version-dependent?

Not my experience. Indeed, that's one of its values - the compiler
package *itself* may differ from Python version to Python version, but
(except in dark corners) the code using it need not - and a try/except
or so will cope with those dark corners.

> It would be very convenient if both the DPS and
> PyChecker used a common module for module parsing;
> this module could be maintained independently.

Hmm. It should be doable - its just there would be a lot more
information extracted than we need. pydps/visit.py (what was until last
night's work dps_visit.py) is a good start towards that, I think.

> > Tibs (who narrowly avoided starting to write docstrings for existing
> > code last night, instead of writing new code)
>
> We need more of those too!

Yes...

> /David (who tends to write docstrings when confused by his own code)

Ah - that's why I like them too - but I always expect to be confused by
my code!

Whilst I'm at it - update on my code:

	http://www.tibsnjoan.co.uk/reST/pydps.tgz

contains::

    pydps/
        visit.py   -- broadly, what was dps_visit.py
        dom.py     -- the DPS tree and DOM tree stuff
        test.py

I may change the name of "dom.py" - indeed, I may change all sorts of
things!

Tibs
--
Tony J Ibbs (Tibs)      http://www.tibsnjoan.co.uk/
"Bounce with the bunny. Strut with the duck.
 Spin with the chickens now - CLUCK CLUCK CLUCK!"
BARNYARD DANCE! by Sandra Boynton
My views! Mine! Mine! (Unless Laser-Scan ask nicely to borrow them.)