[XML-SIG] Parsing XML for deeply nested structures

Michael McLay mclay@nist.gov
Thu, 10 Aug 2000 21:17:49 -0400 (EDT)


travish writes:
 > Actually, I want something between the two APIs that appear to be present
 > (lexing and generating an AST).  For example, in the reduce phase
 > of a shift-reduce parser like yacc (which corresponds to a close-tag
 > event from an "event driven API"), one is given the ability to
 > 'condense' all of the subtrees of this particular node, requiring
 > neither a full AST nor keeping track of the stack of nested tags
 > you may currently be processing in.  This would be extremely handy
 > for (e.g.) converting XML to nested data structures.

[...]

 > All of the example I've seen have a fixed, shallow tag hierarchy and so
 > are toy problems which don't encounter these complexities.

There are major efforts underway to develop XML based standards for
engineering data, so it is likely that this kind of problem will
become very common as soon.  Think of any product category and you'll
find someone working on an XML mapping.

I am working with a standards group and Georgia Tech on an XML Schema
for representing the manufacturing data needed to produce a printed
circuit board and a printed circuit assembly.  This is a fairly easy
example to grok for anyone who has ever seen a printed circuit board.
It requires a deeply nested XML tag set with a corresponding deeply
nested set of structures that must be referenced by CAD and CAM
software.  

The XML Schema for the GenCAM standard is at:

    http://www.fis.marc.gatech.edu/xml/ipc-schema.html#IPC2511

The example file is at:

   http://www.gencam.org/examples/dieter6.xml

This is a typical example of a nested structure.  The GenCAM
description of a printed circuit boards contains about 18 top level
sections, for this example I'll explain the interaction between the
PRIMITIVES and ROUTES section.  A PRIMITIVES section has a list of
GROUP objects.  Each GROUP is a separate name-space.  All 
GROUP name space names are unique to a GenCAM file.  One group might
hold standard colors and another might hold line descriptions.  It is
up to the vendor to decide how to partition the name-spaces.  The
GROUP contains a PAINTDESC object definition.  This defines the fill
used inside of polygon and other closed shapes.

The ROUTES section follows the PRIMITIVES section..  ROUTES contains a 
list of GROUP objects.  A ROUTES GROUP contains a 
list of ROUTE objects and a ROUTE (which represents a copper trace
etched on the printed circuit board) contains a list of geometry
objects, such as PATH, PLANE, VIA, TESTPAD...  

Representing pointers between objects in XML is a special case problem 
that is very common in engineering data structures.  A small
example extracted from the dieter6.xml file will illustrate the
problem.  The example contains only one GROUP and only one ROUTE in
that group.  A PCB design would typically have between 100 and 100k
unique routes.   

<GENCAM>
  <PRIMITIVES>
    <GROUP primitive_group_id="prim4" >
      <PAINTDESC paintdesc_name="filled" paint_type="FILL" />
     </GROUP>
  </PRIMITIVES>
</GENCAM>
<GENCAM>
  <ROUTES>
    <GROUP route_group_id="route1" >
      <ROUTE net_name="Ground" net_class="GROUND" >
        <PATH layers_ref="lay1:2" linedesc_ref="prim4:signalwidth" >
          <POLYLINE >
            <STARTAT start_xy="(1300,1400)" />
            <LINETO end_xy="(1200,2200)" />
          </POLYLINE>
        </PATH>
        <PLANE layers_ref="lay1:2" paintdesc_ref="prim4:filled" >
          <POLYGON >
            <STARTAT start_xy="(0,0)" />
            <LINETO end_xy="(1200,0)" />
            <LINETO end_xy="(1200,2400)" />
            <LINETO end_xy="(0,2400)" />
            <ENDLINE />
          </POLYGON>
        </PLANE>
        <COMPPIN component_ref="cmp1:R1" pattern_pin_ref="Pin1" />
      </ROUTE>
    </GROUP>

In the ROUTE definition there is a PLANE object defined using the
statement: 

        <PLANE layers_ref="lay1:2" paintdesc_ref="prim4:filled" >

The paintdesc_ref attribute is used to specify a relationship between
the PLANE object in the ROUTE and the PAINTDESC object defined in the
PRIMITIVES section.  The paintdesc_ref contains a string that when
split on the first ':' in the string identifies the name of the GROUP
and the name of the PAINTDESC that is to be used to fill the PLANE.  

Resolving the object pointer between the PLANE and the PAINTDESC isn't 
done automatically using an XML parser.  We might of been able to use
one of the standard features of XML, such as XPATH, to define the
relationship, but this seemed like a nature point for breaking between 
the standard off-the-shelf XML object handling and the custom code
that will be required to populate the structures of the CAD and CAM
tool. 

This breaking point between standard XML parsing and building custom
objects is probably a pattern language, but I wouldn't know how to
define it.  I'm still struggling with the hand-off between the two.

There is a huge legacy code base of engineering software that will
eventually be reading and writing this XML format.  The industry I'm
working with is just one of many that are converting their engineering
data to XML format.  Most likely each tool vendor will do this by
attaching an XML parser to the existing code and populating there
existing data structures as the object definitions as the XML parser
reads the data.  Is there an easy way to associate which data
structures should be populated as each of the XML tags are
encountered?  I was thinking it might be possible to automatically
generate a parse tree directly from the XML Schema definition.  

I'd appreciate feedback on the approach used to define pointers
between structures.  Is there a standard way that would also be
efficient for a file that may contain millions of these references? 
I would be interested in seeing the example rewritten using any
alternative notations, such as XPATH.