xml.etree and namespaces -- why?

Axy axy at declassed.art
Wed Oct 19 11:23:56 EDT 2022


I mean, it's worth to look at BeautifulSoup source how do they do that.
With BS I work with attributes exactly as you want, and I explicitly
tell BS to use lxml parser.

Axy.

On 19/10/2022 14:25, Robert Latest via Python-list wrote:
> Hi all,
>
> For the impatient: Below the longish text is a fully self-contained Python
> example that illustrates my problem.
>
> I'm struggling to understand xml.etree's handling of namespaces. I'm trying to
> parse an Inkscape document which uses several namespaces. From etree's
> documentation:
>
>      If the XML input has namespaces, tags and attributes with prefixes in the
>      form prefix:sometag get expanded to {uri}sometag where the prefix is
>      replaced by the full URI.
>
> Which means that given an Element e, I cannot directly access its attributes
> using e.get() because in order to do that I need to know the URI of the
> namespace. So rather than doing this (see example below):
>
>      label = e.get('inkscape:label')
>
> I need to do this:
>
>      label = e.get('{' + uri_inkscape_namespace + '}label')
>
> ...which is the method mentioned in etree's docs:
>
>      One way to search and explore this XML example is to manually add the URI
>      to every tag or attribute in the xpath of a find() or findall().
>      [...]
>      A better way to search the namespaced XML example is to create a
>      dictionary with your own prefixes and use those in the search functions.
>
> Good idea! Better yet, that dictionary or rather, its reverse, already exists,
> because etree has used it to unnecessarily mangle the namespaces in the first
> place. The documentation doesn't mention where it can be found, but we can
> just use the 'xmlns:' attributes of the <svg> root element to rebuild it. Or
> so I thought, until I found out that etree deletes exactly these attributes
> before handing the <svg> element to the user.
>
> I'm really stumped here. Apart from the fact that I think XML is bloated shit
> anyway and has no place outside HTML, I just don't get the purpose of etree's
> way of working:
>
> 1) Evaluate 'xmlns:' attributes of the <svg> element
> 2) Use that info to replace the existing prefixes by {uri}
> 3) Realizing that using {uri} prefixes is cumbersome, suggest to
>     the user to build their own prefix -> uri dictionary
>     to undo the effort of doing 1) and 2)
> 4) ...but witholding exactly the information that existed in the original
>     document by deleting the 'xmlns:' attributes from the <svg> tag
>
> Why didn't they leave the whole damn thing alone? Keep <svg> intact and keep
> the attribute 'prefix:key' literally as they are. For anyone wanting to use
> the {uri} prefixes (why would they) they could have thrown in a helper
> function for the prefix->URI translation.
>
> I'm assuming that etree's designers knew what they were doing in order to make
> my life easier when dealing with XML. Maybe I'm missing the forest for the
> trees. Can anybody enlighten me? Thanks!
>
>
> #### self-contained example
> import xml.etree.ElementTree as ET
>
> def test_svg(xml):
>      root = ET.fromstring(xml)
>      for e in root.iter():
>          print(e.tag) # tags are shown prefixed with {URI}
>          if e.tag.endswith('svg'):
> # Since namespaces are defined inside the <svg> tag, let's use the info
> # from the 'xmlns:' attributes to undo etree's URI prefixing
>              print('Element <svg>:')
>              for k, v in e.items():
>                  print('  %s: %s' % (k, v))
> # ...but alas: the 'xmlns:' attributes have been deleted by the parser
>
> xml = '''<?xml version="1.0" encoding="UTF-8" standalone="no"?>
> <!-- Created with Inkscape (http://www.inkscape.org/) -->
>
> <svg
>     width="210mm"
>     height="297mm"
>     viewBox="0 0 210 297"
>     version="1.1"
>     id="svg285"
>     inkscape:version="1.2.1 (9c6d41e410, 2022-07-14)"
>     sodipodi:docname="test.svg"
>     xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape"
>     xmlns:sodipodi="http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd"
>     xmlns="http://www.w3.org/2000/svg"
>     xmlns:svg="http://www.w3.org/2000/svg">
>    <sodipodi:namedview
>       id="namedview287"
>       pagecolor="#ffffff"
>       bordercolor="#000000"
>       borderopacity="0.25"
>       inkscape:showpageshadow="2"
>       inkscape:pageopacity="0.0"
>       inkscape:pagecheckerboard="0"
>       inkscape:deskcolor="#d1d1d1"
>       inkscape:document-units="mm"
>       showgrid="false"
>       inkscape:zoom="0.2102413"
>       inkscape:cx="394.78447"
>       inkscape:cy="561.25984"
>       inkscape:window-width="1827"
>       inkscape:window-height="1177"
>       inkscape:window-x="85"
>       inkscape:window-y="-8"
>       inkscape:window-maximized="1"
>       inkscape:current-layer="layer1" />
>    <defs
>       id="defs282" />
>    <g
>       inkscape:label="Ebene 1"
>       inkscape:groupmode="layer"
>       id="layer1">
>      <rect
>         style="fill:#aaccff;stroke-width:0.264583"
>         id="rect289"
>         width="61.665253"
>         height="54.114403"
>         x="33.978813"
>         y="94.38559" />
>    </g>
> </svg>
> '''
>
> if __name__ == '__main__':
>      test_svg(xml)


More information about the Python-list mailing list