Finding all instances of a string in an XML file

Peter Otten __peter__ at web.de
Fri Jun 21 02:16:00 EDT 2013


Jason Friedman wrote:

> I have XML which looks like:
> 
> <?xml version="1.0" encoding="UTF-8"?>
> <!DOCTYPE KMART SYSTEM "my.dtd">
> <LEVEL_1>
>   <LEVEL_2 ATTR="hello">
>     <ATTRIBUTE NAME="Property X" VALUE ="2"/>
>   </LEVEL_2>
>   <LEVEL_2 ATTR="goodbye">
>     <ATTRIBUTE NAME="Property Y" VALUE ="NULL"/>
>     <LEVEL_3 ATTR="aloha">
>       <ATTRIBUTE NAME="Property X" VALUE ="3"/>
>     </LEVEL_3>
>     <ATTRIBUTE NAME="Property Z" VALUE ="welcome"/>
>   </LEVEL_2>
> </LEVEL_1>
> 
> The "Property X" string appears twice times and I want to output the
> "path"
> that leads  to all such appearances.  In this case the output would be:
> 
> LEVEL_1 {}, LEVEL_2 {"ATTR": "hello"}, ATTRIBUTE {"NAME": "Property X",
> "VALUE": "2"}
> LEVEL_1 {}, LEVEL_2 {"ATTR": "goodbye"}, LEVEL_3 {"ATTR": "aloha"},
> ATTRIBUTE {"NAME": "Property X", "VALUE": "3"}
> 
> My actual XML file is 2000 lines and contains up to 8 levels of nesting.

That's still small, so

xml = """<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE KMART SYSTEM "my.dtd">
<LEVEL_1>
  <LEVEL_2 ATTR="hello">
    <ATTRIBUTE NAME="Property X" VALUE ="2"/>
  </LEVEL_2>
  <LEVEL_2 ATTR="goodbye">
    <ATTRIBUTE NAME="Property Y" VALUE ="NULL"/>
    <LEVEL_3 ATTR="aloha">
      <ATTRIBUTE NAME="Property X" VALUE ="3"/>
    </LEVEL_3>
    <ATTRIBUTE NAME="Property Z" VALUE ="welcome"/>
  </LEVEL_2>
</LEVEL_1>
"""

import xml.etree.ElementTree as etree

tree = etree.fromstring(xml)

def walk(elem, path, token):
    path += (elem,)
    if token in elem.attrib.values():
        yield path
    for child in elem.getchildren():
        for match in walk(child, path, token):
            yield match

for path in walk(tree, (), "Property X"):
    print(", ".join("{} {}".format(elem.tag, elem.attrib) for elem in path))





More information about the Python-list mailing list