busting-out XML sections

Robert Roy rjroy at takingcontrol.com
Fri Oct 6 10:05:36 EDT 2000


On Thu, 05 Oct 2000 18:25:07 -0400, Thomas Gagne
<tgagne at ix.netcom.com> wrote:

>I have a file that looks like:
>
><batch>
><order>
><order_head/> (only one of these)
><order_detail/> (multiples of these)
></order>
><order>....
>
></batch>
>
>What I need is a good approach to take each <order></order> section and send
>the entire contents within it to a separate process.  I'm not worried about
>how to send it, I'm trying to figure out the best way to grab the text between
>the <order> tags.
>
>If I use sax, I'd have to write methods for everthing that might appear
>between the tags and accumulate the text (remember, I don't want to change
>anything) into an instance variable.

You can use a fairly simple xmllib-based parser (see below). This
saves you from writing methods for each tag. However it will convert
your empty elements from the form '<element />' to the form
'<element></element>' . Perhaps not an ideal solution for what you are
doing. Noth that this 

It can also be done with a fairly simple regular expression. The only
tricky part is to make sure that it does not also match 'orders' but
still allows for attributes in the order element. 

import re
r = re.compile(r'<order(>|\s.*?>)(.*?)</order>', re.I|re.M|re.S)
# dummy is the group used to catch attributes etc
for dummy, item in items = r.findall(data):
    print item
	

>
>I thought of using nawk or grep or python but then my scripting language would
>have to know how to parse XML to make sure it correctly detects the tags it's
>looking for.  That would be too much effort.

No matter what there is going to be some work involved here. How much
depends on many factors such as the quality of the incoming xml,
etc... Given a guaranteed quality of incoming data you can take a lot
of shortcuts. 

Also who is responsible for the process that takes the order data? If
you are writing that too, then you are going to have to parse the xml
one way or another so you might as well bite the bullet and do it up
front.

>
>Niether solution sounds appealing.  It amounts to a lot of code for what's
>really a simple problem.  Maybe there's something in the SAX stuff that would
>allow me to grab everything between (and including) the <order> tags.
>
>
>--
>.tom
>
>
>

good luck!

# note that this assumes 1.5.2 and uses xmllib
# would use pyexpat for 2.0
 
import xmllib, string

class myparser(xmllib.XMLParser):
    def __init__(self):
        xmllib.XMLParser.__init__(self)
        self.currentdata = []
        
    def handle_data(self, data):
        self.currentdata.append(data)

    def start_order(self, attrs):
        self.currentdata = []
                
    def end_order(self):
        print string.join(self.currentdata,'')

    def unknown_starttag(self, tag, attrs):
        self.currentdata.append('<%s' % tag)     
        for attr in attrs.items():
            self.currentdata.append(' %s="%s"' % attr) 
        self.currentdata.append('>')     
                
    def unknown_endtag(self, tag):
        self.currentdata.append('</%s>' % tag)     

    # put these here assuming that we 
    # do not want to expand entity refs at this point
    def unknown_entityref(self, ref):
        self.outfile.write('&%s;' % ref)
    # so known refs do not get translated
    handle_entityref = unknown_entityref

    def unknown_charref(self, ref):
        self.outfile.write('&#%s;' % ref)
    # so known refs do not get translated
    handle_charref = unknown_charref

### some test code
data= """\
<orders>
    <order>
        <order_head clientname="tom"/>
        <order_detail item="tiger"/>
    </order>
    <order>
        <order_head clientname="dick"/>
        <order_detail item="dog"/>
    </order>
    <order>
        <order_head clientname="harry"/>
        <order_detail item="hamster"/>
    </order>
</orders>

"""

if __name__ == "__main__":

    p=myparser()
    p.feed(data)
    p.close()






More information about the Python-list mailing list