Convert XML to SQL

Wed Jan 26 05:51:06 EST 2011

Stefan Behnel, 26.01.2011 10:29:
> Johann Spies, 26.01.2011 10:07:
>> I an not a Python newbie but working with xml is new to me.
>>
>> I get data through a soap connection, using suds, and want to convert that
>> to objects which I can use to populate a rather complex database.
>
> Your problem description is pretty comprehensive in general. It would be
> helpful to see an example snippet of the XML that you parse.

[example received in private e-mail]

>> I have been able to parse the xml using
>>
>> tree = etree.iterparse(infile,events=("start","end")) but it seems like a
>> lot of work to get that to sql-objects.
>>
>> I have seen references to lxml.objectify and have created a object
>> containing the contents of a whole file using
>>
>> tree = objectify.parse(fileobject)
>>
>> That object contains for example the data of 605 records and I do not know
>> how to use it. I could not figure out from the lxml.objectify documentation
>> how to do it.
>>
>> In the end I want to use data from about 54 fields of each records. I would
>> like to have a list of dictionaries as a result of the parsing. From there
>> it should not be too difficult to create sql.
>
> I think iterparse() is a good way to deal with this, as is objectify.
> iterparse() has the advantage that you can dispose of handled records, thus
> keeping memory usage low (if that's an issue here).
>
> Using objectify, you would usually do something like this:
>
> tree = objectify.parse(fileobject)
> root = tree.getroot()
> for record in root.xml_record_tag:
>     title_name = record.title.text
>
> It really just depends on what your XML looks like. In the above, I assumed
> that each record hangs directly below the root tag and is called
> "xml_record_tag". I also assumed that each record has a "title" tag with
> text content.

The example you sent me is almost perfect for lxml.objectify. Basically, 
you'd do something like this:

     for record in root.REC:
         refs = [ ref.text for ref in record.item.refs.ref ]
         publisher = record.item.copyright.publisher.text
         for issue in record.issue:
             units = [ unit.text for unit in issue.units.unit ]

and so on. The same in ET:

     for record in root.findall('REC'):
         refs = [ ref.text for ref in record.findall('item/refs/ref') ]
         publisher = record.findtext('item/copyright/publisher')
         for issue in record.findall('issue'):
             units = [ unit.text for unit in issue.findall('units/unit') ]

Not much harder either.

Stefan