Parse bad xml file

David Jobes david.jobes.sony at gmail.com
Fri Oct 10 08:33:40 EDT 2014


On Friday, October 10, 2014 8:21:17 AM UTC-4, Peter Otten wrote:
> David Jobes wrote:
> 
> 
> 
> > I was given a badly or poor formatted xml file that i need to convert to
> 
> > csv file:
> 
> 
> 
> There are no "badly formatted" XML files, only valid and invalid ones.
> 
> Fortunately  following looks like the beginning of a valid one.
> 
> 
> 
> > <?xml version="1.0"?>
> 
> > <resultset xmlns:dyn="http://exslt.org/dynamic">
> 
> > <table name="SIGNATURE">
> 
> > <column name="ID" type="String">	</column>
> 
> > <column name="NUM" type="Integer">	</column>
> 
> > <column name="SEVERITY_ID" type="Integer">	</column>
> 
> > <column name="NAME" type="String">	</column>
> 
> > <column name="CLASS" type="String">	</column>
> 
> > <column name="PRODUCT_CATEGORY_ID" type="Integer">	</column>
> 
> > <column name="PROTOCOL" type="String">	</column>
> 
> > <column name="TAXONOMY" type="String">	</column>
> 
> > <column name="CVE_ID" type="String">	</column>
> 
> > <column name="BUGTRAQ_ID" type="String">	</column>
> 
> > <column name="DESCRIPTION" type="String">	</column>
> 
> > <column name="MESSAGE" type="String">	</column>
> 
> > <column name="FILTERTYPE" type="String">	</column>
> 
> > <data>
> 
> > <r>
> 
> > <c>00000001-0001-0001-0001-000000000027</c>
> 
> > <c>27</c>
> 
> > <c>2</c>
> 
> > <c>0027: IP Options: Record Route (RR)</c>
> 
> > <c>Network_equip</c>
> 
> > <c>10</c>
> 
> > <c>ip</c>
> 
> > <c>100741885</c>
> 
> > <c>2001-0752,1999-1339,1999-0986</c>
> 
> > <c>870</c>
> 
> > <c></c>
> 
> > <c></c>
> 
> > <c></c>
> 
> > </r>
> 
> > 
> 
> > 
> 
> > I have been able to load and read the file line by line, 
> 
> 
> 
> XML doesn't have an idea of lines, so don't do that. Instead let a parser 
> 
> make sense of the document structure.
> 
> 
> 
> > but once i get to
> 
> > the r line and try to process each c(column) that is where it blows up. I
> 
> > need to be able to split the lines and place each one or the r (row) on a
> 
> > single line for the csv.
> 
> > 
> 
> > i have a list set for each one of the headers based on the col name field,
> 
> > i just have been able to format properly.
> 
> 
> 
> Here's a simple script using ElementTree, to introduce you to basic xml 
> 
> handling with Python's stdlib. If you are lucky it might even work ;)
> 
> 
> 
> import csv
> 
> import sys
> 
> from xml.etree import ElementTree
> 
> 
> 
> SOURCEFILE = "xml_to_csv.xml"
> 
> 
> 
> tree = ElementTree.parse(SOURCEFILE)
> 
> table = tree.find("table")
> 
> column_names = [c.attrib["name"] for c in table.findall("column")]
> 
> writer = csv.writer(sys.stdout)
> 
> writer.writerow(column_names)
> 
> for row in table.find("data").findall("r"):
> 
>     writer.writerow([field.text for field in row.findall("c")])

That did it, thank you, and in a lot fewer lines of code than i had, i was trying to use strings and regex. i will read up more on the xml.etree stuff.



More information about the Python-list mailing list