xml data or other?

Prasad, Ramit ramit.prasad at jpmorgan.com
Mon Nov 19 16:42:00 EST 2012


Artie Ziff wrote:
> 
> On 11/9/12 5:50 AM, rusi wrote:
> > On Nov 9, 5:54 pm, Artie Ziff <artie.z... at gmail.com> wrote:
> > # submit correctedinput to etree
> I was very grateful to get the "leg up" on getting started down that
> right path with my coding. Many thanks to you, rusi. I took your
> excellent advices and have this working.
> 
> class Converter():
>      PREFIX = """<?xml version="1.0"?>
>      <data>
>      """
>      POSTFIX = "</data>"
>      def __init__(self, data):
>          self.data = data
>          self.writeXML()
>      def writeXML(self):
>          pattern = re.compile('<testname=(.*)>')
>          replaceStr = r'<testname name="\1">'
>          xmlData = re.sub(pattern, replaceStr, self.data)
>          self.dataXML = self.PREFIX + xmlData.replace("\\", "/") +
> self.POSTFIX
> 
> ###  main
> # input to script is directory:
> # sanitize trailing slash
> testPkgDir = sys.argv[1].rstrip('/')
> # Within each test package directory is doc/testcase
> tcDocDir = "doc/testcases"
> # set input dir, containing broken files
> tcTxtDir = os.path.join(testPkgDir, tcDocDir)
> # set output dir, to write proper XML files
> tcXmlDir = os.path.join(testPkgDir, tcDocDir + "_XML")
> if not os.path.exists(tcXmlDir):
>      os.makedirs(tcXmlDir)
> # iterate through files in input dir
> for filename in os.listdir(tcTxtDir):
>      # set filepaths
>      filepathTXT = os.path.join(tcTxtDir, filename)
>      base = os.path.splitext(filename)[0]
>      fileXML = base + ".xml"
>      filepathXML = os.path.join(tcXmlDir, fileXML)
>      # read broken file, convert to proper XML
>      with open(filepathTXT) as f:
>          c = Converter(f.read())
>          xmlFO = open(filepathXML, 'w')   # xmlFileObject
>          xmlFO.write(c.dataXML)
>          xmlFO.close()
> 
> ###
> 
> Writing XML files so to see whats happening. My plan is to
> keep xml data in memory and parse with xml.etree.ElementTree.
> 
> Unfortunately, xml parsing fails due to angle brackets inside
> description tags. In particular, xml.etree.ElementTree.parse()
> aborts on '<' inside xml data such as the following:
> 
> <testname name="cron_test.sh">
>      <description>
>          This testcase tests if crontab <filename> installs the cronjob
>          and cron schedules the job correctly.
>      <\description>
> 
> ##
> 
> What is right way to handle the extra angle brackets?
> Substitute on line-by-line basis, if that works?
> Or learn to write a simple stack-style parser, or
> recursive descent, it may be called?

I think your description text should be in a CDATA section.
http://en.wikipedia.org/wiki/CDATA#CDATA_sections_in_XML

~Ramit


This email is confidential and subject to important disclaimers and
conditions including on offers for the purchase or sale of
securities, accuracy and completeness of information, viruses,
confidentiality, legal privilege, and legal entity disclaimers,
available at http://www.jpmorgan.com/pages/disclosures/email.  



More information about the Python-list mailing list