Extracting xml from html

Mon Sep 17 16:31:19 EDT 2007

Hi,

I am attempting to extract some XML from an HTML document that I get
returned from a form based web page. For some reason, I cannot figure
out how to do this. I thought I could use the minidom module to do it,
but all I get is a screwy traceback:

Traceback (most recent call last):
  File "\\mcisnt1\repl$\Scripts\PythonPackages\Development\clippy
\xml_parser.py", line 69, in ?
    inst = ApptParser(url)
  File "\\mcisnt1\repl$\Scripts\PythonPackages\Development\clippy
\xml_parser.py", line 19, in __init__
    xml = self.getXml(url)
  File "\\mcisnt1\repl$\Scripts\PythonPackages\Development\clippy
\xml_parser.py", line 30, in getXml
    doc = xml.dom.minidom.parse(f)
  File "C:\Python24\lib\xml\dom\minidom.py", line 1915, in parse
    return expatbuilder.parse(file)
  File "C:\Python24\lib\xml\dom\expatbuilder.py", line 928, in parse
    result = builder.parseFile(file)
  File "C:\Python24\lib\xml\dom\expatbuilder.py", line 207, in
parseFile
    parser.Parse(buffer, 0)
ExpatError: mismatched tag: line 1, column 357

Here's a sample of the html:

<html>
<body>
lots of screwy text including divs and spans
<Row status="o">
    <RecordNum>1126264</RecordNum>
    <Make>Mitsubishi</Make>
    <Model>Mirage DE</Model>
</Row>
</body>
</html>

What's the best way to get at the XML? Do I need to somehow parse it
using the HTMLParser and then parse that with minidom or what?

Thanks a lot!

Mike