Extracting xml from html

kyosohma at gmail.com kyosohma at gmail.com
Tue Sep 18 15:31:48 EDT 2007


On Sep 17, 4:51 pm, "Gabriel Genellina" <gagsl-... at yahoo.com.ar>
wrote:
> En Mon, 17 Sep 2007 17:31:19 -0300, <kyoso... at gmail.com> escribi?:
>
> > I am attempting to extract some XML from an HTML document that I get
> > returned from a form based web page. For some reason, I cannot figure
> > out how to do this. I thought I could use the minidom module to do it,
> > but all I get is a screwy traceback:
>
> > Traceback (most recent call last):
> >   File "C:\Python24\lib\xml\dom\expatbuilder.py", line 207, in
> > parseFile
> >     parser.Parse(buffer, 0)
> > ExpatError: mismatched tag: line 1, column 357
>
> So your HTML is not a well formed XML document, as many html pages, and
> you can't use an XML parser. (even a valid HTML document may not be valid
> XML). Let's try with some mismatched tags:

> Depending on your document, you may prefer to extract the XML blocks using
> BeautifulSoup, and then parse each one using BeautifulStoneSoup (the XML
> parser) or xml.etree.ElementTree
>
> --
> Gabriel Genellina

Thanks for the reply. I already knew about BeautifulSoup but I was
hoping to avoid installing *yet another module* on my PC. I got it to
work with lxml, but it's not very pretty. See my reply to Stefan.

Mike




More information about the Python-list mailing list