Extracting xml from html

George Sakkis george.sakkis at gmail.com
Tue Sep 18 17:11:27 EDT 2007


On Sep 18, 3:31 pm, kyoso... at gmail.com wrote:
> On Sep 17, 4:51 pm, "Gabriel Genellina" <gagsl-... at yahoo.com.ar>
> wrote:
>
>
>
> > En Mon, 17 Sep 2007 17:31:19 -0300, <kyoso... at gmail.com> escribi?:
>
> > > I am attempting to extract some XML from an HTML document that I get
> > > returned from a form based web page. For some reason, I cannot figure
> > > out how to do this. I thought I could use the minidom module to do it,
> > > but all I get is a screwy traceback:
>
> > > Traceback (most recent call last):
> > >   File "C:\Python24\lib\xml\dom\expatbuilder.py", line 207, in
> > > parseFile
> > >     parser.Parse(buffer, 0)
> > > ExpatError: mismatched tag: line 1, column 357
>
> > So your HTML is not a well formed XML document, as many html pages, and
> > you can't use an XML parser. (even a valid HTML document may not be valid
> > XML). Let's try with some mismatched tags:
> > Depending on your document, you may prefer to extract the XML blocks using
> > BeautifulSoup, and then parse each one using BeautifulStoneSoup (the XML
> > parser) or xml.etree.ElementTree
>
> > --
> > Gabriel Genellina
>
> Thanks for the reply. I already knew about BeautifulSoup but I was
> hoping to avoid installing *yet another module* on my PC.

That's a poor excuse for a self-contained module in a single file.
"Installing" it can be as simple as having it in the same directory of
your module that imports it. Given that you can do in 2 lines what
took you around 15 with lxml, I wouldn't think it twice.

George




More information about the Python-list mailing list