Extracting xml from html

kyosohma at gmail.com kyosohma at gmail.com
Mon Sep 17 17:14:41 EDT 2007


On Sep 17, 4:01 pm, Paul Boddie <p... at boddie.org.uk> wrote:
> On 17 Sep, 22:31, kyoso... at gmail.com wrote:
>
>
>
> > What's the best way to get at the XML? Do I need to somehow parse it
> > using the HTMLParser and then parse that with minidom or what?
>
> Probably easiest is to use an XML processing toolkit or library which
> supports HTML parsing. Since the libxml2 library (written in C) makes
> a fairly good job of HTML parsing, I would suggest either libxml2dom
> (for a DOM-like API) or lxml (for an ElementTree-like API) as suitable
> Python wrappers of libxml2. Of course, HTMLParser or SGMLParser should
> work, but the programming style is a bit more convoluted unless you're
> used to XML processing using a SAX-like API.
>
> Paul
>
> P.S. I'm biased towards libxml2dom, being the developer, but I use it
> routinely and it generally does the job for me.

I have lxml installed and I appear to also have libxml2dom installed.
I know lxml has decent docs, but I don't see much for yours. Is this
the only place to go: http://www.boddie.org.uk/python/libxml2dom.html
?

Mike




More information about the Python-list mailing list