How use XML parsing tools on this one specific URL?

Paul Boddie paul at boddie.org.uk
Sun Mar 4 14:28:13 EST 2007


seberino at spawar.navy.mil wrote:
> I understand that the web is full of ill-formed XHTML web pages but
> this is Microsoft:
>
> http://moneycentral.msn.com/companyreport?Symbol=BBBY

Yes, thank you Microsoft!

> I can't validate it and xml.minidom.dom.parseString won't work on it.
>
> If this was just some teenager's web site I'd move on.  Is there any
> hope avoiding regular expression hacks to extract the data from this
> page?

The standards adherence from Microsoft services is clearly at "teenage
level", but here's a recipe:

import libxml2dom
import urllib
f = urllib.urlopen("http://moneycentral.msn.com/companyreport?
Symbol=BBBY")
d = libxml2dom.parse(f, html=1)
f.close()

You now have a document which contains a DOM providing libxml2's
interpretation of the HTML. Sadly, PyXML's HtmlLib doesn't seem to
work with the given document. Other tools may give acceptable results,
however.

Paul




More information about the Python-list mailing list