How use XML parsing tools on this one specific URL?
Paul Boddie
paul at boddie.org.uk
Sun Mar 4 14:28:13 EST 2007
seberino at spawar.navy.mil wrote:
> I understand that the web is full of ill-formed XHTML web pages but
> this is Microsoft:
>
> http://moneycentral.msn.com/companyreport?Symbol=BBBY
Yes, thank you Microsoft!
> I can't validate it and xml.minidom.dom.parseString won't work on it.
>
> If this was just some teenager's web site I'd move on. Is there any
> hope avoiding regular expression hacks to extract the data from this
> page?
The standards adherence from Microsoft services is clearly at "teenage
level", but here's a recipe:
import libxml2dom
import urllib
f = urllib.urlopen("http://moneycentral.msn.com/companyreport?
Symbol=BBBY")
d = libxml2dom.parse(f, html=1)
f.close()
You now have a document which contains a DOM providing libxml2's
interpretation of the HTML. Sadly, PyXML's HtmlLib doesn't seem to
work with the given document. Other tools may give acceptable results,
however.
Paul
More information about the Python-list
mailing list