How use XML parsing tools on this one specific URL?
Jorge Godoy
jgodoy at gmail.com
Sun Mar 4 12:53:58 EST 2007
"seberino at spawar.navy.mil" <seberino at spawar.navy.mil> writes:
> I understand that the web is full of ill-formed XHTML web pages but
> this is Microsoft:
Yes... And Microsoft is responsible for a lot of the ill-formed pages on the
web be it on their website or made by their applications.
>
> http://moneycentral.msn.com/companyreport?Symbol=BBBY
>
> I can't validate it and xml.minidom.dom.parseString won't work on it.
>
> If this was just some teenager's web site I'd move on. Is there any
> hope avoiding regular expression hacks to extract the data from this
> page?
It all depends on what data you want. Probably a non-validating parser would
be able to extract some things. Another option is pass the page through some
validator that can fix the page, like tidy...
--
Jorge Godoy <jgodoy at gmail.com>
More information about the Python-list
mailing list