How use XML parsing tools on this one specific URL?
Nikita the Spider
NikitaTheSpider at gmail.com
Sun Mar 4 14:21:43 EST 2007
In article <1173030156.276363.174250 at i80g2000cwc.googlegroups.com>,
"seberino at spawar.navy.mil" <seberino at spawar.navy.mil> wrote:
> I understand that the web is full of ill-formed XHTML web pages but
> this is Microsoft:
>
> http://moneycentral.msn.com/companyreport?Symbol=BBBY
>
> I can't validate it and xml.minidom.dom.parseString won't work on it.
>
> If this was just some teenager's web site I'd move on. Is there any
> hope avoiding regular expression hacks to extract the data from this
> page?
Valid XHTML is scarcer than hen's teeth. Luckily, someone else has
already written the ugly regex parsing hacks for you. Try Connelly
Barnes' HTMLData:
http://oregonstate.edu/~barnesc/htmldata/
Or BeautifulSoup as others have suggested.
--
Philip
http://NikitaTheSpider.com/
Whole-site HTML validation, link checking and more
More information about the Python-list
mailing list