SimplePrograms challenge
Steven Bethard
steven.bethard at gmail.com
Wed Jun 13 02:48:09 EDT 2007
Stefan Behnel wrote:
> Steven Bethard wrote:
>> If you want to parse invalid HTML, I strongly encourage you to look into
>> BeautifulSoup. Here's the updated code:
>>
>> import ElementSoup # http://effbot.org/zone/element-soup.htm
>> import cStringIO
>>
>> tree = ElementSoup.parse(cStringIO.StringIO(page2))
>> for a_node in tree.getiterator('a'):
>> url = a_node.get('href')
>> if url is not None:
>> print url
>>
[snip]
>
> Here's an lxml version:
>
> from lxml import etree as et # http://codespeak.net/lxml
> html = et.HTML(page2)
> for href in html.xpath("//a/@href[string()]"):
> print href
>
> Doesn't count as a 15-liner, though, even if you add the above HTML code to it.
Definitely better than the HTMLParser code. =) Personally, I still
prefer the xpath-less version, but that's only because I can never
remember what all the line noise characters in xpath mean. ;-)
STeVe
More information about the Python-list
mailing list