SimplePrograms challenge

Steven Bethard steven.bethard at gmail.com
Wed Jun 13 02:48:09 EDT 2007


Stefan Behnel wrote:
> Steven Bethard wrote:
>> If you want to parse invalid HTML, I strongly encourage you to look into
>> BeautifulSoup. Here's the updated code:
>>
>>     import ElementSoup # http://effbot.org/zone/element-soup.htm
>>     import cStringIO
>>
>>     tree = ElementSoup.parse(cStringIO.StringIO(page2))
>>     for a_node in tree.getiterator('a'):
>>         url = a_node.get('href')
>>         if url is not None:
>>             print url
>>
[snip]
> 
> Here's an lxml version:
> 
>   from lxml import etree as et   #  http://codespeak.net/lxml
>   html = et.HTML(page2)
>   for href in html.xpath("//a/@href[string()]"):
>       print href
> 
> Doesn't count as a 15-liner, though, even if you add the above HTML code to it.

Definitely better than the HTMLParser code. =) Personally, I still 
prefer the xpath-less version, but that's only because I can never 
remember what all the line noise characters in xpath mean. ;-)

STeVe



More information about the Python-list mailing list