SimplePrograms challenge

Wed Jun 13 02:26:00 EDT 2007

Steven Bethard wrote:
> Rob Wolfe wrote:
>> Steven Bethard <steven.bethard at gmail.com> writes:
>>> I'd hate to steer a potential new Python developer to a clumsier
>>
>> "clumsier"???
>> Try to parse this with your program:
>>
>> page2 = '''
>>      <html><head><title>URLs</title></head>
>>      <body>
>>      <ul>
>>      <li><a href="http://domain1/page1">some page1</a></li>
>>      <li><a href="http://domain2/page2">some page2</a></li>
>>      </body></html>
>>      '''
> 
> If you want to parse invalid HTML, I strongly encourage you to look into
> BeautifulSoup. Here's the updated code:
> 
>     import ElementSoup # http://effbot.org/zone/element-soup.htm
>     import cStringIO
> 
>     tree = ElementSoup.parse(cStringIO.StringIO(page2))
>     for a_node in tree.getiterator('a'):
>         url = a_node.get('href')
>         if url is not None:
>             print url
> 
>>> I know that the wiki page is supposed to be Python 2.4 only, but I'd
>>> rather have no example than an outdated one.
>>
>> This example is by no means "outdated".
> 
> Given the simplicity of the ElementSoup code above, I'd still contend
> that using HTMLParser here shows too complex an answer to too simple a
> problem.

Here's an lxml version:

  from lxml import etree as et   #  http://codespeak.net/lxml
  html = et.HTML(page2)
  for href in html.xpath("//a/@href[string()]"):
      print href

Doesn't count as a 15-liner, though, even if you add the above HTML code to it.

Stefan