SimplePrograms challenge

Steven Bethard steven.bethard at gmail.com
Tue Jun 12 18:06:38 EDT 2007


Rob Wolfe wrote:
> Steven Bethard <steven.bethard at gmail.com> writes:
>> I'd hate to steer a potential new Python developer to a clumsier
> 
> "clumsier"???
> Try to parse this with your program:
> 
> page2 = '''
>      <html><head><title>URLs</title></head>
>      <body>
>      <ul>
>      <li><a href="http://domain1/page1">some page1</a></li>
>      <li><a href="http://domain2/page2">some page2</a></li>
>      </body></html>
>      '''

If you want to parse invalid HTML, I strongly encourage you to look into 
BeautifulSoup. Here's the updated code:

     import ElementSoup # http://effbot.org/zone/element-soup.htm
     import cStringIO

     tree = ElementSoup.parse(cStringIO.StringIO(page2))
     for a_node in tree.getiterator('a'):
         url = a_node.get('href')
         if url is not None:
             print url

>> I know that the wiki page is supposed to be Python 2.4 only, but I'd
>> rather have no example than an outdated one.
> 
> This example is by no means "outdated".

Given the simplicity of the ElementSoup code above, I'd still contend 
that using HTMLParser here shows too complex an answer to too simple a 
problem.

STeVe



More information about the Python-list mailing list