SimplePrograms challenge
Steven Bethard
steven.bethard at gmail.com
Tue Jun 12 18:06:38 EDT 2007
Rob Wolfe wrote:
> Steven Bethard <steven.bethard at gmail.com> writes:
>> I'd hate to steer a potential new Python developer to a clumsier
>
> "clumsier"???
> Try to parse this with your program:
>
> page2 = '''
> <html><head><title>URLs</title></head>
> <body>
> <ul>
> <li><a href="http://domain1/page1">some page1</a></li>
> <li><a href="http://domain2/page2">some page2</a></li>
> </body></html>
> '''
If you want to parse invalid HTML, I strongly encourage you to look into
BeautifulSoup. Here's the updated code:
import ElementSoup # http://effbot.org/zone/element-soup.htm
import cStringIO
tree = ElementSoup.parse(cStringIO.StringIO(page2))
for a_node in tree.getiterator('a'):
url = a_node.get('href')
if url is not None:
print url
>> I know that the wiki page is supposed to be Python 2.4 only, but I'd
>> rather have no example than an outdated one.
>
> This example is by no means "outdated".
Given the simplicity of the ElementSoup code above, I'd still contend
that using HTMLParser here shows too complex an answer to too simple a
problem.
STeVe
More information about the Python-list
mailing list