SimplePrograms challenge
Stefan Behnel
stefan.behnel-n05pAM at web.de
Wed Jun 13 02:26:00 EDT 2007
Steven Bethard wrote:
> Rob Wolfe wrote:
>> Steven Bethard <steven.bethard at gmail.com> writes:
>>> I'd hate to steer a potential new Python developer to a clumsier
>>
>> "clumsier"???
>> Try to parse this with your program:
>>
>> page2 = '''
>> <html><head><title>URLs</title></head>
>> <body>
>> <ul>
>> <li><a href="http://domain1/page1">some page1</a></li>
>> <li><a href="http://domain2/page2">some page2</a></li>
>> </body></html>
>> '''
>
> If you want to parse invalid HTML, I strongly encourage you to look into
> BeautifulSoup. Here's the updated code:
>
> import ElementSoup # http://effbot.org/zone/element-soup.htm
> import cStringIO
>
> tree = ElementSoup.parse(cStringIO.StringIO(page2))
> for a_node in tree.getiterator('a'):
> url = a_node.get('href')
> if url is not None:
> print url
>
>>> I know that the wiki page is supposed to be Python 2.4 only, but I'd
>>> rather have no example than an outdated one.
>>
>> This example is by no means "outdated".
>
> Given the simplicity of the ElementSoup code above, I'd still contend
> that using HTMLParser here shows too complex an answer to too simple a
> problem.
Here's an lxml version:
from lxml import etree as et # http://codespeak.net/lxml
html = et.HTML(page2)
for href in html.xpath("//a/@href[string()]"):
print href
Doesn't count as a 15-liner, though, even if you add the above HTML code to it.
Stefan
More information about the Python-list
mailing list