SimplePrograms challenge
Steven Bethard
steven.bethard at gmail.com
Tue Jun 12 18:09:44 EDT 2007
Steven Bethard wrote:
> Rob Wolfe wrote:
>> Steven Bethard <steven.bethard at gmail.com> writes:
>>> I'd hate to steer a potential new Python developer to a clumsier
>>
>> "clumsier"???
>> Try to parse this with your program:
>>
>> page2 = '''
>> <html><head><title>URLs</title></head>
>> <body>
>> <ul>
>> <li><a href="http://domain1/page1">some page1</a></li>
>> <li><a href="http://domain2/page2">some page2</a></li>
>> </body></html>
>> '''
>
> If you want to parse invalid HTML, I strongly encourage you to look into
> BeautifulSoup. Here's the updated code:
>
> import ElementSoup # http://effbot.org/zone/element-soup.htm
> import cStringIO
>
> tree = ElementSoup.parse(cStringIO.StringIO(page2))
> for a_node in tree.getiterator('a'):
> url = a_node.get('href')
> if url is not None:
> print url
I should also have pointed out that using the above ElementSoup code can
parse the following text::
<html><head><title>URLs</title></head>
<body>
<ul>
<li<a href="http://domain1/page1">some page1</a></li>
<li><a href="http://domain2/page2">some page2</a></li>
</body></html>
where the HTMLParser code raises an HTMLParseError.
STeVe
More information about the Python-list
mailing list