SimplePrograms challenge

Steven Bethard steven.bethard at gmail.com
Tue Jun 12 18:09:44 EDT 2007


Steven Bethard wrote:
> Rob Wolfe wrote:
>> Steven Bethard <steven.bethard at gmail.com> writes:
>>> I'd hate to steer a potential new Python developer to a clumsier
>>
>> "clumsier"???
>> Try to parse this with your program:
>>
>> page2 = '''
>>      <html><head><title>URLs</title></head>
>>      <body>
>>      <ul>
>>      <li><a href="http://domain1/page1">some page1</a></li>
>>      <li><a href="http://domain2/page2">some page2</a></li>
>>      </body></html>
>>      '''
> 
> If you want to parse invalid HTML, I strongly encourage you to look into 
> BeautifulSoup. Here's the updated code:
> 
>     import ElementSoup # http://effbot.org/zone/element-soup.htm
>     import cStringIO
> 
>     tree = ElementSoup.parse(cStringIO.StringIO(page2))
>     for a_node in tree.getiterator('a'):
>         url = a_node.get('href')
>         if url is not None:
>             print url

I should also have pointed out that using the above ElementSoup code can 
parse the following text::

     <html><head><title>URLs</title></head>
     <body>
     <ul>
     <li<a href="http://domain1/page1">some page1</a></li>
     <li><a href="http://domain2/page2">some page2</a></li>
     </body></html>

where the HTMLParser code raises an HTMLParseError.

STeVe



More information about the Python-list mailing list