SimplePrograms challenge

Wed Jun 13 06:21:17 EDT 2007

--- Steven Bethard <steven.bethard at gmail.com> wrote:

> Stefan Behnel wrote:
> > Steven Bethard wrote:
> >> If you want to parse invalid HTML, I strongly
> encourage you to look into
> >> BeautifulSoup. Here's the updated code:
> >>
> >>     import ElementSoup #
> http://effbot.org/zone/element-soup.htm
> >>     import cStringIO
> >>
> >>     tree =
> ElementSoup.parse(cStringIO.StringIO(page2))
> >>     for a_node in tree.getiterator('a'):
> >>         url = a_node.get('href')
> >>         if url is not None:
> >>             print url
> >>
> [snip]
> > 
> > Here's an lxml version:
> > 
> >   from lxml import etree as et   # 
> http://codespeak.net/lxml
> >   html = et.HTML(page2)
> >   for href in html.xpath("//a/@href[string()]"):
> >       print href
> > 
> > Doesn't count as a 15-liner, though, even if you
> add the above HTML code to it.
> 
> Definitely better than the HTMLParser code. =)
> Personally, I still 
> prefer the xpath-less version, but that's only
> because I can never 
> remember what all the line noise characters in xpath
> mean. ;-)
> 

I think there might be other people who will balk at
the xpath syntax, simply due to their unfamiliarity
with it.  And, on the other hand, if you really like
the xpath syntax, then the program really becomes more
of an endorsement for xpath's clean syntax than for
Python's.  To the extent that Python enables you to
implement an xpath solution cleanly, that's great, but
then you have the problem that lxml is not batteries
included.

I do hope we can find something to put on the page,
but I'm the wrong person to decide on it, since I
don't really do any rigorous HTML screen scraping in
my current coding.  (I still think it's a common use
case, even though I don't do it myself.)

I suggested earlier that maybe we post multiple
solutions.  That makes me a little nervous, to the
extent that it shows that the Python community has a
hard time coming to consensus on tools sometimes. 
This is not a completely unfair knock on Python,
although I think the reason multiple solutions tend to
emerge for this type of thing is precisely due to the
simplicity and power of the language itself.

So I don't know.  What about trying to agree on an XML
parsing example instead?  

Thoughts?

____________________________________________________________________________________
Pinpoint customers who are looking for what you sell. 
http://searchmarketing.yahoo.com/