[Tutor] Request review: A DSL for scraping a web page

Joe Farro joe.farro at gmail.com
Sat Apr 4 11:42:39 CEST 2015


Joe Farro <joe.farro <at> gmail.com> writes:

> 
> Thanks, Peter.
> 
> Peter Otten <__peter__ <at> web.de> writes:
> 
> > Can you give a real-world example where your DSL is significantly cleaner 
> > than the corresponding code using bs4, or lxml.xpath, or lxml.objectify?

Peter, I worked up what I hope is a fairly representative example. It scrapes
metadata from the 10 newest web-scraping questions on stackoverflow.
It's done with bs4 and take.

https://github.com/tiffon/take-examples/tree/master/samples/stackoverflow

I've posted on the bs4 discussion group asking for feedback on the bs4
version to make sure it's up to snuff. (The post is in new-member
purgatory, at the moment.)

In my opinion, the fact that take lacks an ability to define sub-routines is
a brutal deficiency. (As compared to defining functions like
`get_poster_details()` and `get_comment_activity()` in the bs4 version.)

On the bright side, I do like that the indentation of the take templates
semi-reflect the structure of the HTML document. However, the
indentation doesn't (always) reflect the hierarchy of the data being 
generated, which seems more clear.

Feedback is definitely welcome.

Thanks again!



More information about the Tutor mailing list