[Tutor] Request review: A DSL for scraping a web page

Thu Apr 2 13:43:37 CEST 2015

Joe Farro wrote:

> The package implements a DSL that is intended to make web-scraping a bit
> more maintainable :)
> 
> I generally find my scraping code ends up being rather chaotic with
> querying, regex manipulations, conditional processing, conversions, etc.,
> ending up being to close together and sometimes interwoven. It's
> stressful. 

Everything is cleaner than a bunch of regular expressions. It's just that 
sometimes they give results more quickly and as reliable as you can get 
without adding a javascript engine to your script.

> The DSL attempts to mitigate this by doing only two things:
> finding stuff and saving it as a string. The post-processing is left to be
> done down the pipeline. It's almost just a configuration file.
> 
> Here is an example that would get the text and URL for every link in a
> page:
> 
>     $ a
>         save each: links
>             | [href]
>                 save: url
>             | text
>                 save: link_text
> 
> 
> The result would be something along these lines:
> 
>     {
>         'links': [
>             {
>                 'url': 'http://www.something.com/hm',
>                 'link_text': 'The text in the link'
>             },
>             # etc... another dict for each <a> tag
>         ]
>     }
> 

With beautiful soup you could write this

soup = bs4.BeautifulSoup(...)

links = [
    {
        "url": a["href"],
        "link_text": a.text
    }
    for a in soup("a")
]

and for many applications you wouldn't even bother with the intermediate 
data structure.

Can you give a real-world example where your DSL is significantly cleaner 
than the corresponding code using bs4, or lxml.xpath, or lxml.objectify?

> The hope is that having all the selectors in one place will make them more
> manageable and possibly simplify the post-processing.
> 
> This is my first go at something along these lines, so any feedback is
> welcomed.

Your code on github looks good to me (too few docstrings), but like Alan I'm 
not prepared to read it completely. Do you have specific questions?