[Tutor] Request review: A DSL for scraping a web page

Peter Otten __peter__ at web.de
Thu Apr 2 13:43:37 CEST 2015


Joe Farro wrote:

> The package implements a DSL that is intended to make web-scraping a bit
> more maintainable :)
> 
> I generally find my scraping code ends up being rather chaotic with
> querying, regex manipulations, conditional processing, conversions, etc.,
> ending up being to close together and sometimes interwoven. It's
> stressful. 

Everything is cleaner than a bunch of regular expressions. It's just that 
sometimes they give results more quickly and as reliable as you can get 
without adding a javascript engine to your script.

> The DSL attempts to mitigate this by doing only two things:
> finding stuff and saving it as a string. The post-processing is left to be
> done down the pipeline. It's almost just a configuration file.
> 
> Here is an example that would get the text and URL for every link in a
> page:
> 
>     $ a
>         save each: links
>             | [href]
>                 save: url
>             | text
>                 save: link_text
> 
> 
> The result would be something along these lines:
> 
>     {
>         'links': [
>             {
>                 'url': 'http://www.something.com/hm',
>                 'link_text': 'The text in the link'
>             },
>             # etc... another dict for each <a> tag
>         ]
>     }
> 

With beautiful soup you could write this

soup = bs4.BeautifulSoup(...)

links = [
    {
        "url": a["href"],
        "link_text": a.text
    }
    for a in soup("a")
]

and for many applications you wouldn't even bother with the intermediate 
data structure.

Can you give a real-world example where your DSL is significantly cleaner 
than the corresponding code using bs4, or lxml.xpath, or lxml.objectify?

> The hope is that having all the selectors in one place will make them more
> manageable and possibly simplify the post-processing.
> 
> This is my first go at something along these lines, so any feedback is
> welcomed.

Your code on github looks good to me (too few docstrings), but like Alan I'm 
not prepared to read it completely. Do you have specific questions?



More information about the Tutor mailing list