[Baypiggies] Scraping with authentication: Scrapy vs BeautifulSoup?

Ryan Larrabure ryan at larrabure.org
Tue Jun 28 17:25:18 CEST 2011


If you're scraping HTML, all reasonable roads seem to lead to xpath.
I'd use httplib2 and lxml.  Avoid mechanize.  It's form handling is
very poor (it'll read forms stored inline within javascript tags).

On Mon, Jun 27, 2011 at 3:07 PM, Dwight Hubbard
<dwight_hubbard at yahoo.com> wrote:
> For scraping with authentication I find the twill module is very good.
>
> ________________________________
> From: Glen Jarvis <glen at glenjarvis.com>
> To: Stephen McInerney <spmcinerney at hotmail.com>
> Cc: "<baypiggies at python.org>" <baypiggies at python.org>
> Sent: Saturday, June 25, 2011 6:48 PM
> Subject: Re: [Baypiggies] Scraping with authentication: Scrapy vs
> BeautifulSoup?
>
> Stephen,
>     Beautiful soup really just parses the HTML. It doesn't (have to)
> retrieve the page for you.
>     You can use the built-in httplib2, urllib libraries to retrieve the page
> (also with authentication) and then use BeautifulSoup to parse the page.
> Cheers,
>
> Glen
> On Jun 25, 2011, at 1:42 PM, Stephen McInerney <spmcinerney at hotmail.com>
> wrote:
>
>
> What do people use for scraping on a website requiring (login form-based)
> authentication?
>
> BeautifulSoup: does not handle authentication or cookies
> Scrapy: does but more heavyweight paradigm to learn, incl. XPath
>
> Some discussion:
> http://stackoverflow.com/questions/4328271/best-way-for-a-beginner-to-learn-screen-scraping-with-python
>
> Thanks,
> Stephen
>
> _______________________________________________
> Baypiggies mailing list
> Baypiggies at python.org
> To change your subscription options or unsubscribe:
> http://mail.python.org/mailman/listinfo/baypiggies
>
> _______________________________________________
> Baypiggies mailing list
> Baypiggies at python.org
> To change your subscription options or unsubscribe:
> http://mail.python.org/mailman/listinfo/baypiggies
>
>
> _______________________________________________
> Baypiggies mailing list
> Baypiggies at python.org
> To change your subscription options or unsubscribe:
> http://mail.python.org/mailman/listinfo/baypiggies
>


More information about the Baypiggies mailing list