[Tutor] What can I do if I'm banned from a website??

Steven D'Aprano steve at pearwood.info
Thu Oct 11 02:56:42 CEST 2012


On 11/10/12 07:35, Benjamin Fishbein wrote:

> I've been scraping info from a website with a url program I wrote. But
>now I can't open their webpage, no matter which web browser I use. I
>think they've somehow blocked me. How can I get back in? Is it a
>temporary block?

How the hell would we know??? Ask the people running the web site.

If you have been breaking the terms and conditions of the web site, you
could have broken the law (computer trespass). I don't say this because
I approve of or agree with the law, but when you scrape websites with
anything other than a browser, that's the chance you take.


> And can I get in with the same computer from a different wifi?

*rolls eyes*

You've been blocked once. You want to get blocked again?

A lot of this depends on what the data is, why it is put on the web in
the first place, and what you intend doing with it.

Wait a week and see if the block is undone. Then:

* If the web site gives you an official API for fetching data, USE IT.

* If not, keep to their web site T&C. If the T&C allows scraping under
   conditions (usually something on the lines of limiting how fast you can
   scrape, or at what times), OBEY THOSE CONDITIONS and don't be selfish.

* If you think the webmaster will be reasonable, ask permission first.
   (I don't recommend that you volunteer the information that you were
   already blocked once.) If he's not a dick, he'll probably say yes,
   under conditions (again, usually to do with time and speed).

* If you insist in disregarding their T&C, don't be a dick about it.
   Always be an ethical scraper. If the police come knocking, at least
   you can say that you tried to avoid any harm from your actions. It
   could make the difference between jail and a good behaviour bond.

   - Make sure you download slowly: pause for at least a few seconds
     between each download, or even a minute or three.

   - Limit the rate that you download: you might be on high speed ADSL2,
     but the faster you slurp files from the website, the less bandwidth
     they have for others.

   - Use a cache so you aren't hitting the website again and again for
     the same files.

   - Obey robots.txt.


Consider using a random pause between (say) 0 and 90 seconds between
downloads to to more accurately mimic a human using a browser. Also
consider changing your user-agent. Ethical scraping suggests putting
your contact details in the user-agent string. Defensive scraping
suggests mimicking Internet Explorer as much as possible.

More about ethical scraping:

http://stackoverflow.com/questions/4384493/how-can-i-ethically-and-legally-scrape-data-from-a-public-web-site



-- 
Steven


More information about the Tutor mailing list